Open-Sora/tools/scoring
xyupeng e933ed8727 Dev/pxy (#61)
* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scene_cut

* update scene_cut

* update scene_cut[A

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* update readme

* update readme

* extract frames using opencv everywhere

* extract frames using opencv everywhere

* extract frames using opencv everywhere

* filter panda10m

* filter panda10m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* ocr

* add ocr

* add main.sh

* add ocr

* add ocr

* add ocr

* add ocr

* add ocr

* add ocr

* update scene_cut

* update remove main.sh
2024-04-22 17:15:55 +08:00
..
aesthetic a bunch of update 2024-04-14 17:05:25 +08:00
matching Dev/pxy (#61) 2024-04-22 17:15:55 +08:00
ocr Dev/pxy (#61) 2024-04-22 17:15:55 +08:00
optical_flow Dev/pxy (#61) 2024-04-22 17:15:55 +08:00
__init__.py update data processing 2024-04-02 10:51:40 +08:00
README.md Dev/pxy (#61) 2024-04-22 17:15:55 +08:00

Data Scoring and Filtering

Aesthetic Scoring

To evaluate the aesthetic quality of videos, we use a pretrained model from CLIP+MLP Aesthetic Score Predictor. This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.

The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.

For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.

Requirement

# install clip
pip install git+https://github.com/openai/CLIP.git
pip install decord

# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth

Usage

With meta.csv containing the paths to the videos, run the following command:

# output: meta_aes.csv
torchrun --nproc_per_node 8  -m tools.scoring.aesthetic.inference meta.csv --bs 1024 --num_workers 16

This will generate multiple part files, you can use python -m tools.datasets.csvutil DATA1.csv DATA2.csv to merge these part files.

Optical Flow Score

Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement. TODO: acknowledge UniMatch.

First get the pretrained model.

wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P pretrained_models/unimatch

Then run:

torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py --meta_path /path/to/meta.csv

The output should be /path/to/meta_flow.csv with column flow.

Matching Score

Matching scores are calculated to evaluate the alignment between an image/video and its caption. For videos, we compute the matching score of the middle frame and the caption.

Make sure meta files contain the column text, which is the caption of the sample. Then run:

torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py --meta_path /path/to/meta.csv

The output should be /path/to/meta_match.csv with column match. Higher matching scores indicate better image-text/video-text alignment.