* update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh |
||
|---|---|---|
| .. | ||
| aesthetic | ||
| matching | ||
| ocr | ||
| optical_flow | ||
| __init__.py | ||
| README.md | ||
Data Scoring and Filtering
Aesthetic Scoring
To evaluate the aesthetic quality of videos, we use a pretrained model from CLIP+MLP Aesthetic Score Predictor. This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.
The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.
For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.
Requirement
# install clip
pip install git+https://github.com/openai/CLIP.git
pip install decord
# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth
Usage
With meta.csv containing the paths to the videos, run the following command:
# output: meta_aes.csv
torchrun --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta.csv --bs 1024 --num_workers 16
This will generate multiple part files, you can use python -m tools.datasets.csvutil DATA1.csv DATA2.csv to merge these part files.
Optical Flow Score
Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement. TODO: acknowledge UniMatch.
First get the pretrained model.
wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P pretrained_models/unimatch
Then run:
torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py --meta_path /path/to/meta.csv
The output should be /path/to/meta_flow.csv with column flow.
Matching Score
Matching scores are calculated to evaluate the alignment between an image/video and its caption. For videos, we compute the matching score of the middle frame and the caption.
Make sure meta files contain the column text, which is the caption of the sample. Then run:
torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py --meta_path /path/to/meta.csv
The output should be /path/to/meta_match.csv with column match. Higher matching scores indicate better image-text/video-text alignment.