Open-Sora/tools/scoring/README.md

71 lines
2.8 KiB
Markdown
Raw Normal View History

# Data Scoring and Filtering
- [Data Scoring and Filtering](#data-scoring-and-filtering)
- [Aesthetic Scoring](#aesthetic-scoring)
- [Requirement](#requirement)
- [Usage](#usage)
- [Optical Flow Score](#optical-flow-score)
- [Matching Score](#matching-score)
## Aesthetic Scoring
To evaluate the aesthetic quality of videos, we use a pretrained model from [CLIP+MLP Aesthetic Score Predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor). This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.
The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.
For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.
### Requirement
```bash
# install clip
pip install git+https://github.com/openai/CLIP.git
pip install decord
# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth
```
### Usage
With `meta.csv` containing the paths to the videos, run the following command:
```bash
2024-04-02 09:55:58 +02:00
# output: meta_aes.csv
torchrun --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta.csv --bs 1024 --num_workers 16
```
This will generate multiple part files, you can use `python -m tools.datasets.csvutil DATA1.csv DATA2.csv` to merge these part files.
## Optical Flow Score
2024-04-05 04:00:30 +02:00
Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement.
TODO: acknowledge UniMatch.
First get the pretrained model.
2024-04-05 04:00:30 +02:00
```bash
wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P pretrained_models/unimatch
```
Then run:
2024-04-05 04:00:30 +02:00
```bash
torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py --meta_path /path/to/meta.csv
```
2024-04-05 04:00:30 +02:00
The output should be `/path/to/meta_flow.csv` with column `flow`.
## Matching Score
2024-04-05 04:00:30 +02:00
Matching scores are calculated to evaluate the alignment between an image/video and its caption.
For videos, we compute the matching score of the middle frame and the caption.
**Make sure** meta files contain the column `text`, which is the caption of the sample. Then run:
2024-04-05 04:00:30 +02:00
```bash
torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py --meta_path /path/to/meta.csv
```
2024-04-05 04:00:30 +02:00
The output should be `/path/to/meta_match.csv` with column `match`. Higher matching scores indicate better image-text/video-text alignment.