Open-Sora/tools/scoring/README.md

# Data Scoring and Filtering

- [Data Scoring and Filtering](#data-scoring-and-filtering)
  - [Aesthetic Scoring](#aesthetic-scoring)
    - [Requirement](#requirement)
    - [Usage](#usage)
  - [Optical Flow Score](#optical-flow-score)
  - [Matching Score](#matching-score)

## Aesthetic Scoring

To evaluate the aesthetic quality of videos, we use a pretrained model from [CLIP+MLP Aesthetic Score Predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor). This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.

The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.

For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.

### Requirement

```bash
# install clip
pip install git+https://github.com/openai/CLIP.git
pip install decord

# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth
```

### Usage

With `meta.csv` containing the paths to the videos, run the following command:

```bash
# output: meta_aes.csv
torchrun --nproc_per_node 8  -m tools.scoring.aesthetic.inference meta.csv --bs 1024 --num_workers 16
```

This will generate multiple part files, you can use `python -m tools.datasets.csvutil DATA1.csv DATA2.csv` to merge these part files.

## Optical Flow Score

Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement.
TODO: acknowledge UniMatch.

First get the pretrained model.

```bash
wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P pretrained_models/unimatch
```

Then run:

```bash
torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py --meta_path /path/to/meta.csv
```

The output should be `/path/to/meta_flow.csv` with column `flow`.

## Matching Score

Matching scores are calculated to evaluate the alignment between an image/video and its caption.
For videos, we compute the matching score of the middle frame and the caption.

**Make sure** meta files contain the column `text`, which is the caption of the sample. Then run:

```bash
torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py --meta_path /path/to/meta.csv
```

The output should be `/path/to/meta_match.csv` with column `match`. Higher matching scores indicate better image-text/video-text alignment.
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00			`# Data Scoring and Filtering`

			`- [Data Scoring and Filtering](#data-scoring-and-filtering)`
			`- [Aesthetic Scoring](#aesthetic-scoring)`
			`- [Requirement](#requirement)`
			`- [Usage](#usage)`
			`- [Optical Flow Score](#optical-flow-score)`
			`- [Matching Score](#matching-score)`

			`## Aesthetic Scoring`

			`To evaluate the aesthetic quality of videos, we use a pretrained model from [CLIP+MLP Aesthetic Score Predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor). This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.`

			`The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.`

			`For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.`

			`### Requirement`

			```bash
			`# install clip`
			`pip install git+https://github.com/openai/CLIP.git`
			`pip install decord`

			`# get pretrained model`
			`wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth`
			```

			`### Usage`

			With `meta.csv` containing the paths to the videos, run the following command:

			```bash
update data pipeline 2024-04-02 09:55:58 +02:00			`# output: meta_aes.csv`
accelerate aesthetic scoring (#32) * accelerate aesthetic scoring * polish 2024-04-04 10:03:43 +02:00			`torchrun --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta.csv --bs 1024 --num_workers 16`
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00			```

accelerate aesthetic scoring (#32) * accelerate aesthetic scoring * polish 2024-04-04 10:03:43 +02:00			This will generate multiple part files, you can use `python -m tools.datasets.csvutil DATA1.csv DATA2.csv` to merge these part files.

Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00			`## Optical Flow Score`
format tools 2024-04-05 04:00:30 +02:00
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			`Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement.`
			`TODO: acknowledge UniMatch.`
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00
			`First get the pretrained model.`
format tools 2024-04-05 04:00:30 +02:00
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00			```bash
			`wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P pretrained_models/unimatch`
			```

Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			`Then run:`
format tools 2024-04-05 04:00:30 +02:00
			```bash
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py --meta_path /path/to/meta.csv`
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			```
format tools 2024-04-05 04:00:30 +02:00
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			The output should be `/path/to/meta_flow.csv` with column `flow`.
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00
			`## Matching Score`
format tools 2024-04-05 04:00:30 +02:00
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			`Matching scores are calculated to evaluate the alignment between an image/video and its caption.`
			`For videos, we compute the matching score of the middle frame and the caption.`
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			Make sure meta files contain the column `text`, which is the caption of the sample. Then run:
Dev/datapipe (#21) * fix #210 * fix #209 * fix #188 * [docs] add training order * update data pipeline --------- Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> 2024-04-02 08:51:21 +02:00
format tools 2024-04-05 04:00:30 +02:00			```bash
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py --meta_path /path/to/meta.csv`
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			```
format tools 2024-04-05 04:00:30 +02:00
Dev/pxy (#26) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching 2024-04-02 15:20:39 +02:00			The output should be `/path/to/meta_match.csv` with column `match`. Higher matching scores indicate better image-text/video-text alignment.