Open-Sora/tools/aesthetic
Zheng Zangwei (Alex Zheng) f1ee27ba2f [feat] llava support image and text (#13)
* [feat] llava support image and text

* add resize for image

* update gpt4 caption

* update prompt for llava image captioning
2024-03-31 20:59:33 +08:00
..
__init__.py [feat] add aesthetic score 2024-03-24 20:34:41 +08:00
inference.py [feat] llava support image and text (#13) 2024-03-31 20:59:33 +08:00
README.md Update image process (#5) 2024-03-29 23:34:10 +08:00

Aesthetic Scoring

To evaluate the aesthetic quality of videos, we use a pretrained model from CLIP+MLP Aesthetic Score Predictor. This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.

The score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for good aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.

For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images. Our script enables 1k videos/s with one GPU. It also supports multiple GPUs to further accelerate the process.

Requirement

# install clip
pip install git+https://github.com/openai/CLIP.git

# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth

Usage

With DATA.csv containing the paths to the videos, run the following command:

# output: DATA_aes.csv
python -m tools.aesthetic.inference DATA.csv