Open-Sora/tools/caption
Frank Lee b704d6c0f8 Feature/llava speedup (#2)
* [caption] accelerated llava with flash attention and parallel frame extraction

* supported dp and tp in llava

* code formatting
2024-03-27 16:55:25 +08:00
..
acceleration Feature/llava speedup (#2) 2024-03-27 16:55:25 +08:00
__init__.py
caption_gpt4.py
caption_llava.py Feature/llava speedup (#2) 2024-03-27 16:55:25 +08:00
README.md Feature/llava speedup (#2) 2024-03-27 16:55:25 +08:00
utils.py Feature/llava speedup (#2) 2024-03-27 16:55:25 +08:00

Video Captioning

Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. With batch inference, we can achieve a speed of 3s/sample with LLaVA, and the quality is comparable. LLaVA is the second best open-source model in MMMU and accepts any resolution.

Caption

GPT-4V Captioning

Run the following command to generate captions for videos with GPT-4V:

python -m tools.caption.caption_gpt4 FOLDER_WITH_VIDEOS output.csv --key $OPENAI_API_KEY

The cost is approximately $0.01 per video (3 frames per video). The output is a CSV file with path and caption.

LLaVA Captioning

First, install LLaVA according to their official instructions. We use the liuhaotian/llava-v1.6-34b model for captioning, which can be download here. Then, run the following command to generate captions for videos with LLaVA:

# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava samples output.csv --tp-size 2 --dp-size 4 --bs 16

The Yi-34B requires 2 80GB GPUs and 3s/sample. The output is a CSV file with path and caption.