# Video Captioning Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable. ![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png) ## LLaVA Captioning We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup. With approximatly 720p resolution and 1 frames, the speed is 2~3 videos/s on 8 GPUs. If we resize the smaller side to 336, the speed can be 8 videos/s. In Open-Sora v1.1, to lower the cost, we use the 7B model. ### Requirement ```bash # create conda env conda create -n llava python=3.10 -y conda activate llava # install torch pip install torch torchvision # clone llava git clone https://github.com/haotian-liu/LLaVA.git cd LLaVA # CAUTION: This line is to remove torch dependency in pyproject.toml, which is: # "torch==2.1.2", "torchvision==0.16.2", # It is better manually remove it in your local pyproject.toml sed -i '16d' pyproject.toml # install llava pip install --upgrade pip # enable PEP 660 support pip install -e . # install flash attention pip install flash-attn --no-build-isolation # install colossalai and decord pip install colossalai decord ``` ### Usage Prepare a csv file for processing. The csv file can be generated by `convert_dataset.py` according to its [documentation](/tools/datasets/README.md). Then, run the following command to generate captions for videos/images with Llava: ```bash # caption with mistral-7B torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video # caption with llava-34B # NOTE: remember to enable flash attention for this model torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 4 --tp-size 2 --model-path liuhaotian/llava-v1.6-34b --prompt image-3ex --flash-attention # we run this on 8xH800 GPUs torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 4 --bs 16 # at least two 80G GPUs are required torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 # can also caption images torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 --prompt image-3ex ``` Please note that you should add the `--flash-attention` flag when running with Llama-based Llava models as it provides speedup but do turn it off for mistral-based ones. Reasons can be found in [this issue](https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453). After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them: ```bash python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv ``` ### Resume Sometimes the process may be interrupted. We can resume the process by running the following command: ```bash # merge generated results python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv # get the remaining videos python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv ``` Then use the output csv file to resume the process. ## GPT-4V Captioning Run the following command to generate captions for videos with GPT-4V: ```bash # output: DATA_caption.csv python -m tools.caption.caption_gpt4 DATA.csv --key $OPENAI_API_KEY ``` The cost is approximately $0.01 per video (3 frames per video).