| .. | ||
| acceleration | ||
| __init__.py | ||
| caption_gpt4.py | ||
| caption_llava.py | ||
| README.md | ||
| utils.py | ||
Video Captioning
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in MMMU and accepts any resolution. We find the quality of 34B model is comparable.
LLaVA Captioning
We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup, with 2.4 videos/s on 8 GPUs.
Requirement
# create conda env
conda create -n llava python=3.10 -y
conda activate llava
# install torch
pip install torch torchvision
# clone llava
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
# CAUTION: This line is to remove torch dependency in pyproject.toml, which is:
# "torch==2.1.2", "torchvision==0.16.2",
# It is better manually remove it in your local pyproject.toml
sed -i '16d' pyproject.toml
# install llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# install flash attention
pip install flash-attn --no-build-isolation
# install colossalai
pip install colossalai
Since only the 34B model's performance is comparable to GPT-4V, we only provide the usage of the 34B model. The 34B model is available here, or run our script and it will be downloaded automatically.
Usage
Prepare a csv file for processing. The csv file can be generated by convert_dataset.py according to its documentation. Then, run the following command to generate captions for videos/images with LLaVA:
# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava samples output.csv --tp-size 2 --dp-size 4 --bs 16
GPT-4V Captioning
Run the following command to generate captions for videos with GPT-4V:
python -m tools.caption.caption_gpt4 FOLDER_WITH_VIDEOS output.csv --key $OPENAI_API_KEY
The cost is approximately $0.01 per video (3 frames per video). The output is a CSV file with path and caption.
