diff --git a/tools/caption/README.md b/tools/caption/README.md index 763b04c..4473a79 100644 --- a/tools/caption/README.md +++ b/tools/caption/README.md @@ -1,8 +1,71 @@ # Video Captioning -Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable. +Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. As for our v1.2 model, we captioned our training videos with the [PLLaVA](https://github.com/magic-research/PLLaVA) model. PLLaVA performs highly competitively on multiple video-based text generation benchmarks including [MVbench](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1). + +## PLLaVA Captioning + +To balance captioning speed and performance, we chose the 13B version of PLLaVA configured with 2*2 spatial pooling. We feed it with 4 frames evenly extracted from the video. + +### Installation +Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections. + + + +### Usage + +Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it. + + +```bash +cd .. # step back to pllava_dir + +CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ +PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \ +nohup python caption_pllava.py \ + --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \ + --use_lora \ + --lora_alpha 4 \ + --num_frames 4 \ + --weight_dir PLLaVA/MODELS/pllava-13b \ + --csv_path meta.csv \ + --pooling_shape 4-12-12 \ + > pllava_caption.out 2>&1 & +``` + +### PLLaVA VS LLaVA +In our previous releases, we used [LLaVA](#llava-captioning) for video captioning. We notice +cherry pick one comparison + +![LLaVA vs PLLaVA](/home/tom/Open-Sora-dev/assets/readme/llava_vs_pllava_sample.gif) -![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png) ## LLaVA Captioning @@ -84,63 +147,6 @@ python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --outpu Then use the output csv file to resume the process. -## PLLaVA Captioning - - -### Installation -Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections. - - - -### Usage - -Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it. - - -```bash -cd .. # step back to pllava_dir - -CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \ -nohup python caption_pllava.py \ - --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \ - --use_lora \ - --lora_alpha 4 \ - --num_frames 4 \ - --weight_dir PLLaVA/MODELS/pllava-13b \ - --csv_path meta.csv \ - --pooling_shape 4-12-12 \ - > pllava_caption.out 2>&1 & - -``` ## GPT-4V Captioning