update pllava readme

2026-04-19 01:15:33 +02:00 · 2024-06-17 03:20:22 +00:00 · 2024-06-17 03:20:22 +00:00 · dce5fad32f
commit dce5fad32f
parent 0fb6415007
1 changed files with 65 additions and 59 deletions
--- a/tools/caption/README.md
+++ b/tools/caption/README.md
@ -1,8 +1,71 @@
 # Video Captioning

-Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable.
+Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. As for our v1.2 model, we captioned our training videos with the [PLLaVA](https://github.com/magic-research/PLLaVA) model. PLLaVA performs highly competitively on multiple video-based text generation benchmarks including [MVbench](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1).
+
+## PLLaVA Captioning
+
+To balance captioning speed and performance, we chose the 13B version of PLLaVA configured with 2*2 spatial pooling. We feed it with 4 frames evenly extracted from the video.
+
+### Installation
+Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
+
+
+<!-- ### Download the PLLaVA repo
+
+First, make sure you are under the directory of tools/caption/pllava_dir. Then,
+
+```bash
+git clone https://github.com/magic-research/PLLaVA.git
+cd PLLaVA
+git checkout fd9194a
+
+
+```
+
+### Environment
+
+```bash
+conda create -n pllava python=3.10
+
+conda activate pllava
+
+pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
+
+```
+
+
+### Download weights
+
+```bash
+python python_scripts/hf.py # download the weights
+``` -->
+### Usage
+
+Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
+
+
+```bash
+cd .. # step back to pllava_dir
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
+nohup python caption_pllava.py \
+  --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
+  --use_lora \
+  --lora_alpha 4 \
+  --num_frames 4 \
+  --weight_dir PLLaVA/MODELS/pllava-13b \
+  --csv_path meta.csv \
+  --pooling_shape 4-12-12 \
+  > pllava_caption.out 2>&1 &
+```
+
+### PLLaVA VS LLaVA
+In our previous releases, we used [LLaVA](#llava-captioning) for video captioning. We notice
+cherry pick one comparison
+
+![LLaVA vs PLLaVA](/home/tom/Open-Sora-dev/assets/readme/llava_vs_pllava_sample.gif)

-![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)

 ## LLaVA Captioning

@ -84,63 +147,6 @@ python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --outpu

 Then use the output csv file to resume the process.

-## PLLaVA Captioning
-
-
-### Installation
-Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
-
-
-<!-- ### Download the PLLaVA repo
-
-First, make sure you are under the directory of tools/caption/pllava_dir. Then,
-
-```bash
-git clone https://github.com/magic-research/PLLaVA.git
-cd PLLaVA
-git checkout fd9194a
-
-
-```
-
-### Environment
-
-```bash
-conda create -n pllava python=3.10
-
-conda activate pllava
-
-pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
-
-```
-
-
-### Download weights
-
-```bash
-python python_scripts/hf.py # download the weights
-``` -->
-### Usage
-
-Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
-
-
-```bash
-cd .. # step back to pllava_dir
-
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
-nohup python caption_pllava.py \
-  --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
-  --use_lora \
-  --lora_alpha 4 \
-  --num_frames 4 \
-  --weight_dir PLLaVA/MODELS/pllava-13b \
-  --csv_path meta.csv \
-  --pooling_shape 4-12-12 \
-  > pllava_caption.out 2>&1 &
-
-```

 ## GPT-4V Captioning