update pllava readme

This commit is contained in:
Tom Young 2024-06-17 03:20:22 +00:00
parent 0fb6415007
commit dce5fad32f

View file

@ -1,8 +1,71 @@
# Video Captioning
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable.
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. As for our v1.2 model, we captioned our training videos with the [PLLaVA](https://github.com/magic-research/PLLaVA) model. PLLaVA performs highly competitively on multiple video-based text generation benchmarks including [MVbench](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1).
## PLLaVA Captioning
To balance captioning speed and performance, we chose the 13B version of PLLaVA configured with 2*2 spatial pooling. We feed it with 4 frames evenly extracted from the video.
### Installation
Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
<!-- ### Download the PLLaVA repo
First, make sure you are under the directory of tools/caption/pllava_dir. Then,
```bash
git clone https://github.com/magic-research/PLLaVA.git
cd PLLaVA
git checkout fd9194a
```
### Environment
```bash
conda create -n pllava python=3.10
conda activate pllava
pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
```
### Download weights
```bash
python python_scripts/hf.py # download the weights
``` -->
### Usage
Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
```bash
cd .. # step back to pllava_dir
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
nohup python caption_pllava.py \
--pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
--use_lora \
--lora_alpha 4 \
--num_frames 4 \
--weight_dir PLLaVA/MODELS/pllava-13b \
--csv_path meta.csv \
--pooling_shape 4-12-12 \
> pllava_caption.out 2>&1 &
```
### PLLaVA VS LLaVA
In our previous releases, we used [LLaVA](#llava-captioning) for video captioning. We notice
cherry pick one comparison
![LLaVA vs PLLaVA](/home/tom/Open-Sora-dev/assets/readme/llava_vs_pllava_sample.gif)
![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)
## LLaVA Captioning
@ -84,63 +147,6 @@ python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --outpu
Then use the output csv file to resume the process.
## PLLaVA Captioning
### Installation
Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
<!-- ### Download the PLLaVA repo
First, make sure you are under the directory of tools/caption/pllava_dir. Then,
```bash
git clone https://github.com/magic-research/PLLaVA.git
cd PLLaVA
git checkout fd9194a
```
### Environment
```bash
conda create -n pllava python=3.10
conda activate pllava
pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
```
### Download weights
```bash
python python_scripts/hf.py # download the weights
``` -->
### Usage
Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
```bash
cd .. # step back to pllava_dir
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
nohup python caption_pllava.py \
--pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
--use_lora \
--lora_alpha 4 \
--num_frames 4 \
--weight_dir PLLaVA/MODELS/pllava-13b \
--csv_path meta.csv \
--pooling_shape 4-12-12 \
> pllava_caption.out 2>&1 &
```
## GPT-4V Captioning