mirror of
https://github.com/hpcaitech/Open-Sora.git
synced 2026-04-19 01:15:33 +02:00
update pllava readme
This commit is contained in:
parent
0fb6415007
commit
dce5fad32f
|
|
@ -1,8 +1,71 @@
|
|||
# Video Captioning
|
||||
|
||||
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable.
|
||||
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. As for our v1.2 model, we captioned our training videos with the [PLLaVA](https://github.com/magic-research/PLLaVA) model. PLLaVA performs highly competitively on multiple video-based text generation benchmarks including [MVbench](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1).
|
||||
|
||||
## PLLaVA Captioning
|
||||
|
||||
To balance captioning speed and performance, we chose the 13B version of PLLaVA configured with 2*2 spatial pooling. We feed it with 4 frames evenly extracted from the video.
|
||||
|
||||
### Installation
|
||||
Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
|
||||
|
||||
|
||||
<!-- ### Download the PLLaVA repo
|
||||
|
||||
First, make sure you are under the directory of tools/caption/pllava_dir. Then,
|
||||
|
||||
```bash
|
||||
git clone https://github.com/magic-research/PLLaVA.git
|
||||
cd PLLaVA
|
||||
git checkout fd9194a
|
||||
|
||||
|
||||
```
|
||||
|
||||
### Environment
|
||||
|
||||
```bash
|
||||
conda create -n pllava python=3.10
|
||||
|
||||
conda activate pllava
|
||||
|
||||
pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Download weights
|
||||
|
||||
```bash
|
||||
python python_scripts/hf.py # download the weights
|
||||
``` -->
|
||||
### Usage
|
||||
|
||||
Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
|
||||
|
||||
|
||||
```bash
|
||||
cd .. # step back to pllava_dir
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
|
||||
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
|
||||
nohup python caption_pllava.py \
|
||||
--pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
|
||||
--use_lora \
|
||||
--lora_alpha 4 \
|
||||
--num_frames 4 \
|
||||
--weight_dir PLLaVA/MODELS/pllava-13b \
|
||||
--csv_path meta.csv \
|
||||
--pooling_shape 4-12-12 \
|
||||
> pllava_caption.out 2>&1 &
|
||||
```
|
||||
|
||||
### PLLaVA VS LLaVA
|
||||
In our previous releases, we used [LLaVA](#llava-captioning) for video captioning. We notice
|
||||
cherry pick one comparison
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## LLaVA Captioning
|
||||
|
||||
|
|
@ -84,63 +147,6 @@ python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --outpu
|
|||
|
||||
Then use the output csv file to resume the process.
|
||||
|
||||
## PLLaVA Captioning
|
||||
|
||||
|
||||
### Installation
|
||||
Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.
|
||||
|
||||
|
||||
<!-- ### Download the PLLaVA repo
|
||||
|
||||
First, make sure you are under the directory of tools/caption/pllava_dir. Then,
|
||||
|
||||
```bash
|
||||
git clone https://github.com/magic-research/PLLaVA.git
|
||||
cd PLLaVA
|
||||
git checkout fd9194a
|
||||
|
||||
|
||||
```
|
||||
|
||||
### Environment
|
||||
|
||||
```bash
|
||||
conda create -n pllava python=3.10
|
||||
|
||||
conda activate pllava
|
||||
|
||||
pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Download weights
|
||||
|
||||
```bash
|
||||
python python_scripts/hf.py # download the weights
|
||||
``` -->
|
||||
### Usage
|
||||
|
||||
Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.
|
||||
|
||||
|
||||
```bash
|
||||
cd .. # step back to pllava_dir
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
|
||||
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
|
||||
nohup python caption_pllava.py \
|
||||
--pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
|
||||
--use_lora \
|
||||
--lora_alpha 4 \
|
||||
--num_frames 4 \
|
||||
--weight_dir PLLaVA/MODELS/pllava-13b \
|
||||
--csv_path meta.csv \
|
||||
--pooling_shape 4-12-12 \
|
||||
> pllava_caption.out 2>&1 &
|
||||
|
||||
```
|
||||
|
||||
## GPT-4V Captioning
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue