Open-Sora/tools/caption/README.md

# Video Captioning

Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable.

![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)

## LLaVA Captioning

We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup. With approximately 720p resolution and 1 frames, the speed is 2~3 videos/s on 8 GPUs. If we resize the smaller side to 336, the speed can be 8 videos/s. In Open-Sora v1.1, to lower the cost, we use the 7B model.

### Installation

Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "LLaVA Captioning" sections.

<!-- ### Requirement

```bash
# create conda env
conda create -n llava python=3.10 -y
conda activate llava

# install torch
pip install torch torchvision

# clone llava
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
# CAUTION: This line is to remove torch dependency in pyproject.toml, which is:
# "torch==2.1.2", "torchvision==0.16.2",
# It is better manually remove it in your local pyproject.toml
sed -i '16d' pyproject.toml

# install llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# install flash attention
pip install flash-attn --no-build-isolation
# install colossalai and decord
pip install colossalai decord
``` -->

### Usage

Prepare a csv file for processing. The csv file can be generated by `convert_dataset.py` according to its [documentation](/tools/datasets/README.md). Then, run the following command to generate captions for videos/images with Llava:

```bash
# caption with mistral-7B
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video

# caption with llava-34B
# NOTE: remember to enable flash attention for this model
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 4 --tp-size 2 --model-path liuhaotian/llava-v1.6-34b --prompt image-3ex --flash-attention

# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 4 --bs 16

# at least two 80G GPUs are required
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16

# can also caption images
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 --prompt image-3ex
```

Please note that you should add the `--flash-attention` flag when running with Llama-based Llava models as it provides speedup but do turn it off for mistral-based ones. Reasons can be found in [this issue](https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453).

After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them:

```bash
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv
```

### Resume

Sometimes the process may be interrupted. We can resume the process by running the following command:

```bash
# merge generated results
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv

# get the remaining videos
python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv
```

Then use the output csv file to resume the process.

## PLLaVA Captioning


### Installation
Install the required dependancies by following our [installation instructions](../../docs/installation.md)'s "Data Dependencies" and "PLLaVA Captioning" sections.


<!-- ### Download the PLLaVA repo

First, make sure you are under the directory of tools/caption/pllava_dir. Then,

```bash
git clone https://github.com/magic-research/PLLaVA.git
cd PLLaVA
git checkout fd9194a


```

### Environment

```bash
conda create -n pllava python=3.10

conda activate pllava

pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.

```


### Download weights

```bash
python python_scripts/hf.py # download the weights
``` -->
### Usage

Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.


```bash
cd .. # step back to pllava_dir

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
nohup python caption_pllava.py \
  --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
  --use_lora \
  --lora_alpha 4 \
  --num_frames 4 \
  --weight_dir PLLaVA/MODELS/pllava-13b \
  --csv_path meta.csv \
  --pooling_shape 4-12-12 \
  > pllava_caption.out 2>&1 &

```

## GPT-4V Captioning

Run the following command to generate captions for videos with GPT-4V:

```bash
# output: DATA_caption.csv
python -m tools.caption.caption_gpt4 DATA.csv --key $OPENAI_API_KEY
```

The cost is approximately $0.01 per video (3 frames per video).

## Camera Motion Detection

<!-- Install additional required packages: `tools/caption/camera_motion/requirements.txt`. -->
Install required packages with `pip install -v .[data]` (See [installation.md](../../docs/installation.md)).
Run the following command to classify camera motion:

```bash
# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv
```

You may additionally specify `threshold` to indicate how "sensitive" the detection should be as below. For example `threshold = 0.2` means that the video is only counted as `tilt_up` when the pixels moved down by `>20%` of video height between the starting and ending frames.
```bash
# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv --threshold 0.2
```

Each video is classified according to 8 categories:
            `pan_right,
            pan_left,
            tilt_up,
            tilt_down,
            zoom_in,
            zoom_out,
            static,
            unclassified`.
Categories of `tilt`, `pan` and `zoom` can overlap with each other.