Open-Sora/tools/caption
2024-06-17 06:04:29 +00:00
..
acceleration added support for llava mistral (#44) 2024-04-08 16:43:32 +08:00
camera_motion new camera motion detection (#65) 2024-04-26 10:45:42 +08:00
pllava_dir remove newlines from captions 2024-06-15 10:40:18 +00:00
__init__.py update docs 2024-03-17 15:47:48 +08:00
camera_motion_detect.py a bunch of update for data 2024-04-13 15:44:24 +08:00
caption_gpt4.py fixed captioning data format check (#42) 2024-04-08 10:38:14 +08:00
caption_llama3.py fix nans and enable error propagation 2024-06-16 03:54:55 +00:00
caption_llava.py revise the package import error in tools/caption/caption_llava.py (#357) 2024-05-09 16:19:19 +08:00
README.md update pllava readme 2024-06-17 06:04:29 +00:00
utils.py put back EXTENSIONS 2024-05-02 01:20:33 +00:00

Video Captioning

Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. As for our v1.2 model, we captioned our training videos with the PLLaVA model. PLLaVA performs highly competitively on multiple video-based text generation benchmarks including MVbench.

PLLaVA Captioning

To balance captioning speed and performance, we chose the 13B version of PLLaVA configured with 2*2 spatial pooling. We feed it with 4 frames evenly extracted from the video.

Installation

Install the required dependancies by following our installation instructions's "Data Dependencies" and "PLLaVA Captioning" sections.

Usage

Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it.

cd .. # step back to pllava_dir

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
nohup python caption_pllava.py \
  --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
  --use_lora \
  --lora_alpha 4 \
  --num_frames 4 \
  --weight_dir PLLaVA/MODELS/pllava-13b \
  --csv_path meta.csv \
  --pooling_shape 4-12-12 \
  > pllava_caption.out 2>&1 &

PLLaVA vs. LLaVA

In our previous releases, we used LLaVA for video captioning. Qualitatively speaking, we observe that PLLaVA has a somewhat higher chance of accurately capture the details in the video than LLaVA. See below for their comparison on a video sample.

LLaVA vs PLLaVA
LLaVA PLLaVA
The video is a close-up shot of two gold wedding rings. The rings are placed on a (black surface), casting a soft shadow beneath them. The rings are positioned in such a way that (they are facing each other), creating a mirror image effect. The rings are (identical in size and design), suggesting they are a pair. The lighting in the video is soft and diffused, highlighting the gold color of the rings and creating a warm and inviting atmosphere. The overall style of the video is minimalist and elegant, focusing solely on the rings and their reflection. The video shows a pair of gold wedding rings on a (reflective surface). The rings are placed one on top of the other, with the top ring slightly tilted to the left. The rings have a (shiny, metallic finish) and are the main focus of the image. The background is a gradient of dark to light gray, providing a neutral backdrop that highlights the rings. There are no texts or other objects in the image. The style of the video is a simple product display with a focus on the rings, likely intended for promotional or sales purposes. The lighting and shadows suggest a soft, even light source, (possibly a studio light), which creates a reflective surface beneath the rings.

LLaVA Captioning

We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup. With approximately 720p resolution and 1 frames, the speed is 2~3 videos/s on 8 GPUs. If we resize the smaller side to 336, the speed can be 8 videos/s. In Open-Sora v1.1, to lower the cost, we use the 7B model.

Installation

Install the required dependancies by following our installation instructions's "Data Dependencies" and "LLaVA Captioning" sections.

Usage

Prepare a csv file for processing. The csv file can be generated by convert_dataset.py according to its documentation. Then, run the following command to generate captions for videos/images with Llava:

# caption with mistral-7B
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video

# caption with llava-34B
# NOTE: remember to enable flash attention for this model
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 4 --tp-size 2 --model-path liuhaotian/llava-v1.6-34b --prompt image-3ex --flash-attention

# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 4 --bs 16

# at least two 80G GPUs are required
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16

# can also caption images
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 --prompt image-3ex

Please note that you should add the --flash-attention flag when running with Llama-based Llava models as it provides speedup but do turn it off for mistral-based ones. Reasons can be found in this issue.

After running the script, with dp-size=N, you will get N parts of csv files. Run the following command to merge them:

python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv

Resume

Sometimes the process may be interrupted. We can resume the process by running the following command:

# merge generated results
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv

# get the remaining videos
python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv

Then use the output csv file to resume the process.

GPT-4V Captioning

Run the following command to generate captions for videos with GPT-4V:

# output: DATA_caption.csv
python -m tools.caption.caption_gpt4 DATA.csv --key $OPENAI_API_KEY

The cost is approximately $0.01 per video (3 frames per video).

Camera Motion Detection

Install required packages with pip install -v .[data] (See installation.md). Run the following command to classify camera motion:

# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv

You may additionally specify threshold to indicate how "sensitive" the detection should be as below. For example threshold = 0.2 means that the video is only counted as tilt_up when the pixels moved down by >20% of video height between the starting and ending frames.

# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv --threshold 0.2

Each video is classified according to 8 categories: pan_right, pan_left, tilt_up, tilt_down, zoom_in, zoom_out, static, unclassified. Categories of tilt, pan and zoom can overlap with each other.