[docs] data processing

This commit is contained in:
Zangwei Zheng 2024-04-01 16:08:53 +08:00
parent c553ee274f
commit ff15a0acfb
2 changed files with 40 additions and 7 deletions

View file

@ -191,13 +191,48 @@ To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (s
## Data Processing
High-quality Data is the key to high-quality models. Our used datasets and data collection plan
is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes
is [here](/docs/datasets.md). We provide tools to process video data. Our data processing pipeline includes
the following steps:
1. Downloading datasets. [[docs](/tools/datasets/README.md)]
1. Manage datasets. [[docs](/tools/datasets/README.md)]
2. Split videos into clips. [[docs](/tools/scenedetect/README.md)]
3. Generate video captions. [[docs](/tools/caption/README.md)]
Below is an example workflow to process data. However, we recommend you to read the detailed documentation for each tool, and decide which tools to use based on your needs.
```bash
# Suppose files under ~/dataset/
# 1. Convert dataset to CSV
# output: ~/dataset.csv
python -m tools.dataset.convert video ~/dataset
# filter out broken videos (broken ones num_frames=0)
python -m tools.dataset.csvutil ~/dataset.csv --video-info --fmin 2 --output ~/dataset.csv
# 2. Filter dataset by aesthetic scores
# output: ~/dataset_aesthetic.csv
python -m tools.aesthetic.inference ~/dataset.csv
# sort and examine videos by aesthetic scores
python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --sort-descending aesthetic_score
# bad videos (aesthetic_score < 5)
tail ~/dataset_aesthetic.csv
# filter videos by aesthetic scores
# output: ~/dataset_aesthetic_aesmin_5.csv
python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --aesmin 5
# 3. Caption dataset
# output: ~/dataset_aesthetic_aesmin_5_caption.csv
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava ~/dataset_aesthetic_aesmin_5.csv --tp-size 2 --dp-size 4 --bs 16
# remove empty captions and process captions (may need to re-caption lost ones)
python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --remove-caption-prefix --remove-empty-caption
# 4. Sanity check & prepare for training
# sanity check
python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --ext --video-info --output ~/dataset_ready.csv
# filter out videos less than 48 frames
# output: ~/dataset_ready_fmin_48.csv
python -m tools.datasets.csvutil ~/dataset_ready.csv --fmin 48
```
## Training
To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights

View file

@ -54,7 +54,7 @@ torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv
After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them:
```bash
python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv --output DATA_caption.csv
python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
```
### Resume
@ -63,11 +63,9 @@ Sometimes the process may be interrupted. We can resume the process by running t
```bash
# merge generated results
# output: DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv
python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
# get the remaining videos
# output: DATA-DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
python -m tools.datasets.csvutil DATA.csv --difference DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
python -m tools.datasets.csvutil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv
```
Then use the output csv file to resume the process.