[docs] data processing

2026-05-21 11:59:01 +02:00 · 2024-04-01 16:08:53 +08:00 · 2024-04-01 16:08:53 +08:00 · ff15a0acfb
commit ff15a0acfb
parent c553ee274f
2 changed files with 40 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -191,13 +191,48 @@ To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (s
 ## Data Processing

 High-quality Data is the key to high-quality models. Our used datasets and data collection plan
-is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes
+is [here](/docs/datasets.md). We provide tools to process video data. Our data processing pipeline includes
 the following steps:

-1. Downloading datasets. [[docs](/tools/datasets/README.md)]
+1. Manage datasets. [[docs](/tools/datasets/README.md)]
 2. Split videos into clips. [[docs](/tools/scenedetect/README.md)]
 3. Generate video captions. [[docs](/tools/caption/README.md)]

+Below is an example workflow to process data. However, we recommend you to read the detailed documentation for each tool, and decide which tools to use based on your needs.
+
+```bash
+# Suppose files under ~/dataset/
+# 1. Convert dataset to CSV
+# output: ~/dataset.csv
+python -m tools.dataset.convert video ~/dataset
+# filter out broken videos (broken ones num_frames=0)
+python -m tools.dataset.csvutil ~/dataset.csv --video-info --fmin 2 --output ~/dataset.csv
+
+# 2. Filter dataset by aesthetic scores
+# output: ~/dataset_aesthetic.csv
+python -m tools.aesthetic.inference ~/dataset.csv
+# sort and examine videos by aesthetic scores
+python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --sort-descending aesthetic_score
+# bad videos (aesthetic_score < 5)
+tail ~/dataset_aesthetic.csv
+# filter videos by aesthetic scores
+# output: ~/dataset_aesthetic_aesmin_5.csv
+python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --aesmin 5
+
+# 3. Caption dataset
+# output: ~/dataset_aesthetic_aesmin_5_caption.csv
+torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava ~/dataset_aesthetic_aesmin_5.csv --tp-size 2 --dp-size 4 --bs 16
+# remove empty captions and process captions (may need to re-caption lost ones)
+python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --remove-caption-prefix --remove-empty-caption
+
+# 4. Sanity check & prepare for training
+# sanity check
+python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --ext --video-info --output ~/dataset_ready.csv
+# filter out videos less than 48 frames
+# output: ~/dataset_ready_fmin_48.csv
+python -m tools.datasets.csvutil ~/dataset_ready.csv --fmin 48
+```
+
 ## Training

 To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights
--- a/tools/caption/README.md
+++ b/tools/caption/README.md
@ -54,7 +54,7 @@ torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv
 After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them:

 ```bash
-python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv --output DATA_caption.csv
+python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
 ```

 ### Resume
@ -63,11 +63,9 @@ Sometimes the process may be interrupted. We can resume the process by running t

 ```bash
 # merge generated results
-# output: DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
-python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv
+python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
 # get the remaining videos
-# output: DATA-DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
-python -m tools.datasets.csvutil DATA.csv --difference DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
+python -m tools.datasets.csvutil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv
 ```

 Then use the output csv file to resume the process.