mirror of
https://github.com/hpcaitech/Open-Sora.git
synced 2026-05-21 11:59:01 +02:00
[docs] data processing
This commit is contained in:
parent
c553ee274f
commit
ff15a0acfb
39
README.md
39
README.md
|
|
@ -191,13 +191,48 @@ To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (s
|
|||
## Data Processing
|
||||
|
||||
High-quality Data is the key to high-quality models. Our used datasets and data collection plan
|
||||
is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes
|
||||
is [here](/docs/datasets.md). We provide tools to process video data. Our data processing pipeline includes
|
||||
the following steps:
|
||||
|
||||
1. Downloading datasets. [[docs](/tools/datasets/README.md)]
|
||||
1. Manage datasets. [[docs](/tools/datasets/README.md)]
|
||||
2. Split videos into clips. [[docs](/tools/scenedetect/README.md)]
|
||||
3. Generate video captions. [[docs](/tools/caption/README.md)]
|
||||
|
||||
Below is an example workflow to process data. However, we recommend you to read the detailed documentation for each tool, and decide which tools to use based on your needs.
|
||||
|
||||
```bash
|
||||
# Suppose files under ~/dataset/
|
||||
# 1. Convert dataset to CSV
|
||||
# output: ~/dataset.csv
|
||||
python -m tools.dataset.convert video ~/dataset
|
||||
# filter out broken videos (broken ones num_frames=0)
|
||||
python -m tools.dataset.csvutil ~/dataset.csv --video-info --fmin 2 --output ~/dataset.csv
|
||||
|
||||
# 2. Filter dataset by aesthetic scores
|
||||
# output: ~/dataset_aesthetic.csv
|
||||
python -m tools.aesthetic.inference ~/dataset.csv
|
||||
# sort and examine videos by aesthetic scores
|
||||
python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --sort-descending aesthetic_score
|
||||
# bad videos (aesthetic_score < 5)
|
||||
tail ~/dataset_aesthetic.csv
|
||||
# filter videos by aesthetic scores
|
||||
# output: ~/dataset_aesthetic_aesmin_5.csv
|
||||
python -m tools.datasets.csvutil ~/dataset_aesthetic.csv --aesmin 5
|
||||
|
||||
# 3. Caption dataset
|
||||
# output: ~/dataset_aesthetic_aesmin_5_caption.csv
|
||||
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava ~/dataset_aesthetic_aesmin_5.csv --tp-size 2 --dp-size 4 --bs 16
|
||||
# remove empty captions and process captions (may need to re-caption lost ones)
|
||||
python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --remove-caption-prefix --remove-empty-caption
|
||||
|
||||
# 4. Sanity check & prepare for training
|
||||
# sanity check
|
||||
python -m tools.datasets.csvutil ~/dataset_aesthetic_aesmin_5_caption.csv --ext --video-info --output ~/dataset_ready.csv
|
||||
# filter out videos less than 48 frames
|
||||
# output: ~/dataset_ready_fmin_48.csv
|
||||
python -m tools.datasets.csvutil ~/dataset_ready.csv --fmin 48
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights
|
||||
|
|
|
|||
|
|
@ -54,7 +54,7 @@ torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv
|
|||
After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them:
|
||||
|
||||
```bash
|
||||
python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv --output DATA_caption.csv
|
||||
python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
|
||||
```
|
||||
|
||||
### Resume
|
||||
|
|
@ -63,11 +63,9 @@ Sometimes the process may be interrupted. We can resume the process by running t
|
|||
|
||||
```bash
|
||||
# merge generated results
|
||||
# output: DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
|
||||
python -m tools.datasets.csvutil DATA_caption_part0.csv DATA_caption_part1.csv DATA_caption_part2.csv DATA_caption_part3.csv
|
||||
python -m tools.datasets.csvutil DATA_caption_part*.csv --output DATA_caption.csv
|
||||
# get the remaining videos
|
||||
# output: DATA-DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
|
||||
python -m tools.datasets.csvutil DATA.csv --difference DATA_caption_part0+DATA_caption_part1+DATA_caption_part2+DATA_caption_part3.csv
|
||||
python -m tools.datasets.csvutil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv
|
||||
```
|
||||
|
||||
Then use the output csv file to resume the process.
|
||||
|
|
|
|||
Loading…
Reference in a new issue