mirror of
https://github.com/hpcaitech/Open-Sora.git
synced 2026-05-21 11:59:01 +02:00
56 lines
1.9 KiB
Markdown
56 lines
1.9 KiB
Markdown
# Dataset Download and Management
|
|
|
|
## Dataset Format
|
|
|
|
The training data should be provided in a CSV file, with each row containing the following information:
|
|
|
|
```csv
|
|
path, text, num_frames, aesthetic_score
|
|
/absolute/path/to/image1.jpg, caption1, num_of_frames
|
|
/absolute/path/to/video2.mp4, caption2, num_of_frames
|
|
```
|
|
|
|
## HD-VG-130M
|
|
|
|
This dataset comprises 130M text-video pairs. You can download the dataset and prepare it for training according to [the dataset repository's instructions](https://github.com/daooshee/HD-VG-130M). There is a README.md file in the Google Drive link that provides instructions on how to download and cut the videos. For this version, we directly use the dataset provided by the authors.
|
|
|
|
## VidProM
|
|
|
|
```bash
|
|
python -m tools.datasets.convert_dataset vidprom VIDPROM_FOLDER --info VidProM_semantic_unique.csv
|
|
```
|
|
|
|
## Demo Dataset
|
|
|
|
You can use ImageNet and UCF101 for a quick demo. After downloading the datasets, you can use the following command to prepare the csv file for the dataset:
|
|
|
|
```bash
|
|
# ImageNet
|
|
python -m tools.datasets.convert_dataset imagenet IMAGENET_FOLDER --split train
|
|
# UCF101
|
|
python -m tools.datasets.convert_dataset ucf101 UCF101_FOLDER --split videos
|
|
```
|
|
|
|
## Manage datasets
|
|
|
|
We provide `csvutils.py` to manage the CSV files. You can use the following commands to process the CSV files:
|
|
|
|
```bash
|
|
# generate DATA_fmin_128_fmax_256.csv with frames between 128 and 256
|
|
python -m tools.datasets.csvutil DATA.csv --fmin 128 --fmax 256
|
|
# generate DATA_root.csv with absolute path
|
|
python -m tools.datasets.csvutil DATA.csv --root /absolute/path/to/dataset
|
|
# remove videos with no captions
|
|
python -m tools.datasets.csvutil DATA.csv --remove-empty-caption
|
|
# compute the number of frames for each video
|
|
python -m tools.datasets.csvutil DATA.csv --relength
|
|
# remove caption prefix
|
|
python -m tools.datasets.csvutil DATA.csv --remove-caption-prefix
|
|
```
|
|
|
|
To merge multiple CSV files, you can use the following command:
|
|
|
|
```bash
|
|
cat *csv > combined.csv
|
|
```
|