mirror of https://github.com/hpcaitech/Open-Sora.git synced 2026-04-15 11:52:47 +02:00

History

Zangwei Zheng 13423c57a7 train dataset support new format		2024-03-25 18:36:56 +08:00
..
__init__.py
convert_dataset.py
csvutil.py	train dataset support new format	2024-03-25 18:36:56 +08:00
README.md	update csvutil	2024-03-25 17:08:35 +08:00

README.md

Dataset Download and Management

Dataset Download

HD-VG-130M

This dataset comprises 130M text-video pairs. You can download the dataset and prepare it for training according to the dataset repository's instructions. There is a README.md file in the Google Drive link that provides instructions on how to download and cut the videos. For this version, we directly use the dataset provided by the authors.

VidProM

python -m tools.datasets.convert_dataset vidprom VIDPROM_FOLDER --info VidProM_semantic_unique.csv

Demo Dataset

You can use ImageNet and UCF101 for a quick demo. After downloading the datasets, you can use the following command to prepare the csv file for the dataset:

# ImageNet
python -m tools.datasets.convert_dataset imagenet IMAGENET_FOLDER --split train
# UCF101
python -m tools.datasets.convert_dataset ucf101 UCF101_FOLDER --split videos

Dataset Format

The dataset should be provided in a CSV file, which is used both for training and data preprocessing. The CSV file should only contain the following columns (can be optional). Aspect ratio is width divided by height.

path, text, num_frames, fps, width, height, aspect_ratio, aesthetic_score, clip_score
/absolute/path/to/image1.jpg, caption1, num_of_frames
/absolute/path/to/video2.mp4, caption2, num_of_frames

We use pandas to manage the CSV files. You can use the following code to read and write the CSV files:

df = pd.read_csv(input_path)
df = df.to_csv(output_path, index=False)

Manage datasets

We provide csvutils.py to manage the CSV files. You can use the following commands to process the CSV files:

# csvutil takes multiple CSV files as input and merge them into one CSV file
python -m tools.datasets.csvutil DATA1.csv DATA2.csv

# filter frames between 128 and 256, with captions
python -m tools.datasets.csvutil DATA.csv --fmin 128 --fmax 256 --remove-empty-caption
# compute the number of frames for each video
python -m tools.datasets.csvutil DATA.csv --video-info
# remove caption prefix
python -m tools.datasets.csvutil DATA.csv --remove-caption-prefix
# generate DATA_root.csv with absolute path
python -m tools.datasets.csvutil DATA.csv --abspath /absolute/path/to/dataset

# examine the first 10 rows of the CSV file
head -n 10 DATA1.csv
# count the number of data in the CSV file (approximately)
wc -l DATA1.csv

To accelerate processing speed, you can install pandarallel:

pip install pandarallel

To filter text language, you need to install lingua:

pip install lingua-language-detector

To get video information, you need to install opencv-python:

pip install opencv-python