Open-Sora/docs/datasets.md
xyupeng a37cd74133 Dev/pxy (#64)
* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scoring/matching

* update scene_cut

* update scene_cut

* update scene_cut[A

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* update scene_cut

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* update readme

* update readme

* extract frames using opencv everywhere

* extract frames using opencv everywhere

* extract frames using opencv everywhere

* filter panda10m

* filter panda10m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* m

* ocr

* add ocr

* add main.sh

* add ocr

* add ocr

* add ocr

* add ocr

* add ocr

* add ocr

* update scene_cut

* update remove main.sh

* update scoring

* update scoring

* update scoring

* update README

* update readme

* update scene_cut
2024-04-23 15:34:58 +08:00

1.3 KiB

Datasets

For Open-Sora 1.1, we conduct mixed training with both images and videos. The main datasets we use are listed below. Please refer to README for data processing.

Panda-70M

Panda-70M is a large-scale dataset with 70M video-caption pairs. We use the training-10M subset for training, which contains ~10M videos of better quality.

Pexels

Pexels is a popular online platform that provides high-quality stock photos, videos, and music for free. Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning. We really appreciate the great platform and the contributors!

Inter4K

Inter4K is a dataset containing 1K video clips with 4K resolution. The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning.

HD-VG-130M

HD-VG-130M comprises 130M text-video pairs. The caption is generated by BLIP-2. We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.