Open-Sora/docs/datasets.md

# Datasets

For Open-Sora 1.1, we conduct mixed training with both images and videos. The main datasets we use are listed below.
Please refer to [README](/README.md#data-processing) for data processing.

## Panda-70M
[Panda-70M](https://github.com/snap-research/Panda-70M) is a large-scale dataset with 70M video-caption pairs.
We use the [training-10M subset](https://github.com/snap-research/Panda-70M/tree/main/dataset_dataloading) for training, 
which contains ~10M videos of better quality.

## Pexels
[Pexels](https://www.pexels.com/) is a popular online platform that provides high-quality stock photos, videos, and music for free. 
Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning.
We really appreciate the great platform and the contributors!

## Inter4K
[Inter4K](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1K video clips with 4K resolution.
The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning.


## HD-VG-130M
[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs. 
The caption is generated by BLIP-2. 
We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.
update docs 2024-03-16 15:48:54 +01:00			`# Datasets`

Dev/pxy (#64) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh * update scoring * update scoring * update scoring * update README * update readme * update scene_cut 2024-04-23 09:34:58 +02:00			`For Open-Sora 1.1, we conduct mixed training with both images and videos. The main datasets we use are listed below.`
			`Please refer to [README](/README.md#data-processing) for data processing.`
add prompts 2024-03-17 13:46:54 +01:00
Dev/pxy (#64) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh * update scoring * update scoring * update scoring * update README * update readme * update scene_cut 2024-04-23 09:34:58 +02:00			`## Panda-70M`
			`[Panda-70M](https://github.com/snap-research/Panda-70M) is a large-scale dataset with 70M video-caption pairs.`
			`We use the [training-10M subset](https://github.com/snap-research/Panda-70M/tree/main/dataset_dataloading) for training,`
			`which contains ~10M videos of better quality.`
add prompts 2024-03-17 13:46:54 +01:00
Dev/pxy (#64) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh * update scoring * update scoring * update scoring * update README * update readme * update scene_cut 2024-04-23 09:34:58 +02:00			`## Pexels`
			`[Pexels](https://www.pexels.com/) is a popular online platform that provides high-quality stock photos, videos, and music for free.`
			`Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning.`
			`We really appreciate the great platform and the contributors!`
add prompts 2024-03-17 13:46:54 +01:00
Dev/pxy (#64) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh * update scoring * update scoring * update scoring * update README * update readme * update scene_cut 2024-04-23 09:34:58 +02:00			`## Inter4K`
			`[Inter4K](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1K video clips with 4K resolution.`
			`The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning.`
add prompts 2024-03-17 13:46:54 +01:00

Dev/pxy (#64) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh * update scoring * update scoring * update scoring * update README * update readme * update scene_cut 2024-04-23 09:34:58 +02:00			`## HD-VG-130M`
			`[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs.`
			`The caption is generated by BLIP-2.`
			`We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.`