Merge branch 'main' of https://github.com/celaraze/Open-Sora into celaraze-main

This commit is contained in:
Zangwei Zheng 2024-03-23 14:37:55 +08:00
commit becf0e7d37
7 changed files with 624 additions and 248 deletions

110
README.md
View file

@ -12,59 +12,79 @@
</div>
## Open-Sora: Democratizing Efficient Video Production for All
We present **Open-Sora**, an initiative dedicated to **efficiently** produce high-quality video and make the model,
tools and contents accessible to all. By embracing **open-source** principles,
Open-Sora not only democratizes access to advanced video generation techniques, but also offers a
We present **Open-Sora**, an initiative dedicated to **efficiently** produce high-quality video and make the model,
tools and contents accessible to all. By embracing **open-source** principles,
Open-Sora not only democratizes access to advanced video generation techniques, but also offers a
streamlined and user-friendly platform that simplifies the complexities of video production.
With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation. [[中文]](/docs/README_zh.md)
With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation.
[[中文文档]](/docs/zh_CN/README.md)
<h4>Open-Sora is still at an early stage and under active development.</h4>
## 📰 News
* **[2024.03.18]** 🔥 We release **Open-Sora 1.0**, a fully open-source project for video generation.
Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
<a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a> acceleration,
inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 512x512 videos with only 3 days training.
[[blog]](https://hpc-ai.com/blog/open-sora-v1.0)
Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
<a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
acceleration,
inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 512x512 videos with only 3 days
training.
[[blog]](https://hpc-ai.com/blog/open-sora-v1.0)
* **[2024.03.04]** Open-Sora provides training with 46% cost reduction.
[[blog]](https://hpc-ai.com/blog/open-sora)
[[blog]](https://hpc-ai.com/blog/open-sora)
## 🎥 Latest Demo
| **2s 512×512** | **2s 512×512** | **2s 512×512** |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| [<img src="assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
| A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
| [<img src="assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
| A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...] |
Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display, see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at our [gallery](https://hpcaitech.github.io/Open-Sora/).
Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display,
see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at
our [gallery](https://hpcaitech.github.io/Open-Sora/).
## 🔆 New Features/Updates
* 📍 Open-Sora-v1 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
* ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
* ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates at [acceleration.md](docs/acceleration.md).
* ✅ We provide data preprocessing pipeline, including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md), and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found at [datasets.md](docs/datasets.md).
* ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in the time dimension deteriorates the quality. See our **[report](docs/report_v1.md)** for more discussions.
* ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our **STDiT** achieves a better trade-off between quality and speed. See our **[report](docs/report_v1.md)** for more discussions.
* 📍 Open-Sora-v1 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800
days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
* ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each
stage.
* ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism.
Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates
at [acceleration.md](docs/acceleration.md).
* ✅ We provide data preprocessing pipeline,
including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md),
and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found
at [datasets.md](docs/datasets.md).
* ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
the time dimension deteriorates the quality. See our **[report](docs/report_v1.md)** for more discussions.
* ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our **STDiT** achieves a better
trade-off between quality and speed. See our **[report](docs/report_v1.md)** for more discussions.
* ✅ Support clip and T5 text conditioning.
* ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See [command.md](docs/commands.md) for more instructions.
* ✅ Support inference with official weights from [DiT](https://github.com/facebookresearch/DiT), [Latte](https://github.com/Vchitect/Latte), and [PixArt](https://pixart-alpha.github.io/).
* ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet &
UCF101). See [commands.md](docs/commands.md) for more instructions.
* ✅ Support inference with official weights
from [DiT](https://github.com/facebookresearch/DiT), [Latte](https://github.com/Vchitect/Latte),
and [PixArt](https://pixart-alpha.github.io/).
<details>
<summary>View more</summary>
* ✅ Refactor the codebase. See [structure.md](docs/structure.md) to learn the project structure and how to use the config files.
* ✅ Refactor the codebase. See [structure.md](docs/structure.md) to learn the project structure and how to use the
config files.
</details>
### TODO list sorted by priority
* [ ] Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See [datasets.md](/docs/datasets.md) for more information. **[WIP]**
* [ ] Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity,
deduplication, etc.). See [datasets.md](/docs/datasets.md) for more information. **[WIP]**
* [ ] Training Video-VAE. **[WIP]**
<details>
@ -118,19 +138,24 @@ cd Open-Sora
pip install -v .
```
After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use the config files.
After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use
the config files.
## Model Weights
| Resolution | Data | #iterations | Batch Size | GPU days (H800) | URL |
| ---------- | ------ | ----------- | ---------- | --------------- | --------------------------------------------------------------------------------------------- |
| Resolution | Data | #iterations | Batch Size | GPU days (H800) | URL |
|------------|--------|-------------|------------|-----------------|-----------------------------------------------------------------------------------------------|
| 16×256×256 | 366K | 80k | 8×64 | 117 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-16x256x256.pth) |
| 16×256×256 | 20K HQ | 24k | 8×64 | 45 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x256x256.pth) |
| 16×512×512 | 20K HQ | 20k | 2×64 | 35 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x512x512.pth) |
Our model's weight is partially initialized from [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha). The number of parameters is 724M. More information about training can be found in our **[report](/docs/report_v1.md)**. More about dataset can be found in [dataset.md](/docs/dataset.md). HQ means high quality.
Our model's weight is partially initialized from [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha). The number of
parameters is 724M. More information about training can be found in our **[report](/docs/report_v1.md)**. More about
dataset can be found in [datasets.md](/docs/datasets.md). HQ means high quality.
:warning: **LIMITATION**: Our model is trained on a limited budget. The quality and text alignment is relatively poor. The model performs badly especially on generating human beings and cannot follow detailed instructions. We are working on improving the quality and text alignment.
:warning: **LIMITATION**: Our model is trained on a limited budget. The quality and text alignment is relatively poor.
The model performs badly especially on generating human beings and cannot follow detailed instructions. We are working
on improving the quality and text alignment.
## Inference
@ -163,11 +188,14 @@ torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/i
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
```
The speed is tested on H800 GPUs. For inference with other models, see [here](docs/commands.md) for more instructions. To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (slightly lower sampling speed).
The speed is tested on H800 GPUs. For inference with other models, see [here](docs/commands.md) for more instructions.
To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (slightly lower sampling speed).
## Data Processing
High-quality Data is the key to high-quality models. Our used datasets and data collection plan is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes the following steps:
High-quality Data is the key to high-quality models. Our used datasets and data collection plan
is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes
the following steps:
1. Downloading datasets. [[docs](/tools/datasets/README.md)]
2. Split videos into clips. [[docs](/tools/scenedetect/README.md)]
@ -175,7 +203,8 @@ High-quality Data is the key to high-quality models. Our used datasets and data
## Training
To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights into `pretrained_models/t5_ckpts/t5-v1_1-xxl`. Then run the following commands to launch training on a single node.
To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights
into `pretrained_models/t5_ckpts/t5-v1_1-xxl`. Then run the following commands to launch training on a single node.
```bash
# 1 GPU, 16x256x256
@ -184,7 +213,9 @@ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/1
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
```
To launch training on multiple nodes, prepare a hostfile according to [ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli), and run the following commands.
To launch training on multiple nodes, prepare a hostfile according
to [ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli), and run the
following commands.
```bash
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
@ -194,7 +225,8 @@ For training other models and advanced usage, see [here](docs/commands.md) for m
## Contribution
Thanks goes to these wonderful contributors ([emoji key](https://allcontributors.org/docs/en/emoji-key) following [all-contributors](https://github.com/all-contributors/all-contributors) specification):
Thanks goes to these wonderful contributors ([emoji key](https://allcontributors.org/docs/en/emoji-key)
following [all-contributors](https://github.com/all-contributors/all-contributors) specification):
<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
@ -227,15 +259,18 @@ If you wish to contribute to this project, you can refer to the [Contribution Gu
## Acknowledgement
* [ColossalAI](https://github.com/hpcaitech/ColossalAI): A powerful large model parallel acceleration and optimization system.
* [ColossalAI](https://github.com/hpcaitech/ColossalAI): A powerful large model parallel acceleration and optimization
system.
* [DiT](https://github.com/facebookresearch/DiT): Scalable Diffusion Models with Transformers.
* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration
strategies for training progress from OpenDiT.
* [PixArt](https://github.com/PixArt-alpha/PixArt-alpha): An open-source DiT-based text-to-image model.
* [Latte](https://github.com/Vchitect/Latte): An attempt to efficiently train DiT for video.
* [StabilityAI VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse-original): A powerful image VAE model.
* [CLIP](https://github.com/openai/CLIP): A powerful text-image embedding model.
* [T5](https://github.com/google-research/text-to-text-transfer-transformer): A powerful text encoder.
* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based
on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
We are grateful for their exceptional work and generous contribution to open source.
@ -251,7 +286,8 @@ We are grateful for their exceptional work and generous contribution to open sou
}
```
[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to
this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
## Star History

View file

@ -1,211 +1,230 @@
<p align="center">
<img src="../assets/readme/icon.png" width="250"/>
<p>
<div align="center">
<a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
<a href="https://hpcaitech.github.io/Open-Sora/"><img src="https://img.shields.io/badge/Gallery-View-orange?logo=&amp"></a>
<a href="https://discord.gg/shpbperhGs"><img src="https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp"></a>
<a href="https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-247ipg9fk-KRRYmUl~u2ll2637WRURVA"><img src="https://img.shields.io/badge/Slack-ColossalAI-blueviolet?logo=slack&amp"></a>
<a href="https://twitter.com/yangyou1991/status/1769411544083996787?s=61&t=jT0Dsx2d-MS5vS9rNM5e5g"><img src="https://img.shields.io/badge/Twitter-Discuss-blue?logo=twitter&amp"></a>
<a href="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/WeChat.png"><img src="https://img.shields.io/badge/微信-小助手加群-green?logo=wechat&amp"></a>
</div>
## Open-Sora 完全开源的高效复现类Sora视频生成方案
**Open-Sora**项目是一项致力于**高效**制作高质量视频,并使所有人都能使用其模型、工具和内容的计划。
通过采用**开源**原则Open-Sora 不仅实现了先进视频生成技术的低成本普及,还提供了一个精简且用户友好的方案,简化了视频制作的复杂性。
通过 Open-Sora我们希望更多开发者一起探索内容创作领域的创新、创造和包容。
[[English]](/README.md)
<h4>Open-Sora 项目目前处在早期阶段,并将持续更新。</h4>
## 📰 资讯
* **[2024.03.18]** 🔥 我们发布了**Open-Sora 1.0**,这是一个完全开源的视频生成项目。
* Open-Sora 1.0 支持视频数据预处理、<a href="https://github.com/hpcaitech/ColossalAI"><img src="../assets/readme/colossal_ai.png" width="8%" ></a> 加速训练、推理等全套流程。
* 我们提供的[模型权重](/#model-weights)只需 3 天的训练就能生成 2 秒的 512x512 视频。
* **[2024.03.04]** Open-Sora开源Sora复现方案成本降低46%,序列扩充至近百万
## 🎥 最新视频
| **2s 512×512** | **2s 512×512** | **2s 512×512** |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
| A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
| A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...] |
视频经过降采样处理为`.gif`格式,以便显示。点击查看原始视频。为便于显示,文字经过修剪,全文请参见 [此处](/assets/texts/t2v_samples.txt)。在我们的[图片库](https://hpcaitech.github.io/Open-Sora/)中查看更多样本。
## 🔆 新功能
* 📍Open-Sora-v1 已发布。[这里](/#model-weights)提供了模型权重。只需 400K 视频片段和在单卡 H800 上训200天类比Stable Video Diffusion 的 152M 样本),我们就能生成 2 秒的 512×512 视频。
* ✅ 从图像扩散模型到视频扩散模型的三阶段训练。我们提供每个阶段的权重。
* ✅ 支持训练加速包括Transformer加速、更快的 T5 和 VAE 以及序列并行。在对 64x512x512 视频进行训练时Open-Sora 可将训练速度提高**55%**。详细信息请参见[训练加速](/acceleration.md)。
* ✅ 我们提供用于数据预处理的视频切割和字幕工具。有关说明请点击[此处](tools/data/README.md),我们的数据收集计划请点击 [数据集](docs/datasets.md)。
* ✅ 我们发现来自[VideoGPT](https://wilson1yan.github.io/videogpt/index.html)的 VQ-VAE 质量较低,因此采用了来自[Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original) 的高质量 VAE。我们还发现使用添加了时间维度的采样会导致生成质量降低。更多讨论请参阅我们的 **[报告](docs/report_v1.md)**。
* ✅ 我们研究了不同的架构,包括 DiT、Latte 和我们提出的 **STDiT**。我们的STDiT在质量和速度之间实现了更好的权衡。更多讨论请参阅我们的 **[报告](docs/report_v1.md)**。
* ✅ 支持剪辑和 T5 文本调节。
* ✅ 通过将图像视为单帧视频,我们的项目支持在图像和视频(如 ImageNet 和 UCF101上训练 DiT。更多说明请参见 [指令解析](command.md)。
* ✅ 利用[DiT](https://github.com/facebookresearch/DiT)、[Latte](https://github.com/Vchitect/Latte) 和 [PixArt](https://pixart-alpha.github.io/) 的官方权重支持推理。
<details>
<summary>查看更多</summary>
* ✅ 重构代码库。请参阅[结构](structure.md),了解项目结构以及如何使用配置文件。
</details>
### 下一步计划【按优先级排序】
* [ ] 完成数据处理流程(包括密集光流、美学评分、文本图像相似性、重复数据删除等)。更多信息请参见[数据集](/docs/datasets.md)。**[项目进行中]**
* [ ] 训练视频-VAE。 **[项目进行中]**
<details>
<summary>查看更多</summary>
* [ ] 支持图像和视频调节。
* [ ] 评估流程。
* [ ] 加入更好的调度程序,如 SD3 中的rectified flow程序。
* [ ] 支持可变长宽比、分辨率和持续时间。
* [ ] 发布后支持 SD3。
</details>
## 目录
* [安装](#installation)
* [模型权重](/#model-weights)
* [推理](/#inference)
* [数据处理](/#data-processing)
* [训练](/#training)
* [贡献](/#contribution)
* [声明](/#acknowledgement)
* [引用](/#citation)
## 安装
```bash
# create a virtual env
conda create -n opensora python=3.10
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision
# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
```
安装完成后,建议阅读[结构](structure.md),了解项目结构以及如何使用配置文件。
## 模型权重
| 分辨率 | 数据 | 迭代次数 | 批量大小 | GPU 天数 (H800) | 网址 |
| ---------- | ------ | ----------- | ---------- | --------------- | ---------- |
| 16×256×256 | 366K | 80k | 8×64 | 117 | [:link:]() |
| 16×256×256 | 20K HQ | 24k | 8×64 | 45 | [:link:]() |
| 16×512×512 | 20K HQ | 20k | 2×64 | 35 | [:link:]() |
我们模型的权重部分由[PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) 初始化。参数数量为 724M。有关训练的更多信息请参阅我们的 **[报告](/docs/report_v1.md)**。有关数据集的更多信息,请参阅[数据](datasets.md)。HQ 表示高质量。
:warning: **局限性**:我们的模型是在有限的预算内训练出来的。质量和文本对齐度相对较差。特别是在生成人类时,模型表现很差,无法遵循详细的指令。我们正在努力改进质量和文本对齐。
## 推理
要使用我们提供的权重进行推理,首先要将[T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main)权重下载到pretrained_models/t5_ckpts/t5-v1_1-xxl 中。然后下载模型权重。运行以下命令生成样本。请参阅[此处](docs/structure.md#inference-config-demos)自定义配置。
```bash
# Sample 16x256x256 (5s/sample, 100 time steps, 22 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
# Auto Download
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
# Auto Download
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path OpenSora-v1-HQ-16x512x512.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
```
我们在 H800 GPU 上进行了速度测试。如需使用其他模型进行推理,请参阅[此处](commands_zh.md)获取更多说明。
## 数据处理
高质量数据是高质量模型的关键。[这里](datasets.md)有我们使用过的数据集和数据收集计划。我们提供处理视频数据的工具。目前,我们的数据处理流程包括以下步骤:
1. 下载数据集。[[文件](/tools/datasets/README.md)]
2. 将视频分割成片段。 [[文件](/tools/scenedetect/README.md)]
3. 生成视频字幕。 [[文件](/tools/caption/README.md)]
## 训练
要启动训练,首先要将[T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main)权重下载到pretrained_models/t5_ckpts/t5-v1_1-xxl 中。然后运行以下命令在单个节点上启动训练。
```bash
# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x512.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
```
要在多个节点上启动训练,请根据[ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli) 准备一个主机文件,并运行以下命令。
```bash
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
```
有关其他模型的训练和高级使用方法,请参阅[此处](commands_zh.md)获取更多说明。
## 贡献
如果您希望为该项目做出贡献,可以参考 [贡献指南](/CONTRIBUTING.md).
## 声明
* [DiT](https://github.com/facebookresearch/DiT): Scalable Diffusion Models with Transformers.
* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
* [PixArt](https://github.com/PixArt-alpha/PixArt-alpha): An open-source DiT-based text-to-image model.
* [Latte](https://github.com/Vchitect/Latte): An attempt to efficiently train DiT for video.
* [StabilityAI VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse-original): A powerful image VAE model.
* [CLIP](https://github.com/openai/CLIP): A powerful text-image embedding model.
* [T5](https://github.com/google-research/text-to-text-transfer-transformer): A powerful text encoder.
* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
我们对他们的出色工作和对开源的慷慨贡献表示感谢。
## 引用
```bibtex
@software{opensora,
author = {Zangwei Zheng and Xiangyu Peng and Yang You},
title = {Open-Sora: Democratizing Efficient Video Production for All},
month = {March},
year = {2024},
url = {https://github.com/hpcaitech/Open-Sora}
}
```
[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
## Star 走势
[![Star History Chart](https://api.star-history.com/svg?repos=hpcaitech/Open-Sora&type=Date)](https://star-history.com/#hpcaitech/Open-Sora&Date)
<p align="center">
<img src="../assets/readme/icon.png" width="250"/>
<p>
<div align="center">
<a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
<a href="https://hpcaitech.github.io/Open-Sora/"><img src="https://img.shields.io/badge/Gallery-View-orange?logo=&amp"></a>
<a href="https://discord.gg/shpbperhGs"><img src="https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp"></a>
<a href="https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-247ipg9fk-KRRYmUl~u2ll2637WRURVA"><img src="https://img.shields.io/badge/Slack-ColossalAI-blueviolet?logo=slack&amp"></a>
<a href="https://twitter.com/yangyou1991/status/1769411544083996787?s=61&t=jT0Dsx2d-MS5vS9rNM5e5g"><img src="https://img.shields.io/badge/Twitter-Discuss-blue?logo=twitter&amp"></a>
<a href="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/WeChat.png"><img src="https://img.shields.io/badge/微信-小助手加群-green?logo=wechat&amp"></a>
</div>
## Open-Sora 完全开源的高效复现类Sora视频生成方案
**Open-Sora**项目是一项致力于**高效**制作高质量视频,并使所有人都能使用其模型、工具和内容的计划。
通过采用**开源**原则Open-Sora 不仅实现了先进视频生成技术的低成本普及,还提供了一个精简且用户友好的方案,简化了视频制作的复杂性。
通过 Open-Sora我们希望更多开发者一起探索内容创作领域的创新、创造和包容。
[[English]](/README.md)
<h4>Open-Sora 项目目前处在早期阶段,并将持续更新。</h4>
## 📰 资讯
* **[2024.03.18]** 🔥 我们发布了**Open-Sora 1.0**,这是一个完全开源的视频生成项目。
* Open-Sora 1.0
支持视频数据预处理、<a href="https://github.com/hpcaitech/ColossalAI"><img src="../assets/readme/colossal_ai.png" width="8%" ></a>
加速训练、推理等全套流程。
* 我们提供的 [模型权重](#模型权重) 只需 3 天的训练就能生成 2 秒的 512x512 视频。
* **[2024.03.04]** Open-Sora开源Sora复现方案成本降低46%,序列扩充至近百万
## 🎥 最新视频
| **2s 512×512** | **2s 512×512** | **2s 512×512** |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
| A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
| A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...] |
视频经过降采样处理为`.gif`
格式,以便显示。点击查看原始视频。为便于显示,文字经过修剪,全文请参见 [此处](/assets/texts/t2v_samples.txt)
。在我们的 [图片库](https://hpcaitech.github.io/Open-Sora/) 中查看更多样本。
## 🔆 新功能
* 📍Open-Sora-v1 已发布。[这里](#模型权重) 提供了模型权重。只需 400K 视频片段和在单卡 H800 上训200天类比Stable Video
Diffusion 的 152M 样本),我们就能生成 2 秒的 512×512 视频。
* ✅ 从图像扩散模型到视频扩散模型的三阶段训练。我们提供每个阶段的权重。
* ✅ 支持训练加速包括Transformer加速、更快的 T5 和 VAE 以及序列并行。在对 64x512x512 视频进行训练时Open-Sora 可将训练速度提高
**55%**。详细信息请参见[训练加速](acceleration.md)。
* ✅ 我们提供用于数据预处理的视频切割和字幕工具。有关说明请点击[此处](/tools/datasets/README.md)
,我们的数据收集计划请点击 [数据集](datasets.md)。
* ✅ 我们发现来自[VideoGPT](https://wilson1yan.github.io/videogpt/index.html)的 VQ-VAE
质量较低,因此采用了来自[Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original) 的高质量
VAE。我们还发现使用添加了时间维度的采样会导致生成质量降低。更多讨论请参阅我们的 **[报告](report_v1.md)**。
* ✅ 我们研究了不同的架构,包括 DiT、Latte 和我们提出的 **STDiT**。我们的STDiT在质量和速度之间实现了更好的权衡。更多讨论请参阅我们的
**[报告](report_v1.md)**。
* ✅ 支持剪辑和 T5 文本调节。
* ✅ 通过将图像视为单帧视频,我们的项目支持在图像和视频(如 ImageNet 和 UCF101上训练
DiT。更多说明请参见 [指令解析](commands.md)。
* ✅ 利用[DiT](https://github.com/facebookresearch/DiT)、[Latte](https://github.com/Vchitect/Latte)
和 [PixArt](https://pixart-alpha.github.io/) 的官方权重支持推理。
<details>
<summary>查看更多</summary>
* ✅ 重构代码库。请参阅 [结构](structure.md),了解项目结构以及如何使用配置文件。
</details>
### 下一步计划【按优先级排序】
* [ ] 完成数据处理流程(包括密集光流、美学评分、文本图像相似性、重复数据删除等)。更多信息请参见 [数据集](datasets.md)。*
*[项目进行中]**
* [ ] 训练视频-VAE。 **[项目进行中]**
<details>
<summary>查看更多</summary>
* [ ] 支持图像和视频调节。
* [ ] 评估流程。
* [ ] 加入更好的调度程序,如 SD3 中的rectified flow程序。
* [ ] 支持可变长宽比、分辨率和持续时间。
* [ ] 发布后支持 SD3。
</details>
## 目录
* [安装](#安装)
* [模型权重](#模型权重)
* [推理](#推理)
* [数据处理](#数据处理)
* [训练](#训练)
* [贡献](#贡献)
* [声明](#声明)
* [引用](#引用)
## 安装
```bash
# create a virtual env
conda create -n opensora python=3.10
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision
# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
```
安装完成后,建议阅读[结构](structure.md),了解项目结构以及如何使用配置文件。
## 模型权重
| 分辨率 | 数据 | 迭代次数 | 批量大小 | GPU 天数 (H800) | 网址 |
|------------|--------|------|------|---------------|------------|
| 16×256×256 | 366K | 80k | 8×64 | 117 | [:link:]() |
| 16×256×256 | 20K HQ | 24k | 8×64 | 45 | [:link:]() |
| 16×512×512 | 20K HQ | 20k | 2×64 | 35 | [:link:]() |
我们模型的权重部分由 [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) 初始化。参数数量为 724M。有关训练的更多信息请参阅我们的
**[报告](report_v1.md)**。有关数据集的更多信息,请参阅[数据](datasets.md)。HQ 表示高质量。
:warning: **局限性**:我们的模型是在有限的预算内训练出来的。质量和文本对齐度相对较差。特别是在生成人类时,模型表现很差,无法遵循详细的指令。我们正在努力改进质量和文本对齐。
## 推理
要使用我们提供的权重进行推理,首先要将 [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main)
权重下载到pretrained_models/t5_ckpts/t5-v1_1-xxl
中。然后下载模型权重。运行以下命令生成样本。请参阅 [此处](structure.md#inference-config-demos) 自定义配置。
```bash
# Sample 16x256x256 (5s/sample)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 16x512x512 (20s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth
```
我们在 H800 GPU 上进行了速度测试。如需使用其他模型进行推理,请参阅 [此处](commands.md) 获取更多说明。
## 数据处理
高质量数据是高质量模型的关键。[这里](datasets.md) 有我们使用过的数据集和数据收集计划。我们提供处理视频数据的工具。目前,我们的数据处理流程包括以下步骤:
1. 下载数据集。[[文件](/tools/datasets/README.md)]
2. 将视频分割成片段。 [[文件](/tools/scenedetect/README.md)]
3. 生成视频字幕。 [[文件](/tools/caption/README.md)]
## 训练
要启动训练,首先要将[T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main)
权重下载到pretrained_models/t5_ckpts/t5-v1_1-xxl 中。然后运行以下命令在单个节点上启动训练。
```bash
# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x512.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
```
要在多个节点上启动训练,请根据[ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli)
准备一个主机文件,并运行以下命令。
```bash
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
```
有关其他模型的训练和高级使用方法,请参阅[此处](commands.md)获取更多说明。
## 贡献
如果您希望为该项目做出贡献,可以参考 [贡献指南](/CONTRIBUTING.md).
## 声明
* [DiT](https://github.com/facebookresearch/DiT): Scalable Diffusion Models with Transformers.
* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration
strategies for training progress from OpenDiT.
* [PixArt](https://github.com/PixArt-alpha/PixArt-alpha): An open-source DiT-based text-to-image model.
* [Latte](https://github.com/Vchitect/Latte): An attempt to efficiently train DiT for video.
* [StabilityAI VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse-original): A powerful image VAE model.
* [CLIP](https://github.com/openai/CLIP): A powerful text-image embedding model.
* [T5](https://github.com/google-research/text-to-text-transfer-transformer): A powerful text encoder.
* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based
on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
我们对他们的出色工作和对开源的慷慨贡献表示感谢。
## 引用
```bibtex
@software{opensora,
author = {Zangwei Zheng and Xiangyu Peng and Yang You},
title = {Open-Sora: Democratizing Efficient Video Production for All},
month = {March},
year = {2024},
url = {https://github.com/hpcaitech/Open-Sora}
}
```
[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to
this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
## Star 走势
[![Star History Chart](https://api.star-history.com/svg?repos=hpcaitech/Open-Sora&type=Date)](https://star-history.com/#hpcaitech/Open-Sora&Date)

View file

@ -0,0 +1,65 @@
# 加速
Open-Sora 旨在为扩散模型提供一个高速训练框架。在 64 帧 512x512 视频上训练时,我们可以实现 **55%** 的训练速度加速。我们的框架支持训练
**1分钟1080p视频**。
## 加速的 Transformer
Open-Sora 通过以下方式提高训练速度:
- 内核优化,包括 [flash attention](https://github.com/Dao-AILab/flash-attention), 融合 layernorm 内核以及由 colossalAI
编译的内核。
- 混合并行性,包括 ZeRO。
- 用于更大批量的梯度检查点。
我们在图像上的训练速度可与 [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT) 相媲美,这是一个加速 DiT
训练的项目。训练速度是在批处理大小为 128、图像大小为 256x256 的 8 个 H800 GPU 上测量的。
| 模型 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
|----------|-----------------|--------------------|
| DiT | 100 | 26k |
| OpenDiT | 175 | 45k |
| OpenSora | 175 | 45k |
## 高效的 STDiT
我们的 STDiT 采用时空注意力对视频数据进行建模。与直接全神贯注在 Dit 相比,我们的 STDiT 随着帧数的增加而更有效率。我们当前的框架仅支持序列超长序列的并行性。
训练速度是在 8 个 H800 GPU 上测量的应用了加速技术GC 表示梯度检查点。
两者都具有像 PixArt 一样的 T5 调节。
| 模型 | 设置 | 吞吐量 (sample/s/GPU) | 吞吐量 (tokens/s/GPU) |
|------------------|----------------|--------------------|--------------------|
| DiT | 16x256 (4k) | 7.20 | 29k |
| STDiT | 16x256 (4k) | 7.00 | 28k |
| DiT | 16x512 (16k) | 0.85 | 14k |
| STDiT | 16x512 (16k) | 1.45 | 23k |
| DiT (GC) | 64x512 (65k) | 0.08 | 5k |
| STDiT (GC) | 64x512 (65k) | 0.40 | 25k |
| STDiT (GC, sp=2) | 360x512 (370k) | 0.10 | 18k |
使用 Video-VAE 在时间维度上进行 4 倍下采样时24fps 视频有 450 帧。STDiT(28k tokens/s) 和 DiT 对图像 (高达 45k tokens/s)
两者之间的速度差距主要来自 T5 和 VAE 编码,以及时间注意力。
## 加速的编码器 (T5, VAE)
在训练过程中,文本由 T5 编码,视频由 VAE 编码。通常有两种方法可以加速训练:
1. 提前预处理文本和视频数据并保存到磁盘。
2. 在训练过程中对文本和视频数据进行编码,并加快编码过程。
对于选项 1一个样本的 120 个令牌需要 1M 磁盘空间,而 64x64x64 的潜在可能需要 4M。考虑训练 包含 10M 视频剪辑的数据集,所需的总磁盘空间为
50TB。我们的存储系统目前还没有准备好 这种数据规模。
对于选项 2我们提高了 T5 速度和内存要求。根据在[OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT),我们发现 VAE
消耗了大量的 GPU 内存。因此,我们
将批大小拆分为较小的批大小,以便进行 VAE 编码。使用这两种技术,我们可以大大加快训练速度。
训练速度是在 8 个带有 STDiT 的 H800 GPU 上测量的。
| 加速模式 | 设置 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
|--------------|---------------|-----------------|--------------------|
| Baseline | 16x256 (4k) | 6.16 | 25k |
| w. faster T5 | 16x256 (4k) | 7.00 | 29k |
| Baseline | 64x512 (65k) | 0.94 | 15k |
| w. both | 64x512 (65k) | 1.45 | 23k |

31
docs/zh_CN/datasets.md Normal file
View file

@ -0,0 +1,31 @@
# 数据集
## 正在使用的数据集
### HD-VG-130M
[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) 包括 130M 个文本视频对。标题是
由 BLIP-2 生成。我们发现剪切和文本质量相对较差。它包含 20 个拆分。对于 OpenSora 1.0,我们使用第一个拆分。我们计划使用整个数据集并对其进行重新处理。
### Inter4k
[Inter4k](https://github.com/alexandrosstergiou/Inter4K) 是一个包含分辨率为 4K 的 1k 视频剪辑的数据集。这个
数据集被提议用于超分辨率任务。我们使用数据集进行 HQ 训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
### Pexels.com
[Pexels.com](https://www.pexels.com/) 是一个提供免费库存照片和视频的网站。我们收集的 19K 视频
来自本网站的剪辑,用于高质量训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
## 数据集监视列表
我们也在关注以下数据集,并考虑在未来使用它们,这取决于我们的存储空间以及数据集的质量。
| 名称 | 大小 | 描述 |
|-------------------|--------------|-------------------------------|
| Panda-70M | 70M videos | High quality video-text pairs |
| WebVid-10M | 10M videos | Low quality |
| InternVid-10M-FLT | 10M videos | |
| EGO4D | 3670 hours | |
| OpenDV-YouTube | 1700 hours | |
| VidProM | 6.69M videos | |

47
docs/zh_CN/report_v1.md Normal file
View file

@ -0,0 +1,47 @@
# Open-Sora v1 Report
OpenAI's Sora is amazing at generating one minutes high quality videos. However, it reveals almost no information about its details. To make AI more "open", we are dedicated to build an open-source version of Sora. This report describes our first attempt to train a transformer-based video diffusion model.
## Efficiency in choosing the architecture
To lower the computational cost, we want to utilize existing VAE models. Sora uses spatial-temporal VAE to reduce the temporal dimensions. However, we found that there is no open-source high-quality spatial-temporal VAE model. [MAGVIT](https://github.com/google-research/magvit)'s 4x4x4 VAE is not open-sourced, while [VideoGPT](https://wilson1yan.github.io/videogpt/index.html)'s 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original)) in our first version.
The video training involves a large amount of tokens. Considering 24fps 1min videos, we have 1440 frames. With VAE downsampling 4x and patch size downsampling 2x, we have 1440x1024≈1.5M tokens. Full attention on 1.5M tokens leads to a huge computational cost. Thus, we use spatial-temporal attention to reduce the cost following [Latte](https://github.com/Vchitect/Latte).
As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).
![Architecture Comparison](https://i0.imgs.ovh/2024/03/15/eLk9D.png)
To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.
![Architecture](https://i0.imgs.ovh/2024/03/16/erC1d.png)
Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.
We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on temporal dimension, the generated videos have a low quality. Thus, we leave the downsampling to temporal VAE in our next version. For now, we sample at every 3 frames with 16 frames training and every 2 frames with 64 frames training.
## Data is the key to high quality
We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.
![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)
As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.
## Training Details
With a limited training budgets, we made only a few exploration. We find learning rate 1e-4 is too large and scales down to 2e-5. When training with a large batch size, we find `fp16` less stable than `bf16` and may lead to generation failure. Thus, we switch to `bf16` for training on 64x512x512. For other hyper-parameters, we follow previous works.
## Loss curves
16x256x256 Pretraining Loss Curve
![16x256x256 Pretraining Loss Curve](https://i0.imgs.ovh/2024/03/16/erXQj.png)
16x256x256 HQ Training Loss Curve
![16x256x256 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/ernXv.png)
16x512x512 HQ Training Loss Curve
![16x512x512 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/erHBe.png)

178
docs/zh_CN/structure.md Normal file
View file

@ -0,0 +1,178 @@
# Repo & Config Structure
## Repo Structure
```plaintext
Open-Sora
├── README.md
├── docs
│ ├── acceleration.md -> Acceleration & Speed benchmark
│ ├── command.md -> Commands for training & inference
│ ├── datasets.md -> Datasets used in this project
│ ├── structure.md -> This file
│ └── report_v1.md -> Report for Open-Sora v1
├── scripts
│ ├── train.py -> diffusion training script
│ └── inference.py -> Report for Open-Sora v1
├── configs -> Configs for training & inference
├── opensora
│ ├── __init__.py
│ ├── registry.py -> Registry helper
│   ├── acceleration -> Acceleration related code
│   ├── dataset -> Dataset related code
│   ├── models
│   │   ├── layers -> Common layers
│   │   ├── vae -> VAE as image encoder
│   │   ├── text_encoder -> Text encoder
│   │   │   ├── classes.py -> Class id encoder (inference only)
│   │   │   ├── clip.py -> CLIP encoder
│   │   │   └── t5.py -> T5 encoder
│   │   ├── dit
│   │   ├── latte
│   │   ├── pixart
│   │   └── stdit -> Our STDiT related code
│   ├── schedulers -> Diffusion schedulers
│   │   ├── iddpm -> IDDPM for training and inference
│   │ └── dpms -> DPM-Solver for fast inference
│ └── utils
└── tools -> Tools for data processing and more
```
## Configs
Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object.
```plaintext
Open-Sora
└── configs -> Configs for training & inference
├── opensora -> STDiT related configs
│ ├── inference
│ │ ├── 16x256x256.py -> Sample videos 16 frames 256x256
│ │ ├── 16x512x512.py -> Sample videos 16 frames 512x512
│ │ └── 64x512x512.py -> Sample videos 64 frames 512x512
│ └── train
│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
│ └── 64x512x512.py -> Train on videos 64 frames 512x512
├── dit -> DiT related configs
   │   ├── inference
   │   │   ├── 1x256x256-class.py -> Sample images with ckpts from DiT
   │   │   ├── 1x256x256.py -> Sample images with clip condition
   │   │   └── 16x256x256.py -> Sample videos
   │   └── train
   │     ├── 1x256x256.py -> Train on images with clip condition
   │      └── 16x256x256.py -> Train on videos
├── latte -> Latte related configs
└── pixart -> PixArt related configs
```
## Inference config demos
To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.
```plaintext
--prompt_path ./assets/texts/t2v_samples.txt -> prompt_path
--ckpt-path ./path/to/your/ckpt.pth -> model["from_pretrained"]
```
The explanation of each field is provided below.
```python
# Define sampling size
num_frames = 64 # number of frames
fps = 24 // 2 # frames per second (divided by 2 for frame_interval=2)
image_size = (512, 512) # image size (height, width)
# Define model
model = dict(
type="STDiT-XL/2", # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
space_scale=1.0, # (Optional) Space positional encoding scale (new height / old height)
time_scale=2 / 3, # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
enable_flashattn=True, # (Optional) Speed up training and inference with flash attention
enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
from_pretrained="PRETRAINED_MODEL", # (Optional) Load from pretrained model
no_temporal_pos_emb=True, # (Optional) Disable temporal positional encoding (for image)
)
vae = dict(
type="VideoAutoencoderKL", # Select VAE type
from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
micro_batch_size=128, # VAE with micro batch size to save memory
)
text_encoder = dict(
type="t5", # Select text encoder type (t5, clip)
from_pretrained="./pretrained_models/t5_ckpts", # Load from pretrained text encoder
model_max_length=120, # Maximum length of input text
)
scheduler = dict(
type="iddpm", # Select scheduler type (iddpm, dpm-solver)
num_sampling_steps=100, # Number of sampling steps
cfg_scale=7.0, # hyper-parameter for classifier-free diffusion
)
dtype = "fp16" # Computation type (fp16, fp32, bf16)
# Other settings
batch_size = 1 # batch size
seed = 42 # random seed
prompt_path = "./assets/texts/t2v_samples.txt" # path to prompt file
save_dir = "./samples" # path to save samples
```
## Training config demos
```python
# Define sampling size
num_frames = 64
frame_interval = 2 # sample every 2 frames
image_size = (512, 512)
# Define dataset
root = None # root path to the dataset
data_path = "CSV_PATH" # path to the csv file
use_image_transform = False # True if training on images
num_workers = 4 # number of workers for dataloader
# Define acceleration
dtype = "bf16" # Computation type (fp16, bf16)
grad_checkpoint = True # Use gradient checkpointing
plugin = "zero2" # Plugin for distributed training (zero2, zero2-seq)
sp_size = 1 # Sequence parallelism size (1 for no sequence parallelism)
# Define model
model = dict(
type="STDiT-XL/2",
space_scale=1.0,
time_scale=2 / 3,
from_pretrained="YOUR_PRETRAINED_MODEL",
enable_flashattn=True, # Enable flash attention
enable_layernorm_kernel=True, # Enable layernorm kernel
)
vae = dict(
type="VideoAutoencoderKL",
from_pretrained="stabilityai/sd-vae-ft-ema",
micro_batch_size=128,
)
text_encoder = dict(
type="t5",
from_pretrained="./pretrained_models/t5_ckpts",
model_max_length=120,
shardformer=True, # Enable shardformer for T5 acceleration
)
scheduler = dict(
type="iddpm",
timestep_respacing="", # Default 1000 timesteps
)
# Others
seed = 42
outputs = "outputs" # path to save checkpoints
wandb = False # Use wandb for logging
epochs = 1000 # number of epochs (just large enough, kill when satisfied)
log_every = 10
ckpt_every = 250
load = None # path to resume training
batch_size = 4
lr = 2e-5
grad_clip = 1.0 # gradient clipping
```