Merge pull request #197 from hpcaitech/celaraze-main

Celaraze main
2026-04-15 03:15:20 +02:00 · 2024-03-23 15:19:32 +08:00 · 2024-03-23 15:19:32 +08:00 · 745956ed15
commit 745956ed15
parent 6d430b5ed1 bd80f6db0a
7 changed files with 606 additions and 248 deletions
--- a/README.md
+++ b/README.md
@ -12,21 +12,25 @@
 </div>

 ## Open-Sora: Democratizing Efficient Video Production for All
+
 We present **Open-Sora**, an initiative dedicated to **efficiently** produce high-quality video and make the model,
 tools and contents accessible to all. By embracing **open-source** principles,
 Open-Sora not only democratizes access to advanced video generation techniques, but also offers a
 streamlined and user-friendly platform that simplifies the complexities of video production.
-With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation. [[中文]](/docs/README_zh.md)
+With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation.
+
+[[中文文档]](/docs/zh_CN/README.md)

 <h4>Open-Sora is still at an early stage and under active development.</h4>

-
 ## 📰 News

 * **[2024.03.18]** 🔥 We release **Open-Sora 1.0**, a fully open-source project for video generation.
  Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
-<a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a> acceleration,
-inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 512x512 videos with only 3 days training.
+  <a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
+  acceleration,
+  inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 512x512 videos with only 3 days
+  training.
  [[blog]](https://hpc-ai.com/blog/open-sora-v1.0)
 * **[2024.03.04]** Open-Sora provides training with 46% cost reduction.
  [[blog]](https://hpc-ai.com/blog/open-sora)
@ -34,37 +38,53 @@ inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 5
 ## 🎥 Latest Demo

 | **2s 512×512**                                                                                                                                                                 | **2s 512×512**                                                                                                                                                              | **2s 512×512**                                                                                                                                    |
-| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
 | [<img src="assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80)                                 | [<img src="assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc)                              | [<img src="assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16)    |
 | A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
 | [<img src="assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94)                                 | [<img src="assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9)                              | [<img src="assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65)    |
 | A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]                                                           | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]                                            | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]                   |

-Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display, see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at our [gallery](https://hpcaitech.github.io/Open-Sora/).
-
+Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display,
+see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at
+our [gallery](https://hpcaitech.github.io/Open-Sora/).

 ## 🔆 New Features/Updates

-* 📍 Open-Sora-v1 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
-* ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
-* ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates at [acceleration.md](docs/acceleration.md).
-* ✅ We provide data preprocessing pipeline, including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md), and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found at [datasets.md](docs/datasets.md).
-* ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in the time dimension deteriorates the quality. See our **[report](docs/report_v1.md)** for more discussions.
-* ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our **STDiT** achieves a better trade-off between quality and speed. See our **[report](docs/report_v1.md)** for more discussions.
+* 📍 Open-Sora-v1 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800
+  days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
+* ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each
+  stage.
+* ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism.
+  Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates
+  at [acceleration.md](docs/acceleration.md).
+* ✅ We provide data preprocessing pipeline,
+  including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md),
+  and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found
+  at [datasets.md](docs/datasets.md).
+* ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
+  better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
+  the time dimension deteriorates the quality. See our **[report](docs/report_v1.md)** for more discussions.
+* ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our **STDiT** achieves a better
+  trade-off between quality and speed. See our **[report](docs/report_v1.md)** for more discussions.
 * ✅ Support clip and T5 text conditioning.
-* ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See [command.md](docs/commands.md) for more instructions.
-* ✅ Support inference with official weights from [DiT](https://github.com/facebookresearch/DiT), [Latte](https://github.com/Vchitect/Latte), and [PixArt](https://pixart-alpha.github.io/).
+* ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet &
+  UCF101). See [commands.md](docs/commands.md) for more instructions.
+* ✅ Support inference with official weights
+  from [DiT](https://github.com/facebookresearch/DiT), [Latte](https://github.com/Vchitect/Latte),
+  and [PixArt](https://pixart-alpha.github.io/).

 <details>
 <summary>View more</summary>

-* ✅ Refactor the codebase. See [structure.md](docs/structure.md) to learn the project structure and how to use the config files.
+* ✅ Refactor the codebase. See [structure.md](docs/structure.md) to learn the project structure and how to use the
+  config files.

 </details>

 ### TODO list sorted by priority

-* [ ] Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See [datasets.md](/docs/datasets.md) for more information. **[WIP]**
+* [ ] Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity,
+  deduplication, etc.). See [datasets.md](/docs/datasets.md) for more information. **[WIP]**
 * [ ] Training Video-VAE. **[WIP]**

 <details>
@ -118,19 +138,24 @@ cd Open-Sora
 pip install -v .
 ```

-After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use the config files.
+After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use
+the config files.

 ## Model Weights

 | Resolution | Data   | #iterations | Batch Size | GPU days (H800) | URL                                                                                           |
-| ---------- | ------ | ----------- | ---------- | --------------- | --------------------------------------------------------------------------------------------- |
+|------------|--------|-------------|------------|-----------------|-----------------------------------------------------------------------------------------------|
 | 16×256×256 | 366K   | 80k         | 8×64       | 117             | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-16x256x256.pth)    |
 | 16×256×256 | 20K HQ | 24k         | 8×64       | 45              | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x256x256.pth) |
 | 16×512×512 | 20K HQ | 20k         | 2×64       | 35              | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x512x512.pth) |

-Our model's weight is partially initialized from [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha). The number of parameters is 724M. More information about training can be found in our **[report](/docs/report_v1.md)**. More about dataset can be found in [dataset.md](/docs/dataset.md). HQ means high quality.
+Our model's weight is partially initialized from [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha). The number of
+parameters is 724M. More information about training can be found in our **[report](/docs/report_v1.md)**. More about
+dataset can be found in [datasets.md](/docs/datasets.md). HQ means high quality.

-:warning: **LIMITATION**: Our model is trained on a limited budget. The quality and text alignment is relatively poor. The model performs badly especially on generating human beings and cannot follow detailed instructions. We are working on improving the quality and text alignment.
+:warning: **LIMITATION**: Our model is trained on a limited budget. The quality and text alignment is relatively poor.
+The model performs badly especially on generating human beings and cannot follow detailed instructions. We are working
+on improving the quality and text alignment.

 ## Inference

@ -163,11 +188,14 @@ torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/i
 torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
 ```

-The speed is tested on H800 GPUs. For inference with other models, see [here](docs/commands.md) for more instructions. To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (slightly lower sampling speed).
+The speed is tested on H800 GPUs. For inference with other models, see [here](docs/commands.md) for more instructions.
+To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (slightly lower sampling speed).

 ## Data Processing

-High-quality Data is the key to high-quality models. Our used datasets and data collection plan is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes the following steps:
+High-quality Data is the key to high-quality models. Our used datasets and data collection plan
+is [here](/docs/datasets.md). We provide tools to process video data. Currently, our data processing pipeline includes
+the following steps:

 1. Downloading datasets. [[docs](/tools/datasets/README.md)]
 2. Split videos into clips. [[docs](/tools/scenedetect/README.md)]
@ -175,7 +203,8 @@ High-quality Data is the key to high-quality models. Our used datasets and data

 ## Training

-To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights into `pretrained_models/t5_ckpts/t5-v1_1-xxl`. Then run the following commands to launch training on a single node.
+To launch training, first download [T5](https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main) weights
+into `pretrained_models/t5_ckpts/t5-v1_1-xxl`. Then run the following commands to launch training on a single node.

 ```bash
 # 1 GPU, 16x256x256
@ -184,7 +213,9 @@ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/1
 torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
 ```

-To launch training on multiple nodes, prepare a hostfile according to [ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli), and run the following commands.
+To launch training on multiple nodes, prepare a hostfile according
+to [ColossalAI](https://colossalai.org/docs/basics/launch_colossalai/#launch-with-colossal-ai-cli), and run the
+following commands.

 ```bash
 colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
@ -194,7 +225,8 @@ For training other models and advanced usage, see [here](docs/commands.md) for m

 ## Contribution

-Thanks goes to these wonderful contributors ([emoji key](https://allcontributors.org/docs/en/emoji-key) following [all-contributors](https://github.com/all-contributors/all-contributors) specification):
+Thanks goes to these wonderful contributors ([emoji key](https://allcontributors.org/docs/en/emoji-key)
+following [all-contributors](https://github.com/all-contributors/all-contributors) specification):

 <!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
 <!-- prettier-ignore-start -->
@ -227,15 +259,18 @@ If you wish to contribute to this project, you can refer to the [Contribution Gu

 ## Acknowledgement

-* [ColossalAI](https://github.com/hpcaitech/ColossalAI): A powerful large model parallel acceleration and optimization system.
+* [ColossalAI](https://github.com/hpcaitech/ColossalAI): A powerful large model parallel acceleration and optimization
+  system.
 * [DiT](https://github.com/facebookresearch/DiT): Scalable Diffusion Models with Transformers.
-* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
+* [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT): An acceleration for DiT training. We adopt valuable acceleration
+  strategies for training progress from OpenDiT.
 * [PixArt](https://github.com/PixArt-alpha/PixArt-alpha): An open-source DiT-based text-to-image model.
 * [Latte](https://github.com/Vchitect/Latte): An attempt to efficiently train DiT for video.
 * [StabilityAI VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse-original): A powerful image VAE model.
 * [CLIP](https://github.com/openai/CLIP): A powerful text-image embedding model.
 * [T5](https://github.com/google-research/text-to-text-transfer-transformer): A powerful text encoder.
-* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
+* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based
+  on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).

 We are grateful for their exceptional work and generous contribution to open source.

@ -251,7 +286,8 @@ We are grateful for their exceptional work and generous contribution to open sou
 }
 ```

-[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
+[Zangwei Zheng](https://github.com/zhengzangw) and [Xiangyu Peng](https://github.com/xyupeng) equally contributed to
+this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).

 ## Star History

--- a/docs/zh_CN/README.md
+++ b/docs/zh_CN/README.md
@ -15,7 +15,8 @@
 **Open-Sora**项目是一项致力于**高效**制作高质量视频，并使所有人都能使用其模型、工具和内容的计划。
 通过采用**开源**原则，Open-Sora 不仅实现了先进视频生成技术的低成本普及，还提供了一个精简且用户友好的方案，简化了视频制作的复杂性。
 通过 Open-Sora，我们希望更多开发者一起探索内容创作领域的创新、创造和包容。
- [[English]](/README.md)
+
+[[English Document]](/README.md)

 <h4>Open-Sora 项目目前处在早期阶段，并将持续更新。</h4>

--- a/docs/zh_CN/acceleration.md
+++ b/docs/zh_CN/acceleration.md
@ -0,0 +1,65 @@
+# 加速
+
+Open-Sora 旨在为扩散模型提供一个高速训练框架。在 64 帧 512x512 视频上训练时，我们可以实现 **55%** 的训练速度加速。我们的框架支持训练
+**1分钟1080p视频**。
+
+## 加速的 Transformer
+
+Open-Sora 通过以下方式提高训练速度：
+
+- 内核优化，包括 [flash attention](https://github.com/Dao-AILab/flash-attention), 融合 layernorm 内核以及由 colossalAI
+  编译的内核。
+- 混合并行性，包括 ZeRO。
+- 用于更大批量的梯度检查点。
+
+我们在图像上的训练速度可与 [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT) 相媲美，这是一个加速 DiT
+训练的项目。训练速度是在批处理大小为 128、图像大小为 256x256 的 8 个 H800 GPU 上测量的。
+
+| 模型       | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|----------|-----------------|--------------------|
+| DiT      | 100             | 26k                |
+| OpenDiT  | 175             | 45k                |
+| OpenSora | 175             | 45k                |
+
+## 高效的 STDiT
+
+我们的 STDiT 采用时空注意力对视频数据进行建模。与直接全神贯注在 Dit 相比，我们的 STDiT 随着帧数的增加而更有效率。我们当前的框架仅支持序列超长序列的并行性。
+
+训练速度是在 8 个 H800 GPU 上测量的，应用了加速技术，GC 表示梯度检查点。
+两者都具有像 PixArt 一样的 T5 调节。
+
+| 模型               | 设置             | 吞吐量 (sample/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|------------------|----------------|--------------------|--------------------|
+| DiT              | 16x256  (4k)   | 7.20               | 29k                |
+| STDiT            | 16x256  (4k)   | 7.00               | 28k                |
+| DiT              | 16x512  (16k)  | 0.85               | 14k                |
+| STDiT            | 16x512  (16k)  | 1.45               | 23k                |
+| DiT (GC)         | 64x512  (65k)  | 0.08               | 5k                 |
+| STDiT (GC)       | 64x512  (65k)  | 0.40               | 25k                |
+| STDiT (GC, sp=2) | 360x512 (370k) | 0.10               | 18k                |
+
+使用 Video-VAE 在时间维度上进行 4 倍下采样时，24fps 视频有 450 帧。STDiT(28k tokens/s) 和 DiT 对图像 (高达 45k tokens/s)
+两者之间的速度差距主要来自 T5 和 VAE 编码，以及时间注意力。
+
+## 加速的编码器 (T5, VAE)
+
+在训练过程中，文本由 T5 编码，视频由 VAE 编码。通常有两种方法可以加速训练：
+
+1. 提前预处理文本和视频数据并保存到磁盘。
+2. 在训练过程中对文本和视频数据进行编码，并加快编码过程。
+
+对于选项 1，一个样本的 120 个令牌需要 1M 磁盘空间，而 64x64x64 的潜在可能需要 4M。考虑训练 包含 10M 视频剪辑的数据集，所需的总磁盘空间为
+50TB。我们的存储系统目前还没有准备好 这种数据规模。
+
+对于选项 2，我们提高了 T5 速度和内存要求。根据在[OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT)，我们发现 VAE
+消耗了大量的 GPU 内存。因此，我们
+将批大小拆分为较小的批大小，以便进行 VAE 编码。使用这两种技术，我们可以大大加快训练速度。
+
+训练速度是在 8 个带有 STDiT 的 H800 GPU 上测量的。
+
+| 加速模式         | 设置            | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
+|--------------|---------------|-----------------|--------------------|
+| Baseline     | 16x256  (4k)  | 6.16            | 25k                |
+| w. faster T5 | 16x256  (4k)  | 7.00            | 29k                |
+| Baseline     | 64x512  (65k) | 0.94            | 15k                |
+| w. both      | 64x512  (65k) | 1.45            | 23k                |
--- a/docs/zh_CN/commands.md
+++ b/docs/zh_CN/commands.md
--- a/docs/zh_CN/datasets.md
+++ b/docs/zh_CN/datasets.md
@ -0,0 +1,31 @@
+# 数据集
+
+## 正在使用的数据集
+
+### HD-VG-130M
+
+[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) 包括 130M 个文本视频对。标题是
+由 BLIP-2 生成。我们发现剪切和文本质量相对较差。它包含 20 个拆分。对于 OpenSora 1.0，我们使用第一个拆分。我们计划使用整个数据集并对其进行重新处理。
+
+### Inter4k
+
+[Inter4k](https://github.com/alexandrosstergiou/Inter4K) 是一个包含分辨率为 4K 的 1k 视频剪辑的数据集。这个
+数据集被提议用于超分辨率任务。我们使用数据集进行 HQ 训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
+
+### Pexels.com
+
+[Pexels.com](https://www.pexels.com/) 是一个提供免费库存照片和视频的网站。我们收集的 19K 视频
+来自本网站的剪辑，用于高质量训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。
+
+## 数据集监视列表
+
+我们也在关注以下数据集，并考虑在未来使用它们，这取决于我们的存储空间以及数据集的质量。
+
+| 名称                | 大小           | 描述                            |
+|-------------------|--------------|-------------------------------|
+| Panda-70M         | 70M videos   | High quality video-text pairs |
+| WebVid-10M        | 10M videos   | Low quality                   |
+| InternVid-10M-FLT | 10M videos   |                               |
+| EGO4D             | 3670 hours   |                               |
+| OpenDV-YouTube    | 1700 hours   |                               |
+| VidProM           | 6.69M videos |                               |
--- a/docs/zh_CN/report_v1.md
+++ b/docs/zh_CN/report_v1.md
@ -0,0 +1,47 @@
+# Open-Sora v1 Report
+
+OpenAI's Sora is amazing at generating one minutes high quality videos. However, it reveals almost no information about its details. To make AI more "open", we are dedicated to build an open-source version of Sora. This report describes our first attempt to train a transformer-based video diffusion model.
+
+## Efficiency in choosing the architecture
+
+To lower the computational cost, we want to utilize existing VAE models. Sora uses spatial-temporal VAE to reduce the temporal dimensions. However, we found that there is no open-source high-quality spatial-temporal VAE model. [MAGVIT](https://github.com/google-research/magvit)'s 4x4x4 VAE is not open-sourced, while [VideoGPT](https://wilson1yan.github.io/videogpt/index.html)'s 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original)) in our first version.
+
+The video training involves a large amount of tokens. Considering 24fps 1min videos, we have 1440 frames. With VAE downsampling 4x and patch size downsampling 2x, we have 1440x1024≈1.5M tokens. Full attention on 1.5M tokens leads to a huge computational cost. Thus, we use spatial-temporal attention to reduce the cost following [Latte](https://github.com/Vchitect/Latte).
+
+As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).
+
+![Architecture Comparison](https://i0.imgs.ovh/2024/03/15/eLk9D.png)
+
+To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.
+
+![Architecture](https://i0.imgs.ovh/2024/03/16/erC1d.png)
+
+Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.
+
+We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on temporal dimension, the generated videos have a low quality. Thus, we leave the downsampling to temporal VAE in our next version. For now, we sample at every 3 frames with 16 frames training and every 2 frames with 64 frames training.
+
+## Data is the key to high quality
+
+We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.
+
+![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)
+
+As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.
+
+## Training Details
+
+With a limited training budgets, we made only a few exploration. We find learning rate 1e-4 is too large and scales down to 2e-5. When training with a large batch size, we find `fp16` less stable than `bf16` and may lead to generation failure. Thus, we switch to `bf16` for training on 64x512x512. For other hyper-parameters, we follow previous works.
+
+## Loss curves
+
+16x256x256 Pretraining Loss Curve
+
+![16x256x256 Pretraining Loss Curve](https://i0.imgs.ovh/2024/03/16/erXQj.png)
+
+16x256x256 HQ Training Loss Curve
+
+![16x256x256 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/ernXv.png)
+
+16x512x512 HQ Training Loss Curve
+
+![16x512x512 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/erHBe.png)
--- a/docs/zh_CN/structure.md
+++ b/docs/zh_CN/structure.md
@ -0,0 +1,178 @@
+# Repo & Config Structure
+
+## Repo Structure
+
+```plaintext
+Open-Sora
+├── README.md
+├── docs
+│   ├── acceleration.md            -> Acceleration & Speed benchmark
+│   ├── command.md                 -> Commands for training & inference
+│   ├── datasets.md                -> Datasets used in this project
+│   ├── structure.md               -> This file
+│   └── report_v1.md               -> Report for Open-Sora v1
+├── scripts
+│   ├── train.py                   -> diffusion training script
+│   └── inference.py               -> Report for Open-Sora v1
+├── configs                        -> Configs for training & inference
+├── opensora
+│   ├── __init__.py
+│   ├── registry.py                -> Registry helper
+│   ├── acceleration               -> Acceleration related code
+│   ├── dataset                    -> Dataset related code
+│   ├── models
+│   │   ├── layers                 -> Common layers
+│   │   ├── vae                    -> VAE as image encoder
+│   │   ├── text_encoder           -> Text encoder
+│   │   │   ├── classes.py         -> Class id encoder (inference only)
+│   │   │   ├── clip.py            -> CLIP encoder
+│   │   │   └── t5.py              -> T5 encoder
+│   │   ├── dit
+│   │   ├── latte
+│   │   ├── pixart
+│   │   └── stdit                  -> Our STDiT related code
+│   ├── schedulers                 -> Diffusion schedulers
+│   │   ├── iddpm                  -> IDDPM for training and inference
+│   │   └── dpms                   -> DPM-Solver for fast inference
+│   └── utils
+└── tools                          -> Tools for data processing and more
+```
+
+## Configs
+
+Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object.
+
+```plaintext
+Open-Sora
+└── configs                        -> Configs for training & inference
+    ├── opensora                   -> STDiT related configs
+    │   ├── inference
+    │   │   ├── 16x256x256.py      -> Sample videos 16 frames 256x256
+    │   │   ├── 16x512x512.py      -> Sample videos 16 frames 512x512
+    │   │   └── 64x512x512.py      -> Sample videos 64 frames 512x512
+    │   └── train
+    │       ├── 16x256x256.py      -> Train on videos 16 frames 256x256
+    │       ├── 16x256x256.py      -> Train on videos 16 frames 256x256
+    │       └── 64x512x512.py      -> Train on videos 64 frames 512x512
+    ├── dit                        -> DiT related configs
+    │   ├── inference
+    │   │   ├── 1x256x256-class.py -> Sample images with ckpts from DiT
+    │   │   ├── 1x256x256.py       -> Sample images with clip condition
+    │   │   └── 16x256x256.py      -> Sample videos
+    │   └── train
+    │       ├── 1x256x256.py       -> Train on images with clip condition
+    │       └── 16x256x256.py      -> Train on videos
+    ├── latte                      -> Latte related configs
+    └── pixart                     -> PixArt related configs
+```
+
+## Inference config demos
+
+To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.
+
+```plaintext
+--prompt_path ./assets/texts/t2v_samples.txt  -> prompt_path
+--ckpt-path ./path/to/your/ckpt.pth           -> model["from_pretrained"]
+```
+
+The explanation of each field is provided below.
+
+```python
+# Define sampling size
+num_frames = 64               # number of frames
+fps = 24 // 2                 # frames per second (divided by 2 for frame_interval=2)
+image_size = (512, 512)       # image size (height, width)
+
+# Define model
+model = dict(
+    type="STDiT-XL/2",        # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
+    space_scale=1.0,          # (Optional) Space positional encoding scale (new height / old height)
+    time_scale=2 / 3,         # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
+    enable_flashattn=True,    # (Optional) Speed up training and inference with flash attention
+    enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
+    from_pretrained="PRETRAINED_MODEL",  # (Optional) Load from pretrained model
+    no_temporal_pos_emb=True,  # (Optional) Disable temporal positional encoding (for image)
+)
+vae = dict(
+    type="VideoAutoencoderKL", # Select VAE type
+    from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
+    micro_batch_size=128,      # VAE with micro batch size to save memory
+)
+text_encoder = dict(
+    type="t5",                 # Select text encoder type (t5, clip)
+    from_pretrained="./pretrained_models/t5_ckpts", # Load from pretrained text encoder
+    model_max_length=120,      # Maximum length of input text
+)
+scheduler = dict(
+    type="iddpm",              # Select scheduler type (iddpm, dpm-solver)
+    num_sampling_steps=100,    # Number of sampling steps
+    cfg_scale=7.0,             # hyper-parameter for classifier-free diffusion
+)
+dtype = "fp16"                 # Computation type (fp16, fp32, bf16)
+
+# Other settings
+batch_size = 1                 # batch size
+seed = 42                      # random seed
+prompt_path = "./assets/texts/t2v_samples.txt"  # path to prompt file
+save_dir = "./samples"         # path to save samples
+```
+
+## Training config demos
+
+```python
+# Define sampling size
+num_frames = 64
+frame_interval = 2             # sample every 2 frames
+image_size = (512, 512)
+
+# Define dataset
+root = None                    # root path to the dataset
+data_path = "CSV_PATH"         # path to the csv file
+use_image_transform = False    # True if training on images
+num_workers = 4                # number of workers for dataloader
+
+# Define acceleration
+dtype = "bf16"                 # Computation type (fp16, bf16)
+grad_checkpoint = True         # Use gradient checkpointing
+plugin = "zero2"               # Plugin for distributed training (zero2, zero2-seq)
+sp_size = 1                    # Sequence parallelism size (1 for no sequence parallelism)
+
+# Define model
+model = dict(
+    type="STDiT-XL/2",
+    space_scale=1.0,
+    time_scale=2 / 3,
+    from_pretrained="YOUR_PRETRAINED_MODEL",
+    enable_flashattn=True,        # Enable flash attention
+    enable_layernorm_kernel=True, # Enable layernorm kernel
+)
+vae = dict(
+    type="VideoAutoencoderKL",
+    from_pretrained="stabilityai/sd-vae-ft-ema",
+    micro_batch_size=128,
+)
+text_encoder = dict(
+    type="t5",
+    from_pretrained="./pretrained_models/t5_ckpts",
+    model_max_length=120,
+    shardformer=True,           # Enable shardformer for T5 acceleration
+)
+scheduler = dict(
+    type="iddpm",
+    timestep_respacing="",      # Default 1000 timesteps
+)
+
+# Others
+seed = 42
+outputs = "outputs"             # path to save checkpoints
+wandb = False                   # Use wandb for logging
+
+epochs = 1000                   # number of epochs (just large enough, kill when satisfied)
+log_every = 10
+ckpt_every = 250
+load = None                     # path to resume training
+
+batch_size = 4
+lr = 2e-5
+grad_clip = 1.0                 # gradient clipping
+```