update readme

2026-04-11 05:13:31 +02:00 · 2024-04-22 14:26:24 +00:00 · 2024-04-22 14:26:24 +00:00 · ae6d88911d
commit ae6d88911d
parent a3346808a6
3 changed files with 56 additions and 38 deletions
--- a/README.md
+++ b/README.md
@ -25,14 +25,13 @@ With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the

 ## 📰 News

-* **[2024.04.22]** 🔥 We release **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video and video-to-video** generation. While a [checkpoint]() is provided, our model is still under training. In addition, a full video processing pipeline is released.
+* **[2024.04.22]** 🔥 We release **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video and video-to-video** generation. In addition, a full video processing pipeline is released. [[report]](/docs/report_02.md)
 * **[2024.03.18]** We release **Open-Sora 1.0**, a fully open-source project for video generation.
  Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
  <a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
  acceleration,
-  inference, and more. Our provided [checkpoints](#model-weights) can produce 2s 512x512 videos with only 3 days
-  training.
-  [[blog]](https://hpc-ai.com/blog/open-sora-v1.0)
+  inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [[checkpoints]](#model-weights)
+  [[blog]](https://hpc-ai.com/blog/open-sora-v1.0) [[report]](docs/report_01.md)
 * **[2024.03.04]** Open-Sora provides training with 46% cost reduction.
  [[blog]](https://hpc-ai.com/blog/open-sora)

@ -50,35 +49,35 @@ TBD
 | [<img src="assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94)                                 | [<img src="assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9)                              | [<img src="assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65)    |
 | A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]                                                           | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]                                            | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]                   |

+Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display,
+see [here](/assets/texts/t2v_samples.txt) for full prompts.
+
 </details>

-Videos are downsampled to `.gif` for display. Click for original videos. Prompts are trimmed for display,
-see [here](/assets/texts/t2v_samples.txt) for full prompts. See more samples at
-our [gallery](https://hpcaitech.github.io/Open-Sora/).
+More samples are available in our [gallery](https://hpcaitech.github.io/Open-Sora/).

 ## 🔆 New Features/Updates

-* 📍
-* ✅
-* ✅
-* ✅
-* ✅
-* 📍 Open-Sora 1.0 released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800
-  days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos. See our **[report](docs/report_01.md)** for more discussions.
-* ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each
+* 📍 **Open-Sora 1.1** released. Model weights are available [here](). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](docs/report_02.md)** for more discussions.
+* 🔧 **Data processing pipeline v1.1** is released. An automatic processing pipeline from raw videos to (text, video clip) pairs is provided, including scene cutting$\rightarrow$filtering(aesthetic, optical flow, OCR, etc.)$\rightarrow$captioning$\rightarrow$managing. With this tool, you can easily build your video dataset.
+* ✅ Modified ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
+* ✅ Support training with any resolution, aspect ratio, and duration (including images).
+* ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
+* 📍 **Open-Sora 1.0** released. Model weights are available [here](#model-weights). With only 400K video clips and 200 H800
+  days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos. See our **[report 1.0](docs/report_01.md)** for more discussions.
+* ✅ Three-stage training from an image diffusion model to a video diffusion model. We provide the weights for each
  stage.
 * ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism.
  Open-Sora improve **55%** training speed when training on 64x512x512 videos. Details locates
  at [acceleration.md](docs/acceleration.md).
-
+* 🔧 **Data preprocessing pipeline v1.0**,
+  including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md),
+  and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found
+  at [datasets.md](docs/datasets.md).

 <details>
 <summary>View more</summary>

-* ✅ We provide data preprocessing pipeline,
-  including [downloading](/tools/datasets/README.md), [video cutting](/tools/scenedetect/README.md),
-  and [captioning](/tools/caption/README.md) tools. Our data collection plan can be found
-  at [datasets.md](docs/datasets.md).
 * ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
  better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
  the time dimension deteriorates the quality. See our **[report](docs/report_01.md)** for more discussions.
@ -124,6 +123,8 @@ our [gallery](https://hpcaitech.github.io/Open-Sora/).

 ## Installation

+TODO: discuss how to include data installation here.
+
 ```bash
 # create a virtual env
 conda create -n opensora python=3.10
@ -158,6 +159,15 @@ the config files.

 ## Model Weights

+### Open-Sora 1.1 Model Weights
+
+TBD
+
+### Open-Sora 1.0 Model Weights
+
+<details>
+<summary>View more</summary>
+
 | Resolution | Data   | #iterations | Batch Size | GPU days (H800) | URL                                                                                           |
 | ---------- | ------ | ----------- | ---------- | --------------- | --------------------------------------------------------------------------------------------- |
 | 16×512×512 | 20K HQ | 20k         | 2×64       | 35              | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x512x512.pth) |
@ -174,8 +184,12 @@ the dataset can be found in [datasets.md](/docs/datasets.md). HQ means high qual
 The model performs badly, especially on generating human beings and cannot follow detailed instructions. We are working
 on improving the quality and text alignment.

+</details>
+
 ## Inference

+### Gradio Demo
+
 We have provided a [Gradio application](./gradio) in this repository, you can use the following the command to start an interactive web application to experience video generation with Open-Sora.

 ```bash
@ -185,7 +199,16 @@ python gradio/app.py

 This will launch a Gradio application on your localhost. If you want to know more about the Gradio applicaiton, you can refer to the [README file](./gradio/README.md).

-Besides, we have also provided an offline inference script. Run the following commands to generate samples, the required model weights will be automatically downloaded. To change sampling prompts, modify the txt file passed to `--prompt-path`. See [here](docs/structure.md#inference-config-demos) to customize the configuration.
+### Open-Sora 1.1 Command Line Inference
+
+TBD
+
+### Open-Sora 1.0 Command Line Inference
+
+<details>
+<summary>View more</summary>
+
+We have also provided an offline inference script. Run the following commands to generate samples, the required model weights will be automatically downloaded. To change sampling prompts, modify the txt file passed to `--prompt-path`. See [here](docs/structure.md#inference-config-demos) to customize the configuration.

 ```bash
 # Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
@ -205,8 +228,12 @@ torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/i
 The speed is tested on H800 GPUs. For inference with other models, see [here](docs/commands.md) for more instructions.
 To lower the memory usage, set a smaller `vae.micro_batch_size` in the config (slightly lower sampling speed).

+</details>
+
 ## Data Processing

+Te be modified
+
 High-quality Data is the key to high-quality models. Our used datasets and data collection plan
 is [here](/docs/datasets.md). We provide tools to process video data. Our data processing pipeline includes
 the following steps:
@ -309,23 +336,10 @@ this work during their internship at [HPC-AI Tech](https://hpc-ai.com/).
 * [StabilityAI VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse-original): A powerful image VAE model.
 * [CLIP](https://github.com/openai/CLIP): A powerful text-image embedding model.
 * [T5](https://github.com/google-research/text-to-text-transfer-transformer): A powerful text encoder.
-* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based
-  on [Yi-34B](https://huggingface.co/01-ai/Yi-34B).
+* [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful image captioning model based on [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and [Yi-34B](https://huggingface.co/01-ai/Yi-34B).

 We are grateful for their exceptional work and generous contribution to open source.

-<!-- ## Citation
-
-```bibtex
-@software{opensora,
-  author = {Zangwei Zheng and Xiangyu Peng and Yang You},
-  title = {Open-Sora: Democratizing Efficient Video Production for All},
-  month = {March},
-  year = {2024},
-  url = {https://github.com/hpcaitech/Open-Sora}
-}
-``` -->
-
 ## Star History

 [![Star History Chart](https://api.star-history.com/svg?repos=hpcaitech/Open-Sora&type=Date)](https://star-history.com/#hpcaitech/Open-Sora&Date)
--- a/docs/report_01.md
+++ b/docs/report_01.md
@ -1,7 +1,5 @@
 # Open-Sora 1.0 Report

-> Zangwei Zheng*, Xiangyu Peng*, Shenggui Li, Hongxing Liu, Yang You
-
 OpenAI's Sora is amazing at generating one minutes high quality videos. However, it reveals almost no information about its details. To make AI more "open", we are dedicated to build an open-source version of Sora. This report describes our first attempt to train a transformer-based video diffusion model.

 ## Efficiency in choosing the architecture
@ -47,3 +45,5 @@ With a limited training budgets, we made only a few exploration. We find learnin
 16x512x512 HQ Training Loss Curve

 ![16x512x512 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/erHBe.png)
+
+> Core Contributor: Zangwei Zheng*, Xiangyu Peng*, Shenggui Li, Hongxing Liu, Yang You
--- a/docs/report_02.md
+++ b/docs/report_02.md
@ -1,3 +1,7 @@
 # Open-Sora 1.1 Report

-> Zangwei Zheng*, Xiangyu Peng*, Shenggui Li, Hongxing Liu, Junjie Wang, Chenhui Shen, Tom Young, Yang You
+> Core Contributor:
+>
+> * **Algorithm & Acceleration**: Zangwei Zheng, Xiangyu Peng, Shenggui Li, Hongxing Liu, Yukun Zhou
+> * **Data Collection & Pipeline**: Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Junjie Wang, Chenfeng Yu
+> * **System Maintenance**: Wenjun Li, Shenggui Li