Docs/fix zw (#476)

* [docs] inference-long merge * [docs] update readme
2026-04-11 05:13:31 +02:00 · 2024-06-19 18:14:21 +08:00 · 2024-06-19 18:14:21 +08:00 · eaa6902527
commit eaa6902527
parent 73eda722f8
2 changed files with 11 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -89,6 +89,10 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
 - ✅ Trained our 3D-VAE for temporal dimension compression.
 - 📍 **Open-Sora 1.1** released. Model weights are available [here](#model-weights). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](/docs/report_02.md)** for more discussions.
 - 🔧 **Data processing pipeline v1.1** is released. An automatic [processing pipeline](#data-processing) from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset.
+
+<details>
+<summary>View more</summary>
+
 - ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
 - ✅ Support training with any resolution, aspect ratio, and duration (including images).
 - ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
@ -103,10 +107,6 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
  including [downloading](tools/datasets/README.md), [video cutting](tools/scene_cut/README.md),
  and [captioning](tools/caption/README.md) tools. Our data collection plan can be found
  at [datasets.md](docs/datasets.md).
-
-<details>
-<summary>View more</summary>
-
 - ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
  better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
  the time dimension deteriorates the quality. See our **[report](docs/report_01.md)** for more discussions.
@ -154,7 +154,11 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.

 Other useful documents and links are listed below.

- Report: [report 1.2](docs/report_03.md), [report 1.1](docs/report_02.md), [report 1.0](docs/report_01.md), [acceleration.md](docs/acceleration.md)
+- Report: each version is trained from a image base seperately (not continuously trained), while a newer version will incorporate the techniques from the previous version.
+  - [report 1.2](docs/report_03.md): rectified flow, 3d-VAE, score condition, evaluation, etc.
+  - [report 1.1](docs/report_02.md): multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc.
+  - [report 1.0](docs/report_01.md): architecture, captioning, etc.
+  - [acceleration.md](docs/acceleration.md)
 - Repo structure: [structure.md](docs/structure.md)
 - Config file explanation: [config.md](docs/config.md)
 - Useful commands: [commands.md](docs/commands.md)
@ -494,7 +498,7 @@ For training other models and advanced usage, see [here](docs/commands.md) for m
 We support evaluation based on:

 - Validation loss
- VBench score
+- [VBench](https://github.com/Vchitect/VBench/tree/master) score
 - VBench-i2v score
 - Batch generation for human evaluation

--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -63,7 +63,7 @@ Lastest diffusion model like Stable Diffusion 3 adopts the [rectified flow](http

 For the resolution-aware timestep sampling, we should use more noise for images with larger resolution. We extend this idea to video generation and use more noise for videos with longer length.

-Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs:
+Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs (the adaptation for the diffusion model is quite fast and straightforward):

 1. Multi-resolution image generation ability: we train the model to generate different resolution ranging from 144p to 2K for 20k steps.
 2. QK-norm: we add the QK-norm to the model and train for 18k steps.