diff --git a/README.md b/README.md index 42201f6..2144526 100644 --- a/README.md +++ b/README.md @@ -89,6 +89,10 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts. - ✅ Trained our 3D-VAE for temporal dimension compression. - 📍 **Open-Sora 1.1** released. Model weights are available [here](#model-weights). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](/docs/report_02.md)** for more discussions. - 🔧 **Data processing pipeline v1.1** is released. An automatic [processing pipeline](#data-processing) from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset. + +
+View more + - ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc. - ✅ Support training with any resolution, aspect ratio, and duration (including images). - ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc. @@ -103,10 +107,6 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts. including [downloading](tools/datasets/README.md), [video cutting](tools/scene_cut/README.md), and [captioning](tools/caption/README.md) tools. Our data collection plan can be found at [datasets.md](docs/datasets.md). - -
-View more - - ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in the time dimension deteriorates the quality. See our **[report](docs/report_01.md)** for more discussions. @@ -154,7 +154,11 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts. Other useful documents and links are listed below. -- Report: [report 1.2](docs/report_03.md), [report 1.1](docs/report_02.md), [report 1.0](docs/report_01.md), [acceleration.md](docs/acceleration.md) +- Report: each version is trained from a image base seperately (not continuously trained), while a newer version will incorporate the techniques from the previous version. + - [report 1.2](docs/report_03.md): rectified flow, 3d-VAE, score condition, evaluation, etc. + - [report 1.1](docs/report_02.md): multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc. + - [report 1.0](docs/report_01.md): architecture, captioning, etc. + - [acceleration.md](docs/acceleration.md) - Repo structure: [structure.md](docs/structure.md) - Config file explanation: [config.md](docs/config.md) - Useful commands: [commands.md](docs/commands.md) @@ -494,7 +498,7 @@ For training other models and advanced usage, see [here](docs/commands.md) for m We support evaluation based on: - Validation loss -- VBench score +- [VBench](https://github.com/Vchitect/VBench/tree/master) score - VBench-i2v score - Batch generation for human evaluation diff --git a/docs/report_03.md b/docs/report_03.md index cb9677c..02c55b7 100644 --- a/docs/report_03.md +++ b/docs/report_03.md @@ -63,7 +63,7 @@ Lastest diffusion model like Stable Diffusion 3 adopts the [rectified flow](http For the resolution-aware timestep sampling, we should use more noise for images with larger resolution. We extend this idea to video generation and use more noise for videos with longer length. -Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs: +Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs (the adaptation for the diffusion model is quite fast and straightforward): 1. Multi-resolution image generation ability: we train the model to generate different resolution ranging from 144p to 2K for 20k steps. 2. QK-norm: we add the QK-norm to the model and train for 18k steps.