mirror of
https://github.com/hpcaitech/Open-Sora.git
synced 2026-04-11 05:13:31 +02:00
Docs/fix zw (#476)
* [docs] inference-long merge * [docs] update readme
This commit is contained in:
parent
73eda722f8
commit
eaa6902527
16
README.md
16
README.md
|
|
@ -89,6 +89,10 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
|
|||
- ✅ Trained our 3D-VAE for temporal dimension compression.
|
||||
- 📍 **Open-Sora 1.1** released. Model weights are available [here](#model-weights). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](/docs/report_02.md)** for more discussions.
|
||||
- 🔧 **Data processing pipeline v1.1** is released. An automatic [processing pipeline](#data-processing) from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset.
|
||||
|
||||
<details>
|
||||
<summary>View more</summary>
|
||||
|
||||
- ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
|
||||
- ✅ Support training with any resolution, aspect ratio, and duration (including images).
|
||||
- ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
|
||||
|
|
@ -103,10 +107,6 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
|
|||
including [downloading](tools/datasets/README.md), [video cutting](tools/scene_cut/README.md),
|
||||
and [captioning](tools/caption/README.md) tools. Our data collection plan can be found
|
||||
at [datasets.md](docs/datasets.md).
|
||||
|
||||
<details>
|
||||
<summary>View more</summary>
|
||||
|
||||
- ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
|
||||
better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
|
||||
the time dimension deteriorates the quality. See our **[report](docs/report_01.md)** for more discussions.
|
||||
|
|
@ -154,7 +154,11 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
|
|||
|
||||
Other useful documents and links are listed below.
|
||||
|
||||
- Report: [report 1.2](docs/report_03.md), [report 1.1](docs/report_02.md), [report 1.0](docs/report_01.md), [acceleration.md](docs/acceleration.md)
|
||||
- Report: each version is trained from a image base seperately (not continuously trained), while a newer version will incorporate the techniques from the previous version.
|
||||
- [report 1.2](docs/report_03.md): rectified flow, 3d-VAE, score condition, evaluation, etc.
|
||||
- [report 1.1](docs/report_02.md): multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc.
|
||||
- [report 1.0](docs/report_01.md): architecture, captioning, etc.
|
||||
- [acceleration.md](docs/acceleration.md)
|
||||
- Repo structure: [structure.md](docs/structure.md)
|
||||
- Config file explanation: [config.md](docs/config.md)
|
||||
- Useful commands: [commands.md](docs/commands.md)
|
||||
|
|
@ -494,7 +498,7 @@ For training other models and advanced usage, see [here](docs/commands.md) for m
|
|||
We support evaluation based on:
|
||||
|
||||
- Validation loss
|
||||
- VBench score
|
||||
- [VBench](https://github.com/Vchitect/VBench/tree/master) score
|
||||
- VBench-i2v score
|
||||
- Batch generation for human evaluation
|
||||
|
||||
|
|
|
|||
|
|
@ -63,7 +63,7 @@ Lastest diffusion model like Stable Diffusion 3 adopts the [rectified flow](http
|
|||
|
||||
For the resolution-aware timestep sampling, we should use more noise for images with larger resolution. We extend this idea to video generation and use more noise for videos with longer length.
|
||||
|
||||
Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs:
|
||||
Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs (the adaptation for the diffusion model is quite fast and straightforward):
|
||||
|
||||
1. Multi-resolution image generation ability: we train the model to generate different resolution ranging from 144p to 2K for 20k steps.
|
||||
2. QK-norm: we add the QK-norm to the model and train for 18k steps.
|
||||
|
|
|
|||
Loading…
Reference in a new issue