minor fixes

2026-04-10 21:01:26 +02:00 · 2024-06-17 13:37:49 +00:00 · 2024-06-17 13:37:49 +00:00 · 31aaa80432
commit 31aaa80432
parent 5cbfe53086
4 changed files with 10 additions and 17 deletions
--- a/README.md
+++ b/README.md
@ -312,7 +312,7 @@ The basic command line inference is as follows:
 ```bash
 # text to video
 python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
-  --num-frames 4s --resolution 720p \
+  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"
 ```

@ -320,7 +320,7 @@ You can add more options to the command line to customize the generation.

 ```bash
 python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
-  --num-frames 4s --resolution 720p \
+  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --num-sampling-steps 30 --flow 5 --aes 6.5 \
  --prompt "a beautiful waterfall"
 ```
@ -402,20 +402,19 @@ Also check out the [datasets](docs/datasets.md) we use.

 ## VAE
 We train a VAE pipeline that consists of a spatial VAE followed by a temporal VAE.
-For more details, refer to our [VAE documentation](docs/vae.md).
+For more details, refer to [VAE](docs/vae.md).
 Before you run the following commands, follow our [Installation Documentation](docs/installation.md) to install the required dependencies for VAE and Evaluation.

-Once you prepare the data in a `csv` file, run the following commands to train the VAE.
+If you want to train your own VAE, we need to prepare data in the csv following the [data processing](#data-processing) pipeline, then run the following commands.
 Note that you need to adjust the number of trained epochs (`epochs`) in the config file accordingly with respect to your own csv data size.

-
 ```bash
 # stage 1 training, 380k steps, 8 GPUs
 torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage1.py --data-path YOUR_CSV_PATH
 # stage 2 training, 260k steps, 8 GPUs
-torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage[1-3].py --data-path YOUR_CSV_PATH
+torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage2.py --data-path YOUR_CSV_PATH
 # stage 3 training, 540k steps, 24 GPUs
-torchrun --nnodes=3 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage[1-3].py --data-path YOUR_CSV_PATH
+torchrun --nnodes=3 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage3.py --data-path YOUR_CSV_PATH
 ```
 To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:

--- a/docs/installation.md
+++ b/docs/installation.md
@ -211,4 +211,4 @@ pip install -v .[vae]

 ### Step 2: VAE Evaluation (`cupy` and Potential VAE Errors)

-Refer to [Evaluation's VAE section](#step-3-install-cupy-for-potential-vae-errors).
+Refer to the [Evaluation's VAE section](#step-3-install-cupy-for-potential-vae-errors) above.
--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -7,7 +7,7 @@
 - [Evaluation](#evaluation)
 - [Sequence parallelism](#sequence-parallelism)

-In Open-Sora 1.2 release, we train a 1.1B models on >20M data, with training cost 35k H100 GPU hours, supporting 0s~15s, 144p to 720p, various aspect ratios video generation. Our configurations is listed below. Following our 1.1 version, Open-Sora 1.2 can also do image-to-video generation and video extension.
+In Open-Sora 1.2 release, we train a 1.1B models on >20M data, with training cost 35k H100 GPU hours, supporting 0s~16s, 144p to 720p, various aspect ratios video generation. Our configurations is listed below. Following our 1.1 version, Open-Sora 1.2 can also do image-to-video generation and video extension.

 |      | image | 2s  | 4s  | 8s  | 16s |
 | ---- | ----- | --- | --- | --- | --- |
@ -44,7 +44,7 @@ Our training involves three stages:
 2. For the next 260k steps, We remove the identity loss and just learn the 3D VAE.
 3. For the last 540k steps , since we find only reconstruction 2D VAE's feature cannot lead to further improvement, we remove the loss and train the whole VAE to reconstruct the original videos. This stage is trained on on 24 GPUs.

-For both stage 1 and stage 2 training, we adopt 20% images and 80% videos. We find videos with length different from 17 frames will suffer from blurring. Thus, we use a random number within 34 frames to make our VAE more robust to different video lengths. Our [training](/scripts/train_vae.py) and [inference](/scripts/inference_vae.py) code is available in the Open-Sora 1.2 release.
+For both stage 1 and stage 2 training, we adopt 20% images and 80% videos. Following [Magvit-v2](https://magvit.cs.cmu.edu/v2/), we train video using 17 frames, while zero-padding the first 16 frames for image. However, we find that this setting leads to blurring of videos with length different from 17 frames. Thus, in stage 3, we use a random number within 34 frames for mixed video length training (a.k.a., zero-pad the first  `43-n` frames if we want to train a `n` frame video), to make our VAE more robust to different video lengths. Our [training](/scripts/train_vae.py) and [inference](/scripts/inference_vae.py) code is available in the Open-Sora 1.2 release.

 When using the VAE for diffusion model, our stacked VAE requires small memory as the our VAE's input is already compressed. We also split the input videos input several 17 frames clips to make the inference more efficient.  The performance of our VAE is on par with another open-sourced 3D VAE in [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.1.0.md).

@ -126,12 +126,6 @@ For example, a video with aesthetic score 5.5, motion score 10, and a detected c

 During inference, we can also use the scores to condition the model. For camera motion, we only label 13k clips with high confidence, and the camera motion detection module is released in our tools.

-[Aesthetic Score Examples TBD]
-
-[Motion Score Examples TBD]
-
-[Camera Motion Detection Module TBD]
-
 ## Evaluation

 Previously, we monitor the training process only by human evaluation, as DDPM traning loss is not well correlated with the quality of generated videos. However, for rectified flow, we find the training loss is well correlated with the quality of generated videos as stated in SD3. Thus, we keep track of rectified flow evaluation loss on 100 images and 1k videos.
--- a/docs/vae.md
+++ b/docs/vae.md
@ -1,6 +1,6 @@
 # VAE Report

-As [Pixart-Sigma](https://arxiv.org/abs/2403.04692) finds that adapting to a new VAE is simple, we develop a temporal VAE for the diffusion model to adapt to.
+As [Pixart-Sigma](https://arxiv.org/abs/2403.04692) finds that adapting to a new VAE is simple, we develop an additional temporal VAE.
 Specifically, our VAE consists of a pipeline of a [spatial VAE](https://huggingface.co/PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers) followed by a temporal VAE.
 For the temporal VAE, we follow the implementation of [MAGVIT-v2](https://arxiv.org/abs/2310.05737), with the following modifications:
 * We remove the architecture specific to the codebook.