Update report_03.md (#452)

fix typo
2026-04-10 21:01:26 +02:00 · 2024-06-18 15:01:38 +08:00 · 2024-06-18 15:01:38 +08:00 · 2c7e740f16
commit 2c7e740f16
parent 5913013387
1 changed files with 1 additions and 1 deletions
--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -40,7 +40,7 @@ We initialize the 2D VAE with [SDXL's VAE](https://huggingface.co/stabilityai/sd

 Our training involves three stages:

-1. For the first 380k steps, we train on 8 GPUs and freese the 2D VAE. The training objective includes the reconstruction of the compressed features from 2D VAE (pink one in the figure) and also add a loss to make features from the 3D VAE similar to the features from the 2D VAE (pink one and green one, called identity loss). We find the latter loss can quickly make the whole VAE achieve a good performance for image and much faster to converge in the next stage.
+1. For the first 380k steps, we train on 8 GPUs and freeze the 2D VAE. The training objective includes the reconstruction of the compressed features from 2D VAE (pink one in the figure) and also add a loss to make features from the 3D VAE similar to the features from the 2D VAE (pink one and green one, called identity loss). We find the latter loss can quickly make the whole VAE achieve a good performance for image and much faster to converge in the next stage.
 2. For the next 260k steps, We remove the identity loss and just learn the 3D VAE.
 3. For the last 540k steps , since we find only reconstruction 2D VAE's feature cannot lead to further improvement, we remove the loss and train the whole VAE to reconstruct the original videos. This stage is trained on on 24 GPUs.