diff --git a/README.md b/README.md
index 42201f6..2144526 100644
--- a/README.md
+++ b/README.md
@@ -89,6 +89,10 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
 - ✅ Trained our 3D-VAE for temporal dimension compression.
 - 📍 **Open-Sora 1.1** released. Model weights are available [here](#model-weights). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](/docs/report_02.md)** for more discussions.
 - 🔧 **Data processing pipeline v1.1** is released. An automatic [processing pipeline](#data-processing) from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset.
+
+<details>
+<summary>View more</summary>
+
 - ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
 - ✅ Support training with any resolution, aspect ratio, and duration (including images).
 - ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
@@ -103,10 +107,6 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
   including [downloading](tools/datasets/README.md), [video cutting](tools/scene_cut/README.md),
   and [captioning](tools/caption/README.md) tools. Our data collection plan can be found
   at [datasets.md](docs/datasets.md).
-
-<details>
-<summary>View more</summary>
-
 - ✅ We find VQ-VAE from [VideoGPT](https://wilson1yan.github.io/videogpt/index.html) has a low quality and thus adopt a
   better VAE from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original). We also find patching in
   the time dimension deteriorates the quality. See our **[report](docs/report_01.md)** for more discussions.
@@ -154,7 +154,11 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
 
 Other useful documents and links are listed below.
 
-- Report: [report 1.2](docs/report_03.md), [report 1.1](docs/report_02.md), [report 1.0](docs/report_01.md), [acceleration.md](docs/acceleration.md)
+- Report: each version is trained from a image base seperately (not continuously trained), while a newer version will incorporate the techniques from the previous version.
+  - [report 1.2](docs/report_03.md): rectified flow, 3d-VAE, score condition, evaluation, etc.
+  - [report 1.1](docs/report_02.md): multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc.
+  - [report 1.0](docs/report_01.md): architecture, captioning, etc.
+  - [acceleration.md](docs/acceleration.md)
 - Repo structure: [structure.md](docs/structure.md)
 - Config file explanation: [config.md](docs/config.md)
 - Useful commands: [commands.md](docs/commands.md)
@@ -494,7 +498,7 @@ For training other models and advanced usage, see [here](docs/commands.md) for m
 We support evaluation based on:
 
 - Validation loss
-- VBench score
+- [VBench](https://github.com/Vchitect/VBench/tree/master) score
 - VBench-i2v score
 - Batch generation for human evaluation
 
diff --git a/docs/report_03.md b/docs/report_03.md
index cb9677c..02c55b7 100644
--- a/docs/report_03.md
+++ b/docs/report_03.md
@@ -63,7 +63,7 @@ Lastest diffusion model like Stable Diffusion 3 adopts the [rectified flow](http
 
 For the resolution-aware timestep sampling, we should use more noise for images with larger resolution. We extend this idea to video generation and use more noise for videos with longer length.
 
-Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs:
+Open-Sora 1.2 starts from the [PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma) checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs (the adaptation for the diffusion model is quite fast and straightforward):
 
 1. Multi-resolution image generation ability: we train the model to generate different resolution ranging from 144p to 2K for 20k steps.
 2. QK-norm: we add the QK-norm to the model and train for 18k steps.