update docs

2026-04-11 13:14:44 +02:00 · 2024-04-23 09:26:10 +00:00 · 2024-04-23 09:26:10 +00:00 · cf95f41002
commit cf95f41002
parent e763669eab
6 changed files with 324 additions and 128 deletions
--- a/README.md
+++ b/README.md
@ -25,7 +25,7 @@ With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the

 ## 📰 News

-* **[2024.04.22]** 🔥 We release **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video and video-to-video** generation. In addition, a full video processing pipeline is released. [[report]](/docs/report_02.md)
+* **[2024.04.22]** 🔥 We release **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video, video-to-video, infinite time** generation. In addition, a full video processing pipeline is released. [[report]](/docs/report_02.md)
 * **[2024.03.18]** We release **Open-Sora 1.0**, a fully open-source project for video generation.
  Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
  <a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
@ -117,6 +117,7 @@ More samples are available in our [gallery](https://hpcaitech.github.io/Open-Sor
 * [Inference](#inference)
 * [Data Processing](#data-processing)
 * [Training](#training)
+* [Evaluation](#evaluation)
 * [Contribution](#contribution)
 * [Acknowledgement](#acknowledgement)

@ -128,6 +129,7 @@ Other useful documents and links are listed below.
 * Useful commands: [commands.md](docs/commands.md)
 * Data processing pipeline and dataset: [datasets.md](docs/datasets.md)
 * Each data processing tool's README: [dataset conventions and management](/tools/datasets/README.md), [scene cutting](/tools/scene_cut/README.md), [scoring](/tools/scoring/README.md), [caption](/tools/caption/README.md)
+* Evaluation: [eval](/eval/README.md)
 * Gallery: [gallery](https://hpcaitech.github.io/Open-Sora/)

 ## Installation
@ -163,9 +165,6 @@ cd Open-Sora
 pip install -v .
 ```

-After installation, we suggest reading [structure.md](docs/structure.md) to learn the project structure and how to use
-the config files.
-
 ## Model Weights

 ### Open-Sora 1.1 Model Weights
@ -210,7 +209,15 @@ This will launch a Gradio application on your localhost. If you want to know mor

 ### Open-Sora 1.1 Command Line Inference

-TBD
+Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.
+
+```bash
+# video sampling
+python scripts/inference.py configs/opensora-v1-1/inference/sample.py \
+    --ckpt-path CKPT_PATH --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854
+```
+
+See [here](docs/commands.md#inference-with-open-sora-11) for more instructions.

 ### Open-Sora 1.0 Command Line Inference

@ -280,6 +287,17 @@ python -m tools.datasets.csvutil ~/dataset_ready.csv --fmin 48

 ### Open-Sora 1.1 Training

+Once you prepare the data in a `csv` file, run the following commands to launch training on a single node.
+
+```bash
+# one node
+torchrun --standalone --nproc_per_node 8 scripts/train.py \
+    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
+# multiple nodes
+colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
+    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
+```
+
 ### Open-Sora 1.0 Training

 <details>
@ -306,6 +324,10 @@ For training other models and advanced usage, see [here](docs/commands.md) for m

 </details>

+## Evaluation
+
+See [here](eval/README.md) for more instructions.
+
 ## Contribution

 Thanks goes to these wonderful contributors ([emoji key](https://allcontributors.org/docs/en/emoji-key)
--- a/docs/commands.md
+++ b/docs/commands.md
@ -1,9 +1,63 @@
 # Commands

+- [Inference](#inference)
+  - [Inference with Open-Sora 1.1](#inference-with-open-sora-11)
+  - [Inference with DiT pretrained on ImageNet](#inference-with-dit-pretrained-on-imagenet)
+  - [Inference with Latte pretrained on UCF101](#inference-with-latte-pretrained-on-ucf101)
+  - [Inference with PixArt-α pretrained weights](#inference-with-pixart-α-pretrained-weights)
+  - [Inference with checkpoints saved during training](#inference-with-checkpoints-saved-during-training)
+  - [Inference Hyperparameters](#inference-hyperparameters)
+- [Training](#training)
+  - [Training Hyperparameters](#training-hyperparameters)
+- [Search batch size for buckets](#search-batch-size-for-buckets)
+
 ## Inference

 You can modify corresponding config files to change the inference settings. See more details [here](/docs/structure.md#inference-config-demos).

+### Inference with Open-Sora 1.1
+
+Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.
+
+```bash
+# image sampling with prompt path
+python scripts/inference.py configs/opensora-v1-1/inference/sample.py \
+    --ckpt-path CKPT_PATH --prompt-path assets/texts/t2i_samples.txt --num-frames 1 --image-size 1024 1024
+
+# image sampling with prompt
+python scripts/inference.py configs/opensora-v1-1/inference/sample.py \
+    --ckpt-path CKPT_PATH --prompt "A beautiful sunset over the city" --num-frames 1 --image-size 1024 1024
+
+# video sampling
+python scripts/inference.py configs/opensora-v1-1/inference/sample.py \
+    --ckpt-path CKPT_PATH --prompt "A beautiful sunset over the city" --num-frames 16 --image-size 480 854
+```
+
+You can adjust the `--num-frames` and `--image-size` to generate different results. We recommend you to use the same image size as the training resolution, which is defined in [aspect.py](/opensora/datasets/aspect.py). Some examples are shown below.
+
+- 240p
+  - 16:9 240x426
+  - 3:4 276x368
+  - 1:1 320x320
+- 480p
+  - 16:9 480x854
+  - 3:4 554x738
+  - 1:1 640x640
+- 720p
+  - 16:9 720x1280
+  - 3:4 832x1110
+  - 1:1 960x960
+
+`inference-long.py` is compatible with `inference.py` and supports advanced features.
+
+```bash
+# long video generation
+# image condition
+# video extending
+# video connecting
+# video editing
+```
+
 ### Inference with DiT pretrained on ImageNet

 The following command automatically downloads the pretrained weights on ImageNet and runs inference.
@ -71,19 +125,6 @@ vae = dict(
 )
 ```

-### Evalution
-
-Use the following commands to generate predefined samples.
-
-```bash
-# image
-bash scripts/misc/sample.sh /path/to/ckpt --image
-# video
-bash scripts/misc/sample.sh /path/to/ckpt --video
-# video edit
-bash scripts/misc/sample.sh /path/to/ckpt --video-edit
-```
-
 ## Training

 To resume training, run the following command. ``--load`` different from ``--ckpt-path`` as it loads the optimizer and dataloader states.
--- a/docs/config.md
+++ b/docs/config.md
@ -1,40 +1,45 @@
+# Config Guide

-## Inference config demos
+- [Inference Config](#inference-config)
+- [Advanced Inference config](#advanced-inference-config)
+- [Inference Args](#inference-args)
+- [Training Config](#training-config)
+- [Training Args](#training-args)
+- [Training Bucket Configs](#training-bucket-configs)

-To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.
+Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object. We expose some fields in the config file to the command line arguments (defined in [opensora/utils/config_util.py](/opensora/utils/config_utils.py)). To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file.

-```plaintext
--prompt_path ./assets/texts/t2v_samples.txt  -> prompt_path
--ckpt-path ./path/to/your/ckpt.pth           -> model["from_pretrained"]
-```
+## Inference Config

 The explanation of each field is provided below.

 ```python
 # Define sampling size
-num_frames = 64               # number of frames
-fps = 24 // 2                 # frames per second (divided by 2 for frame_interval=2)
-image_size = (512, 512)       # image size (height, width)
+num_frames = 64               # number of frames, 1 means image
+fps = 24                      # frames per second (condition for generation)
+frame_interval = 3            # output video will have fps/frame_interval frames per second
+image_size = (240, 426)       # image size (height, width)

 # Define model
 model = dict(
-    type="STDiT-XL/2",        # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
-    space_scale=1.0,          # (Optional) Space positional encoding scale (new height / old height)
-    time_scale=2 / 3,         # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
-    enable_flashattn=True,    # (Optional) Speed up training and inference with flash attention
-    enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
+    type="STDiT2-XL/2",       # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
    from_pretrained="PRETRAINED_MODEL",  # (Optional) Load from pretrained model
-    no_temporal_pos_emb=True,  # (Optional) Disable temporal positional encoding (for image)
+    input_sq_size=512,        # Base spatial position embedding size
+    qk_norm=True,             # Normalize query and key in attention
+    enable_flashattn=True,    # (Optional) Speed up training and inference with flash attention
+    # Turn enable_flashattn to False if you skip flashattn installation
+    enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
+    # Turn enable_layernorm_kernel to False if you skip apex installation
 )
 vae = dict(
    type="VideoAutoencoderKL", # Select VAE type
    from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
-    micro_batch_size=128,      # VAE with micro batch size to save memory
+    micro_batch_size=4,        # VAE with micro batch size to save memory
 )
 text_encoder = dict(
    type="t5",                 # Select text encoder type (t5, clip)
    from_pretrained="DeepFloyd/t5-v1_1-xxl", # Load from pretrained text encoder
-    model_max_length=120,      # Maximum length of input text
+    model_max_length=200,      # Maximum length of input text
 )
 scheduler = dict(
    type="iddpm",              # Select scheduler type (iddpm, dpm-solver)
@ -42,101 +47,186 @@ scheduler = dict(
    cfg_scale=7.0,             # hyper-parameter for classifier-free diffusion
    cfg_channel=3,             # how many channels to use for classifier-free diffusion, if None, use all channels
 )
-dtype = "fp16"                 # Computation type (fp16, fp32, bf16)
+dtype = "bf16"                 # Computation type (fp16, fp32, bf16)
+
+# Condition
+prompt_path = "./assets/texts/t2v_samples.txt" # path to prompt file
+prompt = None                  # prompt has higher priority than prompt_path

 # Other settings
 batch_size = 1                 # batch size
 seed = 42                      # random seed
-prompt_path = "./assets/texts/t2v_samples.txt"  # path to prompt file
 save_dir = "./samples"         # path to save samples
 ```

-## Training config demos
-
-```python
-# Define sampling size
-num_frames = 64
-frame_interval = 2             # sample every 2 frames
-image_size = (512, 512)
-
-# Define dataset
-root = None                    # root path to the dataset
-data_path = "CSV_PATH"         # path to the csv file
-use_image_transform = False    # True if training on images
-num_workers = 4                # number of workers for dataloader
-
-# Define acceleration
-dtype = "bf16"                 # Computation type (fp16, bf16)
-grad_checkpoint = True         # Use gradient checkpointing
-plugin = "zero2"               # Plugin for distributed training (zero2, zero2-seq)
-sp_size = 1                    # Sequence parallelism size (1 for no sequence parallelism)
-
-# Define model
-model = dict(
-    type="STDiT-XL/2",
-    space_scale=1.0,
-    time_scale=2 / 3,
-    from_pretrained="YOUR_PRETRAINED_MODEL",
-    enable_flashattn=True,        # Enable flash attention
-    enable_layernorm_kernel=True, # Enable layernorm kernel
-)
-vae = dict(
-    type="VideoAutoencoderKL",
-    from_pretrained="stabilityai/sd-vae-ft-ema",
-    micro_batch_size=128,
-)
-text_encoder = dict(
-    type="t5",
-    from_pretrained="DeepFloyd/t5-v1_1-xxl",
-    model_max_length=120,
-    shardformer=True,           # Enable shardformer for T5 acceleration
-)
-scheduler = dict(
-    type="iddpm",
-    timestep_respacing="",      # Default 1000 timesteps
-)
-
-# Others
-seed = 42
-outputs = "outputs"             # path to save checkpoints
-wandb = False                   # Use wandb for logging
-
-epochs = 1000                   # number of epochs (just large enough, kill when satisfied)
-log_every = 10
-ckpt_every = 250
-load = None                     # path to resume training
-
-batch_size = 4
-lr = 2e-5
-grad_clip = 1.0                 # gradient clipping
-```
-
-## Inference-long specific arguments
+## Advanced Inference config

 The [`inference-long.py`](/scripts/inference-long.py) script is used to generate long videos, and it also provides all functions of the [`inference.py`](/scripts/inference.py) script. The following arguments are specific to the `inference-long.py` script.

 ```python
 loop = 10
 condition_frame_length = 4
-reference_path = ["one.png;two.mp4"]
-mask_strategy = ["0,0,0,1,0;0,0,0,1,-1"]
+reference_path = [
+    "https://cdn.openai.com/tmp/s/interp/d0.mp4",
+    None,
+    "assets/images/condition/wave.png",
+]
+mask_strategy = [
+    "0,0,0,0,8,0.3",
+    None,
+    "0,0,0,0,1;0,0,0,-1,1",
+]
 ```

-To generate a long video of any time, our strategy is to generate a video with a fixed length first, and then use the last `condition_frame_length` number of frames for the next video generation. This will loop for `loop` times. Thus, the total length of the video is `loop * (num_frames - condition_frame_length) + condition_frame_length`.
+The following figure provides an illustration of the `mask_strategy`:

-To condition the generation on images or videos, we introduce the `mask_strategy`. It is 5 number tuples separated by `;`.  Each tuple indicate an insertion of the condition image or video to the target generation. The meaning of each number is:
+![mask_strategy](/assets/readme/report_mask_config.png)

- First number: the index of the condition image or video in the `reference_path`. (0 means one.png, and 1 means two.mp4)
- Second number: the loop index of the condition image or video. (0 means the first loop, 1 means the second loop, etc.)
- Third number: the start frame of the condition image or video. (0 means the first frame, and images only have one frame)
- Fourth number: the number of frames to insert. (1 means insert one frame, and images only have one frame)
- Fifth number: the location to insert. (0 means insert at the beginning, 1 means insert at the end, and -1 means insert at the end of the video)
+To generate a long video of infinite time, our strategy is to generate a video with a fixed length first, and then use the last `condition_frame_length` number of frames for the next video generation. This will loop for `loop` times. Thus, the total length of the video is `loop * (num_frames - condition_frame_length) + condition_frame_length`.

-Thus, "0,0,0,1,-1" means insert the first frame of one.png at the end of the video at the first loop.
+To condition the generation on images or videos, we introduce the `mask_strategy`. It is 6 number tuples separated by `;`.  Each tuple indicate an insertion of the condition image or video to the target generation. The meaning of each number is:

-## Bucket Configs
+- **First number**: the loop index of the condition image or video. (0 means the first loop, 1 means the second loop, etc.)
+- **Second number**: the index of the condition image or video in the `reference_path`.
+- **Third number**: the start frame of the condition image or video. (0 means the first frame, and images only have one frame)
+- **Fourth number**: the location to insert. (0 means insert at the beginning, 1 means insert at the end, and -1 means insert at the end of the video)
+- **Fifth number**: the number of frames to insert. (1 means insert one frame, and images only have one frame)
+- **Sixth number**: the edit rate of the condition image or video. (0 means no edit, 1 means full edit).

-To enable dynamic training (for STDiT2), use `VariableVideoText` dataset, and set the `bucket_config` in the config. An example is:
+To facilitate usage, we also accept passing the reference path and mask strategy as a json appended to the prompt. For example,
+
+```plaintext
+'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff\'s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff\'s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.{"reference_path": "assets/images/condition/cliff.png", "mask_strategy": "0"}'
+```
+
+## Inference Args
+
+You can use `python scripts/inference.py --help` to see the following arguments:
+
+- `--seed`: random seed
+- `--ckpt-path`: path to the checkpoint (`model["from_pretrained"]`)
+- `--batch-size`: batch size
+- `--save-dir`: path to save samples
+- `--sample-name`: if None, the sample will be name by `sample_{index}.mp4/png`, otherwise, the sample will be named by `{sample_name}_{index}.mp4/png`
+- `--start-index`: start index of the sample
+- `--end-index`: end index of the sample
+- `--num-sample`: number of samples to generate for each prompt. The sample will be suffixed by `-0`, `-1`, `-2`, etc.
+- `--prompt-as-path`: if True, use the prompt as the name for saving samples
+- `--prompt-path`: path to the prompt file
+- `--prompt`: prompt string list
+- `--num-frames`: number of frames
+- `--fps`: frames per second
+- `--image-size`: image size
+- `--num-sampling-steps`: number of sampling steps (`scheduler["num_sampling_steps"]`)
+- `--cfg-scale`: hyper-parameter for classifier-free diffusion (`scheduler["cfg_scale"]`)
+- `--loop`: loop for long video generation
+- `--condition-frame-length`: condition frame length for long video generation
+- `--reference-path`: reference path for long video generation
+- `--mask-strategy`: mask strategy for long video generation
+
+Example commands for inference can be found in [commands.md](/docs/commands.md).
+
+## Training Config
+
+```python
+# Define dataset
+dataset = dict(
+    type="VariableVideoTextDataset",   # Select dataset type
+    # VideoTextDataset for OpenSora 1.0, VariableVideoTextDataset for OpenSora 1.1
+    data_path=None,                    # Path to the dataset
+    num_frames=None,                   # Number of frames, set None since we support dynamic training
+    frame_interval=3,                  # Frame interval
+    image_size=(None, None),           # Image size, set None since we support dynamic training
+    transform_name="resize_crop",      # Transform name
+)
+# bucket config usage see next section
+bucket_config = {
+    "144p": {1: (1.0, 48), 16: (1.0, 17), 32: (1.0, 9), 64: (1.0, 4), 128: (1.0, 1)},
+    "256": {1: (0.8, 254), 16: (0.5, 17), 32: (0.5, 9), 64: (0.5, 4), 128: (0.5, 1)},
+    "240p": {1: (0.1, 20), 16: (0.9, 17), 32: (0.8, 9), 64: (0.8, 4), 128: (0.8, 2)},
+    "512": {1: (0.5, 86), 16: (0.2, 4), 32: (0.2, 2), 64: (0.2, 1), 128: (0.0, None)},
+    "480p": {1: (0.4, 54), 16: (0.4, 4), 32: (0.0, None)},
+    "720p": {1: (0.1, 20), 16: (0.1, 2), 32: (0.0, None)},
+    "1024": {1: (0.3, 20)},
+    "1080p": {1: (0.4, 8)},
+}
+# mask ratio in training
+mask_ratios = {
+    "mask_no": 0.75,                   # 75% no mask
+    "mask_quarter_random": 0.025,      # 2.5% random mask with 1 frame to 1/4 #frames
+    "mask_quarter_head": 0.025,        # 2.5% mask at the beginning with 1 frame to 1/4 #frames
+    "mask_quarter_tail": 0.025,        # 2.5% mask at the end with 1 frame to 1/4 #frames
+    "mask_quarter_head_tail": 0.05,    # 5% mask at the beginning and end with 1 frame to 1/4 #frames
+    "mask_image_random": 0.025,        # 2.5% random mask with 1 image to 1/4 #images
+    "mask_image_head": 0.025,          # 2.5% mask at the beginning with 1 image to 1/4 #images
+    "mask_image_tail": 0.025,          # 2.5% mask at the end with 1 image to 1/4 #images
+    "mask_image_head_tail": 0.05,      # 5% mask at the beginning and end with 1 image to 1/4 #images
+}
+
+# Define acceleration
+num_workers = 8                        # Number of workers for dataloader
+num_bucket_build_workers = 16          # Number of workers for bucket building
+dtype = "bf16"                         # Computation type (fp16, fp32, bf16)
+grad_checkpoint = True                 # Use gradient checkpointing
+plugin = "zero2"                       # Plugin for training
+sp_size = 1                            # Sequence parallel size
+
+# Define model
+model = dict(
+    type="STDiT2-XL/2",                # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
+    from_pretrained=None,              # Load from pretrained model
+    input_sq_size=512,                 # Base spatial position embedding size
+    qk_norm=True,                      # Normalize query and key in attention
+    enable_flashattn=True,             # (Optional) Speed up training and inference with flash attention
+    enable_layernorm_kernel=True,      # (Optional) Speed up training and inference with fused kernel
+)
+vae = dict(
+    type="VideoAutoencoderKL",         # Select VAE type
+    from_pretrained="stabilityai/sd-vae-ft-ema",
+    micro_batch_size=4,                # VAE with micro batch size to save memory
+    local_files_only=True,             # Load from local files only (first time should be false)
+)
+text_encoder = dict(
+    type="t5",                         # Select text encoder type (t5, clip)
+    from_pretrained="DeepFloyd/t5-v1_1-xxl",
+    model_max_length=200,              # Maximum length of input text
+    shardformer=True,                  # Use shardformer
+    local_files_only=True,             # Load from local files only (first time should be false)
+)
+scheduler = dict(
+    type="iddpm",                      # Select scheduler type (iddpm, iddpm-speed)
+    timestep_respacing="",
+)
+
+# Others
+seed = 42                              # random seed
+outputs = "outputs"                    # path to save outputs
+wandb = False                          # Use wandb or not
+
+epochs = 1000                          # Number of epochs (set a large number and kill the process when you want to stop)
+log_every = 10
+ckpt_every = 500
+load = None
+
+batch_size = None
+lr = 2e-5
+grad_clip = 1.0
+```
+
+## Training Args
+
+- `--seed`: random seed
+- `--ckpt-path`: path to the checkpoint (`model["from_pretrained"]`)
+- `--batch-size`: batch size
+- `--wandb`: use wandb or not
+- `--load`: path to the checkpoint to load
+- `--data-path`: path to the dataset (`dataset["data_path"]`)
+
+See [commands.md](/docs/commands.md) for example commands.
+
+## Training Bucket Configs
+
+We support multi-resolution/aspect-ratio/num_frames training with bucket. To enable dynamic training (for STDiT2), use `VariableVideoText` dataset, and set the `bucket_config` in the config. An example is:

 ```python
 bucket_config = {
@ -154,6 +244,8 @@ This looks a bit difficult to understand at the first glance. Let's understand t

 ### Three-level bucket

+![bucket](/assets/readme/report_bucket.png)
+
 We design a three-level bucket: `(resolution, num_frames, aspect_ratios)`. The resolution and aspect ratios is predefined in [aspect.py](/opensora/datasets/aspect.py). Commonly used resolutions (e.g., 240p, 1080p) are supported, and the name represents the number of pixels (e.g., 240p is 240x426, however, we define 240p to represent any size with HxW approximately 240x426=102240 pixels). The aspect ratios are defined for each resolution. You do not need to define the aspect ratios in the `bucket_config`.

 The `num_frames` is the number of frames in each sample, with `num_frames=1` especially for images. If `frame_intervals` is not 1, a bucket with `num_frames=k` will contain videos with `k*frame_intervals` frames except for images. Only a video with more than `num_frames` and more than `resolution` pixels will be likely to be put into the bucket.
@ -181,7 +273,7 @@ bucket_config = {
 }
 ```

-If you want to train a model supporting different resolutions of images, you can use the following config:
+If you want to train a model supporting different resolutions of images, you can use the following config (example [image.py](/configs/opensora-v1-1/train/image.py)):

 ```python
 bucket_config = {
@ -205,7 +297,7 @@ bucket_config = {
 }
 ```

-And similarly for videos:
+And similarly for videos (example [video.py](/configs/opensora-v1-1/train/video.py)):

 ```python
 bucket_config = {
--- a/docs/report_02.md
+++ b/docs/report_02.md
@ -23,7 +23,7 @@ We made the following modifications to the original ST-DiT for better training s
 - **[Rope embedding](https://arxiv.org/abs/2104.09864) for temporal attention**: Following LLM's best practice, we change the sinusoidal positional encoding to rope embedding for temporal attention since it is also a sequence prediction task.
 - **AdaIN and Layernorm for temporal attention**: we wrap the temporal attention with AdaIN and layernorm as the spatial attention to stabilize the training.
 - **[QK-normalization](https://arxiv.org/abs/2302.05442) with [RMSNorm](https://arxiv.org/abs/1910.07467)**: Following [SD3](https://arxiv.org/pdf/2403.03206.pdf), we appy QK-normalization to the all attention for better training stability in half-precision.
- **Dynamic input size support and video infomation condition**: To support multi-resolution, aspect ratio, and fps training, we make ST-DiT-2 to accept any input size. Extending [PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha)'s idea, we conditioned on video's height, width, aspect ratio, frame length, and fps.
+- **Dynamic input size support and video infomation condition**: To support multi-resolution, aspect ratio, and fps training, we make ST-DiT-2 to accept any input size, and automatically scale positional embeddings. Extending [PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha)'s idea, we conditioned on video's height, width, aspect ratio, frame length, and fps.
 - **Extending T5 tokens from 120 to 200**: our caption is usually less than 200 tokens, and we find the model can handle longer text well.

 ## Support for Multi-time/resolution/aspect ratio/fps Training
@ -52,6 +52,8 @@ As shown in the figure, a bucket is a triplet of `(resolution, num_frame, aspect

 Considering our computational resource is limited, we further introduce two attributes `keep_prob` and `batch_size` for each `(resolution, num_frame)` to reduce the computational cost and enable multi-stage training. Specifically, a high-resolution video will be downsampled to a lower resolution with probability `1-keep_prob` and the batch size for each bucket is `batch_size`. In this way, we can control the number of samples in different buckets and balance the GPU load by search a good batch size for each bucket.

+A detailed explanation of the bucket usage in training is available in [docs/config.md](/docs/config.md#training-bucket-configs).
+
 ## Masked DiT as Image/Video-to-Video Model

 Transformers can be easily extended to support image-to-image and video-to-video tasks. We propose a mask strategy to support image and video conditioning. The mask strategy is shown in the figure below.
@ -66,6 +68,8 @@ An illustration of masking strategy config to use in inference is given as follo

 ![mask strategy config](/assets/readme/report_mask_config.png)

+A detailed explanation of the mask strategy usage is available in [docs/config.md](/docs/config.md#advanced-inference-config).
+
 ## Data Collection & Pipeline

 As we found in Open-Sora 1.0, the data number and quality are crucial for training a good model, we work hard on scaling the dataset. First, we create an automatic pipeline following [SVD](https://arxiv.org/abs/2311.15127), inlcuding scene cutting, captioning, various scoring and filtering, and dataset management scripts and conventions.
--- a/eval/README.md
+++ b/eval/README.md
@ -0,0 +1,35 @@
+# Evalution
+
+## Human evaluation
+
+To conduct human evaluation, we need to generate various samples. We provide many prompts in `assets/texts`, and defined some test setting covering different resolution, duration and aspect ratio in `eval/sample.sh`. To facilitate the usage of multiple GPUs, we split sampling tasks into several parts.
+
+```bash
+# image
+bash eval/sample.sh /path/to/ckpt -1
+# video (2a to 2f)
+bash eval/sample.sh /path/to/ckpt -2a
+# video edit
+bash eval/sample.sh /path/to/ckpt -3
+# launch 8 jobs at once (you must read the script to understand the details)
+bash eval/launch.sh /path/to/ckpt
+```
+
+## VBench
+
+[VBench](https://github.com/Vchitect/VBench) is a benchmark for short text to video generation. We provide a script for easily generating samples required by VBench.
+
+```bash
+# 4a to 4h
+bash eval/vbench.sh /path/to/ckpt -4a
+# launch 8 jobs at once (you must read the script to understand the details)
+bash eval/launch.sh /path/to/ckpt
+```
+
+After generation, install the VBench package according to their [instructions](https://github.com/Vchitect/VBench?tab=readme-ov-file#hammer-installation). Then, run `bash eval/vbench/vbench.sh` to evaluate the generated samples.
+
+## VBench-i2v
+
+[VBench-i2v](https://github.com/Vchitect/VBench/tree/master/vbench2_beta_i2v) is a benchmark for short image to video generation (beta version).
+
+TBD
--- a/opensora/datasets/aspect.py
+++ b/opensora/datasets/aspect.py
@ -105,21 +105,23 @@ ASPECT_RATIO_720P = {

 # S = 409920
 ASPECT_RATIO_480P = {
-    "0.39": (400, 1026),
-    "0.42": (414, 986),
-    "0.48": (444, 925),
-    "0.50": (452, 904),
-    "0.52": (462, 888),
-    "0.56": (480, 854),  # base
-    "0.66": (520, 788),
-    "0.75": (554, 738),
-    "1.00": (640, 640),
-    "1.33": (738, 554),
-    "1.52": (790, 520),
-    "1.78": (854, 480),
-    "1.92": (888, 462),
-    "2.00": (906, 454),
-    "2.10": (928, 442),
+    "0.38": (294, 784),
+    "0.43": (314, 732),
+    "0.48": (332, 692),
+    "0.50": (340, 680),
+    "0.53": (350, 662),
+    "0.54": (352, 652),
+    "0.56": (360, 640),  # base
+    "0.62": (380, 608),
+    "0.67": (392, 588),
+    "0.75": (416, 554),
+    "1.00": (480, 480),
+    "1.33": (554, 416),
+    "1.50": (588, 392),
+    "1.78": (640, 360),
+    "1.89": (660, 350),
+    "2.00": (678, 340),
+    "2.08": (692, 332),
 }

 # S = 230400