a bunch of update

2026-04-11 21:42:26 +02:00 · 2024-04-24 09:23:24 +00:00 · 2024-04-24 09:23:24 +00:00 · 44e618bbbc
commit 44e618bbbc
parent 9e0b314dab
11 changed files with 9206 additions and 83 deletions
--- a/assets/texts/t2v_car.txt
+++ b/assets/texts/t2v_car.txt
@ -0,0 +1 @@
+|0|A car driving on the in forest.|2|A car driving in the desert.|4|A car driving near the coast.|6|A car driving in the city.|8|A car driving near a mountain.|10|A car driving on the surface of a river.|12|A car driving on the surface of the earch.|14|A car driving in the universe.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4", "mask_strategy": "0,0,0,0,16,0.4"}
--- a/assets/texts/t2v_samples.txt
+++ b/assets/texts/t2v_samples.txt
@ -1,5 +1,5 @@
 A soaring drone footage captures the majestic beauty of a coastal cliff, its red and yellow stratified rock faces rich in color and against the vibrant turquoise of the sea. Seabirds can be seen taking flight around the cliff's precipices. As the drone slowly moves from different angles, the changing sunlight casts shifting shadows that highlight the rugged textures of the cliff and the surrounding calm sea. The water gently laps at the rock base and the greenery that clings to the top of the cliff, and the scene gives a sense of peaceful isolation at the fringes of the ocean. The video captures the essence of pristine natural beauty untouched by human structures.
-The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.
+A majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.
 A vibrant scene of a snowy mountain landscape. The sky is filled with a multitude of colorful hot air balloons, each floating at different heights, creating a dynamic and lively atmosphere. The balloons are scattered across the sky, some closer to the viewer, others further away, adding depth to the scene.  Below, the mountainous terrain is blanketed in a thick layer of snow, with a few patches of bare earth visible here and there. The snow-covered mountains provide a stark contrast to the colorful balloons, enhancing the visual appeal of the scene.  In the foreground, a few cars can be seen driving along a winding road that cuts through the mountains. The cars are small compared to the vastness of the landscape, emphasizing the grandeur of the surroundings.  The overall style of the video is a mix of adventure and tranquility, with the hot air balloons adding a touch of whimsy to the otherwise serene mountain landscape. The video is likely shot during the day, as the lighting is bright and even, casting soft shadows on the snow-covered mountains.
 The vibrant beauty of a sunflower field. The sunflowers, with their bright yellow petals and dark brown centers, are in full bloom, creating a stunning contrast against the green leaves and stems. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. The sun is shining brightly, casting a warm glow on the flowers and highlighting their intricate details. The video is shot from a low angle, looking up at the sunflowers, which adds a sense of grandeur and awe to the scene. The sunflowers are the main focus of the video, with no other objects or people present. The video is a celebration of nature's beauty and the simple joy of a sunny day in the countryside.
 A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell, is the main focus of the video, swimming gracefully towards the right side of the frame. The coral reef, teeming with life, is visible in the background, providing a vibrant and colorful backdrop to the turtle's journey. Several small fish, darting around the turtle, add a sense of movement and dynamism to the scene. The video is shot from a slightly elevated angle, providing a comprehensive view of the turtle's surroundings. The overall style of the video is calm and peaceful, capturing the beauty and tranquility of the underwater world.
--- a/docs/report_02.md
+++ b/docs/report_02.md
@ -5,7 +5,6 @@
 - [Masked DiT as Image/Video-to-Video Model](#masked-dit-as-imagevideo-to-video-model)
 - [Data Collection \& Pipeline](#data-collection--pipeline)
 - [Training Details](#training-details)
- [Results and Evaluation](#results-and-evaluation)
 - [Limitation and Future Work](#limitation-and-future-work)

 In Open-Sora 1.1 release, we train a 700M models on 10M data (Open-Sora 1.0 trained on 400K data) with a better STDiT architecture. We implement the following features mentioned in [sora's report](https://openai.com/research/video-generation-models-as-world-simulators):
@ -103,13 +102,11 @@ With limited computational resources, we have to carefully monitor the training

 To summarize, the training of Open-Sora 1.1 requires approximately **9 days** on 64 H800 GPUs.

-## Results and Evaluation
-
 ## Limitation and Future Work

 As we get one step closer to the replication of Sora, we find many limitations for the current model, and these limitations point to the future work.

- **Generation Failure**: we fine many cases (especially when the total token number is large or the content is complex),  our model fails to generate the scene.
+- **Generation Failure**: we fine many cases (especially when the total token number is large or the content is complex),  our model fails to generate the scene. There may be a collapse in the temporal attention and we are working hard on it.
 - **Noisy generation and influency**: we find the generated model is sometimes noisy and not fluent, especially for long videos. We think the problem is due to not using a temporal VAE. As [Pixart-Sigma](https://arxiv.org/abs/2403.04692) finds that adapting to a new VAE is simple, we plan to develop a temporal VAE for the model in the next version.
 - **Lack of time consistency**: we find the model cannot generate videos with high time consistency. We think the problem is due to the lack of training FLOPs. We plan to collect more data and continue training the model to improve the time consistency.
 - **Bad human generation**: We find the model cannot generate high-quality human videos. We think the problem is due to the lack of human data. We plan to collect more human data and continue training the model to improve the human generation.
--- a/eval/launch.sh
+++ b/eval/launch.sh
@ -14,14 +14,14 @@ LOG_BASE=logs/sample/$CKPT_BASE
 echo "Logging to $LOG_BASE"

 # == sample & human evaluation ==
-CUDA_VISIBLE_DEVICES=0 bash eval/sample.sh $CKPT -1 >${LOG_BASE}_1.log 2>&1 &
-CUDA_VISIBLE_DEVICES=1 bash eval/sample.sh $CKPT -2a >${LOG_BASE}_2a.log 2>&1 &
-CUDA_VISIBLE_DEVICES=2 bash eval/sample.sh $CKPT -2b >${LOG_BASE}_2b.log 2>&1 &
-CUDA_VISIBLE_DEVICES=3 bash eval/sample.sh $CKPT -2c >${LOG_BASE}_2c.log 2>&1 &
-CUDA_VISIBLE_DEVICES=4 bash eval/sample.sh $CKPT -2d >${LOG_BASE}_2d.log 2>&1 &
-CUDA_VISIBLE_DEVICES=5 bash eval/sample.sh $CKPT -2e >${LOG_BASE}_2e.log 2>&1 &
-CUDA_VISIBLE_DEVICES=6 bash eval/sample.sh $CKPT -2f >${LOG_BASE}_2f.log 2>&1 &
-CUDA_VISIBLE_DEVICES=7 bash eval/sample.sh $CKPT -2g >${LOG_BASE}_2g.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=0 bash eval/sample.sh $CKPT -1 >${LOG_BASE}_1.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=1 bash eval/sample.sh $CKPT -2a >${LOG_BASE}_2a.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=2 bash eval/sample.sh $CKPT -2b >${LOG_BASE}_2b.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=3 bash eval/sample.sh $CKPT -2c >${LOG_BASE}_2c.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=4 bash eval/sample.sh $CKPT -2d >${LOG_BASE}_2d.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=5 bash eval/sample.sh $CKPT -2e >${LOG_BASE}_2e.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=6 bash eval/sample.sh $CKPT -2f >${LOG_BASE}_2f.log 2>&1 &
+# CUDA_VISIBLE_DEVICES=7 bash eval/sample.sh $CKPT -2g >${LOG_BASE}_2g.log 2>&1 &

 # CUDA_VISIBLE_DEVICES=0 bash eval/sample.sh $CKPT -2h >${LOG_BASE}_2h.log 2>&1 &

@ -35,4 +35,14 @@ CUDA_VISIBLE_DEVICES=7 bash eval/sample.sh $CKPT -2g >${LOG_BASE}_2g.log 2>&1 &
 # CUDA_VISIBLE_DEVICES=6 bash eval/sample.sh $CKPT -4g >${LOG_BASE}_4g.log 2>&1 &
 # CUDA_VISIBLE_DEVICES=7 bash eval/sample.sh $CKPT -4h >${LOG_BASE}_4h.log 2>&1 &

+# == vbench i2v ==
+CUDA_VISIBLE_DEVICES=0 bash eval/sample.sh $CKPT -5a >${LOG_BASE}_5a.log 2>&1 &
+CUDA_VISIBLE_DEVICES=1 bash eval/sample.sh $CKPT -5b >${LOG_BASE}_5b.log 2>&1 &
+CUDA_VISIBLE_DEVICES=2 bash eval/sample.sh $CKPT -5c >${LOG_BASE}_5c.log 2>&1 &
+CUDA_VISIBLE_DEVICES=3 bash eval/sample.sh $CKPT -5d >${LOG_BASE}_5d.log 2>&1 &
+CUDA_VISIBLE_DEVICES=4 bash eval/sample.sh $CKPT -5e >${LOG_BASE}_5e.log 2>&1 &
+CUDA_VISIBLE_DEVICES=5 bash eval/sample.sh $CKPT -5f >${LOG_BASE}_5f.log 2>&1 &
+CUDA_VISIBLE_DEVICES=6 bash eval/sample.sh $CKPT -5g >${LOG_BASE}_5g.log 2>&1 &
+CUDA_VISIBLE_DEVICES=7 bash eval/sample.sh $CKPT -5h >${LOG_BASE}_5h.log 2>&1 &
+
 # kill all by: pkill -f "inference"
--- a/eval/multiple.sh
+++ b/eval/multiple.sh
@ -14,93 +14,260 @@ if [[ $CKPT == *"ema"* ]]; then
 else
    CKPT_BASE=$(basename $CKPT)
 fi
-OUTPUT="./samples_${CKPT_BASE}_${NAME}"
+OUTPUT="./samples/samples_${CKPT_BASE}_${NAME}"
 start=$(date +%s)

 # Generate samples

-# 16x240p
+# == 16x240p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_1_1 \
+    --num-frames 16 --image-size 320 320 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_16_9 \
+    --num-frames 16 --image-size 240 426 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_9_16 \
+    --num-frames 16 --image-size 426 240 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_4_3 \
+    --num-frames 16 --image-size 276 368 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_3_4 \
+    --num-frames 16 --image-size 368 276 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_1_2 \
+    --num-frames 16 --image-size 226 452 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x240p_2_1 \
+    --num-frames 16 --image-size 452 226 --num-sample $NUM_SAMPLE

-# 64x240p
+# == 64x240p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_1_1 \
+    --num-frames 64 --image-size 320 320 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_16_9 \
+    --num-frames 64 --image-size 240 426 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_9_16 \
+    --num-frames 64 --image-size 426 240 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_4_3 \
+    --num-frames 64 --image-size 276 368 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_3_4 \
+    --num-frames 64 --image-size 368 276 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_1_2 \
+    --num-frames 64 --image-size 226 452 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x240p_2_1 \
+    --num-frames 64 --image-size 452 226 --num-sample $NUM_SAMPLE

-# 128x240p
+# == 128x240p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_1_1 \
+    --num-frames 128 --image-size 320 320 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_16_9 \
+    --num-frames 128 --image-size 240 426 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_9_16 \
+    --num-frames 128 --image-size 426 240 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_4_3 \
+    --num-frames 128 --image-size 276 368 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_3_4 \
+    --num-frames 128 --image-size 368 276 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_1_2 \
+    --num-frames 128 --image-size 226 452 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x240p_2_1 \
+    --num-frames 128 --image-size 452 226 --num-sample $NUM_SAMPLE

-# 16x320p
+# == 16x360p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_1_1 \
+    --num-frames 16 --image-size 480 480 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_16_9 \
+    --num-frames 16 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_9_16 \
+    --num-frames 16 --image-size 640 360 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_4_3 \
+    --num-frames 16 --image-size 416 554 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_3_4 \
+    --num-frames 16 --image-size 554 416 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_1_2 \
+    --num-frames 16 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x360p_2_1 \
+    --num-frames 16 --image-size 640 360 --num-sample $NUM_SAMPLE

-# 64x320p
+# == 64x360p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_1_1 \
+    --num-frames 64 --image-size 480 480 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_16_9 \
+    --num-frames 64 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_9_16 \
+    --num-frames 64 --image-size 640 360 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_4_3 \
+    --num-frames 64 --image-size 416 554 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_3_4 \
+    --num-frames 64 --image-size 554 416 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_1_2 \
+    --num-frames 64 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x360p_2_1 \
+    --num-frames 64 --image-size 640 360 --num-sample $NUM_SAMPLE

-# 128x320p
+# == 128x360p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_1_1 \
+    --num-frames 128 --image-size 480 480 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_16_9 \
+    --num-frames 128 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_9_16 \
+    --num-frames 128 --image-size 640 360 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_4_3 \
+    --num-frames 128 --image-size 416 554 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_3_4 \
+    --num-frames 128 --image-size 554 416 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_1_2 \
+    --num-frames 128 --image-size 360 640 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 128x360p_2_1 \
+    --num-frames 128 --image-size 640 360 --num-sample $NUM_SAMPLE

-# 16x480p
+# == 16x480p ==
 # 1:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_1_1 \
-    --num-frames 16 --image-size 360 360 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 640 640 --num-sample $NUM_SAMPLE
 # 16:9
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_16_9 \
-    --num-frames 16 --image-size 360 640 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 480 854 --num-sample $NUM_SAMPLE
 # 9:16
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_9_16 \
-    --num-frames 16 --image-size 1280 720 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 854 480 --num-sample $NUM_SAMPLE
 # 4:3
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_4_3 \
-    --num-frames 16 --image-size 832 1108 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 554 738 --num-sample $NUM_SAMPLE
 # 3:4
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_3_4 \
-    --num-frames 16 --image-size 1108 832 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 738 554 --num-sample $NUM_SAMPLE
 # 1:2
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_1_2 \
-    --num-frames 16 --image-size 1358 600 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 452 904 --num-sample $NUM_SAMPLE
 # 2:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x480p_2_1 \
-    --num-frames 16 --image-size 600 1358
+    --num-frames 16 --image-size 904 452 --num-sample $NUM_SAMPLE

-# 32x480p
+# == 32x480p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_1_1 \
+    --num-frames 32 --image-size 640 640 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_16_9 \
+    --num-frames 32 --image-size 480 854 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_9_16 \
+    --num-frames 32 --image-size 854 480 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_4_3 \
+    --num-frames 32 --image-size 554 738 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_3_4 \
+    --num-frames 32 --image-size 738 554 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_1_2 \
+    --num-frames 32 --image-size 452 904 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x480p_2_1 \
+    --num-frames 32 --image-size 904 452 --num-sample $NUM_SAMPLE

-# 64x480p
+# == 64x480p ==
+# 1:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_1_1 \
+    --num-frames 64 --image-size 640 640 --num-sample $NUM_SAMPLE
+# 16:9
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_16_9 \
+    --num-frames 64 --image-size 480 854 --num-sample $NUM_SAMPLE
+# 9:16
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_9_16 \
+    --num-frames 64 --image-size 854 480 --num-sample $NUM_SAMPLE
+# 4:3
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_4_3 \
+    --num-frames 64 --image-size 554 738 --num-sample $NUM_SAMPLE
+# 3:4
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_3_4 \
+    --num-frames 64 --image-size 738 554 --num-sample $NUM_SAMPLE
+# 1:2
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_1_2 \
+    --num-frames 64 --image-size 452 904 --num-sample $NUM_SAMPLE
+# 2:1
+eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 64x480p_2_1 \
+    --num-frames 64 --image-size 904 452 --num-sample $NUM_SAMPLE

-
-# 16x720p
+# == 16x720p ==
 # 1:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_1_1 \
-    --num-frames 16 --image-size 960 960 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 960 960 --num-sample $NUM_SAMPLE
 # 16:9
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_16_9 \
-    --num-frames 16 --image-size 720 1280 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 720 1280 --num-sample $NUM_SAMPLE
 # 9:16
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_9_16 \
-    --num-frames 16 --image-size 1280 720 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 1280 720 --num-sample $NUM_SAMPLE
 # 4:3
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_4_3 \
-    --num-frames 16 --image-size 832 1108 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 832 1108 --num-sample $NUM_SAMPLE
 # 3:4
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_3_4 \
-    --num-frames 16 --image-size 1108 832 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 1108 832 --num-sample $NUM_SAMPLE
 # 1:2
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_1_2 \
-    --num-frames 16 --image-size 1358 600 --num-samples $NUM_SAMPLE
+    --num-frames 16 --image-size 1358 600 --num-sample $NUM_SAMPLE
 # 2:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 16x720p_2_1 \
    --num-frames 16 --image-size 600 1358

-# 32x720p
+# == 32x720p ==
 # 1:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_1_1 \
-    --num-frames 32 --image-size 960 960 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 960 960 --num-sample $NUM_SAMPLE
 # 16:9
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_16_9 \
-    --num-frames 32 --image-size 720 1280 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 720 1280 --num-sample $NUM_SAMPLE
 # 9:16
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_9_16 \
-    --num-frames 32 --image-size 1280 720 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 1280 720 --num-sample $NUM_SAMPLE
 # 4:3
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_4_3 \
-    --num-frames 32 --image-size 832 1108 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 832 1108 --num-sample $NUM_SAMPLE
 # 3:4
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_3_4 \
-    --num-frames 32 --image-size 1108 832 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 1108 832 --num-sample $NUM_SAMPLE
 # 1:2
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_1_2 \
-    --num-frames 32 --image-size 1358 600 --num-samples $NUM_SAMPLE
+    --num-frames 32 --image-size 1358 600 --num-sample $NUM_SAMPLE
 # 2:1
 eval $CMD --ckpt-path $CKPT --prompt \"$PROMPT\" --save-dir $OUTPUT --sample-name 32x720p_2_1 \
    --num-frames 32 --image-size 600 1358
--- a/eval/sample.sh
+++ b/eval/sample.sh
@ -210,49 +210,49 @@ function run_vbenck_i2v_b() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 140 --end-index 280 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_c() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 280 --end-index 420 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_d() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 420 --end-index 560 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_e() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 560 --end-index 700 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_f() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 700 --end-index 840 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_g() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 840 --end-index 980 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 function run_vbenck_i2v_h() {
  eval $CMD_REF --ckpt-path $CKPT --save-dir ${OUTPUT}_vbench_i2v --prompt-as-path --num-sample 5 \
    --prompt-path assets/texts/VBench/all_i2v.txt \
    --start-index 980 \
-    --image-size 256 256
+    --num-frames $VBENCH_I2V_FRAMES --image-size $VBENCH_I2V_H $VBENCH_I2V_W --batch-size $VBENCH_BS
 }

 ### Main
--- a/eval/vbench_i2v/vbench2_i2v_full_info.json
+++ b/eval/vbench_i2v/vbench2_i2v_full_info.json
--- a/opensora/models/layers/blocks.py
+++ b/opensora/models/layers/blocks.py
@ -171,6 +171,7 @@ class Attention(nn.Module):

        qkv = qkv.view(qkv_shape).permute(2, 0, 3, 1, 4)
        q, k, v = qkv.unbind(0)
+        # WARNING: this may be a bug
        if self.rope:
            q = self.rotary_emb(q)
            k = self.rotary_emb(k)
--- a/opensora/schedulers/iddpm/gaussian_diffusion.py
+++ b/opensora/schedulers/iddpm/gaussian_diffusion.py
@ -408,7 +408,7 @@ class GaussianDiffusion:
        if mask is not None:
            if mask.shape[0] != x.shape[0]:
                mask = mask.repeat(2, 1)  # HACK
-            mask_t = (mask * len(self.betas) - 1).to(torch.int)
+            mask_t = (mask * len(self.betas)).to(torch.int)

            # x0: copy unchanged x values
            # x_noise: add noise to x values
--- a/scripts/inference-long.py
+++ b/scripts/inference-long.py
@ -232,7 +232,6 @@ def main():
        batch_prompts_raw = prompts[i : i + cfg.batch_size]
        batch_prompts_raw, additional_infos = extract_json_from_prompts(batch_prompts_raw)
        batch_prompts_loops = process_prompts(batch_prompts_raw, cfg.loop)
-        video_clips = []
        # handle the last batch
        if len(batch_prompts_raw) < cfg.batch_size and cfg.multi_resolution == "STDiT2":
            model_args["height"] = model_args["height"][: len(batch_prompts_raw)]
@ -250,37 +249,39 @@ def main():
        refs_x = collect_references_batch(cfg.reference_path[i : i + cfg.batch_size], vae, cfg.image_size)
        mask_strategy = cfg.mask_strategy[i : i + cfg.batch_size]

-        # 4.3. long video generation
-        for loop_i in range(cfg.loop):
-            # 4.4 sample in hidden space
-            batch_prompts = [prompt[loop_i] for prompt in batch_prompts_loops]
-            z = torch.randn(len(batch_prompts), vae.out_channels, *latent_size, device=device, dtype=dtype)
+        # 4.3. diffusion sampling
+        old_sample_idx = sample_idx
+        # generate multiple samples for each prompt
+        for k in range(cfg.num_sample):
+            sample_idx = old_sample_idx
+            video_clips = []

-            # 4.5. apply mask strategy
-            masks = None
-            # if cfg.reference_path is not None:
-            if loop_i > 0:
-                ref_x = vae.encode(video_clips[-1])
-                for j, refs in enumerate(refs_x):
-                    if refs is None:
-                        refs_x[j] = [ref_x[j]]
-                    else:
-                        refs.append(ref_x[j])
-                    if mask_strategy[j] is None:
-                        mask_strategy[j] = ""
-                    else:
-                        mask_strategy[j] += ";"
-                    mask_strategy[
-                        j
-                    ] += f"{loop_i},{len(refs)-1},-{cfg.condition_frame_length},0,{cfg.condition_frame_length}"
-            masks = apply_mask_strategy(z, refs_x, mask_strategy, loop_i)
+            # 4.4. long video generation
+            for loop_i in range(cfg.loop):
+                # 4.4 sample in hidden space
+                batch_prompts = [prompt[loop_i] for prompt in batch_prompts_loops]

-            # 4.6. diffusion sampling
-            old_sample_idx = sample_idx
-            # generate multiple samples for each prompt
-            for k in range(cfg.num_sample):
-                sample_idx = old_sample_idx
+                # 4.5. apply mask strategy
+                masks = None
+                # if cfg.reference_path is not None:
+                if loop_i > 0:
+                    ref_x = vae.encode(video_clips[-1])
+                    for j, refs in enumerate(refs_x):
+                        if refs is None:
+                            refs_x[j] = [ref_x[j]]
+                        else:
+                            refs.append(ref_x[j])
+                        if mask_strategy[j] is None:
+                            mask_strategy[j] = ""
+                        else:
+                            mask_strategy[j] += ";"
+                        mask_strategy[
+                            j
+                        ] += f"{loop_i},{len(refs)-1},-{cfg.condition_frame_length},0,{cfg.condition_frame_length}"

+                # sampling
+                z = torch.randn(len(batch_prompts), vae.out_channels, *latent_size, device=device, dtype=dtype)
+                masks = apply_mask_strategy(z, refs_x, mask_strategy, loop_i)
                samples = scheduler.sample(
                    model,
                    text_encoder,
--- a/scripts/inference.py
+++ b/scripts/inference.py
@ -114,7 +114,6 @@ def main():
        # 4.2 sample in hidden space
        batch_prompts_raw = prompts[i : i + cfg.batch_size]
        batch_prompts = [text_preprocessing(prompt) for prompt in batch_prompts_raw]
-        z = torch.randn(len(batch_prompts), vae.out_channels, *latent_size, device=device, dtype=dtype)
        # handle the last batch
        if len(batch_prompts_raw) < cfg.batch_size and cfg.multi_resolution == "STDiT2":
            model_args["height"] = model_args["height"][: len(batch_prompts_raw)]
@ -145,6 +144,7 @@ def main():
                    continue

            # sampling
+            z = torch.randn(len(batch_prompts), vae.out_channels, *latent_size, device=device, dtype=dtype)
            samples = scheduler.sample(
                model,
                text_encoder,
				`@ -0,0 +1 @@`
				`\|0\|A car driving on the in forest.\|2\|A car driving in the desert.\|4\|A car driving near the coast.\|6\|A car driving in the city.\|8\|A car driving near a mountain.\|10\|A car driving on the surface of a river.\|12\|A car driving on the surface of the earch.\|14\|A car driving in the universe.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4", "mask_strategy": "0,0,0,0,16,0.4"}`