remove images

2026-05-05 19:49:35 +02:00 · 2025-03-05 10:27:32 +08:00 · 2025-03-05 10:27:32 +08:00 · c4983aedb2
commit c4983aedb2
parent ae143b54cf
16 changed files with 74 additions and 74 deletions
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
 <p align="center">
-    <img src="./assets/readme/icon.png" width="250"/>
+    <img src=".https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
 </p>
 <div align="center">
    <a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
@ -44,7 +44,7 @@ With Open-Sora, our goal is to foster innovation, creativity, and inclusivity wi
 - **[2024.04.25]** We released **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video, video-to-video, infinite time** generation. In addition, a full video processing pipeline is released. [[checkpoints]](#open-sora-11-model-weights) [[report]](/docs/report_02.md)
 - **[2024.03.18]** We released **Open-Sora 1.0**, a fully open-source project for video generation.
  Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
-  <a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
+  <a href="https://github.com/hpcaitech/ColossalAI"><img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/colossal_ai.png" width="8%" ></a>
  acceleration,
  inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [[checkpoints]](#open-sora-10-model-weights)
  [[blog]](https://hpc-ai.com/blog/open-sora-v1.0) [[report]](/docs/report_01.md)
@ -346,7 +346,7 @@ export OPENAI_API_KEY=YOUR_API_KEY

 In the Gradio application, the basic options are as follows:

-![Gradio Demo](assets/readme/gradio_basic.png)
+![Gradio Demo](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/gradio_basic.png)

 The easiest way to generate a video is to input a text prompt and click the "**Generate video**" button (scroll down if you cannot find). The generated video will be displayed in the right panel. Checking the "**Enhance prompt with GPT4o**" will use GPT-4o to refine the prompt, while "**Random Prompt**" button will generate a random prompt by GPT-4o for you. Due to the OpenAI's API limit, the prompt refinement result has some randomness.

@ -359,7 +359,7 @@ Then, you can choose the **resolution**, **duration**, and **aspect ratio** of t

 Note that besides text to video, you can also use **image to video generation**. You can upload an image and then click the "**Generate video**" button to generate a video with the image as the first frame. Or you can fill in the text prompt and click the "**Generate image**" button to generate an image with the text prompt, and then click the "**Generate video**" button to generate a video with the image generated with the same model.

-![Gradio Demo](assets/readme/gradio_option.png)
+![Gradio Demo](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/gradio_option.png)

 Then you can specify more options, including "**Motion Strength**", "**Aesthetic**" and "**Camera Motion**". If "Enable" not checked or the choice is "none", the information is not passed to the model. Otherwise, the model will generate videos with the specified motion strength, aesthetic score, and camera motion.

@ -540,7 +540,7 @@ To this end, we establish a complete pipeline for data processing, which could s
 The pipeline is shown below. For detailed information, please refer to [data processing](docs/data_processing.md).
 Also check out the [datasets](docs/datasets.md) we use.

-![Data Processing Pipeline](assets/readme/report_data_pipeline.png)
+![Data Processing Pipeline](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_data_pipeline.png)

 ## Training

--- a/configs/opensora-v1-1/inference/sample-ref.py
+++ b/configs/opensora-v1-1/inference/sample-ref.py
@ -7,11 +7,11 @@ multi_resolution = "STDiT2"
 # Condition
 prompt_path = None
 prompt = [
-    'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. {"reference_path": "assets/images/condition/cliff.png", "mask_strategy": "0"}',
-    'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png","mask_strategy": "0"}',
+    'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. {"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png", "mask_strategy": "0"}',
+    'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png","mask_strategy": "0"}',
    'A car driving on the ocean.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4","mask_strategy": "0,0,-8,0,8"}',
    'A snowy forest.{"reference_path": "https://cdn.pixabay.com/video/2021/04/25/72171-542991404_large.mp4","mask_strategy": "0,0,0,0,15,0.8"}',
-    'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}',
+    'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}',
    '|0|a white jeep equipped with a roof rack driving on a dirt road in a coniferous forest.|2|a white jeep equipped with a roof rack driving on a dirt road in the desert.|4|a white jeep equipped with a roof rack driving on a dirt road in a mountain.|6|A white jeep equipped with a roof rack driving on a dirt road in a city.|8|a white jeep equipped with a roof rack driving on a dirt road on the surface of a river.|10|a white jeep equipped with a roof rack driving on a dirt road under the lake.|12|a white jeep equipped with a roof rack flying into the sky.|14|a white jeep equipped with a roof rack driving in the universe. Earth is the background.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4", "mask_strategy": "0,0,0,0,15"}',
 ]

--- a/docs/commands.md
+++ b/docs/commands.md
@ -66,7 +66,7 @@ python scripts/inference_i2v.py configs/opensora-v1-3/inference/v2v.py \
  --num-frames 97 --resolution 720p --aspect-ratio "9:16" --cond-type i2v_head --use-sdedit True \
  --use-oscillation-guidance-for-image True --image-cfg-scale 2.0 \
  --use-oscillation-guidance-for-text True --cfg-scale 7.5 \
-  --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/wave.png"}'
+  --prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png"}'

 # last-frame condition generation
 python scripts/inference_i2v.py configs/opensora-v1-3/inference/v2v.py \ 
@ -88,7 +88,7 @@ python scripts/inference_i2v.py configs/opensora-v1-3/inference/v2v.py \
  --num-frames 97 --resolution 720p --aspect-ratio "9:16" --cond-type i2v_loop --use-sdedit True \
  --use-oscillation-guidance-for-image True --image-cfg-scale 2.0 \
  --use-oscillation-guidance-for-text True --cfg-scale 7.5 \
-  --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png"}'
+  --prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset2.png"}'
 ```

 ### Inference with Open-Sora 1.2
@ -144,7 +144,7 @@ You can adjust the `--num-frames` and `--image-size` to generate different resul
 # image condition
 python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
  --num-frames 32 --image-size 240 426 --sample-name image-cond \
-  --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/wave.png","mask_strategy": "0"}'
+  --prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png","mask_strategy": "0"}'

 # video extending
 python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
@ -159,7 +159,7 @@ python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckp
 # video connecting
 python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
  --num-frames 32 --image-size 240 426 --sample-name connect \
-  --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}'
+  --prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}'

 # video editing
 python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
--- a/docs/config.md
+++ b/docs/config.md
@ -69,7 +69,7 @@ condition_frame_length = 4
 reference_path = [
    "https://cdn.openai.com/tmp/s/interp/d0.mp4",
    None,
-    "assets/images/condition/wave.png",
+    "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png",
 ]
 mask_strategy = [
    "0,0,0,0,8,0.3",
@ -80,7 +80,7 @@ mask_strategy = [

 The following figure provides an illustration of the `mask_strategy`:

-![mask_strategy](/assets/readme/report_mask_config.png)
+![mask_strategy](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_mask_config.png)

 To generate a long video of infinite time, our strategy is to generate a video with a fixed length first, and then use the last `condition_frame_length` number of frames for the next video generation. This will loop for `loop` times. Thus, the total length of the video is `loop * (num_frames - condition_frame_length) + condition_frame_length`.

@ -96,7 +96,7 @@ To condition the generation on images or videos, we introduce the `mask_strategy
 To facilitate usage, we also accept passing the reference path and mask strategy as a json appended to the prompt. For example,

 ```plaintext
-'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff\'s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff\'s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.{"reference_path": "assets/images/condition/cliff.png", "mask_strategy": "0"}'
+'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff\'s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff\'s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png", "mask_strategy": "0"}'
 ```

 ## Inference Args
@ -244,7 +244,7 @@ This looks a bit difficult to understand at the first glance. Let's understand t

 ### Three-level bucket

-![bucket](/assets/readme/report_bucket.png)
+![bucket](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_bucket.png)

 We design a three-level bucket: `(resolution, num_frames, aspect_ratios)`. The resolution and aspect ratios is predefined in [aspect.py](/opensora/datasets/aspect.py). Commonly used resolutions (e.g., 240p, 1080p) are supported, and the name represents the number of pixels (e.g., 240p is 240x426, however, we define 240p to represent any size with HxW approximately 240x426=102240 pixels). The aspect ratios are defined for each resolution. You do not need to define the aspect ratios in the `bucket_config`.

--- a/docs/data_processing.md
+++ b/docs/data_processing.md
@ -3,7 +3,7 @@

 We establish a complete pipeline for video/image data processing. The pipeline is shown below.

-![pipeline](/assets/readme/report_data_pipeline.png)
+![pipeline](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_data_pipeline.png)

 First, raw videos,
 either from the  Internet or public datasets, are split into shorter clips based on scene detection.
--- a/docs/report_01.md
+++ b/docs/report_01.md
@ -10,11 +10,11 @@ The video training involves a large amount of tokens. Considering 24fps 1min vid

 As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).

-![Architecture Comparison](/assets/readme/report_arch_comp.png)
+![Architecture Comparison](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_arch_comp.png)

 To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.

-![Architecture](/assets/readme/report_arch.jpg)
+![Architecture](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_arch.jpg)

 Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.

@ -24,7 +24,7 @@ We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on

 We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.

-![Caption](/assets/readme/report_caption.png)
+![Caption](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_caption.png)

 As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.

@ -36,14 +36,14 @@ With a limited training budgets, we made only a few exploration. We find learnin

 16x256x256 Pretraining Loss Curve

-![16x256x256 Pretraining Loss Curve](/assets/readme/report_loss_curve_1.png)
+![16x256x256 Pretraining Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_1.png)

 16x256x256 HQ Training Loss Curve

-![16x256x256 HQ Training Loss Curve](/assets/readme/report_loss_curve_2.png)
+![16x256x256 HQ Training Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_2.png)

 16x512x512 HQ Training Loss Curve

-![16x512x512 HQ Training Loss Curve](/assets/readme/report_loss_curve_3.png)
+![16x512x512 HQ Training Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_3.png)

 > Core Contributor: Zangwei Zheng*, Xiangyu Peng*, Shenggui Li, Hongxing Liu, Yang You
--- a/docs/report_02.md
+++ b/docs/report_02.md
@ -45,7 +45,7 @@ For the simplicity of implementation, we choose the bucket method. We pre-define

 </details>

-![bucket](/assets/readme/report_bucket.png)
+![bucket](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_bucket.png)

 As shown in the figure, a bucket is a triplet of `(resolution, num_frame, aspect_ratio)`. We provide pre-defined aspect ratios for different resolution that covers most of the common video aspect ratios. Before each epoch, we shuffle the dataset and allocate the samples to different buckets as shown in the figure. We put a sample into a bucket with largest resolution and frame length that is smaller than the video's.

@ -57,7 +57,7 @@ A detailed explanation of the bucket usage in training is available in [docs/con

 Transformers can be easily extended to support image-to-image and video-to-video tasks. We propose a mask strategy to support image and video conditioning. The mask strategy is shown in the figure below.

-![mask strategy](/assets/readme/report_mask.png)
+![mask strategy](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_mask.png)

 Typically, we unmask the frames to be conditioned on for image/video-to-video condition. During the ST-DiT forward, unmasked frames will have timestep 0, while others remain the same (t). We find directly apply the strategy to trained model yield poor results as the diffusion model did not learn to handle different timesteps in one sample during training.

@ -65,7 +65,7 @@ Inspired by [UL2](https://arxiv.org/abs/2205.05131), we introduce random mask st

 An illustration of masking strategy config to use in inference is given as follow. A five number tuple provides great flexibility in defining the mask strategy. By conditioning on generated frames, we can autogressively generate infinite frames (although error propagates).

-![mask strategy config](/assets/readme/report_mask_config.png)
+![mask strategy config](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_mask_config.png)

 A detailed explanation of the mask strategy usage is available in [docs/config.md](/docs/config.md#advanced-inference-config).

@ -73,21 +73,21 @@ A detailed explanation of the mask strategy usage is available in [docs/config.m

 As we found in Open-Sora 1.0, the data number and quality are crucial for training a good model, we work hard on scaling the dataset. First, we create an automatic pipeline following [SVD](https://arxiv.org/abs/2311.15127), inlcuding scene cutting, captioning, various scoring and filtering, and dataset management scripts and conventions. More infomation can be found in [docs/data_processing.md](/docs/data_processing.md).

-![pipeline](/assets/readme/report_data_pipeline.png)
+![pipeline](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_data_pipeline.png)

 We plan to use [panda-70M](https://snap-research.github.io/Panda-70M/) and other data to traing the model, which is approximately 30M+ data. However, we find disk IO a botteleneck for training and data processing at the same time. Thus, we can only prepare a 10M dataset and did not go through all processing pipeline that we built. Finally, we use a dataset with 9.7M videos + 2.6M images for pre-training, and 560k videos + 1.6M images for fine-tuning. The pretraining dataset statistics are shown below. More information about the dataset can be found in [docs/datasets.md](/docs/datasets.md).

 Image text tokens (by T5 tokenizer):

-![image text tokens](/assets/readme/report_image_textlen.png)
+![image text tokens](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_image_textlen.png)

 Video text tokens (by T5 tokenizer). We directly use panda's short caption for training, and caption other datasets by ourselves. The generated caption is usually less than 200 tokens.

-![video text tokens](/assets/readme/report_video_textlen.png)
+![video text tokens](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_video_textlen.png)

 Video duration:

-![video duration](/assets/readme/report_video_duration.png)
+![video duration](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_video_duration.png)

 ## Training Details

--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -34,7 +34,7 @@ For Open-Sora 1.0 & 1.1, we used stability-ai's 83M 2D VAE, which compress the v

 Considering the high computational cost of training a 3D VAE, we hope to re-use the knowledge learnt in the 2D VAE. We notice that after 2D VAE's compression, the features adjacent in the temporal dimension are still highly correlated. Thus, we propose a simple video compression network, which first compress the video in the spatial dimension by 8x8 times, then compress the video in the temporal dimension by 4x times. The network is shown below:

-![video_compression_network](/assets/readme/report_3d_vae.png)
+![video_compression_network](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_3d_vae.png)

 We initialize the 2D VAE with [SDXL's VAE](https://huggingface.co/stabilityai/sdxl-vae), which is better than our previously used one. For the 3D VAE, we adopt the structure of VAE in [Magvit-v2](https://magvit.cs.cmu.edu/v2/), which contains 300M parameters. Along with 83M 2D VAE, the total parameters of the video compression network is 384M. We train the 3D VAE for 1.2M steps with local batch size 1. The training data is videos from pixels and pixabay, and the training video size is mainly 17 frames, 256x256 resolution. Causal convolutions are used in the 3D VAE to make the image reconstruction more accurate.

@ -108,9 +108,9 @@ While MiraData and Vript have captions from GPT, we use [PLLaVA](https://github.

 Some statistics of the video data used in this stage are shown below. We present basic statistics of duration and resolution, as well as aesthetic score and optical flow score distribution.
 We also extract tags for objects and actions from video captions and count their frequencies.
-![stats](/assets/readme/report-03_video_stats.png)
-![object_count](/assets/readme/report-03_objects_count.png)
-![object_count](/assets/readme/report-03_actions_count.png)
+![stats](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_video_stats.png)
+![object_count](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_objects_count.png)
+![object_count](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_actions_count.png)

 We mainly train 720p and 1080p videos in this stage, aiming to extend the model's ability to larger resolutions. We use a mask ratio of 25% during training. The training config locates in [stage3.py](/configs/opensora-v1-2/train/stage3.py). We train the model for 15k steps, which is approximately 2 epochs.

@ -132,12 +132,12 @@ Previously, we monitor the training process only by human evaluation, as DDPM tr

 We sampled 1k videos from pixabay as validation dataset. We calculate the evaluation loss for image and different lengths of videos (2s, 4s, 8s, 16s) for different resolution (144p, 240p, 360p, 480p, 720p). For each setting, we equidistantly sample 10 timesteps. Then all the losses are averaged. We also provide a [video](https://streamable.com/oqkkf1) showing the sampled videos with a fixed prompt for different steps.

-![Evaluation Loss](/assets/readme/report_val_loss.png)
-![Video Evaluation Loss](/assets/readme/report_vid_val_loss.png)
+![Evaluation Loss](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_val_loss.png)
+![Video Evaluation Loss](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_vid_val_loss.png)

 In addition, we also keep track of [VBench](https://vchitect.github.io/VBench-project/) scores during training. VBench is an automatic video evaluation benchmark for short video generation. We calcuate the vbench score with 240p 2s videos. The two metrics verify that our model continues to improve during training.

-![VBench](/assets/readme/report_vbench_score.png)
+![VBench](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_vbench_score.png)

 All the evaluation code is released in `eval` folder. Check the [README](/eval/README.md) for more details.

@ -150,7 +150,7 @@ All the evaluation code is released in `eval` folder. Check the [README](/eval/R

 We use sequence parallelism to support long-sequence training and inference. Our implementation is based on Ulysses and the workflow is shown below. When sequence parallelism is enabled, we only need to apply the `all-to-all` communication to the spatial block in STDiT as only spatial computation is dependent on the sequence dimension.

-![SP](../assets/readme/sequence_parallelism.jpeg)
+![SP](..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sequence_parallelism.jpeg)

 Currently, we have not used sequence parallelism for training as data resolution is small and we plan to do so in the next release. As for inference, we can use sequence parallelism in case your GPU goes out of memory. A simple benchmark shows that sequence parallelism can achieve speedup

--- a/docs/zh_CN/README.md
+++ b/docs/zh_CN/README.md
@ -1,5 +1,5 @@
 <p align="center">
-    <img src="../../assets/readme/icon.png" width="250"/>
+    <img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
 </p>
 <div align="center">
    <a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
@ -26,7 +26,7 @@
 * **[2024.04.25]** 🤗 我们在 Hugging Face Spaces 上发布了 [Open-Sora的Gradio演示](https://huggingface.co/spaces/hpcai-tech/open-sora)。
 * **[2024.04.25]** 我们发布了**Open-Sora 1.1**，支持**2s~15s、144p 到 720p、任意比例的文本转图片、文本转视频、图片转视频、视频转视频、无限时间生成**。此外，还发布了完整的视频处理管道。 [[模型权重]](#模型权重) [[技术报告]](report_v2.md)[[公众号文章]](https://mp.weixin.qq.com/s/nkPSTep2se__tzp5OfiRQQ)
 * **[2024.03.18]** 我们发布了 **Open-Sora 1.0**, 一个完全开源的视频生成项目。Open-Sora 1.0 支持完整的视频数据预处理流程、加速训练
-  <a href="https://github.com/hpcaitech/ColossalAI"><img src="/assets/readme/colossal_ai.png" width="8%" ></a>
+  <a href="https://github.com/hpcaitech/ColossalAI"><img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/colossal_ai.png" width="8%" ></a>
 、推理等。我们的模型只需 3 天的训练就可以生成 2 秒的 512x512 视频。 [[模型权重]](#模型权重)
  [[公众号文章]](https://mp.weixin.qq.com/s/H52GW8i4z1Dco3Sg--tCGw) [[技术报告]](report_v1.md)
 * **[2024.03.04]** Open-Sora 提供培训，成本降低 46%。
@ -77,9 +77,9 @@

 | **2秒 512×512**                                                                                                                                                                 | **2秒 512×512**                                                                                                                                                              | **2秒 512×512**                                                                                                                                    |
 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80)                                 | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc)                              | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16)    |
+| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80)                                 | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc)                              | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16)    |
 |森林地区宁静的夜景。 [...] 该视频是一段延时摄影，捕捉了白天到夜晚的转变，湖泊和森林始终作为背景。 | 无人机拍摄的镜头捕捉到了海岸悬崖的壮丽美景，[...] 海水轻轻地拍打着岩石底部和紧贴悬崖顶部的绿色植物。| 瀑布从悬崖上倾泻而下，流入宁静的湖泊，气势磅礴。[...] 摄像机角度提供了瀑布的鸟瞰图。 |
-| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94)                                 | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9)                              | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65)    |
+| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94)                                 | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9)                              | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65)    |
 | 夜晚繁华的城市街道，充满了汽车前灯的光芒和路灯的氛围光。 [...]                                                           | 向日葵田的生机勃勃，美不胜收。向日葵整齐排列，给人一种秩序感和对称感。 [...]                                            |宁静的水下场景，一只海龟在珊瑚礁中游动。这只海龟的壳呈绿褐色 [...]                   |

 视频经过降采样以.gif用于显示。单击查看原始视频。提示经过修剪以用于显示，请参阅[此处](/assets/texts/t2v_samples.txt)查看完整提示。
@ -279,7 +279,7 @@ export OPENAI_API_KEY=YOUR_API_KEY

 在 Gradio 应用程序中，基本选项如下：

-![Gradio Demo](/assets/readme/gradio_basic.png)
+![Gradio Demo](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/gradio_basic.png)

 生成视频最简单的方式是输入文本提示，然后点击“**生成视频**”按钮（如果找不到，请向下滚动）。生成的视频将显示在右侧面板中。勾选“**使用 GPT4o 增强提示**”将使用 GPT-4o 来细化提示，而“**随机提示**”按钮将由 GPT-4o 为您生成随机提示。由于 OpenAI 的 API 限制，提示细化结果具有一定的随机性。

@ -292,7 +292,7 @@ export OPENAI_API_KEY=YOUR_API_KEY

 注意，除了文本转视频，你还可以使用图片转视频。你可以上传图片，然后点击“**生成视频**”按钮，生成以图片为第一帧的视频。或者，你可以填写文本提示，然后点击“**生成图片**”按钮，根据文本提示生成图片，然后点击“**生成视频**”按钮，根据同一模型生成的图片生成视频。

-![Gradio Demo](/assets/readme/gradio_option.png)
+![Gradio Demo](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/gradio_option.png)

 然后您可以指定更多选项，包括“**运动强度**”、“**美学**”和“**相机运动**”。如果未选中“启用”或选择“无”，则不会将信息传递给模型。否则，模型将生成具有指定运动强度、美学分数和相机运动的视频。

@ -457,7 +457,7 @@ torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/i

 高质量的数据对于训练良好的生成模型至关重要。为此，我们建立了完整的数据处理流程，可以将原始视频无缝转换为高质量的视频-文本对。流程如下所示。有关详细信息，请参阅[数据处理](docs/data_processing.md)。另请查看我们使用的[数据集](docs/datasets.md)。

-![Data Processing Pipeline](/assets/readme/report_data_pipeline.png)
+![Data Processing Pipeline](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_data_pipeline.png)

 ## 训练

--- a/docs/zh_CN/READMEv1.1.md
+++ b/docs/zh_CN/READMEv1.1.md
@ -1,5 +1,5 @@
 <p align="center">
-    <img src="../../assets/readme/icon.png" width="250"/>
+    <img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
 <p>

 <div align="center">
@ -34,9 +34,9 @@

 | **2s 512×512**                                                                                                                                                                 | **2s 512×512**                                                                                                                                                              | **2s 512×512**                                                                                                                                    |
 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80)                                 | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc)                              | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16)    |
+| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80)                                 | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc)                              | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16)    |
 | A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
-| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
+| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
 | A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]                                                           | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]                                            | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]                   |

 视频经过降采样处理为`.gif`格式，以便显示。点击查看原始视频。为便于显示，文字经过修剪，全文请参见 [此处](/assets/texts/t2v_samples.txt)。在我们的[图片库](https://hpcaitech.github.io/Open-Sora/)中查看更多样本。
--- a/docs/zh_CN/report_v1.md
+++ b/docs/zh_CN/report_v1.md
@ -11,11 +11,11 @@ OpenAI的Sora在生成一分钟高质量视频方面非常出色。然而，它
 如图中所示，在STDiT（ST代表时空）中，我们在每个空间注意力之后立即插入一个时间注意力。这类似于Latte论文中的变种3。然而，我们并没有控制这些变体的相似数量的参数。虽然Latte的论文声称他们的变体比变种3更好，但我们在16x256x256视频上的实验表明，相同数量的迭代次数下，性能排名为：DiT（完整）> STDiT（顺序）> STDiT（并行）≈ Latte。因此，我们出于效率考虑选择了STDiT（顺序）。[这里](/docs/acceleration.md#efficient-stdit)提供了速度基准测试。


-![Architecture Comparison](/assets/readme/report_arch_comp.png)
+![Architecture Comparison](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_arch_comp.png)

 为了专注于视频生成，我们希望基于一个强大的图像生成模型来训练我们的模型。PixArt-α是一个经过高效训练的高质量图像生成模型，具有T5条件化的DiT结构。我们使用[PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha)初始化我们的模型，并将插入的时间注意力的投影层初始化为零。这种初始化在开始时保留了模型的图像生成能力，而Latte的架构则不能。插入的注意力将参数数量从5.8亿增加到7.24亿。

-![Architecture](/assets/readme/report_arch.jpg)
+![Architecture](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_arch.jpg)

 借鉴PixArt-α和Stable Video Diffusion的成功，我们还采用了渐进式训练策略：在366K预训练数据集上进行16x256x256的训练，然后在20K数据集上进行16x256x256、16x512x512和64x512x512的训练。通过扩展位置嵌入，这一策略极大地降低了计算成本。

@ -26,7 +26,7 @@ OpenAI的Sora在生成一分钟高质量视频方面非常出色。然而，它

 我们发现数据的数量和质量对生成视频的质量有很大的影响，甚至比模型架构和训练策略的影响还要大。目前，我们只从[HD-VG-130M](https://github.com/daooshee/HD-VG-130M)准备了第一批分割（366K个视频片段）。这些视频的质量参差不齐，而且字幕也不够准确。因此，我们进一步从提供免费许可视频的[Pexels](https://www.pexels.com/)收集了20k相对高质量的视频。我们使用LLaVA，一个图像字幕模型，通过三个帧和一个设计好的提示来标记视频。有了设计好的提示，LLaVA能够生成高质量的字幕。

-![Caption](/assets/readme/report_caption.png)
+![Caption](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_caption.png)

 由于我们更加注重数据质量，我们准备收集更多数据，并在下一版本中构建一个视频预处理流程。

@ -38,12 +38,12 @@ OpenAI的Sora在生成一分钟高质量视频方面非常出色。然而，它

 16x256x256 预训练损失曲线

-![16x256x256 Pretraining Loss Curve](/assets/readme/report_loss_curve_1.png)
+![16x256x256 Pretraining Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_1.png)

 16x256x256 高质量训练损失曲线

-![16x256x256 HQ Training Loss Curve](/assets/readme/report_loss_curve_2.png)
+![16x256x256 HQ Training Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_2.png)

 16x512x512 高质量训练损失曲线

-![16x512x512 HQ Training Loss Curve](/assets/readme/report_loss_curve_3.png)
+![16x512x512 HQ Training Loss Curve](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_loss_curve_3.png)
--- a/docs/zh_CN/report_v2.md
+++ b/docs/zh_CN/report_v2.md
@ -45,7 +45,7 @@

 </details>

-![bucket](/assets/readme/report_bucket.png)
+![bucket](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_bucket.png)

 如图所示，桶是（分辨率，帧数量，宽高比）的三元组。我们为不同的分辨率提供预定义的宽高比，涵盖了大多数常见的视频宽高比。在每个epoch之前，我们打乱数据集并将样本分配到不同的桶中，如图所示。我们将样本放入最大分辨率和帧长度小于视频的桶中。

@ -57,7 +57,7 @@

 Transformer可以很容易地扩展到支持图生图和视频生视频的任务。我们提出了一种蒙版策略来支持图像和视频的调节。蒙版策略如下图所示。

-![mask strategy](/assets/readme/report_mask.png)
+![mask strategy](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_mask.png)

 在将图像或视频转换成另一个视频的过程中，我们通常会选择出需要作为条件的帧并取消其掩码（unmask）。在使用ST-DiT模型进行前向传播时，被选择取消掩码（unmask）的帧将被赋予时间步长0，而其他帧则保持它们原有的时间步长t。我们发现，如果直接将这种策略应用到训练好的模型上，会得到较差的结果，因为扩散模型在训练过程中并未学会如何处理一个样本中具有不同时间步长的帧。

@ -65,7 +65,7 @@ Transformer可以很容易地扩展到支持图生图和视频生视频的任务

 下图给出了用于推理的掩码策略配置的说明。五数字元组在定义掩码策略方面提供了极大的灵活性。

-![mask strategy config](/assets/readme/report_mask_config.png)
+![mask strategy config](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_mask_config.png)

 掩码策略用法的详细说明可在[配置文件](/docs/config.md#advanced-inference-config)中查看.

@ -74,18 +74,18 @@ Transformer可以很容易地扩展到支持图生图和视频生视频的任务

 正如我们在Sora1.0版本中看见的那样，数据数量和质量对于训练一个好的模型至关重要，因此，我们努力扩展数据集。首先，我们创建了一个遵循[SVD](https://arxiv.org/abs/2311.15127)的自动流水线，包括场景切割、字幕、各种评分和过滤以及数据集管理脚本和通用惯例。

-![pipeline](/assets/readme/report_data_pipeline.png)
+![pipeline](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_data_pipeline.png)

 我们计划使用[panda-70M](https://snap-research.github.io/Panda-70M/)和其他数据来训练模型，大约包含3000万条数据。然而，我们发现磁盘输入输出（disk IO）在同时进行训练和数据处理时成为了一个瓶颈。因此，我们只能准备一个包含1000万条数据的数据集，并且没有完成我们构建的所有处理流程。最终，我们使用了包含970万视频和260万图像的数据集进行预训练，以及560,000视频和160万图像的数据集进行微调。预训练数据集的统计信息如下所示。

 图像文本标记 (使用T5分词器)：
-![image text tokens](/assets/readme/report_image_textlen.png)
+![image text tokens](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_image_textlen.png)

 视频文本标记 (使用T5分词器)。我们直接使用Panda的短视频描述进行训练，并自己给其他数据集加视频描述。生成的字幕通常少于200个token。
-![video text tokens](/assets/readme/report_video_textlen.png)
+![video text tokens](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_video_textlen.png)

 视频时长：
-![video duration](/assets/readme/report_video_duration.png)
+![video duration](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_video_duration.png)

 ## 训练详情

--- a/docs/zh_CN/report_v3.md
+++ b/docs/zh_CN/report_v3.md
@ -33,7 +33,7 @@

 考虑到训练 3D VAE 的计算成本很高，我们希望重新利用在 2D VAE 中学到的知识。我们注意到，经过 2D VAE 压缩后，时间维度上相邻的特征仍然高度相关。因此，我们提出了一个简单的视频压缩网络，首先将视频在空间维度上压缩 8x8 倍，然后将视频在时间维度上压缩 4 倍。网络如下所示：

-![video_compression_network](/assets/readme/report_3d_vae.png)
+![video_compression_network](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_3d_vae.png)

 我们用[SDXL 的 VAE](https://huggingface.co/stabilityai/sdxl-vae)初始化 2D VAE ，它比我们以前使用的更好。对于 3D VAE，我们采用[Magvit-v2](https://magvit.cs.cmu.edu/v2/)中的 VAE 结构，它包含 300M 个参数。加上 83M 的 2D VAE，视频压缩网络的总参数为 384M。我们设定batch size 为 1， 对 3D VAE 进行了 1.2M 步的训练。训练数据是来自 pixels 和 pixabay 的视频，训练视频大小主要是 17 帧，256x256 分辨率。3D VAE 中使用causal convolotions使图像重建更加准确。

@ -106,9 +106,9 @@ Open-Sora 1.2 从[PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma)
 MiraData 和 Vript 有来自 GPT 的字幕，而我们使用[PLLaVA](https://github.com/magic-research/PLLaVA)为其余字幕添加字幕。与只能进行单帧/图像字幕的 LLaVA 相比，PLLaVA 是专门为视频字幕设计和训练的。[加速版PLLaVA](/tools/caption/README.md#pllava-captioning)已在我们的`tools/`中发布。在实践中，我们使用预训练的 PLLaVA 13B 模型，并从每个视频中选择 4 帧生成字幕，空间池化形状为 2*2。

 下面显示了此阶段使用的视频数据的一些统计数据。我们提供了持续时间和分辨率的基本统计数据，以及美学分数和光流分数分布。我们还从视频字幕中提取了对象和动作的标签并计算了它们的频率。
-![stats](/assets/readme/report-03_video_stats.png)
-![object_count](/assets/readme/report-03_objects_count.png)
-![object_count](/assets/readme/report-03_actions_count.png)
+![stats](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_video_stats.png)
+![object_count](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_objects_count.png)
+![object_count](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report-03_actions_count.png)

 此阶段我们主要在 720p 和 1080p 上进行训练，以提高模型在高清视频上的表现力。在训练中，我们使用的掩码率为25%。训练配置位于[stage3.py](/configs/opensora-v1-2/train/stage3.py)中。我们对模型进行 15k 步训练，大约为 2 个 epoch。

@ -130,12 +130,12 @@ MiraData 和 Vript 有来自 GPT 的字幕，而我们使用[PLLaVA](https://git

 我们从 pixabay 中抽样了 1k 个视频作为验证数据集。我们计算了不同分辨率（144p、240p、360p、480p、720p）下图像和不同长度的视频（2s、4s、8s、16s）的评估损失。对于每个设置，我们等距采样 10 个时间步长。然后对所有损失取平均值。

-![Evaluation Loss](/assets/readme/report_val_loss.png)
-![Video Evaluation Loss](/assets/readme/report_vid_val_loss.png)
+![Evaluation Loss](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_val_loss.png)
+![Video Evaluation Loss](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_vid_val_loss.png)

 此外，我们还会在训练过程中跟踪[VBench](https://vchitect.github.io/VBench-project/)得分。VBench 是用于短视频生成的自动视频评估基准。我们用 240p 2s 视频计算 vbench 得分。这两个指标验证了我们的模型在训练过程中持续改进。

-![VBench](/assets/readme/report_vbench_score.png)
+![VBench](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/report_vbench_score.png)

 所有评估代码均发布在`eval`文件夹中。查看[评估指南](/eval/README.md)了解更多详细信息。

@ -148,7 +148,7 @@ MiraData 和 Vript 有来自 GPT 的字幕，而我们使用[PLLaVA](https://git

 我们使用序列并行来支持长序列训练和推理。我们的实现基于Ulysses，工作流程如下所示。启用序列并行后，我们只需要将 `all-to-all` 通信应用于STDiT中的空间模块（spatial block），因为在序列维度上，只有对空间信息的计算是相互依赖的。

-![SP](/assets/readme/sequence_parallelism.jpeg)
+![SP](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sequence_parallelism.jpeg)

 目前，由于训练数据分辨率较小，我们尚未使用序列并行进行训练，我们计划在下一个版本中使用。至于推理，我们可以使用序列并行，以防您的 GPU 内存不足。下表显示，序列并行可以实现加速：

--- a/gradio/README.md
+++ b/gradio/README.md
@ -71,7 +71,7 @@ We have also tested this Gradio app on Hugging Face Spaces. You can follow the s

 ## Advanced Usage

-![Gradio Demo](../assets/readme/gradio_advanced.png)
+![Gradio Demo](..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/gradio_advanced.png)

 For the "**FPS**" option, as now we fix the output video's FPS to 24, this option will not affect the output video's length. Thus, for a smaller FPS, the video is supposed to be longer but accelerated due to 24 FPS. Thus, the video will be less smooth but faster. For a larger FPS, the video will be smoother but slower.

--- a/gradio/app.py
+++ b/gradio/app.py
@ -599,7 +599,7 @@ def main():
                    """
                <div style='text-align: center;'>
                    <p align="center">
-                        <img src="https://github.com/hpcaitech/Open-Sora/raw/main/assets/readme/icon.png" width="250"/>
+                        <img src="https://github.com/hpcaitech/Open-Sora/raw/mainhttps://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
                    </p>
                    <div style="display: flex; gap: 10px; justify-content: center;">
                        <a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
--- a/tools/caption/README.md
+++ b/tools/caption/README.md
@ -63,10 +63,10 @@ nohup torchrun --nproc_per_node 8 --standalone caption_pllava.py \
 ### PLLaVA vs. LLaVA
 In our previous releases, we used [LLaVA](#llava-captioning) for video captioning.
 Qualitatively speaking, we observe that PLLaVA has a somewhat higher chance of accurately capture the details in the video than LLaVA. See below for their comparison on a video sample.
-<!-- <img src="../../assets/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA"> -->
+<!-- <img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA"> -->

 <figure>
-    <img src="../../assets/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA">
+    <img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA">
 </figure>


@ -250,4 +250,4 @@ CUDA_VISIBLE_DEVICES=1,2,3,4 python3 caption_llava_next.py \
    --mm_spatial_pool_mode average \
    --mm_newline_position grid \
    --prompt "Please provide a detailed description of the video, focusing on the main subjects, their actions, the background scenes."
-```
+```