mirror of
https://github.com/hpcaitech/Open-Sora.git
synced 2026-04-14 18:25:35 +02:00
fix image
This commit is contained in:
parent
668694320a
commit
eff63fa91b
10
README.md
10
README.md
|
|
@ -1,5 +1,5 @@
|
|||
<p align="center">
|
||||
<img src="./assets/readme/icon.png" width="250"/>
|
||||
<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
|
||||
</p>
|
||||
<div align="center">
|
||||
<a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
|
||||
|
|
@ -29,7 +29,7 @@ With Open-Sora, our goal is to foster innovation, creativity, and inclusivity wi
|
|||
- **[2024.04.25]** We released **Open-Sora 1.1**, which supports **2s~15s, 144p to 720p, any aspect ratio** text-to-image, **text-to-video, image-to-video, video-to-video, infinite time** generation. In addition, a full video processing pipeline is released. [[checkpoints]]() [[report]](/docs/report_02.md)
|
||||
- **[2024.03.18]** We released **Open-Sora 1.0**, a fully open-source project for video generation.
|
||||
Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
|
||||
<a href="https://github.com/hpcaitech/ColossalAI"><img src="assets/readme/colossal_ai.png" width="8%" ></a>
|
||||
<a href="https://github.com/hpcaitech/ColossalAI"><img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/colossal_ai.png" width="8%" ></a>
|
||||
acceleration,
|
||||
inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [[checkpoints]](#open-sora-10-model-weights)
|
||||
[[blog]](https://hpc-ai.com/blog/open-sora-v1.0) [[report]](/docs/report_01.md)
|
||||
|
|
@ -287,7 +287,7 @@ export OPENAI_API_KEY=YOUR_API_KEY
|
|||
|
||||
In the Gradio application, the basic options are as follows:
|
||||
|
||||

|
||||

|
||||
|
||||
The easiest way to generate a video is to input a text prompt and click the "**Generate video**" button (scroll down if you cannot find). The generated video will be displayed in the right panel. Checking the "**Enhance prompt with GPT4o**" will use GPT-4o to refine the prompt, while "**Random Prompt**" button will generate a random prompt by GPT-4o for you. Due to the OpenAI's API limit, the prompt refinement result has some randomness.
|
||||
|
||||
|
|
@ -301,7 +301,7 @@ Then, you can choose the **resolution**, **duration**, and **aspect ratio** of t
|
|||
|
||||
Note that besides text to video, you can also use image to video generation. You can upload an image and then click the "**Generate video**" button to generate a video with the image as the first frame. Or you can fill in the text prompt and click the "**Generate image**" button to generate an image with the text prompt, and then click the "**Generate video**" button to generate a video with the image generated with the same model.
|
||||
|
||||

|
||||

|
||||
|
||||
Then you can specify more options, including "**Motion Strength**", "**Aesthetic**" and "**Camera Motion**". If "Enable" not checked or the choice is "none", the information is not passed to the model. Otherwise, the model will generate videos with the specified motion strength, aesthetic score, and camera motion.
|
||||
|
||||
|
|
@ -415,7 +415,7 @@ To this end, we establish a complete pipeline for data processing, which could s
|
|||
The pipeline is shown below. For detailed information, please refer to [data processing](docs/data_processing.md).
|
||||
Also check out the [datasets](docs/datasets.md) we use.
|
||||
|
||||

|
||||

|
||||
|
||||
## Training
|
||||
|
||||
|
|
|
|||
|
|
@ -7,11 +7,11 @@ multi_resolution = "STDiT2"
|
|||
# Condition
|
||||
prompt_path = None
|
||||
prompt = [
|
||||
'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. {"reference_path": "assets/images/condition/cliff.png", "mask_strategy": "0"}',
|
||||
'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png","mask_strategy": "0"}',
|
||||
'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. {"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png", "mask_strategy": "0"}',
|
||||
'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png","mask_strategy": "0"}',
|
||||
'A car driving on the ocean.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4","mask_strategy": "0,0,-8,0,8"}',
|
||||
'A snowy forest.{"reference_path": "https://cdn.pixabay.com/video/2021/04/25/72171-542991404_large.mp4","mask_strategy": "0,0,0,0,15,0.8"}',
|
||||
'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}',
|
||||
'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}',
|
||||
'|0|a white jeep equipped with a roof rack driving on a dirt road in a coniferous forest.|2|a white jeep equipped with a roof rack driving on a dirt road in the desert.|4|a white jeep equipped with a roof rack driving on a dirt road in a mountain.|6|A white jeep equipped with a roof rack driving on a dirt road in a city.|8|a white jeep equipped with a roof rack driving on a dirt road on the surface of a river.|10|a white jeep equipped with a roof rack driving on a dirt road under the lake.|12|a white jeep equipped with a roof rack flying into the sky.|14|a white jeep equipped with a roof rack driving in the universe. Earth is the background.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4", "mask_strategy": "0,0,0,0,15"}',
|
||||
]
|
||||
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ You can adjust the `--num-frames` and `--image-size` to generate different resul
|
|||
# image condition
|
||||
python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
|
||||
--num-frames 32 --image-size 240 426 --sample-name image-cond \
|
||||
--prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/wave.png","mask_strategy": "0"}'
|
||||
--prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png","mask_strategy": "0"}'
|
||||
|
||||
# video extending
|
||||
python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
|
||||
|
|
@ -82,7 +82,7 @@ python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckp
|
|||
# video connecting
|
||||
python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
|
||||
--num-frames 32 --image-size 240 426 --sample-name connect \
|
||||
--prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}'
|
||||
--prompt 'A breathtaking sunrise scene.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset1.png;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}'
|
||||
|
||||
# video editing
|
||||
python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
|
||||
|
|
|
|||
|
|
@ -69,7 +69,7 @@ condition_frame_length = 4
|
|||
reference_path = [
|
||||
"https://cdn.openai.com/tmp/s/interp/d0.mp4",
|
||||
None,
|
||||
"assets/images/condition/wave.png",
|
||||
"https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png",
|
||||
]
|
||||
mask_strategy = [
|
||||
"0,0,0,0,8,0.3",
|
||||
|
|
@ -80,7 +80,7 @@ mask_strategy = [
|
|||
|
||||
The following figure provides an illustration of the `mask_strategy`:
|
||||
|
||||

|
||||

|
||||
|
||||
To generate a long video of infinite time, our strategy is to generate a video with a fixed length first, and then use the last `condition_frame_length` number of frames for the next video generation. This will loop for `loop` times. Thus, the total length of the video is `loop * (num_frames - condition_frame_length) + condition_frame_length`.
|
||||
|
||||
|
|
@ -96,7 +96,7 @@ To condition the generation on images or videos, we introduce the `mask_strategy
|
|||
To facilitate usage, we also accept passing the reference path and mask strategy as a json appended to the prompt. For example,
|
||||
|
||||
```plaintext
|
||||
'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff\'s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff\'s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.{"reference_path": "assets/images/condition/cliff.png", "mask_strategy": "0"}'
|
||||
'Drone view of waves crashing against the rugged cliffs along Big Sur\'s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff\'s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff\'s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.{"reference_path": "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png", "mask_strategy": "0"}'
|
||||
```
|
||||
|
||||
## Inference Args
|
||||
|
|
@ -244,7 +244,7 @@ This looks a bit difficult to understand at the first glance. Let's understand t
|
|||
|
||||
### Three-level bucket
|
||||
|
||||

|
||||

|
||||
|
||||
We design a three-level bucket: `(resolution, num_frames, aspect_ratios)`. The resolution and aspect ratios is predefined in [aspect.py](/opensora/datasets/aspect.py). Commonly used resolutions (e.g., 240p, 1080p) are supported, and the name represents the number of pixels (e.g., 240p is 240x426, however, we define 240p to represent any size with HxW approximately 240x426=102240 pixels). The aspect ratios are defined for each resolution. You do not need to define the aspect ratios in the `bucket_config`.
|
||||
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@
|
|||
|
||||
We establish a complete pipeline for video/image data processing. The pipeline is shown below.
|
||||
|
||||

|
||||

|
||||
|
||||
First, raw videos,
|
||||
either from the Internet or public datasets, are split into shorter clips based on scene detection.
|
||||
|
|
|
|||
|
|
@ -10,11 +10,11 @@ The video training involves a large amount of tokens. Considering 24fps 1min vid
|
|||
|
||||
As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).
|
||||
|
||||

|
||||

|
||||
|
||||
To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.
|
||||
|
||||

|
||||

|
||||
|
||||
Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.
|
||||
|
||||
|
|
@ -24,7 +24,7 @@ We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on
|
|||
|
||||
We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.
|
||||
|
||||

|
||||

|
||||
|
||||
As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.
|
||||
|
||||
|
|
@ -36,14 +36,14 @@ With a limited training budgets, we made only a few exploration. We find learnin
|
|||
|
||||
16x256x256 Pretraining Loss Curve
|
||||
|
||||

|
||||

|
||||
|
||||
16x256x256 HQ Training Loss Curve
|
||||
|
||||

|
||||

|
||||
|
||||
16x512x512 HQ Training Loss Curve
|
||||
|
||||

|
||||

|
||||
|
||||
> Core Contributor: Zangwei Zheng*, Xiangyu Peng*, Shenggui Li, Hongxing Liu, Yang You
|
||||
|
|
|
|||
|
|
@ -45,7 +45,7 @@ For the simplicity of implementation, we choose the bucket method. We pre-define
|
|||
|
||||
</details>
|
||||
|
||||

|
||||

|
||||
|
||||
As shown in the figure, a bucket is a triplet of `(resolution, num_frame, aspect_ratio)`. We provide pre-defined aspect ratios for different resolution that covers most of the common video aspect ratios. Before each epoch, we shuffle the dataset and allocate the samples to different buckets as shown in the figure. We put a sample into a bucket with largest resolution and frame length that is smaller than the video's.
|
||||
|
||||
|
|
@ -57,7 +57,7 @@ A detailed explanation of the bucket usage in training is available in [docs/con
|
|||
|
||||
Transformers can be easily extended to support image-to-image and video-to-video tasks. We propose a mask strategy to support image and video conditioning. The mask strategy is shown in the figure below.
|
||||
|
||||

|
||||

|
||||
|
||||
Typically, we unmask the frames to be conditioned on for image/video-to-video condition. During the ST-DiT forward, unmasked frames will have timestep 0, while others remain the same (t). We find directly apply the strategy to trained model yield poor results as the diffusion model did not learn to handle different timesteps in one sample during training.
|
||||
|
||||
|
|
@ -65,7 +65,7 @@ Inspired by [UL2](https://arxiv.org/abs/2205.05131), we introduce random mask st
|
|||
|
||||
An illustration of masking strategy config to use in inference is given as follow. A five number tuple provides great flexibility in defining the mask strategy. By conditioning on generated frames, we can autogressively generate infinite frames (although error propagates).
|
||||
|
||||

|
||||

|
||||
|
||||
A detailed explanation of the mask strategy usage is available in [docs/config.md](/docs/config.md#advanced-inference-config).
|
||||
|
||||
|
|
@ -73,21 +73,21 @@ A detailed explanation of the mask strategy usage is available in [docs/config.m
|
|||
|
||||
As we found in Open-Sora 1.0, the data number and quality are crucial for training a good model, we work hard on scaling the dataset. First, we create an automatic pipeline following [SVD](https://arxiv.org/abs/2311.15127), inlcuding scene cutting, captioning, various scoring and filtering, and dataset management scripts and conventions. More infomation can be found in [docs/data_processing.md](/docs/data_processing.md).
|
||||
|
||||

|
||||

|
||||
|
||||
We plan to use [panda-70M](https://snap-research.github.io/Panda-70M/) and other data to traing the model, which is approximately 30M+ data. However, we find disk IO a botteleneck for training and data processing at the same time. Thus, we can only prepare a 10M dataset and did not go through all processing pipeline that we built. Finally, we use a dataset with 9.7M videos + 2.6M images for pre-training, and 560k videos + 1.6M images for fine-tuning. The pretraining dataset statistics are shown below. More information about the dataset can be found in [docs/datasets.md](/docs/datasets.md).
|
||||
|
||||
Image text tokens (by T5 tokenizer):
|
||||
|
||||

|
||||

|
||||
|
||||
Video text tokens (by T5 tokenizer). We directly use panda's short caption for training, and caption other datasets by ourselves. The generated caption is usually less than 200 tokens.
|
||||
|
||||

|
||||

|
||||
|
||||
Video duration:
|
||||
|
||||

|
||||

|
||||
|
||||
## Training Details
|
||||
|
||||
|
|
|
|||
|
|
@ -34,7 +34,7 @@ For Open-Sora 1.0 & 1.1, we used stability-ai's 83M 2D VAE, which compress the v
|
|||
|
||||
Considering the high computational cost of training a 3D VAE, we hope to re-use the knowledge learnt in the 2D VAE. We notice that after 2D VAE's compression, the features adjacent in the temporal dimension are still highly correlated. Thus, we propose a simple video compression network, which first compress the video in the spatial dimension by 8x8 times, then compress the video in the temporal dimension by 4x times. The network is shown below:
|
||||
|
||||

|
||||

|
||||
|
||||
We initialize the 2D VAE with [SDXL's VAE](https://huggingface.co/stabilityai/sdxl-vae), which is better than our previously used one. For the 3D VAE, we adopt the structure of VAE in [Magvit-v2](https://magvit.cs.cmu.edu/v2/), which contains 300M parameters. Along with 83M 2D VAE, the total parameters of the video compression network is 384M. We train the 3D VAE for 1.2M steps with local batch size 1. The training data is videos from pixels and pixabay, and the training video size is mainly 17 frames, 256x256 resolution. Causal convolutions are used in the 3D VAE to make the image reconstruction more accurate.
|
||||
|
||||
|
|
@ -108,9 +108,9 @@ While MiraData and Vript have captions from GPT, we use [PLLaVA](https://github.
|
|||
|
||||
Some statistics of the video data used in this stage are shown below. We present basic statistics of duration and resolution, as well as aesthetic score and optical flow score distribution.
|
||||
We also extract tags for objects and actions from video captions and count their frequencies.
|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
We mainly train 720p and 1080p videos in this stage, aiming to extend the model's ability to larger resolutions. We use a mask ratio of 25% during training. The training config locates in [stage3.py](/configs/opensora-v1-2/train/stage3.py). We train the model for 15k steps, which is approximately 2 epochs.
|
||||
|
||||
|
|
@ -132,12 +132,12 @@ Previously, we monitor the training process only by human evaluation, as DDPM tr
|
|||
|
||||
We sampled 1k videos from pixabay as validation dataset. We calculate the evaluation loss for image and different lengths of videos (2s, 4s, 8s, 16s) for different resolution (144p, 240p, 360p, 480p, 720p). For each setting, we equidistantly sample 10 timesteps. Then all the losses are averaged. We also provide a [video](https://streamable.com/oqkkf1) showing the sampled videos with a fixed prompt for different steps.
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
In addition, we also keep track of [VBench](https://vchitect.github.io/VBench-project/) scores during training. VBench is an automatic video evaluation benchmark for short video generation. We calcuate the vbench score with 240p 2s videos. The two metrics verify that our model continues to improve during training.
|
||||
|
||||

|
||||

|
||||
|
||||
All the evaluation code is released in `eval` folder. Check the [README](/eval/README.md) for more details.
|
||||
|
||||
|
|
@ -150,7 +150,7 @@ All the evaluation code is released in `eval` folder. Check the [README](/eval/R
|
|||
|
||||
We use sequence parallelism to support long-sequence training and inference. Our implementation is based on Ulysses and the workflow is shown below. When sequence parallelism is enabled, we only need to apply the `all-to-all` communication to the spatial block in STDiT as only spatial computation is dependent on the sequence dimension.
|
||||
|
||||

|
||||

|
||||
|
||||
Currently, we have not used sequence parallelism for training as data resolution is small and we plan to do so in the next release. As for inference, we can use sequence parallelism in case your GPU goes out of memory. A simple benchmark shows that sequence parallelism can achieve speedup
|
||||
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
<p align="center">
|
||||
<img src="../../assets/readme/icon.png" width="250"/>
|
||||
<img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
|
||||
</p>
|
||||
<div align="center">
|
||||
<a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
|
||||
|
|
@ -59,9 +59,9 @@
|
|||
|
||||
| **2秒 512×512** | **2秒 512×512** | **2秒 512×512** |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
|
||||
| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
|
||||
|森林地区宁静的夜景。 [...] 该视频是一段延时摄影,捕捉了白天到夜晚的转变,湖泊和森林始终作为背景。 | 无人机拍摄的镜头捕捉到了海岸悬崖的壮丽美景,[...] 海水轻轻地拍打着岩石底部和紧贴悬崖顶部的绿色植物。| 瀑布从悬崖上倾泻而下,流入宁静的湖泊,气势磅礴。[...] 摄像机角度提供了瀑布的鸟瞰图。 |
|
||||
| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
|
||||
| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
|
||||
| 夜晚繁华的城市街道,充满了汽车前灯的光芒和路灯的氛围光。 [...] | 向日葵田的生机勃勃,美不胜收。向日葵整齐排列,给人一种秩序感和对称感。 [...] |宁静的水下场景,一只海龟在珊瑚礁中游动。这只海龟的壳呈绿褐色 [...] |
|
||||
|
||||
视频经过降采样以.gif用于显示。单击查看原始视频。提示经过修剪以用于显示,请参阅[此处](/assets/texts/t2v_samples.txt)查看完整提示。
|
||||
|
|
@ -252,7 +252,7 @@ export OPENAI_API_KEY=YOUR_API_KEY
|
|||
|
||||
在 Gradio 应用程序中,基本选项如下:
|
||||
|
||||

|
||||

|
||||
|
||||
生成视频最简单的方式是输入文本提示,然后点击“**生成视频**”按钮(如果找不到,请向下滚动)。生成的视频将显示在右侧面板中。勾选“**使用 GPT4o 增强提示**”将使用 GPT-4o 来细化提示,而“**随机提示**”按钮将由 GPT-4o 为您生成随机提示。由于 OpenAI 的 API 限制,提示细化结果具有一定的随机性。
|
||||
|
||||
|
|
@ -266,7 +266,7 @@ export OPENAI_API_KEY=YOUR_API_KEY
|
|||
|
||||
注意,除了文本转视频,你还可以使用图片转视频。你可以上传图片,然后点击“**生成视频**”按钮,生成以图片为第一帧的视频。或者,你可以填写文本提示,然后点击“**生成图片**”按钮,根据文本提示生成图片,然后点击“**生成视频**”按钮,根据同一模型生成的图片生成视频。
|
||||
|
||||

|
||||

|
||||
|
||||
然后您可以指定更多选项,包括“**运动强度**”、“**美学**”和“**相机运动**”。如果未选中“启用”或选择“无”,则不会将信息传递给模型。否则,模型将生成具有指定运动强度、美学分数和相机运动的视频。
|
||||
|
||||
|
|
@ -341,7 +341,7 @@ torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/i
|
|||
|
||||
高质量的数据对于训练良好的生成模型至关重要。为此,我们建立了完整的数据处理流程,可以将原始视频无缝转换为高质量的视频-文本对。流程如下所示。有关详细信息,请参阅[数据处理](docs/data_processing.md)。另请查看我们使用的[数据集](docs/datasets.md)。
|
||||
|
||||

|
||||

|
||||
|
||||
## 训练
|
||||
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
<p align="center">
|
||||
<img src="../../assets/readme/icon.png" width="250"/>
|
||||
<img src="../..https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
|
||||
<p>
|
||||
|
||||
<div align="center">
|
||||
|
|
@ -34,9 +34,9 @@
|
|||
|
||||
| **2s 512×512** | **2s 512×512** | **2s 512×512** |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [<img src="/assets/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="/assets/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="/assets/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
|
||||
| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_0.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/de1963d3-b43b-4e68-a670-bb821ebb6f80) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_1.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/13f8338f-3d42-4b71-8142-d234fbd746cc) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_2.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/fa6a65a6-e32a-4d64-9a9e-eabb0ebb8c16) |
|
||||
| A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. | A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. | The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall. |
|
||||
| [<img src="/assets/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="/assets/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="/assets/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
|
||||
| [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_3.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/64232f84-1b36-4750-a6c0-3e610fa9aa94) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_4.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/983a1965-a374-41a7-a76b-c07941a6c1e9) | [<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/sample_5.gif" width="">](https://github.com/hpcaitech/Open-Sora/assets/99191637/ec10c879-9767-4c31-865f-2e8d6cf11e65) |
|
||||
| A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] | The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] | A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...] |
|
||||
|
||||
视频经过降采样处理为`.gif`格式,以便显示。点击查看原始视频。为便于显示,文字经过修剪,全文请参见 [此处](/assets/texts/t2v_samples.txt)。在我们的[图片库](https://hpcaitech.github.io/Open-Sora/)中查看更多样本。
|
||||
|
|
|
|||
|
|
@ -45,7 +45,7 @@
|
|||
|
||||
</details>
|
||||
|
||||

|
||||

|
||||
|
||||
如图所示,桶是(分辨率,帧数量,宽高比)的三元组。我们为不同的分辨率提供预定义的宽高比,涵盖了大多数常见的视频宽高比。在每个epoch之前,我们打乱数据集并将样本分配到不同的桶中,如图所示。我们将样本放入最大分辨率和帧长度小于视频的桶中。
|
||||
|
||||
|
|
@ -57,7 +57,7 @@
|
|||
|
||||
Transformer可以很容易地扩展到支持图生图和视频生视频的任务。我们提出了一种蒙版策略来支持图像和视频的调节。蒙版策略如下图所示。
|
||||
|
||||

|
||||

|
||||
|
||||
在将图像或视频转换成另一个视频的过程中,我们通常会选择出需要作为条件的帧并取消其掩码(unmask)。在使用ST-DiT模型进行前向传播时,被选择取消掩码(unmask)的帧将被赋予时间步长0,而其他帧则保持它们原有的时间步长t。我们发现,如果直接将这种策略应用到训练好的模型上,会得到较差的结果,因为扩散模型在训练过程中并未学会如何处理一个样本中具有不同时间步长的帧。
|
||||
|
||||
|
|
@ -65,7 +65,7 @@ Transformer可以很容易地扩展到支持图生图和视频生视频的任务
|
|||
|
||||
下图给出了用于推理的掩码策略配置的说明。五数字元组在定义掩码策略方面提供了极大的灵活性。
|
||||
|
||||

|
||||

|
||||
|
||||
掩码策略用法的详细说明可在[配置文件](/docs/config.md#advanced-inference-config)中查看.
|
||||
|
||||
|
|
@ -74,18 +74,18 @@ Transformer可以很容易地扩展到支持图生图和视频生视频的任务
|
|||
|
||||
正如我们在Sora1.0版本中看见的那样,数据数量和质量对于训练一个好的模型至关重要,因此,我们努力扩展数据集。首先,我们创建了一个遵循[SVD](https://arxiv.org/abs/2311.15127)的自动流水线,包括场景切割、字幕、各种评分和过滤以及数据集管理脚本和通用惯例。
|
||||
|
||||

|
||||

|
||||
|
||||
我们计划使用[panda-70M](https://snap-research.github.io/Panda-70M/)和其他数据来训练模型,大约包含3000万条数据。然而,我们发现磁盘输入输出(disk IO)在同时进行训练和数据处理时成为了一个瓶颈。因此,我们只能准备一个包含1000万条数据的数据集,并且没有完成我们构建的所有处理流程。最终,我们使用了包含970万视频和260万图像的数据集进行预训练,以及560,000视频和160万图像的数据集进行微调。预训练数据集的统计信息如下所示。
|
||||
|
||||
图像文本标记 (使用T5分词器):
|
||||

|
||||

|
||||
|
||||
视频文本标记 (使用T5分词器)。我们直接使用Panda的短视频描述进行训练,并自己给其他数据集加视频描述。生成的字幕通常少于200个token。
|
||||

|
||||

|
||||
|
||||
视频时长:
|
||||

|
||||

|
||||
|
||||
## 训练详情
|
||||
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@
|
|||
|
||||
考虑到训练 3D VAE 的计算成本很高,我们希望重新利用在 2D VAE 中学到的知识。我们注意到,经过 2D VAE 压缩后,时间维度上相邻的特征仍然高度相关。因此,我们提出了一个简单的视频压缩网络,首先将视频在空间维度上压缩 8x8 倍,然后将视频在时间维度上压缩 4 倍。网络如下所示:
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
我们用[SDXL 的 VAE](https://huggingface.co/stabilityai/sdxl-vae)初始化 2D VAE ,它比我们以前使用的更好。对于 3D VAE,我们采用[Magvit-v2](https://magvit.cs.cmu.edu/v2/)中的 VAE 结构,它包含 300M 个参数。加上 83M 的 2D VAE,视频压缩网络的总参数为 384M。我们设定batch size 为 1, 对 3D VAE 进行了 1.2M 步的训练。训练数据是来自 pixels 和 pixabay 的视频,训练视频大小主要是 17 帧,256x256 分辨率。3D VAE 中使用causal convolotions使图像重建更加准确。
|
||||
|
|
@ -125,9 +125,9 @@ Open-Sora 1.2 从[PixArt-Σ 2K](https://github.com/PixArt-alpha/PixArt-sigma)
|
|||
|
||||
我们从 pixabay 中抽样了 1k 个视频作为验证数据集。我们计算了不同分辨率(144p、240p、360p、480p、720p)下图像和不同长度的视频(2s、4s、8s、16s)的评估损失。对于每个设置,我们等距采样 10 个时间步长。然后对所有损失取平均值。
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
此外,我们还会在训练过程中跟踪[VBench](https://vchitect.github.io/VBench-project/)得分。VBench 是用于短视频生成的自动视频评估基准。我们用 240p 2s 视频计算 vbench 得分。这两个指标验证了我们的模型在训练过程中持续改进。
|
||||
|
||||

|
||||

|
||||
|
|
|
|||
|
|
@ -206,14 +206,14 @@ function run_video_h() { # 61min
|
|||
--prompt-path assets/texts/t2v_ref.txt --start-index 0 --end-index 3 \
|
||||
--num-frames 2s --resolution 360p --aspect-ratio 9:16 \
|
||||
--loop 5 --condition-frame-length 5 \
|
||||
--reference-path assets/images/condition/cliff.png assets/images/condition/wave.png assets/images/condition/ship.png \
|
||||
--reference-path https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/ship.png \
|
||||
--mask-strategy "0" "0" "0" --batch-size $DEFAULT_BS
|
||||
|
||||
eval $CMD --ckpt-path $CKPT --save-dir $OUTPUT --sample-name ref_L5C10_16s_360p_9_16 \
|
||||
--prompt-path assets/texts/t2v_ref.txt --start-index 0 --end-index 3 \
|
||||
--num-frames 16s --resolution 360p --aspect-ratio 9:16 \
|
||||
--loop 5 --condition-frame-length 10 \
|
||||
--reference-path assets/images/condition/cliff.png assets/images/condition/wave.png assets/images/condition/ship.png \
|
||||
--reference-path https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/wave.png https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/ship.png \
|
||||
--mask-strategy "0" "0" "0" --batch-size $DEFAULT_BS
|
||||
|
||||
# 3.2
|
||||
|
|
@ -221,7 +221,7 @@ function run_video_h() { # 61min
|
|||
--prompt-path assets/texts/t2v_ref.txt --start-index 3 --end-index 6 \
|
||||
--num-frames 16s --resolution 360p --aspect-ratio 9:16 \
|
||||
--loop 1 \
|
||||
--reference-path assets/images/condition/cliff.png "assets/images/condition/cactus-sad.png\;assets/images/condition/cactus-happy.png" https://cdn.openai.com/tmp/s/interp/d0.mp4 \
|
||||
--reference-path https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cliff.png "https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cactus-sad.png\;https://github.com/hpcaitech/Open-Sora-Demo/blob/main/images/condition/cactus-happy.png" https://cdn.openai.com/tmp/s/interp/d0.mp4 \
|
||||
--mask-strategy "0" "0\;0,1,0,-1,1" "0,0,0,0,${QUAD_FRAMES},0.5" --batch-size $DEFAULT_BS
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -71,7 +71,7 @@ We have also tested this Gradio app on Hugging Face Spaces. You can follow the s
|
|||
|
||||
## Advanced Usage
|
||||
|
||||

|
||||

|
||||
|
||||
For the "**FPS**" option, as now we fix the output video's FPS to 24, this option will not affect the output video's length. Thus, for a smaller FPS, the video is supposed to be longer but accelerated due to 24 FPS. Thus, the video will be less smooth but faster. For a larger FPS, the video will be smoother but slower.
|
||||
|
||||
|
|
|
|||
|
|
@ -183,7 +183,6 @@ from opensora.utils.inference_utils import (
|
|||
prepare_multi_resolution_info,
|
||||
refine_prompts_by_openai,
|
||||
split_prompt,
|
||||
has_openai_key
|
||||
)
|
||||
from opensora.utils.misc import to_torch_dtype
|
||||
|
||||
|
|
@ -494,7 +493,7 @@ def main():
|
|||
"""
|
||||
<div style='text-align: center;'>
|
||||
<p align="center">
|
||||
<img src="https://github.com/hpcaitech/Open-Sora/raw/main/assets/readme/icon.png" width="250"/>
|
||||
<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/icon.png" width="250"/>
|
||||
</p>
|
||||
<div style="display: flex; gap: 10px; justify-content: center;">
|
||||
<a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a>
|
||||
|
|
|
|||
|
|
@ -63,10 +63,10 @@ nohup python caption_pllava.py \
|
|||
### PLLaVA vs. LLaVA
|
||||
In our previous releases, we used [LLaVA](#llava-captioning) for video captioning.
|
||||
Qualitatively speaking, we observe that PLLaVA has a somewhat higher chance of accurately capture the details in the video than LLaVA. See below for their comparison on a video sample.
|
||||
<!-- <img src="../../assets/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA"> -->
|
||||
<!-- <img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA"> -->
|
||||
|
||||
<figure>
|
||||
<img src="../../assets/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA">
|
||||
<img src="https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/llava_vs_pllava_sample.gif" width="300" height="200" alt="LLaVA vs PLLaVA">
|
||||
</figure>
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue