diff --git a/README.md b/README.md
index 3b63cae..3ebf6fd 100644
--- a/README.md
+++ b/README.md
@@ -37,15 +37,16 @@ With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the
## 🎥 Latest Demo
+More samples are available in our [gallery](https://hpcaitech.github.io/Open-Sora/).
-| **2s 240×426** | **2s 240×426** |
-| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c31ebc52-de39-4a4e-9b1e-9211d45e05b2) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c31ebc52-de39-4a4e-9b1e-9211d45e05b2) |
-| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/f7ce4aaa-528f-40a8-be7a-72e61eaacbbd) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/5d58d71e-1fda-4d90-9ad3-5f2f7b75c6a9) |
+| **2s 240×426** | **2s 240×426** |
+| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c31ebc52-de39-4a4e-9b1e-9211d45e05b2) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c31ebc52-de39-4a4e-9b1e-9211d45e05b2) |
+| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/f7ce4aaa-528f-40a8-be7a-72e61eaacbbd) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/5d58d71e-1fda-4d90-9ad3-5f2f7b75c6a9) |
-| **2s 426×240** | **2s 426×240** | **4s 480×854** |
-| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/34ecb4a0-4eef-4286-ad4c-8e3a87e5a9fd) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/3e892ad2-9543-4049-b005-643a4c1bf3bf) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c1619333-25d7-42ba-a91c-18dbc1870b18) |
+| **2s 426×240** | **2s 426×240** | **4s 480×854** |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/34ecb4a0-4eef-4286-ad4c-8e3a87e5a9fd) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/3e892ad2-9543-4049-b005-643a4c1bf3bf) | [
](https://github.com/hpcaitech/Open-Sora-dev/assets/99191637/c1619333-25d7-42ba-a91c-18dbc1870b18) |
@@ -63,8 +64,6 @@ see [here](/assets/texts/t2v_samples.txt) for full prompts.
-More samples are available in our [gallery](https://hpcaitech.github.io/Open-Sora/).
-
## 🔆 New Features/Updates
* 📍 **Open-Sora 1.1** released. Model weights are available [here](). It is trained on **0s~15s, 144p to 720p, various aspect ratios** videos. See our **[report 1.1](docs/report_02.md)** for more discussions.
@@ -176,11 +175,10 @@ pip install -v .
### Open-Sora 1.1 Model Weights
-| Resolution | Data | #iterations | Batch Size | URL |
-| ---------- | --------------------- | ----------- | ---------- | --------------------------------------------------------------------------------------------- |
-| dynamic | 10M videos + 2M images | 100 | dynamic | [:link:](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage2) |
-| dynamic | 20K HQ | 4k | dynamic | [:link:](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage3) |
-
+| Resolution | Data | #iterations | Batch Size | URL |
+| ------------------ | -------------------------- | ----------- | ------------------------------------------------- | -------------------------------------------------------------------- |
+| mainly 144p & 240p | 10M videos + 2M images | 100k | [dynamic](/configs/opensora-v1-1/train/stage2.py) | [:link:](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage2) |
+| 144p to 720p | 500K HQ videos + 1M images | 4k | [dynamic](/configs/opensora-v1-1/train/stage3.py) | [:link:](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage3) |
### Open-Sora 1.0 Model Weights
@@ -223,12 +221,12 @@ This will launch a Gradio application on your localhost. If you want to know mor
Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.
```bash
-# video sampling
+# text to video
python scripts/inference.py configs/opensora-v1-1/inference/sample.py \
--ckpt-path CKPT_PATH --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854
```
-See [here](docs/commands.md#inference-with-open-sora-11) for more instructions.
+See [here](docs/commands.md#inference-with-open-sora-11) for more instructions including text-to-image, image-to-video, video-to-video, and infinite time generation.
### Open-Sora 1.0 Command Line Inference
diff --git a/configs/opensora-v1-1/inference/sample-ref.py b/configs/opensora-v1-1/inference/sample-ref.py
index 557bb70..735c01b 100644
--- a/configs/opensora-v1-1/inference/sample-ref.py
+++ b/configs/opensora-v1-1/inference/sample-ref.py
@@ -14,18 +14,26 @@ prompt = [
loop = 2
condition_frame_length = 4
-reference_path = [
- "https://cdn.openai.com/tmp/s/interp/d0.mp4",
- None,
- "assets/images/condition/wave.png",
-]
-# valid when reference_path is not None
-# (loop id, ref id, ref start, target start, length, edit_ratio)
+# (
+# loop id, [the loop index of the condition image or video]
+# reference id, [the index of the condition image or video in the reference_path]
+# reference start, [the start frame of the condition image or video]
+# target start, [the location to insert]
+# length, [the number of frames to insert]
+# edit_ratio [the edit rate of the condition image or video]
+# )
+# See https://github.com/hpcaitech/Open-Sora/blob/main/docs/config.md#advanced-inference-config for more details
+# See https://github.com/hpcaitech/Open-Sora/blob/main/docs/commands.md#inference-with-open-sora-11 for more examples
mask_strategy = [
"0,0,0,0,8,0.3",
None,
"0",
]
+reference_path = [
+ "https://cdn.openai.com/tmp/s/interp/d0.mp4",
+ None,
+ "assets/images/condition/wave.png",
+]
# Define model
model = dict(
diff --git a/docs/commands.md b/docs/commands.md
index 944fc88..2d7420f 100644
--- a/docs/commands.md
+++ b/docs/commands.md
@@ -51,11 +51,30 @@ You can adjust the `--num-frames` and `--image-size` to generate different resul
`inference-long.py` is compatible with `inference.py` and supports advanced features.
```bash
-# long video generation
# image condition
+python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
+ --num-frames 32 --image-size 240 426 --sample-name image-cond \
+ --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/wave.png","mask_strategy": "0"}'
+
# video extending
+python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
+ --num-frames 32 --image-size 240 426 --sample-name image-cond \
+ --prompt 'A car driving on the ocean.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4","mask_strategy": "0,0,0,-8,8"}'
+
+# long video generation
+python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
+ --num-frames 32 --image-size 240 426 --loop 16 --condition-frame-length 8 --sample-name long \
+ --prompt '|0|a white jeep equipped with a roof rack driving on a dirt road in a coniferous forest.|2|a white jeep equipped with a roof rack driving on a dirt road in the desert.|4|a white jeep equipped with a roof rack driving on a dirt road in a mountain.|6|A white jeep equipped with a roof rack driving on a dirt road in a city.|8|a white jeep equipped with a roof rack driving on a dirt road on the surface of a river.|10|a white jeep equipped with a roof rack driving on a dirt road under the lake.|12|a white jeep equipped with a roof rack flying into the sky.|14|a white jeep equipped with a roof rack driving in the universe. Earth is the background.{"reference_path": "https://cdn.openai.com/tmp/s/interp/d0.mp4", "mask_strategy": "0,0,0,0,16"}'
+
# video connecting
+python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
+ --num-frames 32 --image-size 240 426 --sample-name connect \
+ --prompt 'A breathtaking sunrise scene.{"reference_path": "assets/images/condition/sunset1.png;assets/images/condition/sunset2.png","mask_strategy": "0;0,1,0,-1,1"}'
+
# video editing
+python scripts/inference-long.py configs/opensora-v1-1/inference/sample.py --ckpt-path CKPT_PATH \
+ --num-frames 32 --image-size 480 853 --sample-name edit \
+ --prompt 'A cyberpunk-style city at night.{"reference_path": "https://cdn.pixabay.com/video/2021/10/12/91744-636709154_large.mp4","mask_strategy": "0,0,0,0,32,0.4"}'
```
### Inference with DiT pretrained on ImageNet
diff --git a/docs/report_02.md b/docs/report_02.md
index 9d2c1eb..ec54853 100644
--- a/docs/report_02.md
+++ b/docs/report_02.md
@@ -106,7 +106,7 @@ To summarize, the training of Open-Sora 1.1 requires approximately **9 days** on
As we get one step closer to the replication of Sora, we find many limitations for the current model, and these limitations point to the future work.
-- **Generation Failure**: we fine many cases (especially when the total token number is large or the content is complex), our model fails to generate the scene. There may be a collapse in the temporal attention and we have identified a potential bug in our code. We are working hard to fix it.
+- **Generation Failure**: we fine many cases (especially when the total token number is large or the content is complex), our model fails to generate the scene. There may be a collapse in the temporal attention and we have identified a potential bug in our code. We are working hard to fix it. Besides, we will increase our model size and training data to improve the generation quality in the next version.
- **Noisy generation and influency**: we find the generated model is sometimes noisy and not fluent, especially for long videos. We think the problem is due to not using a temporal VAE. As [Pixart-Sigma](https://arxiv.org/abs/2403.04692) finds that adapting to a new VAE is simple, we plan to develop a temporal VAE for the model in the next version.
- **Lack of time consistency**: we find the model cannot generate videos with high time consistency. We think the problem is due to the lack of training FLOPs. We plan to collect more data and continue training the model to improve the time consistency.
- **Bad human generation**: We find the model cannot generate high-quality human videos. We think the problem is due to the lack of human data. We plan to collect more human data and continue training the model to improve the human generation.