From 5ac84ca80c9e557e2b2a4be598e7ca08fc4578c3 Mon Sep 17 00:00:00 2001 From: Arshad Nazir Date: Thu, 20 Feb 2025 08:36:05 +0000 Subject: [PATCH] Just a spelling mistake (#776) --- docs/report_02.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/report_02.md b/docs/report_02.md index ab7cb0b..9cdacc8 100644 --- a/docs/report_02.md +++ b/docs/report_02.md @@ -106,7 +106,7 @@ To summarize, the training of Open-Sora 1.1 requires approximately **9 days** on As we get one step closer to the replication of Sora, we find many limitations for the current model, and these limitations point to the future work. -- **Generation Failure**: we fine many cases (especially when the total token number is large or the content is complex), our model fails to generate the scene. There may be a collapse in the temporal attention and we have identified a potential bug in our code. We are working hard to fix it. Besides, we will increase our model size and training data to improve the generation quality in the next version. +- **Generation Failure**: we find many cases (especially when the total token number is large or the content is complex), our model fails to generate the scene. There may be a collapse in the temporal attention and we have identified a potential bug in our code. We are working hard to fix it. Besides, we will increase our model size and training data to improve the generation quality in the next version. - **Noisy generation and influency**: we find the generated model is sometimes noisy and not fluent, especially for long videos. We think the problem is due to not using a temporal VAE. As [Pixart-Sigma](https://arxiv.org/abs/2403.04692) finds that adapting to a new VAE is simple, we plan to develop a temporal VAE for the model in the next version. - **Lack of time consistency**: we find the model cannot generate videos with high time consistency. We think the problem is due to the lack of training FLOPs. We plan to collect more data and continue training the model to improve the time consistency. - **Bad human generation**: We find the model cannot generate high-quality human videos. We think the problem is due to the lack of human data. We plan to collect more human data and continue training the model to improve the human generation.