* update docs * update docs * update docs * update acceleration docs and fix typos * update docs commands |
||
|---|---|---|
| assets/texts | ||
| configs | ||
| docs | ||
| opensora | ||
| scripts | ||
| tests | ||
| tools/data | ||
| .gitignore | ||
| .isort.cfg | ||
| .pre-commit-config.yaml | ||
| LICENSE | ||
| README.md | ||
| requirements.txt | ||
| setup.py | ||
Open-Sora: Towards Open Reproduction of Sora
Open-Sora is an open-source initiative dedicated to efficiently reproducing OpenAI's Sora. Our project aims to cover the full pipeline, including video data preprocessing, training with acceleration, efficient inference and more. Operating on a limited budget, we prioritize the vibrant open-source community, providing access to text-to-image, image captioning, and language models. We hope to make a contribution to the community and make the project more accessible to everyone.
📰 News
- [2024.03.18] 🔥 We release Open-Sora 1.0, an open-source project to reproduce OpenAI Sora.
Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
acceleration,
inference, and more. Our provided checkpoint can produce 2s 512x512 videos.
🎥 Latest Demo
| 2s 512x512 | 2s 512x512 | 2s 512x512 |
|---|---|---|
![]() |
![]() |
![]() |
Videos are downsampled to .gif. Click the video for original ones.
🔆 New Features/Updates
- 📍 Open-Sora-v1 is trained on xxx. We train the model in three stages. Model weights are available here. Training details can be found here. [WIP]
- ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
- ✅ We provide video cutting and captioning tools for data preprocessing. Instructions can be found here and our data collection plan can be found at datasets.md.
- ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
- ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
- ✅ Support clip and T5 text conditioning.
- ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See command.md for more instructions.
- ✅ Support inference with official weights from DiT, Latte, and PixArt.
View more
- ✅ Refactor the codebase. See structure.md to learn the project structure and how to use the config files.
TODO list sorted by priority
- Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See datasets.md for more information. [WIP]
- Training Video-VAE. [WIP]
View more
- Support image and video conditioning.
- Evaluation pipeline.
- Incoporate a better scheduler, e.g., rectified flow in SD3.
- Support variable aspect ratios, resolutions, durations.
- Support SD3 when released.
Contentss
Installation
# create a virtual env
conda create -n opensora python=3.10
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision
# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v -e .
After installation, we suggest reading structure.md to learn the project structure and how to use the config files.
Model Weights
| Model | #Params | url |
|---|---|---|
| 16x256x256 |
Inference
To run inference with our provided weights, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl. Then run the following commands to generate samples. See here to customize the configuration.
# Sample 16x256x256 (~2s)
python scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 16x512x512 (~2s)
python scripts/inference.py configs/opensora/inference/16x512x512.py
# Sample 64x512x512 (~5s)
python scripts/inference.py configs/opensora/inference/64x512x512.py
For inference with other models, see here for more instructions.
Data Processing
Split video into clips
We provide code to split a long video into separate clips efficiently using multiprocessing. See tools/data/scene_detect.py.
Generate video caption
Training
To launch training, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl. Then run the following commands to launch training on a single node.
# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x512.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
For training other models and advanced usage, see here for more instructions.
Acknowledgement
- DiT: Scalable Diffusion Models with Transformers.
- OpenDiT: An acceleration for DiT training. OpenDiT's team provides valuable suggestions on acceleration of our training process.
- PixArt: An open-source DiT-based text-to-image model.
- Latte: An attempt to efficiently train DiT for video.
- StabilityAI VAE: A powerful image VAE model.
- CLIP: A powerful text-image embedding model.
- T5: A powerful text encoder.
- LLaVA: A powerful image captioning model based on Yi-34B.
We are grateful for their exceptional work and generous contribution to open source.
Citation
@software{opensora,
author = {Zangwei Zheng and Xiangyu Peng and Yang You},
title = {Open-Sora: Towards Open Reproduction of Sora},
month = {March},
year = {2024},
url = {https://github.com/hpcaitech/Open-Sora}
}
Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.


