Go to file
2024-03-16 22:48:54 +08:00
assets/texts format (#69) 2024-03-15 22:16:20 +08:00
configs Docs/readme (#75) 2024-03-16 22:17:22 +08:00
docs update docs 2024-03-16 22:48:54 +08:00
opensora Docs/readme (#75) 2024-03-16 22:17:22 +08:00
scripts format (#69) 2024-03-15 22:16:20 +08:00
tests format (#69) 2024-03-15 22:16:20 +08:00
tools/data update docs 2024-03-16 22:48:54 +08:00
.gitignore Docs/readme (#73) 2024-03-16 17:09:00 +08:00
.isort.cfg added pre-commit (#5) 2024-02-22 10:54:40 +08:00
.pre-commit-config.yaml added pre-commit (#5) 2024-02-22 10:54:40 +08:00
LICENSE Initial commit 2024-02-20 11:01:34 +08:00
README.md update docs 2024-03-16 22:48:54 +08:00
requirements.txt Docs/readme (#73) 2024-03-16 17:09:00 +08:00
setup.py Docs/readme (#73) 2024-03-16 17:09:00 +08:00

Open-Sora: Towards Open Reproduction of Sora

Open-Sora is an open-source initiative dedicated to efficiently reproducing OpenAI's Sora. Our project aims to cover the full pipeline, including video data preprocessing, training with acceleration, efficient inference and more. Operating on a limited budget, we prioritize the vibrant open-source community, providing access to text-to-image, image captioning, and language models. We hope to make a contribution to the community and make the project more accessible to everyone.

📰 News

  • [2024.03.18] 🔥 We release Open-Sora 1.0, an open-source project to reproduce OpenAI Sora. Open-Sora 1.0 supports a full pipeline including video data preprocessing, training with acceleration, inference, and more. Our provided checkpoint can produce 2~5s 512x512 videos with only 3 days training.

🎥 Latest Demo

2s 512×512 2s 512×512 2s 512×512

Videos are downsampled to .gif for display. Click the video for original ones.

🔆 New Features/Updates

  • 📍 Open-Sora-v1 released. Model weights are available here. With only 400K video clips and 200 H800 days, we are able to generate 2s 512×512 videos.
  • Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
  • Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
  • We provide video cutting and captioning tools for data preprocessing. Instructions can be found here and our data collection plan can be found at datasets.md.
  • We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
  • We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
  • Support clip and T5 text conditioning.
  • By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See command.md for more instructions.
  • Support inference with official weights from DiT, Latte, and PixArt.
View more
  • Refactor the codebase. See structure.md to learn the project structure and how to use the config files.

TODO list sorted by priority

  • Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See datasets.md for more information. [WIP]
  • Training Video-VAE. [WIP]
View more
  • Support image and video conditioning.
  • Evaluation pipeline.
  • Incoporate a better scheduler, e.g., rectified flow in SD3.
  • Support variable aspect ratios, resolutions, durations.
  • Support SD3 when released.

Contentss

Installation

# create a virtual env
conda create -n opensora python=3.10

# install torch
# the command below is for CUDA 12.1, choose install commands from 
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision

# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v -e .

After installation, we suggest reading structure.md to learn the project structure and how to use the config files.

Model Weights

Resoluion Data #iterations Batch Size GPU days (H800) URL
16×256×256 366K 80k 8×64 117
16×256×256 20K HQ 24k 8×64 45
16×512×512 20K HQ 20k 2×64 35
64×512×512 50K HQ 4×64

Our model's weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in report_v1.md. More about dataset can be found in dataset.md.

Inference

To run inference with our provided weights, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl. Then run the following commands to generate samples. See here to customize the configuration.

# Sample 16x256x256 (~2s)
python scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 16x512x512 (~2s)
python scripts/inference.py configs/opensora/inference/16x512x512.py
# Sample 64x512x512 (~5s)
python scripts/inference.py configs/opensora/inference/64x512x512.py

For inference with other models, see here for more instructions.

Data Processing (WIP)

Split video into clips

We provide code to split a long video into separate clips efficiently using multiprocessing. See tools/data/scene_detect.py.

Generate video caption

Training

To launch training, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl. Then run the following commands to launch training on a single node.

# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x512.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.

colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

For training other models and advanced usage, see here for more instructions.

Acknowledgement

  • DiT: Scalable Diffusion Models with Transformers.
  • OpenDiT: An acceleration for DiT training. OpenDiT's team provides valuable suggestions on acceleration of our training process.
  • PixArt: An open-source DiT-based text-to-image model.
  • Latte: An attempt to efficiently train DiT for video.
  • StabilityAI VAE: A powerful image VAE model.
  • CLIP: A powerful text-image embedding model.
  • T5: A powerful text encoder.
  • LLaVA: A powerful image captioning model based on Yi-34B.

We are grateful for their exceptional work and generous contribution to open source.

Citation

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Yang You},
  title = {Open-Sora: Towards Open Reproduction of Sora},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.

Star History

Star History Chart