diff --git a/README.md b/README.md index 72fc2c9..1bbedfc 100644 --- a/README.md +++ b/README.md @@ -211,7 +211,7 @@ docker run -ti --gpus all -v {MOUNT_DIR}:/data opensora | Model | Model Size | Data | #iterations | Batch Size | URL | | --------- | ---------- | ---- | ----------- | ---------- | ------------------------------------------------------------- | | Diffusion | 1.1B | 30M | 70k | Dynamic | [:link:](https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3) | -| VAE | 384M | | | | [:link:](https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2) | +| VAE | 384M | 3M | 1.18M | 8 | [:link:](https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2) | See our **[report 1.2](docs/report_03.md)** for more infomation. @@ -237,7 +237,7 @@ See our **[report 1.1](docs/report_02.md)** for more infomation. View more | Resolution | Model Size | Data | #iterations | Batch Size | GPU days (H800) | URL | -| ---------- | ---------- | ------ | ----------- | ---------- | --------------- | +| ---------- | ---------- | ------ | ----------- | ---------- | --------------- | --------------------------------------------------------------------------------------------- | | 16×512×512 | 700M | 20K HQ | 20k | 2×64 | 35 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x512x512.pth) | | 16×256×256 | 700M | 20K HQ | 24k | 8×64 | 45 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x256x256.pth) | | 16×256×256 | 700M | 366K | 80k | 8×64 | 117 | [:link:](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-16x256x256.pth) | @@ -408,6 +408,7 @@ Before you run the following commands, follow our [Installation Documentation](d Once you prepare the data in a `csv` file, run the following commands to train the VAE. Note that you need to adjust the number of trained epochs (`epochs`) in the config file accordingly with respect to your own csv data size. + ```bash # stage 1 training, 380k steps, 8 GPUs torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage1.py --data-path YOUR_CSV_PATH diff --git a/configs/vae/inference/image.py b/configs/vae/inference/image.py index cb25757..2eebcb0 100644 --- a/configs/vae/inference/image.py +++ b/configs/vae/inference/image.py @@ -21,7 +21,7 @@ num_workers = 4 # Define model model = dict( type="OpenSoraVAE_V1_2", - from_pretrained="pretrained_models/vae-pipeline", + from_pretrained="hpcai-tech/OpenSora-VAE-v1.2", micro_frame_size=None, micro_batch_size=4, cal_loss=True, diff --git a/configs/vae/inference/video.py b/configs/vae/inference/video.py index 4697a2e..e4211b8 100644 --- a/configs/vae/inference/video.py +++ b/configs/vae/inference/video.py @@ -21,7 +21,7 @@ num_workers = 4 # Define model model = dict( type="OpenSoraVAE_V1_2", - from_pretrained="pretrained_models/vae-pipeline", + from_pretrained="hpcai-tech/OpenSora-VAE-v1.2", micro_frame_size=None, micro_batch_size=4, cal_loss=True, diff --git a/configs/vae/train/stage1.py b/configs/vae/train/stage1.py index a6899ec..151d86d 100644 --- a/configs/vae/train/stage1.py +++ b/configs/vae/train/stage1.py @@ -46,7 +46,7 @@ use_image_identity_loss = True # Others seed = 42 -outputs = "outputs" +outputs = "outputs/vae_stage1" wandb = False epochs = 100 # NOTE: adjust accordingly w.r.t dataset size diff --git a/configs/vae/train/stage2.py b/configs/vae/train/stage2.py index 80748d4..d6961e0 100644 --- a/configs/vae/train/stage2.py +++ b/configs/vae/train/stage2.py @@ -20,7 +20,7 @@ plugin = "zero2" model = dict( type="VideoAutoencoderPipeline", freeze_vae_2d=False, - from_pretrained=None, + from_pretrained="outputs/vae_stage1", cal_loss=True, vae_2d=dict( type="VideoAutoencoderKL", @@ -46,7 +46,7 @@ use_image_identity_loss = False # Others seed = 42 -outputs = "outputs" +outputs = "outputs/vae_stage2" wandb = False epochs = 100 # NOTE: adjust accordingly w.r.t dataset size diff --git a/configs/vae/train/stage3.py b/configs/vae/train/stage3.py index 2b6bc12..464a3ef 100644 --- a/configs/vae/train/stage3.py +++ b/configs/vae/train/stage3.py @@ -20,7 +20,7 @@ plugin = "zero2" model = dict( type="VideoAutoencoderPipeline", freeze_vae_2d=False, - from_pretrained=None, + from_pretrained="outputs/vae_stage2", cal_loss=True, vae_2d=dict( type="VideoAutoencoderKL", @@ -45,7 +45,7 @@ use_image_identity_loss = False # Others seed = 42 -outputs = "outputs" +outputs = "outputs/vae_stage3" wandb = False epochs = 100 # NOTE: adjust accordingly w.r.t dataset size diff --git a/docs/report_03.md b/docs/report_03.md index 2a5d6a5..853dc38 100644 --- a/docs/report_03.md +++ b/docs/report_03.md @@ -147,7 +147,12 @@ In addition, we also keep track of [VBench](https://vchitect.github.io/VBench-pr All the evaluation code is released in `eval` folder. Check the [README](/eval/README.md) for more details. -[Final performance TBD] + +| Model | Total Score | Quality Score | Semantic Score | +| -------------- | ----------- | ------------- | -------------- | +| Open-Sora V1.0 | 75.91% | 78.81% | 64.28% | +| Open-Sora V1.2 | 79.23% | 80.71% | 73.30% | + ## Sequence parallelism diff --git a/docs/vae.md b/docs/vae.md index 114a7e5..998562f 100644 --- a/docs/vae.md +++ b/docs/vae.md @@ -59,5 +59,3 @@ We are grateful for the following work: * [Taming Transformers](https://github.com/CompVis/taming-transformers): Taming Transformers for High-Resolution Image Synthesis * [3D blur pooling](https://github.com/adobe/antialiased-cnns/pull/39/commits/3d6f02b6943c58b68c19c07bc26fad57492ff3bc) * [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan) - -Special thanks go to the authors of [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan) for their valuable advice and help.