Go to file
Hongxin Liu 2561963140
[feature] add data process script (#2)
* [misc] update gitignore

* [feature] add data process script
2024-02-21 11:33:36 +08:00
diffusion [init] migrate dit repo (#1) 2024-02-20 11:19:59 +08:00
.gitignore [feature] add data process script (#2) 2024-02-21 11:33:36 +08:00
download.py [init] migrate dit repo (#1) 2024-02-20 11:19:59 +08:00
LICENSE Initial commit 2024-02-20 11:01:34 +08:00
models.py [init] migrate dit repo (#1) 2024-02-20 11:19:59 +08:00
preprocess_data.py [feature] add data process script (#2) 2024-02-21 11:33:36 +08:00
README.md [feature] add data process script (#2) 2024-02-21 11:33:36 +08:00
sample.py [init] migrate dit repo (#1) 2024-02-20 11:19:59 +08:00
train.py [init] migrate dit repo (#1) 2024-02-20 11:19:59 +08:00

🎥 Open-Sora

📍 Overview

This repository is an unofficial implementation of OpenAI's Sora. We built this based on the facebookresearch/DiT repository.

Dataset preparation

We use MSR-VTT dataset, which is a large-scale video description dataset. We should preprocess the raw videos before training the model.

Before running preprocess_data.py, you should prepare a captions file and a video directory. The captions file should be a JSON file or a JSONL file. The video directory should contain all the videos.

Here is an example of the captions file:

[
    {
        "file": "video0.mp4",
        "captions": ["a girl is throwing away folded clothes", "a girl throwing cloths around"]
    },
    {
        "file": "video1.mp4",
        "captions": ["a  comparison of two opposing team football athletes"]    
    }
]

Here is an example of the video directory:

.
├── video0.mp4
├── video1.mp4
└── ...

Each video may have multiple captions. So the outputs are video-caption pairs. E.g., the first video has two captions, then the output will be two video-caption pairs.

We use VQ-VAE to quantize the video frames. And we use CLIP to extract the text features.

The output is an arrow dataset, which contains the following columns: "video_file", "video_latent_states", "text_latent_states". The dimension of "video_latent_states" is (T, H, W), and the dimension of "text_latent_states" is (S, D).

How to run the script:

python preprocess_data.py /path/to/captions.json /path/to/video_dir /path/to/output_dir

Note that this script needs to be run on a machine with a GPU. To avoid CUDA OOM, we filter out the videos that are too long.