diff --git a/README.md b/README.md index 1a70030..59c09dc 100644 --- a/README.md +++ b/README.md @@ -248,7 +248,7 @@ is [here](/docs/datasets.md). We provide tools to process video data. Our data p the following steps: 1. Manage datasets. [[docs](/tools/datasets/README.md)] -2. Scene detection and video splitting. [[docs](/tools/splitting/README.md)] +2. Scene detection and video splitting. [[docs](/tools/scene_cut/README.md)] 3. Score and filter videos. [[docs](/tools/scoring/README.md)] 4. Generate video captions. [[docs](/tools/caption/README.md)] diff --git a/docs/datasets.md b/docs/datasets.md index cc9293c..2d5e995 100644 --- a/docs/datasets.md +++ b/docs/datasets.md @@ -1,15 +1,24 @@ # Datasets +For Open-Sora 1.1, we conduct mixed training with both images and videos. The main datasets we use are listed below. +Please refer to [README](/README.md#data-processing) for data processing. + +## Panda-70M +[Panda-70M](https://github.com/snap-research/Panda-70M) is a large-scale dataset with 70M video-caption pairs. +We use the [training-10M subset](https://github.com/snap-research/Panda-70M/tree/main/dataset_dataloading) for training, +which contains ~10M videos of better quality. + +## Pexels +[Pexels](https://www.pexels.com/) is a popular online platform that provides high-quality stock photos, videos, and music for free. +Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning. +We really appreciate the great platform and the contributors! + +## Inter4K +[Inter4K](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1K video clips with 4K resolution. +The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning. + + ## HD-VG-130M - -[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs. The caption is generated by BLIP-2. We find the cut and the text quality are relatively poor. It contains 20 splits. For OpenSora 1.0, we use the first split (~350K). We plan to use the whole dataset and re-process it. - -You can download the dataset and prepare it for training according to [the dataset repository's instructions](https://github.com/daooshee/HD-VG-130M). There is a README.md file in the Google Drive link that provides instructions on how to download and cut the videos. For this version, we directly use the dataset provided by the authors. - -## Inter4k - -[Inter4k](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1k video clips with 4K resolution. The dataset is proposed for super-resolution tasks. We use the dataset for HQ training. The videos are processed as mentioned [here](/README.md#data-processing). - -## Pexels.com - -[Pexels.com](https://www.pexels.com/) is a website that provides free stock photos and videos. We collect 19K video clips from this website for HQ training. The videos are processed as mentioned [here](/README.md#data-processing). +[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs. +The caption is generated by BLIP-2. +We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset. diff --git a/tools/scene_cut/README.md b/tools/scene_cut/README.md index 87aa201..0f1ac7d 100644 --- a/tools/scene_cut/README.md +++ b/tools/scene_cut/README.md @@ -2,7 +2,7 @@ In many cases, raw videos contain several scenes and are too long for training. Thus, it is essential to split them into shorter clips based on scenes. Here, we provide code for scene detection and video splitting. -## Formatting +## Prepare a meta file At this step, you should have a raw video dataset prepared. We need a meta file for the dataset. To create a meta file from a folder, run: ```bash @@ -15,7 +15,7 @@ If you already have a meta file for the videos and want to keep the information. The following command will add a new column `path` to the meta file. ```bash -python tools/scene_cut/process_meta.py --task append_path --meta_path /path/to/meta.csv --folder_path /path/to/video/folder +python tools/scene_cut/convert_id_to_path.py /path/to/meta.csv --folder_path /path/to/video/folder ``` This should output - `{prefix}_path-filtered.csv` with column `path` (broken videos filtered) @@ -28,8 +28,7 @@ We use [`PySceneDetect`](https://github.com/Breakthrough/PySceneDetect) for this **Make sure** the input meta file has column `path`, which is the path of a video. ```bash -python tools/scene_cut/scene_detect.py --meta_path /path/to/meta.csv -python tools/scene_cut/scene_detect.py --meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format.csv +python tools/scene_cut/scene_detect.py /path/to/meta.csv ``` The output is `{prefix}_timestamp.csv` with column `timestamp`. Each cell in column `timestamp` is a list of tuples, with each tuple indicating the start and end timestamp of a scene @@ -39,18 +38,13 @@ with each tuple indicating the start and end timestamp of a scene After obtaining timestamps for scenes, we conduct video splitting (cutting). **Make sure** the meta file contains column `timestamp`. -TODO: output video size, min_duration, max_duration - ```bash -python tools/scene_cut/main_cut_pandarallel.py \ - --meta_path /path/to/meta.csv \ - --out_dir /path/to/output/dir - -python tools/scene_cut/main_cut_pandarallel.py \ - --meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format_timestamp.csv \ - --out_dir /mnt/hdd/data/pexels_new/scene_cut/data/popular_6 +python tools/scene_cut/cut.py /path/to/meta.csv --save_dir /path/to/output/dir ``` -This yields video clips saved in `/path/to/output/dir`. The video clips are named as `{video_id}_scene-{scene_id}.mp4` +This will save video clips to `/path/to/output/dir`. The video clips are named as `{video_id}_scene-{scene_id}.mp4` -TODO: meta for video clips +To create a new meta file for the generated clips, run: +```bash +python -m tools.datasets.convert video /path/to/video/folder --output /path/to/save/meta.csv +``` diff --git a/tools/scene_cut/process_meta.py b/tools/scene_cut/convert_id_to_path.py similarity index 64% rename from tools/scene_cut/process_meta.py rename to tools/scene_cut/convert_id_to_path.py index 9d982de..025cb7b 100644 --- a/tools/scene_cut/process_meta.py +++ b/tools/scene_cut/convert_id_to_path.py @@ -1,15 +1,5 @@ -""" -1. format_raw_meta() - - only keep intact videos - - add 'path' column (abs path) -2. create_meta_for_folder() -""" - import os -# os.chdir('../..') -print(f"Current working directory: {os.getcwd()}") - import argparse import json from functools import partial @@ -18,7 +8,42 @@ import numpy as np import pandas as pd from pandarallel import pandarallel from tqdm import tqdm -from utils_video import is_intact_video +import cv2 +from mmengine.logging import print_log +from moviepy.editor import VideoFileClip + + +def is_intact_video(video_path, mode="moviepy", verbose=False, logger=None): + if not os.path.exists(video_path): + if verbose: + print_log(f"Could not find '{video_path}'", logger=logger) + return False + + if mode == "moviepy": + try: + VideoFileClip(video_path) + if verbose: + print_log(f"The video file '{video_path}' is intact.", logger=logger) + return True + except Exception as e: + if verbose: + print_log(f"Error: {e}", logger=logger) + print_log(f"The video file '{video_path}' is not intact.", logger=logger) + return False + elif mode == "cv2": + try: + cap = cv2.VideoCapture(video_path) + if cap.isOpened(): + if verbose: + print_log(f"The video file '{video_path}' is intact.", logger=logger) + return True + except Exception as e: + if verbose: + print_log(f"Error: {e}", logger=logger) + print_log(f"The video file '{video_path}' is not intact.", logger=logger) + return False + else: + raise ValueError def has_downloaded_success(json_path): @@ -36,12 +61,28 @@ def has_downloaded_success(json_path): return True -def append_format_pandarallel(meta_path, folder_path, mode=".json"): - def is_intact(row, mode=".json"): +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("meta_path", type=str) + parser.add_argument("--folder_path", type=str, required=True) + parser.add_argument("--mode", type=str, default=None) + + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + + meta_path = args.meta_path + folder_path = args.folder_path + mode = args.mode + + def is_intact(row, mode=None): video_id = row["id"] - # video_path = os.path.join(root_raw, f"data/{split}/{video_id}.mp4") video_path = os.path.join(folder_path, f"{video_id}.mp4") row["path"] = video_path + if mode == ".mp4": if is_intact_video(video_path): return True, video_path @@ -74,7 +115,6 @@ def append_format_pandarallel(meta_path, folder_path, mode=".json"): meta.to_csv(out_path, index=False) print(f"New meta (shape={meta.shape}) with intact info saved to '{out_path}'") - # meta_format = meta[meta['intact']] meta_format = meta[np.array(intact)] meta_format.drop("intact", axis=1, inplace=True) out_path = os.path.join(meta_dirpath, f"{wo_ext}_path-filtered.csv") @@ -82,40 +122,5 @@ def append_format_pandarallel(meta_path, folder_path, mode=".json"): print(f"New meta (shape={meta_format.shape}) with format info saved to '{out_path}'") -def create_subset(meta_path): - meta = pd.read_csv(meta_path) - meta_subset = meta.iloc[:100] - - wo_ext, ext = os.path.splitext(meta_path) - out_path = f"{wo_ext}_head-100{ext}" - meta_subset.to_csv(out_path, index=False) - print(f"New meta (shape={meta_subset.shape}) saved to '{out_path}'") - - -def parse_args(): - parser = argparse.ArgumentParser() - parser.add_argument("--task", default="append_path", required=True) - parser.add_argument("--meta_path", type=str, required=True) - parser.add_argument("--folder_path", type=str, required=True) - parser.add_argument("--mode", type=str, default=None) - parser.add_argument("--num_workers", default=5, type=int) - - args = parser.parse_args() - return args - - -def main(): - args = parse_args() - meta_path = args.meta_path - task = args.task - - if task == "append_path": - append_format_pandarallel(meta_path=meta_path, folder_path=args.folder_path, mode=args.mode) - elif task == "create_subset": - create_subset(meta_path=meta_path) - else: - raise ValueError - - if __name__ == "__main__": main() diff --git a/tools/scene_cut/main_cut_pandarallel.py b/tools/scene_cut/cut.py similarity index 76% rename from tools/scene_cut/main_cut_pandarallel.py rename to tools/scene_cut/cut.py index d300eda..353e392 100644 --- a/tools/scene_cut/main_cut_pandarallel.py +++ b/tools/scene_cut/cut.py @@ -11,7 +11,7 @@ from pandarallel import pandarallel from scenedetect import FrameTimecode -def process_single_row(row, save_dir, log_name=None): +def process_single_row(row, args, log_name=None): video_path = row["path"] logger = None @@ -28,7 +28,14 @@ def process_single_row(row, save_dir, log_name=None): scene_list = eval(timestamp) scene_list = [(FrameTimecode(s, fps=1), FrameTimecode(t, fps=1)) for s, t in scene_list] split_video( - video_path, scene_list, save_dir=save_dir, min_seconds=2, max_seconds=15, shorter_size=720, logger=logger + video_path, + scene_list, + save_dir=args.save_dir, + min_seconds=args.min_seconds, + max_seconds=args.max_seconds, + target_fps=args.target_fps, + shorter_size=args.shorter_size, + logger=logger, ) @@ -36,10 +43,10 @@ def split_video( video_path, scene_list, save_dir, - min_seconds=None, - max_seconds=None, + min_seconds=2.0, + max_seconds=15.0, target_fps=30, - shorter_size=512, + shorter_size=720, verbose=False, logger=None, ): @@ -121,9 +128,14 @@ def split_video( def parse_args(): parser = argparse.ArgumentParser() - parser.add_argument("--meta_path", default="./data/pexels_new/raw/meta/popular_5_format_timestamp.csv") - parser.add_argument("--out_dir", default="./data/pexels_new/scene_cut/data/popular_5") - parser.add_argument("--num_workers", default=5, type=int) + parser.add_argument("meta_path", type=str) + parser.add_argument("--save_dir", type=str) + parser.add_argument("--min_seconds", type=float, default=None, + help='if not None, clip shorter than min_seconds is ignored') + parser.add_argument("--max_seconds", type=float, default=None, + help='if not None, clip longer than max_seconds is truncated') + parser.add_argument("--target_fps", type=int, default=30, help='target fps of clips') + parser.add_argument("--shorter_size", type=int, default=720, help='resize the shorter size by keeping ratio') args = parser.parse_args() return args @@ -131,25 +143,24 @@ def parse_args(): def main(): args = parse_args() - meta_path = args.meta_path - out_dir = args.out_dir - assert os.path.basename(os.path.dirname(out_dir)) == "data" - - os.makedirs(out_dir, exist_ok=True) - - meta = pd.read_csv(meta_path) + save_dir = args.save_dir + os.makedirs(save_dir, exist_ok=True) # create logger - log_dir = os.path.dirname(out_dir) - log_name = os.path.basename(out_dir) + log_dir = os.path.dirname(save_dir) + log_name = os.path.basename(save_dir) timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime(time.time())) log_path = os.path.join(log_dir, f"{log_name}_{timestamp}.log") logger = MMLogger.get_instance(log_name, log_file=log_path) # logger = None + # initialize pandarallel pandarallel.initialize(progress_bar=True) - process_single_row_partial = partial(process_single_row, save_dir=out_dir, log_name=log_name) + process_single_row_partial = partial(process_single_row, args=args, log_name=log_name) + + # process + meta = pd.read_csv(args.meta_path) meta.parallel_apply(process_single_row_partial, axis=1) diff --git a/tools/scene_cut/main_cut_multi_thread.py b/tools/scene_cut/main_cut_multi_thread.py deleted file mode 100644 index cab91f5..0000000 --- a/tools/scene_cut/main_cut_multi_thread.py +++ /dev/null @@ -1,162 +0,0 @@ -import argparse -import os -import subprocess -from concurrent.futures import ThreadPoolExecutor, as_completed - -import pandas as pd -from imageio_ffmpeg import get_ffmpeg_exe -from mmengine.logging import print_log -from scenedetect import FrameTimecode -from tqdm import tqdm - - -def single_process(row, save_dir, logger=None): - # video_id = row['videoID'] - # video_path = os.path.join(root_src, f'{video_id}.mp4') - video_path = row["path"] - - # check mp4 integrity - # if not is_intact_video(video_path, logger=logger): - # return False - - timestamp = row["timestamp"] - if not (timestamp.startswith("[") and timestamp.endswith("]")): - return False - scene_list = eval(timestamp) - scene_list = [(FrameTimecode(s, fps=1), FrameTimecode(t, fps=1)) for s, t in scene_list] - split_video(video_path, scene_list, save_dir=save_dir, logger=logger) - return True - - -def split_video( - video_path, - scene_list, - save_dir, - min_seconds=None, - max_seconds=None, - target_fps=30, - shorter_size=512, - verbose=False, - logger=None, -): - """ - scenes shorter than min_seconds will be ignored; - scenes longer than max_seconds will be cut to save the beginning max_seconds. - Currently, the saved file name pattern is f'{fname}_scene-{idx}'.mp4 - - Args: - scene_list (List[Tuple[FrameTimecode, FrameTimecode]]): each element is (s, t): start and end of a scene. - min_seconds (float | None) - max_seconds (float | None) - target_fps (int | None) - shorter_size (int | None) - """ - FFMPEG_PATH = get_ffmpeg_exe() - - save_path_list = [] - for idx, scene in enumerate(scene_list): - s, t = scene # FrameTimecode - if min_seconds is not None: - if (t - s).get_seconds() < min_seconds: - continue - - duration = t - s - if max_seconds is not None: - fps = s.framerate - max_duration = FrameTimecode(timecode="00:00:00", fps=fps) - max_duration.frame_num = round(fps * max_seconds) - duration = min(max_duration, duration) - - # save path - fname = os.path.basename(video_path) - fname_wo_ext = os.path.splitext(fname)[0] - # TODO: fname pattern - save_path = os.path.join(save_dir, f"{fname_wo_ext}_scene-{idx}.mp4") - - # ffmpeg cmd - cmd = [FFMPEG_PATH] - - # Only show ffmpeg output for the first call, which will display any - # errors if it fails, and then break the loop. We only show error messages - # for the remaining calls. - # cmd += ['-v', 'error'] - - # input path - # cmd += ["-i", video_path] - - # clip to cut - cmd += ["-nostdin", "-y", "-ss", str(s.get_seconds()), "-i", video_path, "-t", str(duration.get_seconds())] - # cmd += ["-nostdin", "-y", "-ss", str(s.get_seconds()), "-t", str(duration.get_seconds())] - - # target fps - # cmd += ['-vf', 'select=mod(n\,2)'] - if target_fps is not None: - cmd += ["-r", f"{target_fps}"] - - # aspect ratio - if shorter_size is not None: - cmd += ["-vf", f"scale='if(gt(iw,ih),-2,{shorter_size})':'if(gt(iw,ih),{shorter_size},-2)'"] - # cmd += ['-vf', f"scale='if(gt(iw,ih),{shorter_size},trunc(ow/a/2)*2)':-2"] - - cmd += ["-map", "0", save_path] - - proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - stdout, stderr = proc.communicate() - if verbose: - stdout = stdout.decode("utf-8") - print_log(stdout, logger=logger) - - save_path_list.append(video_path) - print_log(f"Video clip saved to '{save_path}'", logger=logger) - - return save_path_list - - -def parse_args(): - parser = argparse.ArgumentParser() - parser.add_argument("--root", default="F:/Panda-70M/") - parser.add_argument("--split", default="test") - parser.add_argument("--num_workers", default=5, type=int) - - args = parser.parse_args() - return args - - -def main(): - # args = parse_args() - # root = args.root - # split = args.split - - root = "F:/Panda-70M/" - root, split = "F:/pexels_new/", "popular_2" - meta_path = os.path.join(root, f"raw/meta/{split}_format_timestamp.csv") - root_dst = os.path.join(root, f"scene_cut/data/{split}") - - folder_dst = root_dst - # folder_src = os.path.join(root_src, f'data/{split}') - # folder_dst = os.path.join(root_dst, os.path.relpath(folder_src, root_src)) - os.makedirs(folder_dst, exist_ok=True) - - meta = pd.read_csv(meta_path) - - # create logger - # folder_path_log = os.path.dirname(root_dst) - # log_name = os.path.basename(root_dst) - # timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime(time.time())) - # log_path = os.path.join(folder_path_log, f"{log_name}_{timestamp}.log") - # logger = MMLogger.get_instance(log_name, log_file=log_path) - logger = None - - tasks = [] - pool = ThreadPoolExecutor(max_workers=1) - for idx, row in meta.iterrows(): - task = pool.submit(single_process, row, folder_dst, logger) - tasks.append(task) - - for task in tqdm(as_completed(tasks), total=len(meta)): - task.result() - pool.shutdown() - - -if __name__ == "__main__": - main() diff --git a/tools/scene_cut/scene_detect.py b/tools/scene_cut/scene_detect.py index c0f45cc..f3a6548 100644 --- a/tools/scene_cut/scene_detect.py +++ b/tools/scene_cut/scene_detect.py @@ -29,53 +29,24 @@ def process_single_row(row): return False, "" -def main(): - meta_path = "F:/pexels_new/raw/meta/popular_1_format.csv" - meta = pd.read_csv(meta_path) - - timestamp_list = [] - for idx, row in tqdm(meta.iterrows()): - video_path = row["path"] - - detector = AdaptiveDetector( - adaptive_threshold=1.5, - luma_only=True, - ) - # detector = ContentDetector() - scene_list = detect(video_path, detector, start_in_scene=True) - - timestamp = [(s.get_timecode(), t.get_timecode()) for s, t in scene_list] - timestamp_list.append(timestamp) - - meta["timestamp"] = timestamp_list - - wo_ext, ext = os.path.splitext(meta_path) - out_path = f"{wo_ext}_timestamp{ext}" - meta.to_csv(out_path, index=False) - print(f"New meta with timestamp saved to '{out_path}'.") - - def parse_args(): parser = argparse.ArgumentParser() - parser.add_argument("--meta_path", default="F:/pexels_new/raw/meta/popular_1_format.csv") - parser.add_argument("--num_workers", default=5, type=int) + parser.add_argument("meta_path", type=str) args = parser.parse_args() return args -def main_pandarallel(): +def main(): args = parse_args() meta_path = args.meta_path - # meta_path = 'F:/pexels_new/raw/meta/popular_1_format.csv' - meta = pd.read_csv(meta_path) - pandarallel.initialize(progress_bar=True) + + meta = pd.read_csv(meta_path) ret = meta.parallel_apply(process_single_row, axis=1) succ, timestamps = list(zip(*ret)) - meta["timestamp"] = timestamps meta = meta[np.array(succ)] @@ -86,4 +57,4 @@ def main_pandarallel(): if __name__ == "__main__": - main_pandarallel() + main() diff --git a/tools/scene_cut/utils_video.py b/tools/scene_cut/utils_video.py deleted file mode 100644 index 4847180..0000000 --- a/tools/scene_cut/utils_video.py +++ /dev/null @@ -1,97 +0,0 @@ -import os - -import cv2 -from mmengine.logging import print_log -from moviepy.editor import VideoFileClip - - -def iterate_files(folder_path): - for root, dirs, files in os.walk(folder_path): - # root contains the current directory path - # dirs contains the list of subdirectories in the current directory - # files contains the list of files in the current directory - - # Process files in the current directory - for file in files: - file_path = os.path.join(root, file) - # print("File:", file_path) - yield file_path - - # Process subdirectories and recursively call the function - for subdir in dirs: - subdir_path = os.path.join(root, subdir) - # print("Subdirectory:", subdir_path) - iterate_files(subdir_path) - - -def iterate_folders(folder_path): - for root, dirs, files in os.walk(folder_path): - for subdir in dirs: - subdir_path = os.path.join(root, subdir) - yield subdir_path - # print("Subdirectory:", subdir_path) - iterate_folders(subdir_path) - - -def clone_folder_structure(root_src, root_dst, verbose=False): - src_path_list = iterate_folders(root_src) - src_relpath_list = [os.path.relpath(x, root_src) for x in src_path_list] - - os.makedirs(root_dst, exist_ok=True) - dst_path_list = [os.path.join(root_dst, x) for x in src_relpath_list] - for folder_path in dst_path_list: - os.makedirs(folder_path, exist_ok=True) - if verbose: - print(f"Create folder: '{folder_path}'") - - -def is_intact_video(video_path, mode="moviepy", verbose=False, logger=None): - if not os.path.exists(video_path): - if verbose: - print_log(f"Could not find '{video_path}'", logger=logger) - return False - - if mode == "moviepy": - try: - VideoFileClip(video_path) - if verbose: - print_log(f"The video file '{video_path}' is intact.", logger=logger) - return True - except Exception as e: - if verbose: - print_log(f"Error: {e}", logger=logger) - print_log(f"The video file '{video_path}' is not intact.", logger=logger) - return False - elif mode == "cv2": - try: - cap = cv2.VideoCapture(video_path) - if cap.isOpened(): - if verbose: - print_log(f"The video file '{video_path}' is intact.", logger=logger) - return True - except Exception as e: - if verbose: - print_log(f"Error: {e}", logger=logger) - print_log(f"The video file '{video_path}' is not intact.", logger=logger) - return False - else: - raise ValueError - - -def count_frames(video_path, logger=None): - cap = cv2.VideoCapture(video_path) - - if not cap.isOpened(): - print_log(f"Error: Could not open video file '{video_path}'", logger=logger) - return - - total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) - print_log(f"Total frames in the video '{video_path}': {total_frames}", logger=logger) - - cap.release() - - -def count_files(root, suffix=".mp4"): - files_list = iterate_files(root) - cnt = len([x for x in files_list if x.endswith(suffix)]) - return cnt