Open-Sora/tools/scene_cut/README.md

# Scene Detection and Video Splitting
In many cases, raw videos contain several scenes and are too long for training. Thus, it is essential to split them into shorter 
clips based on scenes. Here, we provide code for scene detection and video splitting.

## Formatting
At this step, you should have a raw video dataset prepared. We need a meta file for the dataset. To create a meta file from a folder, run:

```bash
python -m tools.datasets.convert video /path/to/video/folder --output /path/to/save/meta.csv
```
This should output a `.csv` file with column `path`.

If you already have a meta file for the videos and want to keep the information.
**Make sure** the meta file has column `id`, which is the id for each video, and the video is named as `{id}.mp4`.
The following command will add a new column `path` to the meta file.

```bash
python tools/scene_cut/process_meta.py --task append_path --meta_path /path/to/meta.csv --folder_path /path/to/video/folder
```
This should output
- `{prefix}_path-filtered.csv` with column `path` (broken videos filtered) 
- `{prefix}_path_intact.csv` with column `path` and `intact` (`intact` indicating a video is intact or not)


## Scene Detection
The next step is to detect scenes in a video. 
We use [`PySceneDetect`](https://github.com/Breakthrough/PySceneDetect) for this job. 
**Make sure** the input meta file has column `path`, which is the path of a video.

```bash
python tools/scene_cut/scene_detect.py --meta_path /path/to/meta.csv
python tools/scene_cut/scene_detect.py --meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format.csv
```
The output is `{prefix}_timestamp.csv` with column `timestamp`. Each cell in column `timestamp` is a list of tuples, 
with each tuple indicating the start and end timestamp of a scene 
(e.g., `[('00:00:01.234', '00:00:02.345'), ('00:00:03.456', '00:00:04.567')]`).

## Video Splitting
After obtaining timestamps for scenes, we conduct video splitting (cutting).
**Make sure** the meta file contains column `timestamp`.

TODO: output video size, min_duration, max_duration

```bash
python tools/scene_cut/main_cut_pandarallel.py \
    --meta_path /path/to/meta.csv \
    --out_dir /path/to/output/dir
    
python tools/scene_cut/main_cut_pandarallel.py \
    --meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format_timestamp.csv \
    --out_dir /mnt/hdd/data/pexels_new/scene_cut/data/popular_6
```

This yields video clips saved in `/path/to/output/dir`. The video clips are named as `{video_id}_scene-{scene_id}.mp4`

TODO: meta for video clips
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`# Scene Detection and Video Splitting`
			`In many cases, raw videos contain several scenes and are too long for training. Thus, it is essential to split them into shorter`
			`clips based on scenes. Here, we provide code for scene detection and video splitting.`
a bunch of update for data 2024-04-13 09:44:24 +02:00
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`## Formatting`
			`At this step, you should have a raw video dataset prepared. We need a meta file for the dataset. To create a meta file from a folder, run:`
a bunch of update for data 2024-04-13 09:44:24 +02:00
			```bash
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`python -m tools.datasets.convert video /path/to/video/folder --output /path/to/save/meta.csv`
a bunch of update for data 2024-04-13 09:44:24 +02:00			```
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			This should output a `.csv` file with column `path`.

			`If you already have a meta file for the videos and want to keep the information.`
			Make sure the meta file has column `id`, which is the id for each video, and the video is named as `{id}.mp4`.
			The following command will add a new column `path` to the meta file.
a bunch of update for data 2024-04-13 09:44:24 +02:00
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			```bash
			`python tools/scene_cut/process_meta.py --task append_path --meta_path /path/to/meta.csv --folder_path /path/to/video/folder`
			```
			`This should output`
			- `{prefix}_path-filtered.csv` with column `path` (broken videos filtered)
			- `{prefix}_path_intact.csv` with column `path` and `intact` (`intact` indicating a video is intact or not)
a bunch of update for data 2024-04-13 09:44:24 +02:00

Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`## Scene Detection`
			`The next step is to detect scenes in a video.`
			We use [`PySceneDetect`](https://github.com/Breakthrough/PySceneDetect) for this job.
			Make sure the input meta file has column `path`, which is the path of a video.
a bunch of update for data 2024-04-13 09:44:24 +02:00
			```bash
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`python tools/scene_cut/scene_detect.py --meta_path /path/to/meta.csv`
a bunch of update for data 2024-04-13 09:44:24 +02:00			`python tools/scene_cut/scene_detect.py --meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format.csv`
			```
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			The output is `{prefix}_timestamp.csv` with column `timestamp`. Each cell in column `timestamp` is a list of tuples,
			`with each tuple indicating the start and end timestamp of a scene`
			(e.g., `[('00:00:01.234', '00:00:02.345'), ('00:00:03.456', '00:00:04.567')]`).
a bunch of update for data 2024-04-13 09:44:24 +02:00
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`## Video Splitting`
			`After obtaining timestamps for scenes, we conduct video splitting (cutting).`
			Make sure the meta file contains column `timestamp`.
a bunch of update for data 2024-04-13 09:44:24 +02:00
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`TODO: output video size, min_duration, max_duration`
a bunch of update for data 2024-04-13 09:44:24 +02:00
			```bash
Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			`python tools/scene_cut/main_cut_pandarallel.py \`
			`--meta_path /path/to/meta.csv \`
			`--out_dir /path/to/output/dir`

a bunch of update for data 2024-04-13 09:44:24 +02:00			`python tools/scene_cut/main_cut_pandarallel.py \`
			`--meta_path /mnt/hdd/data/pexels_new/raw/meta/popular_6_format_timestamp.csv \`
			`--out_dir /mnt/hdd/data/pexels_new/scene_cut/data/popular_6`
			```

Dev/pxy (#61) * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scoring/matching * update scene_cut * update scene_cut * update scene_cut[A * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * update scene_cut * m * m * m * m * m * m * m * m * m * m * m * m * m * m * update readme * update readme * extract frames using opencv everywhere * extract frames using opencv everywhere * extract frames using opencv everywhere * filter panda10m * filter panda10m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * m * ocr * add ocr * add main.sh * add ocr * add ocr * add ocr * add ocr * add ocr * add ocr * update scene_cut * update remove main.sh 2024-04-22 11:15:55 +02:00			This yields video clips saved in `/path/to/output/dir`. The video clips are named as `{video_id}_scene-{scene_id}.mp4`
a bunch of update for data 2024-04-13 09:44:24 +02:00
			`TODO: meta for video clips`