update video stats

2026-04-11 05:13:31 +02:00 · 2024-06-15 04:09:09 +00:00 · 2024-06-15 04:09:09 +00:00 · 34252d40ef
commit 34252d40ef
parent cba233cabb
1 changed files with 5 additions and 4 deletions
--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -99,17 +99,18 @@ The training mainly happens on 360p and 480p. We train the model for 23k steps,

 In this stage, we collect ~2M video clips with a total length of 5K hours from all kinds of sources, including:

- Free-license videos, including Pexels, Pixabay, Mixkit, etc.
+- Free-license videos, sourced from Pexels, Pixabay, Mixkit, etc.
 - [MiraData](https://github.com/mira-space/MiraData): a high-quality dataset with long videos, mainly from games and city/scenic exploration.
 - [Vript](https://github.com/mutonix/Vript/tree/main): a densely annotated dataset.
 - And some other datasets.

+While MiraData and Vript have captions from GPT, we use [PLLaVA](https://github.com/magic-research/PLLaVA) to caption the rest ones. Compared with LLaVA, which is only capable of single frame/image captioning, PLLaVA is specially designed and trained for video captioning. The accelerated PLLaVA is released in our tools. In practice, we use the pretrained PLLaVA 13B model and select 4 frames from each video for captioning.
+
 Some statistics of the video data used in this stage are shown below. We present basic statistics of duration and resolution, as well as aesthetic score and optical flow score distribution.
 We also extract tags for objects and actions from video captions and count their frequencies.
 ![stats](/assets/readme/report-03_video_stats.png)
-![object_count](/assets/readme/report-03_object_count.png)
-
-While MiraData and Vript have captions from GPT, we use [PLLaVA](https://github.com/magic-research/PLLaVA) to caption the rest ones. Compared with LLaVA, which is only capable of single frame/image captioning, PLLaVA is specially designed and trained for video captioning. The accelerated PLLaVA is released in our tools. In practice, we use the pretrained PLLaVA 13B model and select 4 frames from each video for captioning.
+![object_count](/assets/readme/report-03_objects_count.png)
+![object_count](/assets/readme/report-03_actions_count.png)

 We mainly train 720p and 1080p videos in this stage, aiming to extend the model's ability to larger resolutions. We use a mask ratio of 25% during training. The training config locates in [stage3.py](/configs/opensora-v1-2/train/stage3.py). We train the model for 15k steps, which is approximately 2 epochs.