[feat] prompt refine

2026-04-10 04:37:45 +02:00 · 2024-06-14 07:37:00 +00:00 · 2024-06-14 07:37:00 +00:00 · 992aae7d8f
commit 992aae7d8f
parent 823ddfb436
6 changed files with 85 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -266,6 +266,16 @@ This will launch a Gradio application on your localhost. If you want to know mor

 ### Open-Sora 1.2 Command Line Inference

+### GPT-4o Prompt Refinement
+
+We find that GPT-4o can refine the prompt and improve the quality of the generated video. With this feature, you can also use other language (e.g., Chinese) as the prompt. To enable this feature, you need prepare your openai api key in the environment:
+
+```bash
+export OPENAI_API_KEY=YOUR_API_KEY
+```
+
+Then you can inference with `--llm-refine True` to enable the GPT-4o prompt refinement.
+
 ### Open-Sora 1.1 Command Line Inference

 <details>
--- a/assets/texts/t2v_pllava.txt
+++ b/assets/texts/t2v_pllava.txt
@ -0,0 +1,10 @@
+a close-up shot of a woman standing in a room with a white wall and a plant on the left side. the woman has curly hair and is wearing a green tank top. she is looking to the side with a neutral expression on her face. the lighting in the room is soft and appears to be natural, coming from the left side of the frame. the focus is on the woman, with the background being out of focus. there are no texts or other objects in the video. the style of the video is a simple, candid portrait with a shallow depth of field.
+a serene scene of a pond filled with water lilies. the water is a deep blue, providing a striking contrast to the pink and white flowers that float on its surface. the flowers, in full bloom, are the main focus of the video. they are scattered across the pond, with some closer to the camera and others further away, creating a sense of depth. the pond is surrounded by lush greenery, adding a touch of nature to the scene. the video is taken from a low angle, looking up at the flowers, which gives a unique perspective and emphasizes their beauty. the overall composition of the video suggests a peaceful and tranquil setting, likely a garden or a park.
+a professional setting where a woman is presenting a slide from a presentation. she is standing in front of a projector screen, which displays a bar chart. the chart is colorful, with bars of different heights, indicating some sort of data comparison. the woman is holding a pointer, which she uses to highlight specific parts of the chart. she is dressed in a white blouse and black pants, and her hair is styled in a bun. the room has a modern design, with a sleek black floor and a white ceiling. the lighting is bright, illuminating the woman and the projector screen. the focus of the image is on the woman and the projector screen, with the background being out of focus. there are no texts visible in the image. the relative positions of the objects suggest that the woman is the main subject of the image, and the projector screen is the object of her attention. the image does not provide any information about the content of the presentation or the context of the meeting.
+a bustling city street from the perspective of a car. the car, a sleek black sedan, is in motion, driving down the street. the dashboard of the car is visible in the foreground, providing a view of the road ahead. the street is lined with parked cars on both sides, their colors muted in the bright sunlight. buildings rise on either side of the street, their windows reflecting the sunlight. the sky above is a clear blue, and the sun is shining brightly, casting a warm glow on the scene. the street is busy with pedestrians and other vehicles, adding to the dynamic nature of the scene. the video does not contain any text. the relative positions of the objects suggest a typical city street scene with the car in the foreground, the parked cars on either side, and the buildings in the background. the sunlight illuminates the scene, highlighting the colors and details of the objects. the pedestrians and other vehicles are in motion, adding a sense of life and activity to the scene. the buildings provide a sense of depth and scale to the image. the video does not contain any text or countable objects. the
+a serene scene in a park. the sun is shining brightly, casting a warm glow on the lush green trees and the grassy field. the camera is positioned low, looking up at the towering trees, which are the main focus of the image. the trees are dense and full of leaves, creating a canopy of green that fills the frame. the sunlight filters through the leaves, creating a beautiful pattern of light and shadow on the ground. the overall atmosphere of the video is peaceful and tranquil, evoking a sense of calm and relaxation.
+a moment in a movie theater. a couple is seated in the middle of the theater, engrossed in the movie they are watching. the man is dressed in a casual outfit, complete with a pair of sunglasses, while the woman is wearing a cozy sweater. they are seated on a red theater seat, which stands out against the dark surroundings. the theater itself is dimly lit, with the screen displaying the movie they are watching. the couple appears to be enjoying the movie, their attention completely absorbed by the on-screen action. the theater is mostly empty, with only a few other seats visible in the background. the video does not contain any text or additional objects. the relative positions of the objects are such that the couple is in the foreground, while the screen and the other seats are in the background. the focus of the video is clearly on the couple and their shared experience of watching a movie in a theater.
+a scene where a person is examining a dog. the person is wearing a blue shirt with the word "volunteer" printed on it. the dog is lying on its side, and the person is using a stethoscope to listen to the dog's heartbeat. the dog appears to be a golden retriever and is looking directly at the camera. the background is blurred, but it seems to be an indoor setting with a white wall. the person's focus is on the dog, and they seem to be checking its health. the dog's expression is calm, and it seems to be comfortable with the person's touch. the overall atmosphere of the video is calm and professional.
+a close-up shot of a woman applying makeup. she is using a black brush to apply a dark powder to her face. the woman has blonde hair and is wearing a black top. the background is black, which contrasts with her skin tone and the makeup. the focus is on her face and the brush, with the rest of her body and the background being out of focus. the lighting is soft and even, highlighting the texture of the makeup and the woman's skin. there are no texts or other objects in the video. the woman's expression is neutral, and she is looking directly at the camera. the video does not contain any action, as it is a still shot of a woman applying makeup. the relative position of the woman and the brush is such that the brush is in her hand and is being used to apply the makeup to her face. the video does not contain any other objects or actions. the woman is the only person in the video, and she is the main subject. the video does not contain any sound. the description is based on the visible content of the video and does not include any assumptions or interpretations.
+a young woman is seated in a black gaming chair in a room filled with computer monitors and other gaming equipment. she is wearing a red tank top and black pants, and her hair is styled in loose waves. the room is dimly lit, with the glow of the monitors casting a soft light on her face. she is holding a black game controller in her hands, and her attention is focused on the screen in front of her. the room is filled with other gaming equipment, including keyboards and mice, and there are other chairs and desks scattered around the room. the woman appears to be engrossed in her game, her posture relaxed yet focused. the room is quiet, the only sound coming from the beeps and boops of the game. the woman is the only person in the room, adding a sense of solitude to the scene. the video does not contain any text. the relative positions of the objects suggest a well-organized gaming setup, with the woman at the center, surrounded by her gaming equipment. the video does not contain any action, but the woman's focused expression suggests that she is in the middle of an intense g
+a breathtaking aerial view of a coastal landscape at sunset. the sky, painted in hues of orange and pink, serves as a stunning backdrop to the scene. the sun, partially obscured by the horizon, casts a warm glow on the landscape below. the foreground of the image is dominated by a rocky cliff, its rugged surface adding a touch of raw beauty to the scene. the cliff's edge is adorned with patches of green vegetation, providing a stark contrast to the otherwise barren landscape. the middle ground of the image reveals a winding road that hugs the coastline. the road, appearing as a thin line against the vast expanse of the landscape, guides the viewer's eye towards the horizon. in the background, the silhouette of mountains can be seen, their peaks shrouded in a light mist. the mountains, along with the road, add depth to the image, creating a sense of distance and scale. overall, the video presents a serene and majestic coastal landscape, captured at the perfect moment of sunset. the colors
--- a/docs/report_03.md
+++ b/docs/report_03.md
@ -25,7 +25,7 @@ Besides features introduced in Open-Sora 1.1, Open-Sora 1.2 highlights:
 - Easy and effective model conditioning
 - Better evaluation metrics

-All implementations (both training and inference) of the above improvements are available in the Open-Sora 1.2 release. The following sections will introduce the details of the improvements. We also refine our codebase and documentation to make it easier to use.
+All implementations (both training and inference) of the above improvements are available in the Open-Sora 1.2 release. The following sections will introduce the details of the improvements. We also refine our codebase and documentation to make it easier to use and develop, and add a LLM to [refine input prompts](/README.md#gpt-4o-prompt-refinement) and support more languages.

 ## Video compression network

--- a/opensora/utils/config_utils.py
+++ b/opensora/utils/config_utils.py
@ -45,6 +45,7 @@ def parse_args(training=False):
        # prompt
        parser.add_argument("--prompt-path", default=None, type=str, help="path to prompt txt file")
        parser.add_argument("--prompt", default=None, type=str, nargs="+", help="prompt list")
+        parser.add_argument("--llm-refine", default=None, type=str2bool, help="enable LLM refine")

        # image/video
        parser.add_argument("--num-frames", default=None, type=str, help="number of frames")
--- a/opensora/utils/inference_utils.py
+++ b/opensora/utils/inference_utils.py
@ -198,3 +198,61 @@ def append_generated(vae, generated_video, refs_x, mask_strategy, loop_i, condit
 def dframe_to_frame(num):
    assert num % 5 == 0, f"Invalid num: {num}"
    return num // 5 * 17
+
+
+OPENAI_CLIENT = None
+SYS_PROMPTS = None
+SYS_PROMPTS_PATH = "assets/texts/t2v_pllava.txt"
+SYS_RPOMPTS_TEMPLATE = """
+You need to refine user's input prompt. The user's input prompt is used for video generation task. You need to refine the user's prompt to make it more suitable for the task. Here are some examples of refined prompts:
+{}
+
+The refined prompt should pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. The refined prompt should be in English.
+"""
+
+
+def get_openai_response(sys_prompt, usr_prompt, model="gpt-4o"):
+    global OPENAI_CLIENT
+    if OPENAI_CLIENT is None:
+        from openai import OpenAI
+
+        OPENAI_CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
+
+    completion = OPENAI_CLIENT.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "system",
+                "content": sys_prompt,
+            },  # <-- This is the system message that provides context to the model
+            {
+                "role": "user",
+                "content": usr_prompt,
+            },  # <-- This is the user message for which the model will generate a response
+        ],
+    )
+
+    return completion.choices[0].message.content
+
+
+def refine_prompt_by_openai(prompt):
+    global SYS_PROMPTS
+    if SYS_PROMPTS is None:
+        examples = load_prompts(SYS_PROMPTS_PATH)
+        SYS_PROMPTS = SYS_RPOMPTS_TEMPLATE.format("\n".join(examples))
+
+    response = get_openai_response(SYS_PROMPTS, prompt)
+    return response
+
+
+def refine_prompts_by_openai(prompts):
+    new_prompts = []
+    for prompt in prompts:
+        try:
+            new_prompt = refine_prompt_by_openai(prompt)
+            print(f"[Info] Refine prompt: {prompt} -> {new_prompt}")
+            new_prompts.append(new_prompt)
+        except Exception as e:
+            print(f"[Warning] Failed to refine prompt: {prompt} due to {e}")
+            new_prompts.append(prompt)
+    return new_prompts
--- a/scripts/inference.py
+++ b/scripts/inference.py
@ -25,6 +25,7 @@ from opensora.utils.inference_utils import (
    get_save_path_name,
    load_prompts,
    prepare_multi_resolution_info,
+    refine_prompts_by_openai,
 )
 from opensora.utils.misc import all_exists, create_logger, is_distributed, is_main_process, to_torch_dtype

@ -148,6 +149,10 @@ def main():
        # == get reference for condition ==
        refs = collect_references_batch(refs, vae, image_size)

+        # == refine prompt by openai ==
+        if cfg.get("llm_refine", False):
+            batch_prompts = refine_prompts_by_openai(batch_prompts)
+
        # == score ==
        batch_prompts = append_score_to_prompts(batch_prompts, aes=cfg.get("aes", None), flow=cfg.get("flow", None))