Describe a universe, a campaign, a pet, or a long-form story! UniVA will plan, compose and produce the video for you.
🚀 Try UniVA Demo System
I want a 20-second-long video with 4 segments. The main subject of the video is a girl dancing. However, the background of each of the 4 segments needs to be different. I need the first segment to be a cyberpunk background, the second segment to be an aesthetic ink dream, the third segment to be a retro film block, and the fourth segment to be an abstract geometric dynamic. The main subject of the four segments should be consistent and the movements should be coherent.
Please create an advertisement based on the following product advertising requirements. 1. Kneading dough in hands, close-up shot, highlighting the texture of the dough. 2. Sprinkling cherry blossom petals on freshly baked bread, slow motion close-up. 3. Customers taste bread in the store and show satisfied smiles. 4. The brand logo appears, with the text: 'BreadTalk'.
Please generate a 30-second short documentary video based on the following story beats. 1. Close-up of clay meeting a spinning wheel; fingers press and a rib tool carves spirals as slip flicks outward under warm studio light. 2. Over-the-shoulder time-lapse: the vessel rises from cylinder to wide bowl; wet sheen glistens while the wheel slows. 3. Kiln-loading montage: shelves slide in, the door seals, orange heat blooms; a thermocouple readout climbs as a notebook of glaze formulas flips. 4. Slow-motion glaze pour coats the cooled bowl; cross-dissolve into a firing time-lapse where crystalline patterns emerge. 5. Morning reveal: final bowl on a wooden table beside steaming tea; the potter signs the foot and exhales in quiet satisfaction.
Recreate a new video that mirrors the original's style—cinematic transitions, lighting, pacing, and tone—but tells the story of an elderly man reliving his youth through a dreamlike journey across time.
Please generate a 20-second mood piece based on the following sequence. 1. Macro close-up of raindrops striking a neon-lit puddle; ripples mirror street signage in shimmering bokeh. 2. Slow-motion silhouettes under translucent umbrellas traverse a zebra crossing while headlight trails streak past. 3. An elevated train thunders by; droplets bead and slide down a window as the interior sound softens to breath. 4. A cat shelters under an awning; a vendor hands a steaming paper cup to a passerby, vapor curling into mist. 5. Dawn edges in: clouds lift to a pastel sky, a final drip falls from a traffic light, and the city exhales.
Create a prequel to the original video that introduces the backstory of the same characters, matching their look, voice, and animation style, but telling a different story that leads into the original events.
Clip 1 (Morning Preparation): The man stands before a mirror and adjusts the collar of his grey overcoat, his eyes filled with confidence. Clip 2 (Focused at Work): The man is focused on his computer screen, his fingers typing swiftly on the keyboard. Clip 3 (Afternoon Meeting): In a bright meeting room, the man engages in a meeting, listening attentively. Clip 4 (End of the Day): At dusk, he closes his laptop, and looks out the window, with an expression of satisfaction and relief on his face.
Maintain all plot and motion as-is, and apply a Chinese ink-painting style to the visuals.
UniVA is built for longer stories, richer worlds, and multi-step video workflows that go far beyond the samples above. If you can imagine it, UniVA can orchestrate it.
Try it end-to-end with our live agentic system:
🚀 Try UniVA Demo No fixed templates. Just your imagination + UniVA’s tools.A planner reasons over long-horizon goals and memory, and an executor is grounded in MCP tools. Together they turn vague prompts into structured, verifiable plans instead of one-shot guesses.
UniVA connects video, vision, language, and utility tools through the Model Context Protocol. Agents dynamically select and chain tools, enabling plug-and-play expansion and multi-step pipelines instead of isolated black-box calls.
A hierarchical memory design maintains story state, user preference, and tool context, so characters, styles, and constraints stay coherent across long videos and iterative edits.
UniVA integrates an extensive, modular toolset via the Model Context Protocol (MCP), enabling flexible plug-and-play extension across diverse media tasks.
Note: The above taxonomy represents only the current set of meta-functions supported in UniVA. Thanks to the MCP architecture, our system is inherently infinitely extensible, i.e., allowing seamless integration of any number of new tools, capabilities, and media modalities in the future.
UniVA-Bench is a unified benchmark for agent-oriented video intelligence, mirroring real workflows where understanding, generation, and editing are intertwined instead of isolated single-step tasks.
Multi-question QA on the same long video, covering narrative, style, transitions, and fine-grained semantics to test temporal reasoning and memory.
Realistic editing chains: replacement, attribute change, style transfer, and composition while keeping characters, scenes, and story logic consistent.
Three creation modes, including LongText2Video, Image/Entities2Video, and Video2Video, testing whether agents can plan, preserve identity, and produce coherent multi-shot narratives.
If you find UniVA useful for your research, please kindly cite our work.
@misc{liang2025univauniversalvideoagent,
title={UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist},
author={Zhengyang Liang and Daoan Zhang and Huichi Zhou and Rui Huang and Bobo Li and Yuechen Zhang and Shengqiong Wu and Xiaohan Wang and Jiebo Luo and Lizi Liao and Hao Fei},
year={2025},
eprint={2511.08521},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.08521},
}
We sincerely thank our colleagues, collaborators, and research partners for their valuable discussions and constructive feedback that helped shape the design and implementation of UniVA.
The current version of UniVA is a research prototype, and its overall capability is subject to the performance limitations of various backend modules, including perception, reasoning, and generative components. We will continue to refine and expand these modules to further enhance the agent’s reliability, scalability, and generalization ability in future releases.
Open-Source Policy: The code and models of UniVA are released under an open academic license. They are freely available for research and educational purposes, but strictly prohibited for any form of commercial use without explicit written permission from the authors. Unauthorized commercial usage or redistribution may constitute an infringement of intellectual property rights and will be subject to legal responsibility.
For collaboration, licensing inquiries, or commercial partnerships, please contact us at haofei7419@gmail.com.