UniVA - Universal Video Agents towards Next-Generation Video Intelligence

🤖 Highly automated, interactive, proactive video creation experience

AGENTIC CREATION PROACTIVE WORKFLOW MCP-NATIVE · EXTENSIBLE

Multi-round co-creation. Talk like a director; UniVA iterates shots & stories with you.
Deep memory & context. Global + user memory keep preferences, lore, styles consistent.
Implicit intent reading. Understands vague & evolving instructions; less prompt hacking.
Proactive agent. Auto plans, checks, and suggests better shots & stories, not just obeys.
End-to-end workspace. UnivA plans, calls tools, and delivers full videos.
Extensiblility. MCP-native, modular design, easy to extend with new models & tools.

🎬 Omnipotent, unified, industrial-grade video production engine

UNIVERSAL VIDEO FABRIC INDUSTRIAL QUALITY

Any-conditioned pipeline. Text / Image / Entity / Video → controllable video in one framework.
Super HD & consistent. Cinematic quality with stable identity & objects.
Complex narratives. Multi-scene, multi-role, multi-shot stories under structured control.
Ultra-long & fine-grained editing. From long-form cuts to per-shot/per-object refinement.
Grounded by understanding. Long-video comprehension & segmentation guide generation & edits.

Try UniVA as your video director

Describe a universe, a campaign, a pet, or a long-form story! UniVA will plan, compose and produce the video for you.

🚀 Try UniVA Demo System

Invitation Only Request access via registration form.

Quick Guide

0:00 / 0:00

UniVA

A fully open-source, highly automated, interactive, and proactive one-stop video creation engine.

0:00 / 0:00

Multi-round co-creation & Deep memory

UniVA can "remember" the artistic style of a video and reuse it for new images in subsequent conversations, seamlessly extending the video.

0:00 / 0:00

Implicit intent reading & Consistency

UniVA can comprehend vague instructions like "display in the same way" and retrieve the same model identity and actions from deep memory to generate consistent dynamic displays for new products.

0:00 / 0:00

Any-conditioned pipeline

Within a single interactive interface, users can complete the entire creative workflow—from T2I (generating images), I2V (generating videos), to V2V (fine editing and style transfer) through multi-turn conversations.

0:00 / 0:00

Grounded by understanding & Fine-grained editing

UniVA first intelligently and precisely segments characters from the video, enabling layered editing of the background (e.g., replacement) while keeping the foreground characters' dynamics unchanged.

0:00 / 0:00

End-to-end workspace

UniVA can use multiple identity-consistent images as keyframes, automatically generate multi-shot narrative videos combined with a text script, and complete professional dubbing within the same interface.

0:00 / 0:00

Proactive agent & Complex narratives

With just a structured script containing multiple scenes, UniVA can proactively plan and automatically execute multi-shot rendering and editing, delivering a complete film in one go.

0:00 / 0:00

Super HD & consistent

UniVA can use a character image as an "entity" anchor, generate multi-scene videos based on a textual story, and maintain absolute identity consistency for the character across all shots.

Video Gallery

Object Consistency - Girl Dance

Input

I want a 20-second-long video with 4 segments. The main subject of the video is a girl dancing. However, the background of each of the 4 segments needs to be different. I need the first segment to be a cyberpunk background, the second segment to be an aesthetic ink dream, the third segment to be a retro film block, and the fourth segment to be an abstract geometric dynamic. The main subject of the four segments should be consistent and the movements should be coherent.

Output

Complex Generation - BreadTalk Ad

Input

Please create an advertisement based on the following product advertising requirements. 1. Kneading dough in hands, close-up shot, highlighting the texture of the dough. 2. Sprinkling cherry blossom petals on freshly baked bread, slow motion close-up. 3. Customers taste bread in the store and show satisfied smiles. 4. The brand logo appears, with the text: 'BreadTalk'.

Output

Complex Generation - Short Documentary

Input

Please generate a 30-second short documentary video based on the following story beats. 1. Close-up of clay meeting a spinning wheel; fingers press and a rib tool carves spirals as slip flicks outward under warm studio light. 2. Over-the-shoulder time-lapse: the vessel rises from cylinder to wide bowl; wet sheen glistens while the wheel slows. 3. Kiln-loading montage: shelves slide in, the door seals, orange heat blooms; a thermocouple readout climbs as a notebook of glaze formulas flips. 4. Slow-motion glaze pour coats the cooled bowl; cross-dissolve into a firing time-lapse where crystalline patterns emerge. 5. Morning reveal: final bowl on a wooden table beside steaming tea; the potter signs the foot and exhales in quiet satisfaction.

Output

Video2Video - Story Video

Input

Recreate a new video that mirrors the original's style—cinematic transitions, lighting, pacing, and tone—but tells the story of an elderly man reliving his youth through a dreamlike journey across time.

Output

Complex Generation - Mood Piece

Input

Please generate a 20-second mood piece based on the following sequence. 1. Macro close-up of raindrops striking a neon-lit puddle; ripples mirror street signage in shimmering bokeh. 2. Slow-motion silhouettes under translucent umbrellas traverse a zebra crossing while headlight trails streak past. 3. An elevated train thunders by; droplets bead and slide down a window as the interior sound softens to breath. 4. A cat shelters under an awning; a vendor hands a steaming paper cup to a passerby, vapor curling into mist. 5. Dawn edges in: clouds lift to a pastel sky, a final drip falls from a traffic light, and the city exhales.

Output

Video2Video - Prequel Story

Input

Create a prequel to the original video that introduces the backstory of the same characters, matching their look, voice, and animation style, but telling a different story that leads into the original events.

Output

Complex Generation - Professional Video

Input

Clip 1 (Morning Preparation): The man stands before a mirror and adjusts the collar of his grey overcoat, his eyes filled with confidence. Clip 2 (Focused at Work): The man is focused on his computer screen, his fingers typing swiftly on the keyboard. Clip 3 (Afternoon Meeting): In a bright meeting room, the man engages in a meeting, listening attentively. Clip 4 (End of the Day): At dusk, he closes his laptop, and looks out the window, with an expression of satisfaction and relief on his face.

Output

Video2Video - Style Transfer

Input

Maintain all plot and motion as-is, and apply a Chinese ink-painting style to the visuals.

Output

Technical Highlights

Function Walkthrough

UniVA integrates an extensive, modular toolset via the Model Context Protocol (MCP), enabling flexible plug-and-play extension across diverse media tasks.

Functions are organized into three families: Video Tools , Non-Video Tools , and Non-AI Tools .
[Atom] Single-purpose, fine-grained tools for precise generation or editing.
[Workflow] High-level pipelines that compose multiple atoms to solve complex tasks.

Note: The above taxonomy represents only the current set of meta-functions supported in UniVA. Thanks to the MCP architecture, our system is inherently infinitely extensible, i.e., allowing seamless integration of any number of new tools, capabilities, and media modalities in the future.

UniVA-Bench

UniVA-Bench is a unified benchmark for agent-oriented video intelligence, mirroring real workflows where understanding, generation, and editing are intertwined instead of isolated single-step tasks.

Understanding

Long-Video QA & Reasoning

Multi-question QA on the same long video, covering narrative, style, transitions, and fine-grained semantics to test temporal reasoning and memory.

Shot-level & story-level queries.
Requires using context across the entire video.

Editing

Multi-Step Long-Video Editing

Realistic editing chains: replacement, attribute change, style transfer, and composition while keeping characters, scenes, and story logic consistent.

Plans must select & combine proper tools.
Rewards temporal and identity coherence.

Generation

Tool-Augmented Video Creation

Three creation modes, including LongText2Video, Image/Entities2Video, and Video2Video, testing whether agents can plan, preserve identity, and produce coherent multi-shot narratives.

Storyboard-first, identity-aware generation.
Evaluates end-to-end agentic pipelines.

How we evaluate UniVA-Bench

Task Quality CLIP / DINO, QA accuracy, J & F for segmentation-style tracks.

User Preference MLLM-as-a-Judge for perceptual and instruction-following quality.

Agentic Metrics wPED, DepCov, ReplanQ for plan quality, dependency coverage, and robustness with / without memory.

For detailed metrics, ablations, and comparisons against baselines, please refer to our paper.

Team

Zhengyang Liang^*

SMU

Daoan Zhang^*

Huichi Zhou

UCL

Rui Huang

NUS

Bobo Li

NUS

Yuechen Zhang

CUHK

Shengqiong Wu

NUS

Xiaohan Wang

Stanford

Jiebo Luo

Lizi Liao

SMU

Hao Fei^#

NUS

^*Core contributors, equal contribution (zyliang@smu.edu.sg; daoan.zhang@rochester.edu)

^#Project lead, correspondence (haofei7419@gmail.com)

Acknowledgement

We sincerely thank our colleagues, collaborators, and research partners for their valuable discussions and constructive feedback that helped shape the design and implementation of UniVA.

The current version of UniVA is a research prototype, and its overall capability is subject to the performance limitations of various backend modules, including perception, reasoning, and generative components. We will continue to refine and expand these modules to further enhance the agent’s reliability, scalability, and generalization ability in future releases.

Open-Source Policy: The code and models of UniVA are released under an open academic license. They are freely available for research and educational purposes, but strictly prohibited for any form of commercial use without explicit written permission from the authors. Unauthorized commercial usage or redistribution may constitute an infringement of intellectual property rights and will be subject to legal responsibility.

For collaboration, licensing inquiries, or commercial partnerships, please contact us at haofei7419@gmail.com.

UniVA: Universal Video Agent

towards Open-Source Next-Generation Video Generalist

🤖 Highly automated, interactive, proactive video creation experience

🎬 Omnipotent, unified, industrial-grade video production engine

Quick Guide

Video Gallery

Object Consistency - Girl Dance

Input

Output

Complex Generation - BreadTalk Ad

Input

Output

Complex Generation - Short Documentary

Input

Output

Video2Video - Story Video

Input

Output

Complex Generation - Mood Piece

Input

Output

Video2Video - Prequel Story

Input

Output

Complex Generation - Professional Video

Input

Output

Video2Video - Style Transfer

Input

Output

These are only the opening shots.

Technical Highlights

Dual-Agent Orchestration

Composable Tool Fabric

Three-Level Memory Mechanism

Function Walkthrough

UniVA-Bench

Long-Video QA & Reasoning

Multi-Step Long-Video Editing

Tool-Augmented Video Creation

Team

Citation

Acknowledgement