UniVA: Universal Video Agent

towards Open-Source Next-Generation Video Generalist

🤖 Highly automated, interactive, proactive video creation experience

AGENTIC CREATION PROACTIVE WORKFLOW MCP-NATIVE · EXTENSIBLE
  • Multi-round co-creation. Talk like a director; UniVA iterates shots & stories with you.
  • Deep memory & context. Global + user memory keep preferences, lore, styles consistent.
  • Implicit intent reading. Understands vague & evolving instructions; less prompt hacking.
  • Proactive agent. Auto plans, checks, and suggests better shots & stories, not just obeys.
  • End-to-end workspace. UnivA plans, calls tools, and delivers full videos.
  • Extensiblility. MCP-native, modular design, easy to extend with new models & tools.

🎬 Omnipotent, unified, industrial-grade video production engine

UNIVERSAL VIDEO FABRIC INDUSTRIAL QUALITY
  • Any-conditioned pipeline. Text / Image / Entity / Video → controllable video in one framework.
  • Super HD & consistent. Cinematic quality with stable identity & objects.
  • Complex narratives. Multi-scene, multi-role, multi-shot stories under structured control.
  • Ultra-long & fine-grained editing. From long-form cuts to per-shot/per-object refinement.
  • Grounded by understanding. Long-video comprehension & segmentation guide generation & edits.
Try UniVA as your video director

Describe a universe, a campaign, a pet, or a long-form story! UniVA will plan, compose and produce the video for you.

🚀 Try UniVA Demo System
Invitation Only Request access via registration form.

Quick Guide

Technical Highlights

Plan–Act Agents

Dual-Agent Orchestration

A planner reasons over long-horizon goals and memory, and an executor is grounded in MCP tools. Together they turn vague prompts into structured, verifiable plans instead of one-shot guesses.

High-level goal
Structured plan
MCP tool calls
Checked outputs
Dual-Agent Architecture
MCP-native Tools

Composable Tool Fabric

UniVA connects video, vision, language, and utility tools through the Model Context Protocol. Agents dynamically select and chain tools, enabling plug-and-play expansion and multi-step pipelines instead of isolated black-box calls.

Synergistic Components and Tool Pipelines
Memory & Consistency

Three-Level Memory Mechanism

A hierarchical memory design maintains story state, user preference, and tool context, so characters, styles, and constraints stay coherent across long videos and iterative edits.

Three-Level Memory Mechanism

Function Walkthrough

UniVA integrates an extensive, modular toolset via the Model Context Protocol (MCP), enabling flexible plug-and-play extension across diverse media tasks.

Note: The above taxonomy represents only the current set of meta-functions supported in UniVA. Thanks to the MCP architecture, our system is inherently infinitely extensible, i.e., allowing seamless integration of any number of new tools, capabilities, and media modalities in the future.

UniVA-Bench

UniVA-Bench is a unified benchmark for agent-oriented video intelligence, mirroring real workflows where understanding, generation, and editing are intertwined instead of isolated single-step tasks.

Understanding

Long-Video QA & Reasoning

Multi-question QA on the same long video, covering narrative, style, transitions, and fine-grained semantics to test temporal reasoning and memory.

  • Shot-level & story-level queries.
  • Requires using context across the entire video.
Editing

Multi-Step Long-Video Editing

Realistic editing chains: replacement, attribute change, style transfer, and composition while keeping characters, scenes, and story logic consistent.

  • Plans must select & combine proper tools.
  • Rewards temporal and identity coherence.
Generation

Tool-Augmented Video Creation

Three creation modes, including LongText2Video, Image/Entities2Video, and Video2Video, testing whether agents can plan, preserve identity, and produce coherent multi-shot narratives.

  • Storyboard-first, identity-aware generation.
  • Evaluates end-to-end agentic pipelines.
How we evaluate UniVA-Bench
Task Quality CLIP / DINO, QA accuracy, J & F for segmentation-style tracks.
User Preference MLLM-as-a-Judge for perceptual and instruction-following quality.
Agentic Metrics wPED, DepCov, ReplanQ for plan quality, dependency coverage, and robustness with / without memory.
UniVA-Bench Overview
For detailed metrics, ablations, and comparisons against baselines, please refer to our paper.

Team

Zhengyang Liang

Zhengyang Liang*

SMU

Daoan Zhang

Daoan Zhang*

UR

Huichi Zhou

Huichi Zhou

UCL

Rui Huang

Rui Huang

NUS

Bobo Li

Bobo Li

NUS

Yuechen Zhang

Yuechen Zhang

CUHK

Shengqiong Wu

Shengqiong Wu

NUS

Xiaohan Wang

Xiaohan Wang

Stanford

Jiebo Luo

Jiebo Luo

UR

Lizi Liao

Lizi Liao

SMU

Hao Fei

Hao Fei#

NUS

*Core contributors, equal contribution (zyliang@smu.edu.sg; daoan.zhang@rochester.edu)

#Project lead, correspondence (haofei7419@gmail.com)  

Citation

If you find UniVA useful for your research, please kindly cite our work.

🔖 UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
@misc{liang2025univauniversalvideoagent,
      title={UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist}, 
      author={Zhengyang Liang and Daoan Zhang and Huichi Zhou and Rui Huang and Bobo Li and Yuechen Zhang and Shengqiong Wu and Xiaohan Wang and Jiebo Luo and Lizi Liao and Hao Fei},
      year={2025},
      eprint={2511.08521},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.08521}, 
}
    

Acknowledgement

We sincerely thank our colleagues, collaborators, and research partners for their valuable discussions and constructive feedback that helped shape the design and implementation of UniVA.

The current version of UniVA is a research prototype, and its overall capability is subject to the performance limitations of various backend modules, including perception, reasoning, and generative components. We will continue to refine and expand these modules to further enhance the agent’s reliability, scalability, and generalization ability in future releases.

Open-Source Policy: The code and models of UniVA are released under an open academic license. They are freely available for research and educational purposes, but strictly prohibited for any form of commercial use without explicit written permission from the authors. Unauthorized commercial usage or redistribution may constitute an infringement of intellectual property rights and will be subject to legal responsibility.

For collaboration, licensing inquiries, or commercial partnerships, please contact us at haofei7419@gmail.com.