Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation

Embodied AI agents capable of perceiving, thinking, and acting in the real world represent significant progress in robotics. A primary challenge lies in developing scalable and reliable robotic manipulation, which involves the deliberate interaction and control of objects through selective contact. Although advancements have been made in analytic methods, model-based approaches, and large-scale data-driven learning, many systems still operate in fragmented stages of data collection, training, and evaluation. These stages often necessitate custom setups, manual curation, and task-specific adjustments, leading to inefficiencies that obscure failure patterns and hinder reproducibility. This underscores the necessity for a unified framework to enhance learning and assessment.

Research in robotic manipulation has evolved from analytical models to neural world models that learn dynamics directly from sensory inputs, utilizing both pixel and latent spaces. While large-scale video generation models can create realistic visuals, they frequently lack action conditioning, long-term temporal consistency, and multi-view reasoning essential for control. Vision-language-action models can follow instructions but are constrained by imitation-based learning, which limits error recovery and planning capabilities. Evaluating policies remains challenging, as physics simulators require extensive tuning, and real-world testing is resource-intensive. Current evaluation metrics often prioritize visual quality over task success, highlighting the need for benchmarks that accurately reflect real-world manipulation performance.

The Genie Envisioner (GE), developed by the AgiBot Genie Team, NUS LV-Lab, and BUAA, serves as a unified platform for robotic manipulation that integrates policy learning, simulation, and evaluation within a video-generative framework. Its core component, GE-Base, is a large-scale, instruction-driven video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world tasks. GE-Act translates these representations into precise action trajectories, while GE-Sim provides rapid, action-conditioned video-based simulation. The EWMBench benchmark assesses visual realism, physical accuracy, and instruction-action alignment. Trained on over 1 million episodes, GE generalizes across diverse robots and tasks, facilitating scalable, memory-aware, and physically grounded embodied intelligence research.

GE’s architecture unfolds in three essential components:

GE-Base: A multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes that learns latent trajectories capturing scene evolution under specific commands.
GE-Act: Converts latent video representations into real action signals via a lightweight, flow-matching decoder, enabling quick and precise motor control even on robots not included in the training data.
GE-Sim: Utilizes the generative capabilities of GE-Base to create an action-conditioned neural simulator, facilitating closed-loop, video-based rollout at speeds significantly surpassing real hardware.

The EWMBench suite evaluates the system comprehensively across video realism, physical consistency, and alignment between instructions and resulting actions.

In evaluations, Genie Envisioner demonstrated strong performance in both real-world and simulated environments across various robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. It adapted to new robot types, such as Agilex Cobot Magic and Dual Franka, with only one hour of task-specific data, excelling in complex deformable object tasks. GE-Sim delivered high-fidelity, action-conditioned video simulations for scalable, closed-loop policy testing. The EWMBench benchmark confirmed GE-Base’s superior temporal alignment, motion consistency, and scene stability over state-of-the-art video models, closely aligning with human quality assessments.

In conclusion, Genie Envisioner is a unified, scalable platform for dual-arm robotic manipulation that merges policy learning, simulation, and evaluation into a cohesive video-generative framework. Its core, GE-Base, is an instruction-guided video diffusion model that captures the spatial, temporal, and semantic patterns of real-world robot interactions. GE-Act translates these representations into precise, adaptable action plans, even for new robot types with minimal retraining. GE-Sim offers high-fidelity, action-conditioned simulation for closed-loop policy refinement, while EWMBench provides rigorous evaluation of realism, alignment, and consistency. Extensive real-world tests highlight the system’s superior performance, establishing it as a robust foundation for general-purpose, instruction-driven embodied intelligence.

Check out the Paper and GitHub Page for more information. Feel free to explore our GitHub Page for Tutorials, Codes, and Notebooks. Also, follow us on Twitter and join our 100k+ ML SubReddit, and subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation appeared first on MarkTechPost.