←back to Blog

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

AI has made significant strides in language processing, mathematics, and code generation. However, extending these capabilities to physical environments remains a challenge. Physical AI aims to bridge this gap by developing systems that perceive, understand, and act within dynamic, real-world settings. Unlike traditional AI, which primarily processes text or symbols, Physical AI engages with sensory inputs, particularly video, to generate responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications include robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is essential.

The current limitations of AI models stem from their weak connection to real-world physics. While they excel in abstract tasks, they often struggle to predict physical consequences or respond appropriately to sensory data. Concepts such as gravity and spatial relationships are not inherently understood, rendering them unreliable for embodied tasks. Training directly in physical environments is costly and risky, hindering development and iteration. This lack of physical grounding and embodied understanding poses a significant barrier to effectively deploying AI in real-world applications.

Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were inflexible and failed in novel scenarios. Simulations and synthetic data often overlooked the nuances of real-world physics. Moreover, there was no standardized framework to define or evaluate physical common sense or embodied reasoning, making progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility.

Researchers from NVIDIA have introduced Cosmos-Reason1, a suite of multimodal large language models specifically designed for physical reasoning tasks. The models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, are trained in two major phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL). A key differentiator of this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories: Space, Time, and Fundamental Physics, further divided into 16 subcategories. The second ontology is two-dimensional, mapping reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies serve as training guides and evaluation tools for benchmarking AI’s physical reasoning.

The architecture of Cosmos-Reason1 employs a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration allows the model to reason over both textual and visual data simultaneously. The researchers curated a dataset of approximately 4 million annotated video-text pairs for training, including action descriptions, multiple-choice questions, and long chain-of-thought reasoning traces. The reinforcement learning phase utilizes rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks, such as predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, ensuring that training is closely tied to real-world physical logic.

The team constructed three benchmarks for physical common sense, encompassing 604 questions from 426 videos, and six benchmarks for embodied reasoning, featuring 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models demonstrated superior performance compared to previous baselines, particularly after the RL phase. Improvements were noted in task completion verification, predicting the next plausible actions, and assessing the physical feasibility of actions. These enhancements were observed in both model sizes, with Cosmos-Reason1-56B exhibiting stronger performance across most metrics. This performance improvement highlights the effectiveness of structured ontologies and multimodal data in enhancing physical reasoning in AI.

Key Takeaways from the Research on Cosmos-Reason1

  • Two models introduced: Cosmos-Reason1-7B and Cosmos-Reason1-56B, specifically trained for physical reasoning tasks.
  • Training conducted in two phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL).
  • Training dataset includes approximately 4 million annotated video-text pairs curated for physical reasoning.
  • Reinforcement learning employs rule-based and verifiable rewards derived from human annotations and video-based tasks.
  • Utilization of two ontologies: a hierarchical one with three categories and 16 subcategories, and a two-dimensional one mapping agent capabilities.
  • Benchmarks: 604 questions from 426 videos for physical common sense, and 610 from 600 videos for embodied reasoning.
  • Performance gains observed across all benchmarks after RL training, particularly in predicting next actions and verifying task completion.
  • Real-world applicability for robots, vehicles, and other embodied agents across diverse environments.

In conclusion, the Cosmos-Reason1 initiative illustrates how AI can be better equipped for the physical world. It addresses critical limitations in perception, reasoning, and decision-making that have impeded progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements represent a significant step forward in bridging the gap between abstract AI reasoning and the requirements of systems that must operate in unpredictable, real-world environments.

Check out the Paper, Project Page, Models on Hugging Face, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.