Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

Recent advancements in multimodal foundation models have shown significant progress in disciplines such as mathematics and knowledge-based reasoning. While these models achieve human-competitive accuracy on benchmarks like AIME, GPQA, MATH-500, and OlympiadBench, they fall short in a critical area: physical reasoning. This type of reasoning necessitates the integration of disciplinary knowledge, symbolic operations, and real-world constraints, which differ fundamentally from pure mathematical reasoning.

For instance, understanding a «smooth surface» as having a zero friction coefficient requires models to maintain physical consistency throughout reasoning chains, as physical laws remain unchanged regardless of reasoning paths. Despite demonstrating excellent visual comprehension by integrating visual and textual data, uncertainty persists regarding whether these models can genuinely perform advanced reasoning in physical domains relevant to real-world scenarios.

To address this gap, researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have introduced PHYX, a novel benchmark aimed at evaluating the physical reasoning capabilities of foundation models. PHYX comprises 3,000 visually-grounded physics questions, meticulously curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics.

PHYX evaluates physics-based reasoning through multimodal problem-solving with three core innovations:

3,000 newly collected questions featuring realistic physical scenarios that require integrated visual analysis and causal reasoning,
Expert-validated data design encompassing six fundamental physics domains, and
Strict unified three-step evaluation protocols.

The data collection process for PHYX consists of four stages to ensure high-quality data. It begins with a comprehensive survey of core physics disciplines to assess coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. This process adheres to copyright restrictions and avoids data contamination by selecting questions without immediately available answers. Quality control includes a three-stage cleaning process: duplicate detection through lexical overlap analysis, manual review by physics Ph.D. students, and filtering of the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial pool of 3,300.

PHYX poses significant challenges for current models, with even the least performing human experts achieving 75.6% accuracy, surpassing all evaluated models. This benchmark highlights that multiple-choice formats can reduce performance gaps by allowing weaker models to depend on surface-level cues, while open-ended questions necessitate genuine reasoning and precise answer generation. The comparison of GPT-4o’s performance on PHYX to its previously reported results on MathVista and MATH-V (both at 63.8%) underscores that physical reasoning demands a deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts.

In conclusion, PHYX stands as the first large-scale benchmark for assessing physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluations reveal that state-of-the-art models exhibit limitations in physical reasoning, often relying on memorized knowledge, mathematical formulas, and superficial visual patterns rather than a true understanding of physical principles. It is important to note that PHYX focuses exclusively on English-language prompts and annotations, which may limit the assessment of multilingual reasoning abilities. Additionally, while the images used depict physically realistic scenarios, they are frequently schematic or textbook-style rather than actual photographs, potentially failing to capture the complexity of perception in natural environments.

Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.