←back to Blog

VideoGameBench: Can Vision-Language Models Complete Popular Video Games?

VideoGameBench is a rigorous benchmark that evaluates VLMs’ real-time decision-making, perception, memory, and planning by challenging them to complete 1990s-era video games with only raw visual inputs and minimal control instructions.

Key Highlights

  • Real-Time, Visually Rich Environments – Evaluates VLMs on 23 popular Game Boy and MS-DOS games, including 3 secret test games to assess generalization across unseen environments.
  • Raw Pixels Only – No game-specific APIs, state overlays, or memory modules; models rely solely on frame-level screenshots and high-level objective/control text.
  • VG-Agent Scaffold – Implements React-style agents with textual scratchpad memory, system/game prompts, and historical frame context for sequential action generation.
  • Strict Evaluation Rules – Prevents use of any external tools, RAM access, overlays, or human guidance. This contrasts with past agents like “Gemini Plays Pokémon” which used engineered pathfinding tools.
  • Zero-Shot + Latency-Limited Evaluation – Benchmarked in two modes: full real-time (VideoGameBench) and latency-free mode (VideoGameBench Lite), where emulators pause for agent reasoning.
  • Automated Checkpoint-Based Scoring – Progress detection via perceptual hashing of visual frames against YouTube walkthroughs enables milestone-level measurement.
  • Cross-Genre Coverage – Games span platformers, FPS, strategy, RPG, puzzle, and racing—requiring spatial reasoning, object interaction, resource management, and fine motor control.
  • Challenging Results – Best model (Gemini 2.5 Pro) completed just 0.48% of games in real-time and 1.6% in Lite mode. No model reached the first checkpoint in 9/10 test games.
  • Diagnostic Game Failures – Frontier VLMs failed at basic tasks (e.g., dragging, 2D grid navigation, clicking) in custom practice games, revealing fundamental deficits in spatial grounding.
  • Common Failure Modes – Models exhibit a “knowing–doing gap” (e.g., pressing the right key but misaligned), memory overwriting, repeated loops, hallucinated progress, and visual misperceptions.
  • Open Tools & Code – Benchmark code, prompts, and interface are open-source to drive transparent evaluation and community improvements.

Paper Resources:

  1. NVIDIA COSMOS REASON1: https://learnopencv.com/cosmos-reason-vlm-video-vqa/
  2. GR00T N1.5: https://learnopencv.com/gr00t-n1_5-explained/

The post VideoGameBench: Can Vision-Language Models Complete Popular Video Games? appeared first on OpenCV.