←back to Blog

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks

«`html

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks

OpenAI has introduced GDPval, an evaluation suite designed to assess AI model performance on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike traditional academic benchmarks, GDPval focuses on authentic deliverables such as presentations, spreadsheets, briefs, CAD artifacts, and audio/video materials, evaluated by occupational experts through blinded pairwise comparisons.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals with an average of 14 years of experience. These tasks map to O*NET work activities and encompass multi-modal file handling, including documents, slides, images, audio, video, spreadsheets, and CAD files. The gold subset offers public prompts and references, while primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review. Reported model-versus-human win/tie rates are nearly equal for top models, with error profiles clustering around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding, such as format checks and artifact rendering for self-inspection, yield predictable gains in performance.

Time–Cost Math: Where AI Pays Off

GDPval conducts scenario analyses comparing human-only workflows to model-assisted workflows with expert review. It quantifies factors such as human completion time and wage-based cost, reviewer time and cost, model latency, API cost, and empirically observed win rates. Results indicate potential time and cost reductions for various task classes when review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

An automated pairwise grader for the gold subset shows approximately 66% agreement with human experts, within about 5 percentage points of human-to-human agreement at around 71%. This automated grader is positioned as an accessibility proxy for rapid iteration rather than a replacement for expert review.

Why This Isn’t Yet Another Benchmark

GDPval differentiates itself through:

  • Occupational breadth: Covering the top GDP sectors and a wide range of O*NET work activities, rather than narrowly focused domains.
  • Deliverable realism: Multi-file, multi-modal inputs/outputs that emphasize structure, formatting, and data handling.
  • Moving ceiling: Utilizing human preference win rates against expert deliverables, allowing for re-baselining as models improve.

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 specifically targets computer-mediated knowledge work. It excludes physical labor, long-horizon interactivity, and organization-specific tooling. Tasks are designed to be one-shot and precisely specified; ablation studies indicate performance drops with reduced context. The task construction and grading processes are resource-intensive, which motivates the automated grader—whose limitations are documented—and future expansion efforts.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evaluations by introducing occupational, multi-modal, file-centric tasks. It provides reports on human preference outcomes, time and cost analyses, and insights on reasoning effort and agent scaffolding. Version 0 is expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by combining expert-built tasks with blinded human preference judgments and an accessible automated grader. This framework quantifies model quality and practical time and cost trade-offs while revealing failure modes and the effects of scaffolding and reasoning effort. Current limitations of v0 pertain to computer-mediated, one-shot tasks requiring expert review, yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.

Check out the Paper, Technical details, and Dataset on Hugging Face. Feel free to visit our GitHub Page for Tutorials, Codes, and Notebooks. Join our Twitter for updates and don’t forget to subscribe for insights on AI advancements.

«`