←back to Blog

SDBench and MAI-DxO: Advancing Realistic, Cost-Aware Clinical Reasoning with AI

«`html

Understanding the Target Audience for SDBench and MAI-DxO

The target audience for SDBench and MAI-DxO includes healthcare professionals, medical researchers, and AI developers focused on enhancing clinical reasoning and diagnostic processes. Their pain points often include the limitations of current AI diagnostic tools, the cost of unnecessary testing, and the challenges of integrating AI into real-world clinical settings.

The goals of this audience are to improve diagnostic accuracy, reduce healthcare costs, and develop more interactive and realistic clinical reasoning tools. They are particularly interested in advancements in AI that allow for dynamic decision-making, cost-effective diagnostics, and educational applications for medical training.

In terms of communication preferences, this audience favors concise, data-driven content that provides clear insights into the effectiveness and applicability of AI solutions in healthcare.

Advancing Realistic, Cost-Aware Clinical Reasoning with AI

AI has the potential to make expert medical reasoning more accessible, but current evaluations often fall short by relying on simplified, static scenarios. Real clinical practice is far more dynamic; physicians adjust their diagnostic approach step by step, asking targeted questions and interpreting new information as it becomes available. This iterative process helps them refine hypotheses, weigh costs and benefits of tests, and avoid premature conclusions.

While language models have shown strong performance on structured exams, these assessments do not reflect real-world complexity, where premature decisions and over-testing remain serious concerns often overlooked by static evaluations.

Challenges in Medical Problem-Solving

Medical problem-solving has been explored for decades, with early AI systems utilizing Bayesian frameworks to guide sequential diagnoses in specialties such as pathology and trauma care. However, these approaches faced challenges due to the need for extensive expert input. Recent studies have shifted toward using language models for clinical reasoning, often evaluated through static, multiple-choice benchmarks that have become saturated.

Projects like AMIE and NEJM-CPC introduced more complex case material but still relied on fixed vignettes. While some newer approaches assess conversational quality or basic information gathering, few capture the full complexity of real-time, cost-sensitive diagnostic decision-making.

Introducing SDBench and MAI-DxO

To better reflect real-world clinical reasoning, researchers from Microsoft AI developed SDBench, a benchmark based on 304 real diagnostic cases from the New England Journal of Medicine, where doctors or AI systems must interactively ask questions and order tests before making a final diagnosis. A language model acts as a gatekeeper, revealing information only when specifically requested.

To improve performance, they introduced MAI-DxO, an orchestrator system co-designed with physicians that simulates a virtual medical panel to choose high-value, cost-effective tests. When paired with models like OpenAI’s o3, it achieved up to 85.5% accuracy while significantly reducing diagnostic costs.

The SDBench Framework

The Sequential Diagnosis Benchmark (SDBench) was built using 304 NEJM Case Challenge scenarios (2017–2025), covering a wide range of clinical conditions. Each case was transformed into an interactive simulation where diagnostic agents could ask questions, request tests, or make a final diagnosis. A gatekeeper, powered by a language model and guided by clinical rules, responded to these actions using realistic case details or synthetic but consistent findings. Diagnoses were evaluated by a judge model using a physician-authored rubric focused on clinical relevance. Costs were estimated using CPT codes and pricing data to reflect real-world diagnostic constraints.

Performance Evaluation

The researchers evaluated various AI diagnostic agents on the SDBench and found that MAI-DxO consistently outperformed both off-the-shelf models and physicians. Standard models showed a trade-off between cost and accuracy, while MAI-DxO, built on o3, delivered higher accuracy at lower costs through structured reasoning and decision-making. For instance, it reached 81.9% accuracy at $4,735 per case, compared to off-the-shelf O3’s 78.6% at $7,850. It demonstrated robustness across multiple models and held-out test data, indicating strong generalizability.

The system significantly improved weaker models and helped stronger ones utilize resources more efficiently, reducing unnecessary tests through smarter information gathering.

Conclusion

SDBench is a new diagnostic benchmark that turns NEJM CPC cases into realistic, interactive challenges, requiring AI or doctors to actively ask questions, order tests, and make diagnoses, all with associated costs. Unlike static benchmarks, it mimics real clinical decision-making. The researchers also introduced MAI-DxO, a model that simulates diverse medical personas to achieve high diagnostic accuracy at a lower cost. Current results are promising, especially in complex cases, but limitations include a lack of everyday conditions and real-world constraints. Future work aims to test the system in real clinics and low-resource settings, with the potential for global health impact and educational use.

Check out the Technical Details. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.

«`