Friday, July 04, 2025

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors after poaching Google researchers

Good news! Competition is good, more competition is better!

The AI doctor is in the house! Will we finally be able to provide top tier medical services in low income countries around the world? 

The average cost per case are still far higher for AI compared to human doctors (about 2-3 times higher), but the accuracy is about 4 times higher than a general physician. I bet the costs will come down very soon.

The once almighty profession or cult of medical doctors is facing some challenges too in the age of AI. We may need far fewer doctors in the not so distant future.

"... The Microsoft team used 304 case studies sourced from the New England Journal of Medicine to devise a test called the Sequential Diagnosis Benchmark. A language model broke down each case into a step-by-step process that a doctor would perform in order to reach a diagnosis. ...

then built a system called the MAI Diagnostic Orchestrator (MAI-DxO) that queries several leading AI models—including OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, Meta’s Llama, and xAI’s Grok—in a way that loosely mimics several human experts working together. ..."

"The Microsoft AI team shares research that demonstrates how AI can sequentially investigate and solve medicine’s most complex diagnostic challenges—cases that expert physicians struggle to answer.

Benchmarked against real-world case records published each week in the New England Journal of Medicine, we show that the Microsoft AI Diagnostic Orchestrator (MAI-DxO) correctly diagnoses up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians. MAI-DxO also gets to the correct diagnosis more cost-effectively than physicians. ..."

From the abstract:
"Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings.
In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they’ve just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative diagnostic process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters.
A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed.
To complement the benchmark, we present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests.
When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy—four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3.
When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families.
We highlight how AI systems, when guided to think iteratively and act judiciously, can advance both diagnostic precision and cost-effectiveness in clinical care."

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors | WIRED "The tech giant poached several top Google researchers to help build a powerful AI tool that can diagnose patients and potentially cut health care costs."

The Path to Medical Superintelligence (original news release)



The MAI-Dx Orchestrator turns any language model into a virtual panel of clinicians: it can ask follow-up questions, order tests, or deliver a diagnosis, then run a cost check and verify its own reasoning before deciding whether to proceed.  


Comparison of AI powered diagnostic agents by accuracy and average diagnostic test cost per case. Top performing agents appear toward the top left quadrant, reflecting higher accuracy and lower cost. The lower dotted line represents the performance range of the best individual  foundation models. The purple line traces the performance of MAI-DxO across different configurations. The red cross indicates the average performance of 21 practicing physicians. 



Figure 1:Example of an AI agent solving a sequential-diagnosis reasoning problem.


No comments: