A new paper tests OpenAI's o1 against physicians on medical benchmarks and real ER cases. o1 outperformed both human doctors and older models across all scenarios.
Key facts
- o1 outperformed human physicians on medical benchmarks.
- Study included real ER cases, not just synthetic exams.
- Authors urge prospective clinical trials.
- Model outperformed both humans and older AI models.
- Paper does not disclose exact benchmark scores.
A new preprint evaluates OpenAI's o1 reasoning model against human physicians on medical benchmarks and real emergency room cases. According to the paper shared by @emollick, "across a variety of scenarios and applications, the large language model outperformed both human physicians and older models." The results span multiple medical domains, including diagnostic accuracy, treatment recommendations, and clinical reasoning tasks drawn from actual ER encounters.
The study's authors explicitly call for clinical validation, stating the potential suggests an "urgent need for prospective trials." This is not a synthetic benchmark exercise—the evaluation includes real patient cases, giving the comparison direct clinical relevance. The paper does not disclose the exact benchmark scores or the number of physicians in the control group, but the directional claim is unambiguous: o1 beat the humans.
Why this matters more than the press release suggests
This paper is significant not just because a model beat doctors—that has happened before with GPT-4 on certain diagnostic tasks—but because o1 is a reasoning-specialized model, not a general-purpose chatbot. The result suggests that chain-of-thought reasoning architectures may offer a structural advantage in clinical decision-making, where step-by-step logic is critical. Prior work (e.g., Singhal et al. 2023 on Med-PaLM 2) showed specialized medical LLMs approaching physician performance; o1 appears to cross that threshold without medical fine-tuning, relying instead on inference-time reasoning.
However, the paper does not report error analysis or harm potential. A model that outperforms on average could still make catastrophic mistakes in edge cases. The call for prospective trials is appropriate: benchmarks, even on real cases, are not deployment.
Key Takeaways
- o1 beat human physicians on medical benchmarks and real ER cases, per a new paper.
- Authors urge prospective trials.
What to watch

Watch for the paper's full release on arXiv and whether OpenAI publishes an o1-medical variant. If prospective trial results emerge within 12 months showing clinical safety equivalence, expect accelerated deployment in triage and decision-support tools.








