Frontier LLMs beat clinical AI tools like OpenEvidence in all three evaluations, per a new paper. Clinical AI tools performed comparably to Google Search AI Overview on the RCQ benchmark.
Key facts
- Frontier LLMs outperformed clinical AI tools in all three evaluations.
- Clinical AI tools matched Google Search AI Overview on RCQ.
- OpenEvidence was the named clinical AI tool in the study.
- Study source is a tweet from @emollick, no preprint linked.
- No details on which frontier LLMs were tested.
There has been a push to use OpenEvidence AI for doctors, but a new paper suggests general models are much better. According to @emollick, the study found that "Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ."
The paper compares frontier general-purpose LLMs against specialized clinical AI tools, including OpenEvidence, which has been promoted for healthcare use. The results challenge the assumption that domain-specific tools automatically outperform general models in medical contexts. The RCQ (a clinical question-answering benchmark) showed clinical AI tools on par with a general search feature, not specialized AI.
The study does not name which frontier LLMs were tested or the specific clinical tools evaluated beyond OpenEvidence. [The source tweet provides no arXiv link or author list], so details on methodology, dataset size, and evaluation criteria remain unclear. The findings align with a broader trend where large general models, trained on diverse data, sometimes match or exceed specialized systems in domain tasks.
Why general models might win
General LLMs benefit from larger training corpora, more compute, and broader knowledge coverage. Clinical AI tools, by contrast, may be trained on narrower medical datasets, potentially limiting their ability to handle edge cases or ambiguous questions. The paper suggests that the gap may be due to the frontier models' superior reasoning and broader pre-training, not just medical knowledge.
Implications for healthcare AI
Hospitals and health systems evaluating AI tools may need to reconsider the assumption that specialized clinical AI is always superior. The finding that a general search feature (Google AI Overview) matches clinical tools on the RCQ benchmark raises questions about the value proposition of proprietary clinical AI systems. However, the study's limited scope—no patient data, no real-world clinical workflow testing—means these results are preliminary.
What to watch

Watch for the full paper preprint release on arXiv or medRxiv. If the authors release methodology details, the key metric to track is the performance delta on RCQ between GPT-4/Claude 3 and OpenEvidence, plus any real-world clinical workflow validation studies.





