A clinical LLM rejection predictor achieved AUROC 0.719 over 4.5 months of prospective user feedback at an academic medical center. The pre-response classifier uses deployment-specific context like provider type and department, not just query content.
Key facts
- AUROC 0.719 from 4.5-month prospective study.
- Pre-response classifier uses provider type, department, model.
- Two downstream use cases: guardrail triggering and abstention.
- Static benchmarks miss user acceptance, per the paper.
Large language models embedded in electronic health records often produce outputs clinicians ignore or override, but static benchmarks miss this rejection signal entirely. A team from an academic medical center—authors include Alyssa Unell, Miguel Fuentes, and Brenna Li—trained a pre-response classifier that estimates the risk a user will reject the LLM output before generation begins According to Deployment-Centered Evaluation.
How the classifier works
The model ingests query content plus deployment-specific context: provider type, department name, and which language model generated the response. Over 4.5 months of prospective analysis, it hit AUROC 0.719—modest but operationally useful for triggering guardrails or abstaining from low-confidence queries. The authors emphasize that static benchmarks "tend to measure correctness rather than user acceptance," creating blind spots for real-world clinical utility.
Two downstream use cases
The paper evaluates two applications: guardrail triggering (intercepting likely-rejected outputs) and abstention (withholding responses when rejection risk is high). Both leverage the pre-response prediction rather than post-hoc feedback, which is sparse in clinical settings. The key insight: deployment-specific context improves rejection prediction over query-only baselines, a finding that generalizes beyond clinical systems to any LLM deployment where user acceptance varies by role or domain.
Why this matters more than the abstract suggests
Most clinical LLM evaluations rely on dense annotation or correctness metrics. This work flips the frame: it predicts user behavior from the same sparse feedback clinicians actually provide. The AUROC of 0.719 is not state-of-the-art for classification tasks, but it is a proof of concept that rejection risk can be forecast from pre-response signals alone. The real value is in the architecture—deployment-specific features are cheap to collect and transfer across departments.
What to watch
Watch for follow-up work extending this to multi-department deployments at larger health systems, and whether the AUROC holds above 0.7 with more diverse provider types. Also monitor if clinical LLM vendors like Epic or Oracle Health adopt pre-response rejection predictors in production EHR integrations.
Source: arxiv.org







