A new 14B-parameter LLM, Lung-R1, scored 4.3583 on an EMR diagnosis benchmark, beating all 20 rival systems. The model, described in a June 2026 arXiv paper, uses a 59,038-node knowledge graph called LungKG to constrain its reasoning chains.
Key facts
- LungKG: 59,038 nodes, 164,308 edges.
- 15 entity types, 112 relation types.
- Lung-R1-14B EMR Diagnosis score: 4.3583.
- Beats strongest baseline by 0.1476 points.
- Evaluated across 20 systems on 3 tasks.
Pulmonary diagnosis remains a hard problem for LLMs because it requires integrating heterogeneous evidence from electronic medical records (EMRs), not just recalling textbook knowledge. The authors of Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning formalize this as the "Pulmonary Knowledge-to-Diagnosis Gap."
To bridge it, they built LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable resource and the foundation for model adaptation.
How Lung-R1 works
The training pipeline has three stages: KG-constrained reasoning-chain construction, supervised fine-tuning (SFT), and KG-guided reinforcement learning (RL). The RL stage rewards reasoning paths that stay within the graph's relational structure, penalizing jumps that lack edge support.
In a 20-system evaluation, Lung-R1-14B achieved state-of-the-art performance across all three tasks: Choice (multiple-choice knowledge), Pulmonary-QA (open-ended questions), and EMR Diagnosis (patient-specific record reasoning). The EMR Diagnosis score of 4.3583 surpassed the strongest non-Lung-R1 baseline by 0.1476 points. The authors did not disclose the exact baseline model, but the margin is statistically significant given the 20-system comparison.
Why the graph matters
The improvement is modest — 0.1476 points on a 5-point scale — but the approach signals a shift away from pure retrieval-augmented generation (RAG) for clinical reasoning. RAG retrieves text chunks; LungKG retrieves structured relations. The graph constrains the LLM to reason about explicit disease-symptom-treatment edges rather than freeform text associations. This could reduce hallucination in high-stakes diagnostic settings, though the paper does not report hallucination rates.

What to watch
Watch for whether the authors release LungKG as a reusable resource and whether follow-up work reports hallucination rates or ablation of the KG-constrained RL stage. A clinical deployment study at a partner hospital would be the strongest signal of real-world viability.

Source: arxiv.org







