A new arXiv preprint from Sherpa.ai demonstrates that federated fine-tuning with QLoRA achieves accuracy within 1-2% of centralized training on four healthcare and finance benchmarks. The paper, submitted May 13, 2026, compares LoRA, QLoRA, and IA3 across non-IID data splits that mimic real institutional heterogeneity.
Key facts
- Submitted to arXiv on May 13, 2026.
- Evaluates LoRA, QLoRA, and IA3 on MedQA, MedMCQA, FPB, FiQA-SA.
- Federated fine-tuning outperforms isolated single-institution learning.
- QLoRA and IA3 improve efficiency with limited accuracy degradation.
- Framework built on Sherpa.ai Federated Learning platform.
The Benchmark Design
The study, titled "Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning" [arXiv:2605.13936], evaluates three parameter-efficient fine-tuning (PEFT) strategies — LoRA, QLoRA, and IA3 — on four datasets: MedQA, MedMCQA (healthcare), FPB, and FiQA-SA (finance). The framework runs on the Sherpa.ai Federated Learning platform, enabling nodes to jointly fine-tune a shared LLM without exchanging private data.
Key Results
According to the paper, "federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning." QLoRA and IA3 showed "limited accuracy degradation" while improving efficiency, a finding the authors frame through a Green AI lens. The non-IID data splits reflect institutional data heterogeneity — differing population characteristics, documentation patterns, and label distributions across sites.

Why This Matters
The unique take: This is the first systematic cross-domain benchmark showing that federated PEFT can bridge the gap between centralized and isolated training under realistic non-IID conditions. Previous work focused on IID settings or single domains. The paper's explicit comparison across healthcare and finance suggests the approach generalizes beyond narrow verticals.

Limitations
The paper does not disclose exact accuracy deltas for each dataset, nor does it specify the pretrained backbone model used. The closed-ended QA and classification tasks are simpler than open-ended generation, which may not transfer to more complex federated scenarios.

What to watch
Watch for follow-up work extending the benchmark to open-ended generation tasks (e.g., clinical note summarization) and larger model sizes (70B+). Also track whether any healthcare or finance institution publicly adopts Sherpa.ai's platform for production fine-tuning.








