Microsoft researchers found that current AI assistants corrupt about 25% of document content during long editing jobs. The paper, titled "LLMs Corrupt Your Documents When You Delegate," tests 19 models across 52 domains with 20 sequential edits.
Key facts
- 19 models tested across 52 domains.
- 20 sequential editing interactions per run.
- ~25% document content corrupted on average.
- Agentic tool use did not improve results.
- Failures were occasional big mistakes, not tiny slips.
A new Microsoft paper reveals that even frontier large language models systematically damage documents during extended editing sessions. The researchers tested 19 models—including frontier systems—on reversible task pairs where a model edits a file and then tries to undo that edit. A reliable system should return to the original document; instead, models corrupted about 25% of document content on average, with many models damaging far more [per the arXiv preprint 2604.15597].
The failures were not gradual degradation but occasional catastrophic mistakes that silently broke parts of the document and compounded over time. The study spanned 52 domains—coding, science, accounting, music notation—with 20 editing interactions per run. Agentic tool use did not improve outcomes. Bigger files, longer workflows, and irrelevant extra documents all made corruption worse [according to @rohanpaul_ai].
Why the 25% figure matters
The unique take: this paper exposes a structural blind spot in LLM evaluation. Current benchmarks test single-turn accuracy or narrow coding tasks, but delegated AI work requires maintaining correctness across many edits. The paper's reversible-pair methodology—where a model must undo its own prior edit—directly measures this reliability. The 25% corruption rate means that for every four edits, one silently damages the document. In enterprise document workflows (contracts, financial reports, codebases), that failure rate is unacceptable.
Prior work, such as the "Agentic AI" benchmarks from 2025, focused on task completion rates for single-shot actions. This paper shifts the lens to longitudinal reliability, a far harder problem. The finding that agentic tool use didn't help suggests the core issue is not tool orchestration but the models' inability to maintain a consistent internal representation of the document state across sequential operations.
What the paper doesn't say
The paper does not disclose which specific frontier models were tested, nor does it provide per-model corruption rates. It also does not explore whether fine-tuning on document-edit traces could reduce the corruption rate—a likely next research direction. The authors leave open whether larger context windows or chain-of-thought reasoning could mitigate the compounding errors.
Implications for AI-as-a-service
For companies building AI-powered document editing tools—Google Docs AI, Microsoft Copilot, Notion AI, Cursor—this paper is a warning. The demo-ready performance of these systems on single edits masks a fundamental unreliability for multi-step workflows. The 25% corruption baseline means that any AI document assistant deployed without a rollback mechanism or human-in-the-loop validation will silently introduce errors that compound over time.
What to watch

Watch for follow-up work from Microsoft or other labs on fine-tuning models for multi-step document reliability. Also watch whether any major AI document tool (Google Docs AI, Copilot) adds explicit rollback validation or corruption-rate disclosures in their next release notes. The paper's reversible-pair methodology may become a standard eval for agentic document editing.









