Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A Microsoft researcher points to a screen displaying a corrupted document with red error highlights alongside a…
AI ResearchScore: 88

Microsoft: LLMs Corrupt 25% of Docs in Long Edits

Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.

·9h ago·3 min read··9 views·AI-Generated·Report error
Share:
How much document content do LLMs corrupt during long editing jobs?

Microsoft researchers found that current LLMs corrupt about 25% of document content during long editing jobs across 52 domains, with failures compounding over 20 interactions.

TL;DR

Frontier models corrupt ~25% of document content. · Failures compound over time, not tiny slips. · Agentic tool use did not help in tests.

Microsoft researchers found that current AI assistants corrupt about 25% of document content during long editing jobs. The paper, titled "LLMs Corrupt Your Documents When You Delegate," tests 19 models across 52 domains with 20 sequential edits.

Key facts

  • 19 models tested across 52 domains.
  • 20 sequential editing interactions per run.
  • ~25% document content corrupted on average.
  • Agentic tool use did not improve results.
  • Failures were occasional big mistakes, not tiny slips.

A new Microsoft paper reveals that even frontier large language models systematically damage documents during extended editing sessions. The researchers tested 19 models—including frontier systems—on reversible task pairs where a model edits a file and then tries to undo that edit. A reliable system should return to the original document; instead, models corrupted about 25% of document content on average, with many models damaging far more [per the arXiv preprint 2604.15597].

The failures were not gradual degradation but occasional catastrophic mistakes that silently broke parts of the document and compounded over time. The study spanned 52 domains—coding, science, accounting, music notation—with 20 editing interactions per run. Agentic tool use did not improve outcomes. Bigger files, longer workflows, and irrelevant extra documents all made corruption worse [according to @rohanpaul_ai].

Why the 25% figure matters

The unique take: this paper exposes a structural blind spot in LLM evaluation. Current benchmarks test single-turn accuracy or narrow coding tasks, but delegated AI work requires maintaining correctness across many edits. The paper's reversible-pair methodology—where a model must undo its own prior edit—directly measures this reliability. The 25% corruption rate means that for every four edits, one silently damages the document. In enterprise document workflows (contracts, financial reports, codebases), that failure rate is unacceptable.

Prior work, such as the "Agentic AI" benchmarks from 2025, focused on task completion rates for single-shot actions. This paper shifts the lens to longitudinal reliability, a far harder problem. The finding that agentic tool use didn't help suggests the core issue is not tool orchestration but the models' inability to maintain a consistent internal representation of the document state across sequential operations.

What the paper doesn't say

The paper does not disclose which specific frontier models were tested, nor does it provide per-model corruption rates. It also does not explore whether fine-tuning on document-edit traces could reduce the corruption rate—a likely next research direction. The authors leave open whether larger context windows or chain-of-thought reasoning could mitigate the compounding errors.

Implications for AI-as-a-service

For companies building AI-powered document editing tools—Google Docs AI, Microsoft Copilot, Notion AI, Cursor—this paper is a warning. The demo-ready performance of these systems on single edits masks a fundamental unreliability for multi-step workflows. The 25% corruption baseline means that any AI document assistant deployed without a rollback mechanism or human-in-the-loop validation will silently introduce errors that compound over time.

What to watch

LLMs and Azure OpenAI in Retrieval Augmented Generation (RAG) pattern ...

Watch for follow-up work from Microsoft or other labs on fine-tuning models for multi-step document reliability. Also watch whether any major AI document tool (Google Docs AI, Copilot) adds explicit rollback validation or corruption-rate disclosures in their next release notes. The paper's reversible-pair methodology may become a standard eval for agentic document editing.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala AYADI.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This paper is a necessary corrective to the hype around AI agents for document work. The field has focused on single-turn accuracy and task completion rates, but the real-world bottleneck is longitudinal reliability. The 25% corruption figure is devastating—it means that for any multi-step document workflow, an LLM is likely to introduce a silent error roughly every four edits. The fact that agentic tool use didn't help suggests the problem is fundamental to the autoregressive generation process: the model cannot maintain a consistent document state across multiple forward passes. Compare this to the 2025 benchmark results where models scored 90%+ on single-turn code generation. Those numbers are now revealed as misleading for production use. The paper's reversible-pair methodology is elegant—it directly measures the model's ability to maintain document integrity. Expect this to become a standard eval in the agentic AI community. One limitation: the paper doesn't break down corruption rates by model size or architecture. It's possible that larger models or those with explicit state-tracking mechanisms (like Anthropic's Claude with its constitutional AI or OpenAI's o-series with chain-of-thought) perform better. Without that granularity, the 25% figure is a useful baseline but not a complete picture.

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

More in AI Research

View all