Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A Microsoft researcher points to a screen displaying a corrupted document with red error highlights alongside a…

Microsoft: LLMs Corrupt 25% of Docs in Long Edits

Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.

AAAla AYADI & AI Research Desk·9h ago·3 min read··9 views·AI-Generated·Report error

Source: x.comvia @rohanpaul_aiSingle Source

How much document content do LLMs corrupt during long editing jobs?

Microsoft researchers found that current LLMs corrupt about 25% of document content during long editing jobs across 52 domains, with failures compounding over 20 interactions.

TL;DR

Frontier models corrupt ~25% of document content. · Failures compound over time, not tiny slips. · Agentic tool use did not help in tests.

Microsoft researchers found that current AI assistants corrupt about 25% of document content during long editing jobs. The paper, titled "LLMs Corrupt Your Documents When You Delegate," tests 19 models across 52 domains with 20 sequential edits.

Key facts

19 models tested across 52 domains.
20 sequential editing interactions per run.
~25% document content corrupted on average.
Agentic tool use did not improve results.
Failures were occasional big mistakes, not tiny slips.

A new Microsoft paper reveals that even frontier large language models systematically damage documents during extended editing sessions. The researchers tested 19 models—including frontier systems—on reversible task pairs where a model edits a file and then tries to undo that edit. A reliable system should return to the original document; instead, models corrupted about 25% of document content on average, with many models damaging far more [per the arXiv preprint 2604.15597].

The failures were not gradual degradation but occasional catastrophic mistakes that silently broke parts of the document and compounded over time. The study spanned 52 domains—coding, science, accounting, music notation—with 20 editing interactions per run. Agentic tool use did not improve outcomes. Bigger files, longer workflows, and irrelevant extra documents all made corruption worse [according to @rohanpaul_ai].

Why the 25% figure matters

The unique take: this paper exposes a structural blind spot in LLM evaluation. Current benchmarks test single-turn accuracy or narrow coding tasks, but delegated AI work requires maintaining correctness across many edits. The paper's reversible-pair methodology—where a model must undo its own prior edit—directly measures this reliability. The 25% corruption rate means that for every four edits, one silently damages the document. In enterprise document workflows (contracts, financial reports, codebases), that failure rate is unacceptable.

Prior work, such as the "Agentic AI" benchmarks from 2025, focused on task completion rates for single-shot actions. This paper shifts the lens to longitudinal reliability, a far harder problem. The finding that agentic tool use didn't help suggests the core issue is not tool orchestration but the models' inability to maintain a consistent internal representation of the document state across sequential operations.

What the paper doesn't say

The paper does not disclose which specific frontier models were tested, nor does it provide per-model corruption rates. It also does not explore whether fine-tuning on document-edit traces could reduce the corruption rate—a likely next research direction. The authors leave open whether larger context windows or chain-of-thought reasoning could mitigate the compounding errors.

Implications for AI-as-a-service

For companies building AI-powered document editing tools—Google Docs AI, Microsoft Copilot, Notion AI, Cursor—this paper is a warning. The demo-ready performance of these systems on single edits masks a fundamental unreliability for multi-step workflows. The 25% corruption baseline means that any AI document assistant deployed without a rollback mechanism or human-in-the-loop validation will silently introduce errors that compound over time.

What to watch

LLMs and Azure OpenAI in Retrieval Augmented Generation (RAG) pattern ...

Watch for follow-up work from Microsoft or other labs on fine-tuning models for multi-step document reliability. Also watch whether any major AI document tool (Google Docs AI, Copilot) adds explicit rollback validation or corruption-rate disclosures in their next release notes. The paper's reversible-pair methodology may become a standard eval for agentic document editing.

Source: gentic.news · 9h ago · author=Ala AYADI · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala AYADI.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This paper is a necessary corrective to the hype around AI agents for document work. The field has focused on single-turn accuracy and task completion rates, but the real-world bottleneck is longitudinal reliability. The 25% corruption figure is devastating—it means that for any multi-step document workflow, an LLM is likely to introduce a silent error roughly every four edits. The fact that agentic tool use didn't help suggests the problem is fundamental to the autoregressive generation process: the model cannot maintain a consistent document state across multiple forward passes. Compare this to the 2025 benchmark results where models scored 90%+ on single-turn code generation. Those numbers are now revealed as misleading for production use. The paper's reversible-pair methodology is elegant—it directly measures the model's ability to maintain document integrity. Expect this to become a standard eval in the agentic AI community. One limitation: the paper doesn't break down corruption rates by model size or architecture. It's possible that larger models or those with explicit state-tracking mechanisms (like Anthropic's Claude with its constitutional AI or OpenAI's o-series with chain-of-thought) perform better. Without that granularity, the 25% figure is a useful baseline but not a complete picture.

#llm reliability #agentic systems #document ai #ai research

Mentioned in this article

Microsoft

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Kimi 2.6 Thinking Shows Promise as Open Weights Model, Lags Behind Closed SoTA

More in AI Research

View all

Screenshot of Qwen-Scope interface showing 81k feature activations across 64 layers in Qwen3.5-27B, with a neural…

AI Research

Qwen3.5-27B Gets Sparse Autoencoders: 81k Features Exposed

Qwen released Qwen-Scope, adding Sparse Autoencoders to Qwen3.5-27B, exposing 81k features across 64 layers for steerable inference.

x.com/9h ago/3 min read

open sourceaiinterpretability

A neural network diagram with some pathways dimmed or collapsed, illustrating reduced neural activity

AI Research

LLMs Shrink Neural Activity When Confused, New Paper Shows

LLMs compress neural activity when confused, measurable as a sparsity signal. Paper 2603.03415 proposes using this for adaptive prompting.

x.com/18h ago/3 min read

uncertaintyllmsresearch

AI Research

Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2

Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions. On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.

x.com/1d ago/3 min read

coding agentsagentic systemsharness engineering

Why the 25% figure matters

What the paper doesn't say

Implications for AI-as-a-service

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Turn Claude Code Into an AI SRE

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

Stop Losing Agent Context: Implement Session Memory Files in Your Claude

CS3: A New Framework to Boost Two-Tower Recommenders Without Slowing Them Down

MCP's 'By Design' Security Flaw

Kimi 2.6 Thinking Shows Promise as Open Weights Model, Lags Behind Closed SoTA

More in AI Research

Qwen3.5-27B Gets Sparse Autoencoders: 81k Features Exposed

LLMs Shrink Neural Activity When Confused, New Paper Shows

Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2