Skip to content

LLMs corrupt your documents when you delegate

Based on: Laban, Schnabel & Neville — LLMs Corrupt Your Documents When You Delegate (Microsoft Research, arXiv 2604.15597, April 2026). Overview: cekrem.github.io

The core finding

Delegating document editing to an LLM introduces silent, compounding corruption. Even frontier models degrade a quarter of document content after 20 interactions. The damage is not random noise — it is sparse but severe, and it accumulates monotonically with no plateau.

DELEGATE-52 benchmark

The researchers built a benchmark of 310 work environments across 52 professional domains (coding, crystallography, music notation, accounting, genealogy…). Each environment contains:

  • A real seed document (~2–5k tokens, permissive license)
  • 5–10 complex editing tasks (forward + inverse instruction pairs)
  • A distractor context (~8–12k tokens of topically related but irrelevant documents)

Evaluation method — round-trip relay: each task is defined as a reversible pair (forward edit → backward edit). A perfect model returns the original document exactly. Chaining N round-trips simulates a long delegated workflow without requiring human annotation.

seed doc → [forward edit] → transformed doc → [backward edit] → reconstructed doc
                                                                        ↕ sim(seed, reconstructed)

Similarity is measured with domain-specific parsers (not generic embeddings), because small semantic changes — 200g → 800g of butter — must register as errors even if surface overlap is high.

Results

19 models tested (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro, and 16 others) over 20 interactions:

Model tier Content corrupted after 20 interactions
Frontier (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro) ~25%
Average across all 19 models ~50%

Key observations:

  • Monotonic decline — no plateau at 20 interactions; extended to 100 interactions, the line keeps going down.
  • Short-term performance does not predict long-term reliability. Two models nearly identical at 2 interactions (91.5% vs 91.1%) diverged to 48.3% vs 64.1% at 20 interactions.
  • Python is the only domain with majority readiness (17/19 models ≥ 98%). Every other domain — prose, structured records, niche formats — fails. The pattern is clear: LLMs handle domains where correctness has a mechanical, verifiable definition. Unstructured text has no spec; corruption is invisible by design.

Note: The benchmark includes other Code & Configuration domains (e.g., Database) that did not reach the same readiness bar as Python.

Python's advantage likely comes from two compounding factors: it is massively overrepresented in LLM training data, and its correctness criteria are unambiguous (interpreter, linters, test frameworks).

A less common language — COBOL, Fortran, a niche DSL — would probably degrade as badly as prose, because the model has less training signal and no strong mechanical definition of "correct". The paper's framing supports this: it is not code that succeeds, it is mechanically verifiable domains.

This interpretation is the authors' framing extended by the reviewer; the paper does not test niche languages directly. Source: arXiv 2604.15597 — Table 2 (per-domain readiness breakdown after 20 interactions) and Figure 8 (document characteristics effect sizes).

Aggravating factors

Factor Effect
Longer interaction Degradation compounds — short simulations underestimate severity
Larger documents More surface area for errors
Distractor context Irrelevant files in the context window worsen results
Agentic tool use +6% additional degradation — more capability, more ways to confidently do the wrong thing

Nature of the errors

Errors are sparse but severe: the model does not produce gibberish. It makes small, confident changes — a detail shifted, a qualification dropped, a meaning subtly altered. These are invisible on a quick scan and only detectable by careful comparison against the original.

Errors also interact: an early corruption changes context, which shifts subsequent outputs, which compounds further. The document drifts away from the original without any single catastrophic failure.

Frontier models corrupt more than they delete; weaker models delete more (Appendix F of the paper).

Implications

For users

  • Do not delegate judgment-heavy editing without reviewing every diff carefully.
  • Delegation is safer for tasks with tight constraints and mechanical verification (code with tests, structured data with a schema).
  • The demo always looks fine. Failure accumulates in the 20th, 50th, 100th interaction — by which point the original intent may be unrecoverable.

For developers / NLP practitioners

  • Short-horizon benchmarks (2 interactions) are insufficient — they do not predict long-horizon reliability.
  • Generic similarity metrics (including LLM-as-a-judge) fail to capture fine-grained semantic changes; domain-specific parsers are necessary.
  • Agentic scaffolding does not fix the underlying problem.

Connection to Naur's "Programming as Theory Building"

Peter Naur (1985) argued that a program is not its source code — it is the theory (mental model) held by the people who built it. When that understanding is lost, the artifact becomes incomprehensible.

Delegating to an LLM causes the theory to die twice:

  1. You never built the understanding, because you delegated instead of engaging.
  2. The LLM silently corrupted the artifact itself.

You lose both the map and the territory.

Sources