LLMs corrupt your documents when you delegate
Based on: Laban, Schnabel & Neville — LLMs Corrupt Your Documents When You Delegate (Microsoft Research, arXiv 2604.15597, April 2026). Overview: cekrem.github.io
The core finding
Delegating document editing to an LLM introduces silent, compounding corruption. Even frontier models degrade a quarter of document content after 20 interactions. The damage is not random noise — it is sparse but severe, and it accumulates monotonically with no plateau.
DELEGATE-52 benchmark
The researchers built a benchmark of 310 work environments across 52 professional domains (coding, crystallography, music notation, accounting, genealogy…). Each environment contains:
- A real seed document (~2–5k tokens, permissive license)
- 5–10 complex editing tasks (forward + inverse instruction pairs)
- A distractor context (~8–12k tokens of topically related but irrelevant documents)
Evaluation method — round-trip relay: each task is defined as a reversible pair (forward edit → backward edit). A perfect model returns the original document exactly. Chaining N round-trips simulates a long delegated workflow without requiring human annotation.
seed doc → [forward edit] → transformed doc → [backward edit] → reconstructed doc
↕ sim(seed, reconstructed)
Similarity is measured with domain-specific parsers (not generic embeddings), because
small semantic changes — 200g → 800g of butter — must register as errors even if surface
overlap is high.
Results
19 models tested (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro, and 16 others) over 20 interactions:
| Model tier | Content corrupted after 20 interactions |
|---|---|
| Frontier (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro) | ~25% |
| Average across all 19 models | ~50% |
Key observations:
- Monotonic decline — no plateau at 20 interactions; extended to 100 interactions, the line keeps going down.
- Short-term performance does not predict long-term reliability. Two models nearly identical at 2 interactions (91.5% vs 91.1%) diverged to 48.3% vs 64.1% at 20 interactions.
- Python is the only domain with majority readiness (17/19 models ≥ 98%). Every other domain — prose, structured records, niche formats — fails. The pattern is clear: LLMs handle domains where correctness has a mechanical, verifiable definition. Unstructured text has no spec; corruption is invisible by design.
Note: The benchmark includes other Code & Configuration domains (e.g., Database) that did not reach the same readiness bar as Python.
Python's advantage likely comes from two compounding factors: it is massively overrepresented in LLM training data, and its correctness criteria are unambiguous (interpreter, linters, test frameworks).
A less common language — COBOL, Fortran, a niche DSL — would probably degrade as badly as prose, because the model has less training signal and no strong mechanical definition of "correct". The paper's framing supports this: it is not code that succeeds, it is mechanically verifiable domains.
This interpretation is the authors' framing extended by the reviewer; the paper does not test niche languages directly. Source: arXiv 2604.15597 — Table 2 (per-domain readiness breakdown after 20 interactions) and Figure 8 (document characteristics effect sizes).
Aggravating factors
| Factor | Effect |
|---|---|
| Longer interaction | Degradation compounds — short simulations underestimate severity |
| Larger documents | More surface area for errors |
| Distractor context | Irrelevant files in the context window worsen results |
| Agentic tool use | +6% additional degradation — more capability, more ways to confidently do the wrong thing |
Nature of the errors
Errors are sparse but severe: the model does not produce gibberish. It makes small, confident changes — a detail shifted, a qualification dropped, a meaning subtly altered. These are invisible on a quick scan and only detectable by careful comparison against the original.
Errors also interact: an early corruption changes context, which shifts subsequent outputs, which compounds further. The document drifts away from the original without any single catastrophic failure.
Frontier models corrupt more than they delete; weaker models delete more (Appendix F of the paper).
Implications
For users
- Do not delegate judgment-heavy editing without reviewing every diff carefully.
- Delegation is safer for tasks with tight constraints and mechanical verification (code with tests, structured data with a schema).
- The demo always looks fine. Failure accumulates in the 20th, 50th, 100th interaction — by which point the original intent may be unrecoverable.
For developers / NLP practitioners
- Short-horizon benchmarks (2 interactions) are insufficient — they do not predict long-horizon reliability.
- Generic similarity metrics (including LLM-as-a-judge) fail to capture fine-grained semantic changes; domain-specific parsers are necessary.
- Agentic scaffolding does not fix the underlying problem.
Connection to Naur's "Programming as Theory Building"
Peter Naur (1985) argued that a program is not its source code — it is the theory (mental model) held by the people who built it. When that understanding is lost, the artifact becomes incomprehensible.
Delegating to an LLM causes the theory to die twice:
- You never built the understanding, because you delegated instead of engaging.
- The LLM silently corrupted the artifact itself.
You lose both the map and the territory.
Sources
- arXiv 2604.15597 — LLMs Corrupt Your Documents When You Delegate
- cekrem.github.io — overview post
- DELEGATE-52 dataset
- Peter Naur — Programming as Theory Building (1985)