Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
This is where commit based version control can help mitigate the damage or detect it early. The golden rule in AI assistance in programming (if you’re going to use it) is to check the changes to make sure you understand everything before you commit it. If documents had a git-like version control system it would be easier to detect corruption early.
You ever use snapshot tests? They’re garbage. They’re garbage because people glance at a 300 line diff and go “that seems right”, because a good chunk of the time it is. But some portion of the time, it’s borked up.
This is where commit based version control can help mitigate the damage or detect it early. The golden rule in AI assistance in programming (if you’re going to use it) is to check the changes to make sure you understand everything before you commit it. If documents had a git-like version control system it would be easier to detect corruption early.
Or just use Markdown files in a git repo IDK
Or use markdown, LaTeX, typst. No reason to use word docs anymore imo
That’s not how humans work. We’re lazy.
You ever use snapshot tests? They’re garbage. They’re garbage because people glance at a 300 line diff and go “that seems right”, because a good chunk of the time it is. But some portion of the time, it’s borked up.
Having it in git doesn’t solve that problem.