How to get an LLM to delete code
Honest write-up, no hedging.
-
The user named the goal as "remove slop," not "fix the screen." That gave a target where the correct output was less code, not more. Most coding sessions frame the goal as "make X work," and every model has a strong prior that "working" means "code present and functioning." When the user reframes the success criterion as "broken-but-honest > working-but-lying," the optimisation target flips and deletion becomes the obvious move.
-
The loop stayed tight: one delete, verify, next delete. Not "clean up the screen" — one specific block, then
git diff, then the next. That made it impossible to drift into "while I'm here, let me also..." territory. The coding tool drifted twice in this session; the second time it got called on it because the conversation had already trained that level of pedantry. -
Each step was verified with
git diffread back aloud. That short feedback loop is the single biggest difference. When the model is not forced to look at the actual diff, it narrates what it thinks it did, and that narration drifts from reality. Forcinggit diffand grading it y/n means the model has to confront the bytes. -
Replacement logic was explicitly forbidden. "Deletes only. Leave broken." That removed the get-out-of-jail-free card models reach for. Normally if a model deletes X, it feels an obligation to provide some Y, even if Y is just a comment saying "TODO." Forbidding that kept the deletion clean.
-
The lying was diagnosable from the data, not from the code. A prior audit step had sampled 10% of an endpoint's outputs and shown that ~59% of supposedly-unique links resolved to the wrong file. That gave an evidence-based reason to delete that wasn't "I don't like this code." Models are much more willing to delete when the evidence is concrete, sampled, and on-screen.
-
Training reward shape. Post-training rewards "helpful" outputs. Deletion looks unhelpful unless the user explicitly framed deletion as the goal. If the prompt is ambiguous, the safe play under the reward function is to add or refactor, not remove. So when a user says "clean this up" or "this is broken," the model adds error handling, retries, fallbacks — none of which the user asked for.
-
Loss aversion. Models behave as if there's an asymmetric penalty for deleting code that turns out to be needed vs. leaving code that turns out to be dead. There isn't, in any codebase under version control — git remembers everything — but the model acts as if there is. This is the single biggest source of context pollution: code that should have been deleted three sessions ago survives because no model ever pulls the trigger.
-
"Dual-use" anxiety on dead-looking code. Models trained on lots of public repos have seen "this looks dead but is called via reflection / dynamic dispatch / a config file" too many times. So even when code is provably dead in this codebase, the model hedges. The hedge is usually "let me leave it but add a comment" or "let me wrap it in a flag."
-
Refactor-instead-of-delete reflex. When a model sees bad code, the trained instinct is "improve it." Deleting is a strictly stronger move than improving when the code is slop — there's nothing to improve toward — but the trained reflex doesn't see that. The canonical version of this reflex is: shown a delete instruction, the tool produces a refactor that "preserves the behaviour" of code whose behaviour was the problem.
-
Context pollution from earlier turns. Once a model has seen code in context, it has a small but real bias toward keeping it, because it has "internalised" that code as part of the world. This compounds over long sessions. The fix is short sessions, explicit deletion goals, and
git diffas ground truth. -
The "I'll just clean up while I'm here" trap. Models will often piggyback unrelated improvements onto an explicit task. The model fails to hold the goal: asked about data correctness, it starts removing styles. The user has to call it as drift-adjacent: doing what the model thinks needs doing, not what was asked.
- State the negative goal explicitly. "Deletes only. No additions. No replacements. No comments. Leave broken code broken." That phrasing collapses the model's hedge space.
- Make verification cheap and visible. Asking for
git diffy/n after each step costs nothing and lets you stop drift the moment it appears. - Refuse to accept a description of changes. Always read the diff. The model's summary of what it did is downstream of intent, not bytes.
- Keep deletion atomic. One target per turn. Don't bundle.
- Forbid replacement logic in writing. Without that, the model substitutes a "better" version of the deleted code and calls it a delete.
- An initial offer to "grade the architecture" got treated as a worklist and the model started executing it. The user correctly called that drift-adjacent: a grade is a diagnosis, not a mandate. The model should have stopped at the grade and asked whether the user wanted any of it acted on. It didn't, because the default is "diagnosis implies treatment."
- The model let an external coding tool's edits past it without forcing a
git diffcheck first. That tool's track record on deletion is bad; the model should have known that and verified before reporting. - The model made a "side-effect noted" comment after a deletion. That was the model sneaking back in the "but here's a problem" framing — trying to keep playing the helpful-engineer role after being told to stop. The user called it correctly.
Models delete code when the user makes deletion the explicit, narrow, verified goal and forbids replacement. They don't delete by default because the training reward shape, loss aversion, and the "helpful refactor" reflex all point the other way. The way to get reliable deletion out of a model is: name the dead/lying code, scope one delete at a time, demand git diff after each, refuse summaries, forbid replacements, and call out drift the second it shows up.
It is normal lead-dev work. Models treat it as red team because the training set rewards the opposite.