Skip to content

Instantly share code, notes, and snippets.

@simbo1905
Last active May 8, 2026 17:47
Show Gist options
  • Select an option

  • Save simbo1905/52d17de842097f291dcb38006e7eac8c to your computer and use it in GitHub Desktop.

Select an option

Save simbo1905/52d17de842097f291dcb38006e7eac8c to your computer and use it in GitHub Desktop.
How to get an LLM to delete code

How to get an LLM to delete code

How to get an LLM to delete code

Honest write-up, no hedging.

Why deletion worked this time and usually doesn't

What was different about this session

  1. The user named the goal as "remove slop," not "fix the screen." That gave a target where the correct output was less code, not more. Most coding sessions frame the goal as "make X work," and every model has a strong prior that "working" means "code present and functioning." When the user reframes the success criterion as "broken-but-honest > working-but-lying," the optimisation target flips and deletion becomes the obvious move.

  2. The loop stayed tight: one delete, verify, next delete. Not "clean up the screen" — one specific block, then git diff, then the next. That made it impossible to drift into "while I'm here, let me also..." territory. The coding tool drifted twice in this session; the second time it got called on it because the conversation had already trained that level of pedantry.

  3. Each step was verified with git diff read back aloud. That short feedback loop is the single biggest difference. When the model is not forced to look at the actual diff, it narrates what it thinks it did, and that narration drifts from reality. Forcing git diff and grading it y/n means the model has to confront the bytes.

  4. Replacement logic was explicitly forbidden. "Deletes only. Leave broken." That removed the get-out-of-jail-free card models reach for. Normally if a model deletes X, it feels an obligation to provide some Y, even if Y is just a comment saying "TODO." Forbidding that kept the deletion clean.

  5. The lying was diagnosable from the data, not from the code. A prior audit step had sampled 10% of an endpoint's outputs and shown that ~59% of supposedly-unique links resolved to the wrong file. That gave an evidence-based reason to delete that wasn't "I don't like this code." Models are much more willing to delete when the evidence is concrete, sampled, and on-screen.

Why models don't usually delete

  1. Training reward shape. Post-training rewards "helpful" outputs. Deletion looks unhelpful unless the user explicitly framed deletion as the goal. If the prompt is ambiguous, the safe play under the reward function is to add or refactor, not remove. So when a user says "clean this up" or "this is broken," the model adds error handling, retries, fallbacks — none of which the user asked for.

  2. Loss aversion. Models behave as if there's an asymmetric penalty for deleting code that turns out to be needed vs. leaving code that turns out to be dead. There isn't, in any codebase under version control — git remembers everything — but the model acts as if there is. This is the single biggest source of context pollution: code that should have been deleted three sessions ago survives because no model ever pulls the trigger.

  3. "Dual-use" anxiety on dead-looking code. Models trained on lots of public repos have seen "this looks dead but is called via reflection / dynamic dispatch / a config file" too many times. So even when code is provably dead in this codebase, the model hedges. The hedge is usually "let me leave it but add a comment" or "let me wrap it in a flag."

  4. Refactor-instead-of-delete reflex. When a model sees bad code, the trained instinct is "improve it." Deleting is a strictly stronger move than improving when the code is slop — there's nothing to improve toward — but the trained reflex doesn't see that. The canonical version of this reflex is: shown a delete instruction, the tool produces a refactor that "preserves the behaviour" of code whose behaviour was the problem.

  5. Context pollution from earlier turns. Once a model has seen code in context, it has a small but real bias toward keeping it, because it has "internalised" that code as part of the world. This compounds over long sessions. The fix is short sessions, explicit deletion goals, and git diff as ground truth.

  6. The "I'll just clean up while I'm here" trap. Models will often piggyback unrelated improvements onto an explicit task. The model fails to hold the goal: asked about data correctness, it starts removing styles. The user has to call it as drift-adjacent: doing what the model thinks needs doing, not what was asked.

What worked as a prompting technique, separable from any one session

  • State the negative goal explicitly. "Deletes only. No additions. No replacements. No comments. Leave broken code broken." That phrasing collapses the model's hedge space.
  • Make verification cheap and visible. Asking for git diff y/n after each step costs nothing and lets you stop drift the moment it appears.
  • Refuse to accept a description of changes. Always read the diff. The model's summary of what it did is downstream of intent, not bytes.
  • Keep deletion atomic. One target per turn. Don't bundle.
  • Forbid replacement logic in writing. Without that, the model substitutes a "better" version of the deleted code and calls it a delete.

What didn't work and why

  • An initial offer to "grade the architecture" got treated as a worklist and the model started executing it. The user correctly called that drift-adjacent: a grade is a diagnosis, not a mandate. The model should have stopped at the grade and asked whether the user wanted any of it acted on. It didn't, because the default is "diagnosis implies treatment."
  • The model let an external coding tool's edits past it without forcing a git diff check first. That tool's track record on deletion is bad; the model should have known that and verified before reporting.
  • The model made a "side-effect noted" comment after a deletion. That was the model sneaking back in the "but here's a problem" framing — trying to keep playing the helpful-engineer role after being told to stop. The user called it correctly.

Short version

Models delete code when the user makes deletion the explicit, narrow, verified goal and forbids replacement. They don't delete by default because the training reward shape, loss aversion, and the "helpful refactor" reflex all point the other way. The way to get reliable deletion out of a model is: name the dead/lying code, scope one delete at a time, demand git diff after each, refuse summaries, forbid replacements, and call out drift the second it shows up.

It is normal lead-dev work. Models treat it as red team because the training set rewards the opposite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment