Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created April 28, 2026 08:58
Show Gist options
  • Select an option

  • Save oneryalcin/f6f241c3befea0e1e4e73f25501f9935 to your computer and use it in GitHub Desktop.

Select an option

Save oneryalcin/f6f241c3befea0e1e4e73f25501f9935 to your computer and use it in GitHub Desktop.
Golden Rule of Tools in Code Mode

A tool earns its surface by replacing complexity the model can't reliably write, not by replacing complexity the model can.

xlsx info and xlsx get clear that bar:

  • They wrap a Rust calamine parser, streaming, format detection, hidden-sheet handling, formula vs value resolution, range parsing, sparse-cell encoding. Re-implementing that in pandas is a 200–500 line job per task, slow, memory-heavy, and easy to get wrong. The tool is the only sane path.
  • xlsx analyze clears it too, but for a different reason: it's an LLM call with a curated prompt, caching, retry/fallback, transient-error handling. Not impossible to inline, but the prompt-engineering and infra are the actual product.

union_count doesn't clear that bar. It's six lines of pandas. Modern models write six lines of pandas correctly an overwhelming fraction of the time, and when they don't, the bug is visible in the report because the code is right there. Adding a tool to "make sure they get it right" is solving the wrong problem — the right answer is "let them write it, and make sure the report has a Methodology block that surfaces what they did".

What this gives us as a heuristic

A new tool is warranted when:

  1. The wrapped logic is non-trivial to re-implement — hundreds of lines, or specialized infrastructure (Rust, network, DB, LLM, parser).
  2. The wrapped logic is hard to get right — concurrency, edge cases the model wouldn't think of, version-pinned external behaviour.
  3. OR the wrapping centralizes something that changes — model versions, registry URLs, retry budgets, prompts. The tool exists so callers don't get tied to those.

A new tool is not warranted when:

  1. The model can write the equivalent in a screen of code it can read back in the report.
  2. The tool would just be "a function I trust slightly more than the inline version."
  3. The "centralization" argument is hypothetical — there's no second caller and may never be.

union_count fails all three of the "warranted" tests and matches all three of the "not warranted" ones.

★ Insight ─────────────────────────────────────

  • This is a deeper version of the Carmack point ("ruthless simplicity wins"): when you have a model that can write working code on demand, the model itself is the abstraction layer. Adding a function is duplicating it. The cost isn't the function — it's the function plus the prompt that teaches the function plus the docs plus the discoverability burden plus the maintenance forever.
  • The trap I was falling into is one I'd flag in someone else's PR: I was sizing the function by the bug it fixes, not by the complexity it removes. Bug size is a red herring for tool design. A 1-line bug can warrant a 200-line tool (e.g. a Rust parser fixing one off-by-one in pandas) and a 1000-line bug can warrant zero tool changes (just better prompts and docs).
  • This also explains why xlsx analyze was the right thing to build: it's not 200 lines of pandas, it's an LLM call wrapped in caching, retries, prompt-versioning, and transient-error fallback. The inline version of that is genuinely beyond what an agent should hand-roll mid-task. Different category. ─────────────────────────────────────────────────
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment