Self-instrumenting your AI agent to A/B its own tools

A small technique that's paid off: have your coding agent emit a consistent marker in its own output every time it makes a judgment call, so you can later mine your transcripts to see which choice actually wins over time.

The setup

I run web searches through more than one engine (Brave, Google-via-Serper, and the model's built-in search), in parallel, on the same query. The agent answers my question normally — but then appends one line ranking the engines for that query, always led by the same sentinel emoji:

🔎 brave > google > builtin here — brave's top hit was authoritative, google buried it, builtin synthesized but added an unsourced claim

That's it. No dashboards, no scoring rubric in the moment — just a single ranked line with a fixed prefix. The prefix is the whole trick: it's a grep-able hook I planted in my own output on purpose.

Why this works

The agent is already making the judgment ("which result was best?") on every search. Normally that judgment evaporates the moment the turn ends. By forcing it into a structured-ish line with a unique marker, you turn a fleeting decision into longitudinal data you can mine later — without any extra logging infrastructure, database, or behavior change to the actual task.

The marker matters more than the format. Free-form prose after the marker is fine; you parse it after the fact.

Mining it

Coding-agent transcripts are typically just JSONL on disk (one file per session). A ~150-line script scans every transcript for the marker line, normalizes the tool names, and tallies first-place finishes plus head-to-head win rates:

📊 SEARCH/FETCH SCOREBOARD  (124 verdicts, 2 ties)

First-place finishes (ties count ½ each):
  brave        68.0  (58% of its 118 appearances)
  google       48.5  (41% of its 119 appearances)
  reddit        4.5  (75% of its  6 appearances)

Head-to-head:
  brave vs google      brave:67  google:49  (n=116)

The interesting result: across 124 real queries, the two general engines split 58/41. That's wide enough that the disagreement itself is the value — dropping the "losing" engine would mean silently taking a worse answer ~4 times in 10. A near-tie (95/5) would have told me I could safely drop one. The data answers a question I couldn't have eyeballed.

Parsing notes

A few things that make the after-the-fact parse robust:

Match multi-word tokens before bare ones (serper scrape before scrape, web search before generic) so context-specific names normalize correctly.
Treat order-of-appearance as the ranking — the convention is best-to-worst, so the first tool named is the winner. This survives prose variation ("roughly even this round", "X adds what Y couldn't") far better than trying to parse > operators rigidly.
Normalize aliases to canonical names (serper → google, websearch/builtin/anthropic → builtin).

When not to add structure

I considered making the line machine-readable (lead with 🔎 [brave>google] then prose). Decided against it: the prose parses fine for the volume I have, the edge cases are rare and don't move the tally, and the only consumer is the same agent that wrote it. A rigid tag would only earn its keep if a different tool had to parse it reliably, or at much larger scale. Don't pay for precision you won't use.

The general pattern

This isn't really about search engines. It's: whenever your agent picks between interchangeable tools/strategies and forms an opinion about which was better, have it leave a grep-able breadcrumb. Then your own transcript history becomes a free, continuously-growing A/B dataset for tuning your setup — model choice, retrieval strategy, fetch tool, prompt variant, whatever. You're already generating the judgments; just make them mineable.

Conceptron/self_instrumenting_search_eval.md

Select an option

No results found