A small technique that's paid off: have your coding agent emit a consistent marker in its own output every time it makes a judgment call, so you can later mine your transcripts to see which choice actually wins over time.
I run web searches through more than one engine (Brave, Google-via-Serper, and the model's built-in search), in parallel, on the same query. The agent answers my question normally — but then appends one line ranking the engines for that query, always led by the same sentinel emoji:
🔎 brave > google > builtin here — brave's top hit was authoritative, google buried it, builtin synthesized but added an unsourced claim