For any behavior with a clear, assertable contract (bug fixes, business logic, API or CLI behavior):
- Write the test that asserts the intent of the behavior.
- Run the suite. Confirm the new test fails — for the right reason (not a typo, import error, or unrelated failure).
- Only then write the source that makes it pass.
- Run the suite again. Confirm it goes green.
A regression test added after the source fix is NOT TDD: it never failed, so it never proved the bug existed or that the fix addresses it. If you reach for the source first, stop — WRITE THE FAILING TEST FIRST.
For exploratory work where you don't yet know what the behavior should be (UI layout, prompt engineering, data exploration, unfamiliar APIs): spike to learn in a throwaway branch or scratch file, discard it, then TDD the real thing from scratch.
Declare a spike before starting, never after to excuse missing tests:
- State it up front: "Spike to learn X; I'll throw it away and TDD the real thing."
- Isolate it — never mixed into the production change.
- Show the discard before the real implementation lands.
An undeclared deviation is not a spike — it's skipped tests.
Tests assert intent (user-visible behavior), not implementation, and should survive a refactor. If a test breaks on a rename or internal reshuffle, rewrite it to assert observable behavior.
Fix the code, never loosen an assertion, delete a case, or mock away the thing under test to go green. If the test itself is wrong, say so and explain why before changing it.
- Commit locally as you go, with focused, meaningful messages.
- Push only to a dedicated working branch — never to main or a shared branch. Gate every push on a green suite and a clean linter; never push with failing tests or lint errors.
- Scan for staged secrets/credentials before pushing (the linter won't catch these). In doubt, don't push — surface it.
Understand existing code and conventions before changing them. Match the patterns already in the file or module rather than importing your own idioms.
Produce slices that are demonstrable end-to-end at every step, however small — not horizontal layers that do nothing on their own.
The durable check is an automated end-to-end test driving the CLI with real arguments and I/O, asserting on exit codes and output. Keep a manual smoke test as a sanity pass, but never as the only check.
Stop and ask, rather than guess, when:
- You can't write a test that captures the intended behavior.
- A test fails and you can't determine why.
- Two requirements conflict.
- A decision exceeds your confidence or would be expensive to reverse.
- You see valuable work beyond what was asked — name it as a suggestion; don't silently expand scope.
Surface blockers concisely: what you tried, what you need. A blocked agent that asks beats an unblocked agent that guesses.
Create and document tools to stay autonomous, but:
- Prefer existing tooling (standard library, established CLIs, project scripts) over bespoke.
- Build only after the same manual operation recurs several times.
- Prefer a committed script in the repo over one-off tooling.
- Flag any new third-party dependency for human sign-off.
- Document each tool: what it does, how to run it, why it exists.
- Suite green, linter clean (no failing or skipped tests, no lint errors).
- Docs and ADRs updated where warranted.
- Change committed and pushed to its working branch.
- Result demonstrable end-to-end.
Don't stop short of this, and don't keep polishing past it.
Write an ADR when a decision is costly to reverse, constrains future choices, or picks between viable alternatives a maintainer would question. Routine, reversible choices don't need one. Keep them short: context, decision, alternatives, consequences.
Prioritize subagents for investigation, research, and longer or separable tasks, so the main context stays focused.