How to Do Test-Driven Development
BraindumpThe cycle is the unit of change
Each change moves through a fixed sequence: whole-suite baseline run, failing test written, minimum code to green, whole-suite verify, refactor code, refactor tests, whole-suite verify again. The rhythm is not decoration: each step exists because the previous one is provably insufficient. Skipping any of them leaves a gap the next regression will fall into.
The baseline is the truth-anchor, and only the whole suite anchors it. A subset baseline — only the tests “near” the change — proves only that subset was green; it says nothing about pre-existing reds elsewhere. When the post-change run lights one of those up, you cannot tell whether your change broke it or whether it was already broken before you started, and the cycle’s interpretability is gone. Worse, a subset baseline gives the false comfort of green when the codebase is in fact already red — the kind of comfort that makes later regressions hard to date. The whole-suite baseline rules out both: every red observed after this point was caused by the change in flight.
Among the suite, a subset encodes product-level invariants — properties that, if violated, invalidate the product itself, regardless of where the current change is. They run on every cycle. Identifying them is a per-project decision; the obligation to re-assert them is not.
Red proves the test
A test that has never been observed failing has not been observed. Writing the test first — and watching it fail before any production code changes — is what establishes that the test exercises the new behaviour. Without that step, a passing test proves only that the assertion compiles: it may pass because the path it claims to cover was already wired, or because the assertion is silently vacuous.
The failure mode and message are part of the test’s value. A red observed once is a fixed point: revert the change and the test must fail again, confirming it really rides on the new code. A test that cannot be made to fail by reverting the change is testing nothing in particular.
Minimum to green, then earn the next line
Code beyond what the failing test demands is, by definition, untested. The next test earns the next line. This tight coupling of spec to code is what keeps the suite a faithful description of the system: every branch in the implementation traces back to a test that was once red.
This is not a literal line-count rule — refactor passes add structure no single test motivates. It is a rule about new behaviour: behaviour appears under a failing test, not ahead of one. Anticipated edge cases are written as the next failing test, not as preemptive guards in the current change.
The whole suite is the contract
A change is verified only when the whole suite is green. The new test going green proves the new behaviour exists; it says nothing about what the change broke elsewhere. A subset run — even a carefully-chosen one — is not a verification: implicit dependencies between modules (shared state, cached values, ordering assumptions) only surface when every test runs, and the test you didn’t include is exactly the one most likely to catch the regression you didn’t predict. Confidence scoped to a subset is, by construction, scoped to the regressions you already imagined; the value of the suite is the ones you didn’t.
This is why the cycle has three whole-suite runs, not two and never one. The baseline pins the truth-anchor (see The cycle is the unit of change). The post-change verify catches behaviour breakage from the new code. The post-refactor verify catches breakage introduced by structural change (of either code or tests), which is a distinct risk: a green-to-green refactor can still flip a subtle ordering, a fixture, or a shared helper. None of the three is substitutable by a faster subset, because the regressions each guards against are unknown ahead of time — that is what makes them regressions and not anticipated cases.
Refactor lives under green
Structural change runs under a green suite. Behaviour is held fixed by the tests; structure is what is allowed to move. Mixing behaviour change with structure change loses the bisect signal: when a test goes red mid-refactor, you no longer know whether it caught a real regression or reacted to an in-progress edit.
Refactor the tests with the same discipline. Tests are code: they duplicate, drift, slow down. The refactor-tests step is the only place in the cycle where test debt can be paid without losing signal, because the suite is green on both sides of the edit. A suite that takes too long stops being run; a suite full of duplicated assertions gives false confidence that coverage is broader than it is.
Tests live at the seam the user touches
Assertions probe the observable surface — visible text, public API, exported artifact, ARIA role — not internals like private stores, test-only hooks, or implementation-selector classes. A test coupled to internals breaks on every refactor and stops being a contract: it becomes a tax that punishes structural improvement.
The discipline shows up in the helpers a suite reaches for first:
get_by_role, get_by_label, get_by_placeholder rather than CSS
selectors that encode the current DOM. The cost of writing
user-level helpers is paid back the first time a refactor changes
the markup without changing the behaviour.
Test, don’t patch
When a bug surfaces, the temptation is to drop a guard, tweak a branch, or add a defensive check in the offending file. Without a test that captures the bug, the fix is invisible to the suite: no proof the bug existed, no proof it is gone, no defense against its return. Every fix earns a test — written first, observed red, then made green by the smallest change that satisfies it.
This is the same shape as the rest of the cycle, applied to regression rather than feature work. The bug report is a what in natural language; the failing test is its translation; the fix is the minimum code that turns it green. A bug fix landed without a test will recur, and there will be no record of why the line was added.
Cycle prompt
Mandatory TDD cycle for ANY edit:
-
WHOLE SUITE GREEN — run the suite. Not green? No edits. Fix the red, or explicitly mark it pre-existing and out of scope.
-
NEW TEST RED — write ONE test. Run. Watch it fail. Green on first run = you’re not testing what you think.
-
MINIMUM CODE TO GREEN — the next line waits for the next failing test.
-
REFACTOR UNDER GREEN — run again. Still green.
You must be able to cite the result of each run in the current cycle. Can’t cite = cycle broken: revert, restart.
Audit prompt
An audit against these principles takes the form of a table: every observation cited by file and line, classified under the principle it tests. The conclusion follows from the table; it is not stated alongside it. An uncited verdict is a guess; a cited verdict has done the reading.