TDD in the AI Era
Posted by bsstahl on 2026-06-03 and Filed Under: development
AI has increased the speed with which we can generate solutions. These tools are so much faster at generating code than development processes have been in the past that it has outpaceed our ability to understand and reason on these solutions. We have to depend more than ever on our processes for understanding and gating development. Test-Driven Development (TDD) still provides a core feedback loop we can rely on: define behavior first, verify it with executable checks, then improve design through refactoring with confidence. This Red-Green-Refactor loop is even more critical in the AI era, because speed without discipline creates tremendous risk for all but the most trivial of codebases.
This post is grounded in the ideas I cover in my workshop, TDD in the AI Era. It is also the extended version of my answer to the AI-related question in my TDD FAQ.
The Bottleneck Has Moved
Before AI-assisted tooling, implementation throughput was often constrained by typing speed and mechanical coding effort. We would understand the requirement, translate intent into code, and pay that cost at the keyboard. With modern LLM-based tooling, that specific constraint has dropped sharply. Work that once took hours to type can now be generated in seconds.
In Theory of Constraints terms, removing one bottleneck exposes the next. In AI-assisted delivery, the dominant constraints are now:
- Problem framing and decomposition: The quality of your specification, and your ability to partition behavior into small testable units, now matter more than raw implementation speed.
- Verification: Plausible output is cheap; trustworthy output is not. The cost of confidence has not fallen at the same rate as the cost of generation.
- Comprehension: Fast generation can outrun human understanding. Rebuilding a lost mental model is expensive, especially in distributed systems with asynchronous flows.
TDD is one of the few practices that addresses all three directly.
Capability Is Not Permission
An AI can generate an end-to-end implementation from a well-formed prompt. That is real capability. The architectural question, though, is not "can we generate this?" but "should we accept this implementation path in this lifecycle context?"
There are cases where prompt-and-accept is perfectly rational: throwaway scripts, temporary migration utilities, or intentionally black-boxed components behind stable contracts. But for shared libraries, long-lived product code, and systems with non-trivial operational characteristics, ownership cannot be delegated away. Someone still owns correctness, observability, and evolution.
That is exactly why Red-Green-Refactor remains essential. It keeps humans in control of behavior boundaries while still leveraging AI for acceleration.
Tests Matter More Than the Code
The key mindset shift is straightforward: implementations are increasingly disposable; tests are durable assets.
If your tests are sound, any contributor, human or AI, can refactor aggressively and preserve behavior. If tests are weak, missing, or mis-specified, elegant generated code does not make the system trustworthy.
None of that is new in principle. What has changed is urgency. In the AI era, regeneration is nearly free, so the artifact that persists is the behavioral contract encoded in tests. Future engineers depend on that contract to reason about change, prevent regressions, and avoid accidental dual-writes across boundaries.
In practice, tests also become your best comprehension artifact. Generated code explains one possible how. Well-designed tests, designed and validated by humans from intent, explain what and why. In fast-moving systems, that distinction is not academic; it is operational.
AI as a Pairing Partner
The most productive model I have found is to treat AI as a pairing partner, not an autonomous code generator.
A generator takes input and emits output. A pairing partner participates in design: challenging assumptions, comparing alternatives, evaluating trade-offs, and iterating quickly. Used that way, AI is highly effective. It can generate candidate designs rapidly, adapt to feedback without ego, and help explore options teams might otherwise skip.
But architecture ownership does not transfer. AI does not own your bounded contexts, integration strategy, reliability posture, or acceptance criteria. Those remain human responsibilities. AI changes velocity; it does not change accountability.
The Loop Stays the Same
Red-Green-Refactor remains the control system. What changes is who does the typing and what each phase looks like in practice.
Red: Establish Constraints Before Writing Code
Red is where the highest-value work happens in AI-assisted TDD. Before implementation, use the AI dialogue to surface assumptions explicitly:
- What are the inputs, outputs, and failure modes?
- Which invariants are non-negotiable?
- Which dependencies are in scope, and which are intentionally excluded?
- What idempotent behavior is required if retries occur?
Then write the smallest failing test that encodes that boundary. A reliable prompt pattern here is: "Strict TDD mode; no implementation yet."
If you accumulate multiple failing tests before producing any working behavior, treat that as a decomposition signal. The behavior is probably too large for a single loop iteration.
Green: Implement the Minimum, No More
With one failing test in place, ask for the smallest implementation that makes that test pass and nothing else. Prompt pattern: "Strict TDD mode; satisfy current tests only."
Then run the tests and verify the red-to-green transition explicitly. Do not infer success from code appearance. AI can satisfy assertions in shallow ways: hard-coded constants, short-circuit logic, or path-specific hacks that technically pass but are behaviorally wrong.
Also keep docs and intent notes synchronized. If behavior is concrete enough to test, it is concrete enough to describe.
Refactor: Prove and Improve
Refactor is where AI can deliver significant value, provided tests are holding the line.
Structural refactoring (naming, extraction, abstraction boundaries) remains the classic goal. AI can accelerate this safely when constrained by: "Strict TDD mode; refactor only; preserve behavior."
Mutation testing has moved from optional in theory to a practical workflow component. Agents can introduce realistic mutations (boundary flips, operator substitutions, null-path inversions, return-value swaps), run the suite, and report surviving mutants. Surviving mutants are not trivia; they expose weak assertions and blind spots.
Use coverage as a diagnostic signal, not a vanity metric. Uncovered paths after green are questions to answer: missing test, unintended behavior, or intentional boundary.
What AI Should Not Do
A few patterns are worth watching for and avoiding:
AI-authored tests accepted without a genuine red phase. If tests are synthesized from existing implementation, they usually encode current behavior rather than intended behavior. If you need to generate tests after implementation, it is a good idea to comment-out all functionality in favor of the standard "Not Implemented" mechanism for the programming environment, and then restore functionality 1 test at a time. Mutation testing is also a key to success here.
Scope creep through cheap generation. When implementation cost approaches zero, overbuilding risk increases. TDD provides the governor: no test, no feature.
Implementation-coupled white-box tests. AI frequently mirrors internals in test shape. Those tests are brittle and break during valid refactors. Prefer behavior-based assertions at boundaries.
Throughput beyond comprehension. Shipping code faster than humans can understand it creates latent operational risk. TDD gates are comprehension checkpoints that keep teams from accumulating opaque behavior, and moderate the size of pull-requests to keep validation steps practical.
Starting Your Own Workflow
The simplest AI-assisted TDD workflow needs no special tooling beyond the AI chat tool you already have. You can start immediately:
- Describe the problem and explicitly define boundaries, success criteria, and non-goals.
- Co-author one failing test and verify it fails for the expected reason.
- Request the minimal implementation required to pass that test.
- Verify green, then refactor for clarity and maintainability.
- Repeat with the next behavior.
From there, add richer workflow elements as new bottlenecks and pain points appear. Useful additions include IDE agents, mutation pipelines, coverage gates, and architecture checks. The foundation does not change. Human judgment owns test intent and boundary decisions; AI amplifies execution.
The Discipline Is the Point
AI reduces the marginal cost of producing code. It does not remove the cost of producing correct, maintainable, and comprehensible systems. Red-Green-Refactor closes that gap by enforcing pre-implementation intent, executable verification, and continuous design improvement.
Comprehension is not optional polish. It is an engineering control. A codebase that "works" but cannot be reasoned about is operational debt waiting to mature.
A codebase where behavior is specified through tests before implementation exists is different. It is explainable, evolvable, and safer to change. That was true before AI. With near-zero generation cost, it is even more true now.