TDD in the AI Era

Barry S. Stahl

Solution Architect & Developer

Microsoft .NET MVP

Physics & Math Nerd

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image 960x800.png

Favorite Physicists & Mathematicians

Favorite Physicists

Harold "Hal" Stahl
Carl Sagan
Richard Feynman
Marie Curie
Nikola Tesla
Albert Einstein
Neil Degrasse Tyson
Niels Bohr
Galileo Galilei
Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble, Leonard Susskind, Christiaan Huygens

Favorite Mathematicians

Ada Lovelace
Alan Turing
Johannes Kepler
Rene Descartes
Isaac Newton
Emmy Noether
George Boole
Blaise Pascal
Johann Gauss
Grace Hopper

Other notables: Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot

Fediverse Supporter

See my presentation from CodeMash 2025: How the Fediverse Could Save Democracy & Why It Probably Won't
Hear my thoughts on why a thriving Fediverse is so critical on David Giard's Technology and Friends show.
Learn more about the Fediverse at fediverse.party.
Get your own Mastodon account at Join Mastodon.
- Don't stress on what server to use, just pick one and go. You can always move later.
Follow me on Mastodon at @bsstahl@CognitiveInheritance.com.

Some OSS Projects I Run

Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
Prehensile Pony-Tail : A static site generator built in c#
TestHelperExtensions : A set of extension methods helpful when building unit tests
Conference Scheduler : A conference schedule optimizer
IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
LiquidNun : Library of abstractions and implementations for loosely-coupled applications
Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

http://GiveCamp.org

Achievement Unlocked

The Abstraction Ladder

Machine code
- Exact instructions the hardware executes.
  - Maximum precision
High-level languages
- Express logic in readable source code.
  - Explicit control flow & data structures
Frameworks and libraries
- Reusable abstractions and patterns.
  - Accelerate implementation and consistency.
AI-assisted intent
- Describe outcomes, constraints, & behavior.
  - Tools synthesize structure.

Judgement Required

We can delegate implementation now
- A strong spec can be thrown over the wall and AI may produce plausible code.
Capability is not permission
- The decision is not "can we"; it is "should we in this context".
Use over-the-wall only with strict boundaries
- Disposable code or permanently black-boxed services with stable contracts.
Otherwise keep humans in the loop
- Shared, long-lived, or high-risk code needs human-readable intent and design.

Theory of Constraints

Typing is no longer the main bottleneck
- We produce code faster than we can understand it.
Understanding the problem dominates
- This is now the constraint.
- Spec quality and architectural choices limit progress.
- Architect for ability to reason about the system.
Decomposition still matters
- Large problems need careful slicing.
Verification is the safety valve
- Confidence costs (time & money).

Principle - Simplest Thing First

Start with the smallest meaningful step
- Write the minimum failing test that proves you are solving the right problem.
Then evolve safely
- Make it pass, then refactor after behavior is protected by tests.
Principle, not recipe
- This is the same mindset we apply to evolving AI-assisted workflow.

Red-Green-Refactor Defines Scope

Use the loop as the decomposition boundary
- Each step is a bounded implementation unit.
Red defines control
- Failing tests represent our definition of done.
Green uses minimal bounded implementation
- Implement only what is needed and in-scope.
Refactor restores clarity before expansion
- Review, validate, and clean up before moving on.
Result: speed without loss of accountability
- Controlled implementation in a logical process.

TDD Helps Keep Scope In-Check

TDD breaks work into manageable chunks
Each loop covers a single behavior
Small steps reduce cognitive load and risk
Focused tests clarify intent and boundaries
Confident refactoring possible with narrow scope

Tests Matter More Than the Code

Define what the system must do
- Any agent (human or machine) can read the test to understand intent
Guarantee correctness
- Visible to both machines and humans
Make implementation safely changeable
- Any agent (human or machine) can refactor ruthlessly
Tests outlive implementations
- Code is temporary; behavior is durable
- Tests preserve the behavior

Key takeaways - If the tests are right, the code can be anything - If the tests are wrong or missing, the code can't be trusted

Coupling and Cohesion

Coupling: dependency load between units
- Tighter coupling means 1 change forces many edits
- Increases complexity and raises risk
Cohesion: focus within a unit
- High cohesion means modules have 1 reason to change
- Changes are isolated to a single module
The TDD loop exposes design friction
- Red reveals setup and dependency pain
- Green shows overreach pressure
- Refactor makes unclear responsibilities obvious

The Facade Family of Patterns

Facade creates a stable test seam
- Callers get a small, stable surface area
Repository is facade for data access
- Tests can isolate persistence with fakes/mocks
Strategy is facade for business logic
- Behavior can be tested independent of algorithm

What Do We Need to Test?

Do we need 100% test coverage?
What is "too trivial" to test?
Should we test infrastructure code like logging?

What Do We Need to Test?

Image Source: Jeremy Clark (2015), Unit Testing Makes me Faster
Jeremy's article: Unit Test Coverage: What Parts of Your Application Do Your Users Not Care About?

Testing In Isolation

London School — Interaction-Driven
- Steve Freeman & Nat Pryce
  - Growing Object-Oriented Software, Guided by Tests
- Unit = a single class (collaborators mocked)
- Tests assert interactions
- Pros: strict boundaries, explicit roles
- Cons: brittle tests tied to internals
Detroit School — Behavior-Driven
- Kent Beck & Martin Fowler
  - Test-Driven Development: By Example
- Unit = a behavior across objects (few mocks)
- Tests assert outcomes
- Pros: resilient tests, refactoring freedom
- Cons: harder to isolate business rules

Embedding Differences - USA UK Argentina - 800x800.jpg

Recommended Approach

Detroit Philosophy at the Core
- Test behavior, not interactions
- Use real collaborators within a layer
- Keep tests stable under refactoring
Guiding Principle
- Define layer boundaries using SRP
- Mock/Fake boundaries, not objects
- Test behaviors, not collaborations
- Examples: data stores, external services, etc.
Why This Works
- Invariants tested in full isolation
- Internal object graph can evolve freely
- Clear seams for integration, observability, and resilience

TDD Before AI

Red: write one failing test.
- Describe intent and any boundaries.
- Tools might help stub the tests.
Green: Write the minimum code for tests to pass.
- Most manual typing happened here.
- Don't create structure, just get to green.
Refactor.
- Clean up names and logic.
- Improve structure and design.
- Run full test suite to prevent regressions.

Tests Drive the Next Action

Change Password service

Invariants
- newPassword must meet criteria
- oldPassword must match current password
If invariants are satisfied, password is updated
- Change is successful → return true
- Return false otherwise
Start with Test 1: minimum length check
- Stub the function
- Create the test
- See the test fail appropriately
Questions:
- Method signature (input/output)
- API Surface location
- Criteria for password strength

Test Cases:

[*] <n chars → false

[ ] old <> current → false

[ ] All rules satisfied → true

Method Stubs

Language	Canonical Stub Mechanism
C#	`throw new NotImplementedException();`
C++	`throw std::logic_error("Not implemented");`
Elixir	`raise "not implemented"`
Go	`panic("not implemented")`
Java	`throw new UnsupportedOperationException("Not implemented");`
JavaScript / TypeScript	`throw new Error("Not implemented");`
Kotlin	`TODO("Not implemented")`
PHP	`throw new \BadMethodCallException("Not implemented");`
Python	`raise NotImplementedError("Not implemented")`
Ruby	`raise NotImplementedError, "Not implemented"`
Rust	`todo!()`
Swift	`fatalError("Not implemented")`

Stub the API Surface

Min Length Test

Tests Must Fail Properly

PasswordService_Change_MinLengthTest-FailureMessage-1280x720.png

Do the Simplest Thing

Refactor-Now Signals

Duplication is growing
- Similar logic appears in multiple places.
Responsibilities are blurred
- Single Responsibility Principle (SRP) is violated.
Naming no longer matches intent
- Tests pass, but names and structure obscure the behavior.
Change impact is widening
- Behavior updates require touching many files or call sites.
Tests are brittle or hard to write
- Setup is heavy, seams are unclear, or assertions over-specify implementation details.
Invariants are scattered
- Critical rules are enforced in several locations instead of one clear boundary.

Next Test Decision

Next behavior to add
- Reject the password change when the supplied current password does not match the stored password.
New pressure introduced
- The test needs current-credential context that likely comes from storage.
Constraint to preserve
- Keep scope small and feedback fast while protecting design boundaries.

Option 1: Compare in method
- Repository returns credential data; matching rule stays local.
Option 2: Compare via in-process service
- A domain service inside this app owns behavior.
Option 3: Delegate to external auth
- Use an external authority and consume the result.

Same Discipline-New Pairing

What stays the same
- Red-Green-Refactor remains the control loop.
- Tests still define intent and verify correctness.
- Human judgment still owns architecture, boundaries, and acceptance criteria.
What changes with AI pairing
- Idea/prototype throughput increases.
- Prompting and decomposition quality become bottlenecks.
- Verification rigor must increase because plausible output is cheap.

AI as a Pairing Partner

Mutual learning relationship
- You learn from their approaches and methods.
- They learn and remember your patterns & preferences.
Dialogue and refinement
- Have conversations with AIs about approaches.
- You guide direction; they generate options.
Key difference from junior developers
- No ego or defensiveness; they adapt to your feedback.
- Consistent availability and tireless iteration.
- Stop anytime and pick up where you left off.

TDD with AI Coding Agents

Red: compact context and discuss scope with AI.
- Architecture constraints, done criteria, and test strategy.
- Externalize assumptions before the first failing test.
Green: generate code and supporting docs.
- AI proposes implementation options.
- Human forces limitation to "Simplest thing...".
- Keep documentation aligned with the implementation.
Refactor: clean up structure and improve.
- Better naming, shape, and reuse.
- Push toward a DSL whenever reasonable.
- Use mutation testing to validate the test suite.

Red Step: Establish Constraints

Setup
- Compact context if needed.
- Discuss scope and strategy.
Define inputs and outputs.
- Data shape, required/optional fields.
- Validations and state changes.
- Emitted events and visible behavior.
Define constraints and done criteria.
- Feature Invariants.
- Allowed external calls and dependencies.
- Idempotency, and ordering.
- Errors and retries that are in-scope.
- What is explicitly out-of-scope.
Add failing test(s) that enforces scope.
- More than 1-2 tests to start is a scope smell.
- Prompt pattern: "Strict TDD mode; no behavior yet."

Green Step: Implement Minimal Pass

Implement the minimum behavior to make tests pass.
- Lock scope to Red-step constraints.
Propose implementation options.
- Only what is required to make the tests pass.
Determine tools/frameworks only when required.
- Refactor toward preferred patterns later.
- Do not build object graphs yet.
Keep documentation aligned with implementation.
- Update or create docs to reflect behavior and constraints.
Run tests and confirm the suite is green.
- Prompt pattern: "Strict TDD mode; satisfy tests only."

Refactor Step: Prove and Improve

Refactor for clarity without changing behavior.
- Improve names, decomposition, and reuse.
- Create object graphs and add frameworks as needed.
- Refactor both tests and production code if appropriate.
  - Be sure tests are stable before touching production code.
Strengthen confidence in the suite.
- Run full tests after each meaningful refactor batch.
- Run coverage and inspect uncovered paths.
- Use mutation testing to expose weak assertions.
Squash here if appropriate to keep history clean.
Prompt pattern: "Strict TDD mode; refactor only; preserve behavior."

More & Better Refactorings

Use builders setup data more maintainably
- For both tests and production code
Extract methods to improve readability
- Make liberal use of Extension methods or equivalents
Create tests that are as declarative as possible
- The test's intent should be directly readable in the test
- Readable tests matter more than clean prod code
- Clean Code -- But for Tests!

Validate Using Coverage

Did we test what we built?
- Verify expected and edge behaviors have explicit assertions.
Missing coverage is a critical signal
- Uncovered paths may indicate code is not acting as expected.

Did we build what we tested?
- Confirm implementation still matches test intent after refactor.
Missing coverage is a critical signal
- Uncovered paths may mean we coded ahead of the test.

Coverage Decision Gate

Question every gap
- Missing test?
- Poorly behaving code?
- intentional scope boundary?
Decision before proceeding
- If alignment is unclear, pause and strengthen tests before moving on.
Proceed rule
- Move forward when confidence is behavior-based, not percentage-based.

Validate Tests with Mutations

Mutate code intentionally
- Mimic realistic defects
Run existing tests against each mutant
Killed mutants - Tests fail as expected
- Validates tests detect incorrect behavior
Surviving mutants - Tests still pass
- Expose weak assertions or missing cases
AI agents automate the loop
- Mutations can be run early and often
- Used to be too dificult for regular use

Mutation Testing in the AI Era

Run mutation tests early and often
- Validate that tests catch real defects, not just happy paths.
Treat surviving mutants as feedback
- Strengthen assertions, add missing cases, or tighten boundaries.
AI changes the economics
- What used to be costly and slow is now practical inside normal loops.

MutationTesting-TDD-Flowchart-800x800-labeled-v2-transparent.png

Some Types of Mutations

Boundary conditions
- Original: if (count < max)
- Mutated: if (count <= max)
Arithmetic operators
- Original: total + fee
- Mutated: total - fee
Null / optional paths
- Original: user?.email handling
- Mutated: force null branch behavior
Return / exception behavior
- Original: return Success
- Mutated: throw ValidationException

Mutation Test Prompts

Select candidates + run tests
- "Knowing this feature and tests, create code mutations likely to expose gaps in the tests, and run the tests with them in place."
Improve tests
- "For surviving mutants, implement stronger tests and assertions. Do not modify production code."
Iterate
- Repeat until survivors are equivalent or out of scope. When complete, remove the mutations.

Optional Amplifiers, Core Workflow

Refactoring prompts and mutation testing
- Increase confidence, speed, and signal quality.
Capability amplifiers
- These are optional power-ups, not entry requirements.
Use when risk is high
- Apply them when complexity or cost of change increases.
Skip when loop is already crisp
- If feedback is clear and fast, keep the cycle lean.
Baseline stays constant
- Red-Green-Refactor remains the control system.
Next step
- Define the default workflow teams can run consistently first.

Human-AI Adoption Patterns

Start narrow, then iterate
- Strong adopters begin with constrained use and expand as confidence grows.
Speed gains do not remove human responsibility
- AI helps with implementation throughput.
- Humans still own decomposition, design judgment, and verification.
Calibration beats novelty
- Teams improve by iterating on outputs.
TDD-aligned takeaway
- Begin with the simplest safe loop, then refactor

Research: Barke et al. (2023); Nair et al. (2023); Chen et al. (2021).

Types of tools Available

By host
- In the browser
- In the terminal
- In the IDE
By purpose
- Code generation
- Specification generators
- Workflow orchestrators

Code Generation Tools

What they do
- Generate code, tests, and refactor suggestions.
How they fit TDD
- Used inside each red-green-refactor step.
- Verify with test suite before accepting output.
Why they help
- Reduce time spent on syntax and boilerplate.
- Humans focus on correctness and risk decisions.
Practical guardrails
- Small prompts, short diffs, and frequent test runs
- Review generated code as if from a new teammate.

Specification Generation Tools

What they Produce
- Structured requirements
- Constraints
- Acceptance criteria
How they fit TDD
- First, create a specification
- Then derive tests from it
- Finally, implement only what is needed.
Why they help
- Reduces churn
- Catches requirement gaps
- Keeps test intent aligned with behavior

Workflow Tools

What they do
- Coordinate multi-step tasks.
- Bridges tools, repositories, and prompts.
- Includes state, retries, and handoffs.
How they fit TDD
- Automate repetitive workflow segments.
- Preserve human checkpoints.
Where to add guardrails
- Require plan approval
- Enforce test pass gates
- Cap diff size
- Require explicit review before merge

Tool Matrix by Purpose

Tool	Code Gen	Spec Gen	Workflow	Reference
Claude	Primary	Supported	Supported	Claude Code Docs
Cline/Roo	Primary	Supported	Supported	Cline Docs & Roo Docs
Copilot Chat	Primary	Supported	Supported	Copilot Chat Docs
Copilot CLI	Primary	Supported	Supported	Copilot CLI Docs
Cursor	Primary	Supported	Supported	Cursor Docs
Kiro	Supported	Primary	Supported	Kiro Docs
OpenClaw	Supported	Supported	Primary	OpenClaw Repos
OpenCode	Primary	Supported	Supported	OpenCode Repos
OpenSpec	Supported	Primary	Supported	OpenSpec Repos
SpecKit	Supported	Primary	Supported	SpecKit Repo
Squad/Ralph	Supported	Supported	Primary	Squad Repos

AI Coach

What AI Engineer Coach is
- VS Code extension to analyze coding session logs.
- Turns usage details into coaching insights.
- Helps improve how your team uses those tools.
What it provides
- Anti-pattern detection
- Skill discovery
- Lots more now and in the future
Reference
- AI Engineer Coach

Who Succeeds in the AI Era

Define the problem and constraints clearly
- Give the model bounded, testable targets.
Recognize good architecture and tool choices
- Match patterns and tools to the scenario.
Verify output with tests and quality gates
- Treat model output as a draft, not a guarantee.
Evolve workflows and tools as confidence changes
- Start simple, then increase capability deliberately.
Delegate effectively to people and agents
- Orchestrate grunt-work.
- Keep accountability human.
In short: think like a solution architect
- Own outcomes, tradeoffs, and verification.

Start Where You Are

If you already have Copilot in VS Code
- Start there and keep your loop tight.
If you already have Claude Code or OpenCode
- Start there and apply the same TDD gates.
If you are new to these tools
- Start with Copilot or Gemini in the browser.
- Learn the prompts and review habits before adding complexity.

Pick a Project You Know

Pick a project you already understand
- Real constraints make the exercise more valuable.
- Familiar domains keeps focus on workflow.
Greenfield or Brownfield
- Create that new project you've been thinking about.
- Or add features to an existing codebase.
Start with a small, testable slice
- Keep scope tight so loop stays observable.
- Work in a new branch for safety.
If you prefer a safer sandbox
- Use one of the packaged options on the next slide.

Experimentation Options

Greenfield: Scheduled post publisher
- Build a new, small system end-to-end with tests from the start.
- Define the domain and constraints yourself.
Brownfield: Meeting scheduling system
- Refactor a tightly-coupled file-based system.
- Leverage the architecture diagram and starter code.
Choose by comfort level
- Favor the option where you can focus on workflow discipline, not domain discovery.

TDD in the AI Era

Barry S. Stahl

Solution Architect & Developer

Microsoft .NET MVP

Physics & Math Nerd

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Favorite Physicists & Mathematicians

Favorite Physicists

Favorite Mathematicians

Fediverse Supporter

Some OSS Projects I Run

http://GiveCamp.org

Achievement Unlocked

The Abstraction Ladder

Judgement Required

Theory of Constraints

Principle - Simplest Thing First

Red-Green-Refactor Defines Scope

TDD Helps Keep Scope In-Check

Tests Matter More Than the Code

Coupling and Cohesion

The Facade Family of Patterns

What Do We Need to Test?

What Do We Need to Test?

Testing In Isolation

Recommended Approach

TDD Before AI

Tests Drive the Next Action

Change Password service

Test Cases:

Method Stubs

Stub the API Surface

Min Length Test

Tests Must Fail Properly

Do the Simplest Thing

Refactor-Now Signals

Next Test Decision

Same Discipline-New Pairing

AI as a Pairing Partner

TDD with AI Coding Agents

Red Step: Establish Constraints

Green Step: Implement Minimal Pass

Refactor Step: Prove and Improve

More & Better Refactorings

Validate Using Coverage

Coverage Decision Gate

Validate Tests with Mutations

Mutation Testing in the AI Era

Some Types of Mutations

Mutation Test Prompts

Optional Amplifiers, Core Workflow

Human-AI Adoption Patterns

Types of tools Available

Code Generation Tools

Specification Generation Tools

Workflow Tools

Tool Matrix by Purpose

AI Coach

Who Succeeds in the AI Era

Start Where You Are

Pick a Project You Know

Experimentation Options

References