The testing playbook for AI-written code

The RED → GREEN loop, six test types that earn their keep, and copy-paste prompts to find what to test first.

9 min read

You built a working product with AI doing most of the typing. It runs, people can use it, and somewhere in the back of your mind sits a list of things you suspect would break if you looked too hard. You know tests would help. The part nobody handed you is which tests, in what order, and how to know when you've done enough.

This playbook is that missing part. I built signalbee.trade mostly with Claude Code, and these are the habits that kept it shippable after the first burst of speed wore off. Your AI assistant writes all of the actual test code — your job is to aim it. Aiming it well takes one loop, six test types, and a handful of prompts you can copy straight from this page.

Drop the 100% coverage goal

Coverage measures which lines of your code ran while the tests executed. That number says nothing about whether the tests would catch a bug — catching bugs is the job of assertions, the specific claims a test makes about what the code must do. A suite can touch every line and assert almost nothing, and an AI assistant asked to "increase coverage" will happily generate exactly that suite.

Aim at consequences instead. The useful question: which parts of this app would cost me something if they broke silently? For most products the list is short — the path where money moves, sign-up and login, anything that writes or deletes user data, and whatever your product's core promise is. On signalbee.trade the list starts at the webhook pipeline that relays trading signals, because that pipeline is the product's whole promise. Those paths get tests first. The settings page can wait.

Every prompt in this playbook is built on that question.

The loop: RED → GREEN

The highest-leverage habit in this playbook: the test gets written before the code, and you watch it fail before you let the AI make it pass. The discipline has a name — test-driven development, TDD — and you only need its first two moves.

RED. Describe the behavior you want and have your AI write a test for it, leaving the implementation untouched. Run the test and watch it fail. Read the failure: it should fail because the feature doesn't exist yet, and for no other reason.

GREEN. Ask for the smallest change that makes that test pass. Run the suite, watch it go green, then move to the next behavior and repeat.

Why watching the RED matters: a test you have never seen fail is unverified. It might pass because the feature works — or because it asserts nothing, checks the wrong thing, or accidentally exercises a mock instead of your code. The failure is your evidence that the test is wired to reality.

The order matters double with AI. An assistant that writes the code and its tests in the same step fits the tests to whatever the code already does, bugs included. The suite goes green on the first run and certifies the wrong behavior. Written first, the test can only encode your description of the feature — there is no implementation yet for it to copy.

The three prompts, used in order:

We're adding {feature}. Follow strict RED → GREEN TDD.
Step 1: write the failing test that captures the new behavior.
Don't touch the implementation yet.
Run the test. I want to see it fail. Read the failure output and
confirm it fails because the implementation doesn't exist yet —
not because the test itself is broken.
GREEN phase: write the smallest implementation that makes this
test pass. Nothing more. We'll add edge cases as new
RED → GREEN cycles.

One behavior per cycle. When the AI offers to batch five test cases and the implementation in one go, decline — the small increment is the point. Each green locks in one verified piece of behavior before the next one starts.

Six test types that earn their keep

The testing literature names dozens of types. You can run this whole playbook with six. For each one: what it is in plain terms, why it pays off on AI-written code, and a prompt that has your AI scan your codebase for the best candidates to add.

1. Unit tests — one function, known inputs, exact outputs

A unit test calls a single function with inputs you chose and asserts exactly what comes back. Unit tests run in milliseconds, so you can keep hundreds and run them on every change.

AI-written functions tend to read clean and mishandle specifics — rounding, timezone edges, off-by-one boundaries, currency math. A unit test pins the exact expected output, which is a higher bar than "looks right."

Scan this codebase and list every function that contains real logic —
calculations, parsing, branching, anything touching dates or money —
and currently has no test. Rank them by how costly a silently wrong
output would be. For the top five, propose unit tests with the exact
inputs and expected outputs you'd assert.

2. Integration tests — the seams between parts

An integration test exercises two or more pieces wired together: an API route against a real database, your code against the payment library, the frontend against the API.

Generated code fails at the seams more than anywhere else, because each piece was produced to look right on its own. The route expects userId, the frontend sends user_id, and both pass their unit tests. The seam carries a spec you never wrote down — an integration test is where that spec finally gets written.

Map the seams in this codebase: every place two parts hand data to
each other — frontend to API, API to database, app to any third-party
service. List the seams that have no test on them, rank them by blast
radius, and propose one integration test for each of the top three.

3. End-to-end tests — one real user journey

An end-to-end test drives the running app the way a user would: open the browser, click, type, submit, and assert the journey completed. End-to-end tests are slow and occasionally flaky, so you keep only a handful — on the journeys that pay you.

This is the one type that catches the product failing while every individual piece passes. Sign-up, login, checkout: when one of those journeys breaks at 2am, an end-to-end test is the thing that notices.

List the user journeys in this app that touch money, sign-up or login,
or anything that writes or deletes user data. Pick the journey whose
silent failure would cost the most, and write one end-to-end test that
walks through it the way a real user would.

4. Sad-path tests — what happens when things go wrong

A sad-path test asserts behavior under failure: invalid input, an empty result, a network call that times out, a payment that gets declined, a user clicking submit twice.

Asked for a feature, an AI delivers the version where everything cooperates. Most production incidents live outside that version. Sad-path tests put the uncooperative cases in front of you during a test run, ahead of a user finding them in production.

Take {file or feature} and list everything that can go wrong around
it: bad input, missing data, a dependency that's down, duplicate
submissions, a user acting out of order. Tell me which of those cases
the current code handles and which it doesn't, and write tests for
the three most likely.

5. Regression tests — every bug, pinned forever

A regression test reproduces a specific bug you actually hit, and stays in the suite after the fix so the bug can never return unnoticed.

The habit that makes this work: when you find a bug, get the test before the fix. The failing test is your RED — a deterministic reproduction of the wrong behavior, which also happens to be the most precise bug report you can hand an AI. The fix turns it green, and the bug is pinned: six months from now, when an unrelated AI edit re-introduces it, the suite says so within seconds. Over time the suite accumulates your product's real failure history, each entry permanently caught.

Here's a bug: {paste the error or describe the wrong behavior}.
Before fixing anything, write a test that reproduces it — it must
fail right now, for the same reason the app misbehaves. Show me the
failing run. Then make it pass without breaking any other test.

6. Characterization tests — pin the code you're afraid to touch

A characterization test records what the code currently does: call it with realistic inputs and assert whatever it returns today, bugs and all.

This is the entry move for an existing codebase with no tests. Before changing a module you don't fully understand — and most of an AI-generated codebase qualifies — wrap it in characterization tests. From then on, any change that alters its behavior turns a test red, and you decide whether that change was intended. The fear of touching old code becomes a mechanical check.

I need to change {file} but I don't fully understand everything it
does. Before we touch anything, write characterization tests that pin
its current behavior: call it with realistic inputs and assert exactly
what it returns today. Don't fix or improve anything yet, even if you
spot bugs — flag those in comments instead.

Where to start on a codebase with no tests

Five moves, in order. The first is a prompt; the rest reuse the types above.

Move 1 — run the risk scan. Have the AI build your map:

Scan this codebase and list the ten places where a silent bug would
cost the most — money paths, sign-up/login, data writes and deletes,
and the product's core promise. For each one, tell me whether any
existing test covers it, and which test type would catch a failure
there: unit, integration, end-to-end, sad-path, regression, or
characterization.

Move 2 — one end-to-end test on the most important journey from that list. One test, and the path your product depends on is watched.

Move 3 — characterization tests around the scariest module the scan surfaced, before you next ask the AI to change it.

Move 4 — start the regression habit. From today, no bug gets fixed until a failing test reproduces it.

Move 5 — RED → GREEN for everything new. Every new feature starts at the loop at the top of this page.

The short version

Test the paths that cost you something when they break. Watch every test fail once before you trust it. Let the AI write all of the test code, and keep the two decisions that matter for yourself: what to test, and in which order.

The risk scan in Move 1 takes your AI about a minute. Start there, today.

Keep reading