Your test suite is green. The question this piece answers: would it go red if the code broke?
For a surprising share of AI-generated tests, the answer is no. They run the code, the checkmark appears, and nothing in them would fail if the behavior changed — the assertions are too weak, aimed at a mock, or missing entirely. These are ghost tests: they look like protection and provide none, which is worse than having no test at all, because no test at all wouldn't have earned your trust.
AI-assisted workflows breed ghost tests in two specific ways. The first: tests generated in the same step as the code get fitted to whatever the code already does, bugs included — the testing playbook covers why writing the test first prevents this. The second is sneakier: when an AI is asked to "make the suite green," weakening an assertion is the shortest path, and assistants take it. A failing test gets "fixed" into one that can't fail, and the suite gets greener and weaker at the same time. The review checklist flags ghost tests as check 5; this piece is how you hunt them systematically.
Coverage can't see this
Coverage measures which lines ran while the tests executed. A line can run inside a test that asserts nothing about it, so coverage counts ghost tests as protection — a suite can sit at 100% coverage while most of its assertions constrain nothing. If you've read the first playbook, this is the same reason it told you to drop the 100% coverage goal.
The number that catches ghost tests is the mutation score, and you can estimate it this afternoon with nothing but prompts.
Mutation testing, in plain terms
A mutation test makes one small deliberate break in your code — flips a > to >=, swaps a + for a -, deletes a guard clause — and runs your suite. Each break is called a mutant.
If any test fails, the mutant is killed: some assertion constrained that behavior. If every test passes, the mutant survived: your suite just certified broken code as correct. The mutation score is the share of mutants killed, and the survivors are a map of exactly where your tests claim protection they don't provide.
That's the whole idea. The dedicated tools automate it at scale; the mechanic is simple enough to run by hand with your AI assistant, which is where to start.
The poor man's mutation pass
Pick one file on a path that costs you something when it breaks — money, auth, data writes (the risk scan from the first playbook gives you this list). Then run:
We're running a manual mutation test on {file}. Apply the following
mutations ONE AT A TIME, running the full test suite after each and
reverting the mutation before applying the next: flip a comparison
operator, swap a + for a -, negate an if condition, replace a return
value with a constant, delete a line that updates state, off-by-one a
boundary. Make 8–10 mutations total, spread across the file. For
each, record whether any test failed. Don't fix anything. At the end:
show me the list (mutation, location, killed or survived), and run
git diff to prove the file is back to its original state.
The mutation menu in the prompt matters — it keeps the breaks mechanical, so the AI can't quietly pick mutations it knows the tests will catch.
Ten mutations on one important file tells you more about your suite than any coverage report. Killed across the board means the tests on this file are real. Survivors mean you've found the gaps — so triage them:
For each surviving mutation, tell me which of these it is:
(a) missing assertion — a test exercises this code but never checks
this behavior; (b) ghost test — the covering test asserts nothing
meaningful or tests a mock instead of the code; (c) untested code —
no test reaches it; (d) equivalent mutant — the change doesn't
actually alter behavior. For every (a), (b), and (c): write the test
that would have caught it, then verify the kill — re-apply the
mutation, watch the new test fail, revert, watch it pass.
That verify-the-kill step is the watched-fail standard from the first playbook doing its job in reverse: you've now seen the new test fail against broken code, which is the only proof it protects anything.
The real tooling
Once the manual pass has paid for itself, the dedicated tools run thousands of mutants instead of ten:
| Ecosystem | Tool |
|---|---|
| JavaScript / TypeScript | StrykerJS |
| Python | mutmut |
| Go | gremlins |
| Rust | cargo-mutants |
| Java / JVM | PIT (pitest) |
| C# / .NET | Stryker.NET |
| PHP | Infection |
| Ruby | mutant |
Setup is a job for your assistant:
Set up {tool} for this project. Scope the first run to {the directory
holding the consequence paths}, not the whole repo. Run it, then
summarize: the mutation score, survivors grouped by file, and the
five survivors sitting on the highest-consequence paths. Report only
— no test changes yet.
Two things to know before the first run. Mutation testing is slow — every mutant means another test-suite run, so scope it to the directories that matter and run it before releases, not on every commit. And don't chase a 100% mutation score: some survivors are equivalent mutants (the change didn't alter behavior, so no test could catch it), and some sit in code too trivial to defend. The goal is zero survivors on the paths that cost you something. Tolerate the noise elsewhere — the same judgment the first playbook applied to coverage.
What this buys you
Every survivor you kill is an assertion that now constrains the behavior of your product — which means the next time an AI edit breaks that behavior, on purpose or by accident, the suite goes red instead of certifying the damage. A weakened assertion gets caught the same way: re-run the pass and the mutant it used to kill survives again.
The short version: coverage counts the lines your tests ran; the mutation score counts the breaks your tests caught. Run the manual pass on one money-path file today — it's ten mutations, and you'll know within the hour whether the green checkmarks you've been trusting are real.