If you've been running the RED → GREEN loop from the testing playbook, you've felt where the time actually goes. The test logic takes a minute. Building the input takes ten — fake users, fake payloads, fake API responses — and then comes the expected output, hand-typed field by field. Most test files are mostly setup.
This piece is about deleting that work. The pattern: save a real input to a file, save the approved output to a file, and let the test diff against it. Two terms, defined once. A fixture is an input your tests load from a file instead of constructing in code. A golden file is an expected output stored as a file that the test compares results against — some tools call these snapshots; same idea.
Less typing turns out to be the smaller win. The bigger one: tests built on data your app actually handled, with assertions that check the whole output instead of the three fields somebody felt like typing.
Fabricated test data carries a blind spot
When your AI assistant invents the test input, the data comes from the same place the code came from — so it cooperates with the code by construction. Every optional field present, every number a number, every date well-formed. The cases that take production down — the field that arrives as null, the amount that arrives as a string, the name with characters your validation never met — live in real traffic, and invented data filters them out. The playbook covered how tests generated alongside code get fitted to it; fabricated inputs are the same fitting problem, applied to the data.
Hand-typed expected outputs fail on the other side. Nobody types out a forty-field response, so the assertion checks three fields and leaves the other thirty-seven unwatched — ghost tests by under-assertion. On signalbee.trade the path that matters most is the webhook pipeline that relays trading signals, and a real exchange payload is exactly the kind of input I would never fabricate correctly by hand.
So aim past "less boilerplate": inputs your app actually handled, outputs checked completely. Files give you both.
The loop: capture, approve, diff
The whole mechanic in four steps:
- Capture a real input into a file. That's the fixture.
- Run the code on it once and write the output to a file.
- Read the output file, line by line. If it's right, commit it — your reading is the approval, and the file is now golden.
- Diff from then on: the test re-runs the code on the fixture and compares against the golden. Any mismatch fails the test and shows you exactly what changed.
Step 3 carries the same weight as watching the RED in the playbook. A golden you never read certifies whatever the code happened to do that day, bugs included. The reading is where the verification lives; everything after it is automated.
From then on, a failed diff is a question addressed to you, with two legitimate answers: regression — fix the code; or intended change — read the new output, approve it again. The third answer that tools offer gets its own section below.
The six moves
1. Capture real inputs
Pull actual data into tests/fixtures/ — webhook bodies, API responses, user-submitted forms, rows from production-shaped tables. A temporary capture switch in the code works; so does extracting from logs or a staging run. Scrub secrets and personal data before anything lands in the repo: fixtures get committed, and a captured payload can carry tokens and emails.
This pays off double on AI-written code: assistant-invented data cooperates with assistant-written code, while captured payloads carry the edge cases nobody would think to invent. And one captured file feeds every test you write afterwards.
Look at {function or route} and the data it receives in production.
Add a capture switch I can turn on temporarily — an env flag that
writes each incoming payload to tests/fixtures/{name}/ as
pretty-printed JSON, scrubbed before writing: tokens, keys, emails,
names, anything that identifies a person. Show me the scrub list for
approval before writing any code. If our logs already hold these
payloads, say so — extracting from logs beats adding capture code.
2. Golden-file the outputs you'd never fully assert
For any output that's big or structured — a rendered email, a generated report or invoice, a parsed document, a full API response — skip the hand-typed assertions. Run the code on a fixture once, write the result to a file, read it line by line, commit it. The test diffs against it from then on.
That first read is mandatory, and it's the same move as the playbook's characterization test: the file records what the code does, and your reading is what upgrades "what it does" to "what it should do." The payoff on AI-written code: every field is asserted at once. An AI under-asserts big outputs, or fits the assertions to the code; a golden file has no opinion to fit — it disagrees with any change and shows you the change.
Take {function} and the fixture at {path}. Set up a golden-file test:
run the function on the fixture, write the result to
tests/goldens/{name} (pretty-printed, stable key order), and make the
test fail with a readable diff whenever output and golden disagree.
Then stop and print the golden — I need to read it line by line
before it gets committed.
3. Read the diff before you update anything
When a golden test fails, your whole job is the diff: for each hunk, decide regression or intended change. The failure mode here has a name — blind re-approval: regenerating the goldens wholesale so the suite goes green. Most tools ship the command (--update-snapshots, -u), and one run of it converts every golden test into a ghost test: still executing, now certifying whatever the code happens to do.
Ask an assistant to "make the suite green" and that flag is the shortest path — the weakened assertion from the ghost-tests piece, industrialized. Component-snapshot suites are where this dies fastest: fifty snapshots fail on a styling change, and the habit of updating without reading sets in within a week. The rule that prevents it: the update command belongs to you. The assistant reports the diff; you read it and run the update yourself, per file.
The golden test for {name} is failing. Show me the diff between the
golden and the new output, hunk by hunk. For each hunk, name the code
change that caused it and classify it: intended consequence of what
we just changed, or regression. Don't update the golden and don't
touch the code — I'll decide hunk by hunk.
4. Normalize what changes every run
Timestamps, generated IDs, hashes, unordered collections, floating-point jitter: any of these in the output and your golden fails with the code untouched. The fix is a normalizer the test applies before comparing — volatile values replaced with stable placeholders like <TIMESTAMP> and <ID>, unordered collections sorted.
AI-written code reaches for the current time and random IDs freely, so expect the first golden you cut to need this. Take it seriously: a golden that cries wolf trains you into blind re-approval within days, and then the whole pattern is dead weight.
Run {function} on the fixture at {path} three times and diff the
outputs against each other. List every value that differs across runs
— timestamps, generated IDs, hashes, ordering, floats — then write a
normalizer the golden-file test applies before comparing: volatile
values replaced with placeholders like <TIMESTAMP> and <ID>,
unordered collections sorted. Rerun until two consecutive runs
produce byte-identical output, and show me that clean diff as proof.
5. Golden-master the module you're afraid to change
This is the playbook's characterization test, industrialized. Take the module you don't fully understand — most of an AI-generated codebase qualifies — and run every fixture you have through it, writing one golden per fixture. Skim the goldens, flag surprises, commit the lot. Now any change to the module that alters behavior goes red, with the exact input that triggered it attached.
Do this before letting an assistant refactor or restructure anything load-bearing: it's the cheapest behavioral net there is — the wide net first, the refactor second. One caveat carries over from characterization testing: goldens record today's behavior, bugs included. When the skim turns up something wrong, flag it; the fix is its own RED → GREEN cycle after the net is in place.
We're about to change {module} and I don't trust that we know
everything it does. Build a golden-master harness: run every fixture
in {fixtures dir} through {entry point} and write one golden per
fixture, applying our normalizer. Run the harness twice to prove the
goldens are stable. List anything in the outputs that looks wrong —
flag it, don't fix it; the goldens record what the code does today,
and fixes come after the net is in place.
6. One fixture, both sides of the seam
The playbook's integration section described the classic seam failure: the API sends user_id, the frontend expects userId, and both pass their own tests. Two hand-written mocks can drift like that forever and stay green. The fix is one captured response file shared across the seam: the API test asserts the server still produces that shape, the frontend test feeds the same file into the component. The fixture becomes the written contract — whichever side breaks the shape goes red against the same file.
Find every test that mocks {the API or service} with a hand-written
response object. For each one, report whether a captured fixture
exists for that endpoint, and where the mock's shape disagrees with
the fixture — or with what the server actually returns. Report only;
we'll swap the worst one over to the shared fixture as its own
change.
Where to start
Three moves now, two gates for later.
Move 1 — golden the assertion you've been avoiding. Pick the one output you've never fully asserted because it's too big to type. Use whatever setup its test already has; upgrading the input to a captured fixture is move 2's job. Read the golden before you commit it.
Move 2 — capture fixtures on the path that costs the most. The risk scan from the testing playbook names that path. Run the capture prompt from move 1 of the six above, scrub list and all.
Move 3 — claim the update flag. Put it in your project instructions file today: the assistant never updates goldens or snapshots; it reports the diff, and the human runs the update per file.
The gates: golden-master before the next refactor of a module you can't fully explain, and a shared fixture the next time a seam bug bites.
Move 1 takes about fifteen minutes, and most of that is reading the output — which is the point. Pick the output and run it now.