Before real users — or real money — touch an AI-built product, it deserves one deliberate review pass. This checklist is that pass: the checks I run on AI-written codebases, my own included.
You don't need a security firm or a four-figure audit to catch the findings that matter most. The expensive-sounding problems in AI-written codebases are usually cheap to find — they cluster in the same eight places, and your AI assistant can dig through all eight in about an hour. What it needs from you is the list of questions. That's this page.
Two rules before you start. First, every prompt below asks the AI to report only. A review that fixes as it goes loses the map of what's actually wrong, and half-applied fixes are how reviews break working apps — collect the findings first, fix in severity order after. Second, when you do fix, run each fix through the RED → GREEN loop from the testing playbook: failing test first, then the change.
The eight checks
In order. The first three are the break-in checks; a finding there outranks everything below it.
1. Secrets in the repo
What I'm looking for: API keys, tokens, passwords, and database URLs sitting in the code, in config files, in anything the browser downloads — or in the git history, where a key committed once and deleted later is still fully readable.
AI assistants hardcode keys to make examples run, and "I'll move it to an env var later" survives surprisingly many commits. The fix for a leaked key is rotation — removing it from the code leaves it alive in history and in every clone.
Scan this repo for secrets: API keys, tokens, passwords, private
keys, database URLs — in the code, in config files, in anything
client-side the browser downloads, and in the git history (committed
then deleted still counts). Report each finding with the file and how
exposed it is. Don't fix anything — if a real key has ever been
committed, the fix starts with rotating it.
2. Who can touch what
What I'm looking for: the line, on every route that reads or writes user-owned data, where the app checks that this specific user is allowed to touch this specific record.
AI scaffolds login flows well. The per-record ownership rules live in your head, and the AI never saw them — so the classic finding in AI-built apps is a logged-in user who can read anyone's data by changing an ID in the URL. Logged-in is a different question from allowed.
List every API route or server action that reads or writes user-owned
data. For each one, show me the exact line where it checks that the
requesting user may touch that specific record — ownership or role,
beyond merely being logged in. Flag every route where that check is
missing, and tell me what someone could reach by changing an ID in
the request.
3. Where user input ends up
What I'm looking for: every path from a request to a sensitive destination — a database query, a file path, a shell command, rendered HTML — and whether the input is parameterized and escaped along the way, or pasted in as a string.
AI-written code mostly reaches for an ORM and gets this right, and then one raw query slips in for the tricky report page. One is enough.
Trace user input through this codebase. List every place where data
from a request reaches a database query, a file path, a shell
command, or rendered HTML. For each, tell me whether the input is
parameterized/escaped or concatenated in, and flag the ones an
attacker could abuse. Report only — no fixes yet.
4. The sad paths
What I'm looking for: the steps in your critical flows that assume everything cooperates — network calls with no failure handling, missing timeouts, nothing guarding an empty result or a double-clicked submit button.
Asked for a feature, an AI delivers the version where everything works. The review question is where that assumption is load-bearing. (Writing the tests that pin these cases down is covered in the testing playbook — this check finds the spots.)
Walk the critical flows in this app — payment, sign-up, anything that
writes data — and list every step that assumes success: network calls
without failure handling, missing timeouts, no handling for empty
results or duplicate submissions. Rank the list by what a real user
would hit first.
5. Whether the tests are real
What I'm looking for: tests that would actually fail if the code broke. A suite full of green checkmarks can be exercising code while asserting nothing — ghost tests, common in AI-generated suites because the tests were fitted to the code they were generated alongside.
The standard from the testing playbook applies to inherited tests too: a test nobody ever saw fail is unverified.
Audit the test suite. For each test file, tell me what behavior it
actually pins down — which specific assertion would fail if the code
broke. Flag ghost tests: tests that run code but assert nothing
meaningful, test the mock instead of the code, or restate the
implementation. Then list the five most consequential paths in the
app with no real test on them.
6. What the database actually enforces
What I'm looking for: rules that exist only in application code — uniqueness, required fields, relationships — with nothing backing them in the schema, plus multi-step writes that aren't wrapped in a transaction.
AI puts validation in the request handler and leaves the schema wide open, which holds until the first concurrent request, background job, or manual script goes around the handler. Anything involving money or counters needs the transaction question asked explicitly.
Compare what the application code assumes about the data with what
the database actually enforces. List every rule that exists only in
app code — uniqueness, required fields, relationships, allowed values
— with no matching constraint in the schema. Then list every
multi-step write (especially anything touching money or counters)
that isn't wrapped in a transaction.
7. The dependency pile
What I'm looking for: packages that are unused, duplicated, abandoned, or carrying known vulnerabilities — and whether a lockfile is committed so installs are reproducible.
AI assistants add packages freely and never remove them. Most of the pile is harmless weight; the dangerous edge is a package that's abandoned, vulnerable, or occasionally one that never existed until someone published malware under a name AIs tend to invent.
Review the dependencies. List packages that are: unused, duplicated
(two libraries doing the same job), abandoned (no release in 2+
years), or carrying known vulnerabilities — run the ecosystem's audit
tool and summarize the output. Confirm a lockfile is committed. Flag
anything that looks like a near-miss of a popular package name.
8. The recovery story
What I'm looking for: answers to two questions. Where is the newest database backup, and has a restore ever actually been run? And if the next deploy is broken, what are the exact steps back to the previous version?
This is the one check about the day something goes wrong anyway. An untested backup is a hope, and a rollback procedure that exists only in someone's memory fails at exactly the moment it's needed.
Tell me the recovery story for this app. Where do database backups
go, how old is the newest one, and has a restore ever been tested?
If the next deploy is broken, what are the exact steps to get back to
the previous version, and how long do they take? List what's missing
to make both answers boring.
After the pass
You'll end with a findings list. Triage it in three buckets:
- Break-in findings (checks 1–3) — fix this week, starting with key rotation, which takes minutes.
- Behavior findings (checks 4–6) — turn each one into a failing test, then fix to green. This is the regression habit from the testing playbook, applied to a review instead of a bug report.
- Hygiene findings (checks 7–8) — schedule them. A tested restore and a written rollback procedure are an afternoon each.
The prompts are reusable: run the full pass again before any release that matters. Codebases drift, and AI-assisted ones drift fast.
Check 1 takes about five minutes. Run it now.