Quality Gates and Code Review

You get a notification. The agent just opened a PR — 280 lines across six files. The tests pass, the build is green, the diff looks clean. Do you skim it, click approve, and move on?

No. And the reason isn’t that you’re paranoid. It’s that you didn’t write this code, so you actually have to read it. That’s the deal we’ve made: agents handle volume, we handle judgment. Review is where the judgment lives.

Faros AI studied 10,000+ developers and found that high-AI-adoption teams merged nearly twice as many PRs, but spent 91% more time reviewing them. PR sizes grew 150%. Those numbers sound heavy. They’re also the natural consequence of something good — we’re producing more. The review load isn’t a sign that something’s broken. It’s a sign that we need a sharper process for handling it.

Some teams don’t have that process. They respond to the increased volume by reviewing faster, which really means reviewing less. PRs get a quick glance and an “LGTM.” Quality drifts. Bugs ship. Eventually someone asks why the codebase feels worse even though the team is shipping more. That’s not us. Our approach is to get better at review, not to skip it. The rest of this chapter shows how.

Review the plan, not just the code

The Spec-First Workflow introduced the plan-then-execute habit. Here’s why it gets its own section: catching a wrong approach at the plan stage saves you from reviewing hundreds of lines of confidently wrong code. This is the single highest-leverage thing you can do as a reviewer.

In interactive Claude Code sessions, ask for the plan before any code gets written:

> Before making any changes, show me your plan:
> 1. What files will you create or modify?
> 2. What approach will you take?
> 3. What assumptions are you making?
> 4. What tests will you write or update?

Read the plan. Compare it against the spec. Check:

Does it modify only the files listed in the spec? If the agent wants to touch files you didn’t mention, that’s either a gap in your spec or scope creep. Decide which before letting it proceed.
Does the approach match our architectural patterns? If you said “use the Repository pattern” and the plan has direct DbContext calls, fix it now. Not after 300 lines of code exist.
Are the assumptions correct? The agent will make assumptions about how our code works. Some will be right. Some won’t. This is where those assumptions surface — before they’re baked into the implementation.
Are the tests testing behaviour? If the plan says “add a test that calls the method and checks it doesn’t throw,” that’s a test that passes without proving anything. Push for tests that check what happens, not just that something happens.

For background agents (GitHub Actions, Copilot coding agent), configure the agent to post a plan comment on the PR before implementing. The task owner reviews and approves the plan before code generation starts.

Fixing a 5-line plan costs minutes. Fixing 300 lines of generated code that followed the wrong plan costs hours. The return on plan review is enormous — it’s the cheapest quality gate we have.

Tier your review by risk

Not every PR needs the same scrutiny. A documentation fix doesn’t need the same depth as a database migration. Tiering our review effort by risk keeps us focused where it matters — and just as importantly, it prevents review fatigue on changes that don’t need it. That’s Balance in practice: spending energy where it counts rather than burning out by treating every diff the same.

Risk tier	PR type	Review approach	Depth
Low	Docs, test additions, dependency bumps, formatting, config tweaks	Single reviewer.	Scan
Medium	New features following existing patterns, bug fixes with clear scope, UI changes with visual before/after	Single reviewer.	Read
High	Security-sensitive code, API contract changes, new architectural patterns, infrastructure changes	Two reviewers (one must be senior).	Trace
Critical	Database migrations, authentication/authorisation, deployment config, payment/billing logic	Senior review + manual testing on a running environment.	Prove

Each level builds on the ones before it:

Scan — CI green, scope matches spec, no surprise files or dependencies. Move on.
Read — You’ve read every changed line. Edge cases checked against the spec.
Trace — You’ve followed at least one happy path and one error path end to end. Verified against ADRs. Side effects checked.
Prove — You’ve run the code and manually verified the behaviour. You can demonstrate it works, not just argue that it should.

How to assign a tier: Look at what changed, not how much. A 20-line change to an auth middleware is High or Critical. A 200-line change that adds a new CRUD endpoint following an existing pattern is Medium. The risk is in what the code does, not how many lines it touches.

For agent-generated PRs specifically, bump the tier up one level when the change introduces a pattern that doesn’t already exist in the codebase. A new abstraction or a new way of doing something deserves extra eyes, even if the agent seems confident about it.

Let the machines go first

Before a human reviewer looks at anything, CI should have already caught the mechanical problems. This is our automated pre-filtering pipeline — the first quality gate.

The minimum CI checks for every agent PR:

Build — dotnet build / npm run build / yarn build. Does it compile? This catches hallucinated methods, wrong types, and missing imports. For typed languages, this is remarkably effective — the compiler is our cheapest reviewer.
Tests — dotnet test / npm test. Do all existing tests still pass? Did the agent add new tests? Do those new tests actually test something meaningful? (More on that last point below.)
Linting — ESLint, StyleCop, Prettier, or whatever the project uses. Does the code follow formatting and style rules? This catches convention violations before a human has to mention them.
Security scan — dotnet list package --vulnerable, npm audit, or a dedicated scanner. Any known CVEs in new or updated dependencies?
AI-assisted first pass — tools like GitHub Copilot review can catch common issues in agent-generated code before a human looks at it. They’re not a replacement for human review, but they surface things like unused variables, missing error handling, and inconsistent naming that would otherwise eat your review time.

If any of these fail, the PR goes back before human review begins. Don’t spend your limited review time on code that doesn’t compile or doesn’t pass its own tests. The post-tool hooks from Hooks, Commands, and MCP Servers (like the TypeScript type-check hook) catch some of these during the code session itself. CI catches whatever slipped through.

The hooks and the CI pipeline work together: hooks catch problems during agent work (tight feedback loop, the agent can self-correct), and CI catches problems after (safety net for anything the hooks missed, or for background agents that ran without hooks).

What to actually look at during review

The CI is green. The plan was sound. Now you’re reading the diff. How deep you go depends on the risk tier, but the order is always the same. Start at step 1. Low-risk reviews stop after step 2. Higher tiers keep going.

1. Check scope first

Before reading any logic, answer one question: did the agent stay within scope?

Open the file list. Are there files that weren’t in the spec?
Check package.json, .csproj, or any dependency files. Did the agent add packages you didn’t ask for?
Look at the commit history. Are there “helpful” refactors mixed in with the actual work?

If the answer to any of these is yes, stop. Don’t review the logic yet. Send it back with a note about what’s out of scope. Reviewing out-of-scope changes is wasted effort — those changes shouldn’t exist.

2. Read the tests before the implementation

This sounds backwards, but it’s the fastest way to understand what the code is supposed to do. The tests are the spec made executable. Reading them first tells you:

What cases the agent thinks it needs to handle.
What cases are missing (compare against the acceptance criteria in the spec).
Whether the tests check behaviour or just check that code runs without crashing.

Good test (tests behaviour):

When I update a profile with a duplicate email,
I get a validation error.

Bad test (tests implementation):

When I call updateProfile,
the save method is called once.

The first tells you something useful about correctness. The second is fragile and tells you nothing — if someone refactors the internals, it breaks even when the code is still correct.

Also watch for tests that the agent wrote to match its own code. This is circular — the test passes because it was designed to pass, not because the code is correct. It’s the testing equivalent of grading your own homework. You’ll catch this by comparing the test expectations against the spec’s acceptance criteria, not against the implementation.

3. Trace the logic

Now read the implementation. Don’t just skim the diff — follow the data through the code.

Pick an input. What happens to it? Where does it go? What transformations does it undergo? Does the output match what the spec says should happen?
Check error paths. What happens when things go wrong? Does the code handle it, or does it silently swallow exceptions?
Look for hardcoded values, magic numbers, or assumptions that should be configurable or documented.

If you’ve been coding for a while, this is where your experience pays off. You know what “code smell” feels like even when you can’t immediately articulate why. Trust that instinct. If something feels off, investigate.

If you’re newer, this is genuinely one of the best ways to build your skills. Every agent PR is a window into how problems get solved — sometimes well, sometimes badly. Both are educational. That’s Growth: not just shipping features, but developing judgment with every review you do.

4. Check for silent architectural drift

This is the subtlest problem with agent-generated code. Each individual change looks locally reasonable, but over time, small decisions accumulate into something that no longer respects our architectural boundaries. The agent didn’t mean to break anything. It just didn’t know about the boundary.

For every agent PR, ask:

Does this change respect our layer boundaries? (e.g., does a frontend PR accidentally contain backend logic? Does a controller contain business logic that belongs in a service?)
Are there new dependencies or abstractions that weren’t in the spec?
Does this change conflict with any Architecture Decision Records (ADRs)?
If you search the codebase for the patterns introduced here, are they consistent with how things are done elsewhere?

This is why the CLAUDE.md conventions from Context Files matter — they’re our defence against drift. And when you spot drift in review, don’t just fix the code: update the CLAUDE.md so the agent stops making the same mistake.

5. Verify commit hygiene

Good commit history makes reviews easier — and it unlocks git bisect, which you’ll want in your toolkit.

Say something broke and you don’t know which change caused it. git bisect does a binary search through your commit history: it picks a commit halfway between “known good” and “known bad,” you tell it whether the bug exists there, and it narrows the range. Ten commits become five, then three, then one. In a few steps you’re staring at the exact commit that introduced the problem.

The catch: it only works if each commit is a single logical change. If one commit mixes three unrelated things, bisect can point you to it but you still don’t know which of those three things broke. Clean commits make bisect surgical.

Each commit should represent one logical change with a clear message.
If the agent produced one monolithic commit, ask it to break it up before you review the logic. Reviewing a 400-line single commit is miserable. Reviewing four 100-line commits with descriptive messages is manageable.
Commit messages should say what changed and why, not just “implement feature” or “fix bug.”

The silent correctness problem

This deserves its own callout because it’s the failure mode that passing CI won’t catch. Risks and Hard Rules covers why it happens and how to defend upstream. This section is about what to look for when you’re already in the review.

Two examples of what it looks like in practice:

Say the agent just built a profile update endpoint for you. The build passes. The tests pass. The diff looks clean. You’re about to approve when you notice the spec says “validate that email is unique, excluding the current user” (so they don’t get rejected by their own existing email when saving without changing it). The test checks uniqueness — but only with a completely different email address. It never tests what happens when another user already has that email. The code might handle it. Or it might not. The test can’t tell you, because it was never actually exercising the constraint.

Or a different flavour: the spec says “users can book appointments up to 30 days in advance.” The agent writes if (requestedDate <= DateTime.Now.AddDays(30)). Tests pass. But the check should be <, not <= — day 31 sneaks through. Every test used dates well within the window, so nobody noticed. The logic is wrong by one day, and it’ll stay wrong until a user hits the boundary.

Both cases share the same shape: the code compiles, the tests pass, and the diff looks reasonable. But the logic is subtly wrong. Here’s what to look for:

Compare the acceptance criteria in the spec against the test assertions. Is every criterion covered by at least one test? Are there criteria that the agent interpreted differently than you intended?
Look for edge cases that the spec mentioned but the tests don’t cover. The email uniqueness example above is typical — the test technically exists, but it doesn’t test the actual constraint.
For business logic, trace one happy path and one error path end to end. If you can’t explain what happens at each step, you don’t understand the code well enough to approve it.
If you find yourself thinking “this looks fine” but can’t articulate why it’s correct, that’s a signal to slow down.

When to iterate vs. reject and restart

Sometimes the agent’s output is close enough to fix with comments. Sometimes it’s faster to throw it away and start over with a better spec. Knowing which is which saves a surprising amount of time.

Iterate (send feedback, let the agent fix it) when:

The approach is correct but the implementation has bugs.
Naming or formatting doesn’t match conventions.
Tests are missing edge cases but the existing tests are sound.
There’s small scope creep that can be trimmed.
The agent made a reasonable choice that you’d just do differently — that’s a comment, not a rejection.

Reject and restart when:

The approach is fundamentally wrong (wrong pattern, wrong architecture, wrong abstraction).
The agent modified files it shouldn’t have, and the changes are interleaved with legitimate work (untangling them would take longer than regenerating).
The PR is too large to review effectively. This is a signal that the original task needed further decomposition (The Spec-First Workflow).
The agent hallucinated a dependency, API, or pattern that doesn’t exist.
You’ve been going back and forth and the output isn’t converging.

That last point deserves a moment. You send the agent feedback. It comes back with a new version that fixes three things but breaks two others. You send it back again. The third version has regressed on something the first version got right. Sound familiar? That’s the “babysitting tax” — and it’s a signal that the spec or the decomposition is the problem, not the execution.

The two-round rule: If you’ve sent the agent feedback twice and the third version still isn’t right, stop iterating. Go back to the spec. Clarify what was ambiguous. Decompose further if the task was too broad. Then start a fresh session.

The review checklist

This is a quick-reference version of everything above. Use it until the habits are automatic. (You’ll find it again in Templates and Quick Reference for quick access.)

That last one is our standard. Not “it looks fine.” Not “the tests pass.” Can you explain it? That’s Integrity — we don’t ship code we don’t understand. If yes, approve. If no, dig deeper or send it back.

Exercise 05 — Review an agent PR (frontend)

Time to put the review process into practice. You’ll get Claude Code to generate a change, push it as a PR, and then review it properly — not just “looks good to me.”

The core skill here is reviewing, not delegating. Background delegation to cloud agents comes in Parallel Workflows. For now, you’ll use a local Claude Code session to produce the code, then switch hats and review the result as if someone else wrote it. (That last part is easier than it sounds — you didn’t write it.)

Goal: Use Claude Code to generate a small, well-scoped change in the Next.js repo, push it as a PR, then review your own PR using the checklist from this chapter.

Repo: Rokkit200.Website (Next.js)

Steps:

Your task: “Add disabled and error states to the Text input component (src/components/storybook/molecules/text/Text.tsx). When disabled is true, the input should be visually dimmed and non-interactive. When error is true, the input should show a red border and display an errorMessage string below it. Add Storybook stories demonstrating both states.”
Write a spec using the template from The Spec-First Workflow. You can use Claude to help draft the spec — give it the task description and the template, then review and refine what it produces. In your spec:
- Reference the existing component pattern: the Text component already manages an isActive focus state with corresponding SCSS classes (.active / .non-active in text.module.scss). Your new states should follow the same pattern.
- List the files to modify: Text.tsx (add props to TextProps, add conditional logic), text.module.scss (add .disabled and .error classes), Text.stories.ts (add stories for both states).
- In “Out of scope”: don’t modify the TextArea component, don’t change existing design tokens in src/components/storybook/tokens/, don’t install new packages.
- For tests: this repo doesn’t have a test runner (no Jest/Vitest). Note in the spec that Storybook stories serve as the verification method and that a dedicated story per state is required.
Create a feature branch and open Claude Code. Switch to plan mode (/plan) and paste the spec — this forces Claude to produce a plan without writing any code. Review the plan against the spec before giving the go-ahead.
Once the agent has produced the code, push the branch and open a PR on GitHub. Now switch hats: you’re the reviewer.
Review the PR using the checklist above. Go through each item. Write your findings down — not in your head, on paper or in a document. For this repo, adapt the checklist:
- “Tests read first” becomes “stories read first” — do the Storybook stories cover both new states? Do they show the visual difference clearly?
- “CI green” means: yarn build succeeds, yarn lint passes, and yarn storybook renders the new stories without errors.
For each finding, decide: approve, send back for changes, or reject and restart. Write down why for each decision. Pay particular attention to:
- Did the agent add the HTML disabled attribute to the underlying <input> element? (Accessibility requirement — without it, the input is only visually dimmed, not actually disabled.)
- Does the disabled state interact correctly with the existing isActive focus state? (A disabled input shouldn’t show the active styling when clicked.)
- Did the agent use CSS custom properties for the error colour, consistent with how other colours are handled in text.module.scss?
If you’d send it back, try it: give Claude Code feedback in the same session and see if the second pass addresses your concerns. Did new issues appear?

Check your work:

You reviewed the plan before the agent wrote code
You read the Storybook stories before the implementation
You checked scope — the agent only modified files within src/components/storybook/molecules/text/
You traced the logic: what happens when disabled={true} and the user clicks the input?
You checked for drift: does the new code follow the existing SCSS module pattern and TypeScript type conventions (type not interface)?
You documented your approval/rejection decisions with reasoning
yarn build and yarn lint pass cleanly

Reflect: What would you have needed in the original spec to have caught the issues you found at review time? Most review findings are spec gaps in disguise. If you found yourself thinking “I should have said…”, write that down — it’s how your specs get better.

Stretch goal: Review a teammate’s agent-generated PR using the same checklist. Compare notes. Do you agree on what’s acceptable? Where do you disagree? Those disagreements are worth discussing — they’re where team conventions get refined.

Exercise 05b — Review an agent PR (backend)

Same process, different codebase. Frontend-only devs can attempt this one. Backend and fullstack devs — do both, because the review instincts you build in one stack sharpen the other.

Goal: Use Claude Code to generate a small, well-scoped change in the Alloy CMS 13 repo, push it as a PR, then review it using the checklist.

Repo: Rokkit200.Training.CMS13.Alloy (Optimizely CMS 13)

Steps:

Your task: “Add a BackgroundColor property to TeaserBlock (Models/Blocks/TeaserBlock.cs) that lets editors choose between White, Light Grey, and Dark via a selection factory. Update the Razor view (Views/Shared/Blocks/TeaserBlock.cshtml) to apply the selected colour as a CSS class on the outer <div>.”
Write a spec using the template from The Spec-First Workflow. You can use Claude to help draft the spec — give it the task description and the template, then review and refine what it produces. In your spec:
- Reference the existing patterns: look at how TeaserBlock.cs currently declares properties (the [Display], [CultureSpecific], [Required] attributes, SystemTabNames.Content group, ordering). Your new property should follow the same pattern.
- For the selection factory, reference the existing ContactPageSelectionFactory in Business/EditorDescriptors/ContactPageSelectionFactory.cs — it implements ISelectionFactory and uses [ServiceConfiguration] for DI registration. Your factory is simpler (static list of options), but it should follow the same structural conventions and live in the same directory.
- List the files to create or modify: TeaserBlock.cs (add property), a new BackgroundColorSelectionFactory.cs in Business/EditorDescriptors/, and TeaserBlock.cshtml (apply the class).
- In “Out of scope”: don’t modify Globals.cs, don’t change the SiteBlockData base class, don’t alter existing property GUIDs, don’t change Startup.cs or Program.cs.
- For tests: this repo has no test project. Note this in the spec and flag it as a limitation. The build (dotnet build) is the primary automated check.
Create a feature branch and open Claude Code. Switch to plan mode (/plan) and paste the spec — this forces Claude to produce a plan without writing any code. Review the plan against the spec before giving the go-ahead.
Once the agent has produced the code, push the branch and open a PR on GitHub. Now switch hats: you’re the reviewer.
Review the PR using the checklist. Write your findings down. Pay particular attention to:
- Did the agent inherit from SiteBlockData and use [SiteContentType] consistently with the other blocks? Or did it invent a new pattern?
- Did it put the selection factory in Business/EditorDescriptors/ (where ContactPageSelectionFactory lives), or somewhere else?
- Does the Razor view apply the CSS class safely? What happens if BackgroundColor is null or empty (the editor hasn’t chosen a value yet)?
- Did the agent modify any files outside the spec? Check for “helpful” changes to Globals.cs, Startup.cs, or other blocks.
For each finding, decide: approve, send back for changes, or reject and restart. Write down why.
If you’d send it back, try it: give Claude Code feedback in the same session and see if the second pass addresses your concerns.

Check your work:

You reviewed the plan before the agent wrote code
You checked scope — the agent only modified or created files listed in the spec
You traced the logic: what CSS class gets applied when an editor selects “Light Grey”? What happens when no colour is selected?
You checked for architectural drift: does it follow the existing SiteBlockData inheritance, [Display] attribute pattern, and property group conventions from Globals.cs?
You checked the selection factory: does it follow the ISelectionFactory pattern from ContactPageSelectionFactory.cs?
You documented your approval/rejection decisions with reasoning
The project builds cleanly (dotnet build Alloy.sln)

Reflect: The checklist says “read the tests before the implementation” — but this repo has no tests. What did you substitute? Storybook stories in the frontend exercise served as a visual equivalent. For backend code with no test project, what’s the equivalent? (Hint: tracing the logic manually, checking the build, and verifying the Razor view against the CMS editor experience are your fallbacks. Note how much harder review gets without automated tests — that’s a finding worth recording.)

What’s next

You now know how to review what agents produce. The next chapter covers the rules that don’t bend — the non-negotiable guardrails that apply regardless of risk tier, review outcome, or how confident the agent sounds. Some of these you’ve already seen in action (the .env hook from Exercise 04, the branch protection rules). Risks and Hard Rules makes them explicit and explains why each one exists.