AI Agents as Code Reviewers: Give Them the Design Doc

Key takeaways

AI code review is mainstream: 60M+ Copilot reviews and 44% team adoption (GitHub Blog, 2025; JetBrains, 2025).

Generic AI review catches style. Spec-aware review, where the agent reads your design doc, catches design violations.

Given the design docs, our agent blocked PRs over a controller-stalling time.Sleep, a wrong RBAC API group, and three auth bugs.

AI review is not a security review: 45% of AI-generated code fails OWASP checks (Veracode, 2025). A human still owns merge.

GitHub logged more than 60 million Copilot code reviews in its first year, and over 1 in 5 reviews on the platform now involve an AI agent (GitHub Blog, 2025). Adoption is settled. The interesting question is no longer whether teams use AI to review code, but what you point it at. Most setups hand the agent a raw diff and ask a vague question: does this look right? We tried something narrower on a real platform-engineering codebase. We gave an AI agent the design documents, then let it submit an actual REQUEST_CHANGES or APPROVE verdict on live pull requests. The results changed how we think about review gates.

How mainstream is AI code review in 2026?

AI code review has crossed from novelty to default. GitHub reports more than 60 million Copilot reviews since launch, with usage up tenfold and over 12,000 organizations now auto-reviewing every pull request (GitHub Blog, 2025). Roughly 44% of developers already use AI for code review specifically (JetBrains, 2025).

The adoption skews toward the teams shipping the most software. Web developers report 52% AI review adoption and DevOps engineers 49%, both ahead of the 44% baseline (JetBrains, 2025). Then zoom out to tooling in general, and the numbers climb higher. Some 90% of developers now use at least one AI coding tool regularly (JetBrains, 2026). On top of that, 84% use or plan to use AI tools in their workflow (Stack Overflow, 2025).

So here is the gap that matters. Most of that adoption is generic review. The agent reads a diff, checks for obvious smells, and comments on style. But does a diff on its own ever tell you whether the change does what the team actually agreed to build? That is useful work, and it is not what we set out to test.

The experiment: letting an agent approve or block real PRs

Most AI reviews grade code against itself. Ours graded code against a spec. With 84% of developers now using or planning to use AI tools (Stack Overflow, 2025), the differentiator is no longer the model, it is the context you feed it. We gave the agent three things: the design docs, the PR diff, and the user-story IDs the change was meant to satisfy.

Then we handed it authority most teams withhold. The agent used the GitHub pull request API to pull the diff and the referenced files, cross-checked the change against the stated requirements in the architecture and user-story docs, and submitted its own review directly to GitHub. Not a summary back to the chat window: inline comments plus an explicit REQUEST_CHANGES or APPROVE verdict, posted as a first-class review on the PR.

For larger batches, the agent fanned out, running one investigation thread per PR so it could read several diffs and codebase areas in parallel. That kept a queue of pull requests moving without a human triaging each one first.

Definition. Spec-aware review means the reviewer holds both the diff and the design doc in the same context, so “does this match section 4 of the architecture doc” becomes a mechanically checkable question. The verdict can cite the exact section a change violates.

What did the agent catch that a human skim would miss?

The agent’s most valuable catches were runtime bugs that pass a visual skim. Across a batch of controller and frontend pull requests, it filed REQUEST_CHANGES on several changes a human reviewer had glanced past under time pressure. Three defects stood out, and none of them look wrong in the diff.

Picture the setting. It is late Friday, the PR touches twelve files, CI is green, and a teammate is waiting on the merge. You skim the diff, the code reads cleanly, so you approve. The blocking bug was on line 40 of file nine. This is exactly the situation where a tireless reviewer that traces semantics, not just lines, earns its keep.

The reconciler that would stall the whole controller

The agent flagged a blocking time.Sleep with recursion inside a Kubernetes reconciler. This is a classic controller-runtime anti-pattern. A reconciler runs on a shared worker pool, so sleeping inside it does not pause one resource, it parks a worker for the full duration and starves every other reconcile.

// Anti-pattern the agent blocked: sleeping inside Reconcile
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if !resourceReady(ctx) {
        time.Sleep(30 * time.Minute) // stalls a shared worker for 30 minutes
        return r.Reconcile(ctx, req) // recursion instead of a requeue
    }
    // ...
}

Instead, the correct shape returns control to the work queue and lets the manager schedule the retry:

    if !resourceReady(ctx) {
        return ctrl.Result{RequeueAfter: time.Minute}, nil
    }

The diff compiled, passed review at a glance, and would have degraded the entire controller in production. Catching it needs someone, or something, tracing execution semantics against controller-runtime’s concurrency model, not just reading the lines.

The RBAC group that silently denied all access

The agent caught a wrong RBAC API group, a resource scoped to platform.example.io when the controller actually served labs.platform.example.io. Nothing errors at deploy time. The manifest applies cleanly. As a result, at runtime the controller is silently denied every API call it tries, and the failure surfaces as confusing, intermittent authorization errors far from the typo that caused them.

On a separate PR, the agent found three bugs that together made an OIDC login flow non-functional. First, a middleware file the framework never actually invoked. Then a session token sent as a header instead of forwarded as a cookie. Finally, an identity-provider URL exposed directly to the browser. Each one is a single plausible-looking line. Together they meant nobody could log in.

Why does the design doc change everything?

The design doc reframes the whole question. In March 2026, GitHub moved Copilot review from line-by-line diffing to agentic tool-calling that explores the repository and traces cross-file dependencies (GitHub changelog, 2026). That industry shift and our experiment point the same direction: review quality is bounded by context, not model size.

Generic review answers “does this code look fine.” Spec-aware review answers “does this code do what we agreed.” Those are different questions with different failure modes. A change can be idiomatic, well-tested, and cleanly formatted while quietly implementing the wrong feature. In our batch, the agent blocked a form that targeted the wrong domain object because the user story explicitly required a different one. No amount of style checking finds that. Only the spec does.

The reframe. The design docs stop being passive reference material and become an active gate. A PR is graded against the requirement it claims to satisfy, and the verdict can quote the doc section it breaks. That is a contract, not a vibe check.

Where does AI code review fall short?

AI review is not a security review, and conflating the two is dangerous. So should a clean AI verdict ever stand in for a security scan? No. Veracode found that 45% of AI-generated code fails OWASP Top 10 checks, and 29.1% of generated Python carries a security weakness (Veracode, 2025). For example, an agent trained on patterns that produce vulnerable code will not reliably flag the same class of vulnerability when reviewing it.

But there is a sharper limit from our own run. The agent only knows what is in the diff and the docs. It cannot run the code. Several verdicts landed as “Request Changes, minor” on PRs that later turned out to have real, blocking bugs, because confirming them required actually executing the change. The agent reasons about intent and semantics; it does not observe runtime behavior.

So treat agent verdicts as an advisory gate, not a merge authority. The human still owns the merge button. In practice that means the agent’s REQUEST_CHANGES blocks nothing on its own, it flags, cites, and routes. A person confirms, runs the code where it matters, and decides. The value is triage and design conformance at scale, not autonomous approval.

How do you set up spec-aware review?

With 90% of developers already using at least one AI coding tool regularly (JetBrains, 2026), the setup cost is low and the tooling is familiar. Four things separate a useful spec-aware review from generic diff commentary: docs in the repo, a structured prompt, an explicit verdict rubric, and CI wiring that stays advisory.

Keep design docs in the repo

The agent can only cite what it can read. Keep architecture docs, user stories, and interface contracts version-controlled next to the code, so the doc the PR is graded against moves with the branch. A spec in a separate wiki drifts out of sync and gives the reviewer nothing to anchor to.

Structure the review prompt

Point the agent at three inputs explicitly: the PR URL, the specific design-doc sections in scope, and the user-story IDs the change should satisfy. Ask it to check the diff against those requirements first, then general quality second. Order matters, because design conformance is the value you are adding over off-the-shelf review.

Define a verdict rubric

Give the agent a small, unambiguous rubric. REQUEST_CHANGES when a change violates a stated requirement or introduces a runtime-breaking pattern. APPROVE when it satisfies the referenced story and passes quality checks. Comment-only when the concern is stylistic. A clear rubric makes verdicts comparable across PRs and reviewers.

Wire it into CI, advisory only

Trigger the review on pull request events, post the verdict as a check, and never make it a required blocking status. Keep the human merge gate in place.

name: spec-aware-review
on:
  pull_request:
    types: [opened, synchronize]
jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run agent review against design docs
        run: ./scripts/review-against-spec.sh
        env:
          PR_URL: ${{ github.event.pull_request.html_url }}
          DESIGN_DOCS: docs/architecture,docs/user-stories
      # Verdict is posted as a comment, not a required status check.

The takeaway

Generic AI review is already table stakes, with 60 million Copilot reviews and 44% team adoption on the record (GitHub Blog, 2025; JetBrains, 2025). The differentiator is what you feed the agent. Give it the design doc, and review shifts from “does this look fine” to “does this do what we agreed.” That shift is what surfaced a controller-stalling time.Sleep, a silent RBAC misconfiguration, and three login-breaking auth bugs before merge. Keep the guardrails honest: pair it with real security scanning, and keep a human on the merge button. Used that way, spec-aware review is one of the highest-return additions a platform team can make this year.

Straight answers

Frequently asked questions

Does an AI agent replace human code reviewers?

No. In our experiment the agent acted as an advisory gate, not a merge authority. It cannot run code, so several verdicts on genuinely buggy PRs came back as minor concerns that only a human running the change could confirm. It scales triage and design conformance; a person still owns the merge decision.

What is spec-aware code review?

Spec-aware review gives the agent both the pull request diff and the design documents in the same context. Instead of asking whether the code looks fine, it asks whether the code does what the spec says. The agent can then cite the exact design-doc section a change violates, turning docs into an active review gate.

Is AI code review the same as an AI security scan?

No, and treating them as equivalent is risky. Veracode found 45% of AI-generated code fails OWASP Top 10 checks. A general-purpose review agent is not a substitute for dedicated SAST, dependency scanning, or a security engineer. Run security tooling alongside, never instead of, spec-aware review.

How much of the industry actually uses AI for reviews?

Adoption is now mainstream. GitHub reports over 60 million Copilot reviews and more than 1 in 5 platform reviews involving AI, while 44% of developers use AI for code review specifically. The open question is context quality, not whether teams have started.