Automating Code Review with AI — Architecture and Honest Results

Manual code review is a bottleneck. Senior engineers spend hours daily reviewing PRs, context-switching between their own work and review queues. Integrating AI into a code review workflow — not to replace human reviewers, but to handle the repetitive checks so humans can focus on architecture and logic — is a practical way to reclaim that time.

What AI Reviews Well

After three months, the pattern is clear. AI catches mechanical issues with near-perfect accuracy:

Security vulnerabilities — SQL injection, XSS, hardcoded credentials, insecure deserialization
Bug patterns — null pointer risks, off-by-one errors, race conditions in obvious cases
Style consistency — naming conventions, import ordering, dead code
Documentation gaps — public APIs without JSDoc, missing error descriptions
Dependency risks — known CVEs in added packages, license incompatibilities

What it doesn't do well: architectural decisions, business logic validation, performance implications in context.

The Architecture

.github/workflows/ai-review.yml

name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]
 
jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
 
      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr.diff
          echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT
 
      - name: Run AI review
        if: steps.diff.outputs.diff_size < 2000
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: node scripts/ai-review.js

AI review is skipped on large diffs (2000+ lines). LLMs lose accuracy on massive context windows, and large PRs should be split anyway.

The Review Script

scripts/ai-review.ts

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";
import { execSync } from "child_process";
 
const client = new Anthropic();
 
const diff = readFileSync("pr.diff", "utf-8");
const changedFiles = execSync("git diff --name-only origin/main...HEAD")
  .toString()
  .trim()
  .split("\n");
 
const prompt = `You are a senior software engineer reviewing a pull request.
 
## Changed files
${changedFiles.join("\n")}
 
## Diff
${diff}
 
Review this PR for:
1. **Security issues** — injection, XSS, hardcoded secrets, insecure patterns
2. **Bugs** — null/undefined risks, incorrect logic, edge cases
3. **Performance** — obvious N+1 queries, unnecessary re-renders, memory leaks
4. **Best practices** — error handling, naming, code organization
 
Rules:
- Only comment on issues that are clearly wrong or risky
- Do NOT suggest stylistic preferences or minor refactors
- Do NOT comment on unchanged code
- Be specific: reference the file and line number
- If the PR looks good, say so briefly
 
Format each issue as:
**[SEVERITY]** \`file:line\` — description`;
 
const response = await client.messages.create({
  model: "claude-sonnet-4-6-20250414",
  max_tokens: 2000,
  messages: [{ role: "user", content: prompt }],
});
 
const review = response.content[0].type === "text"
  ? response.content[0].text
  : "";
 
// Post as PR comment via GitHub API
execSync(`gh pr comment ${process.env.PR_NUMBER} --body "${
  review.replace(/"/g, '\\"')
}"`);

Prompt Engineering Matters

The prompt went through 15+ iterations before reaching its current state. Key lessons:

Be explicit about what NOT to flag

Without "Do NOT suggest stylistic preferences," the AI generates dozens of comments about variable naming and formatting that clutter the review.

Severity levels reduce noise

**[CRITICAL]** — Must fix before merge (security, data loss)
**[WARNING]**  — Should fix, potential bug or risk
**[INFO]**     — Suggestion, take it or leave it

Developers learned to ignore [INFO] and always address [CRITICAL]. Without severity, every comment felt equally urgent.

Context window management

For large PRs, the script sends only the diff plus the full content of modified files — not the entire repo. This keeps the context focused and the review relevant:

context-management.ts

function buildContext(changedFiles: string[], maxTokens: number): string {
  const fileContents: string[] = [];
  let estimatedTokens = 0;
 
  for (const file of changedFiles) {
    const content = readFileSync(file, "utf-8");
    const tokens = Math.ceil(content.length / 4);
 
    if (estimatedTokens + tokens > maxTokens) break;
 
    fileContents.push(`--- ${file} ---\n${content}`);
    estimatedTokens += tokens;
  }
 
  return fileContents.join("\n\n");
}

Results After 3 Months

Metric	Before	After
Avg time to first review	4.2 hours	12 minutes (AI) + 2.1 hours (human)
Security issues caught in review	~60%	~92%
Review comments per PR (human)	4.7	2.1
Developer satisfaction with review process	3.2/5	4.1/5

The biggest win isn't speed — it's that human reviewers now focus on higher-level feedback because the AI already handled the mechanical checks.

Common Pitfalls

False positives erode trust

If the AI flags non-issues, developers stop reading its comments. Tracking the false positive rate and tuning the prompt when it exceeds 15% is essential to maintaining trust.

Cost management

At ~$0.03 per review for average PRs (Sonnet), cost is negligible. But large PRs with full file context can hit $0.50+. The 2000-line diff limit keeps costs predictable:

# Monthly cost tracking
echo "Reviews this month: $(gh api /repos/myorg/myapp/actions/runs \
  --jq '[.workflow_runs[] | select(.name=="AI Code Review")] | length')"

Don't review generated code

Auto-generated files (GraphQL types, Prisma client, lock files) produce noise. Exclude them:

ai-review-config.yml

exclude_patterns:
  - "*.generated.ts"
  - "*.lock"
  - "prisma/migrations/**"
  - "__generated__/**"
  - "*.min.js"

Key Takeaways

AI handles mechanical checks — humans handle judgment — this split is where value lives
Prompt engineering is 80% of the work — the same model with a bad prompt is useless
False positive rate is the critical metric — above 15%, developers ignore the AI entirely
Skip large diffs — AI accuracy degrades on massive PRs
Track cost and adjust — set context limits to keep costs predictable

#What AI Reviews Well

#The Architecture

#The Review Script

#Prompt Engineering Matters

#Be explicit about what NOT to flag

#Severity levels reduce noise

#Context window management

#Results After 3 Months

#Common Pitfalls

#False positives erode trust

#Cost management

#Don't review generated code

#Key Takeaways