Skip to main content
Back to blogs
AI

Automating Code Review with AI — Architecture and Honest Results

AI-powered code review integrated into a PR workflow. Here's the architecture, the prompt engineering, and the metrics after 3 months.

March 30, 20266 min read
aicode-reviewautomationci-cddevops

Manual code review is a bottleneck. Senior engineers spend hours daily reviewing PRs, context-switching between their own work and review queues. Integrating AI into a code review workflow — not to replace human reviewers, but to handle the repetitive checks so humans can focus on architecture and logic — is a practical way to reclaim that time.

What AI Reviews Well

After three months, the pattern is clear. AI catches mechanical issues with near-perfect accuracy:

  • Security vulnerabilities — SQL injection, XSS, hardcoded credentials, insecure deserialization
  • Bug patterns — null pointer risks, off-by-one errors, race conditions in obvious cases
  • Style consistency — naming conventions, import ordering, dead code
  • Documentation gaps — public APIs without JSDoc, missing error descriptions
  • Dependency risks — known CVEs in added packages, license incompatibilities

What it doesn't do well: architectural decisions, business logic validation, performance implications in context.

The Architecture

.github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]
 
jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
 
      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr.diff
          echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT
 
      - name: Run AI review
        if: steps.diff.outputs.diff_size < 2000
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: node scripts/ai-review.js

AI review is skipped on large diffs (2000+ lines). LLMs lose accuracy on massive context windows, and large PRs should be split anyway.

The Review Script

scripts/ai-review.ts
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";
import { execSync } from "child_process";
 
const client = new Anthropic();
 
const diff = readFileSync("pr.diff", "utf-8");
const changedFiles = execSync("git diff --name-only origin/main...HEAD")
  .toString()
  .trim()
  .split("\n");
 
const prompt = `You are a senior software engineer reviewing a pull request.
 
## Changed files
${changedFiles.join("\n")}
 
## Diff
${diff}
 
Review this PR for:
1. **Security issues** — injection, XSS, hardcoded secrets, insecure patterns
2. **Bugs** — null/undefined risks, incorrect logic, edge cases
3. **Performance** — obvious N+1 queries, unnecessary re-renders, memory leaks
4. **Best practices** — error handling, naming, code organization
 
Rules:
- Only comment on issues that are clearly wrong or risky
- Do NOT suggest stylistic preferences or minor refactors
- Do NOT comment on unchanged code
- Be specific: reference the file and line number
- If the PR looks good, say so briefly
 
Format each issue as:
**[SEVERITY]** \`file:line\` — description`;
 
const response = await client.messages.create({
  model: "claude-sonnet-4-6-20250414",
  max_tokens: 2000,
  messages: [{ role: "user", content: prompt }],
});
 
const review = response.content[0].type === "text"
  ? response.content[0].text
  : "";
 
// Post as PR comment via GitHub API
execSync(`gh pr comment ${process.env.PR_NUMBER} --body "${
  review.replace(/"/g, '\\"')
}"`);

Prompt Engineering Matters

The prompt went through 15+ iterations before reaching its current state. Key lessons:

Be explicit about what NOT to flag

Without "Do NOT suggest stylistic preferences," the AI generates dozens of comments about variable naming and formatting that clutter the review.

Severity levels reduce noise

**[CRITICAL]** — Must fix before merge (security, data loss)
**[WARNING]**  — Should fix, potential bug or risk
**[INFO]**     — Suggestion, take it or leave it

Developers learned to ignore [INFO] and always address [CRITICAL]. Without severity, every comment felt equally urgent.

Context window management

For large PRs, the script sends only the diff plus the full content of modified files — not the entire repo. This keeps the context focused and the review relevant:

context-management.ts
function buildContext(changedFiles: string[], maxTokens: number): string {
  const fileContents: string[] = [];
  let estimatedTokens = 0;
 
  for (const file of changedFiles) {
    const content = readFileSync(file, "utf-8");
    const tokens = Math.ceil(content.length / 4);
 
    if (estimatedTokens + tokens > maxTokens) break;
 
    fileContents.push(`--- ${file} ---\n${content}`);
    estimatedTokens += tokens;
  }
 
  return fileContents.join("\n\n");
}

Results After 3 Months

MetricBeforeAfter
Avg time to first review4.2 hours12 minutes (AI) + 2.1 hours (human)
Security issues caught in review~60%~92%
Review comments per PR (human)4.72.1
Developer satisfaction with review process3.2/54.1/5

The biggest win isn't speed — it's that human reviewers now focus on higher-level feedback because the AI already handled the mechanical checks.

Common Pitfalls

False positives erode trust

If the AI flags non-issues, developers stop reading its comments. Tracking the false positive rate and tuning the prompt when it exceeds 15% is essential to maintaining trust.

Cost management

At ~$0.03 per review for average PRs (Sonnet), cost is negligible. But large PRs with full file context can hit $0.50+. The 2000-line diff limit keeps costs predictable:

# Monthly cost tracking
echo "Reviews this month: $(gh api /repos/myorg/myapp/actions/runs \
  --jq '[.workflow_runs[] | select(.name=="AI Code Review")] | length')"

Don't review generated code

Auto-generated files (GraphQL types, Prisma client, lock files) produce noise. Exclude them:

ai-review-config.yml
exclude_patterns:
  - "*.generated.ts"
  - "*.lock"
  - "prisma/migrations/**"
  - "__generated__/**"
  - "*.min.js"

Key Takeaways

  1. AI handles mechanical checks — humans handle judgment — this split is where value lives
  2. Prompt engineering is 80% of the work — the same model with a bad prompt is useless
  3. False positive rate is the critical metric — above 15%, developers ignore the AI entirely
  4. Skip large diffs — AI accuracy degrades on massive PRs
  5. Track cost and adjust — set context limits to keep costs predictable
Share