Manual code review is a bottleneck. Senior engineers spend hours daily reviewing PRs, context-switching between their own work and review queues. Integrating AI into a code review workflow — not to replace human reviewers, but to handle the repetitive checks so humans can focus on architecture and logic — is a practical way to reclaim that time.
What AI Reviews Well
After three months, the pattern is clear. AI catches mechanical issues with near-perfect accuracy:
- Security vulnerabilities — SQL injection, XSS, hardcoded credentials, insecure deserialization
- Bug patterns — null pointer risks, off-by-one errors, race conditions in obvious cases
- Style consistency — naming conventions, import ordering, dead code
- Documentation gaps — public APIs without JSDoc, missing error descriptions
- Dependency risks — known CVEs in added packages, license incompatibilities
What it doesn't do well: architectural decisions, business logic validation, performance implications in context.
The Architecture
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
permissions:
pull-requests: write
contents: read
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff
id: diff
run: |
git diff origin/${{ github.base_ref }}...HEAD > pr.diff
echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT
- name: Run AI review
if: steps.diff.outputs.diff_size < 2000
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: node scripts/ai-review.jsAI review is skipped on large diffs (2000+ lines). LLMs lose accuracy on massive context windows, and large PRs should be split anyway.
The Review Script
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";
import { execSync } from "child_process";
const client = new Anthropic();
const diff = readFileSync("pr.diff", "utf-8");
const changedFiles = execSync("git diff --name-only origin/main...HEAD")
.toString()
.trim()
.split("\n");
const prompt = `You are a senior software engineer reviewing a pull request.
## Changed files
${changedFiles.join("\n")}
## Diff
${diff}
Review this PR for:
1. **Security issues** — injection, XSS, hardcoded secrets, insecure patterns
2. **Bugs** — null/undefined risks, incorrect logic, edge cases
3. **Performance** — obvious N+1 queries, unnecessary re-renders, memory leaks
4. **Best practices** — error handling, naming, code organization
Rules:
- Only comment on issues that are clearly wrong or risky
- Do NOT suggest stylistic preferences or minor refactors
- Do NOT comment on unchanged code
- Be specific: reference the file and line number
- If the PR looks good, say so briefly
Format each issue as:
**[SEVERITY]** \`file:line\` — description`;
const response = await client.messages.create({
model: "claude-sonnet-4-6-20250414",
max_tokens: 2000,
messages: [{ role: "user", content: prompt }],
});
const review = response.content[0].type === "text"
? response.content[0].text
: "";
// Post as PR comment via GitHub API
execSync(`gh pr comment ${process.env.PR_NUMBER} --body "${
review.replace(/"/g, '\\"')
}"`);Prompt Engineering Matters
The prompt went through 15+ iterations before reaching its current state. Key lessons:
Be explicit about what NOT to flag
Without "Do NOT suggest stylistic preferences," the AI generates dozens of comments about variable naming and formatting that clutter the review.
Severity levels reduce noise
**[CRITICAL]** — Must fix before merge (security, data loss)
**[WARNING]** — Should fix, potential bug or risk
**[INFO]** — Suggestion, take it or leave itDevelopers learned to ignore [INFO] and always address [CRITICAL]. Without severity, every comment felt equally urgent.
Context window management
For large PRs, the script sends only the diff plus the full content of modified files — not the entire repo. This keeps the context focused and the review relevant:
function buildContext(changedFiles: string[], maxTokens: number): string {
const fileContents: string[] = [];
let estimatedTokens = 0;
for (const file of changedFiles) {
const content = readFileSync(file, "utf-8");
const tokens = Math.ceil(content.length / 4);
if (estimatedTokens + tokens > maxTokens) break;
fileContents.push(`--- ${file} ---\n${content}`);
estimatedTokens += tokens;
}
return fileContents.join("\n\n");
}Results After 3 Months
| Metric | Before | After |
|---|---|---|
| Avg time to first review | 4.2 hours | 12 minutes (AI) + 2.1 hours (human) |
| Security issues caught in review | ~60% | ~92% |
| Review comments per PR (human) | 4.7 | 2.1 |
| Developer satisfaction with review process | 3.2/5 | 4.1/5 |
The biggest win isn't speed — it's that human reviewers now focus on higher-level feedback because the AI already handled the mechanical checks.
Common Pitfalls
False positives erode trust
If the AI flags non-issues, developers stop reading its comments. Tracking the false positive rate and tuning the prompt when it exceeds 15% is essential to maintaining trust.
Cost management
At ~$0.03 per review for average PRs (Sonnet), cost is negligible. But large PRs with full file context can hit $0.50+. The 2000-line diff limit keeps costs predictable:
# Monthly cost tracking
echo "Reviews this month: $(gh api /repos/myorg/myapp/actions/runs \
--jq '[.workflow_runs[] | select(.name=="AI Code Review")] | length')"Don't review generated code
Auto-generated files (GraphQL types, Prisma client, lock files) produce noise. Exclude them:
exclude_patterns:
- "*.generated.ts"
- "*.lock"
- "prisma/migrations/**"
- "__generated__/**"
- "*.min.js"Key Takeaways
- AI handles mechanical checks — humans handle judgment — this split is where value lives
- Prompt engineering is 80% of the work — the same model with a bad prompt is useless
- False positive rate is the critical metric — above 15%, developers ignore the AI entirely
- Skip large diffs — AI accuracy degrades on massive PRs
- Track cost and adjust — set context limits to keep costs predictable