Everyone's talking about AI transforming DevOps. After six months of integrating LLMs into an incident response workflow, the picture is more nuanced: AI is genuinely useful for some parts of incident response and actively harmful for others.
Where LLMs Actually Help
Log Summarization
When a production incident generates thousands of log lines across multiple services, an LLM can summarize the pattern faster than any human:
import openai
def summarize_incident_logs(logs: list[str], context: str) -> str:
prompt = f"""You are an SRE analyzing a production incident.
Context: {context}
Here are the relevant log entries (newest first):
{chr(10).join(logs[:200])}
Summarize:
1. What service(s) are affected
2. The sequence of events leading to the failure
3. Any error patterns or recurring messages
4. Suggested areas to investigate"""
response = openai.chat.completions.create(
model="claude-sonnet-4-6-20250414",
messages=[{"role": "user", "content": prompt}],
max_tokens=1000,
)
return response.choices[0].message.contentIn practice, this cuts the initial triage phase from 10-15 minutes down to 2-3 minutes. The LLM doesn't need to be right about the root cause — it just needs to point the on-call engineer in the right direction.
Runbook Retrieval
We indexed our runbooks and post-mortems into a vector database. When an alert fires, the system retrieves relevant past incidents:
from chromadb import Client
def find_relevant_runbooks(alert_summary: str, n_results: int = 3):
db = Client()
collection = db.get_collection("runbooks")
results = collection.query(
query_texts=[alert_summary],
n_results=n_results,
)
return [
{
"title": meta["title"],
"resolution": meta["resolution"],
"similarity": score,
}
for meta, score in zip(results["metadatas"][0], results["distances"][0])
]This is where AI shines — pattern matching across hundreds of past incidents that no human could recall on demand.
Change Correlation
When an incident occurs, the LLM cross-references recent deployments, config changes, and infrastructure modifications:
#!/bin/bash
# Gather recent changes for AI analysis
echo "=== Deployments (last 4 hours) ==="
kubectl get events --field-selector reason=Pulling -A --sort-by='.lastTimestamp' | tail -20
echo "=== Config Changes ==="
git log --oneline --since="4 hours ago" -- "k8s/" "terraform/"
echo "=== Infrastructure Events ==="
aws cloudtrail lookup-events \
--start-time "$(date -d '4 hours ago' -u +%FT%TZ)" \
--lookup-attributes AttributeKey=EventName,AttributeValue=RunInstancesWhere LLMs Fail
Automated Remediation
Letting an LLM execute fixes in production is a terrible idea. When tested, the model suggested scaling up a database replica to handle increased load — reasonable in theory, but it didn't account for the storage class limitations that would have caused the new replica to start without persistent storage.
# The LLM generated this "fix"
# It would have created a replica with no persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-replica
spec:
replicas: 3 # was 2 — LLM scaled it up
# Missing: volumeClaimTemplates
# Missing: storage class specification
# Missing: replication configurationRule: AI suggests, humans execute. Always.
Root Cause Analysis
LLMs are confident but often wrong about root causes. They'll correlate events that happen to co-occur and present them as causal relationships. A memory spike and a deployment happening at the same time doesn't mean the deployment caused the memory spike.
Real-Time Decision Making
During an active incident, you need speed and accuracy. LLM latency (even a few seconds) and hallucination risk make them unsuitable for real-time decisions. Use them for preparation and post-incident analysis, not during the heat of the moment.
The Architecture That Works
After iterating, here's the setup that works in practice:
- Alert fires → PagerDuty notifies on-call
- Automated context gathering → script collects logs, metrics, recent changes
- AI summarization → LLM summarizes the context and suggests investigation areas
- Runbook retrieval → vector search finds relevant past incidents
- Human decision → on-call engineer reads the AI summary and decides what to do
- Post-incident → LLM drafts the post-mortem from the incident timeline
The AI never touches production. It reads, summarizes, and suggests. The human investigates and acts.
Measuring the Impact
After six months:
- MTTR reduced by 34% — mostly from faster initial triage
- On-call cognitive load decreased — engineers report less "where do I start?" anxiety
- Post-mortem quality improved — AI-drafted timelines are more thorough than human-recalled ones
- False positive rate unchanged — AI doesn't help with alert tuning (yet)
Key Takeaways
- AI is best at summarization and retrieval — not decision-making
- Never let AI execute in production — suggest only, humans approve
- Index your post-mortems — they're your most valuable training data
- Measure before and after — "AI-powered" means nothing without metrics
- Start with log summarization — it's the easiest win with the lowest risk