Using LLMs for Incident Response — What Works and What Doesn't

Everyone's talking about AI transforming DevOps. After six months of integrating LLMs into an incident response workflow, the picture is more nuanced: AI is genuinely useful for some parts of incident response and actively harmful for others.

Where LLMs Actually Help

Log Summarization

When a production incident generates thousands of log lines across multiple services, an LLM can summarize the pattern faster than any human:

log-summarizer.py

import openai
 
def summarize_incident_logs(logs: list[str], context: str) -> str:
    prompt = f"""You are an SRE analyzing a production incident.
Context: {context}
 
Here are the relevant log entries (newest first):
{chr(10).join(logs[:200])}
 
Summarize:
1. What service(s) are affected
2. The sequence of events leading to the failure
3. Any error patterns or recurring messages
4. Suggested areas to investigate"""
 
    response = openai.chat.completions.create(
        model="claude-sonnet-4-6-20250414",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,
    )
    return response.choices[0].message.content

In practice, this cuts the initial triage phase from 10-15 minutes down to 2-3 minutes. The LLM doesn't need to be right about the root cause — it just needs to point the on-call engineer in the right direction.

Runbook Retrieval

We indexed our runbooks and post-mortems into a vector database. When an alert fires, the system retrieves relevant past incidents:

runbook-search.py

from chromadb import Client
 
def find_relevant_runbooks(alert_summary: str, n_results: int = 3):
    db = Client()
    collection = db.get_collection("runbooks")
 
    results = collection.query(
        query_texts=[alert_summary],
        n_results=n_results,
    )
 
    return [
        {
            "title": meta["title"],
            "resolution": meta["resolution"],
            "similarity": score,
        }
        for meta, score in zip(results["metadatas"][0], results["distances"][0])
    ]

This is where AI shines — pattern matching across hundreds of past incidents that no human could recall on demand.

Change Correlation

When an incident occurs, the LLM cross-references recent deployments, config changes, and infrastructure modifications:

correlate-changes.sh

#!/bin/bash
# Gather recent changes for AI analysis
echo "=== Deployments (last 4 hours) ==="
kubectl get events --field-selector reason=Pulling -A --sort-by='.lastTimestamp' | tail -20
 
echo "=== Config Changes ==="
git log --oneline --since="4 hours ago" -- "k8s/" "terraform/"
 
echo "=== Infrastructure Events ==="
aws cloudtrail lookup-events \
  --start-time "$(date -d '4 hours ago' -u +%FT%TZ)" \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances

Where LLMs Fail

Automated Remediation

Letting an LLM execute fixes in production is a terrible idea. When tested, the model suggested scaling up a database replica to handle increased load — reasonable in theory, but it didn't account for the storage class limitations that would have caused the new replica to start without persistent storage.

what-the-llm-suggested.yml

# The LLM generated this "fix"
# It would have created a replica with no persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-replica
spec:
  replicas: 3  # was 2 — LLM scaled it up
  # Missing: volumeClaimTemplates
  # Missing: storage class specification
  # Missing: replication configuration

Rule: AI suggests, humans execute. Always.

Root Cause Analysis

LLMs are confident but often wrong about root causes. They'll correlate events that happen to co-occur and present them as causal relationships. A memory spike and a deployment happening at the same time doesn't mean the deployment caused the memory spike.

Real-Time Decision Making

During an active incident, you need speed and accuracy. LLM latency (even a few seconds) and hallucination risk make them unsuitable for real-time decisions. Use them for preparation and post-incident analysis, not during the heat of the moment.

The Architecture That Works

After iterating, here's the setup that works in practice:

Alert fires → PagerDuty notifies on-call
Automated context gathering → script collects logs, metrics, recent changes
AI summarization → LLM summarizes the context and suggests investigation areas
Runbook retrieval → vector search finds relevant past incidents
Human decision → on-call engineer reads the AI summary and decides what to do
Post-incident → LLM drafts the post-mortem from the incident timeline

The AI never touches production. It reads, summarizes, and suggests. The human investigates and acts.

Measuring the Impact

After six months:

MTTR reduced by 34% — mostly from faster initial triage
On-call cognitive load decreased — engineers report less "where do I start?" anxiety
Post-mortem quality improved — AI-drafted timelines are more thorough than human-recalled ones
False positive rate unchanged — AI doesn't help with alert tuning (yet)

Key Takeaways

AI is best at summarization and retrieval — not decision-making
Never let AI execute in production — suggest only, humans approve
Index your post-mortems — they're your most valuable training data
Measure before and after — "AI-powered" means nothing without metrics
Start with log summarization — it's the easiest win with the lowest risk

#Where LLMs Actually Help

#Log Summarization

#Runbook Retrieval

#Change Correlation

#Where LLMs Fail

#Automated Remediation

#Root Cause Analysis

#Real-Time Decision Making

#The Architecture That Works

#Measuring the Impact

#Key Takeaways