An on-call rotation drowning in noisy alerts, runbooks scattered across wikis, and a backlog of small recurring incidents that ate engineering time. We built an LLM-powered DevOps copilot that triages incoming alerts, correlates them with logs and metrics, proposes a fix, and opens a PR with the patch attached. It cut MTTR by 20% in the first quarter.
The problem
The platform team supported a sprawling AWS estate — multiple EKS clusters, Aurora PostgreSQL, OpenSearch, Lambda, Kinesis — for a multi-tenant subscription platform with consumer-facing and partner-facing workloads. Alert volume was high, the on-call engineer's first 20 minutes of every page were spent on the same recurring detective work: which service, which deployment, what changed, where's the runbook.
Three patterns came up over and over:
- Noisy alerts that needed human correlation. A spike in 5xxs from one service often had its root cause in a different service three hops away. Engineers were good at finding it; the work just wasn't repeatable.
- Recurring fix patterns. Roughly 30% of incidents had a known shape — a config drift, an undersized HPA, an expired secret, a missing IAM permission — that an experienced engineer recognised in seconds.
- Runbook decay. Runbooks existed but were never quite up to date with the latest deployment topology. Engineers stopped trusting them and rebuilt context from scratch each time.
The constraints
Three non-negotiables shaped the design:
- No autonomous changes to production. The copilot proposes; engineers approve. Every action ends in a PR, never a kubectl apply.
- Auditable. Every recommendation cites the logs, metrics and prior incidents it drew on. Hallucinations would destroy trust on day one.
- Cheap to run idle. The copilot only spins up when an alert fires or an engineer asks it something. No background polling, no permanent inference cost.
Architecture
The copilot is a Claude-powered agent with a tightly scoped tool surface. When an alert fires from Prometheus or CloudWatch, an EventBridge rule invokes a Lambda that hands the alert to the agent loop. The agent then reasons over the alert and decides which tools to call.
The tools are deliberately narrow and read-mostly:
read_logs— scoped OpenSearch queries on the affected service, time-windowed to the alert.read_metrics— Prometheus / CloudWatch lookups for the same service plus its known dependencies.read_runbook— Confluence search by service tag, with the agent told to quote the runbook in its output rather than rephrase it.query_history— Databricks Lakehouse query against past incidents (alert fingerprint, time-to-resolution, fix that worked). This is where Snowflake-governed historical data earns its keep — the copilot is far more useful with three years of past pages to reason from.open_pr— only writes to a fork, never to the main branch directly. Includes the proposed change, the reasoning trail and links to the cited evidence.
The on-call engineer gets a Slack message: "I think this is X because of Y, here's the runbook section, here's a PR if you want it." They can approve, edit, or reject. Every interaction is logged.
Implementation notes
The PR is the unit of trust
The breakthrough wasn't the LLM — it was treating the version-controlled pull request as the contract between the agent and the team. A PR has reviewers, CI checks, deploy gates, rollback. The engineering team already trusts that workflow. We slotted the agent into that workflow rather than asking the team to trust a new one.
Ground every claim in a tool call
The agent is prompted to never make a claim without citing a tool call output. If it can't find evidence, it says so. If it found contradictory evidence, it surfaces both. This is dull, slow agent behaviour — and it's the only kind that engineers trust.
Past incidents are the killer dataset
Most LLM-for-DevOps demos focus on real-time signals. The bigger unlock was historical. Querying Databricks for "last 90 days of alerts matching this fingerprint, plus the resolution that closed them" turned the agent from a clever generalist into a colleague who'd been on-call here before.
What didn't work
- Auto-applying anything. Tried it on dev clusters with strict guardrails. The political cost outweighed the time saved. Reverted to PR-only.
- Letting the agent propose new runbooks. Hallucinated steps. We restricted runbook output to quoting existing content with citations.
- Open-ended chat. Engineers hated it. Structured output (alert → diagnosis → evidence → proposed fix → PR link) won.
Results
Beyond the MTTR number, three softer wins:
- On-call retention improved. Engineers reported on-call as less exhausting. The first 20 minutes of context-rebuilding was gone.
- Runbook quality improved as a side effect. The copilot's tendency to surface stale runbook sections gave the team a clear backlog of docs to fix.
- Junior engineers ramped faster. The copilot's structured diagnosis became a teaching tool — new joiners read its reasoning to learn the platform.
What I'd do differently
- Start with one alert family. We tried to cover every alert in the first sprint. Should have shipped one — say, deployment-rollback alerts — and proven the loop before generalising.
- Invest in evals earlier. A small held-out set of past incidents with known good fixes would have caught two regressions we found in production.
- Be more aggressive about quoting. Anytime the agent paraphrased instead of quoting, trust dropped. Quoting verbatim is dull, but it's safe.
Tools used
Claude · AWS Lambda · EventBridge · OpenSearch · Prometheus · Grafana · CloudWatch · Confluence API · Databricks Lakehouse · Snowflake · ArgoCD · EKS · Terraform · Slack · Python · Bash · Go.
Building something similar?
I work with platform and SRE teams introducing AI-augmented DevOps without breaking trust. Outside IR35, Inside IR35, permanent or fractional.
Schedule a call →