AI SRE — Autonomous Incident Investigator
The agent that finds root cause before you finish your coffee
✓ Resolve AI is a $1B unicorn used by Coinbase & DoorDash; Datadog Bits AI SRE is GA across thousands of orgs.
How you build it: The open-source HolmesGPT pattern (Apache 2.0, CNCF Sandbox, by Robusta + Microsoft).
This is the flagship build: an autonomous AI Site Reliability Engineer that investigates production incidents the way Datadog Bits AI SRE and the $1B unicorn Resolve AI do — and the way the open-source, CNCF-backed HolmesGPT does it in code. When an alert fires, the agent doesn't just summarize it. It runs an agentic tool-use loop: it decides which read-only tool to call, queries live metrics and logs and Kubernetes state, forms candidate root-cause hypotheses, gathers more evidence to confirm or rule them out, and converges on a ranked root cause — then posts a clean RCA with the full evidence trail to Slack and proposes a fix that a human approves.
What you'll build
- ✓An alert webhook (Alertmanager / Grafana) that kicks off an investigation with the firing alert as context.
- ✓A read-only tool layer the agent can call: PromQL metric queries, Loki log search, `kubectl describe/logs`, and recent-deploy lookups.
- ✓A Claude tool-use loop: plan → call a tool → observe results → reason about hypotheses → repeat under a step budget.
- ✓A ranked root-cause report with the evidence trail, posted to Slack in a clean, scannable format.
- ✓Human-in-the-loop guardrails: every mutating action (rollback, scale, restart) is gated behind a Slack approval button.
The production architecture
The same agentic tool-use loop that powers Datadog Bits and Resolve AI — not a single prompt, but an iterative cycle of plan → act → observe → reason.

- 1
Trigger
An alert fires (Alertmanager/Grafana) and hits a webhook that opens an investigation with the alert's labels, severity, and timeframe.
- 2
Plan
The LLM looks at the alert plus the evidence gathered so far and decides which tool to call next — exactly like Datadog Bits' 'deep research agent.'
- 3
Observe
Call one read-only tool (a PromQL query, a Loki log search, `kubectl describe`, recent deploys). Append the raw results to the agent's working context.
- 4
Reason
Form and refine root-cause hypotheses, scoring each against the evidence. Loop back to Plan until confident or the step budget is hit.
- 5
Report
Write a ranked RCA — most-likely cause first — with the evidence trail and a concrete suggested remediation. Post to Slack.
- 6
Approve
A human reviews. Any mutating action runs only on explicit approval. This is how every credible production system actually ships.
The stack
Build milestones
Stage a real incident
Spin up a breakable demo app on kind with Prometheus, Loki, and Alertmanager. Trigger a genuine CrashLoopBackOff or OOMKilled so the agent has something real to solve.
Build the read-only tools
Implement PromQL query, Loki log search, kubectl describe/logs, and recent-deploys as typed, safe tools the LLM can call.
Wire the agentic loop
Plan → call tool → observe → reason → repeat, with a step budget and a stop condition. This is the heart of the build.
Generate the RCA report
Turn the agent's evidence trail into a ranked, human-readable root-cause analysis and post it to Slack.
Add human-in-the-loop
Put remediation (rollback/scale) behind a Slack approval button — no blind auto-fixes.
Evaluate honestly
Run it against 3–5 seeded incidents. Show where it nails the root cause and where it still needs a human — credibility over hype.
What you'll learn
⚖️ The production reality (teach this on camera)
Be honest on camera: the real systems gate every mutating action behind human approval, and IBM's ITBench benchmark shows even top models resolve only ~14% of real incidents end-to-end. We're building assisted, evidence-backed investigation — fast root-cause, not blind auto-remediation. That honesty is exactly what earns trust with senior engineers.
Get this build the moment it ships
Code, video, and write-up — delivered to your inbox. Plus every other production-grade agentic DevOps build.
Subscribe to Newsletter
Get the latest articles and tutorials delivered to your inbox.
We respect your privacy. Unsubscribe at any time.