← All projects
🚨
Tier 1 · Flagship AnchorBuilding now

AI SRE — Autonomous Incident Investigator

The agent that finds root cause before you finish your coffee

Intermediate~3 hrs
Mirrors real production systemsDatadog Bits AI SREResolve AICleric AI

Resolve AI is a $1B unicorn used by Coinbase & DoorDash; Datadog Bits AI SRE is GA across thousands of orgs.

How you build it: The open-source HolmesGPT pattern (Apache 2.0, CNCF Sandbox, by Robusta + Microsoft).

This is the flagship build: an autonomous AI Site Reliability Engineer that investigates production incidents the way Datadog Bits AI SRE and the $1B unicorn Resolve AI do — and the way the open-source, CNCF-backed HolmesGPT does it in code. When an alert fires, the agent doesn't just summarize it. It runs an agentic tool-use loop: it decides which read-only tool to call, queries live metrics and logs and Kubernetes state, forms candidate root-cause hypotheses, gathers more evidence to confirm or rule them out, and converges on a ranked root cause — then posts a clean RCA with the full evidence trail to Slack and proposes a fix that a human approves.

What you'll build

  • An alert webhook (Alertmanager / Grafana) that kicks off an investigation with the firing alert as context.
  • A read-only tool layer the agent can call: PromQL metric queries, Loki log search, `kubectl describe/logs`, and recent-deploy lookups.
  • A Claude tool-use loop: plan → call a tool → observe results → reason about hypotheses → repeat under a step budget.
  • A ranked root-cause report with the evidence trail, posted to Slack in a clean, scannable format.
  • Human-in-the-loop guardrails: every mutating action (rollback, scale, restart) is gated behind a Slack approval button.

The production architecture

The same agentic tool-use loop that powers Datadog Bits and Resolve AI — not a single prompt, but an iterative cycle of plan → act → observe → reason.

AI SRE — Autonomous Incident Investigator — architecture diagram
The agentic loop: an alert triggers an investigation that queries read-only tools, reasons over the evidence, and reports a root cause — with a human approving any fix.
  1. 1

    Trigger

    An alert fires (Alertmanager/Grafana) and hits a webhook that opens an investigation with the alert's labels, severity, and timeframe.

  2. 2

    Plan

    The LLM looks at the alert plus the evidence gathered so far and decides which tool to call next — exactly like Datadog Bits' 'deep research agent.'

  3. 3

    Observe

    Call one read-only tool (a PromQL query, a Loki log search, `kubectl describe`, recent deploys). Append the raw results to the agent's working context.

  4. 4

    Reason

    Form and refine root-cause hypotheses, scoring each against the evidence. Loop back to Plan until confident or the step budget is hit.

  5. 5

    Report

    Write a ranked RCA — most-likely cause first — with the evidence trail and a concrete suggested remediation. Post to Slack.

  6. 6

    Approve

    A human reviews. Any mutating action runs only on explicit approval. This is how every credible production system actually ships.

The stack

Claude (tool use)
The reasoning engine that drives the agentic loop and chooses tools
Python + FastAPI
The webhook receiver and agent orchestrator
Prometheus + PromQL
Metrics tool — latency, error rate, saturation
Loki (or Elasticsearch)
Logs tool — search around the incident window
Kubernetes API
Cluster-state tool — pod status, events, recent rollouts
Slack API
Delivery of the RCA + the human approval gate
kind + Docker
A local cluster to stage a real, breakable incident for the demo
HolmesGPT (reference)
The open-source CNCF agent whose loop you're modeling

Build milestones

01

Stage a real incident

Spin up a breakable demo app on kind with Prometheus, Loki, and Alertmanager. Trigger a genuine CrashLoopBackOff or OOMKilled so the agent has something real to solve.

02

Build the read-only tools

Implement PromQL query, Loki log search, kubectl describe/logs, and recent-deploys as typed, safe tools the LLM can call.

03

Wire the agentic loop

Plan → call tool → observe → reason → repeat, with a step budget and a stop condition. This is the heart of the build.

04

Generate the RCA report

Turn the agent's evidence trail into a ranked, human-readable root-cause analysis and post it to Slack.

05

Add human-in-the-loop

Put remediation (rollback/scale) behind a Slack approval button — no blind auto-fixes.

06

Evaluate honestly

Run it against 3–5 seeded incidents. Show where it nails the root cause and where it still needs a human — credibility over hype.

What you'll learn

The production agentic tool-use loop — why it beats a single mega-prompt
Designing safe, read-only tools an LLM can call without breaking prod
Grounding an agent in live telemetry: metrics, logs, and Kubernetes state
Human-in-the-loop guardrails for any action that mutates infrastructure
Evaluating an ops agent honestly (the ITBench reality check)

⚖️ The production reality (teach this on camera)

Be honest on camera: the real systems gate every mutating action behind human approval, and IBM's ITBench benchmark shows even top models resolve only ~14% of real incidents end-to-end. We're building assisted, evidence-backed investigation — fast root-cause, not blind auto-remediation. That honesty is exactly what earns trust with senior engineers.

Get this build the moment it ships

Code, video, and write-up — delivered to your inbox. Plus every other production-grade agentic DevOps build.

Subscribe to Newsletter

Get the latest articles and tutorials delivered to your inbox.

We respect your privacy. Unsubscribe at any time.