How Datadog taught an AI to investigate high-severity incidents
Most incident tools are good at collecting evidence.
They’re bad at thinking with it.
If you’ve ever been on call, you know the feeling:
12 dashboards open
Logs screaming
Traces half-useful
And one suspicious metric you can’t ignore
The hard part isn’t access to data.
It’s deciding what to look at next.
That’s the problem Bits AI SRE is actually trying to solve.
This isn’t an AI summarizer (and that matters)
The early wave of “AI for ops” tools made a quiet assumption:
If we gather enough telemetry, the model can summarize its way to the root cause.
That turns out to be wrong.
More data doesn’t make incidents clearer.
It makes them noisier.
Bits AI SRE does something different.
It investigates like a team of human SREs:
Form a hypothesis
Pull targeted evidence
Validate or reject
Go deeper only when the signal earns it
That sounds obvious.
It isn’t.
Most tools still dump everything into context and hope the model figures it out.
The key shift: causality over correlation
Here’s the most important design decision in this system:
The agent only looks at data that is causally related to a hypothesis.
Not “everything nearby.”
Not “everything noisy.”
Not “everything interesting.”
Just:
Does this explain why the alert fired?
In one real incident:
Kafka lag spiked
Commit latency spiked
Unrelated upstream errors were present
Earlier versions of the agent saw all of it
…and picked the wrong root cause.
The newer version ignored the noise and followed the causal chain:
commit latency → consumer lag → alert
That’s not an LLM trick.
That’s system design discipline.


