How Datadog taught an AI to investigate high-severity incidents

Jan 20, 2026

∙ Paid

Most incident tools are good at collecting evidence.

They’re bad at thinking with it.

If you’ve ever been on call, you know the feeling:

The hard part isn’t access to data.
It’s deciding what to look at next.

That’s the problem Bits AI SRE is actually trying to solve.

This isn’t an AI summarizer (and that matters)

The early wave of “AI for ops” tools made a quiet assumption:

If we gather enough telemetry, the model can summarize its way to the root cause.

That turns out to be wrong.

More data doesn’t make incidents clearer.
It makes them noisier.

Bits AI SRE does something different.
It investigates like a team of human SREs:

That sounds obvious.
It isn’t.

Most tools still dump everything into context and hope the model figures it out.

Here’s the most important design decision in this system:

The agent only looks at data that is causally related to a hypothesis.

Not “everything nearby.”
Not “everything noisy.”
Not “everything interesting.”

Just:

Does this explain why the alert fired?

In one real incident:

Earlier versions of the agent saw all of it
…and picked the wrong root cause.

The newer version ignored the noise and followed the causal chain:
commit latency → consumer lag → alert

That’s not an LLM trick.
That’s system design discipline.