Byte-Sized Design

Byte-Sized Design

What Datadog’s Outage Taught Us About Hidden Dependencies

A subtle DNS change triggered cascading failures, but the recovery paved the way for stronger, more fault-tolerant systems

Byte-Sized Design's avatar
Byte-Sized Design
Mar 11, 2025
∙ Paid

Product for Engineers - Sponsor

Product for Engineers is PostHog’s newsletter dedicated to helping engineers improve their product skills. It features curated advice on building great products, lessons (and mistakes) from building PostHog, and research into the practices of top startups.

Subscribe For Free

🚀 TL;DR

On September 24, 2020, Datadog’s US region suffered a multi-hour outage due to a failure in its service discovery system, a core component that lets services find their dependencies. A routine change to a latency-measuring cluster triggered a thundering herd of DNS requests, overloading the system and breaking service discovery across Datadog’s infrastructure.

📌 The Impact: What Went Down

🔻 Web tier & API – 9+ hours of degraded access
🔻 Logs & monitoring – Outages up to 12 hours
🔻 Alerts & APM – Extended failures, up to 15 hours
🔻 Infrastructure monitoring – Fully recovered after 15+ hours

Despite the disruption, incoming data was still processed but users couldn't access it.

🔍 What Happened?

This outage was not due to a security breach or infrastructure failure. Instead, it was caused by a subtle configuration mistake made a month earlier:

🚨 August: A change moved service discovery queries from a static file (resilient but slow) to a dynamic DNS resolver (fast but failure-prone).
🚨 September 24: A routine restart of a small, low-priority cluster caused a surge of DNS queries that overwhelmed the service discovery system.
🚨 Service discovery failed → Services couldn’t find dependencies → Outage spread rapidly.

🛑 Root Cause: A Tiny Change With Big Consequences

1️⃣ Overloaded Service Discovery

  • Datadog relies on a distributed service discovery cluster for routing dependencies.

  • The DNS resolver wasn’t caching NXDOMAIN responses (i.e., missing services).

  • A routine restart flooded the system with unnecessary DNS lookups, amplifying failure.

2️⃣ Cascading Failures

  • As more services failed, they kept retrying, creating a feedback loop that brought down the web tier, alerts, and monitoring tools.

  • Many components couldn't start because they relied on dynamic configuration, which was unavailable.

3️⃣ Lack of Fallback Mechanisms

  • Had fallback service discovery mechanisms (like static files) remained in place, some services could have continued operating.

🤔 Lessons Learned

User's avatar

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.
© 2026 Byte-Sized Design · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture