The AWS October 20th Outage Dissection

15 hours. 113 services. One empty DNS record.

Oct 23, 2025

∙ Paid

On October 20, 2025, AWS experienced what would become its longest outage in a decade. The thing is it wasn’t a cyberattack, a catastrophic hardware failure, or even a deployment gone wrong. It was a race condition in DNS management, a textbook distributed systems problem that somehow survived years of production load.

If you’re thinking “that could never happen to us,” I’ve got news: it already has, or it will.

A Race Condition Masterclass

The Architecture (Before Everything Broke)

DynamoDB’s DNS management system was elegantly designed with resilience in mind:

DNS Planner: Monitors load balancer health and generates DNS plans
DNS Enactor: Three independent instances across different AZs, each applying plans to Route53
Safety mechanism: Each Enactor checks that its plan is newer before applying

Sound familiar? It should. This is exactly the kind of distributed consensus architecture we all build. And it’s exactly where race conditions love to hide.

The Perfect Storm

Here’s what happened at 11:48 PM PDT on October 19:

Enactor A picks up a DNS plan and starts applying it
Due to unusual delays, Enactor A takes forever to work through endpoints
DNS Planner keeps running, generating newer plans (as designed)
Enactor B grabs a fresh plan and blazes through all endpoints
Enactor B finishes and triggers cleanup, which deletes “old” plans
Here’s the race: Enactor A (still stuck on its ancient plan) finally reaches the regional DynamoDB endpoint and overwrites the new plan with the old one
Enactor B’s cleanup process deletes this now-active old plan
Result: An empty DNS record for dynamodb.us-east-1.amazonaws.com

The staleness check Enactor A performed at the start was now meaningless hours later. The system was designed to handle concurrent updates but not this specific timing scenario.

The Cascading Failure

When DynamoDB’s DNS disappeared:

Immediate impact: All new connections to DynamoDB failed
EC2’s DropletWorkflow Manager: Uses DynamoDB for lease management. Leases started timing out across the fleet.
Congestive collapse: When DynamoDB recovered, DWFM tried to re-establish leases but entered a death spiral work took so long that leases timed out before completion, queuing more work.
Network Manager: Once DWFM recovered, it had a massive backlog of network configurations to propagate. New EC2 instances launched but had no network connectivity.
NLB health checks: Started failing because they were checking instances whose network state hadn’t propagated yet, causing nodes to flap in and out of service.

It took engineers 15 hours to fully untangle this mess.

Design Lessons: What Should Keep You Up at Night

1. Staleness Checks Are Time Bombs

python

# This pattern is everywhere in code:
if plan.version > current_version:  # Check at START
    # ... hours of processing ...
    apply_plan(plan)  # Apply at END

The problem: In distributed systems with high latency, the check and the action are separated by an eternity. Between them, the world changes.

The fix: Version checks must be atomic with the operation. Use compare-and-swap, optimistic locking, or fencing tokens.

2. Recovery Mechanisms Can Become Attack Mechanisms

DynamoDB’s DNS system was designed to recover from failures. The cleanup process that deleted old plans? That was a feature, not a bug. Until the race condition turned it into the kill switch.

Ask yourself: What happens when your recovery automation runs at the worst possible time? Have you tested your system’s behavior when recovery and failure happen simultaneously?

3. The Lease Pattern Doesn’t Scale Under Stress

EC2’s DropletWorkflow Manager is a classic example of the lease pattern breaking down:

Under normal conditions: ✅ Leases renew before timeout
Under stress: ❌ Processing takes longer than timeout → lease expires → more work queued → congestive collapse

The lesson: Lease-based systems need backpressure mechanisms and circuit breakers for the renewal process itself.

4. Dependencies Create Blast Radius Multipliers

Look at this dependency chain:

DynamoDB DNS → DynamoDB APIs → DWFM → EC2 Launches → Network Manager → NLB Health Checks → Lambda → 100+ AWS services

One empty DNS record brought down 113 services. The original issue was resolved in ~3 hours. Full recovery took 15 hours because of the dependency cascade.

The hard question: Map your critical path dependencies. How many single points of failure do you have? (Hint: It’s more than you think.)

5. Multi-Region Doesn’t Mean What You Think It Means

DynamoDB global tables worked fine—in other regions. But here’s the catch: The DNS Planner and Enactor automation? Disabled worldwide after this incident.

A failure in US-EAST-1 triggered a global operational change. Why? Because the underlying bug existed everywhere, it just hadn’t been triggered yet.

Reality check: Your multi-region setup protects against regional failures. It doesn’t protect against architectural bugs that exist in all regions simultaneously.

The Numbers That Matter

15 hours: Total duration from start to full recovery
113 services: Number of AWS services impacted
11 million: Downdetector reports globally
2,500 companies: Affected at peak
3 hours: Time to fix the root cause (DynamoDB DNS)
12 hours: Time to recover from cascading failures

The root cause was resolved relatively quickly. The cascading failures took 4x longer to clean up.

What AWS Is Doing (And How You Can Learn From This)

AWS’s Response:

✅ Disabled the DNS automation globally (immediate)
🔄 Fixing the race condition before re-enabling
🔄 Adding protections to prevent incorrect DNS plan application
🔄 Adding velocity controls to NLB health check failovers
🔄 Building scale tests for DWFM recovery workflows
🔄 Improving rate limiting based on queue depth

How to learn from this:

This week:

Audit your distributed consensus mechanisms for similar race conditions
Review your staleness checks—are they atomic with the operations they protect?
Map your critical dependency chains and identify blast radius multipliers

This month:

Build chaos tests for your recovery mechanisms
Implement backpressure for any lease-based systems
Test your multi-region setup under partial failure scenarios (not just full region failure)

This quarter:

Review every “this should never happen” assumption in your codebase
Build observability into your consensus protocols
Create runbooks for congestive collapse scenarios

The Uncomfortable Truth

AWS employs some of the best distributed systems engineers on the planet. They have formal verification, chaos engineering, and extensive testing. They still had a race condition in production that survived years of load.

This isn’t a story about AWS failing. This is a story about distributed systems being fundamentally hard.

The question isn’t “could this happen to us?”

The question is: When it happens to you, will you survive it?

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.

Byte-Sized Design