When "Distributed Backup" Isn't Actually Distributed: Lessons From the Coinbase Outage

The May 2026 AWS thermal event, the latency-versus-resilience tradeoff that broke, and what happens when “distributed backup” isn’t actually distributed

May 16, 2026

∙ Paid

TLDR

23:50 UTC, Thursday, May 7 2026. A data center hall in Northern Virginia overheats. Cooling units in availability zone az4 give up. EC2 instances and EBS volumes start losing power inside the hour.

Coinbase goes dark. Seven hours. Trading, Prime, International, derivatives, balance updates, all gone. FanDuel goes down at 9pm ET right as Game 2 of Lakers-Thunder tips off. CME Direct logs go sideways for institutional traders.

Coinbase had multi-AZ on most of their stack. But not the matching engine. That one ran in a single zone, on purpose, for latency. They had a backup. The backup didn’t isolate from the failure the way it was supposed to.

This is a story about a tradeoff that finally got tested in production. And about the difference between high availability and disaster recovery, which is not the same thing, no matter how many architecture diagrams pretend otherwise.

What actually broke

The root cause was not software. It was a building. Multiple chillers in a single data center hall failed. Temperatures climbed. AWS lost power to racks in availability zone use1-az4. EC2 instances and EBS volumes on those racks were physically damaged, not “marked unhealthy,” damaged.

AWS shifted traffic away from the zone, but recovery depended on getting cooling back online before they could safely bring damaged hardware up again. That took more than 20 hours. Cooling was stable at pre-event levels by 13:50 PT Friday afternoon.

This is the part of the story no architecture diagram makes visible. The internet still runs in buildings, and buildings can overheat.

If that framing sounds familiar, the October 20, 2025 AWS outage had the same shape from a different angle, 15 hours, 113 services, traced back to one empty DNS record. Different failure mode, same conclusion: at AWS’s scale, the failures that take down half the internet are rarely the ones you’d architect for.

The Coinbase tradeoff

Here’s the part that matters for engineers.

Coinbase had multi-AZ. They said so plainly: most of their systems are designed to survive a single AZ failing. Most of them did.

The matching engine didn’t.

Rob Witoff, Head of Platform, was honest about why. The exchange runs in a single availability zone by design. Latency. Customer co-location. Real money on the line, measured in microseconds. Spreading it across zones would have meant accepting cross-AZ network hops on every order match. For an exchange that competes on speed, that is not a free trade.

So they made the call. Single AZ for the matching engine. And — this is the part everyone glosses over — they built a backup. A distributed copy of the exchange infrastructure, designed to take over if the primary zone died.

The backup did not work as expected.

Witoff was specific: “backup systems did not work as expected during the incident, extending the outage and forcing engineers to manually execute disaster recovery procedures.” Engineers had to develop, test, deploy, and validate a fix while the production system was on fire.

Kafka made it worse. Coinbase runs partitioned Kafka handling thousands of terabytes a day. That couldn’t fail over automatically either. It needed manual recovery. Balance streams lagged behind until replication caught up.

No data was lost. But seven hours of downtime is seven hours of downtime.

“Distributed backup” is not the same as “distributed”

This is the lesson worth dwelling on.

A backup that depends on the same failure domain as the primary is not a backup. It is a copy.

Coinbase had a distributed copy of the exchange. But “distributed” without specifying what failure mode it survives is marketing. If your backup database lives in the same AZ as your primary, an AZ outage takes both. If your backup region depends on your primary region’s IAM control plane, a region outage takes both.

The principle generalizes brutally: your failover only works against failure modes you actually tested it against. Untested failover is not failover. It is a hypothesis.

GitLab learned this the hard way in 2017. When an engineer accidentally deleted 300GB of production data, they discovered their backup system had been broken for weeks. Five backup mechanisms. None of them worked. Different failure mode, same shape: a safety net that nobody had pulled on recently.

The Coinbase incident is the kinder version of the same story. Their backup existed. Their backup partially worked. But the bits that didn’t work showed up only under the specific conditions of the specific failure that actually happened, which is exactly the moment when you can’t afford to discover them.

If your DR plan has not been exercised in the last 90 days, treat it as untested.

High availability and disaster recovery are different problems

Every time something like this happens, the postmortem comment section fills up with “they should have used multi-AZ.” This is half-right and entirely beside the point.

Multi-AZ is high availability. It protects you from a bad day in one zone. Coinbase already had it for most workloads.

Multi-region is disaster recovery. It protects you from a bad day in one region.

AWS treats availability zones as the failure domain for HA. They treat regions as independent on purpose. A thermal event in US-EAST-1 will not move your data to US-WEST-2 unless you have explicitly built that path, paid for the replication, tested the failover, and decided in advance who gets to push the button.

Most production systems need both. Most production systems have one.

The reason is cost. Cross-region replication is real money. Hot standby in a second region doubles your compute bill. And the day-to-day value of that spending is invisible, until the day the chillers fail, at which point the conversation stops being about budget and becomes about how fast you can recover.

This is the part where engineering leadership earns their salary. Resilience investments look expensive right up until the moment they look obvious.

How Coinbase actually recovered

The recovery sequence is worth studying because it shows what disciplined incident response looks like under load.

When the matching engine came back, they did not just re-enable trading and let it rip. They staged it:

Cancel-only mode — let customers withdraw existing orders, but no new trades
Auction mode — orders accumulate, no immediate matching, gives time to verify books are consistent
Live trading — only after product-by-product health checks

This is not glamorous. It is the engineering equivalent of bringing a power plant back online — slow, sequential, fully verified before each step. The temptation in a multi-hour outage is to fix the thing and flip the switch. Coinbase didn’t. They reconciled state first. No data was lost. That outcome is downstream of this discipline, not luck.

The public communication followed the same pattern. CEO Brian Armstrong posted within hours: “never acceptable.” Witoff followed with a technical thread giving the actual timeline, the actual root cause, the actual decision being revisited. No PR-speak. No “we are looking into it.” Specific.

That is what a public post-mortem looks like when the engineering org is the one writing it, not legal.

The SLA is not your business continuity plan

One detail from the AWS side worth internalizing. The standard EC2 SLA pays out around 10% of monthly compute spend on impacted instances. That’s the entire compensation.

Lost revenue: not covered. Customer trust: not covered. Regulatory exposure: not covered.

The 2024 ITIC survey put hourly downtime cost above $300,000 for 90% of mid-to-large enterprises. 41% lose between $1M and $5M per hour. Trading and finance are higher still. Coinbase was down for seven hours during a quarter where they had also just announced a 14% workforce reduction and a $394M net loss. The 10% AWS service credit was not going to move that needle.

Your cloud provider’s SLA is a small refund. It is not insurance. It is not a business continuity plan. The continuity plan is something you build, fund, and test yourself, or you don’t have one.

Three things to do this week

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.