Byte-Sized Design

Byte-Sized Design

How Discord Automates ScyllaDB Clusters at Scale

The framework that turned a 36-hour database operation into two hours of mostly waiting

Byte-Sized Design's avatar
Byte-Sized Design
Jun 15, 2026
∙ Paid

TLDR

Discord’s Persistence Infrastructure team runs every ScyllaDB cluster behind messages, channels, servers, and most of Discord’s user data. Dozens of clusters. Hundreds of nodes. Seven engineers.

For years, the tooling was a pile of Python and bash scripts that worked but required someone who remembered all the landmines. Standing up a full shadow cluster, a complete replica that mirrors production traffic so you can safety-test a new ScyllaDB release, took a day and a half of careful, sequential, do-not-screw-up-step-nine work.

They rebuilt it as the Scylla Control Plane (SCP): composable tasks, YAML-defined workflows, automatic retries, and resumable jobs. The same operation now takes under two hours, and most of that is the engineer doing something else while nodes bootstrap.

The interesting part isn’t that they automated database ops. It’s the specific shape of the automation, and three boring decisions that mattered more than any of the Rust.


Seven Engineers, Hundreds of Nodes

ScyllaDB is Discord’s largest database by scope. We’ve covered how that data layer evolved before, How Discord Indexes Trillions of Messages walked through the search side of the same infrastructure. The team that operates it is seven people.

That ratio sounds fine until you look at what “operating” actually means: rolling restarts after every config change, expanding clusters as servers fill up, rolling OS upgrades across hundreds of nodes with zero downtime, and validating every new ScyllaDB release on a shadow cluster before it touches production.

None of that is fire-and-forget. Each one demands careful sequencing, validation, and someone paying attention the entire time. We talked about why that kind of pre-production validation matters in The Tech Lead’s Guide to Load Testing, shadow clusters are basically load testing with real production traffic and real production stakes.

For years, Discord automated this the way most teams do: incrementally, under pressure, with no long-term plan. A Python script here, a bash script there. It worked. It also required deep institutional knowledge to run safely, and that’s the kind of debt that’s invisible until the one person who understands it is on vacation.


Three Ways the Old Scripts Failed

User's avatar

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.
© 2026 Byte-Sized Design · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture