Byte-Sized Design

Byte-Sized Design

HubSpot's 37-Minute Lesson in Why HTTP 200 Can Lie

The permission check that passed, the users who were locked out, and what monitoring for "availability" actually misses

Byte-Sized Design's avatar
Byte-Sized Design
Apr 22, 2026
∙ Paid

TL;DR

3:43 PM EST to 4:20 PM EST. 37 minutes. Every HubSpot customer lost the ability to click into contact, company, order, or project workflows in the UI. Deal-based and ticket-based workflows still worked. Every backend automation kept firing on schedule. No data lost. No execution missed.

And the whole thing flew under the radar because the endpoint that broke kept returning HTTP 200.

The incident monitoring didn’t catch it. The automated canary checks didn’t catch it. A 60-minute alert threshold meant the tests that did fail weren’t going to page anyone until well after customers had already flooded support.

This is a textbook case of the thing we keep writing about: your observability is only as good as what you’re actually measuring. If you’re measuring “did the server respond,” you’re going to miss every bug that makes the server respond with the wrong answer. HubSpot’s post-mortem is refreshingly direct about this, and there’s a clean lesson in it for anyone running permission systems, feature flags, or anything else where the shape of a correct response matters more than its existence.

If you’ve been around for the Cloudflare July 2025 outage breakdown or the AWS October 20th dissection, this one will feel familiar. Different blast radius. Same category of failure.


So what actually happened?

HubSpot was rolling out a permissions framework update. The goal was reasonable: replace a broad shared scope with narrower, object-type-specific scopes for contact, company, order, and project workflows. Tighter permissions, better isolation. Standard stuff.

The rollout had two pieces:

  1. Create the new permission scopes.

  2. Promote the user role assignments that map those scopes to the right users.

Piece one made it to production. Piece two didn’t.

The staging environment had both pieces, so staging worked. Production had scopes without role assignments, so production’s access-control system went looking for user-role mappings that didn’t exist. When it couldn’t find them, it did what permission systems are supposed to do: fail closed. Deny access.

From the access-control system’s perspective, this was correct behavior. Users were asking about permissions the system couldn’t verify, so the system returned a restrictive access level.

From the user’s perspective, their workflows vanished.

The 200 that lied

Here’s the part worth dwelling on. The access endpoint returned HTTP 200 the whole time. The server didn’t crash. It didn’t throw. It didn’t log an error. It just returned a technically-valid response that said “this user can barely do anything.” The frontend, doing its job, saw “barely anything” and hid the UI.

Most monitoring treats HTTP status codes as ground truth. 2xx is fine, 4xx is the client’s problem, 5xx pages the on-call. It’s a useful abstraction, and it’s wrong in exactly this scenario. The server is healthy. The payload is garbage.

We covered something very similar in how Twitch caught their invisible failures—streams that terminated “successfully” from the server’s point of view while users saw nothing. Same failure mode, different domain. When correctness lives in the response body rather than the status line, your dashboards need to look inside the response.

Why the canary didn’t save them

HubSpot’s automated test suite did catch failures during the canary window. Those failures fired into a queue that was configured to wait 60 minutes before paging anyone.

Sixty minutes.

The deployment rolled out fully in 33 minutes. The entire incident lasted 37 minutes from first impact to rollback. The alerts would have arrived after the problem was already resolved.

Alert thresholds are a real tradeoff. Too tight and your on-call drowns in noise from flaky tests. Too loose and you get this. The right answer is rarely a single global threshold, it’s threshold in context. Failures during an active deployment window are categorically different from failures on a quiet Tuesday morning, and HubSpot is correctly calling that out in their remediation plan. Correlate the alerts with the deploys. Shrink the window to minutes during rollout.

This is the kind of instrumentation gap that shows up over and over in post-mortems. For more on how to actually write these documents well instead of just surviving them, our tech lead’s guide to writing post-mortems covers the framing that distinguishes a useful post-mortem from a corporate apology.

The split-brain deployment

User's avatar

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.
© 2026 Byte-Sized Design · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture