marcelsud.me

Cloudflare Outage June/2025: Lessons for Software Engineers - Marcelo Santos (@marcelsud)

6/16/2025Updated 7/18/2025

Excerpt

This obscure storage provider turned out to be the backbone of Cloudflare's Workers Key-Value (KV) service — a critical piece of infrastructure that thousands of applications depended on for everything from user sessions to configuration data. When it stumbled, the dominoes began to fall: **91%**of Workers KV requests started failing **100%**failure rate on Access logins **90%+**error rate on Stream - Workers AI, Images, Turnstile, and parts of Dashboard also affected - Thousands of dependent applications around the globe started throwing errors - Customer support channels lit up like Christmas trees … ### The Failure Timeline **Ground Zero (T+0 minutes)**: A third-party storage provider experiences internal issues. Most of the world doesn't notice yet. **Primary Impact (T+5 minutes)**: Cloudflare's Workers KV service starts timing out. Alert dashboards begin showing yellow warnings that soon turn red. **Secondary Impact (T+15 minutes)**: Services that depend on Workers KV — Access for corporate authentication, Stream for video delivery, Workers AI for machine learning inference — start failing completely. These aren't graceful degradations; they're hard failures. **Tertiary Impact (T+30 minutes)**: Customer applications that relied on these services start experiencing outages. E-commerce sites can't authenticate users. Streaming platforms can't deliver content. AI-powered features simply disappear. **Ecosystem Impact (T+60 minutes)**: The blast radius has now extended to millions of end users who have no idea what "Workers KV" means. They just know their favorite apps aren't working. This progression reveals something crucial about modern distributed systems: ... Because of what I call the "invisible dependencies problem." When you're building at scale, you tend to think about your immediate dependencies — the databases you talk to, the APIs you call, the services you integrate with. But you rarely map the dependency tree three or four levels deep.

Source URL

https://marcelsud.me/en/cloudflare-outage-analysis-june-2025/

Related Pain Points