www.gremlin.com
Reliability lessons from the 2025 Cloudflare outage - Gremlin
Excerpt
At the heart of this outage is a simple configuration change. Cloudflare’s Matthew Prince goes into extensive detail in his post-mortem, but in short, Cloudflare engineers implemented a change that caused a configuration file to double in size. This configuration file is used by their Bot Management system, which analyzes all requests traversing the Cloudflare network for bot traffic. Among other things, this helps Cloudflare and its customers distinguish between requests from bots and those from humans. The problem is that Bot Management has an upper limit on the size of its configuration file. The new file became too large, which caused Bot Management to error and return HTTP 5XX error codes. The effect wasn’t immediate, though. The file took some time to propagate throughout their network, resulting in spikes in error rates that lasted only a few minutes each. It took around 90 minutes for the bad configuration to propagate fully and create a sustained high error rate. This issue had a knock-on effect on other Cloudflare services, including Workers KV, Access, Turnstile, and the Dashboard. Because many of these services are dependent on each other (Workers KV is used for Access, Turnstile is used for Dashboard, etc.), a single failure in any one can have a cascading effect on the others. This is how a relatively small error in one service became an Internet-spanning outage. > ... It wasn’t until the bad file was fully deployed to all systems that the error rate became steady.
Related Pain Points
Configuration file size limits causing cascading service failures
9A configuration change caused a file to double in size, exceeding Bot Management's upper limit, resulting in HTTP 5XX errors that cascaded across dependent services (Workers KV, Access, Turnstile, Dashboard) into an internet-wide outage.
Slow configuration change propagation and delayed error detection
8Bad configurations take extended time (90 minutes) to propagate across distributed infrastructure, resulting in intermittent errors before sustained failure, making rapid detection and rollback difficult.