www.gremlin.com

Reliability lessons from the 2025 Cloudflare outage - Gremlin

11/21/2025Updated 3/16/2026

Excerpt

At the heart of this outage is a simple configuration change. Cloudflare’s Matthew Prince goes into extensive detail in his post-mortem, but in short, Cloudflare engineers implemented a change that caused a configuration file to double in size. This configuration file is used by their Bot Management system, which analyzes all requests traversing the Cloudflare network for bot traffic. Among other things, this helps Cloudflare and its customers distinguish between requests from bots and those from humans. The problem is that Bot Management has an upper limit on the size of its configuration file. The new file became too large, which caused Bot Management to error and return HTTP 5XX error codes. The effect wasn’t immediate, though. The file took some time to propagate throughout their network, resulting in spikes in error rates that lasted only a few minutes each. It took around 90 minutes for the bad configuration to propagate fully and create a sustained high error rate. This issue had a knock-on effect on other Cloudflare services, including Workers KV, Access, Turnstile, and the Dashboard. Because many of these services are dependent on each other (Workers KV is used for Access, Turnstile is used for Dashboard, etc.), a single failure in any one can have a cascading effect on the others. This is how a relatively small error in one service became an Internet-spanning outage. > ‍ ... It wasn’t until the bad file was fully deployed to all systems that the error rate became steady.

Source URL

https://www.gremlin.com/blog/reliability-lessons-from-the-2025-cloudflare-outage

Related Pain Points