news.ycombinator.com
Cloudflare outage on November 18, 2025 post mortem
Excerpt
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself? How can the database export query not have a limit set if there is a hard limit on number of features? Why do they do non-critical changes in production before testing in a stage environment? … Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare. Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. ... The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know. ... But at the same time, what value do they add if they: * Took down the the customers sites due to their bug. * Never protected against an attack that our infra could not have handled by itself. * Don't think that they will be able to handle the "next big ddos" attack. It's just an extra layer of complexity for us. ... Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. ... [1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things. … That said, I am totally fine with your use case in your application. ... My worry is that this runtime panic behavior has unwittingly seeped into library code that is beyond our ability and scope to observe. Or that an organization sets a policy, but that the tools don't allow for rigid enforcement.
Related Pain Points
Production Deployment Without Proper Testing Pipeline
9Changes are deployed directly to production without apparent dev/test/staging environments, causing widespread bugs to affect all users simultaneously. The lack of canary deployments and feature flags prevents quick rollback of breaking changes.
Lack of expressive data model understanding leads to poor schema design
6Development teams unfamiliar with expressive data modeling often fail to apply important constraints like foreign keys, instead relying on familiar application-level patterns. This results in databases without essential integrity constraints.
Inadequate documentation for Cloudflare developer products
6Cloudflare's developer platform and products lack sufficient documentation, making it difficult for developers to understand and implement features effectively.