news.ycombinator.com

Cloudflare outage on December 5, 2025 - Hacker News

12/5/2025Updated 1/17/2026

Excerpt

As to architecture: Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so. … - Centralizing most of the dependency on Cloudflare results in a major outage when something happens at Cloudflare, it is fragile because Cloudflare becomes the single point of failure. Like: Oh Cloudflare is down... oh, none of my SaaS services work anymore. … dfex 42 days ago Putting Cloudflare in front of a site doesn't mean that site's backend suddenly never goes down. Availability will now be worse - you'll have Cloudflare outages* affecting all the sites they proxy for, along with individual site back-end failures which will of course still happen. * which are still pretty rare … The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result. ... It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true. I had a manager who came from defense electronics in the 1980's. ... You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it. … What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says: 1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable. 2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this. 3. Pandemic after effects. 4. Political climate / war / drugs; all are intermingled. … 2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button. … Not only that, but their API/pricing is specifically designed to cover edge-cases that will force you to buy a license. For example, they don't expose an API to assign a co-host. You can do that via the UI, manually, but not via the API. Can you share which solution are you moving to?

Source URL

https://news.ycombinator.com/item?id=46162656

Related Pain Points