neon.com
A Recap on May/June Stability - Neon
Excerpt
Starting in May, we had a series of feature launches with agentic AI partners that gained far more momentum than we predicted. In two short timespans, the rate of new database creation increased more than 5x, and the rate of branch creation increased more than 50x. While we were humbled by the uptick, the significant burst in operational load caused a lot of strain in the Neon platform, **manifesting as more incidents over the course of the two months than the entire year before. ** … ### May: Capacity Handling and “Cells” May incidents were caused by us hitting a scaling limit around the number of active databases in US regions before our solution (Cells) was ready. Every active database on Neon is a running pod in a Kubernetes cluster. Our testing of Kubernetes showed service degradation beyond 10,000 concurrent databases. Among multiple issues discovered in testing, we approached the EKS etcd memory limit of 8GB and pod start time fell below our targets. In addition, in our us-east-1 cluster, our network configuration limited us to ~12,000 concurrently active databases. In January 2025, we forecasted that we would hit these limits by the end of the year. ... We made various configuration changes to keep our oversized regions functional: tuning networking, reducing our usage of the Kubernetes API to avoid EKS (Amazon Elastic Kubernetes Service) rate limits, scaling up our control plane databases, and shedding load where possible. Each of these changes bought us time to complete the Cells project, but also increased risk of failure and the amount of customer impact when failures occurred — resulting in the incidents you experienced. … ### June: Metadata Handling Database operations incidents in June were caused by scaling issues with our control plane database as a result of a 50x increase in database branch creation. A branch in Neon is a cheap operation – there’s no data copy – but that makes it fast and easy to create thousands of them. In agentic workloads, customers often use branches as “savepoints” to restore app state as their agent iterates on a codebase. *Daily branches created each day by developers (blue) vs agents (yellow)* … - Billing and consumption calculations became more expensive, consuming more CPU - Some queries switched to different Postgres execution plans, and tables became more dependent on aggressive vacuuming - In several cases, customer-impacting issues were caused by control plane queries that went from taking a few hundred milliseconds to over a minute. We have alerting on slow and degrading query performance – but in this case, the query plan switched quickly, introducing service degradation without warning. While we are a Postgres company, we experienced classic Postgres failure modes with increased load: query plan drift and slow vacuum. This is humbling, and will inform our roadmap to help our customers avoid the same. Our test suites were designed around historical usage patterns which didn’t simulate highly skewed project-to-branch ratios. That meant production workloads diverged significantly from what we tested – leading to issues surfacing in production. In this case, we had tested the system at 5x of normal load. But branch creation increased 50x. The system would have continued to function well if we had stronger limits on the number of branches both per project and per customer. One lesson here is that we need stronger limits on EVERY dimension of the workload. Rate limiting at or before our test boundaries would have saved the day. … ### Where to go from here We know the scale of operations is only going to accelerate. Users and Agentic AI platforms are now creating more than 40,000 projects every day. The post-mortems, lessons and patches that have come from these incidents will take us a long way towards a system that can scale exponentially with best-in-class resiliency.
Related Pain Points
Kubernetes scaling limits cause database activation failures
9Neon's Kubernetes infrastructure hit scaling limits when database creation increased 5x and branch creation increased 50x in May-June 2025. The platform exceeded the 10,000 concurrent database pod limit in testing, with network configuration limiting us-east-1 to ~12,000 active databases. IP exhaustion in Kubernetes subnets caused outages where customers couldn't activate or create databases.
Missing rate limiting on branch creation enables runaway workloads
8Neon lacked strong limits on the number of branches per project and per customer. Agentic AI systems creating thousands of branches as 'savepoints' overwhelmed the control plane metadata handling system. The absence of rate limiting at test boundaries allowed production workloads to diverge significantly from tested scenarios.
Query plan instability causes unpredictable performance degradation
7PostgreSQL query execution plans can become unstable, causing previously performing queries to suddenly degrade. Developers must use advanced tools like Query Plan Management (QPM) and pg_hint_plan to ensure consistent query performance.
Vacuum and table dependency issues under rapid workload scaling
7As agentic workloads caused 50x increase in branch creation, Neon experienced classic PostgreSQL failure modes including query plan drift and slow vacuum operations. Tables became more dependent on aggressive vacuuming, creating performance bottlenecks that weren't anticipated in the original system design.
Control plane database CPU exhaustion from billing and consumption calculations
7The 50x increase in branch creation caused the control plane database's CPU to become exhausted due to expensive billing and consumption calculations. These operations contributed significantly to the overall control plane degradation and cascading query performance issues.
Test suites fail to capture real-world workload patterns at scale
7Neon's test suites were designed around historical usage patterns and didn't simulate highly skewed project-to-branch ratios created by agentic AI workloads. Testing at 5x normal load proved insufficient when production experienced 50x load with different distribution characteristics, leading to issues only surfacing in production.