www.ilert.com
Who Was Affected By The Neon...
Excerpt
# Neon: Kubernetes IP exhaustion disrupted services Neon experienced outages caused by Kubernetes IP exhaustion, impacting service availability. Explore what went wrong, Neon's response, key actions taken, and lessons learned to improve reliability. ... On May 16 and May 19, 2025, Neon experienced two outages totalling 5.5 hours in the AWS us-east-1 region. Customers were unable to start or create inactive databases, though active databases remained unaffected. The incidents resulted from exhausted IP addresses in Kubernetes subnets, triggered by control plane overload and AWS CNI misconfigurations. Immediate mitigations included reconfiguring IP allocation parameters and scaling prewarmed compute pools. ... The first incident began at 14:13 UTC on May 16, 2025, when customers started experiencing failures to activate databases. The second incident occurred on May 19, 2025, at 13:17 UTC, triggered by reverting the previous fixes. … ## Who was affected by the Neon outage, and how bad was it? Customers using Neon databases with scale-to-zero configurations in AWS us-east-1 were directly impacted. Users couldn't activate or create new inactive databases, disrupting development workflows and CI/CD processes. … ## What patterns did the Neon outage reveal? The outage revealed recurring risks in scaled infrastructure systems: - IP exhaustion acts as a hidden infrastructure bottleneck. - Configuration regressions were introduced during incident remediation. - Kubernetes clusters exceeding the designed pod limits under dynamic load conditions. ## Quick summary On May 16 and May 19, 2025, Neon faced two outages totalling 5.5 hours due to IP exhaustion in Kubernetes subnets in AWS us-east-1. Users were unable to activate databases with autoscaling configurations. Neon responded with rapid mitigations and transparent, though brief, communication. The incidents underscored the importance of robust infrastructure safeguards, effective configuration management, and clear, timely updates during critical incidents.
Related Pain Points
Production Deployment Without Proper Testing Pipeline
9Changes are deployed directly to production without apparent dev/test/staging environments, causing widespread bugs to affect all users simultaneously. The lack of canary deployments and feature flags prevents quick rollback of breaking changes.
Kubernetes scaling limits cause database activation failures
9Neon's Kubernetes infrastructure hit scaling limits when database creation increased 5x and branch creation increased 50x in May-June 2025. The platform exceeded the 10,000 concurrent database pod limit in testing, with network configuration limiting us-east-1 to ~12,000 active databases. IP exhaustion in Kubernetes subnets caused outages where customers couldn't activate or create databases.