www.anthropic.com
A postmortem of three recent issues - Anthropic
The overlapping nature of these bugs made diagnosis particularly challenging. The first bug was introduced on August 5, affecting approximately 0.8% of requests made to Sonnet 4. Two more bugs arose from deployments on August 25 and 26. Although initial impacts were limited, a load balancing change on August 29 started to increase affected traffic. This caused many more users to experience issues while others continued to see normal performance, creating confusing and contradictory reports. … ### 1. Context window routing error On August 5, some Sonnet 4 requests were misrouted to servers configured for the upcoming 1M token context window. This bug initially affected 0.8% of requests. On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected. … However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server. **Resolution:** We fixed the routing logic to ensure short- and long-context requests were directed to the correct server pools. We deployed the fix on September 4. Rollout to our first-party platform and Google Cloud's Vertex AI was completed by September 16, and to AWS Bedrock by September 18. ### 2. Output corruption On August 25, we deployed a misconfiguration to the Claude API TPU servers that caused an error during token generation. An issue caused by a runtime performance optimization occasionally assigned a high probability to tokens that should rarely be produced given the context, for example producing Thai or Chinese characters in response to English prompts, or producing obvious syntax errors in code. A small subset of users that asked a question in English might have seen "สวัสดี" in the middle of the response, for example. This corruption affected requests made to Opus 4.1 and Opus 4 on August 25-28, and requests to Sonnet 4 August 25–September 2. Third-party platforms were not affected by this issue. **Resolution:** We identified the issue and rolled back the change on September 2. We've added detection tests for unexpected character outputs to our deployment process. ### 3. Approximate top-k XLA:TPU miscompilation On August 25, we deployed code to improve how Claude selects tokens during text generation. This change inadvertently triggered a latent bug in the XLA:TPU^[1] ^compiler, which has been confirmed to affect requests to Claude Haiku 3.5. We also believe this could have impacted a subset of Sonnet 4 and Opus 3 on the Claude API. Third-party platforms were not affected by this issue. … This caused a mismatch: operations that should have agreed on the highest probability token were running at different precision levels. The precision mismatch meant they didn't agree on which token had the highest probability. This caused the highest probability token to sometimes disappear from consideration entirely. On August 26, we deployed a rewrite of our sampling code to fix the precision issues and improve how we handled probabilities at the limit that reach the top-p threshold. But in fixing these problems, we exposed a trickier one. Our fix removed the December workaround because we believed we'd solved the root cause. This led to a deeper bug in the approximate top-k operation—a performance optimization that quickly finds the highest probability tokens.^[3]^ This approximation sometimes returned completely wrong results, but only for certain batch sizes and model configurations. The December workaround had been inadvertently masking this problem. … ## Why detection was difficult ... Each bug produced different symptoms on different platforms at different rates. ... When negative reports spiked on August 29, we didn't immediately make the connection to an otherwise standard load balancing change.
Related Pain Points4件
Context Window Routing Bug Affecting 16% of Requests
8A context window routing error caused requests to be misrouted to incorrect server configurations, with sticky routing amplifying the impact. At peak, 16% of Sonnet 4 requests were affected, with fix rollout spanning multiple platforms over weeks.
XLA:TPU Compiler Bug Affecting Token Selection
8A precision mismatch bug in the XLA:TPU compiler caused incorrect token selection by the approximate top-k operation, affecting Haiku 3.5 and potentially Sonnet 4 and Opus 3. The issue was masked by a previous workaround that was inadvertently removed.
Complex Debugging Due to Overlapping Production Bugs
7Multiple overlapping bugs with different symptoms, affecting different platforms at different rates, made diagnosis and root-cause analysis extremely difficult. Load balancing changes increased affected traffic unexpectedly, creating contradictory user reports.
Output formatting issues and text quality problems
4API responses include unwanted formatting artifacts, repeated phrases, extraneous whitespace, newlines, and phrase repetition. These quality issues require additional post-processing and reduce application reliability.