Skip to main content
AnthropicClaudeAI reliability

Why Anthropic's Claude Code Failed

Anthropic released a post-mortem on Claude Code's failures, revealing the problem wasn't the core model but a combination of three product-level changes. For businesses, this is a direct lesson in AI integration: the entire system around the model, not just the model itself, is a critical point of failure.

Technical Context

I've carefully reviewed Anthropic's post-mortem from April 23, and the most interesting part isn't the bug itself, but how a seemingly stable AI integration fell apart due to several small decisions at once. If you're building AI automation on top of LLMs, this is a very familiar scenario: the model seems the same, but the product suddenly becomes dumber, more forgetful, and less articulate.

Anthropic described three independent changes. The first was made on March 4: Claude Code's default reasoning effort was lowered from high to medium to speed up responses. In internal tests, the quality drop looked moderate, but in real-world use, users got a noticeably weaker code assistant. This was only rolled back on April 7.

The second change arrived on March 26. The team intended to clear the reasoning cache after an hour of inactivity, but a bug caused it to clear on every subsequent turn of the session. This created the impression that Claude was forgetting context, repeating itself, and acting disoriented. This bug persisted until April 10.

The third change appeared on April 16, after the Opus 4.7 release. To eliminate verbosity and reduce token consumption, Anthropic added constraints to the system prompt. This is where things got particularly bad: the new instruction, combined with other prompt edits, degraded coding quality across several versions, including Sonnet 4.6, Opus 4.6, and Opus 4.7. The rollback was done on April 20.

The key takeaway: according to Anthropic, the base model and core API were not broken. The product layer on top of it was. Honestly, this is my favorite and most frustrating type of incident, because the culprit isn't one major release but the sum of "safe" changes to parameters, the prompt layer, and session management.

What This Means for Business and Automation

For teams, this is a very sobering signal: LLM system degradation often comes not from the model but from the surrounding infrastructure. If your AI solution development relies on system prompts, caching, routing, and latency tuning, you need to test the entire orchestra, not just the model.

Who wins? Those with staged rollouts, proper cohort metrics, and fast rollbacks. Who loses? Teams that treat prompts as "not code" and push such changes with little engineering discipline.

I've long treated the prompt layer as part of the architecture, not just a text file thrown together. At Nahornyi AI Lab, we solve these exact problems for clients: we break down AI architecture into layers, establish observability, and eliminate fragile points that can suddenly tank quality.

If you're already noticing that your assistant is smart one moment and dull the next for no apparent reason, it's usually not magic or "model fatigue." We can systematically analyze your pipeline and build AI automation that relies on engineering guarantees, not luck. If you'd like, at Nahornyi AI Lab I can help you quickly find where your production is leaking.

A related examination of AI vulnerabilities revealed how the Claude self-reflection glitch could be exploited via prompt injection, potentially leading to denial-of-service attacks. Such incidents underscore the critical need for detailed post-mortems and robust security measures in AI deployment.

Share this article