The Technical Context
I love stories like this not for the drama, but for how quickly they expose weaknesses in AI architecture. In early April, Claude Code users began reporting en masse that their usual workflow suddenly couldn't fit within the 5-hour limits. And we're not talking about insane loads, but rather standard code generation across a few threads.
The initial complaints paint a grim picture: on the more expensive plan, it used to be difficult to hit the limit even with active use, but after a 5x downgrade, people are hitting the ceiling almost immediately. One of the most telling cases: after a full session reset, a user sent about 130k tokens, continued a previous context that had fallen out of the hourly cache, and almost instantly saw about 5% of their 5-hour limit disappear.
I'm making two notes here. First: this doesn't just look like “people are using it more.” Second: the suspicion of junk tokens being created during caching sounds plausible, because similar counter jumps have been discussed before.
The external context also lines up. After late March, Anthropic rolled back some of its more generous policies, including promos with increased limits during off-peak hours, and began to rein in usage much more aggressively amidst a GPU shortage. This creates a double whammy: a real tightening of rate limits on one hand, and possibly a flawed token counting method or poor prompt caching logic on the other.
For those building AI integrations in development, this problem is far from abstract. When your limit is consumed not by useful generation but by internal cache mechanics or re-processing long contexts, the entire economics of your pipeline becomes unreliable.
Impact on Business and Automation
If I'm building an AI solution for a development team, I can't rely on “it seems to be enough.” I need predictability: how much does one task cost, how many parallel sessions can the team maintain, what happens with long agent chains, and how quickly does performance degrade under load?
And this is where Claude Code currently falls short, especially for heavy usage. Not because the model suddenly became bad, but because the billing and limits layer hurts the real-world UX more than the model's capabilities. When a developer is afraid to open a second thread or continue a long session, AI automation turns from an accelerator into a lottery.
Who wins? Those with short sessions, simple tasks, and a backup stack of multiple providers. Who loses? Teams that are used to maintaining long engineering contexts, running exploratory branches, and building semi-autonomous coding agents on a subscription basis.
I wouldn't currently bet on a Claude subscription as the sole foundation for internal engineering processes. It's better to design for routing: short tasks to one layer, long code contexts to another, and critical pipelines through an API with separate cost control and actual token burn logging. Otherwise, one unexpected cache recalculation breaks not just the budget, but also the deadlines.
Anthropic is likely dealing with a mix of two issues here: a shortage of inference capacity and a questionable implementation of rate limiting for real-world coding scenarios. This is survivable, but only if your architecture isn't tied to a single access channel and one nice-looking subscription from the start.
At Nahornyi AI Lab, we specialize in dissecting these kinds of practical bottlenecks: determining where a subscription is fine for a prototype versus where you need a proper AI implementation with model routing, a caching strategy, and protection against sudden rate limits. If your development or support teams are already stumbling over these constraints, we can review your workflow and build an AI automation solution that doesn't depend on someone else's surprises every five hours.