I Underestimated AI Agents' Progress Too

Ajeya Cotra has revised her AI capability forecasts following new METR results, which show AI agents can handle significantly longer tasks than expected. For businesses, this is crucial: AI automation can now be designed for hour-long, sometimes day-long, tasks, though not yet for a flawless, week-long operation.

Technical Context

I appreciate articles like this not for the hype, but for the moment someone honestly says: okay, I underestimated the pace of progress. This is exactly what happened in Ajeya Cotra's post from March 5, 2026. She reassessed how much autonomous work modern agents can actually handle, and for AI implementation, this is no longer a philosophical point but an architectural one.

I dug into the numbers, and here’s what caught my eye. Previously, the benchmark was roughly this: a top-tier model like Claude Opus 4.5 could maintain about a 5-hour 'temporal horizon' on METR's engineering tasks, meaning it could solve about half the problems a skilled human would take 5 hours to complete.

The new shift turned out to be uncomfortably large for anyone making conservative forecasts. According to the data Cotra cited, Opus 4.6 already passed 14 out of 19 tasks longer than 8 hours, and the confidence interval for its horizon has expanded to a wide 5.3-66 hours. This doesn't mean the agent is suddenly 'reliable for three days straight.' It means our old measurement tools are hitting their limits.

And this is where it gets really interesting. Outside of neat benchmarks, agents were already handling multi-week projects like building a browser, a compiler, or large code ports, but not in zero-touch mode. I see this in field cases too: the better the specifications and the more defined the tools, the further an agent can go without intervention. The more open-ended the task, the faster it succumbs to drift, loops, and accumulating simple errors.

What This Changes for Business and Automation

First: I would no longer design AI automation as a 'chatbot next to an employee.' For some processes, it's more sensible to build long, hours-long runbooks with checkpoints, rollbacks, and artifact verification.

Second: Teams with well-formalized tasks will win. Those who try to throw a chaotic production environment and vague requirements at an agent, expecting magic without proper AI integration into a stack with logs, tests, and access rights, will lose.

Third: The cost of an error is now more important than the cost of tokens. If an agent runs for 12 hours and ends up in an incorrect state, the savings can easily turn into expensive debugging.

At Nahornyi AI Lab, we tackle this challenging layer: deciding where to grant an agent autonomy, where to implement safety nets, and where to prevent it from acting without human oversight. If your processes are already hitting bottlenecks with manual checks and slow engineering cycles, Vadym Nahornyi and I can help you build AI automation that actually offloads your team, rather than producing beautifully formatted chaos.

We previously discussed the emerging 'subprime code crisis', where relying too heavily on AI for development can degrade code quality and inflate total cost of ownership. This serves as a stark reminder of the unpredictable challenges that sometimes arise when integrating AI into established industry workflows.

Share this article

Twitter/X LinkedIn Telegram

I Underestimated AI Agents' Progress Too

Technical Context

What This Changes for Business and Automation

More News

Gemma 4 Accelerates Inference with Multi-Token Prediction

Codex Pulls Ahead After Its Latest Update