Technical Context
I appreciate articles like this not for the hype, but for the moment someone honestly says: okay, I underestimated the pace of progress. This is exactly what happened in Ajeya Cotra's post from March 5, 2026. She reassessed how much autonomous work modern agents can actually handle, and for AI implementation, this is no longer a philosophical point but an architectural one.
I dug into the numbers, and here’s what caught my eye. Previously, the benchmark was roughly this: a top-tier model like Claude Opus 4.5 could maintain about a 5-hour 'temporal horizon' on METR's engineering tasks, meaning it could solve about half the problems a skilled human would take 5 hours to complete.
The new shift turned out to be uncomfortably large for anyone making conservative forecasts. According to the data Cotra cited, Opus 4.6 already passed 14 out of 19 tasks longer than 8 hours, and the confidence interval for its horizon has expanded to a wide 5.3-66 hours. This doesn't mean the agent is suddenly 'reliable for three days straight.' It means our old measurement tools are hitting their limits.
And this is where it gets really interesting. Outside of neat benchmarks, agents were already handling multi-week projects like building a browser, a compiler, or large code ports, but not in zero-touch mode. I see this in field cases too: the better the specifications and the more defined the tools, the further an agent can go without intervention. The more open-ended the task, the faster it succumbs to drift, loops, and accumulating simple errors.
What This Changes for Business and Automation
First: I would no longer design AI automation as a 'chatbot next to an employee.' For some processes, it's more sensible to build long, hours-long runbooks with checkpoints, rollbacks, and artifact verification.
Second: Teams with well-formalized tasks will win. Those who try to throw a chaotic production environment and vague requirements at an agent, expecting magic without proper AI integration into a stack with logs, tests, and access rights, will lose.
Third: The cost of an error is now more important than the cost of tokens. If an agent runs for 12 hours and ends up in an incorrect state, the savings can easily turn into expensive debugging.
At Nahornyi AI Lab, we tackle this challenging layer: deciding where to grant an agent autonomy, where to implement safety nets, and where to prevent it from acting without human oversight. If your processes are already hitting bottlenecks with manual checks and slow engineering cycles, Vadym Nahornyi and I can help you build AI automation that actually offloads your team, rather than producing beautifully formatted chaos.