June 13, 20263 min read

autoresearch: When the Model Hires an ML Engineer

autoresearchML engineeringAI automation

Andrej Karpathy revealed autoresearch, an open-source loop where the model edits its own code, runs short training, measures results, and reverts bad ideas. This matters as an early but highly practical blueprint for AI automation in ML engineering, showing how to delegate routine experimentation to an agent.

Technical Context

I love things like this not for the hype, but for the cycle's form. In autoresearch, Karpathy assembled a very down-to-earth loop: the agent reads the repository and program.md, edits the training script, runs a short pass, checks the metric, and either commits the change or rolls it back via git.

Essentially, this is no longer a "code helper" but a blueprint for AI automation for an ML team. A person sets the goal and constraints, while the model handles the mechanical part of AI implementation: hypothesis, edit, run, check, rollback.

What really caught my attention here is that the control interface is not a heavy dashboard but a markdown spec. You don't manually tweak train.py every time; instead you describe what counts as success, what can be touched, what the experiment budget is, and how to log attempts.

The current public loop is quite rigid: a short budget of about 5 minutes per run, the main metric is val_bpb where lower is better, and comparison happens under identical conditions. This is crucial: the agent doesn’t "magically train a model" — it iterates on changes inside a formalized sandbox.

From the published results, the idea works not as one big leap but as a series of small hits. Dozens or hundreds of runs yield a few real improvements, and it’s these that over time push quality or training speed.

And yes, minor metrics can easily dip. If you optimize one KPI, the agent will push exactly there. Without a decent set of guardrails, such a system will just as quickly find a bad local maximum as a good move.

What This Changes for Business and Automation

The first effect is simple: the experiment cycle gets cheaper. If your team spends hours on repetitive runs, this pattern can be embedded as an internal AI integration loop in R&D, letting people focus on experiment design rather than routine.

The second point is about architecture. Those who break training into short, measurable iterations with a clear metric will benefit. Projects where everything hinges on long runs, fuzzy KPIs, and handshake agreements in chat will suffer.

The third nuance seems the most important to me: this is not a replacement for an ML engineer but an amplifier of good engineering discipline. At Nahornyi AI Lab, we solve such tasks for clients regularly: first we build an objective metric and constraints, then we construct automation with AI; otherwise the agent simply automates chaos.

If your model training, prompt tuning, or internal experiments are bogged down in manual repetitions, we can dissect this at the process level. At Nahornyi AI Lab, I will help you assemble AI solution development for your real workflow, so that the agent doesn’t just play science but saves people weeks of work.

We have already covered the Simple Self-Distillation method, which improves code generation quality by using the model's own predictions without external verifiers or complex reinforcement learning. This approach shows in practice how AI can autonomously optimize its results — exactly the idea that Karpathy scales in autoresearch.

Twitter/X LinkedIn Telegram

← Back to News

autoresearch: When the Model Hires an ML Engineer

Technical Context

What This Changes for Business and Automation

More reading

PerceptionBench: Moonshot Tests If AI Truly Sees

Kimi K3: Open Weights and No Longer 50B Active