AdaL beats Claude Code on Karpathy's Autoresearch — finding better hyperparameters, running more experiments, and converging faster.
We ran Autoresearch head-to-head: AdaL (SylphAI's AI coding agent) vs Claude Code (Anthropic's CLI agent), each autonomously tuning a GPT-2 language model. Same hardware, same starting point, same rules. Here are the results.
On the A10 GPU, the gap is dramatic. AdaL found a significantly better optimum: 1.1048 vs Claude's 1.1539 — a 4.3% gap in final BPB. AdaL ran 336 experiments vs Claude's 76 — Claude Code stopped running partway through, failing to follow the instruction to work autonomously and indefinitely.
AdaL achieved a best validation BPB of 0.9755 vs Claude's 0.9793. AdaL ran 191 experiments vs Claude's 104 — again, Claude Code stopped running on its own, unable to sustain autonomous operation.
AdaL (--yolo) |
Claude Code (--dangerously-skip-permissions) |
|
|---|---|---|
| Best BPB | 1.1048 ✅ | 1.1539 |
| Experiments | 336 | 76 |
| Kept improvements | 61 | 14 |
| Improvement from baseline | −16.5% | −12.8% |
AdaL (--yolo) |
Claude Code (--dangerously-skip-permissions) |
|
|---|---|---|
| Best BPB | 0.9755 ✅ | 0.9793 |
| Experiments | 191 | 104 |
| Kept improvements | 29 | 19 |
| Improvement from baseline | −2.1% | −1.7% |
Both agents used Claude Opus 4.6 (1M context) as their backbone LLM.
Lower BPB = better. Validation bits-per-byte measures how well the model predicts the next token.
Autoresearch's program.md explicitly states: "NEVER STOP — the loop runs until the human interrupts you, period." AdaL followed through, running continuously for the full duration. Claude Code repeatedly stopped on its own, failing to sustain autonomous operation. This is why AdaL ran far more experiments — because Claude Code didn't follow the instruction to run fully autonomously.
More experiments alone don't guarantee better results — you need good search strategy too. AdaL consistently converged to lower BPB values, suggesting smarter hyperparameter exploration.
Autoresearch by Andrej Karpathy is an autonomous AI research benchmark. An AI coding agent is given a GPT-2 language model and must iteratively:
- Propose a hyperparameter or architecture change
- Train the model and evaluate validation BPB
- Keep improvements, discard regressions
- Repeat
The agent has full autonomy — it reads the codebase, decides what to try, writes the code, runs training, and evaluates results. It's a pure test of an AI agent's ability to do ML research.
All raw experiment logs are included:
results-a10-adal.tsv/results-a10-claude.tsv— A10 experiment resultsresults-h100-adal.tsv/results-h100-claude.tsv— H100 experiment results
Clone the benchmark repo:
git clone https://github.com/karpathy/autoresearch.git
cd autoresearchAdaL — install and run:
npm install -g @sylphai/adal-cli
adal --yoloClaude Code — install and run:
curl -fsSL https://claude.ai/install.sh | bash
claude --dangerously-skip-permissionsThen give both agents the same prompt:
"Hi, have a look at program.md and let's kick off a new experiment! Let's do the setup first."
AdaL is SylphAI's AI coding agent, named after Ada Lovelace. It's designed for software engineering and AI R&D tasks — writing code, debugging, running experiments, and iterating on results.
🚀 Coming soon: We're building AdaL into the self-evolving AI coding agent that learns from your entire team and codebase. Stay tuned.
🌸 Generated with AdaL


