Hi r/MachineLearning! I'm in the US transit sector and recently dove headfirst into AI and ML. When I came across Andrej Karpathy's autoresearch system, I found it really fascinating.

For this experiment, I reused the same transit data from a previous GPT-2 XL fine-tuning but trained an 80M-parameter model entirely from scratch. Since autoresearch is built for pretraining — not fine-tuning — I launched a standalone project instead of modifying the GPT-2 XL work.

I'd really appreciate your feedback on a few things:

Where did I go wrong?
What stands out as interesting in these results?
What should I concentrate on next? (I share some initial thoughts at the end.)

My Motivation

Karpathy's autoresearch is essentially a self-driving research cycle: an agent tweaks one training script, runs a short 5-minute training job on a fixed dataset, and either keeps or rolls back the change based on a single performance number. It was developed and tuned on FineWeb — essentially an enormous, internet-scale text corpus. My work, by contrast, uses a much smaller, industry-specific dataset.

Looking through Karpathy's documentation, I wondered whether its core loop — the automated 5-minute training cap, the single-metric pass/fail gate, and the all-or-nothing ratchet — could still drive meaningful perplexity gains on a limited dataset. So I cloned the framework, fed it my transit corpus (~33M tokens covering traffic analyses, train routing plans, and regulatory Q&A), and focused on two questions:

Question #1: Can autoresearch work on a dataset roughly a million times smaller than what it was designed for?

Question #2: What improvements does the autonomous agent discover that I wouldn't have thought up myself?

To be clear, this wasn't about building a production-ready chatbot — it was a test of the methodology. I wanted to see whether the mechanics (overnight autonomous runs, a single numeric gate, git-based experiment tracking) still function when data is narrow and scarce.

Constraints of the Project

Hardware: One RTX 5080 (16 GB, Blackwell consumer GPU) running WSL2 on Ubuntu 22.04 — no cloud compute.
Experiment budget: Exactly 5 minutes of wall-clock training per run, as enforced by autoresearch.
No extra packages: Stuck strictly with what was listed in pyproject.toml.
From-scratch training only: The agent initialized a fresh transformer from scratch for every experiment — no pretrained starting point. (This is separate from my earlier LoRA fine-tuning of GPT-2 XL on the same data; that model is not part of this project.)

Design Decisions and Rationale

Early on, I hit several compatibility issues. The autoresearch framework makes three assumptions that didn't fit my setup — FlashAttention-3 kernel support, the "one change at a time" architecture rule, and validation-set robustness against overfitting. I had to work around each of these.

SDPA-only attention: My RTX 5080 doesn't support the FlashAttention-3 kernels autoresearch defaults to, so I switched to PyTorch's native scaled_dot_product_attention with cuDNN backend. This is a lasting limitation until FlashAttention-3 adds Blackwell support.
Two-parameter scaling controls: Karpathy's original train.py uses several intertwined constants to define model shape — changing size meant editing multiple lines at once, violating the agent's "one change per experiment" principle. I consolidated this into two single-line knobs — TARGET_PARAMS_M for total parameter count and ASPECT_RATIO for depth-vs-width — with a helper function derive_arch() handling the math. It felt limiting at first because the agent lost granular control, but it ensured every experiment stayed a clean head-to-head comparison.
Blind validation gating: The agent never sees the actual validation score. It only gets a pass/fail plus a four-tier margin label (clear win / narrow win / miss / first run). The true score goes into a private file the agent can't access, so it can't overfit to a specific number it can't observe.

A few other adjustments: I divided the transit dataset into four topic-separated splits — train, dev, val_public, and test_private — so no single document bleeds across any boundary, preventing data leakage between training, the agent's working set, the commit gate, and milestone checkpoints. I also built a custom tokenizer so that 65 common transit acronyms (FTA, MBTA, NTD, IIJA, etc.) each map to a single token rather than being broken into subword fragments. Before starting the agent loop, I trained the baseline five times with different random seeds to measure the inherent variance — giving me a reliable noise threshold for distinguishing genuine improvements from random fluctuation later.

Key Takeaways

The single biggest improvement surprised me. The agent cut the batch size in half twice — from 524K tokens per step down to 131K — which packed 3.6 times more gradient updates into the same five-minute window (118 steps → 427 steps). Each individual step was noisier, but the Muon optimizer absorbed the noise without any issues. I personally would have dismissed this idea in a code review, since conventional wisdom says larger batches give steadier training. The agent didn't share that assumption and found it on experiment 13 after eight failed architecture tweaks.

The model size curve (shown below) settled the capacity question. 80M parameters was the clear sweet spot: 30M and 50M underfit, while 100M and 150M couldn't complete enough optimizer steps in five minutes to keep up (the 150M model only squeezed in 84 steps before the timer expired).

The gating protocol also caught two false positives — experiments that improved the agent's internal dev metric but didn't generalize to the held-out validation set. Without the blind gate, both would have been kept; instead, both were rolled back.

My rigor check was humbling. When I repeated the winning late-stage configurations at a fresh seed (INIT_SEED=43), the language-modeling gains held firm (variation within ±0.005 across four runs — two architectures × two seeds), but two apparent accuracy boosts evaporated: terminology accuracy jumped around by 9 points between seeds, and regulatory citation accuracy swung by 15 points.

Running proper statistical tests on the domain accuracy benchmarks (terminology, Q&A, regulatory citations) confirmed that only 1 out of 8 head-to-head comparisons reached significance. The takeaway was unavoidable: the language-modeling improvement is genuinely valid (~20× above the noise floor and replicated at a new seed), but the apparent domain-accuracy wins were just random noise given our benchmark sizes of 100–250 items.

Key Takeaways

Here are five insights from this project that I intend to apply to any future work involving autoresearch on limited datasets:

The autoresearch framework is effective on small, specialized datasets — but only if you build in your own checks. The Transit Language Model score rose by approximately 14%. That said, two of the experiments appeared successful but failed to hold up on unseen data. Without proper safeguards, you risk deploying misleading results.
The most impactful improvement came from adjusting how frequently the model was updated, rather than altering its architecture. Cutting the batch size in half twice allowed 3.6× more training steps within the same 5-minute window (going from 118 to 427 updates), which was responsible for the 13.8% gain. In a traditional code review, I would have dismissed this change, believing that “larger batches lead to more stable training.” The autoresearch agent wasn’t limited by that assumption, and the Muon optimizer handled the increased noise gracefully without any issues.
Run the baseline several times with different random seeds before handing control over to the agent. Five baseline runs took about 30 minutes — but they revealed how much each metric can fluctuate due to chance alone. Without this step, it’s impossible to distinguish genuine progress from random variation.
Re-validate every apparent improvement using a different random seed before finalizing results. Two ~6-minute reruns revealed that two of the later-stage accuracy “improvements” failed to hold up. They were simply fortunate seed selections, not meaningful gains. Two seeds appeared sufficient as an initial filter for unreliable outcomes.
Prevent the agent from accessing the held-out score directly — provide only a pass/fail signal. An agent can’t exploit information it doesn’t have access to. This approach flagged two apparent “wins” during the project that wouldn’t have translated to new, unseen data.

What’s Next?

Honestly, I’m uncertain about the best path forward. Several directions all seem valuable, and I’d welcome perspectives from the ML community on which one is most compelling. The three options I’m considering:

1. Replicate the project using fresh random seeds. Re-run the complete Phase 5 + Phase 7 pipeline with two or three new seeds to see whether similar improvements (or comparable outcomes) surface — and whether the same false positives appear again. The core question: “Is this methodology reliable, or was I just fortunate in a different way?”

2. Apply autoresearch following the standard approach on a general-purpose dataset. Clone Karpathy’s original repository without my AutoTransit modifications and run it on a segment of FineWeb, which is the type of data the framework was designed for. Contrasting those results with those from my small, specialized dataset will help clarify which findings are broadly applicable to autoresearch versus which are unique to limited data.

3. Compare my from-scratch training results with domain-adaptive pretraining (DAPT). I would take a similarly sized off-the-shelf pretrained model — Pythia-160M, already trained on web text — and continue training it on my transit dataset. Keep the data, evaluation method, and overall approach identical. The central question is whether training from random weights can keep up with the straightforward alternative — most existing research suggests it can’t, from what I understand. If my from-scratch result remains competitive, that’s the genuinely fascinating outcome; if it doesn’t, there’s still valuable insight to be gained.

THANK YOU if you made it this way down!! Lol. Please share your thoughts …. Where did I go wrong? What do you find interesting? What would you try next?

Top Posts

The Goblin Whisperer: OpenAI Reveals the Secret Behind ChatGPT’s Obsessed AI Outbursts

Healthcare Affordability Part 4: Using Expected Costs as Your Roadmap to the Right Plan

Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift

Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift

My Motivation

Constraints of the Project

Design Decisions and Rationale

Key Takeaways

Cursor Unveils a Game-Changing TypeScript SDK for Powering Intelligent Coding Agents Using Cloud Sandbox VMs

“Mastering the Art of Stacking: Building Powerful Ensembles of Ensembles of Ensembles”

Beyond the Hype: Hard-Won Truths About Running Self-Hosted LLMs in Production

Unlocking Limitless Multimodal Intelligence with Transparency and Speed

“Unlocking Biological Age: How Facial Aging Charges Predict Cancer Outcomes Through Cellular Change”

Poolside AI Unveils Laguna XS.2 and M.1: Agentic Coding Models Hit 68.2% and 72.5% on SWE-bench Verified

The Goblin Whisperer: OpenAI Reveals the Secret Behind ChatGPT’s Obsessed AI Outbursts

Healthcare Affordability Part 4: Using Expected Costs as Your Roadmap to the Right Plan

Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift

CrowdStrike Anti-DDoS Group Targets Brazilian Internet Providers—Krebs on Security

Solving Today’s Supply Chain Puzzles with RFID Technology

Launchpad Build AI Deploys MLM to Rapidly Streamline Industrial Automation Design

LG and NVIDIA’s Secret Conversations About Physical AI’s Future

Post-quantum encryption for Cloudflare IPsec is generally available

Trending

The Goblin Whisperer: OpenAI Reveals the Secret Behind ChatGPT’s Obsessed AI Outbursts

Healthcare Affordability Part 4: Using Expected Costs as Your Roadmap to the Right Plan

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift

My Motivation

Constraints of the Project

Design Decisions and Rationale

Key Takeaways

Key Takeaways

What’s Next?

Related Posts