Hi r/MachineLearning! I'm in the US transit sector and recently dove headfirst into AI and ML. When I came across Andrej Karpathy's autoresearch system, I found it really fascinating. For this experiment, I reused the same transit data from a previous GPT-2 XL fine-tuning but trained an 80M-parameter model entirely from scratch. Since autoresearch is built for pretraining — not fine-tuning — I launched a standalone project instead of modifying the GPT-2 XL work. I'd really appreciate your feedback on a few things:
My MotivationKarpathy's autoresearch is essentially a self-driving research cycle: an agent tweaks one training script, runs a short 5-minute training job on a fixed dataset, and either keeps or rolls back the change based on a single performance number. It was developed and tuned on FineWeb — essentially an enormous, internet-scale text corpus. My work, by contrast, uses a much smaller, industry-specific dataset. Looking through Karpathy's documentation, I wondered whether its core loop — the automated 5-minute training cap, the single-metric pass/fail gate, and the all-or-nothing ratchet — could still drive meaningful perplexity gains on a limited dataset. So I cloned the framework, fed it my transit corpus (~33M tokens covering traffic analyses, train routing plans, and regulatory Q&A), and focused on two questions: Question #1: Can autoresearch work on a dataset roughly a million times smaller than what it was designed for? Question #2: What improvements does the autonomous agent discover that I wouldn't have thought up myself? To be clear, this wasn't about building a production-ready chatbot — it was a test of the methodology. I wanted to see whether the mechanics (overnight autonomous runs, a single numeric gate, git-based experiment tracking) still function when data is narrow and scarce. Constraints of the Project
Design Decisions and RationaleEarly on, I hit several compatibility issues. The autoresearch framework makes three assumptions that didn't fit my setup — FlashAttention-3 kernel support, the "one change at a time" architecture rule, and validation-set robustness against overfitting. I had to work around each of these.
A few other adjustments: I divided the transit dataset into four topic-separated splits — train, dev, val_public, and test_private — so no single document bleeds across any boundary, preventing data leakage between training, the agent's working set, the commit gate, and milestone checkpoints. I also built a custom tokenizer so that 65 common transit acronyms (FTA, MBTA, NTD, IIJA, etc.) each map to a single token rather than being broken into subword fragments. Before starting the agent loop, I trained the baseline five times with different random seeds to measure the inherent variance — giving me a reliable noise threshold for distinguishing genuine improvements from random fluctuation later. Key TakeawaysThe single biggest improvement surprised me. The agent cut the batch size in half twice — from 524K tokens per step down to 131K — which packed 3.6 times more gradient updates into the same five-minute window (118 steps → 427 steps). Each individual step was noisier, but the Muon optimizer absorbed the noise without any issues. I personally would have dismissed this idea in a code review, since conventional wisdom says larger batches give steadier training. The agent didn't share that assumption and found it on experiment 13 after eight failed architecture tweaks. The model size curve (shown below) settled the capacity question. 80M parameters was the clear sweet spot: 30M and 50M underfit, while 100M and 150M couldn't complete enough optimizer steps in five minutes to keep up (the 150M model only squeezed in 84 steps before the timer expired). The gating protocol also caught two false positives — experiments that improved the agent's internal dev metric but didn't generalize to the held-out validation set. Without the blind gate, both would have been kept; instead, both were rolled back. My rigor check was humbling. When I repeated the winning late-stage configurations at a fresh seed (INIT_SEED=43), the language-modeling gains held firm (variation within ±0.005 across four runs — two architectures × two seeds), but two apparent accuracy boosts evaporated: terminology accuracy jumped around by 9 points between seeds, and regulatory citation accuracy swung by 15 points. Running proper statistical tests on the domain accuracy benchmarks (terminology, Q&A, regulatory citations) confirmed that only 1 out of 8 head-to-head comparisons reached significance. The takeaway was unavoidable: the language-modeling improvement is genuinely valid (~20× above the noise floor and replicated at a new seed), but the apparent domain-accuracy wins were just random noise given our benchmark sizes of 100–250 items. |
Key Takeaways
Here are five insights from this project that I intend to apply to any future work involving autoresearch on limited datasets:
- The autoresearch framework is effective on small, specialized datasets — but only if you build in your own checks. The Transit Language Model score rose by approximately 14%. That said, two of the experiments appeared successful but failed to hold up on unseen data. Without proper safeguards, you risk deploying misleading results.
- The most impactful improvement came from adjusting how frequently the model was updated, rather than altering its architecture. Cutting the batch size in half twice allowed 3.6× more training steps within the same 5-minute window (going from 118 to 427 updates), which was responsible for the 13.8% gain. In a traditional code review, I would have dismissed this change, believing that “larger batches lead to more stable training.” The autoresearch agent wasn’t limited by that assumption, and the Muon optimizer handled the increased noise gracefully without any issues.
- Run the baseline several times with different random seeds before handing control over to the agent. Five baseline runs took about 30 minutes — but they revealed how much each metric can fluctuate due to chance alone. Without this step, it’s impossible to distinguish genuine progress from random variation.
- Re-validate every apparent improvement using a different random seed before finalizing results. Two ~6-minute reruns revealed that two of the later-stage accuracy “improvements” failed to hold up. They were simply fortunate seed selections, not meaningful gains. Two seeds appeared sufficient as an initial filter for unreliable outcomes.
- Prevent the agent from accessing the held-out score directly — provide only a pass/fail signal. An agent can’t exploit information it doesn’t have access to. This approach flagged two apparent “wins” during the project that wouldn’t have translated to new, unseen data.
What’s Next?
Honestly, I’m uncertain about the best path forward. Several directions all seem valuable, and I’d welcome perspectives from the ML community on which one is most compelling. The three options I’m considering:
1. Replicate the project using fresh random seeds. Re-run the complete Phase 5 + Phase 7 pipeline with two or three new seeds to see whether similar improvements (or comparable outcomes) surface — and whether the same false positives appear again. The core question: “Is this methodology reliable, or was I just fortunate in a different way?”
2. Apply autoresearch following the standard approach on a general-purpose dataset. Clone Karpathy’s original repository without my AutoTransit modifications and run it on a segment of FineWeb, which is the type of data the framework was designed for. Contrasting those results with those from my small, specialized dataset will help clarify which findings are broadly applicable to autoresearch versus which are unique to limited data.
3. Compare my from-scratch training results with domain-adaptive pretraining (DAPT). I would take a similarly sized off-the-shelf pretrained model — Pythia-160M, already trained on web text — and continue training it on my transit dataset. Keep the data, evaluation method, and overall approach identical. The central question is whether training from random weights can keep up with the straightforward alternative — most existing research suggests it can’t, from what I understand. If my from-scratch result remains competitive, that’s the genuinely fascinating outcome; if it doesn’t, there’s still valuable insight to be gained.
THANK YOU if you made it this way down!! Lol. Please share your thoughts …. Where did I go wrong? What do you find interesting? What would you try next?
submitted by /u/MarsPassenger
[comments]



![Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]](https://technologiesdigest.com/wp-content/uploads/2026/04/Applying-Karpathys-autoresearch-to-a-33M-token-public-transit-dataset-14.png)