in a scenario the place you will have loads of concepts on enhance your product, however no time to check all of them? I wager you will have.
What if I advised you that you simply not should do all of it by yourself, you’ll be able to delegate it to AI. It may possibly run dozens (and even a whole lot) of experiments for you, discard concepts that don’t work, and iterate on those that truly transfer the needle.
Sounds wonderful. And that’s precisely the concept behind autoresearch, the place an LLM operates in a loop, repeatedly experimenting, measuring impression, and iterating from there. The strategy sounded compelling, and lots of of my colleagues have already seen advantages from it. So I made a decision to attempt it out myself.
For this, I picked a sensible analytical process: advertising finances optimisation with a bunch of constraints. Let’s see whether or not an autonomous loop can attain the identical outcomes as we did.
Background
Let’s begin with some background to set the context. Autoresearch was developed by Andrej Karpathy. As he wrote in his repository:
At some point, frontier AI analysis was achieved by meat computer systems in between consuming, sleeping, having different enjoyable, and synchronizing now and again utilizing sound wave interconnect within the ritual of “group meeting”. That period is lengthy gone. Analysis is now solely the area of autonomous swarms of AI brokers working throughout compute cluster megastructures within the skies. The brokers declare that we are actually within the 10,205th technology of the code base, in any case nobody may inform if that’s proper or flawed because the “code” is now a self-modifying binary that has grown past human comprehension. This repo is the story of the way it all started. -@karpathy, March 2026.
The thought behind autoresearch is to let an LLM function by itself in an atmosphere the place it may well repeatedly run experiments. It adjustments the code, trains the mannequin, evaluates whether or not efficiency improves, after which both retains or discards every change earlier than repeating the loop. Finally, you come again and (hopefully) discover a higher mannequin than you began with. Utilizing this strategy, Andrej was capable of considerably enhance nanochat.
The unique implementation was centered on optimising an ML mannequin. Nevertheless, simialr strategy might be utilized to any process with a transparent goal (from lowering web site load time to minimising errors when scraping with Playwright). Shopify later open-sourced an extension of the unique autoresearch, pi-autoresearch. It builds on pi, a minimal open-source terminal coding harness.
It follows an analogous loop to the unique autoresearch, with a couple of key steps:
- Outline the metric you wish to enhance, together with any constraints.
- Measure the baseline.
- Speculation testing: in every iteration, the agent proposes an thought, writes it down, and assessments it. There are three doable outcomes: it doesn’t work (discard), it worsens the metric (discard), or it improves the goal (preserve it and iterate from there).
- Repeat: the loop continues till you cease it, enhancements plateau, or it reaches a predefined iteration restrict.
So the core thought is to outline a transparent goal and let the agent attempt daring concepts and be taught from them. This strategy can uncover potential enhancements to your KPIs by testing concepts your workforce merely by no means had the time to discover. It undoubtedly sounds fascinating, so let’s attempt it out.
Job
I want to check this strategy on an analytical process, since in analytical day-to-day duties we regularly have clear goals and have to iterate a number of occasions to achieve an optimum answer. So, I went via all of the posts I’ve written for In the direction of Information Science over time and located a process round optimising advertising campaigns, which we mentioned within the article “Linear Optimisations in Product Analytics”.
The duty is sort of widespread. Think about you’re employed as a advertising analyst and have to plan advertising actions for the subsequent month. Your objective is to maximise income inside a restricted advertising finances ($30M).
You’ve a set of potential advertising campaigns, together with projections for every of them. For every marketing campaign, we all know the next:
nationand advertisingchannel,marketing_spending— funding required for this exercise,income— anticipated income from acquired prospects over the subsequent 12 months (our goal metric).
We even have some further data, such because the variety of acquired customers and the variety of buyer help contacts. We’ll use these to iterate on the preliminary process and make it progressively tougher by including additional constraints.

It’s helpful to provide the agent a baseline strategy so it has one thing to begin from. So, let’s put it collectively. One easy answer for this optimisation is to concentrate on the top-performing segments by income per greenback spent. We will kind all campaigns by this metric and choose those that match throughout the finances. In fact, this strategy is sort of naive and may undoubtedly be improved, however it supplies a superb start line.
import pandas as pd
df = pd.read_csv('marketing_campaign_estimations.csv', sep='t')
# --- Baseline: grasping by revenue-per-dollar ---
df['revenue_per_spend'] = df.income / df.marketing_spending
df = df.sort_values('revenue_per_spend', ascending=False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30_000_000]
total_spend = selected_df.marketing_spending.sum()
revenue_millions = selected_df.income.sum() / 1_000_000
assert total_spend <= 30_000_000, f"Budget violated: {total_spend}"
print(f"METRIC revenue_millions={revenue_millions:.4f}")
print(f"Segments={len(selected_df)} spend={total_spend/1e6:.2f}M")I put this code in optimise.py within the repository.
If we run the baseline, we see that the ensuing income is 107.9M USD, whereas the entire spend is 29.2M.
python3 optimise.py
# METRIC revenue_millions=107.9158
# Segments=48 spend=29.23MOrganising
Earlier than shifting on to the precise experiment, we first want to put in pi_autoresearch. We begin by establishing pi itself by following the directions from pi.dev. Fortunately, it may be put in with a single command, supplying you with a pi coding harness up and working regionally you can already use to assist with coding duties.
npm set up -g @mariozechner/pi-coding-agent # set up pi
pi # begin pi
/login # choose supplier and specify APIKeyNevertheless, as talked about earlier, our objective is to attempt the pi-autoresearch extension on prime of pi, so let’s set up that as nicely.
pi set up I additionally needed some guardrails in place, so I created an autoresearch.config.json file within the root of my repo to outline the utmost variety of iterations. This helps restrict what number of iterations the agent can run and, in flip, retains token prices underneath management throughout experiments. You can too set a per-API-key spending restrict along with your LLM supplier for even tighter management.
{
"maxIterations": 30
}You’ll find all the main points on configuration within the docs.
That’s it. The setup is completed, and we’re prepared to begin the experiment.
Experiments
Lastly, it’s time to begin utilizing the autoresearch strategy to determine which advertising campaigns we should always run. I’m fairly positive our preliminary strategy isn’t optimum, so let’s see whether or not autoresearch can enhance it. Let the journey start.
I began autoresearch by calling the ability.
/ability:autoresearch-createAfter that, autoresearch tries to deduce the optimisation objective, and if it fails, it asks for extra particulars.
In my case, it merely inspected the code we applied in optimise.py and created an autoresearch.md file summarising the duty. Right here’s what we bought (a fairly stable abstract, contemplating it solely noticed our baseline optimisation perform). We will see that it clearly outlined the metrics and constraints. I additionally appreciated that it explicitly highlighted that altering the enter knowledge isn’t allowed. That’s a superb guardrail.
# Autoresearch: maximize advertising marketing campaign income underneath finances
## Goal
Enhance `optimise.py` so it selects a set of marketing campaign segments with **most complete income** whereas respecting the mounted advertising finances of **30,000,000**. The present implementation is a grasping heuristic: it types by revenue-per-spend, takes a cumulative prefix, and stops as soon as the subsequent merchandise would exceed finances. Which means it may well go away finances unused and by no means think about cheaper worthwhile gadgets later within the sorted record.
The workload is tiny (62 rows), so higher-quality combinatorial optimization methods are seemingly sensible. We must always favor actual or near-exact choice logic over fragile heuristics when the runtime stays quick.
## Metrics
- **Main**: `revenue_millions` (hundreds of thousands, larger is healthier) - complete chosen income divided by 1,000,000
- **Secondary**:
- `spend_millions` - complete chosen spend divided by 1,000,000
- `budget_slack_millions` - unused finances in hundreds of thousands
- `segment_count` - variety of chosen segments
## Find out how to Run
`./autoresearch.sh` - runs a fast syntax pre-check, then `optimise.py`, which should emit `METRIC identify=quantity` strains.
## Information in Scope
- `optimise.py` - campaign-selection logic and metric output
- `autoresearch.sh` - benchmark harness and pre-checks
- `autoresearch.md` - session reminiscence / findings
- `autoresearch.concepts.md` - backlog for promising deferred concepts
## Off Limits
- `marketing_campaign_estimations.csv` - enter knowledge; don't edit
- Git historical past / department construction outdoors the autoresearch workflow
## Constraints
- Should preserve spend `<= 30_000_000`
- Should preserve the script runnable with `python3 optimise.py`
- No dataset adjustments
- Maintain the answer easy and explainable except additional complexity yields materially higher income
- Runtime ought to stay quick sufficient for a lot of autoresearch iterations
## What's Been Tried
- Baseline code types by `income / marketing_spending`, computes cumulative spend, and retains solely the sorted prefix underneath finances.After defining the duty, it instantly began the loop. It may possibly run for a while, however you continue to retain visibility. You’ll be able to see each its reasoning and a few key stats within the widget (corresponding to the present iteration, finest goal worth, and enchancment over the baseline), which is sort of useful.

Because it iterates, it additionally writes an autoresearch.jsonl file with full particulars of every experiment and the ensuing goal metric. This log may be very helpful each for reviewing what has been tried and for the mannequin itself to maintain observe of which hypotheses it has already examined.
In my case, regardless of the configured restrict of 30 iterations, it determined to cease after simply 5. The agent explored a number of totally different methods: actual knapsack optimisation, search-space pruning, and a Pareto-frontier dynamic programming strategy. Let’s undergo the main points:
- Iteration 1: Reproduced our baseline strategy. The prefix-greedy technique (income/spend) reached 107.9M, however stopped early when gadgets didn’t match, lacking higher downstream mixtures. No breakthrough right here, only a sanity test of the baseline.
- Iteration 2: Precise knapsack solver. The agent switched to a branch-and-bound (0/1 knapsack) strategy and reached 110.16M income (+2.25M uplift), which is a transparent enchancment. A robust acquire already within the second iteration.
- Iteration 3: Dominance pruning. This iteration tried to shrink the search area by eradicating pairwise dominated segments (i.e., segments worse in each spend and income than one other). Whereas intuitive, this assumption doesn’t maintain within the 0/1 knapsack setting: a “dominating” section could already be chosen, whereas a “dominated” one can nonetheless be helpful together with others. Because of this, this strategy failed and dropped to 95.9M income, and was discarded. A superb instance of trial and error. We examined it, it didn’t work, and we instantly moved on.
- Iteration 4: Dynamic programming frontier. The agent switched to a Pareto-frontier dynamic programming strategy, however it achieved the identical outcome as iteration 2. From an analyst perspective, that is nonetheless helpful. It confirms we’ve seemingly reached the optimum.
- Iteration 5: Integer accounting. This iteration transformed all financial values from floats to integer cents to enhance numerical stability and reproducibility, however once more produced the identical closing worth. It is sensible that the agent stopped there.
So ultimately, the optimum answer was already discovered within the second iteration and it matches the answer we present in my article with linear programming. The agent nonetheless tried a couple of different concepts, however stored ending up with the identical outcome and ultimately stopped (as an alternative of burning much more tokens).
Now we are able to end the analysis by working the /ability:autoresearch-finalize command, which commits and pushes all the things to GitHub. Because of this, it created a brand new department with a PR, saving each the adjustments to the optimise.py code and the intermediate reasoning recordsdata. This manner, we are able to simply observe what occurred all through the method.
The agent simply solved our preliminary process. Subsequent, let’s attempt making it extra sensible by including further constraints from the Operations workforce. Assume we realised that we additionally want to make sure there are not more than 5K incremental buyer help tickets (so the Ops workforce can deal with the load), and that the general buyer contact price stays under 4.2%, since that is one among our system well being checks. This makes the issue tougher, because it provides additional constraints and forces the agent to revisit the answer area and seek for a brand new optimum.
To kick this off, I merely restarted the /ability:autoresearch-create course of, offering the extra constraints.
/ability:autoresearch-create I've further constraints for our CS contacts to make sure that our Operations
workforce can deal with the demand in a wholesome means:
- The variety of further CS contacts ≤ 5K
- Contact price (CS contacts/customers) ≤ 0.042This time, it picked up precisely the place we left off. It already had full context from the earlier run, together with all the things we had achieved to date. On account of updating the duty, the agent revised the autoresearch.md file to incorporate the brand new constraints.
## Constraints
- Should preserve spend `<= 30_000_000`
- Should preserve further CS contacts `<= 5_000`
- Should preserve contact price `<= 0.042`
- Should preserve the script runnable with `python3 optimise.py`
- No dataset adjustments
- Maintain the answer easy and explainable except additional complexity yields materially higher income
- Runtime ought to stay quick sufficient for a lot of autoresearch iterationsIt ran 8 further iterations and converged to the next answer (once more matching what we had seen beforehand):
- Income: $109.87M,
- Funds spent: $29.9981M (underneath $30M),
- Buyer help contacts: 3,218 (underneath 5K),
- Contact price: 0.038 (underneath 0.042).
After introducing the brand new constraints, the agent reformulated the issue and switched to an actual MILP solver. It rapidly discovered the optimum answer, reaching 109.87M income whereas satisfying all constraints. Many of the later iterations didn’t actually change the outcome, they only cleaned issues up: eliminated fallback logic, diminished dependencies, and improved runtime. So, as soon as the issue was well-defined, the agent stopped “searching” and began “engineering”. What’s much more fascinating is that it knew when to cease optimising and didn’t run all the way in which to the 30-iteration restrict.
Lastly, I requested the agent to finalise the analysis. This time, for some motive, /ability:autoresearch-finalize didn’t push all of the adjustments, so I needed to manually ask pi to create two PRs: one with clear code adjustments, and one other with the reasoning and supporting recordsdata. You’ll be able to undergo the PRs if you wish to see extra particulars about what the agent tried.
That’s all for the experiments. We bought wonderful outcomes and was capable of see the capabilities of autoresearch. So, it’s time to wrap it up.
Abstract
That was a extremely fascinating experiment. The agent was capable of attain the identical optimum answer we beforehand discovered, fully by itself. Whereas it didn’t push the outcome additional (which isn’t shocking given how well-studied issues like knapsack are), it was spectacular to see how an LLM can iteratively discover options and converge to a stable end result with out handbook steering.
I consider this strategy has sturdy potential throughout a number of domains (from coaching ML fashions and fixing analytical duties to extra engineering-heavy issues like optimising system efficiency or loading occasions). In lots of groups, we merely don’t have the time to check all doable concepts, or we dismiss a few of them too early. An autonomous loop like this could systematically attempt totally different approaches and validate them with precise metrics.
On the identical time, that is undoubtedly not a silver bullet. There shall be circumstances the place the agent finds “optimal” options that aren’t possible in observe, for instance, enhancing web site loading velocity at the price of breaking person expertise. That’s the place human supervision turns into vital: not simply to validate outcomes, however to make sure the answer is sensible holistically.
From what I’ve seen, this strategy works finest when you will have a transparent goal, well-defined constraints, and one thing measurable to optimise. It’s a lot tougher to use it to extra ambiguous issues, like making a product extra user-friendly, the place success is much less clearly outlined.
Total, I’d undoubtedly suggest attempting out pi-autoresearch or comparable instruments by yourself issues. It’s a strong method to check concepts you wouldn’t usually have time to discover and see what really works in observe. And there’s one thing nearly magical about your product enhancing when you sleep.
Disclaimer: I work at Shopify, however this put up is unbiased of my work there and displays my private views.



