Think about you have got a whole lot of concepts for bettering your product however no time to check all of them. Sound acquainted? It in all probability does.
What if you happen to may hand that work off to AI? It may possibly experiment with dozens and even a whole lot of ideas on the identical time, toss out those that don’t assist and hold refining the ones who find themselves making a distinction.
That is the thought behind autoresearch. On this system, a big language mannequin (LLM) works by itself in a loop. It runs checks, measures the outcomes and decides what to strive subsequent. Having heard colleagues reward the strategy, I made a decision to check it myself.
I selected a traditional analytical drawback for my take a look at: optimizing a advertising finances with a set of guidelines. The query was whether or not an autonomous loop may ship the identical high quality of outcomes as our workforce.
Background
To grasp the place that is coming from, let’s begin on the starting. Autoresearch was created by a researcher named Andrej Karpathy. In his undertaking’s documentation, he wrote:
At one level, superior AI analysis was accomplished by folks working between meals, sleep, or enjoyable, syncing often in conferences. These days are lengthy gone. Analysis now belongs to fleets of AI brokers working throughout enormous computing methods within the sky. They are saying we’re on technology 10,205 of the code, however no one can confirm, since the “code” is a self-repairing system past human understanding. This undertaking is the story of the way it all started. -@karpathy, March 2026.
The core thought is to let an LLM work by itself in the place it may well run experiments again and again. It tweaks the code, checks the consequence, evaluates modifications in efficiency, and resolves whether or not to maintain or undo every change earlier than repeating the cycle. In the end, the aim is to come back again to a mannequin that performs higher than the unique. This strategy led to important enhancements in a earlier undertaking known as nanochat.
Whereas the unique model centered on enhancing machine studying fashions, the identical strategy will be utilized to any quantitative course of, resembling reducing a web site’s load pace or reducing errors when scraping net pages. Later, a workforce at Shopify launched an open-source model referred to as pi-autoresearch, developed on prime of a coding device known as pi.
This new software follows an analogous cycle with 4 fundamental steps:
- Step one is to outline the quantity you wish to enhance in addition to any limits the answer should meet.
- Subsequent, measure your place to begin precisely.
- Concepts are examined one after the other with the agent suggesting a change, writing it out, and checking the outcomes. It is going to both discard concepts that fail or damage the metric, or hold people who make issues higher and proceed from there.
- The cycle continues till you cease it, good points stage off, or a restrict is reached.
The driving drive is an easy one: set a transparent goal and the agent, not inhibited by the boundaries and biases of human expertise, can discover new paths and be taught from them. This technique can uncover enhancements to your KPIs that your workforce merely hadn’t had the time to contemplate. It not solely sounds thrilling, however value making an attempt.
The Activity
I needed to see how this would possibly work on information issues, since we frequently have analytic objectives that require a number of efforts to succeed in the absolute best consequence. I looked for a becoming use case within the articles I had beforehand written for In the direction of Information Science. I selected a undertaking I known as “Linear Optimisations in Product Analytics,” which centered on optimizing promoting campaigns.
This instance is sort of widespread. Think about you’re a advertising analyst making subsequent month’s advert plans. The aim is to earn as a lot as potential from a set finances of $30M.
You’re given a listing of potential campaigns with information for every:
nationandchannel,marketing_spending— the funding every marketing campaign wants,income— the anticipated earnings from new prospects over the approaching 12 months (the principle quantity we wish to maximize).
We even have additional particulars such because the variety of new customers and the quantity of assist contacts. These will assist us improve the issue by including extra guidelines as we go.

It is useful to offer an agent one thing to begin with. Let’s put together a extremely easy strategy for this process. One primary technique is to select the campaigns providing the very best income per greenback spent. We rank all of them and choose those that match throughout the finances. After all, that is fairly primary, nevertheless it provides us a agency start line.
import pandas as pd
df = pd.read_csv('marketing_campaign_estimations.csv', sep='t')
# --- Baseline Technique: Greedy by Revenue-per-Dollar ---
df['revenue_per_spend'] = df.income / df.marketing_spending
df = df.sort_values('revenue_per_spend', ascending=False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30_000_000]
total_spend = selected_df.marketing_spending.sum()
revenue_millions = selected_df.income.sum() / 1_000_000
assert total_spend <= 30_000_000, f"Budget violated: {total_spend}"
print(f"METRIC revenue_millions={revenue_millions:.4f}")
print(f"Segments={len(selected_df)} spend={total_spend/1e6:.2f}M")I positioned this code known as optimise.py in our storage listing.
Operating the beginning script reveals us a complete income of 107.9M USD with spending of 29.2M.
python3 optimise.py
# METRIC revenue_millions=107.9158
# Getting Organized
Before diving into the actual experiment, we first need to install pi_autoresearch. We begin by setting up pi itself by following the instructions from pi.dev. Fortunately, it can be installed with a single command, giving you a pi coding harness up and running locally that you can already use to assist with coding tasks.
npm install -g @mariozechner/pi-coding-agent # install pi
pi # start pi
/login # choose provider and specify APIKeyHowever, as mentioned earlier, our goal is to try the pi-autoresearch extension on top of pi, so let’s install that as well.
pi install I also wanted some guardrails in place, so I created an autoresearch.config.json file in the root of my repo to define the maximum number of iterations. This helps limit how many iterations the agent can run and, in turn, keeps token costs under control during experiments. You can also set a per-API-key spending limit with your LLM provider for even tighter control.
{
"maxIterations": 30
}You can find all the details on configuration in the docs.
That’s it. The setup is done, and we’re ready to start the experiment.
Experiments
Finally, it’s time to start using the autoresearch strategy to figure out which advertising campaigns we should run. I’m fairly sure our initial approach isn’t optimal, so let’s see whether autoresearch can improve it. Let the journey begin.
I started autoresearch by calling the command.
/command:autoresearch-createAfter that, autoresearch tries to infer the optimization objective, and if it fails, it asks for more details.
In my case, it simply inspected the code we implemented in optimise.py and created an autoresearch.md file summarizing the task. Here’s what we got (a pretty solid summary, considering it only saw our baseline optimization function). We can see that it clearly defined the metrics and constraints. I also appreciated that it explicitly highlighted that changing the input data is not allowed. That’s a good guardrail.
# Autoresearch: maximize advertising campaign revenue under budget
## Goal
Improve `optimise.py` so it selects a set of campaign segments with **maximum total revenue** while respecting the fixed advertising budget of **30,000,000**. The current implementation is a greedy heuristic: it sorts by revenue-per-spend, takes a cumulative prefix, and stops once the next item would exceed budget. This means it can leave budget unused and never consider cheaper profitable items later in the sorted list.
The workload is tiny (62 rows), so higher-quality combinatorial optimization techniques are likely practical. We should favor real or near-exact selection logic over fragile heuristics when the runtime remains fast.
## Metrics
- **Primary**: `revenue_millions` (millions, higher is better) - total selected revenue divided by 1,000,000
- **Secondary**:
- `spend_millions` - total selected spend divided by 1,000,000
- `budget_slack_millions` - unused budget in millions
- `segment_count` - number of selected segments
## How to Run
`./autoresearch.sh` - runs a quick syntax pre-check, then `optimise.py`, which should emit `METRIC name=value` lines.
## Data in Scope
- `optimise.py` - campaign-selection logic and metric output
- `autoresearch.sh` - benchmark harness and pre-checks
- `autoresearch.md` - session memory / findings
- `autoresearch.ideas.md` - backlog for promising deferred ideas
## Off Limits
- `marketing_campaign_estimations.csv` - input data; do not edit
- Git history / branch structure outside the autoresearch workflow
## Constraints
- Must keep spend `<= 30_000_000`
- Must keep the script runnable with `python3 optimise.py`
- No dataset changes
- Keep the solution simple and explainable unless further complexity yields materially higher revenue
- Runtime should remain fast enough for many autoresearch iterations
## What Has Been Tried
- Baseline code sorts by `revenue / marketing_spending`, computes cumulative spend, and keeps only the sorted prefix under budget.After defining the task, it immediately started the loop. It can run for a while, but you still retain visibility. You can see both its reasoning and some key stats in the widget (such as the current iteration, best objective value, and improvement over the baseline), which is quite helpful.

As it iterates, it also writes an autoresearch.jsonl file with full details of each experiment and the resulting objective metric. This log is very useful both for reviewing what has been tried and for the model itself to keep track of which hypotheses it has already tested.
In my case, despite the configured limit of 30 iterations, it decided to stop after just 5. The agent explored several different approaches: exact knapsack optimization, search-space pruning, and a Pareto-frontier dynamic programming approach. Let’s go through the details:
- Iteration 1: Reproduced our baseline approach. The prefix-greedy method (revenue/spend) reached 107.9M, but stopped early when items didn’t fit, missing better downstream combinations. No breakthrough here, just a sanity check of the baseline.
- Iteration 2: Exact knapsack solver. The agent switched to a branch-and-bound (0/1 knapsack) approach and reached 110.16M revenue (+2.25M uplift), which is a clear improvement. A strong gain already in the second iteration.
- Iteration 3: Dominance pruning. This iteration tried to shrink the search space by removing pairwise dominated segments (i.e., segments worse in both spend and revenue than another). While intuitive, this assumption doesn’t hold in the 0/1 knapsack setting: a “dominating” segment may already be chosen, while a “dominated” one can still be useful in combination with others. As a result, this approach failed and dropped to 95.9M revenue, and was discarded. A good example of trial and error. We tested it, it didn’t work, and we immediately moved on.
- Iteration 4: Dynamic programming frontier.
- Iteration 5: Integer accounting. This iteration converted all financial values from floats to integers representing cents, aiming to boost numerical stability and reproducibility. However, it again produced the same end result. It makes sense that the agent concluded its search at this stage.
The agent switched to a Pareto-frontier dynamic programming approach, yet it arrived at the same result as iteration 2. From an analyst’s standpoint, this was still useful. It confirms we have likely hit the optimal point.
In the end, the optimal solution was already identified during the second iteration, matching the answer we previously arrived at using linear programming. The agent did try a couple of alternative methods, but repeatedly returned to the same outcome before deciding to stop (rather than consuming more tokens).
With that, we can finalize the process by running the /ability:autoresearch-finalize command. This command commits everything and pushes it to GitHub, creating a new branch and pull request that captures both the modifications to the optimise.py code and the intermediate reasoning logs. This provides a clear, trackable record of the entire process.
The agent successfully completed the initial task. Next, let’s make it more realistic by incorporating additional constraints from the Operations team. Suppose we discover we also need to ensure no more than 5K additional customer support tickets (to keep the Ops team’s workload manageable), and that the overall customer contact rate stays below 4.2%, as part of our system health checks. This complicates the problem by adding new constraints, forcing the agent to re-explore the solution space and find a new optimum.
To start this phase, I simply re-launched the /ability:autoresearch-create process with the new constraints provided.
/ability:autoresearch-create I have extra constraints for our CS contacts to ensure our Operations
team can manage the workload effectively:
- Additional CS contacts ≤ 5K
- Contact rate (CS contacts/customers) ≤ 0.042This time, it seamlessly resumed from where it left off, with full context from the previous run, including all prior progress. After updating the task, the agent revised the autoresearch.md file to incorporate the new constraints.
## Constraints
- Spend must remain `<= 30_000_000`
- Additional CS contacts must remain `<= 5_000`
- Contact rate must remain `<= 0.042`
- The script must remain runnable with `python3 optimise.py`
- No changes to the dataset
- Keep the solution simple and explainable unless added complexity yields significantly better revenue
- Runtime should stay fast enough for numerous autoresearch iterationsIt conducted 8 additional iterations and ultimately converged on the following solution (again aligning with our earlier results):
- Revenue: $109.87M,
- Budget spent: $29.9981M (under $30M),
- Customer support contacts: 3,218 (under 5K),
- Contact rate: 0.038 (under 0.042).
After incorporating the new constraints, the agent restructured the problem and transitioned to a precise MILP solver. It quickly identified the optimal solution, reaching $109.87M in revenue while meeting all conditions. Most of the later iterations didn’t alter the outcome; they focused mainly on refinement: removing fallback logic, reducing dependencies, and speeding up execution. Essentially, once the problem was clearly defined, the agent shifted from “searching” to “engineering.” Notably, it recognized when to cease optimizing and didn’t exhaust the full 30-iteration limit.
Finally, to conclude the research, I had the agent finalize. However, for some reason, the /ability:autoresearch-finalize command didn’t push all the changes this time, so I needed to manually request two PRs: one containing the clean code changes and another with the reasoning and supporting files. Feel free to review these PRs if you’re interested in seeing more details about the agent’s attempts.
That concludes our experiments. We achieved excellent results and gained valuable insights into the capabilities of autoresearch. Now, it’s time to bring it all together.
Summary
This was a truly fascinating experiment. The agent managed to independently reach the same optimal solution we had previously identified. While it didn’t surpass the result (which is expected for well-studied problems like the knapsack), it was impressive to observe how an LLM could iteratively explore solutions and converge on a stable outcome without manual intervention.
I think this approach holds significant promise across various domains, from training ML models and tackling analytical challenges to more engineering-centric issues like optimizing system performance or load times. In many teams, we often lack the time to test every possible idea or discard some prematurely. An autonomous loop like this could systematically try different approaches and validate them against concrete metrics.
However, this is certainly not a universal solution. There will be cases where the agent arrives at “optimal” solutions that are impractical in reality—for example, improving website load times at the expense of a good user experience. This highlights the importance of human oversight: not just to verify results, but to ensure the solution makes sense holistically.
Based on my observations, this method works best when you have a clear objective, well-defined constraints, and a quantifiable metric to optimize. It’s much harder to apply to more ambiguous goals, like making a product more user-friendly, where success is less tangible.
Overall, I’d definitely recommend trying pi-autoresearch or similar tools on your own challenges. It’s a powerful way to test ideas you wouldn’t normally have the bandwidth for and discover what truly works in practice. And there’s something uniquely satisfying about your product improving overnight while you rest.
Disclaimer: I work at Shopify, but this post is independent of my work there and reflects my personal views.



