Hexo Labs Unveils SIA: The Self-Improving Agent That Rewrites Its Own Code And Model Weights

Once a human stops fine-tuning them, most AI agents stop getting better. The underlying model stays the same, and so does the infrastructure built around it. Hexo Labs aims to evolve both simultaneously. This week, the company open-sourced SIA (Self-Improving AI) under the MIT license.

The central — and deliberately narrow — claim behind SIA is this: within a single self-improvement loop, it can modify both the agent’s surrounding infrastructure and the model’s internal weights.

What is SIA (Self-Improving AI)

SIA divides a task-specific agent into two layers. The first layer is the harness (sometimes called the scaffold), which includes the system prompt, tool-routing logic, retry rules, and answer-parsing code. The second layer is the model’s weights themselves.

Three large-language-model components power the loop. A Meta-Agent generates an initial scaffold from a task description and any available reference code. A Task-Specific Agent then runs the job while logging every step of the process. After completion, a Feedback-Agent reviews that full trajectory and determines what adjustments need to be made.

This decision-making step is the core innovation. Following each run, the Feedback-Agent chooses one of two paths: it can revise the scaffold while keeping the model weights frozen, or it can apply a weight update while leaving the scaffold unchanged.

The base model is openai/gpt-oss-120b. For weight updates, SIA uses LoRA — a low-rank adapter — set at rank 32. Both the Meta-Agent and the Feedback-Agent run on Claude Sonnet 4.6. All training is executed on H100 GPUs via Modal, the team’s reinforcement-learning platform.

The team defines two operating modes: SIA-H and SIA-W+H. SIA-H applies harness changes only. SIA-W+H layers weight updates on top of harness edits.

The Benchmark Case

The team evaluated SIA across three deliberately diverse domains. A consistent pattern emerged: weight updates consistently delivered gains beyond what scaffold modifications alone could achieve. “Initial” represents the Meta-Agent’s first scaffold running against the base model, before any feedback loop has fired.

Task	Initial	Prev. SOTA	SIA-H (harness only)	SIA-W+H (harness + weights)
LawBench (top-1 acc)	13.5%	45.0%	50.0%	70.1%
AlphaEvolve TriMul (reward)	0.105	1.292	0.120	1.475
Denoising (mse_norm)	0.048	0.240	0.241	0.289

On LawBench, the challenge is a 191-class Chinese criminal-charge classification task. Repeated harness iterations constructed a TF-IDF plus LinearSVC pipeline and leveled off at 50.0% accuracy. Subsequent weight updates through PPO pushed performance up to 70.1% — a 20.1 percentage-point improvement over the harness-only ceiling.

The TriMul task requires writing a custom CUDA kernel for an H100 GPU. The kernel implements a core operation in AlphaFold2’s Evoformer module. Scaffold editing alone achieved a 1.14× speedup over the baseline. Weight updates then slashed runtime from 12,483 µs to 1,017 µs — a 91.9% reduction from the harness-only peak.

The results chart includes an important qualification: the coding assistant Claude Code achieved a 1.50× speedup on TriMul without any help, outpacing SIA-H’s 1.14× result. SIA-W+H still delivered the best overall outcome at 14.02×.

For the denoising task, the agent optimizes MAGIC, a single-cell RNA imputation method. Harness sweeps across hyperparameters settled at an mse_norm of 0.241. After the first weight-update checkpoint, the agent introduced a two-step line that no scaffold iteration had ever generated — it rounded imputed counts to non-negative integers, lifting the score to 0.289.

How the Feedback-Agent Picks Its Move

SIA doesn’t rely on a single fixed RL recipe. The Feedback-Agent chooses a training algorithm based

The reward signal it observes guides its approach to improvement.

On LawBench, since the reward was a clear, outcome-based numerical value, the team employed PPO with GAE. On TriMul, where most kernels did not compile successfully, they applied entropic advantage weighting—a technique that gives more importance to uncommon high-reward outcomes. For the denoising task, they adopted GRPO, which removes the need for a value network altogether.

The researchers also reference REINFORCE with KL-to-base, DPO, and best-of-N behavioral cloning. Each method aligns with a different reward structure and carries its own set of potential failure risks.

Strengths and Points to Consider

Strengths:

According to the authors’ comparison table, this is the first system to modify both the agent scaffold and model weights within a single loop.
Delivers consistent improvements over previous state-of-the-art methods across three distinct and unrelated domains.
Open source under the MIT license, installable as sia-agent, and comes with four pre-packaged tasks.
The choice of algorithm adapts based on the rewards observed, rather than following a predetermined sequence.

Points to Consider:

The study reports results from three tasks; wider algorithm-selection findings are reserved for future work.
Both optimization levers target the same fixed verifier, which introduces the risk of tightly coupled Goodhart effects.
The researchers caution that the joint fixed point may prove fragile if subjected to outside disturbance.

Marktechpost’s Visual Guide

Hexo Labs · Open Source (MIT)

SIA: Self-Improving AI

Scaffold + Weight Updates

A self-enhancing cycle that modifies both an agent’s underlying structure and its model parameters—no additional human adjustment required.

gpt-oss-120b
LoRA rank 32
3 benchmarks
Claude Sonnet 4.6 agents

The Gap

Two separate camps, working independently

Scaffold-focused approach

Modify the framework

A meta-agent revises prompts, tools, and retry logic while keeping model parameters unchanged.

Test-time training

Modify the weights

An RL pipeline adjusts the model based on task feedback, leaving the overall framework untouched.

SIA bridges this gap by operating both levers together inside one unified loop.

Anatomy

What SIA truly represents

Scaffold (harness): encompasses the system prompt, tool-dispatch rules, retry mechanisms, and answer-extraction routines.
Weights: refers to the model’s internal parameters, fine-tuned using LoRA at rank 32.
Three LLM modules power the loop: a Meta-Agent, a Task-Specific Agent, and a Feedback-Agent.

The Loop

Single loop, dual levers

Following each execution, the Feedback-Agent examines the complete trajectory and selects one action.

Action A

Scaffold update

Revise the framework; model weights remain unchanged.

Action B

Weight update

Train LoRA parameters; the framework stays as-is.

The two levers alternate freely instead of being confined to rigid sequential stages.

Evidence

Results across benchmarks

Task	Initial	Prev. SOTA	SIA-H	SIA-W+H
LawBench (top-1 acc)	13.5%	45.0%	50.0%	70.1%
AlphaEvolve TriMul (reward)	0.105	1.292	0.120	1.475
Denoising (mse_norm)	0.048	0.240	0.241	0.289

SIA-W+H (scaffold + weights) outperformed SIA-H (scaffold only) across all three benchmarks.

Mechanism

How the Feedback-Agent decides what to do

LawBench: given the straightforward outcome-based reward, PPO with GAE was chosen, leading to 70.1% accuracy.
TriMul: since most kernels failed to compile, entropic advantage weighting was applied, achieving a runtime of 1,017 µs.
Denoising: GRPO was used, bypassing the value network entirely, and the score improved to 0.289.
Also supported: REINFORCE + KL-to-base, DPO, and best-of-N behavioral cloning.

RQ2

How each lever drives change

Scaffold

External improvements

Software-engineering refinements: new tools, more robust parsers, improved retry handling.

Weights

Internal knowledge gains

Domain-specific insights that no prompt alone can convey: H100 kernel patterns, an integer-rounding step.

The scaffold defines how the agent explores; weight updates reshape what the model understands.

The Honest Read

Limitations worth noting

Both levers optimize toward the same fixed verifier, raising the possibility of a co-evolutionary Goodhart effect.
Fixed points may appear strong on the verifier yet remain fragile if disturbed.
The paper documents three tasks; more extensive algorithm-selection results will come later.
A separate 350× superintelligence claim found in launch coverage is not present in the actual paper.

Get Started

Try it yourself

Available as open source under MIT at hexo-ai/sia. Built on gpt-oss-120b with LoRA rank 32.

# install the Claude backend
pip install 'sia-agent[claude]'
export ANTHROPIC_API_KEY="..."

# execute 5 rounds of self-improvement on a bundled task
sia --task lawbench --max_gen 5 --run_id 1

Four tasks are included out of the box: gpqa, lawbench, longcot-chess, spaceship-titanic.

01 / 09

Key Takeaways

SIA is the first self-enhancing loop capable of editing both an agent’s scaffold and its model weights in tandem.
After each run, a Feedback-Agent reviews the entire trajectory and decides between a scaffold rewrite or a weight update.
Using both levers together surpassed scaffold-only optimization on every task tested: LawBench, TriMul kernels, and scRNA-seq denoising.
Scaffold edits contribute software-engineering rigor; weight updates unlock domain knowledge inaccessible through prompting alone.
Open source under MIT (hexo-ai/sia), running on gpt-oss-120b with LoRA rank 32.

Explore the Repo and Research Paper. Also, feel free to follow us on Twitter—and don’t miss our 150k+ ML SubReddit and Newsletter subscription. Are you on Telegram? You can join us there too.

Looking to collaborate on promoting your GitHub Repo, Hugging Face Page, Product Release, Webinar, or similar? Get in touch with us

Top Posts

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

Hexo Labs Unveils SIA: The Self-Improving Agent That Rewrites Its Own Code and Model Weights

Two separate camps, working independently

Modify the framework

Modify the weights

What SIA truly represents

Single loop, dual levers

Scaffold update

Weight update

Results across benchmarks

How the Feedback-Agent decides what to do

How each lever drives change

External improvements

Internal knowledge gains

Limitations worth noting

Try it yourself

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trending

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Hexo Labs Unveils SIA: The Self-Improving Agent That Rewrites Its Own Code and Model Weights

What is SIA (Self-Improving AI)

The Benchmark Case

How the Feedback-Agent Picks Its Move

Strengths and Points to Consider

Strengths:

Marktechpost’s Visual Guide

Two separate camps, working independently

Modify the framework

Modify the weights

What SIA truly represents

Single loop, dual levers

Scaffold update

Weight update

Results across benchmarks

How the Feedback-Agent decides what to do

How each lever drives change

External improvements

Internal knowledge gains

Limitations worth noting

Try it yourself

Key Takeaways

Related Posts