Webwright: Microsoft's Terminal-Native Web Agent Framework Boosts Performance From 33.5% To 60.1% On Odysseys Benchmark

Most web agents today operate by performing one browser action at a time. The model takes in the current page state — whether as a screenshot or DOM text — and predicts the next click, keypress, or scroll. This step-by-step approach was reasonable when language models had limited reasoning power. But as models have grown more skilled at writing and debugging code, that rigid loop has turned from a helpful structure into a bottleneck.

Researchers at Microsoft’s AI Frontiers lab developed an alternative. Their new open-source framework, Webwright, gives the agent a terminal rather than a stateful browser session. The agent writes Playwright code to control browsers, executes bash commands, examines logs, and iteratively improves scripts. Playwright is an open-source browser automation library, also from Microsoft, that enables programmatic control of Chromium, Firefox, and WebKit browsers.

What Webwright Does Differently

Webwright decouples the agent from the browser and treats the browser as something the agent can launch, inspect, and discard while building a program. The lasting artifact is not the browser session but the code and logs stored in the local workspace.

This mirrors how a developer writes an RPA (Robotic Process Automation) script. Instead of manually clicking through a site each time, they write a script once. That script can be rerun, modified, and shared. Webwright brings this same approach to LLM-powered agents.

The system consists of three core components: a Runner, a Model Endpoint, and a terminal Environment. The runner is roughly 150 lines of code, the model interface about 550 lines, and the environment about 300 lines. There is no multi-agent orchestration or complex planning hierarchy — just a single agent loop.

All intermediate code, logs, screenshots, and results are saved in the workspace, making each run straightforward to inspect.

The Agent Loop

The Runner sends the current context to the model. The model responds with a thinking block and a shell command. That command executes in the Environment, which returns terminal output, logs, screenshots, or error tracebacks. These observations feed back into the context, and the loop repeats.

Instead of issuing one primitive action at a time, a coding agent can naturally express multi-step interactions — like picking a date or completing an entire form — as a compact program. Loops, functions, and abstractions let the agent generalize across similar tasks without repeatedly predicting the same sequences of low-level steps.

Two Engineering Challenges

Premature ‘done’ and context explosion are the two main issues. With open-ended bash actions, the model must self-report completion and often declares success without actually finishing. The team added a gate: the agent must generate a self-reflection config, run a final script in a fresh folder with logs and screenshots, and pass its own self-reflection judgment that outputs success or failure before emitting done: true. Otherwise, the flag is dropped and it retries.

For context length, long coding trajectories quickly exceed context limits, so they

Every 20 steps, the agent automatically compresses its history into a single summary, keeping the context window compact and reducing token usage.

Benchmark Results

Webwright was tested on two benchmarks: Online-Mind2Web and Odysseys.

Online-Mind2Web features 300 tasks spread across 136 popular websites and relies on an automated LLM-as-a-Judge evaluation setup. GPT-5.4 scored 86.67% overall accuracy — the highest result among all open-source harness recipes in the AutoEval category of the Online-Mind2Web benchmark using a 100-step limit. Claude Opus 4.7 achieved 84.7% overall but edged ahead on difficult tasks at N=100 steps, reaching 80.5% compared to GPT-5.4’s 76.6%.

The team also recreated a GPT-5.4 baseline using a traditional screenshot-based agent approach, where the model guesses x,y coordinates for click and typing actions. With the identical underlying model, Webwright delivers major improvements across all three difficulty levels, demonstrating the advantage of a code-driven terminal-based method over predicting coordinates one step at a time.

Odysseys assesses long-horizon browsing tasks that span multiple websites, with instructions averaging 272.3 words. On the April 2026 leaderboard, the top-performing model was Opus 4.6 with a peak score of 44.5. Webwright powered by GPT-5.4 hits 60.1% — a 35.1% relative gain over the prior state of the art. Against the base GPT-5.4 score of 33.5%, this represents a 79.4% relative boost, or 26.6 absolute points.

Cost Analysis

Claude Opus 4.7 requires fewer steps per task on average (mean 21.9) than GPT-5.4 (mean 26.3). However, its pricing is considerably higher ($5 versus $2.50 per 1M input tokens, and $25 versus $15.00 per 1M output tokens as of April 2026), making the average cost per task greater ($6.09 versus $2.37). The first 50 steps already capture 82% accuracy, while the next 50 steps add just 3–4 more points.

Small Model Performance

The researchers also evaluated Qwen3.5-9B on the hard subset of Online-Mind2Web. When tasks were enhanced with pre-built reusable tool scripts, Qwen3.5-9B reached 66.2% on Online-Mind2Web sites that offered more than five tools. This indicates that smaller, more affordable models can tackle complex web tasks effectively when supported by a pre-built tool library.

Marktechpost’s Visual Explainer

Webwright
Quick Start Guide

01 / 05 — Overview
What Is Webwright?
Webwright is an open-source, terminal-native web agent framework built by Microsoft Research. Rather than guessing one browser click at a time, the agent writes Playwright code, executes bash commands, and saves reusable scripts in a local workspace.

~1,000 lines of harness code spread across 3 modules — no hidden orchestration layer
Single agent loop: Runner, Model Endpoint, and terminal Environment
86.7% on Online-Mind2Web | 60.1% on Odysseys with GPT-5.4
Backends: OpenAI, Anthropic, OpenRouter
Scripts work in Claude Code, Codex, OpenClaw

# GitHub repository
github.com/microsoft/Webwright

02 / 05 — Prerequisites
What You Need Before Installing
Make sure the following are set up before running any install commands.

Python 3.10+ — minimum required runtime
Chromium — installed via Playwright in the next step
API key — OpenAI, Anthropic, or OpenRouter
Git — to clone the repository

# Check your Python version
python --version
# Must return Python 3.10 or higher

03 / 05 — Installation
Clone and Install Webwright
Clone the repo, install in editable mode, then set up Chromium for Playwright browser control.

# 1. Clone the repository
git clone 
cd Webwright

# 2. Install the package in editable mode
pip install -e .

# 3. Install Chromium for Playwright
playwright install chromium

The -e flag means changes to local source take effect immediately without reinstalling.

04 / 05 — Running a Task
Run Your First Web Task
Export your API key, then provide a task instruction and start URL via the CLI.

# Export your key
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Run a task
python -m webwright.run.cli 
  -c base.yaml -c model_openai.yaml 
  -t "Find cheapest economy flight SEA to JFK on 2026-05-15" 
  --start-url  
  --task-id demo_openai 
  -o outputs/default

Flag	Description
-c	Config file from src/webwright/config/ — stackable
-t	Task instruction in plain English
–start-url	Starting URL for the browser session
–task-id	Name of the output subfolder
-o	Root output directory for logs and scripts

05 / 05 — Claude Code Integration
Use Webwright as a Claude Code Skill
Webwright includes a built-in Claude Code skill. No separate LLM API key is required beyond your Claude Code subscription. Claude Code handles PNG screenshots natively.

# Project-scoped (inside this repo only)
mkdir -p .claude/skills .claude/commands
ln -s "$PWD/skills/webwright" .claude/skills/webwright
ln -s "$PWD/skills/webwright/commands" .claude/commands/webwright

# User-scoped (all projects)
mkdir -p ~/.claude/skills ~/.claude/commands
ln -s "$PWD/skills/webwright" ~/.claude/skills/webwright
ln -s "$PWD/skills/webwright/commands" ~/.claude/commands/webwright

Restart Claude Code after installing, then use slash commands:

# One-shot task
/webwright:run search Google Flights SEA to JFK 2026-05-15

# Reusable parameterized CLI tool
/webwright:craft search a ticket from LAX to SFO depart June 7

Key Takeaways

Webwright runs a terminal loop where the agent writes and executes Playwright code instead of predicting individual browser actions.
GPT-5.4 achieved 86.7% on Online-Mind2Web (100-step budget) and 60.1% on Odysseys — 26.6 points above the base GPT-5.4 score of 33.5%.
The harness spans ~1,000 lines across three modules with no multi-agent orchestration.
Qwen3.5-9B hit 66.2% on the hard split of Online-Mind2Web when given pre-built tool scripts.
Task scripts are packaged as reusable CLIs, working across Claude Code, Codex, and OpenClaw.

Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Webwright: Microsoft’s Terminal-Native Web Agent Framework Boosts Performance from 33.5% to 60.1% on Odysseys Benchmark

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trending

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Webwright: Microsoft’s Terminal-Native Web Agent Framework Boosts Performance from 33.5% to 60.1% on Odysseys Benchmark

What Webwright Does Differently

The Agent Loop

Two Engineering Challenges

Benchmark Results

Cost Analysis

Small Model Performance

Marktechpost’s Visual Explainer

Key Takeaways

Related Posts