OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models On Real Life-Science Research Using Expert-Crafted Rubrics

Most biology benchmarks focus on narrow, fact-based questions with clear-cut answers. In reality, scientists evaluate uncertain evidence and make judgment calls. OpenAI’s LifeSciBench was designed to address this gap.

Even the top-performing model succeeds at only about one in three tasks. The benchmark is nowhere near solved.

What is LifeSciBench

LifeSciBench features 750 tasks crafted by domain experts. These span seven research workflows and seven areas of biology. Every task includes a prompt, supporting materials, and a detailed grading rubric.

The seven workflows encompass evidence evaluation and analysis, as well as design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.

The seven domains range from genomics and medicinal chemistry to clinical and translational science.

Tasks are framed the way a scientist would brief a colleague — as free-response prompts rather than multiple-choice questions. Roughly 79% demand multiple reasoning or decision-making steps, with an average of four steps per task.

How the Benchmark was Built

A team of 173 expert scientists authored the tasks. Each held a Ph.D. and had experience in biotechnology or pharmaceuticals. Accepted tasks went through an average of six automated review cycles plus at least two expert reviews.

Many tasks come with supporting artifacts. The benchmark includes 1,062 attached artifacts in total, and about 53% of tasks require at least one. Artifact types include sequences, figures, tables, PDFs, and chemical structures.

A separate group of reviewers validated quality. The panel included 453 reviewers, 97% of whom held doctorates. Overall agreement exceeded 96% across relevance, reasoning, grounding, and usefulness.

The Rubric System

Rubrics are the central mechanism. They contain 19,020 criteria across the entire benchmark — roughly 25 per task.

Each criterion rewards one specific element, such as a particular fact, a reasoning step, or a numeric answer within an acceptable range. Grading is performed against the rubric rather than a single reference answer.

Two metrics capture performance. The normalized rubric score is the ratio of points earned to total points available. The task pass rate counts tasks that score at or above 70%.

This distinction is important for interpretation. A response can receive partial credit yet still fail the task. The pass threshold is intentionally strict.

Here is the scoring logic in plain Python:

def grade(rubric, awarded_ids):
    total = sum(c["pts"] for c in rubric)
    earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
    normalized = earned / total          # partial credit
    passed = normalized >= 0.70          # task-level success
    return normalized, passed

How the Models Performed

OpenAI tested five models in a single-turn setup. Each model received the prompt and artifacts once. Unrestricted internet browsing was allowed.

Model	Normalized score	Task pass rate
GPT-Rosalind	0.576	36.1%
GPT-5.5	0.519	25.7%
Gemini 3.1 Pro	0.515	23.6%
GPT-5.4	0.479	20.7%
Grok 4.3	0.399	13.0%

GPT-Rosalind, OpenAI’s domain-specialized model, came out on top. It achieved the highest per-task average on 386 of 750 tasks. It also raised the overall pass rate compared to GPT-5.5, from 25.7% to 36.1%. Pass rates remained low across all models.

Rankings don’t tell the full story. Gemini 3.1 Pro uniquely led on 214 tasks. Aggregate scores can obscure task-specific strengths.

Where Models Win, and Where They Fall Short

Models performed best on structured judgment tasks. GPT-Rosalind reached a 0.712 mean score on Translation. Scientific Communication scored 0.718, though that category is small, so interpret it cautiously.

Two workflows remained difficult. Design, Optimization, and Prediction was among the hardest, with GPT-Rosalind passing 30.7%. Analysis was close behind at 30.3%.

Artifact use was a clear weak point. GPT-Rosalind’s pass rate dropped from 45.1% on text-only tasks to 28.1% on artifact tasks. GPT-5.5 showed the same pattern, falling from 29.9% to 21.9%.

Exact outputs were the most challenging. Sequence and structure criterion success ranged from 46.9% to 18.0% across models. GPT-Rosalind’s improvement over GPT-5.5 on generate/construct items was a mere +0.001.

Models also stalled partway through tasks. For GPT-Rosalind, 109 tasks earned at least 50% rubric credit yet still passed below 20%.

Room for improvement is substantial. No model passed 171 tasks (22.8%). And 261 tasks (34.8%) had a best-model pass rate under 20%.

Strengths and Weaknesses

Strengths:

Broad coverage across seven workflows and seven biological domains
Expert-authored rubrics with 19,020 atomic, gradeable criteria
Realistic artifacts: sequences, figures, tables, PDFs, and structures
Independent validation by 453 expert reviewers, 97% with doctorates

Weaknesses:

Single-turn only; real research is iterative and multi-turn
Built by OpenAI, which also supplies most evaluated models
Public release may be limited by safety and licensing constraints
750 tasks cannot cover every scientific specialty

Try It: Interactive Rubric Grader Demo

LifeSciBench — Interactive Demo

Rubric Grader & Model Leaderboard

See how rubric-based grading works on a real benchmark task. Toggle the criteria a model “got right” and watch the normalized score and 70% pass threshold update

live.

Task (Analysis — Spatial Transcriptomics): Using the provided Visium data from an FFPE cervical cancer slide, group the spots into 4 k-means clusters, identify the dominant cell type in each cluster, and suggest the 1–2 most promising targeted therapies (ADC, TCE, or CAR-T) based on differences in antigen expression between tumor and non-tumor regions.

Generate a sample response:

0 / 76 pts

Normalized score: 0%

▲ 70% pass threshold (53.2 pts)

FAIL — below 70%

A response can earn partial credit and still not pass the task. This gap is precisely what LifeSciBench evaluates.

Single-turn evaluation; open internet access allowed. GPT-Rosalind led overall but uniquely topped only 386 of 750 tasks; Gemini 3.1 Pro uniquely led on 214.

Explore the Paper and Technical details. Also, feel free to follow us on Twitter and remember to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Top Posts

Harnessing the Power of Agent Frameworks on Cloudflare: Introducing Flue and Beyond

A New Era in Air Freight Oversight: How Griffin’s Revolutionary Tracking Technology Eliminates Blind Spots in Global Air Cargo Monitoring

OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models on Real Life-Science Research Using Expert-Crafted Rubrics

OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models on Real Life-Science Research Using Expert-Crafted Rubrics

Churn Thresholds: The Hidden Lever in Your Pricing Strategy

Mathematicians Draft Playbook for Responsible AI Use — Other Disciplines Should Take Note

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

Unlock the Power of Local AI: Run Your Own LLM on a Mac Mini with OpenClaw

Beyond the Loop: 7 Blazing-Fast Pandas Techniques That Replace Iteration

Deep Learning–Driven Multi-Omic Mapping Reveals Hidden Links Between Eye Imaging, Heart Health, and Brain Traits

Harnessing the Power of Agent Frameworks on Cloudflare: Introducing Flue and Beyond

A New Era in Air Freight Oversight: How Griffin’s Revolutionary Tracking Technology Eliminates Blind Spots in Global Air Cargo Monitoring

OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models on Real Life-Science Research Using Expert-Crafted Rubrics

Revolutionizing Aviation: How Cobots Are Transforming Avionics Testing

ORPilot’s IR: The Hidden Engine Behind Portable, Reproducible Optimization

Mexican Billionaire Ricardo Salinas Puts 70% of His Fortune on Bitcoin, Predicts $1 Million Price

When the C2 Crashed, the Junior Hacker Stayed Connected via Tailscale and OpenSSH

Senate NDAA Unveils Bold CMMC Grant Program to Boost Cybersecurity Readiness

Trending

Harnessing the Power of Agent Frameworks on Cloudflare: Introducing Flue and Beyond

A New Era in Air Freight Oversight: How Griffin’s Revolutionary Tracking Technology Eliminates Blind Spots in Global Air Cargo Monitoring

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models on Real Life-Science Research Using Expert-Crafted Rubrics

What is LifeSciBench

How the Benchmark was Built

The Rubric System

How the Models Performed

Where Models Win, and Where They Fall Short

Strengths and Weaknesses

Strengths:

Weaknesses:

Try It: Interactive Rubric Grader Demo

Related Posts