Most biology benchmarks focus on narrow, fact-based questions with clear-cut answers. In reality, scientists evaluate uncertain evidence and make judgment calls. OpenAI’s LifeSciBench was designed to address this gap.
Even the top-performing model succeeds at only about one in three tasks. The benchmark is nowhere near solved.
What is LifeSciBench
LifeSciBench features 750 tasks crafted by domain experts. These span seven research workflows and seven areas of biology. Every task includes a prompt, supporting materials, and a detailed grading rubric.
The seven workflows encompass evidence evaluation and analysis, as well as design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.
The seven domains range from genomics and medicinal chemistry to clinical and translational science.
Tasks are framed the way a scientist would brief a colleague — as free-response prompts rather than multiple-choice questions. Roughly 79% demand multiple reasoning or decision-making steps, with an average of four steps per task.
How the Benchmark was Built
A team of 173 expert scientists authored the tasks. Each held a Ph.D. and had experience in biotechnology or pharmaceuticals. Accepted tasks went through an average of six automated review cycles plus at least two expert reviews.
Many tasks come with supporting artifacts. The benchmark includes 1,062 attached artifacts in total, and about 53% of tasks require at least one. Artifact types include sequences, figures, tables, PDFs, and chemical structures.
A separate group of reviewers validated quality. The panel included 453 reviewers, 97% of whom held doctorates. Overall agreement exceeded 96% across relevance, reasoning, grounding, and usefulness.
The Rubric System
Rubrics are the central mechanism. They contain 19,020 criteria across the entire benchmark — roughly 25 per task.
Each criterion rewards one specific element, such as a particular fact, a reasoning step, or a numeric answer within an acceptable range. Grading is performed against the rubric rather than a single reference answer.
Two metrics capture performance. The normalized rubric score is the ratio of points earned to total points available. The task pass rate counts tasks that score at or above 70%.
This distinction is important for interpretation. A response can receive partial credit yet still fail the task. The pass threshold is intentionally strict.
Here is the scoring logic in plain Python:
def grade(rubric, awarded_ids):
total = sum(c["pts"] for c in rubric)
earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
normalized = earned / total # partial credit
passed = normalized >= 0.70 # task-level success
return normalized, passedHow the Models Performed
OpenAI tested five models in a single-turn setup. Each model received the prompt and artifacts once. Unrestricted internet browsing was allowed.
| Model | Normalized score | Task pass rate |
|---|---|---|
| GPT-Rosalind | 0.576 | 36.1% |
| GPT-5.5 | 0.519 | 25.7% |
| Gemini 3.1 Pro | 0.515 | 23.6% |
| GPT-5.4 | 0.479 | 20.7% |
| Grok 4.3 | 0.399 | 13.0% |
GPT-Rosalind, OpenAI’s domain-specialized model, came out on top. It achieved the highest per-task average on 386 of 750 tasks. It also raised the overall pass rate compared to GPT-5.5, from 25.7% to 36.1%. Pass rates remained low across all models.
Rankings don’t tell the full story. Gemini 3.1 Pro uniquely led on 214 tasks. Aggregate scores can obscure task-specific strengths.
Where Models Win, and Where They Fall Short
Models performed best on structured judgment tasks. GPT-Rosalind reached a 0.712 mean score on Translation. Scientific Communication scored 0.718, though that category is small, so interpret it cautiously.
Two workflows remained difficult. Design, Optimization, and Prediction was among the hardest, with GPT-Rosalind passing 30.7%. Analysis was close behind at 30.3%.
Artifact use was a clear weak point. GPT-Rosalind’s pass rate dropped from 45.1% on text-only tasks to 28.1% on artifact tasks. GPT-5.5 showed the same pattern, falling from 29.9% to 21.9%.
Exact outputs were the most challenging. Sequence and structure criterion success ranged from 46.9% to 18.0% across models. GPT-Rosalind’s improvement over GPT-5.5 on generate/construct items was a mere +0.001.
Models also stalled partway through tasks. For GPT-Rosalind, 109 tasks earned at least 50% rubric credit yet still passed below 20%.
Room for improvement is substantial. No model passed 171 tasks (22.8%). And 261 tasks (34.8%) had a best-model pass rate under 20%.
Strengths and Weaknesses
Strengths:
- Broad coverage across seven workflows and seven biological domains
- Expert-authored rubrics with 19,020 atomic, gradeable criteria
- Realistic artifacts: sequences, figures, tables, PDFs, and structures
- Independent validation by 453 expert reviewers, 97% with doctorates
Weaknesses:
- Single-turn only; real research is iterative and multi-turn
- Built by OpenAI, which also supplies most evaluated models
- Public release may be limited by safety and licensing constraints
- 750 tasks cannot cover every scientific specialty
Try It: Interactive Rubric Grader Demo
LifeSciBench — Interactive Demo
Rubric Grader & Model Leaderboard
See how rubric-based grading works on a real benchmark task. Toggle the criteria a model “got right” and watch the normalized score and 70% pass threshold update
live.
Task (Analysis — Spatial Transcriptomics): Using the provided Visium data from an FFPE cervical cancer slide, group the spots into 4 k-means clusters, identify the dominant cell type in each cluster, and suggest the 1–2 most promising targeted therapies (ADC, TCE, or CAR-T) based on differences in antigen expression between tumor and non-tumor regions.
Generate a sample response:
0 / 76 pts
Normalized score: 0%
▲ 70% pass threshold (53.2 pts)
FAIL — below 70%
A response can earn partial credit and still not pass the task. This gap is precisely what LifeSciBench evaluates.
Single-turn evaluation; open internet access allowed. GPT-Rosalind led overall but uniquely topped only 386 of 750 tasks; Gemini 3.1 Pro uniquely led on 214.
Explore the Paper and Technical details. Also, feel free to follow us on Twitter and remember to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch with us
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.



