In brief
- Opus 4.8 delivered a decisive victory in mathematics and generated the most polished single-prompt game we have ever evaluated.
- A single coding request consumed our entire Pro token allowance, rendering the model unsuitable for substantial projects without a Max subscription or significant API expenditure.
- Creative writing showed almost no progress compared to version 4.7.
Six weeks following the release of Opus 4.7, Anthropic launched Claude Opus 4.8. Benchmark results have improved, safety ratings have climbed, and the cost remains unchanged at $5 per million input tokens and $25 per million output tokens.
We therefore subjected it to the identical suite of evaluations we apply to every leading-edge model—creative writing, programming, mathematics, logic, narrative reasoning, and long-context memory—and measured it directly against its predecessor and the Chinese models that consistently underprice it.
The quick takeaway: version 4.8 excels at tasks where Claude already performed strongly (such as math, coding, and technical work), and performs marginally worse in areas where it already struggled (such as imagination and creative writing). It also carries a token consumption rate that nearly undermines its own practicality.
Here is the full analysis.
Creative Writing
We reused the identical prompt from our MiMo and Qwen assessments: a time-travel narrative rooted in the writer’s cultural heritage, placed in a definite historical setting, constructed around a paradox where time remains unchangeable. Opus 4.8 chose a Venezuelan setting, likely because it inferred the user’s background and recognized my Venezuelan origin. The AI situated the story in the Orinoco delta in the year 1000, featuring a pardo from Maracaibo named José Lanz (my name) dispatched 11 centuries into the past to destroy a song.
The writing is richly descriptive. The delta appears “green in a manner that 2150 had forgotten green could be,” stilt houses sway above coffee-colored water, and macaws streak across the sky “in shrieking ribbons of scarlet and gold.” The paradox resolves elegantly as well: the protagonist is assigned to undermine the creation of a song that inspired a cultural revolution which shaped his dystopian society thousands of years ahead—yet upon arriving to discredit the song’s composer, he discovers there is no composer. The song’s creator composed it in his honor, the song is about him, and he cannot discredit himself, the cycle sealing itself shut.
The piece closes with “It worked perfectly. It always had.” As a constructed work, it is tidy and capable.
However, tidy does not equal vibrant. The prose is descriptive yet never flows as naturally as what MiMo v2.5 delivered—less drive, fewer unexpected turns, less compelling, and the events are harder to follow from the start. Placed next to Opus 4.7, it is difficult to label this an improvement; if anything, it falls slightly short. A more intensive thinking configuration and some multi-shot prompting would almost certainly propel it to the top of the rankings—but on a single default attempt, this amounts to a sideways step at most.
You can read the complete story on our Github.
Coding
Our coding evaluation follows the standard single-prompt game creation. Opus 4.8 built a typing-zombie game—Typing Dead—that performed quite well. The finest splash screen, the most creative zombie designs, and the smoothest mechanics we have seen from any Anthropic model on this test.
The model identified several of its own errors during inference and corrected them before we intervened. Its genuine advantage, however, emerged through multi-shot interactions: every subsequent iteration refined and enhanced the build rather than damaging it, which is precisely the failure pattern that derails most models once a codebase expands. This is clearly the area Anthropic prioritized for optimization.
After just one round of refinement, our game improved dramatically—our main characters now navigate the scene fluidly, the camera angles shift more naturally, and both the audio and visual effects received a noticeable upgrade.
You can try out the updated version of the game on our Itch.io page.
That’s also where things went sideways. A single prompt wiped out our entire token allowance—just one prompt. For anyone subscribed to the Pro plan, this makes Opus 4.8 essentially impractical for any project of meaningful scope. You’ll exhaust your quota before lunchtime and spend the rest of the day staring at a progress bar, waiting for it to reset.
Math
The math challenge is our go-to FrontierMath benchmark: build a degree-19 polynomial where the curve X = {p(x) = p(y)} has at least three irreducible components—not all of which are linear—ensure it’s odd, monic, real-valued, with a linear coefficient of −19, and then evaluate p(19). It’s the sort of problem that either sends most models into an endless token spiral or leads them to a seemingly confident shortcut that’s subtly incorrect.
Opus 4.8 nailed it. It correctly identified the Dickson/Chebyshev construction, pinpointed the dihedral monodromy structure that produces exactly 10 components—one diagonal line along with nine conics—and calculated p(19) = 1,876,572,071,974,094,803,391,179 using the proper recurrence relation. No crashes, no hand-waving.

This is significant because Opus 4.7 couldn’t solve it even after multiple attempts. The improvement is genuine and clearly visible—the most obvious leap across the entire test suite.
The complete response is available on our Github.
Logic and Common Sense
The question is a well-known riddle: Can a man legally marry his widow’s sister under Falkland Islands law? The trick is in the wording, not the law—if someone has a widow, that person is deceased, making the question logically impossible as stated.
MiMo quietly rephrased the question and responded to the corrected version without ever pointing out the flaw. Opus 4.8 took a different approach. It called out the trap directly—”if a man has a widow, he is dead”—answered the literal question first, then provided the substantive legal analysis for what was likely intended, referencing the Deceased Wife’s Sister’s Marriage Act 1907 and the Falkland Islands Marriage Ordinance.
That’s the right way to handle it: acknowledge the contradiction, then still offer useful guidance without silently guessing the user’s intent. It meets the same bar Qwen 3.7 Max set, and Opus 4.8 clears it cleanly—solid reasoning paired with full transparency.
The full response can be found here.
Non-Math Reasoning
This is where it came up short. The reasoning challenge is a mystery scenario—a winter school outing, three kidnappings, an innocent child on the verge of being punished, and a timeline you need to carefully follow to identify the real perpetrator. The correct answer is Leo.
Opus 4.8 constructed a detailed, confident argument that Leo was innocent—citing the half-hour walk to the shower, the jacket that was damp in some areas and dry in others, and interpreting “strange behavior” as signs of a concussion rather than guilt—and instead blamed Eric, “the one attendee unaccounted for all night.” The logic is internally elegant. It’s also incorrect.
And this is exactly the kind of issue researchers have been flagging about LLMs. They can be extremely persuasive even when they’re mistaken. Typically, it takes a domain expert (in this case, us knowing the right answer in advance) to catch such errors. Someone relying on AI for research, or blindly trusting its output, could face serious consequences depending on the task at hand.
That’s what makes this failure noteworthy. The model was sophisticated enough to build a convincing alibi for the actual offender and pin the blame on an innocent bystander. Opus 4.7 got the right answer. Sometimes greater reasoning power simply gives you a more compelling way to arrive at the wrong conclusion. It only takes one small misstep to set an entire chain of reasoning on the wrong track.
The complete reply is available on our Github.
Needle in the haystack
We put two haystack tests through their paces. The 300K-token version never stood a chance—the model buckled under the sheer volume of context and simply couldn’t handle it. So much for the million-token claims the moment you throw a genuinely demanding real-world workload at it. That kind of capacity seems reserved for the API alone.
The 85K version, on the other hand, ran smoothly. The model successfully located both needles we’d hidden inside a copy of The Devil’s Dictionary: a fabricated line (“The Decrypt dudes read Emerge News”) and a made-up personal detail (“My mom’s name is Carmen Diaz Golindano”). It correctly identified both as insertions that have no place in Ambrose Bierce’s 1906 work.
And then it clammed up. Convinced it was facing a prompt injection or some sort of “unusual test,” the model refused to share what it had just successfully uncovered. The needle was right there—but Anthropic’s behavioral guardrails prevented the model from saying so. A safety mechanism that overrides a task the model has already accomplished represents its own distinct category of failure.

The verdict
Across all six benchmarks, a clear pattern emerges: Opus 4.8 sharpens Claude’s existing strengths while likely deepening its existing weaknesses. That reveals exactly who Anthropic is targeting—developers, and more specifically, developers with deep pockets. Creative writing does edge ahead of ChatGPT, no question, but the difference in pure prose quality between 4.8, 4.7, and even 4.5 is genuinely tough to spot.
Creative writers seem like an afterthought for Anthropic, and honestly, that’s true of virtually every major AI company at this point.
Then there’s the token issue, which has become a persistent joke in the AI world for good reason. Anthropic intentionally designed Opus’s new tokenizer to be less efficient, meaning it burns through more tokens to handle the same input. The real-world impact on developers is harsh and immediate. It boils down to three choices.
Option one: sit around for hours waiting for your coding session to pick back up. Option two: upgrade to Claude Max—which, not coincidentally, is precisely where Anthropic appears to be nudging everyone. Option three: jump to a cheaper, similarly capable alternative—OpenAI, with its more generous rate limits, or Chinese models that deliver comparable output for less than a quarter of the price.
A typical developer who can’t justify spending $100 to $200 a month is far more likely to take their business elsewhere than to pay ten times more for a model that isn’t ten times better than the last one. That’s the gamble Anthropic is placing against its own user base.
And yet the strategy appears to be working just fine. Anthropic seems poised to go public at a valuation approaching $1 trillion—so who are we to argue.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



