# DeepReinforce Unveils Ornith-1.0: A New Family of Open-Source Models Built for Agentic Coding
DeepReinforce, an AI research lab previously recognized for its work on CUDA-L1 and the IterX code-agent optimization loop, has officially released Ornith-1.0—a family of open-source large language models purpose-built for agentic coding tasks. The models were made available on Hugging Face on June 25 under the permissive MIT license, with no regional restrictions applied.
## Four Sizes, One Mission
The Ornith-1.0 family spans four distinct parameter configurations: a 9-billion dense model, a 31-billion dense model, a 35-billion mixture-of-experts (MoE) variant, and a flagship 397-billion mixture-of-experts model. Parameters, essentially the number of adjustable configurations a model can handle during training, are a rough proxy for capability. A 9-billion-parameter model is considered compact—capable of running on a high-end smartphone but not suited for heavy reasoning tasks. At the other end of the spectrum, the 397-billion model demands significant computational resources well beyond what consumer hardware can provide.
Despite the range in size, every variant in the family shares the same specialization: agentic coding in real terminal and repository environments.
## What Makes Ornith Different
The lab describes Ornith as “a self-improving family of open-source models specially for agentic coding tasks.” The term “agentic” carries significant weight here. Unlike conventional conversational AI—where a user types a prompt and receives a response before the exchange ends—agentic AI operates with a degree of autonomy. Given a task, it takes actions to complete it without requiring a human to guide each individual step.
In a coding context, this translates to an AI that can read files, run tests, identify failures, fix code, and loop through the process again until the task is resolved. The practical implication is that a developer does not need to remain at the keyboard for most of the workflow. This autonomous, multi-step capability is precisely where the most commercially relevant progress in AI is occurring in 2026. Models capable of running unsupervised through 20-step development workflows carry substantially more value than those that simply write a clean function on request.
## A Learnable Scaffold
One of the key architectural distinctions of Ornith lies in how it handles the scaffolding that governs agent behavior. Most AI coding agents rely on a human-designed harness—a fixed set of rules dictating when to call a tool, how to handle errors, and how to decompose multi-step problems. Ornith takes a fundamentally different approach: it treats the scaffold as a learnable object that co-evolves with the policy. Rather than inheriting a predetermined playbook, the model develops its own strategies.
During reinforcement learning, each training step unfolds in two stages. The model first reads the task and proposes a refined strategy for approaching it. It then uses that self-generated strategy to produce a solution. This dual-stage process is central to the model’s self-improving nature.
## Performance That Turns Heads
The results speak for themselves. The 9-billion-parameter variant of Ornith-1.0 scores 69.4 on SWE-bench Verified, a widely respected benchmark for evaluating AI coding agents on real-world software engineering tasks. That figure notably outperforms Google’s Gemma 4-31B, which scores 52.0 on the same benchmark—a striking result given the significant gap in parameter count between the two models.
## A Focused Tool, Not a Generalist
Ornith’s own model card includes an important caveat: the models may underperform on non-coding tasks. They are explicitly wired for developer pipelines, not general-purpose AI conversations. This is a deliberate design choice rather than a limitation—DeepReinforce has optimized the entire family for the specific demands of agentic coding, accepting trade-offs in breadth for depth of capability in its target domain.
—
*This article is based on information originally published by Decrypt. Source: [Decrypt – DeepReinforce released Ornith-1.0](https://decrypt.co)*# DeepReinforce: How a 397B Parameter Model Is Rewriting the Rules of AI-Driven Software Engineering
A new reinforcement learning framework called DeepReinforce is pushing the boundaries of what large language models can accomplish in software engineering—by training them not just to write code, but to devise the strategies behind writing better code.
## Training the Strategist, Not Just the Coder
At the heart of DeepReinforce is a two-stage training pipeline. In the first stage, the model learns to formulate high-level strategies for tackling software engineering tasks. In the second stage, it executes on those strategies by writing actual code. The critical innovation is that the reward from the outcome flows back to both stages simultaneously. This means the model is optimized for writing better strategies, not just better code.
When this process is repeated thousands and even millions of times, task-specific approaches emerge organically—without any human explicitly engineering them. The system discovers effective problem-solving patterns on its own through sheer scale and reinforcement.
## Guarding Against Reward Hacking
DeepReinforce also takes the problem of reward hacking seriously. Because the model can write its own training scaffold, there is a theoretical risk that it could write a scaffold designed to game the verifier—for instance, touching a file to make it appear as though a task was completed without actually doing the work.
To combat this, the system employs three layers of defense. First, the environment and test suite are immutable and exist entirely outside the model’s reach, meaning the AI cannot alter the criteria by which it is evaluated. Second, a deterministic monitor flags any attempt to access restricted paths or alter verification scripts, providing a real-time safeguard against manipulation. Third, a frozen judge model sits on top of the automated verifier, serving as an additional veto layer that cannot be influenced by the training process.
## The Numbers
The flagship 397 billion parameter model delivers striking results across key benchmarks. On SWE-bench Verified—a test where an AI is given a real bug from an open-source GitHub repository and must fix it without seeing the test suite, scored as the percentage of issues it successfully resolves—the model posts a score of 82.4.
That figure surpasses Claude Opus 4.7’s 80.8 and DeepSeek-V4-Pro’s 80.6 on the same test. On Terminal Bench 2.1, which consists of 89 tasks run inside containerized terminal environments ranging from debugging async code to resolving security vulnerabilities and is scored by completion rate, the model achieves 77.5 against Claude Opus 4.7’s 70.3.
These results arrive amid ongoing concerns about SWE-bench contamination. OpenAI argued earlier this year that models were inflating their scores by memorizing test data rather than genuinely solving problems, making robust verification methods like those employed by DeepReinforce all the more critical.
—
*This article is based on original reporting from [Decrypt](https://decrypt.co).*# Ornith-1.0: A New Contender in Agentic Coding Benchmarks
A new model called Ornith-1.0 has entered the competitive landscape of AI coding benchmarks, posting impressive numbers that challenge some of the biggest names in the field. The model comes in multiple sizes, with the 397 billion parameter version and a surprisingly capable 9 billion parameter variant both turning heads.
## Strong Showings on SWE-bench
Ornith-1.0-397B has demonstrated competitive performance on SWE-bench Pro, a harder version of the benchmark that uses more diverse, less-leaked codebases scored in the same way. The 397 billion parameter model lands at 62.2 on that benchmark. While meaningfully lower than some competitors, the result is still competitive with the field overall and still outperforms Deepseek V4 Pro.
## The 9 Billion Parameter Model Steals the Show
Perhaps the more interesting data point is the 9 billion parameter model, which posts 69.4 on SWE-bench Verified. This score is higher than Gemma 4-31B’s 52 and competitive with Qwen 3.5-35B’s 70, despite being 3-4 times smaller than both competitors. This suggests that the training methodology behind Ornith-1.0 is highly efficient, packing significant coding capability into a much smaller package.
## Who It’s For, and Who It Isn’t
Ornith-1.0 is explicitly not a general-purpose AI. The model’s own documentation says it may underperform on tasks outside agentic coding. If you want AI to summarize a document, help write a doctoral thesis, or draft an email, Ornith-1.0 is the wrong pick.
The model is optimized for a narrow problem set: developer pipelines where an AI agent takes a task description, operates inside a code repository or terminal session, and completes multi-step work without intervention. This is a tool that was built for people who are already running agent infrastructure—not for people trying to decide if AI is worth using.
## The “Beats Claude” Headline Requires Context
The headline comparisons that circulate require important context. Every lab is now chasing performance on agentic coding evals, because that’s where the useful performance differences live. Ornith-1.0-397B does surpass Claude Opus 4.7 on different coding benchmarks, but Anthropic’s current flagship, Claude Opus 4.8, scores higher. The comparison that holds is within the open-source category, at comparable parameter counts, on coding-specific agent tasks.
For developers building self-hosted coding pipelines, agentic infrastructure, or similar coding-focused work, the small and medium models running on edge hardware may be genuinely useful. However, the average user may be better served looking elsewhere for general-purpose AI needs.
—
*This article is based on information from the original post published by Decrypt.*



