AntAngelMed: The 103B-Parameter Open-Source Medical LLM Powered By A 1/32 Activation-Ratio MoE Architecture

Researchers from China have introduced AntAngelMed, a large-scale open-source language model tailored for the medical field. According to the team, it is currently the largest and most capable medical language model available.

What Is AntAngelMed?

AntAngelMed is a medical-focused language model with 103 billion total parameters. However, it does not use all of them at once during inference. It relies on a Mixture-of-Experts (MoE) architecture with a 1/32 activation ratio, meaning only 6.1 billion parameters are engaged at any given moment when handling a query.

To understand how MoE works: in a traditional dense model, every parameter is involved in processing each token. In an MoE setup, the network is split into multiple “expert” sub-networks, and a routing system picks only a small group of them for each input. This approach allows for a massive total parameter count — which generally means greater knowledge capacity — while keeping the actual computational cost tied to the much smaller number of active parameters.

AntAngelMed builds on Ling-flash-2.0, a base model created by inclusionAI and shaped by what the team refers to as Ling Scaling Laws. Key optimizations added on top include: refined expert granularity, an adjusted shared expert ratio, attention balance mechanisms, sigmoid routing without auxiliary loss, an MTP (Multi-Token Prediction) layer, QK-Norm, and Partial-RoPE (where Rotary Position Embedding is applied to only some attention heads rather than all of them). The research team states that these combined design choices enable small-activation MoE models to achieve up to 7× greater efficiency compared to dense models of similar size — meaning AntAngelMed, with just 6.1B active parameters, can deliver performance comparable to a roughly 40B dense model. Additionally, as output length increases during inference, the speed advantage can also reach 7× or more over similarly sized dense models.

Training Pipeline

AntAngelMed follows a three-stage training process that layers broad language understanding with deep medical domain expertise.

In the first stage, the model undergoes continual pre-training on extensive medical corpora, including encyclopedias, web content, and academic publications. This phase starts from the Ling-flash-2.0 checkpoint, ensuring the model has a solid foundation in general reasoning before medical specialization begins.

The second stage involves Supervised Fine-Tuning (SFT) using a multi-source instruction dataset. This dataset blends general reasoning tasks — such as math, programming, and logic — to maintain chain-of-thought abilities, along with medical scenarios like doctor–patient Q&A, diagnostic reasoning, and safety and ethics cases.

The third stage applies Reinforcement Learning through the GRPO (Group Relative Policy Optimization) algorithm, paired with task-specific reward models. GRPO, first introduced in the DeepSeekMath paper, is a PPO variant that estimates baselines from group scores instead of a separate critic model, reducing computational overhead. Here, reward signals are crafted to guide the model toward empathy, structured clinical responses, safety boundaries, and evidence-based reasoning — all aimed at minimizing hallucinations on medical questions.

Inference Performance

Running on H20 hardware, AntAngelMed achieves over 200 tokens per second, which the team reports is roughly 3× faster than a 36 billion parameter dense model. With YaRN (Yet Another RoPE extensioN) extrapolation, it supports a 128K context window — sufficient for processing full clinical documents, lengthy patient histories, or multi-turn medical conversations.

The team has also released an FP8 quantized version of the model. When this quantization

Combined with EAGLE3 Speculative Decoding Optimization

When integrated with EAGLE3 speculative decoding, inference throughput at a concurrency level of 32 shows substantial gains compared to FP8 alone: 71% on HumanEval, 45% on GSM8K, and 94% on Math-500. While these benchmarks assess coding and math reasoning rather than medical tasks directly, they reflect the model’s general throughput stability across different output types.

Benchmark Results

In HealthBench — OpenAI’s open-source medical evaluation benchmark that uses simulated multi-turn medical conversations to assess real-world clinical performance — AntAngelMed holds the top position among all open-source models, outperforming several leading proprietary models as well, with its greatest advantage on the HealthBench-Hard subset.

In MedAIBench, an evaluation platform managed by China’s National Artificial Intelligence Medical Industry Pilot Facility, AntAngelMed places among the highest performers, especially excelling in medical knowledge Q&A and medical ethics and safety domains.

In MedBench, a Chinese- healthcare LLM benchmark encompassing 36 independently compiled datasets and roughly 700,000 samples spanning five areas — medical knowledge question answering, medical language understanding, medical language generation, complex medical reasoning, and safety and ethics — AntAngelMed ranks first overall.

Marktechpost’s Visual Explainer

01 — Overview
What Is AntAngelMed?
Collaboratively developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.

103BTotal Parameters

6.1BActive at Inference

128KContext Window Length

AntAngelMed is a medical-specialized LLM built on a 1/32 activation-ratio MoE architecture. Containing 103B total parameters with only 6.1B active during inference, it delivers performance comparable to approximately 40B dense models at dramatically lower computational cost.

Model weights are made available under Apache 2.0 licensing. The source code repository uses the MIT license.

02 — Architecture
MoE Architecture & Base Model
Developed on Ling-flash-2.0 by inclusionAI, guided by Ling Scaling Laws.

AntAngelMed employs a 1/32 activation-ratio MoE design with enhancements throughout all major components. These design choices allow small-activation MoE models to achieve up to 7× greater efficiency compared to equally sized dense architectures — and as output sequences lengthen, relative speed gains can climb to 7× or beyond.

Key architectural elements:

Expert Granularity
Shared Expert Ratio
Sigmoid Routing
No Auxiliary Loss
MTP Layer
QK-Norm
Partial-RoPE
YaRN Extrapolation
Attention Balance

03 — Training
Three-Stage Training Pipeline
Crafted to stack broad language comprehension over deep medical domain specialization.

Stage 01
Continual Pre-Training
Built upon Ling-flash-2.0, trained on extensive medical corpora — including encyclopedias, web content, and scholarly articles — to embed deep domain and general world knowledge.

Stage 02
Supervised Fine-Tuning (SFT)
Multi-source instruction data blending general tasks (math, programming, logic) for chain-of-thought reasoning, combined with medical-specific scenarios (doctor–patient Q&A, diagnostic reasoning, safety and ethics) for practical clinical adaptation.

Stage 03
Reinforcement Learning via GRPO
Group Relative Policy Optimization leveraging task-specific reward models. Shapes the model’s responses toward empathy, clear structure, safety boundaries, and evidence-backed reasoning to minimize hallucinations.

04 — Inference
Inference Performance
Hardware benchmarks on H20 and throughput gains from FP8 + EAGLE3 optimization.

>200 tok/s
On H20 hardware. About 3× the speed of a comparable 36B dense model.

7× efficiency
MoE versus dense at the same size. Speed advantage grows larger as output sequences become longer.

+71% / +45% / +94%
FP8 + EAGLE3 throughput improvements over FP8 alone on HumanEval / GSM8K / Math-500 at concurrency 32.

128K context
Enabled via YaRN extrapolation. Accommodates full clinical documents and extended multi-turn medical conversations.

05 — Benchmarks
Benchmark Results
Assessed across three leading medical LLM benchmarks.

Benchmark	Scope	Result
HealthBenchOpenAI	Simulated multi-turn medical dialogues to evaluate real-world clinical performance.	Top open-source model; beats several proprietary models. Biggest advantage on HealthBench-Hard.
MedAIBenchNat’l AI Medical Pilot Facility	Benchmark from a Chinese authority covering knowledge Q&A, medical ethics, and safety.	Among the highest performers. Particularly strong in knowledge Q&A and medical ethics/safety.
MedBenchChinese Healthcare Domain	36 datasets, approximately 700,000 samples spanning 5 clinical areas.	#1 overall across all 5 areas.

06 — Quickstart
Run with Hugging Face Transformers
Requires trust_remote_code=True for the MoE routing code.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MedAIBase/AntAngelMed",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("MedAIBase/AntAngelMed")

messages = [
  {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
  {"role": "user",   "content": "What should I do if I have a headache?"}
]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt",
    return_token_type_ids=False).to(model.device)
out    = model.generate(**inputs, max_new_tokens=16384)
out    = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

Also supports: vLLM v0.11.0 (4-GPU tensor parallel), SGLang with FlashAttention-3, and vLLM-Ascend for Huawei Ascend 910B NPUs.

07 — Access
Resources & Links
Model weights Apache 2.0 — Code repository MIT — FP8 quantized variant available separately.

Developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.
Coverage by Marktechpost — marktechpost.com

Key Takeaways

AntAngelMed is a 103-billion-parameter open-source medical large language model that activates only 6.1 billion parameters during inference, thanks to a 1/32 activation-ratio Mixture-of-Experts architecture inherited from Ling-flash-2.0.
It follows a three-stage training pipeline: continual pre-training on medical corpora, supervised fine-tuning with a blend of general and clinical instruction data, and GRPO-based reinforcement learning to improve safety and diagnostic reasoning.
On H20 hardware, the model delivers over 200 tokens per second and supports a 128K context window via YaRN extrapolation — roughly three times faster than a comparable 36-billion-parameter dense model.
AntAngelMed claims the top spot among open-source models on OpenAI’s HealthBench, outperforms several proprietary models, and leads both the MedAIBench and MedBench leaderboards.
The model is accessible on Hugging Face, ModelScope, and GitHub; model weights are licensed under Apache 2.0, code under MIT, and an FP8 quantized version has also been released.

Check out the Model Weights on Hugging Face, GitHub Repository, and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait — are you on Telegram? You can now join us on Telegram as well.

Looking to partner with us to promote your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch with us

Top Posts

“From Zero to Cloud Native: Crafting a Platform with Kairos, k0rdent, and Bindy”

“Netmore, Green Frog, and Sensational Systems Join Forces to Revolutionize UK Smart Gas Metering with a Managed LoRaWAN Solution”

Architecting Resilience: Leveraging GMSL Diagnostics for Robust Vision Systems

AntAngelMed: The 103B-Parameter Open-Source Medical LLM Powered by a 1/32 Activation-Ratio MoE Architecture

From 100+ Real Deployments: The 12-Metric Evaluation Harness That Powers Production AI Agents

5 Essential Python Scripts to Supercharge Your Time Series Analysis

My $190 Mesh Wi-Fi Handled a Dozen 4K Streams Without Breaking a Sweat

How AI Falls Short in Automating the HR Compliance Challenges Tech Companies Need Most

“Claude Code-Powered Knowledge Base: The Ultimate Builder’s Guide”

OpenAI’s Ilya Sutskever Defends His Role in Sam Altman’s Ouster: ‘I Didn’t Want It to Be Destroyed’

“From Zero to Cloud Native: Crafting a Platform with Kairos, k0rdent, and Bindy”

“Netmore, Green Frog, and Sensational Systems Join Forces to Revolutionize UK Smart Gas Metering with a Managed LoRaWAN Solution”

Architecting Resilience: Leveraging GMSL Diagnostics for Robust Vision Systems

From 100+ Real Deployments: The 12-Metric Evaluation Harness That Powers Production AI Agents

AI vs. Humanity: The Disturbing Rise of AI-Designed Bioweapons and How Concerned We Should Be

OpenTelemetry: The Universal Language Powering Cloud Observability

SNC Scandic Coin (SNC): Bridging Tangible Assets and Blockchain Innovation in a Groundbreaking Launch

From Elementary OS to Linux Mint: A Personal Distro Showdown

Trending

“From Zero to Cloud Native: Crafting a Platform with Kairos, k0rdent, and Bindy”

“Netmore, Green Frog, and Sensational Systems Join Forces to Revolutionize UK Smart Gas Metering with a Managed LoRaWAN Solution”

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

AntAngelMed: The 103B-Parameter Open-Source Medical LLM Powered by a 1/32 Activation-Ratio MoE Architecture

What Is AntAngelMed?

Training Pipeline

Inference Performance

Combined with EAGLE3 Speculative Decoding Optimization

Benchmark Results

Marktechpost’s Visual Explainer

Key Takeaways

Related Posts