NVIDIA Open-Sources Nemotron-Cascade 2: A 30B MoE Model with 3B Active Parameters Achieves Gold Medal IMO Performance

NVIDIA released Nemotron-Cascade 2, an open-weight 30B Mixture-of-Experts model with 3B active parameters. It achieves Gold Medal-level performance on the 2025 International Mathematical Olympiad and outperforms Qwen3.5-35B-A3B on key reasoning benchmarks.

Ggentic.news Editorial·1d ago·3 min read·1 views·via marktechpost

NVIDIA has released Nemotron-Cascade 2, an open-weight 30B Mixture-of-Experts (MoE) model with 3B activated parameters. The model is designed to maximize "intelligence density," delivering advanced reasoning capabilities at a fraction of the parameter scale of frontier models. It is the second open-weight LLM to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals.

Targeted Performance and Strategic Trade-offs

The primary value proposition of Nemotron-Cascade 2 is its specialized performance in mathematical reasoning, coding, alignment, and instruction following. The source material notes it is "surely not a 'blanket win' across all benchmarks," but excels in targeted categories.

Key benchmark comparisons against Qwen3.5-35B-A3B (released Feb 2026) and the larger Nemotron-3-Super-120B-A12B:

Mathematical Reasoning AIME 2025 92.4 91.9 HMMT Feb25 94.6 89.0 Coding LiveCodeBench v6 87.2 74.6 IOI 2025 439.28 348.6+ Alignment & Instruction ArenaHard v2 83.5 65.4+ IFBench 82.9 70.2

Technical Architecture: Cascade RL and Multi-domain On-Policy Distillation

The model's reasoning capabilities stem from a post-training pipeline starting from the Nemotron-3-Nano-30B-A3B-Base model.

1. Supervised Fine-Tuning (SFT)

During SFT, NVIDIA's research team utilized a meticulously curated dataset where samples were packed into sequences of up to 256K tokens. The dataset included:

1.9M Python reasoning traces and 1.3M Python tool-calling samples for competitive coding.
816K samples for mathematical natural language proofs.
A specialized Software Engineering (SWE) blend consisting of 125K agentic and 389K agentless samples.

2. Cascade Reinforcement Learning

Following SFT, the model underwent Cascade RL, which applies sequential, domain-wise training. This prevents catastrophic forgetting by allowing hyperparameters to be tailored to specific domains without destabilizing others. The pipeline includes stages for instruction-following (IF-RL), multi-domain RL, RLHF, long-context RL, and specialized Code and SWE RL.

3. Multi-Domain On-Policy Distillation (MOPD)

A critical innovation in Nemotron-Cascade 2 is the integration of MOPD during the Cascade RL process. MOPD assembly uses the best-performing intermediate 'teacher' models—already derived from the same SFT initialization—to provide a dense token-level distillation advantage. This advantage is defined mathematically as:

$$a_{t}^{MOPD}=log~\pi^{domain_{t}}(y_{t}|s_{t})-log~\pi^{train}(y_{t}|s_{t})$$

The research team's approach leverages these intermediate checkpoints to distill knowledge back into the main training model, enhancing performance across the targeted domains without requiring separate, full-sized teacher models.

Availability and Implications

Nemotron-Cascade 2 is released as an open-weight model. Its architecture—a 30B parameter MoE with only 3B active parameters—prioritizes efficiency for inference and deployment while targeting state-of-the-art performance in reasoning-intensive tasks. The model's performance profile suggests it is optimized for applications requiring strong mathematical reasoning, competitive programming, and precise instruction following, rather than general-purpose chat.

AI Analysis

Nemotron-Cascade 2 represents a focused iteration in the efficient frontier of reasoning models. The 30B total / 3B active parameter MoE architecture is a direct play for practical deployment, offering a compelling trade-off: near-frontier reasoning performance (as evidenced by IMO Gold Medal status) at a drastically lower computational cost for inference than dense 70B+ class models or larger MoEs like the 120B-parameter Nemotron-3-Super. The technical core is the Cascade RL + MOPD pipeline. Cascade RL's domain-sequential training is a pragmatic solution to the multi-objective optimization problem of building a 'generalist' reasoning model, mitigating catastrophic forgetting. The integration of MOPD is more novel; using intermediate checkpoints from the *same* training run as teachers for token-level distillation is a clever way to bootstrap performance without external supervision or massive ensemble costs. The mathematical formulation of the advantage `a_t^MOPD` suggests they are effectively measuring and reinforcing the policy shift towards domain-specific optimal responses during training. Practitioners should note the model's specialized nature. The benchmarks show clear dominance in math (AIME, HMMT), coding (LiveCodeBench, IOI), and alignment (ArenaHard, IFBench) over a comparable model like Qwen3.5-35B-A3B. This makes it a strong candidate for tool-integrated agents, coding assistants, and math solvers. However, the source's caveat that it's not a 'blanket win' implies potential weaknesses in broader knowledge or multilingual tasks not highlighted in the release. The choice of a 256K context window during SFT on specialized data also signals a design for deep, complex reasoning chains rather than broad document processing.

Original sourcemarktechpost.com

#open source #model release #research #llm

Enjoyed this article?

Get notified when we launch our newsletter

Trending Now

Products & LaunchesBreaking

OpenAI Delays 'Adult Mode' for ChatGPT Amid Internal Backlash Over Safety Risks

OpenAI has delayed a proposed 'adult mode' for ChatGPT following internal warnings about risks including emotional dependency, compulsive use, and ina...

@kimmonismus·1h ago·3 min read·12 views

policy & safetyproduct developmentindustry

Big Tech

AWS Commits 2 Gigawatts of Trainium Capacity to OpenAI, Reveals 1.4 Million Chips Deployed

Amazon's $50B OpenAI deal includes a 2-gigawatt commitment of Trainium computing capacity. AWS disclosed 1.4 million Trainium chips are deployed, with...

techcrunch_ai·6h ago·3 min read·2 views

hardwarecloudbusiness

AI Research

Supermemory Claims ~99% on LongMemEval_s with Experimental ASMR Technique, Plans Open-Source Release

An experimental AI technique called ASMR (Agentic Search and Memory Retrieval) reportedly achieved near-perfect performance (~99%) on the LongMemEval_...

@kimmonismus·9h ago·3 min read·42 views

ai researchretrievallong context