NVIDIA Releases Nemotron-Cascade 2: A 30B MoE Model with 3B Active Parameters, Achieves Gold Medal on IMO 2025
Big TechScore: 87

NVIDIA Releases Nemotron-Cascade 2: A 30B MoE Model with 3B Active Parameters, Achieves Gold Medal on IMO 2025

NVIDIA has open-sourced Nemotron-Cascade 2, a 30B parameter Mixture-of-Experts model with 3B active parameters. It achieves Gold Medal-level performance on the 2025 International Mathematical Olympiad and leads in coding benchmarks like LiveCodeBench v6.

1d ago·3 min read·12 views·via marktechpost
Share:

NVIDIA has released Nemotron-Cascade 2, an open-weight large language model (LLM) with a Mixture-of-Experts (MoE) architecture. The model totals 30 billion parameters but activates only 3 billion during inference, a design focused on maximizing "intelligence density" for reasoning and agentic tasks. It is the second open-weight LLM to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals.

Targeted Performance and Strategic Trade-offs

The model is engineered for specialized performance in mathematical reasoning, coding, alignment, and instruction following. According to the source, it is not a "blanket win" across all benchmarks but excels in targeted categories compared to recent models like Qwen3.5-35B-A3B (released February 2026) and the larger Nemotron-3-Super-120B-A12B.

Key benchmark results include:

AIME 2025 92.4 91.9 HMMT Feb25 94.6 89.0 LiveCodeBench v6 87.2 74.6 IOI 2025 439.28 348.6+ ArenaHard v2 83.5 65.4+ IFBench 82.9 70.2

Technical Architecture: Cascade RL and Multi-domain On-Policy Distillation (MOPD)

The model's capabilities stem from a post-training pipeline applied to the Nemotron-3-Nano-30B-A3B-Base model.

1. Supervised Fine-Tuning (SFT)
The SFT phase used a curated dataset packed into sequences of up to 256K tokens. This dataset included:

  • 1.9 million Python reasoning traces and 1.3 million Python tool-calling samples for competitive coding.
  • 816,000 samples for mathematical natural language proofs.
  • A specialized Software Engineering (SWE) blend of 125,000 agentic and 389,000 agentless samples.

2. Cascade Reinforcement Learning
Following SFT, the model underwent Cascade RL, a sequential, domain-wise training process designed to prevent catastrophic forgetting. This pipeline includes stages for:

  • Instruction-following (IF-RL)
  • Multi-domain RL
  • RLHF
  • Long-context RL
  • Specialized Code and SWE RL

3. Multi-Domain On-Policy Distillation (MOPD)
A key innovation is the integration of MOPD during Cascade RL. This technique uses the best-performing intermediate "teacher" models—derived from the same SFT initialization—to provide a dense token-level distillation advantage. The advantage is defined mathematically as:

$$a_{t}^{MOPD}=log~\pi^{domain_{t}}(y_{t}|s_{t})-log~\pi^{train}(y_{t}|s_{t})$$

Availability and Context

The model has been released and open-sourced on the Hugging Face Hub. This announcement follows recent NVIDIA news, including CEO Jensen Huang's mandate for engineers to spend 50% of their salary on AI inference tokens to drive productivity and the announcement of OpenClaw software.

AI Analysis

Nemotron-Cascade 2 represents a focused engineering effort to push the performance-per-parameter frontier in specialized domains, particularly reasoning and coding. The 30B total / 3B active parameter MoE design is a direct counter to the trend of ever-larger dense models, offering a more efficient inference profile for targeted high-stakes tasks. The reported benchmarks suggest it has carved out a leading position among similarly sized open models in its chosen domains. The technical pipeline is notable for its complexity and domain-specific tuning. The Cascade RL approach, which applies reinforcement learning sequentially across different task families, is a pragmatic method to build a multi-talented model without catastrophic forgetting. The integration of Multi-domain On-Policy Distillation (MOPD) is an interesting technical detail; it essentially allows the model to distill knowledge from its own best-performing checkpoints during training, creating a more efficient self-improvement loop. Practitioners should examine whether this specialized training pipeline, rather than just the architecture, is the primary driver of its benchmark success. For the open-source community and developers, the release provides a high-performance, efficient model for agentic and reasoning workloads. Its strong showing on LiveCodeBench and IOI makes it a compelling candidate for coding assistants and competitive programming tools. However, the source's caveat that it is not a 'blanket win' is crucial; its general conversational or knowledge-based performance may lag behind more balanced models of similar size. The true test will be its performance in integrated, real-world agent systems beyond isolated benchmarks.
Original sourcemarktechpost.com

Trending Now

More in Big Tech

Browse more AI articles