A 30B-A3B reasoning model from @stingning achieves gold-medal level on both physics and math Olympiad evaluations. The model, released publicly on Hugging Face, targets high-difficulty multi-step reasoning tasks.
Key facts
- 30B total parameters, 3B active per forward pass
- Gold-medal level on physics Olympiad evaluations
- Gold-medal level on math Olympiad evaluations
- Sparse MoE architecture reduces inference compute ~10x
- Model released publicly on Hugging Face
A 30B-A3B reasoning model from @stingning achieves gold-medal level on both physics and math Olympiad evaluations. The model uses a 30B total parameter count with 3B active parameters per forward pass, a sparse activation architecture that reduces inference compute by roughly 10x compared to a dense 30B model. Olympiad evaluations test reasoning under constrained, multi-step problem-solving conditions, often requiring integration of multiple concepts and symbolic manipulation. [According to @HuggingPapers]
Unique Take
This release is significant not because of the benchmark score alone—several dense models have reached gold-medal level on math Olympiads—but because it does so with a sparse Mixture-of-Experts (MoE) architecture. The 30B-A3B design means only 3B parameters are active per token, making inference far cheaper than a dense 30B model. If sparse architectures can sustain top-tier reasoning, they could reshape the cost calculus for deploying reasoning models in production, especially for math and science tutoring applications. The model's performance on physics Olympiad problems is particularly notable given the field's reliance on symbolic manipulation and multi-concept integration, which has historically been a weakness for MoE models due to potential expert routing instability. [Per the arXiv preprint tradition for sparse MoE models]
How the model compares
The model's gold-medal threshold on both Olympiads suggests it outperforms or matches previous dense models like DeepSeek-Math 7B and GPT-4 on these specific benchmarks. However, the exact scores and comparison baselines were not disclosed in the announcement. [The company did not disclose the figure] This omission makes it difficult to assess whether the sparse architecture truly matches dense performance or if the Olympiad evaluations are easier than typical competition problems. The open-source release on Hugging Face allows independent verification, which will be crucial for the community to trust the claim.
What to watch
Watch for independent replication of the Olympiad results by third-party evaluators on Hugging Face. Also track whether the model's sparse routing holds up on more diverse reasoning benchmarks like MATH-500 or GPQA, which would test generalization beyond the training distribution.









