MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.

AAAla SMITH & AI Research Desk·Jun 12, 2026·3 min read··204 views·AI-Generated·Report error

Source: x.comvia @MiniMax_AIMulti-Source

Did MiniMax's M3 model surpass the human gold-medal threshold on math benchmarks?

MiniMax's M3 model, using the MaxProof framework, exceeded the human gold-medal threshold on both math benchmark sets, according to a post by Ryan Lee on X.

TL;DR

M3 beats human gold-medal threshold. · MaxProof framework drives the gain. · Paper details the technical approach.

MiniMax's M3 model exceeded the human gold-medal threshold on both math benchmark sets using the MaxProof framework. The claim was announced via a repost by @MiniMax_AI of a post by Ryan Lee, with a link to a paper.

Key facts

M3 exceeded human gold-medal on both math sets.
MaxProof framework is the claimed method.
No benchmark names or scores disclosed.
Announced via X post by Ryan Lee/MiniMax.
Full paper not yet publicly available.

MiniMax's M3 model exceeded the human gold-medal threshold on both math benchmark sets using the MaxProof framework, according to a post on X by Ryan Lee (@MiniMax_AI). The post links to a paper titled 'MaxProof: ...' but the full text is not yet available. No benchmark names, numerical scores, or comparison baselines were disclosed in the announcement, making independent verification impossible at this stage.

The claim is notable because exceeding a human gold-medal threshold on math benchmarks—typically the AIME or AMC sets—requires strong reasoning and step-by-step verification. The MaxProof framework likely introduces a proof-based verification mechanism to reduce hallucination in mathematical reasoning, a known weakness in large language models. However, without published scores or ablations, it is unclear whether M3 achieves this via scaling, novel architecture, or a specialized inference-time procedure.

This announcement follows a pattern of Chinese AI labs—including DeepSeek, Alibaba's Qwen, and Baidu's ERNIE—publishing strong benchmark results on math reasoning tasks. MiniMax, known for its multimodal models and video generation (Hailuo AI), has not previously emphasized mathematical reasoning. The shift suggests a strategic pivot to compete in the reasoning-heavy segment dominated by OpenAI's o-series and Anthropic's Claude.

The company did not disclose the exact scores, dataset names, or training details. Until the paper is released and results are replicated, the claim should be treated as preliminary. The broader trend: math reasoning benchmarks are becoming a standard proxy for general reasoning capability, and any model that claims to exceed human expert performance invites scrutiny.

What the paper may reveal

The linked paper—assuming it follows the typical arXiv format—will likely describe: (1) the MaxProof framework's mechanism for generating and verifying proofs, (2) the training data and fine-tuning methodology for M3, (3) ablation studies comparing M3 with and without MaxProof, and (4) results on standard benchmarks such as AIME, AMC, or MATH. If MaxProof is a new verification layer, it could be applicable beyond math to code generation and formal verification.

Caveats and context

Human gold-medal thresholds on math competitions are not static. The AIME, for example, requires a score of 6–7 out of 15 for a gold medal (distinguished honor roll), and the AMC 12 requires approximately 100–120 out of 150. Exceeding these thresholds does not mean the model solves all problems—only that it achieves a score above the cutoff. Moreover, benchmark contamination (training on test data) remains a concern for all frontier models. Without a public evaluation set or a third-party audit, the result is unverified.

What to watch

MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI ...

Watch for the full MaxProof paper on arXiv and whether independent evaluators replicate the result on standard math benchmarks like AIME 2025. Also monitor MiniMax's next model release—if M3 is a reasoning-focused variant, expect comparisons to OpenAI's o3 and DeepSeek-R1.

Source: gentic.news · Jun 12, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This announcement is thin on specifics—no benchmark names, scores, or dataset details—making it impossible to assess the claim's significance. The pattern of Chinese AI labs making bold benchmark claims without immediate reproducibility is well-established (e.g., DeepSeek's earlier math results). The MaxProof framework could be a novel verification method, but without ablations or code, it's indistinguishable from a press release. The strategic angle: MiniMax is known for multimodal and video models, not reasoning. Pivoting to math reasoning suggests they see a market gap or technical advantage. However, the lack of transparency hurts credibility. Compare to DeepSeek's R1, which published detailed technical reports and open-sourced weights. If MaxProof is a lightweight verification layer, it could be applied broadly—but if it requires massive inference compute, it may not be practical. The real test will be third-party replication on held-out problems.

#reasoning #mathematics #ai #benchmarks

Mentioned in this article

MaxProof MiniMax Ryan Lee

Enjoyed this article?