Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A sleek silver robot arm assembling a microchip on a glowing circuit board, with floating data graphs and code lines…

Minimax M2.7 Achieves 56.2% on SWE-Pro, Features Self-Evolving Training with 100+ Autonomous Optimization Loops

Minimax has released M2.7, a model that reportedly used autonomous optimization loops during RL training to achieve a 30% internal improvement. It scores 56.2% on SWE-Pro, near Claude 3.5 Opus, and ties Gemini 3.1 on MLE Bench Lite.

AAAla SMITH & AI Research Desk·Mar 18, 2026·2 min read··216 views·AI-Generated·Report error

Source: x.comvia @kimmonismusCorroborated

What Happened

Minimax has announced the release of its M2.7 model. The announcement, made via a post on X (formerly Twitter), highlights several claimed performance benchmarks and a novel training approach.

The key claims are:

Self-Evolving Training: Described as the "first model that helped build itself," M2.7 reportedly ran over 100 autonomous optimization loops during its own reinforcement learning (RL) training phase, leading to a claimed 30% "internal improvement."
Coding Performance: The model is said to score 56.2% on SWE-Pro (described as "near Opus 4.6") and 55.6% on VIBE-Pro. For production debugging tasks, it is claimed to reduce resolution time to under 3 minutes.
ML Research Agent Capability: On MLE Bench Lite, the model allegedly achieves a 66.6% medal rate, tying the performance of Gemini 3.1.
Office & Analytical Work: It claims a top open-source ELO rating of 1495 on GDPval-AA with 97% skill adherence, capable of handling end-to-end analyst workflows including reports, models, and presentations.
New Features: The release includes native multi-agent support and a new open-source interactive character demo called OpenRoom.

Context

Minimax is a Chinese AI company known for its large language models. The "self-evolving" claim, if substantiated, would represent a significant technical narrative, suggesting a move towards more automated and iterative model improvement processes. The benchmark comparisons position M2.7 against leading models from Anthropic (Claude 3.5 Opus) and Google (Gemini 3.1) in specific domains like coding and ML research.

It is crucial to note that this information comes from a promotional social media post. Independent verification of the reported benchmarks (SWE-Pro, VIBE-Pro, MLE Bench Lite, GDPval-AA) and the "self-evolving" training methodology is not available in the source material. The details of the autonomous optimization loops and the nature of the 30% improvement are not specified.

Sources cited in this article

M2.7

Source: gentic.news · Mar 18, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The most technically provocative claim is the "self-evolving" training process involving "100+ autonomous optimization loops." This suggests an RL training regime where the model itself, or an adjacent system, is actively proposing and testing improvements to its own training objective, architecture, or data sampling strategy. A 30% "internal improvement" is vague—it could refer to reward model scores, loss reduction, or performance on a held-out validation set. Without a detailed paper, this remains a compelling but unverified narrative about automating hyperparameter optimization and curriculum learning at scale. The benchmark results are strategically chosen. A 56.2% score on SWE-Pro, if accurate, places M2.7 in the upper tier of coding models, directly competing with the best-in-class. Similarly, tying Gemini 3.1 on MLE Bench Lite targets the research and development audience. The GDPval-AA claim for office work attempts to carve out a practical, enterprise-ready niche. The release appears to be a broadside across multiple fronts: coding, research, and productivity, wrapped in a novel training story. Practitioners should look for independent evaluations on these specific benchmarks before drawing firm conclusions about its relative standing.

#model release #coding ai #reinforcement learning #benchmarks

Compare side-by-side

Claude 3.5 Opus vs Gemini 3.1

→

Mentioned in this article

MiniMax Group Claude 3.5 Opus Gemini 3.1 Gemini Claude Agent MLE Bench Lite SWE-Pro

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

Google Gemini Launches Manual Memory & Chat Import to Ease Switching from ChatGPT, Claude

Big Tech2 shared topics

Apple iOS 27 to Introduce 'Extensions' for Siri, Allowing Users to Link to ChatGPT, Gemini, or Claude

AI Research2 shared topics

AI-Generated Text Volume Surpasses Human-Written Content for First Time, According to New Data

Products & Launches2 shared topics

Tessera Launches Open-Source Framework for 32 OWASP AI Security Tests, Benchmarks GPT-4o, Claude, Gemini, Llama 3

AI Research2 shared topics

MiniMax M2.7 Achieves 30% Internal Benchmark Gain via Self-Improvement Loops, Ties Gemini 3.1 on MLE Bench Lite

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog