Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Step-3.5-Flash: 196B Open-Source MoE Model Activates Only 11B Parameters, Outperforms Kimi K2.5 and Claude Opus 4.5 on Key Benchmarks

Shanghai-based StepFun's Step-3.5-Flash, a 196B parameter sparse mixture-of-experts model that activates only 11B parameters per token, achieves top scores on AIME 2025 (97.3) and LiveCodeBench-V6 (86.4) while costing 18.9x less to run than Kimi K2.5.

AAAla SMITH & AI Research Desk·Mar 24, 2026·7 min read··522 views·AI-Generated·Report error

Source: pub.towardsai.netvia towards_aiWidely Reported

While mainstream AI discourse focuses on announcements from Anthropic, OpenAI, and Qwen, a Shanghai-based lab called StepFun quietly published a technical report in February 2026 for a model called Step-3.5-Flash. The 196B parameter open-source model achieves state-of-the-art performance on multiple reasoning, coding, and agentic benchmarks while activating only 11B parameters per token during inference—resulting in dramatically lower operational costs compared to competitors.

What StepFun Built: A Sparse Mixture-of-Experts Architecture

Step-3.5-Flash uses a Sparse Mixture-of-Experts (MoE) backbone that fundamentally decouples total parameter count from active inference cost. The model stores 196B parameters of total knowledge but activates only 11B parameters per token during generation. This is achieved through a router that selects the top 8 experts from 288 routed experts per layer, plus one shared expert that always fires. All other parameters remain dormant.

This architectural choice explains the dramatic cost differential: while Kimi K2.5 (a 1 trillion parameter model) activates 32B parameters per token, Step-3.5-Flash activates just 11B—approximately one-third the computational load per token.

To handle long contexts efficiently, Step-3.5-Flash implements a hybrid attention mechanism that interleaves Sliding Window Attention (SWA) with Full Attention at a 3:1 ratio. Three SWA layers (with linear cost scaling) process local context for every one full-attention layer (with quadratic scaling). This enables a 256K context window without the typical inference cost explosion associated with full attention across long sequences.

Key Benchmark Results: Performance vs. Cost

StepFun's technical report includes comprehensive benchmarking against leading open-source and proprietary models. The results show Step-3.5-Flash leading or competing at the top across multiple categories while maintaining significantly lower inference costs.

Coding Performance

LiveCodeBench-V6 86.4 85.0 83.3 N/A SWE-bench Verified 74.4 76.8 73.1 N/A Terminal-Bench 2.0 51.0 N/A N/A N/A

Step-3.5-Flash achieves the highest score among open-source models on LiveCodeBench-V6 (86.4) and leads Terminal-Bench 2.0 (51.0), which tests long-horizon command-line agent tasks. On SWE-bench Verified—which measures real GitHub issue resolution—it posts a competitive 74.4, close to Kimi K2.5's 76.8.

Mathematical Reasoning

AIME 2025 97.3 N/A N/A IMOAnswerBench 85.4 N/A N/A HMMT 2025 96.2 N/A N/A

On AIME 2025—considered one of the hardest math competition benchmarks—Step-3.5-Flash scores 97.3, the highest among open-source models in the comparison. The model maintains strong performance across IMOAnswerBench (85.4) and HMMT 2025 (96.2), demonstrating consistent reasoning capability across different problem structures.

Agentic Capabilities

τ²-Bench 88.2 85.4 85.2 GAIA 84.5 N/A N/A ResearchRubrics 65.3 N/A N/A

τ²-Bench measures real-world tool use across web, code, and file environments, where Step-3.5-Flash leads at 88.2. Most notably, on ResearchRubrics—which evaluates long-form deep research quality using a ReAct agent loop—Step-3.5-Flash scores 65.3, outperforming both Gemini DeepResearch (60.7) and OpenAI DeepResearch (60.7).

The Efficiency Advantage: 1.0x vs. 18.9x Cost

The most striking aspect of Step-3.5-Flash is its inference cost relative to performance. According to StepFun's benchmarks conducted at 128K context on Hopper GPUs with MTP-3 inference and EP8 settings:

Step-3.5-Flash: 1.0x cost (baseline)
Kimi K2.5: 18.9x cost
GLM-4.7: 18.9x cost
DeepSeek V3.2: 6.0x cost
MiniMax M2.1: 3.9x cost

This means Kimi K2.5 costs 18.9 times more per token while scoring lower on several coding and reasoning benchmarks. The efficiency stems directly from the MoE architecture's ability to activate only 11B of the model's 196B total parameters during inference.

How It Works: Technical Implementation Details

The Step-3.5-Flash architecture combines several advanced techniques:

Sparse MoE with Expert Routing: The model uses 288 experts per layer with a top-8 routing strategy plus one always-active shared expert. This creates a total parameter count of 196B while limiting active parameters to 11B per token.
Hybrid Attention for Long Context: By implementing a 3:1 ratio of Sliding Window Attention to Full Attention layers, the model maintains linear scaling for 75% of attention computations while preserving global context awareness through occasional full attention.
Optimized Inference Stack: The technical report mentions optimizations for Hopper GPUs using MTP-3 inference protocols and EP8 precision settings, though specific implementation details are limited in the available source material.

Limitations and Deployment Considerations

The source material acknowledges that Step-3.5-Flash has "real limitations—deployment requirements, known stability issues in specific conditions, and areas where proprietary models still pull ahead." Specific limitations mentioned include:

Specialized deployment requirements that may not be trivial to implement
Stability issues under certain conditions (unspecified)
Areas where proprietary models maintain advantages (though these are not detailed in the available source)

gentic.news Analysis

Step-3.5-Flash represents a significant milestone in the efficiency frontier of large language models, demonstrating that parameter count alone is a poor proxy for either capability or cost. The model's architecture—combining sparse MoE with hybrid attention—shows how careful system design can dramatically improve the performance-to-cost ratio. What's particularly notable is that this efficiency doesn't come at the expense of capability: the model competes at or near the top across coding, reasoning, and agentic benchmarks.

The technical approach here validates a growing trend in the industry: the decoupling of knowledge storage (total parameters) from inference computation (active parameters). While MoE architectures are not new, Step-3.5-Flash's implementation appears particularly effective at maintaining strong performance across diverse tasks while minimizing active parameters. This suggests that future model improvements may come less from simply scaling parameter counts and more from architectural innovations that make better use of existing parameters.

For practitioners, the most immediate implication is cost. A model that delivers competitive performance at 1/18.9th the inference cost of alternatives represents a substantial operational advantage. However, the real test will be independent verification of these benchmarks and real-world deployment experience. The model's reported stability issues in specific conditions warrant caution, and the specialized deployment requirements may limit accessibility for some teams.

Frequently Asked Questions

What is Step-3.5-Flash?

Step-3.5-Flash is a 196B parameter open-source large language model developed by Shanghai-based lab StepFun. It uses a sparse mixture-of-experts (MoE) architecture that activates only 11B parameters per token during inference, making it significantly more cost-efficient than models with similar capabilities.

How does Step-3.5-Flash achieve such low inference costs?

The model achieves low inference costs through its MoE architecture, which stores 196B total parameters but activates only 11B per token. A router selects the top 8 experts from 288 per layer plus one shared expert, while all other parameters remain dormant. This means the computational cost is based on the 11B active parameters rather than the full 196B parameter count.

What benchmarks does Step-3.5-Flash excel at?

According to StepFun's technical report, Step-3.5-Flash achieves top scores among open-source models on AIME 2025 (97.3), LiveCodeBench-V6 (86.4), τ²-Bench (88.2), and GAIA (84.5). It also outperforms both Gemini DeepResearch and OpenAI DeepResearch on ResearchRubrics (65.3 vs. 60.7).

Is Step-3.5-Flash actually better than Claude Opus 4.5?

The source material claims Step-3.5-Flash beats Claude Opus 4.5 on "multiple agentic benchmarks," though specific comparative numbers for Claude Opus 4.5 are not provided in the available source. The model does outperform both Gemini DeepResearch and OpenAI DeepResearch on ResearchRubrics, suggesting strong agentic capabilities, but comprehensive head-to-head comparisons with all Claude Opus 4.5 capabilities are not available in the source material.

Sources cited in this article

StepFun's
GPUs

Source: gentic.news · Mar 24, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Step-3.5-Flash represents a compelling case study in architectural efficiency over brute-force scaling. The model's sparse MoE design demonstrates that total parameter count has become an increasingly misleading metric for both capability and cost. What matters practically is the active parameter count during inference, and StepFun's architecture optimizes precisely for this metric while maintaining competitive performance. Technically, the hybrid attention mechanism deserves particular attention. By implementing a 3:1 ratio of sliding window to full attention, Step-3.5-Flash addresses one of the fundamental challenges in long-context modeling: the quadratic scaling of attention costs. This approach suggests that future models may increasingly adopt heterogeneous attention strategies rather than uniform architectures, with different layers optimized for different aspects of context processing. The benchmark results, if independently verified, indicate that the AI community may be approaching diminishing returns on pure parameter scaling. Step-3.5-Flash achieves top-tier results with approximately one-third the active parameters of Kimi K2.5 per token. This efficiency advantage could reshape the economics of AI deployment, particularly for applications requiring high-volume inference. However, the specialized deployment requirements and mentioned stability issues suggest that these efficiency gains come with implementation complexity that may limit adoption outside of well-resourced teams.

#open source #research #benchmarks #ai efficiency #large language models

This story is part of

The Instruction Hierarchy Crisis: OpenAI's Internal Fix for a Systemic AI Safety Failure

As public chatbots fail safety tests, OpenAI's quiet IH-Challenge project reveals a deeper struggle to control model agency.

Compare side-by-side

Anthropic vs StepFun

→

Mentioned in this article

Step-3.5-Flash StepFun Sparse Mixture-of-Experts Kimi K2.5 Claude Opus 4.6 Anthropic OpenAI Qwen 3.5 Medium LiveCodeBench v6 AIME 2025

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches3 shared topics

Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots

$Cursor's Composer 2.5 matches Opus 4.7, GPT-5.5 at fraction of cost$

Products & Launches3 shared topics

Cursor's Composer 2.5 matches Opus 4.7, GPT-5.5 at fraction of cost

AI Research2 shared topics

MCP Explained: The Standard Quietly Changing How AI Agents Connect to Data

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in Open Source

View all

Zhipu AI engineer points at monitor displaying GLM-5.2 ranking chart, office with coding screens visible…

Open SourceBreakthrough

100

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Zhipu AI's GLM-5.2 ranks top-3 globally on a coding benchmark, with US engineers calling it a daily driver superior to GPT-5.5.

scmp.com/2d ago/3 min read/Widely Reported

open sourcechinacoding

Open Source

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

arxiv.org/3d ago/3 min read

real-time systemsmultimodal modelsai research