moe

30 articles about moe in AI news

MoEngage Buys Aampe for Tens of Millions, Bets AI Agents Replace Campaigns

MoEngage acquired Aampe for tens of millions to embed per-customer AI agents, targeting migrations from Salesforce and Adobe Marketing Cloud.

Jun 23, 202671% relevant

Chinese Lab's Free MoE Model Matches GPT-5.5 on Agentic Coding

A Chinese lab released an Apache-2.0 open-weights MoE model matching GPT-5.5 on agentic coding. This free model challenges proprietary AI's lead with sparse MoE architecture.

Jun 12, 2026100% relevant

dMoE Cuts Active Experts from 69.5 to 14.6, Retains 99.11% Performance

dMoE reduces active experts from 69.5 to 14.6 in diffusion LLMs, retaining 99.11% performance while cutting memory 80% and speeding inference 1.66×.

Jun 7, 202685% relevant

JetBrains Open-Sources Mellum2: 12B MoE at 2.5B Active Params

JetBrains open-sourced Mellum2, a 12B MoE model with 2.5B active params, trained from scratch for code and reasoning.

Jun 6, 202690% relevant

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

May 19, 202690% relevant

New MoE Framework Tames User Interest Shifts in Long-Sequence Recommendations

Researchers propose MoS, a model-agnostic MoE approach that handles long user sequences by detecting session hopping – where user interests shift across sessions. The theme-aware routing mechanism filters irrelevant sessions, while multi-scale fusion captures global and local patterns. Results show SOTA on benchmarks with fewer FLOPs than alternatives.

Apr 24, 202694% relevant

Fine-Tuning OpenAI's GPT-OSS 20B: A Practitioner's Guide to LoRA on MoE Models

A technical guide details the practical challenges and solutions for fine-tuning OpenAI's 20-billion parameter GPT-OSS model using LoRA. This is crucial for efficiently adapting large, complex MoE models to specific business domains.

Mar 22, 2026100% relevant

Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming

Developer Dan reportedly runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac at ~5.7 tokens per second using only 5.5GB of active memory by quantizing and streaming weights from SSD.

Mar 18, 202685% relevant

ELDR: Expert-Locality Decode Routing Cuts MoE TPOT by 13.9%

ELDR uses prefill expert signatures to route decode requests, cutting median TPOT by 5.9–13.9% in vLLM at scale.

Jul 2, 202685% relevant

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

Qwen3.6-27B delivers flagship-level coding performance in a 55.6GB model that can be quantized to 16.8GB, making high-quality local coding assistance accessible.

Apr 22, 2026100% relevant

NVIDIA Nemotron 3 Super: 120B Hybrid Mamba-Transformer MoE with 1M Context

NVIDIA has released Nemotron 3 Super, a 120B parameter open hybrid Mamba-Transformer Mixture of Experts model with 12B active parameters and 1M token context length. The company claims it delivers up to 7.5x higher throughput than similar open models.

Apr 18, 202695% relevant

Alibaba Qwen3.6-35B-A3B: 3B-Active Sparse MoE Hits 73.4% on SWE-Bench

Alibaba released Qwen3.6-35B-A3B, a sparse mixture-of-experts model with 35B total but only 3B active parameters. It shows significant gains over its predecessor, scoring 73.4% on SWE-bench Verified and beating Claude 3.5 Sonnet on several vision tasks.

Apr 16, 202697% relevant

Cursor AI Claims 1.84x Faster MoE Inference on NVIDIA Blackwell GPUs

Cursor AI announced a rebuilt inference engine for Mixture-of-Experts models on NVIDIA's new Blackwell GPUs, resulting in a claimed 1.84x speedup and improved output accuracy.

Apr 6, 202685% relevant

Stanford Releases Free LLM & Transformer Cheatsheets Covering LoRA, RAG, MoE

Stanford University has released a free, open-source collection of cheatsheets covering core LLM concepts from self-attention to RAG and LoRA. This provides a consolidated technical reference for engineers and researchers.

Apr 6, 202691% relevant

Google Releases Gemma 4 Family Under Apache 2.0, Featuring 2B to 31B Models with MoE and Multimodal Capabilities

Google has released the Gemma 4 family of open-weight models, derived from Gemini 3 technology. The four models, ranging from 2B to 31B parameters and including a Mixture-of-Experts variant, are available under a permissive Apache 2.0 license and feature multimodal processing.

Apr 2, 2026100% relevant

Kimi 2.5's 1T Parameter MoE Model Runs on 96GB Mac Hardware via SSD Streaming

Developers have demonstrated that Kimi 2.5's 1 trillion parameter Mixture-of-Experts model can run on Mac hardware with just 96GB RAM by streaming expert weights from SSD, with only 32B parameters active per token.

Mar 24, 202685% relevant

Step-3.5-Flash: 196B Open-Source MoE Model Activates Only 11B Parameters, Outperforms Kimi K2.5 and Claude Opus 4.5 on Key Benchmarks

Shanghai-based StepFun's Step-3.5-Flash, a 196B parameter sparse mixture-of-experts model that activates only 11B parameters per token, achieves top scores on AIME 2025 (97.3) and LiveCodeBench-V6 (86.4) while costing 18.9x less to run than Kimi K2.5.

Mar 24, 202695% relevant

NVIDIA Releases Nemotron-Cascade 2: A 30B MoE Model with 3B Active Parameters

NVIDIA has open-sourced Nemotron-Cascade 2, a 30B parameter Mixture-of-Experts model that activates only 3B parameters per token. It claims 'gold medal performance' on IMO and IOI 2025 benchmarks.

Mar 20, 202695% relevant

The Hidden Cost of Mixture-of-Experts: New Research Reveals Why MoE Models Struggle at Inference

A groundbreaking paper introduces the 'qs inequality,' revealing how Mixture-of-Experts architectures suffer a 'double penalty' during inference that can make them 4.5x slower than dense models. The research shows training efficiency doesn't translate to inference performance, especially with long contexts.

Mar 11, 202675% relevant

Beyond Homogenization: How Expert Divergence Learning Unlocks MoE's True Potential

Researchers have developed Expert Divergence Learning, a novel pre-training strategy that combats expert homogenization in Mixture-of-Experts language models. By encouraging functional specialization through domain-aware routing, the method improves performance across benchmarks with minimal computational overhead.

Mar 3, 202675% relevant

Google Lyria 3 Pro Music AI Demoed: Generates '1990s Boy Band' Version of Rilke Poetry

A researcher gained early access to Google's Lyria 3 Pro music generation AI, demonstrating its ability to transform Rainer Maria Rilke's 'First Elegy' into a 1990s boy band track. The demo highlights rapid stylistic remixing capabilities not yet publicly available.

Mar 25, 202685% relevant

Nemotron 3 Ultra matches GPT-5.5 on physics test at 10X lower cost

Nemotron 3 Ultra matched GPT-5.5 on a physics test at 10X lower cost ($0.051 vs $0.57), highlighting MoE efficiency.

Jun 5, 202685% relevant

Microsoft Unveils MAI-Thinking-1: 35B Active, 1T Parameters, 97% on AIME 2025

Microsoft's MAI-Thinking-1 hits 97% on AIME 2025 with 35B active params in a 1T MoE model, trained on 30T human tokens without distillation.

Jun 2, 2026100% relevant

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

May 6, 202687% relevant

MIT Hackathon Team Builds Wearable AI for Physical Movement Guidance

MIT hackathon team builds wearable AI for real-time physical movement guidance via sensors and on-device inference, demoed by @kimmonismus.

May 5, 202677% relevant

Expert Pyramid Tuning: A New Parameter-Efficient Fine-Tuning Architecture for Multi-Task LLMs

Researchers propose Expert Pyramid Tuning (EPT), a novel PEFT method that uses multi-scale feature pyramids to better handle tasks of varying complexity. It outperforms existing MoE-LoRA variants while using fewer parameters, offering more efficient multi-task LLM deployment.

Mar 16, 202679% relevant

NVIDIA's Nemotron 3 Super: The Efficiency-First AI Model Redefining Performance Benchmarks

NVIDIA unveils Nemotron 3 Super, a 120B parameter model with only 12B active parameters using hybrid Mamba-Transformer MoE architecture. It achieves 1M token context, beats GPT-OSS-120B on intelligence metrics, and offers configurable reasoning modes for optimal compute efficiency.

Mar 11, 202695% relevant

Qwen's 9B Base Model Breaks Language Barriers with 1M Context Window

Alibaba's Qwen team has released Qwen3.5-9B-Base, a multimodal foundation model supporting 201 languages with a massive 1 million token context window. The model features a hybrid DeltaNet-MoE architecture designed for efficient inference.

Mar 5, 202695% relevant

Liquid Cooling Hits 15kW: CoolIT Coldplate Quadruples Capacity for AI

CoolIT demoed a 15kW single-phase coldplate, quadrupling capacity, while Vertiv, Accelsius, and LiquidStack launched products targeting scalable AI cooling deployment.

Jun 5, 202660% relevant

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

NVIDIA claims Blackwell inference stack cut DeepSeek V4 token costs 5x in one month, per a newly published report shared by @rohanpaul_ai.

Jun 30, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety