Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

moe

30 articles about moe in AI news

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

90% relevant

New MoE Framework Tames User Interest Shifts in Long-Sequence Recommendations

Researchers propose MoS, a model-agnostic MoE approach that handles long user sequences by detecting session hopping – where user interests shift across sessions. The theme-aware routing mechanism filters irrelevant sessions, while multi-scale fusion captures global and local patterns. Results show SOTA on benchmarks with fewer FLOPs than alternatives.

94% relevant

Fine-Tuning OpenAI's GPT-OSS 20B: A Practitioner's Guide to LoRA on MoE Models

A technical guide details the practical challenges and solutions for fine-tuning OpenAI's 20-billion parameter GPT-OSS model using LoRA. This is crucial for efficiently adapting large, complex MoE models to specific business domains.

100% relevant

Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming

Developer Dan reportedly runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac at ~5.7 tokens per second using only 5.5GB of active memory by quantizing and streaming weights from SSD.

85% relevant

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

Qwen3.6-27B delivers flagship-level coding performance in a 55.6GB model that can be quantized to 16.8GB, making high-quality local coding assistance accessible.

100% relevant

NVIDIA Nemotron 3 Super: 120B Hybrid Mamba-Transformer MoE with 1M Context

NVIDIA has released Nemotron 3 Super, a 120B parameter open hybrid Mamba-Transformer Mixture of Experts model with 12B active parameters and 1M token context length. The company claims it delivers up to 7.5x higher throughput than similar open models.

95% relevant

Alibaba Qwen3.6-35B-A3B: 3B-Active Sparse MoE Hits 73.4% on SWE-Bench

Alibaba released Qwen3.6-35B-A3B, a sparse mixture-of-experts model with 35B total but only 3B active parameters. It shows significant gains over its predecessor, scoring 73.4% on SWE-bench Verified and beating Claude 3.5 Sonnet on several vision tasks.

97% relevant

Cursor AI Claims 1.84x Faster MoE Inference on NVIDIA Blackwell GPUs

Cursor AI announced a rebuilt inference engine for Mixture-of-Experts models on NVIDIA's new Blackwell GPUs, resulting in a claimed 1.84x speedup and improved output accuracy.

85% relevant

Stanford Releases Free LLM & Transformer Cheatsheets Covering LoRA, RAG, MoE

Stanford University has released a free, open-source collection of cheatsheets covering core LLM concepts from self-attention to RAG and LoRA. This provides a consolidated technical reference for engineers and researchers.

91% relevant

Google Releases Gemma 4 Family Under Apache 2.0, Featuring 2B to 31B Models with MoE and Multimodal Capabilities

Google has released the Gemma 4 family of open-weight models, derived from Gemini 3 technology. The four models, ranging from 2B to 31B parameters and including a Mixture-of-Experts variant, are available under a permissive Apache 2.0 license and feature multimodal processing.

100% relevant

Kimi 2.5's 1T Parameter MoE Model Runs on 96GB Mac Hardware via SSD Streaming

Developers have demonstrated that Kimi 2.5's 1 trillion parameter Mixture-of-Experts model can run on Mac hardware with just 96GB RAM by streaming expert weights from SSD, with only 32B parameters active per token.

85% relevant

Step-3.5-Flash: 196B Open-Source MoE Model Activates Only 11B Parameters, Outperforms Kimi K2.5 and Claude Opus 4.5 on Key Benchmarks

Shanghai-based StepFun's Step-3.5-Flash, a 196B parameter sparse mixture-of-experts model that activates only 11B parameters per token, achieves top scores on AIME 2025 (97.3) and LiveCodeBench-V6 (86.4) while costing 18.9x less to run than Kimi K2.5.

95% relevant

NVIDIA Releases Nemotron-Cascade 2: A 30B MoE Model with 3B Active Parameters

NVIDIA has open-sourced Nemotron-Cascade 2, a 30B parameter Mixture-of-Experts model that activates only 3B parameters per token. It claims 'gold medal performance' on IMO and IOI 2025 benchmarks.

95% relevant

The Hidden Cost of Mixture-of-Experts: New Research Reveals Why MoE Models Struggle at Inference

A groundbreaking paper introduces the 'qs inequality,' revealing how Mixture-of-Experts architectures suffer a 'double penalty' during inference that can make them 4.5x slower than dense models. The research shows training efficiency doesn't translate to inference performance, especially with long contexts.

75% relevant

Beyond Homogenization: How Expert Divergence Learning Unlocks MoE's True Potential

Researchers have developed Expert Divergence Learning, a novel pre-training strategy that combats expert homogenization in Mixture-of-Experts language models. By encouraging functional specialization through domain-aware routing, the method improves performance across benchmarks with minimal computational overhead.

75% relevant

Google Lyria 3 Pro Music AI Demoed: Generates '1990s Boy Band' Version of Rilke Poetry

A researcher gained early access to Google's Lyria 3 Pro music generation AI, demonstrating its ability to transform Rainer Maria Rilke's 'First Elegy' into a 1990s boy band track. The demo highlights rapid stylistic remixing capabilities not yet publicly available.

85% relevant

Palantir's AI Platform Demoed by US DoD Director, Showcasing Real-Time Military Analysis

The US Department of Defense's Director of AI demonstrated Palantir's AI system, highlighting real-time analysis capabilities that contribute to the company's surging valuation.

85% relevant

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

87% relevant

MIT Hackathon Team Builds Wearable AI for Physical Movement Guidance

MIT hackathon team builds wearable AI for real-time physical movement guidance via sensors and on-device inference, demoed by @kimmonismus.

77% relevant

Expert Pyramid Tuning: A New Parameter-Efficient Fine-Tuning Architecture for Multi-Task LLMs

Researchers propose Expert Pyramid Tuning (EPT), a novel PEFT method that uses multi-scale feature pyramids to better handle tasks of varying complexity. It outperforms existing MoE-LoRA variants while using fewer parameters, offering more efficient multi-task LLM deployment.

79% relevant

NVIDIA's Nemotron 3 Super: The Efficiency-First AI Model Redefining Performance Benchmarks

NVIDIA unveils Nemotron 3 Super, a 120B parameter model with only 12B active parameters using hybrid Mamba-Transformer MoE architecture. It achieves 1M token context, beats GPT-OSS-120B on intelligence metrics, and offers configurable reasoning modes for optimal compute efficiency.

95% relevant

Qwen's 9B Base Model Breaks Language Barriers with 1M Context Window

Alibaba's Qwen team has released Qwen3.5-9B-Base, a multimodal foundation model supporting 201 languages with a massive 1 million token context window. The model features a hybrid DeltaNet-MoE architecture designed for efficient inference.

95% relevant

Claude Code /goal Uses Haiku Evaluator, Runs Unattended Until Condition Met

Claude Code /goal runs unattended until a condition is met, using Haiku evaluator. Agent View manages multiple background sessions. Requires v2.1.139.

90% relevant

Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test

A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.

100% relevant

30B-A3B Reasoning Model Hits Gold Medal on Physics, Math Olympiads

30B-A3B reasoning model from @stingning achieves gold-medal level on physics and math Olympiads, released on Hugging Face.

85% relevant

Anthropic Leases xAI's Colossus 1 After Mixed-Architecture Flaw Blocked

Anthropic leased xAI's 220K-GPU Colossus 1 after its mixed architecture failed to train Grok. Musk builds Blackwell-only Colossus 2 for training and IPO.

100% relevant

Google TPU 'Broadfly' Topology Scales Pod to 1,152 Chips

Google unveiled a Broadfly TPU topology at Cloud Next, scaling pods to 1,152 chips — 4.5x larger than Ironwood — with max 7 hops. This inference-first design challenges NVIDIA's NVLink on scale and latency.

94% relevant

MiniMax M2.7 Hits 400 TPS on SambaNova Hardware

MiniMax M2.7 reaches 400 TPS on SambaNova hardware, making latency imperceptible. Details on model size and batch size undisclosed.

75% relevant

Mistral Medium Model Launch Teased by European AI Company

Mistral AI teased an upcoming model called Mistral Medium on X, signaling continued expansion of its model lineup. The announcement comes amid growing competition in the open-weight LLM space.

86% relevant

Pyptx: Write Nvidia PTX Kernels in Python for Hopper and Blackwell

Pyptx lets developers write and launch hand-tuned Nvidia PTX kernels directly from Python, supporting Hopper (sm_90a) and Blackwell (sm_100a). It provides explicit control over registers, shared memory, and advanced features like WGMMA and TMA, with dispatch through JAX, PyTorch eager, and torch.compile.

91% relevant