autoregressive models
30 articles about autoregressive models in AI news
Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100
Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.
Luma Labs Launches Uni-1: An Autoregressive Transformer for Image Generation with a Pre-Generation Reasoning Phase
Luma Labs has released Uni-1, a foundational image model that uses an autoregressive transformer to reason about user intent before generating pixels. It aims to address the 'intent gap' common in diffusion models by adding a structured reasoning step.
Mercury 2: The End of Autoregressive Thinking in AI Reasoning
Mercury 2 represents a paradigm shift in AI reasoning architecture, moving beyond traditional autoregressive generation to create native reasoning models that process information simultaneously rather than sequentially.
Tencent Hunyuan GEAR: 10× Faster Autoregressive Image Gen
Tencent Hunyuan's GEAR jointly trains VQ tokenizers and AR generators end-to-end, achieving 10× faster autoregressive image generation while outperforming LlamaGen-REPA.
MinerU-Diffusion: A 2.5B Parameter Diffusion Model for OCR Achieves 3.2x Speedup Over Autoregressive Methods
Researchers introduced MinerU-Diffusion, a 2.5B parameter diffusion model for OCR that replaces autoregressive decoding with parallel block-wise diffusion. It achieves up to 3.2x faster inference while improving robustness on complex documents with tables and formulas.
Evo LLM Unifies Autoregressive and Diffusion AI, Achieving New Balance in Language Generation
Researchers introduce Evo, a novel large language model architecture that bridges autoregressive and diffusion-based text generation. By treating language creation as a continuous evolutionary flow, Evo adaptively balances confident refinement with exploratory planning, achieving state-of-the-art results across 15 benchmarks while maintaining fast inference speeds.
PSAD: A New Framework for Efficient Personalized Reranking in Recommender Systems
Researchers propose PSAD, a novel reranking framework using semi-autoregressive generation and online knowledge distillation to balance ranking quality with low-latency inference. It addresses key deployment challenges for generative reranking models in production systems.
ByteDance GenLIP: ViT Predicts Language Tokens Directly with 8B Samples
ByteDance's GenLIP trains ViTs to predict language tokens directly with a single autoregressive objective, outperforming baselines on 8B samples.
MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes
Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.
dLLM Framework Unifies Diffusion Language Models, Opening New Frontiers in AI Text Generation
Researchers have introduced dLLM, a unified framework that standardizes training, inference, and evaluation for diffusion language models. This breakthrough enables conversion of existing models like BERT into diffusion architectures and facilitates reproduction of cutting-edge models like LLaDA and Dream.
Survey Paper 'The Latent Space' Maps Evolution from Token Generation to Latent Computation in Language Models
Researchers have published a comprehensive survey charting the evolution of language model architectures from token-level autoregression to methods that perform computation in continuous latent spaces. This work provides a unified framework for understanding recent advances in reasoning, planning, and long-context modeling.
ByteDance iLLaDA: 8B Diffusion LM Matches Qwen2.5 Base, Lags on Instruct
ByteDance iLLaDA, an 8B diffusion LM trained on 12T tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails 10 points after instruction tuning, revealing the alignment gap for diffusion models.
UniRec: A New Generative Recommendation Model Bridges the 'Expressive Gap'
A new paper introduces UniRec, a generative recommendation model that closes the performance gap with traditional discriminative models by prefixing item sequences with structured attributes like category and brand. It achieved a +22.6% improvement in offline metrics and significant online gains in CTR and GMV when deployed on Shopee.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
OpenVoice v2: Complete Voice Cloning Directory Launches on GitHub
A developer has compiled and released a comprehensive directory of open-source voice cloning tools and resources on GitHub. This centralizes access to models, datasets, and training code, lowering the barrier to entry for AI audio development.
Kuaishou's Dual-Rerank: A New Industrial Framework for High-Stakes
Researchers from Kuaishou introduce Dual-Rerank, a framework designed for industrial-scale generative reranking. It addresses the dual dilemma of structural trade-offs (AR vs. NAR models) and optimization gaps (SL vs. RL) through Sequential Knowledge Distillation and List-wise Decoupled Reranking Optimization. A/B tests on production traffic show significant improvements in user satisfaction and watch time with reduced latency.
Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x
A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.
Microsoft Open-Sources VALL-E 2: A Zero-Shot TTS Model Achieving Human Parity in Speech Naturalness
Microsoft Research has open-sourced VALL-E 2, a neural codec language model for text-to-speech that achieves human parity in naturalness. It uses a novel 'Repetition-Aware Sampling' method to eliminate word repetition, a common failure mode in prior models.
GLM-5.1 Released by Zhipu AI, Claiming Performance Close to GPT-4o and Claude 3.5
Zhipu AI has released GLM-5.1, its latest large language model series. The company claims its top-tier model, GLM-5.1-9B/1M, achieves performance close to GPT-4o and Claude 3.5 Sonnet, narrowing the gap with leading Western models.
Luma AI Launches Uni-1, a Unified Image Model Priced at $0.09 per 2K Image, Challenging Google Nano Banana
Luma AI released Uni-1, a single transformer model for image understanding and generation. It ranks first in human preference tests for style/editing and reference tasks, and is priced lower than Google's Nano Banana models.
OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency
Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.
New Research Diagnoses LLMs' Struggle with Multiple Knowledge Updates in Context
A new arXiv paper reveals a persistent bias in LLMs when facts are updated multiple times within a long context. Models increasingly favor the earliest version, failing to track the latest state—a critical flaw for dynamic knowledge tasks.
Sam Altman Envisions AI That Thinks for Days: The Dawn of Super-Long-Term Reasoning
OpenAI CEO Sam Altman predicts future AI models will perform "super long-term reasoning," spending days or weeks analyzing complex, high-stakes problems. This represents a fundamental shift from today's rapid-response systems toward deliberate, extended cognitive processes.
CausalTimePrior: The Missing Link for AI That Understands Time and Cause
Researchers have introduced CausalTimePrior, a new framework to generate synthetic time series data with known interventions. This breakthrough addresses a critical gap in training AI models to understand causality over time, paving the way for foundation models in time series analysis.
Beyond Words: Fei-Fei Li Joins Growing Chorus Questioning LLMs' World Understanding
AI pioneer Dr. Fei-Fei Li highlights a fundamental limitation of Large Language Models, arguing they lack true understanding of the physical world because they are trained solely on language, a 'purely generated signal.' Her critique aligns with Yann LeCun's vision for more grounded, embodied AI.
Diffusion Architecture Breaks Speed Barrier: Inception's Mercury 2 Hits 1,000 Tokens/Second
Inception's Mercury 2 achieves unprecedented text generation speeds of 1,000 tokens per second using diffusion architecture borrowed from image AI. This represents a 10x speed advantage over leading models like Claude 4.5 Haiku and GPT-5 Mini without requiring custom hardware.
MultiHashFormer Brings Hash-Based Autoregression to Causal LMs
MultiHashFormer brings hash-based autoregression to causal LMs, slashing embedding memory and outperforming standard Transformers from 100M to 3B parameters.
BioMatrix: A single decoder reads proteins, molecules, language on 304B tokens
BioMatrix, a decoder-only biological foundation model, achieves SOTA on 77 of 80 tasks after training on 304B tokens of sequences, structures, and language.
Pareto LoRA Boosts Image Quality 44.9% vs Vanilla LoRA on Emu2
Pareto LoRA reformulates multimodal instruction tuning as bi-objective optimization, achieving up to 44.9% image quality gains on Emu2 while maintaining text performance.
llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU
llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.