Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

autoregressive models

30 articles about autoregressive models in AI news

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.

100% relevant

Luma Labs Launches Uni-1: An Autoregressive Transformer for Image Generation with a Pre-Generation Reasoning Phase

Luma Labs has released Uni-1, a foundational image model that uses an autoregressive transformer to reason about user intent before generating pixels. It aims to address the 'intent gap' common in diffusion models by adding a structured reasoning step.

88% relevant

Mercury 2: The End of Autoregressive Thinking in AI Reasoning

Mercury 2 represents a paradigm shift in AI reasoning architecture, moving beyond traditional autoregressive generation to create native reasoning models that process information simultaneously rather than sequentially.

85% relevant

Tencent Hunyuan GEAR: 10× Faster Autoregressive Image Gen

Tencent Hunyuan's GEAR jointly trains VQ tokenizers and AR generators end-to-end, achieving 10× faster autoregressive image generation while outperforming LlamaGen-REPA.

85% relevant

MinerU-Diffusion: A 2.5B Parameter Diffusion Model for OCR Achieves 3.2x Speedup Over Autoregressive Methods

Researchers introduced MinerU-Diffusion, a 2.5B parameter diffusion model for OCR that replaces autoregressive decoding with parallel block-wise diffusion. It achieves up to 3.2x faster inference while improving robustness on complex documents with tables and formulas.

85% relevant

Evo LLM Unifies Autoregressive and Diffusion AI, Achieving New Balance in Language Generation

Researchers introduce Evo, a novel large language model architecture that bridges autoregressive and diffusion-based text generation. By treating language creation as a continuous evolutionary flow, Evo adaptively balances confident refinement with exploratory planning, achieving state-of-the-art results across 15 benchmarks while maintaining fast inference speeds.

75% relevant

PSAD: A New Framework for Efficient Personalized Reranking in Recommender Systems

Researchers propose PSAD, a novel reranking framework using semi-autoregressive generation and online knowledge distillation to balance ranking quality with low-latency inference. It addresses key deployment challenges for generative reranking models in production systems.

85% relevant

ByteDance GenLIP: ViT Predicts Language Tokens Directly with 8B Samples

ByteDance's GenLIP trains ViTs to predict language tokens directly with a single autoregressive objective, outperforming baselines on 8B samples.

85% relevant

MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes

Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.

85% relevant

dLLM Framework Unifies Diffusion Language Models, Opening New Frontiers in AI Text Generation

Researchers have introduced dLLM, a unified framework that standardizes training, inference, and evaluation for diffusion language models. This breakthrough enables conversion of existing models like BERT into diffusion architectures and facilitates reproduction of cutting-edge models like LLaDA and Dream.

85% relevant

Survey Paper 'The Latent Space' Maps Evolution from Token Generation to Latent Computation in Language Models

Researchers have published a comprehensive survey charting the evolution of language model architectures from token-level autoregression to methods that perform computation in continuous latent spaces. This work provides a unified framework for understanding recent advances in reasoning, planning, and long-context modeling.

85% relevant

ByteDance iLLaDA: 8B Diffusion LM Matches Qwen2.5 Base, Lags on Instruct

ByteDance iLLaDA, an 8B diffusion LM trained on 12T tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails 10 points after instruction tuning, revealing the alignment gap for diffusion models.

93% relevant

UniRec: A New Generative Recommendation Model Bridges the 'Expressive Gap'

A new paper introduces UniRec, a generative recommendation model that closes the performance gap with traditional discriminative models by prefixing item sequences with structured attributes like category and brand. It achieved a +22.6% improvement in offline metrics and significant online gains in CTR and GMV when deployed on Shopee.

94% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

85% relevant

OpenVoice v2: Complete Voice Cloning Directory Launches on GitHub

A developer has compiled and released a comprehensive directory of open-source voice cloning tools and resources on GitHub. This centralizes access to models, datasets, and training code, lowering the barrier to entry for AI audio development.

85% relevant

Kuaishou's Dual-Rerank: A New Industrial Framework for High-Stakes

Researchers from Kuaishou introduce Dual-Rerank, a framework designed for industrial-scale generative reranking. It addresses the dual dilemma of structural trade-offs (AR vs. NAR models) and optimization gaps (SL vs. RL) through Sequential Knowledge Distillation and List-wise Decoupled Reranking Optimization. A/B tests on production traffic show significant improvements in user satisfaction and watch time with reduced latency.

82% relevant

Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x

A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.

95% relevant

Microsoft Open-Sources VALL-E 2: A Zero-Shot TTS Model Achieving Human Parity in Speech Naturalness

Microsoft Research has open-sourced VALL-E 2, a neural codec language model for text-to-speech that achieves human parity in naturalness. It uses a novel 'Repetition-Aware Sampling' method to eliminate word repetition, a common failure mode in prior models.

95% relevant

GLM-5.1 Released by Zhipu AI, Claiming Performance Close to GPT-4o and Claude 3.5

Zhipu AI has released GLM-5.1, its latest large language model series. The company claims its top-tier model, GLM-5.1-9B/1M, achieves performance close to GPT-4o and Claude 3.5 Sonnet, narrowing the gap with leading Western models.

85% relevant

Luma AI Launches Uni-1, a Unified Image Model Priced at $0.09 per 2K Image, Challenging Google Nano Banana

Luma AI released Uni-1, a single transformer model for image understanding and generation. It ranks first in human preference tests for style/editing and reference tasks, and is priced lower than Google's Nano Banana models.

95% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

92% relevant

New Research Diagnoses LLMs' Struggle with Multiple Knowledge Updates in Context

A new arXiv paper reveals a persistent bias in LLMs when facts are updated multiple times within a long context. Models increasingly favor the earliest version, failing to track the latest state—a critical flaw for dynamic knowledge tasks.

78% relevant

Sam Altman Envisions AI That Thinks for Days: The Dawn of Super-Long-Term Reasoning

OpenAI CEO Sam Altman predicts future AI models will perform "super long-term reasoning," spending days or weeks analyzing complex, high-stakes problems. This represents a fundamental shift from today's rapid-response systems toward deliberate, extended cognitive processes.

85% relevant

CausalTimePrior: The Missing Link for AI That Understands Time and Cause

Researchers have introduced CausalTimePrior, a new framework to generate synthetic time series data with known interventions. This breakthrough addresses a critical gap in training AI models to understand causality over time, paving the way for foundation models in time series analysis.

95% relevant

Beyond Words: Fei-Fei Li Joins Growing Chorus Questioning LLMs' World Understanding

AI pioneer Dr. Fei-Fei Li highlights a fundamental limitation of Large Language Models, arguing they lack true understanding of the physical world because they are trained solely on language, a 'purely generated signal.' Her critique aligns with Yann LeCun's vision for more grounded, embodied AI.

85% relevant

Diffusion Architecture Breaks Speed Barrier: Inception's Mercury 2 Hits 1,000 Tokens/Second

Inception's Mercury 2 achieves unprecedented text generation speeds of 1,000 tokens per second using diffusion architecture borrowed from image AI. This represents a 10x speed advantage over leading models like Claude 4.5 Haiku and GPT-5 Mini without requiring custom hardware.

95% relevant

MultiHashFormer Brings Hash-Based Autoregression to Causal LMs

MultiHashFormer brings hash-based autoregression to causal LMs, slashing embedding memory and outperforming standard Transformers from 100M to 3B parameters.

85% relevant

BioMatrix: A single decoder reads proteins, molecules, language on 304B tokens

BioMatrix, a decoder-only biological foundation model, achieves SOTA on 77 of 80 tasks after training on 304B tokens of sequences, structures, and language.

95% relevant

Pareto LoRA Boosts Image Quality 44.9% vs Vanilla LoRA on Emu2

Pareto LoRA reformulates multimodal instruction tuning as bi-objective optimization, achieving up to 44.9% image quality gains on Emu2 while maintaining text performance.

90% relevant

llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU

llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.

84% relevant