latency
30 articles about latency in AI news
Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single
Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.
llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU
llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.
Miso One: 8B Open-Source TTS Hits 110ms Latency, Real Emotion
Miso One, an 8B open-source TTS model, achieves 110ms latency with emotional range. Weights are fully open-source for self-hosting, but no benchmark data is provided.
vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster
vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.
Codex Update Cuts GUI Workflow Latency 42%
Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.
Google Virgo Fabric: 100K-Accelerator AI Network Cuts Latency
Google unveiled Virgo, a data center fabric for AI clusters of 100,000+ accelerators, using a flatter two-layer topology to reduce latency and improve bisection bandwidth for synchronized training workloads.
OpenCLAW-P2P v6.0 Cuts Paper Lookup Latency to <50ms
OpenCLAW-P2P v6.0 introduces a multi-layer persistence architecture and live reference verification, reducing paper retrieval latency from >3s to <50ms and operating with 14 autonomous agents that scored 50+ papers.
EventChat Study: LLM-Driven Conversational Recommenders Show Promise but Face Cost & Latency Hurdles for SMEs
A new study details the real-world implementation and user evaluation of an LLM-driven conversational recommender system (CRS) for an SME. Results show 85.5% recommendation accuracy but highlight critical business viability challenges: a median cost of $0.04 per interaction and 5.7s latency.
OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency
Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.
Quantized Inference Breakthrough for Next-Gen Recommender Systems: OneRec-V2 Achieves 49% Latency Reduction with FP8
New research shows FP8 quantization can dramatically speed up modern generative recommender systems like OneRec-V2, achieving 49% lower latency and 92% higher throughput with no quality loss. This breakthrough bridges the gap between LLM optimization techniques and industrial recommendation workloads.
WSL 3 Preview: Cut Claude Code's Local Inference Latency on Windows
WSL 3 preview delivers near-native GPU/NPU for Claude Code + Ollama on Copilot+ laptops, but WSL 2 still handles NVIDIA CUDA fine for desktop users.
VIAVI Ships First Ultra Ethernet Validation Tool for AI Data Centers
VIAVI launched the first Ultra Ethernet validation tool for AI data centers, supporting 800GE/1.6TE links. The tool enables certification of low-latency, lossless transport critical for distributed AI training.
OpenAI 'Bidi' Voice Mode Demo Leaks: Real-Time Interruption
Leaked demo shows OpenAI 'bidi' voice mode handling interruptions with sub-320ms latency. No official release date or pricing announced.
Gemini 3.5 Live Translate Debuts as Real-Time Audio Model
Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.
Google Releases Magenta RealTime 2 for Open-Weight Music Generation
Google released Magenta RealTime 2 on Hugging Face, the only open-weights model for real-time continuous music generation on device with ~200ms latency.
Google Gemma 4 12B: Encoder-Free Multimodal Model Launches
Google launched Gemma 4 12B, an encoder-free multimodal model for on-device AI, reducing latency by eliminating the vision encoder.
Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test
A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.
Conductor vs Claude Code: Pinned Versions Split the Community
Ask HN asks if Conductor's single-agent matches native Claude Code. Pinned versions create a stability-vs-latency trade-off.
Google TPU 'Broadfly' Topology Scales Pod to 1,152 Chips
Google unveiled a Broadfly TPU topology at Cloud Next, scaling pods to 1,152 chips — 4.5x larger than Ironwood — with max 7 hops. This inference-first design challenges NVIDIA's NVLink on scale and latency.
Meshwatch GNN Stack Ships Fraud Detection with 17.2% Lift over XGBoost
Meshwatch GNN fraud stack achieves 17.2% recall lift over XGBoost at sub-50ms latency, shipping a custom GraphSAGE variant with online neighbor sampling.
12-Metric Agent Eval Framework From 100+ Deployments Hits Production
12-metric evaluation framework for production AI agents from 100+ deployments targets task success, cost, latency, tool use, and safety.
Codex 'Ultra-Fast' Mode Spotted in Leaked Screenshot
Leaked screenshot suggests OpenAI is adding an ultra-fast latency mode to Codex. No release date or pricing confirmed.
Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?
Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.
MiniMax M2.7 Hits 400 TPS on SambaNova Hardware
MiniMax M2.7 reaches 400 TPS on SambaNova hardware, making latency imperceptible. Details on model size and batch size undisclosed.
AllenAI's MolmoAct2: 720-Hour Bimanual Dataset, Beats GPT-5 on Robotics
AllenAI released MolmoAct2, an open robotics model with a 720-hour bimanual dataset, beating GPT-5 and Gemini Robotics on success rate (89.4% vs 82.1%) with 40% lower latency.
How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute
LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.
Continuous Semantic Caching
Researchers propose a theory-grounded semantic caching system that treats user queries as points in a continuous embedding space, using dynamic ε-net discretization and kernel ridge regression to cut inference costs and latency without switching overhead.
PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100
PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.
CS3: A New Framework to Boost Two-Tower Recommenders Without Slowing Them Down
Researchers propose CS3, a plug-and-play framework that strengthens the ubiquitous two-tower recommendation architecture. It uses three novel mechanisms to improve model alignment and knowledge transfer, delivering significant revenue gains in a live ad system while maintaining millisecond latency.
Xiaomi's OneVL Uses Latent CoT to Beat Explicit CoT in Autonomous Driving
Xiaomi's Embodied Intelligence Team released OneVL, a vision-language model using latent Chain-of-Thought reasoning. It achieves state-of-the-art results on four autonomous driving benchmarks without the latency penalty of explicit reasoning steps.