Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

latency

30 articles about latency in AI news

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

75% relevant

llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU

llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.

84% relevant

Miso One: 8B Open-Source TTS Hits 110ms Latency, Real Emotion

Miso One, an 8B open-source TTS model, achieves 110ms latency with emotional range. Weights are fully open-source for self-hosting, but no benchmark data is provided.

85% relevant

vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster

vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.

82% relevant

Codex Update Cuts GUI Workflow Latency 42%

Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.

84% relevant

Google Virgo Fabric: 100K-Accelerator AI Network Cuts Latency

Google unveiled Virgo, a data center fabric for AI clusters of 100,000+ accelerators, using a flatter two-layer topology to reduce latency and improve bisection bandwidth for synchronized training workloads.

78% relevant

OpenCLAW-P2P v6.0 Cuts Paper Lookup Latency to <50ms

OpenCLAW-P2P v6.0 introduces a multi-layer persistence architecture and live reference verification, reducing paper retrieval latency from >3s to <50ms and operating with 14 autonomous agents that scored 50+ papers.

77% relevant

EventChat Study: LLM-Driven Conversational Recommenders Show Promise but Face Cost & Latency Hurdles for SMEs

A new study details the real-world implementation and user evaluation of an LLM-driven conversational recommender system (CRS) for an SME. Results show 85.5% recommendation accuracy but highlight critical business viability challenges: a median cost of $0.04 per interaction and 5.7s latency.

72% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

92% relevant

Quantized Inference Breakthrough for Next-Gen Recommender Systems: OneRec-V2 Achieves 49% Latency Reduction with FP8

New research shows FP8 quantization can dramatically speed up modern generative recommender systems like OneRec-V2, achieving 49% lower latency and 92% higher throughput with no quality loss. This breakthrough bridges the gap between LLM optimization techniques and industrial recommendation workloads.

97% relevant

WSL 3 Preview: Cut Claude Code's Local Inference Latency on Windows

WSL 3 preview delivers near-native GPU/NPU for Claude Code + Ollama on Copilot+ laptops, but WSL 2 still handles NVIDIA CUDA fine for desktop users.

98% relevant

VIAVI Ships First Ultra Ethernet Validation Tool for AI Data Centers

VIAVI launched the first Ultra Ethernet validation tool for AI data centers, supporting 800GE/1.6TE links. The tool enables certification of low-latency, lossless transport critical for distributed AI training.

90% relevant

OpenAI 'Bidi' Voice Mode Demo Leaks: Real-Time Interruption

Leaked demo shows OpenAI 'bidi' voice mode handling interruptions with sub-320ms latency. No official release date or pricing announced.

82% relevant

Gemini 3.5 Live Translate Debuts as Real-Time Audio Model

Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.

87% relevant

Google Releases Magenta RealTime 2 for Open-Weight Music Generation

Google released Magenta RealTime 2 on Hugging Face, the only open-weights model for real-time continuous music generation on device with ~200ms latency.

85% relevant

Google Gemma 4 12B: Encoder-Free Multimodal Model Launches

Google launched Gemma 4 12B, an encoder-free multimodal model for on-device AI, reducing latency by eliminating the vision encoder.

100% relevant

Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test

A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.

100% relevant

Conductor vs Claude Code: Pinned Versions Split the Community

Ask HN asks if Conductor's single-agent matches native Claude Code. Pinned versions create a stability-vs-latency trade-off.

92% relevant

Google TPU 'Broadfly' Topology Scales Pod to 1,152 Chips

Google unveiled a Broadfly TPU topology at Cloud Next, scaling pods to 1,152 chips — 4.5x larger than Ironwood — with max 7 hops. This inference-first design challenges NVIDIA's NVLink on scale and latency.

94% relevant

Meshwatch GNN Stack Ships Fraud Detection with 17.2% Lift over XGBoost

Meshwatch GNN fraud stack achieves 17.2% recall lift over XGBoost at sub-50ms latency, shipping a custom GraphSAGE variant with online neighbor sampling.

92% relevant

12-Metric Agent Eval Framework From 100+ Deployments Hits Production

12-metric evaluation framework for production AI agents from 100+ deployments targets task success, cost, latency, tool use, and safety.

74% relevant

Codex 'Ultra-Fast' Mode Spotted in Leaked Screenshot

Leaked screenshot suggests OpenAI is adding an ultra-fast latency mode to Codex. No release date or pricing confirmed.

89% relevant

Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?

Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.

100% relevant

MiniMax M2.7 Hits 400 TPS on SambaNova Hardware

MiniMax M2.7 reaches 400 TPS on SambaNova hardware, making latency imperceptible. Details on model size and batch size undisclosed.

75% relevant

AllenAI's MolmoAct2: 720-Hour Bimanual Dataset, Beats GPT-5 on Robotics

AllenAI released MolmoAct2, an open robotics model with a 720-hour bimanual dataset, beating GPT-5 and Gemini Robotics on success rate (89.4% vs 82.1%) with 40% lower latency.

95% relevant

How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute

LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.

100% relevant

Continuous Semantic Caching

Researchers propose a theory-grounded semantic caching system that treats user queries as points in a continuous embedding space, using dynamic ε-net discretization and kernel ridge regression to cut inference costs and latency without switching overhead.

78% relevant

PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100

PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.

90% relevant

CS3: A New Framework to Boost Two-Tower Recommenders Without Slowing Them Down

Researchers propose CS3, a plug-and-play framework that strengthens the ubiquitous two-tower recommendation architecture. It uses three novel mechanisms to improve model alignment and knowledge transfer, delivering significant revenue gains in a live ad system while maintaining millisecond latency.

100% relevant

Xiaomi's OneVL Uses Latent CoT to Beat Explicit CoT in Autonomous Driving

Xiaomi's Embodied Intelligence Team released OneVL, a vision-language model using latent Chain-of-Thought reasoning. It achieves state-of-the-art results on four autonomous driving benchmarks without the latency penalty of explicit reasoning steps.

95% relevant