speech recognition

30 articles about speech recognition in AI news

Cohere Releases Arabic Speech Recognition Model Under Apache 2.0

Cohere released Arabic speech recognition model under Apache 2.0, claiming world's most accurate open-source version. No benchmark data or training details disclosed.

Jul 7, 2026100% relevant

AI Model Decodes Silent Speech from Phone Sensors, No Microphone Needed

A new AI model can reconstruct speech by analyzing imperceptible facial movements captured by smartphone sensors, effectively enabling silent speech recognition without a microphone. This represents a significant leap in sensor fusion and on-device AI.

Apr 8, 202685% relevant

Developer Achieves 395x RTFx on M5 Max with Fastest Parakeet v3 for Apple ANE

Developer @mweinbach has optimized the Parakeet v3 speech recognition model for Apple's Neural Engine, achieving a 395x real-time factor on an M5 Max chip. This represents a significant performance leap for on-device AI inference on Apple Silicon.

Apr 22, 202687% relevant

Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS

Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.

Apr 6, 202697% relevant

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.

Apr 3, 202685% relevant

Cohere Transcribe: 2B-Parameter Open-Source ASR Model Achieves 5.42% WER, Topping Hugging Face Leaderboard

Cohere released Transcribe, a 2B-parameter open-source speech recognition model. It claims a 5.42% average word error rate, beating OpenAI Whisper v3 and topping the Hugging Face Open ASR Leaderboard.

Mar 27, 202695% relevant

Microsoft's VibeVoice-ASR Shatters Transcription Limits with 60-Minute Single-Pass Processing

Microsoft has released VibeVoice-ASR on Hugging Face, a revolutionary speech recognition model that transcribes 60-minute audio in one pass with speaker diarization, timestamps, and multilingual support across 50+ languages without configuration.

Mar 2, 202685% relevant

Sabicap Develops Brain Wearable to Decode Imagined Speech into Text

Sabicap is developing a brain wearable with tens of thousands of sensors to decode imagined speech into text. The company, backed by Vinod Khosla, aims to create a system that works across users with minimal calibration for broad adoption.

Apr 16, 202695% relevant

Typeless v1.0 Launches for Windows, Claims 220 WPM Speech-to-Text with Local Processing

Typeless has launched v1.0 for Windows, claiming its local AI speech-to-text tool delivers polished text at 220 words per minute—4x faster than typing—with zero cloud retention.

Mar 23, 202685% relevant

GPT-5.2-Based Smart Speaker Achieves 100% Resident ID Accuracy in Care Home Safety Evaluation

Researchers evaluated a voice-enabled smart speaker for care homes using Whisper and RAG, achieving 100% resident identification and 89.09% reminder recognition with GPT-5.2. The safety-focused framework highlights remaining challenges in converting informal speech to calendar events (84.65% accuracy).

Mar 26, 202677% relevant

Cohere Open-Sources Three AI Models Under Apache 2.0

Cohere released three open-source AI models under Apache 2.0 in 2025, expanding its enterprise portfolio with speech, language, and code capabilities.

Jul 26, 202685% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

Apr 5, 202675% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

Mar 29, 202699% relevant

OpenClaw Voice Interface Demo Shows Real-Time AI Assistant with Push-to-Talk Hardware

A developer demonstrated a custom hardware rig that uses a push-to-talk button to transcribe speech, query the OpenClaw AI model, and stream responses back in real-time. The setup provides a tangible, hands-free interface for interacting with open-source AI assistants.

Mar 23, 202685% relevant

Whisper's Real-Time Translation Demo Shows Practical Progress Toward Universal Translation

OpenAI's Whisper model demonstrated real-time translation from English to Spanish, showcasing progress toward practical universal translation tools. The demo highlights incremental but meaningful improvements in speech-to-speech translation latency and quality.

Mar 18, 202685% relevant

Qualcomm's Arduino Ventuno Q: A Powerhouse Single-Board Computer for the Next Wave of Physical AI

Qualcomm and Arduino have launched the Ventuno Q, a high-performance single-board computer designed specifically for robotics and physical AI applications. Powered by the Dragonwing IQ8 processor with a dedicated NPU and paired with a low-latency microcontroller, it enables complex, offline AI tasks like object tracking and gesture recognition for systems that interact with the real world.

Mar 9, 202680% relevant

Typeless AI Redefines Voice-to-Text: From Transcription to Native-Level Rewriting

Typeless AI has introduced a revolutionary voice-to-text tool that doesn't just transcribe speech but rewrites it with native-level fluency, grammar correction, and tone adjustment across multiple languages, potentially eliminating manual typing for many professional tasks.

Mar 1, 202685% relevant

Anthropic's Claude Code Gets Voice Mode: The Next Frontier in AI-Assisted Programming

Anthropic has introduced voice mode for Claude Code, allowing developers to interact with the AI coding assistant through natural speech. This marks a significant evolution in how programmers can collaborate with AI tools, potentially transforming development workflows.

Mar 3, 202685% relevant

GigaAM Multilingual: 600M Conformer cuts WER 32-86% vs Whisper on 70+ languages

GigaAM Multilingual open-sources two Conformer encoders (220M/600M params) pre-trained on 2M hours across 70+ languages, achieving 32–86% relative WER reduction vs Whisper-large-v3 on Central Asian languages.

Jul 26, 202692% relevant

China Builds First Phase-Change Memristor Neural Chip

Chinese researchers built the first phase-change memristor neural chip, claiming 10,000x energy efficiency but releasing no benchmarks.

Jul 22, 2026100% relevant

NVIDIA Drops 30B Nemotron Audex Audio Model with MoE

NVIDIA released Nemotron Audex 30B-A3B, a 30B-parameter MoE audio model unifying ASR, understanding, and TTS with 3B active parameters.

Jul 12, 202691% relevant

GoldBean MCP Server: Pay Per Call with USDC — No API Keys, No Subscriptions

GoldBean MCP server lets Claude Code call 47 Baidu AI endpoints with per-call USDC micropayments via x402 — no API keys, no subscriptions. Agents pay $0.01-$0.05 per call on Base L2.

Jul 9, 202654% relevant

Amazon Designs Custom AI Silicon for Future Devices, Panay Says

Amazon hardware chief Panos Panay confirmed Amazon is designing its own end-to-end silicon for some devices, signaling a strategic push into custom AI hardware.

Jul 5, 202685% relevant

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

Jun 25, 202675% relevant

Gemini 3.5 Live Translate Debuts as Real-Time Audio Model

Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.

Jun 9, 202687% relevant

NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text

NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.

Apr 28, 202693% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

Apr 19, 202695% relevant

Seven Voice AI Architectures That Actually Work in Production

An engineer shares seven voice agent architectures that have survived production, detailing their components, latency improvements, and failure modes. This is a practical guide for building real-time, interruptible, and scalable voice AI.

Apr 12, 202678% relevant

MLX Enables Local Grounded Reasoning for Satellite, Security, Robotics AI

Apple's MLX framework is enabling 'local grounded reasoning' for AI applications in satellite imagery, security systems, and robotics, moving complex tasks from the cloud to on-device processing.

Apr 11, 202685% relevant

OpenClaw Voice Interface Demo Shows Real-Time AI Assistant Hardware

A developer showcased a custom hardware rig that integrates a push-button voice interface with the OpenClaw AI model, streaming responses in real-time. This demonstrates a tangible, open-source alternative to proprietary voice assistants like Amazon Alexa.

Apr 9, 202675% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety