Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

speech ai

30 articles about speech ai in AI news

OpenBMB Launches VoxCPM 2, an Open-Source TTS Model Rivaling Qwen3-TTS

OpenBMB has launched VoxCPM 2, an open-source text-to-speech AI model from China. The release is positioned as a direct competitor to Alibaba's Qwen3-TTS, expanding the open-source TTS landscape.

91% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

99% relevant

Sabicap Develops Brain Wearable to Decode Imagined Speech into Text

Sabicap is developing a brain wearable with tens of thousands of sensors to decode imagined speech into text. The company, backed by Vinod Khosla, aims to create a system that works across users with minimal calibration for broad adoption.

95% relevant

AI Model Decodes Silent Speech from Phone Sensors, No Microphone Needed

A new AI model can reconstruct speech by analyzing imperceptible facial movements captured by smartphone sensors, effectively enabling silent speech recognition without a microphone. This represents a significant leap in sensor fusion and on-device AI.

85% relevant

Microsoft Expands AI Portfolio with New Speech and Voice Models

Microsoft has released MAI-Transcribe-1, a new speech-to-text model, and made its in-house MAI-Voice-1 and MAI-Image-2 models available. This expansion represents Microsoft's continued diversification beyond its OpenAI partnership, strengthening its position in the competitive AI market.

80% relevant

Typeless v1.0 Launches for Windows, Claims 220 WPM Speech-to-Text with Local Processing

Typeless has launched v1.0 for Windows, claiming its local AI speech-to-text tool delivers polished text at 220 words per minute—4x faster than typing—with zero cloud retention.

85% relevant

Sabi Cap: 100k-Sensor EEG Hat Decodes Internal Speech at 30 WPM

Sabi released the Sabi Cap, a wearable EEG beanie with 70k-100k biosensors and a brain foundation model trained on 100k hours of neural data. It decodes internal speech to text at ~30 WPM and enables cursor control via intention.

97% relevant

Google Launches Gemini 3.1 Flash TTS with Prompt-Controlled Speech

Google has launched Gemini 3.1 Flash TTS, a text-to-speech model featuring prompt-based voice control and support for over 70 languages. This release expands Google's multimodal AI offerings directly to developers.

93% relevant

Microsoft Open-Sources VALL-E 2: A Zero-Shot TTS Model Achieving Human Parity in Speech Naturalness

Microsoft Research has open-sourced VALL-E 2, a neural codec language model for text-to-speech that achieves human parity in naturalness. It uses a novel 'Repetition-Aware Sampling' method to eliminate word repetition, a common failure mode in prior models.

95% relevant

Text-to-Speech Cost Plummets from $0.15/Word to Free Local Models Using 3GB RAM

High-quality text-to-speech has shifted from a $0.15 per word cloud service to free, local models requiring only 3GB of RAM in 12 months, signaling a broader price collapse in AI inference.

85% relevant

Microsoft’s VibeVoice: Open-Source Speech-to-Text with Diarization

Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. Simon Willison tested a 4-bit MLX conversion on an M5 MacBook, transcribing 1 hour of audio in ~9 minutes using ~60GB RAM.

85% relevant

Fish Audio S2 Enables Word-Level Speech Control with Positional Tags, Beats GPT-4o in Human Preference Tests

Fish Audio S2 introduces a 100% open-source TTS model that uses inline positional tags for word-level vocal control, achieving 8/10 wins against GPT-4o and Gemini in human preference tests while generating audio nearly 5x faster than real-time.

95% relevant

Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS

Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.

97% relevant

Neuralink & ElevenLabs Demo AI Voice Restoration for Brain Implant User

Neuralink and voice AI firm ElevenLabs demonstrated a system that generates speech for a Neuralink patient who lost their voice. The demo shows a brain-computer interface decoding intended speech into synthetic voice in real-time.

85% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

75% relevant

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.

85% relevant

Mistral AI Releases Voxtral TTS: 4B-Parameter Open-Weight Model Clones Voices from 3-Second Audio in 9 Languages

Mistral AI has launched Voxtral TTS, its first open-weight text-to-speech model. The 4B-parameter model clones voices from three seconds of reference audio across nine languages, with a latency of 70ms, and scored higher on naturalness than ElevenLabs Flash v2.5 in human tests.

95% relevant

Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5

Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.

95% relevant

OpenClaw Voice Interface Demo Shows Real-Time AI Assistant with Push-to-Talk Hardware

A developer demonstrated a custom hardware rig that uses a push-to-talk button to transcribe speech, query the OpenClaw AI model, and stream responses back in real-time. The setup provides a tangible, hands-free interface for interacting with open-source AI assistants.

85% relevant

OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations

OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.

95% relevant

Typeless AI Redefines Voice-to-Text: From Transcription to Native-Level Rewriting

Typeless AI has introduced a revolutionary voice-to-text tool that doesn't just transcribe speech but rewrites it with native-level fluency, grammar correction, and tone adjustment across multiple languages, potentially eliminating manual typing for many professional tasks.

85% relevant

The Uncanny Valley of Truth: How AI Avatars Are Blurring Reality's Edge

AI avatars now replicate human speech patterns, facial expressions, and gestures with unsettling accuracy, creating synthetic personas indistinguishable from real people. This technological leap raises urgent questions about authenticity, trust, and the future of digital communication.

85% relevant

Anthropic's Claude Code Gets Voice Mode: The Next Frontier in AI-Assisted Programming

Anthropic has introduced voice mode for Claude Code, allowing developers to interact with the AI coding assistant through natural speech. This marks a significant evolution in how programmers can collaborate with AI tools, potentially transforming development workflows.

85% relevant

Developer Achieves 395x RTFx on M5 Max with Fastest Parakeet v3 for Apple ANE

Developer @mweinbach has optimized the Parakeet v3 speech recognition model for Apple's Neural Engine, achieving a 395x real-time factor on an M5 Max chip. This represents a significant performance leap for on-device AI inference on Apple Silicon.

87% relevant

ElevenLabs Voice Cloning API Priced from $5 to $1,320/Month

ElevenLabs' AI voice cloning service has published pricing tiers from $5 to $1,320 per month. This formalizes the cost structure for developers and businesses integrating synthetic speech.

87% relevant

Cohere Transcribe: 2B-Parameter Open-Source ASR Model Achieves 5.42% WER, Topping Hugging Face Leaderboard

Cohere released Transcribe, a 2B-parameter open-source speech recognition model. It claims a 5.42% average word error rate, beating OpenAI Whisper v3 and topping the Hugging Face Open ASR Leaderboard.

95% relevant

Qwen3-TTS Added to mlx-tune, Enabling Full Qwen Model Fine-Tuning on Apple Silicon Macs

The mlx-tune library now supports Qwen3-TTS, making the entire Qwen model stack—including the new text-to-speech model—fine-tunable on Apple Silicon Macs. This expands local AI development options for researchers and developers.

85% relevant

GPT-5.2-Based Smart Speaker Achieves 100% Resident ID Accuracy in Care Home Safety Evaluation

Researchers evaluated a voice-enabled smart speaker for care homes using Whisper and RAG, achieving 100% resident identification and 89.09% reminder recognition with GPT-5.2. The safety-focused framework highlights remaining challenges in converting informal speech to calendar events (84.65% accuracy).

77% relevant

Whisper's Real-Time Translation Demo Shows Practical Progress Toward Universal Translation

OpenAI's Whisper model demonstrated real-time translation from English to Spanish, showcasing progress toward practical universal translation tools. The demo highlights incremental but meaningful improvements in speech-to-speech translation latency and quality.

85% relevant

OpenBMB's VoxCPM 2: 2B-Param Open-Source TTS for Multilingual Voice

OpenBMB launched VoxCPM 2, a 2-billion-parameter open-source text-to-speech model. It generates multilingual, emotionally expressive speech from text descriptions and runs on consumer-grade hardware.

97% relevant