Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

speech

30 articles about speech in AI news

Microsoft’s VibeVoice: Open-Source Speech-to-Text with Diarization

Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. Simon Willison tested a 4-bit MLX conversion on an M5 MacBook, transcribing 1 hour of audio in ~9 minutes using ~60GB RAM.

85% relevant

Sabi Cap: 100k-Sensor EEG Hat Decodes Internal Speech at 30 WPM

Sabi released the Sabi Cap, a wearable EEG beanie with 70k-100k biosensors and a brain foundation model trained on 100k hours of neural data. It decodes internal speech to text at ~30 WPM and enables cursor control via intention.

97% relevant

Sabicap Develops Brain Wearable to Decode Imagined Speech into Text

Sabicap is developing a brain wearable with tens of thousands of sensors to decode imagined speech into text. The company, backed by Vinod Khosla, aims to create a system that works across users with minimal calibration for broad adoption.

95% relevant

Google Launches Gemini 3.1 Flash TTS with Prompt-Controlled Speech

Google has launched Gemini 3.1 Flash TTS, a text-to-speech model featuring prompt-based voice control and support for over 70 languages. This release expands Google's multimodal AI offerings directly to developers.

93% relevant

AI Model Decodes Silent Speech from Phone Sensors, No Microphone Needed

A new AI model can reconstruct speech by analyzing imperceptible facial movements captured by smartphone sensors, effectively enabling silent speech recognition without a microphone. This represents a significant leap in sensor fusion and on-device AI.

85% relevant

Microsoft Expands AI Portfolio with New Speech and Voice Models

Microsoft has released MAI-Transcribe-1, a new speech-to-text model, and made its in-house MAI-Voice-1 and MAI-Image-2 models available. This expansion represents Microsoft's continued diversification beyond its OpenAI partnership, strengthening its position in the competitive AI market.

80% relevant

Microsoft Open-Sources VALL-E 2: A Zero-Shot TTS Model Achieving Human Parity in Speech Naturalness

Microsoft Research has open-sourced VALL-E 2, a neural codec language model for text-to-speech that achieves human parity in naturalness. It uses a novel 'Repetition-Aware Sampling' method to eliminate word repetition, a common failure mode in prior models.

95% relevant

Text-to-Speech Cost Plummets from $0.15/Word to Free Local Models Using 3GB RAM

High-quality text-to-speech has shifted from a $0.15 per word cloud service to free, local models requiring only 3GB of RAM in 12 months, signaling a broader price collapse in AI inference.

85% relevant

Typeless v1.0 Launches for Windows, Claims 220 WPM Speech-to-Text with Local Processing

Typeless has launched v1.0 for Windows, claiming its local AI speech-to-text tool delivers polished text at 220 words per minute—4x faster than typing—with zero cloud retention.

85% relevant

Fish Audio S2 Enables Word-Level Speech Control with Positional Tags, Beats GPT-4o in Human Preference Tests

Fish Audio S2 introduces a 100% open-source TTS model that uses inline positional tags for word-level vocal control, achieving 8/10 wins against GPT-4o and Gemini in human preference tests while generating audio nearly 5x faster than real-time.

95% relevant

Developer Achieves 395x RTFx on M5 Max with Fastest Parakeet v3 for Apple ANE

Developer @mweinbach has optimized the Parakeet v3 speech recognition model for Apple's Neural Engine, achieving a 395x real-time factor on an M5 Max chip. This represents a significant performance leap for on-device AI inference on Apple Silicon.

87% relevant

OpenBMB's VoxCPM 2: 2B-Param Open-Source TTS for Multilingual Voice

OpenBMB launched VoxCPM 2, a 2-billion-parameter open-source text-to-speech model. It generates multilingual, emotionally expressive speech from text descriptions and runs on consumer-grade hardware.

97% relevant

ElevenLabs Voice Cloning API Priced from $5 to $1,320/Month

ElevenLabs' AI voice cloning service has published pricing tiers from $5 to $1,320 per month. This formalizes the cost structure for developers and businesses integrating synthetic speech.

87% relevant

OpenBMB Launches VoxCPM 2, an Open-Source TTS Model Rivaling Qwen3-TTS

OpenBMB has launched VoxCPM 2, an open-source text-to-speech AI model from China. The release is positioned as a direct competitor to Alibaba's Qwen3-TTS, expanding the open-source TTS landscape.

91% relevant

Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS

Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.

97% relevant

Neuralink & ElevenLabs Demo AI Voice Restoration for Brain Implant User

Neuralink and voice AI firm ElevenLabs demonstrated a system that generates speech for a Neuralink patient who lost their voice. The demo shows a brain-computer interface decoding intended speech into synthetic voice in real-time.

85% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

75% relevant

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.

85% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

99% relevant

Cohere Transcribe: 2B-Parameter Open-Source ASR Model Achieves 5.42% WER, Topping Hugging Face Leaderboard

Cohere released Transcribe, a 2B-parameter open-source speech recognition model. It claims a 5.42% average word error rate, beating OpenAI Whisper v3 and topping the Hugging Face Open ASR Leaderboard.

95% relevant

Qwen3-TTS Added to mlx-tune, Enabling Full Qwen Model Fine-Tuning on Apple Silicon Macs

The mlx-tune library now supports Qwen3-TTS, making the entire Qwen model stack—including the new text-to-speech model—fine-tunable on Apple Silicon Macs. This expands local AI development options for researchers and developers.

85% relevant

Mistral AI Releases Voxtral TTS: 4B-Parameter Open-Weight Model Clones Voices from 3-Second Audio in 9 Languages

Mistral AI has launched Voxtral TTS, its first open-weight text-to-speech model. The 4B-parameter model clones voices from three seconds of reference audio across nine languages, with a latency of 70ms, and scored higher on naturalness than ElevenLabs Flash v2.5 in human tests.

95% relevant

Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5

Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.

95% relevant

GPT-5.2-Based Smart Speaker Achieves 100% Resident ID Accuracy in Care Home Safety Evaluation

Researchers evaluated a voice-enabled smart speaker for care homes using Whisper and RAG, achieving 100% resident identification and 89.09% reminder recognition with GPT-5.2. The safety-focused framework highlights remaining challenges in converting informal speech to calendar events (84.65% accuracy).

77% relevant

OpenClaw Voice Interface Demo Shows Real-Time AI Assistant with Push-to-Talk Hardware

A developer demonstrated a custom hardware rig that uses a push-to-talk button to transcribe speech, query the OpenClaw AI model, and stream responses back in real-time. The setup provides a tangible, hands-free interface for interacting with open-source AI assistants.

85% relevant

Whisper's Real-Time Translation Demo Shows Practical Progress Toward Universal Translation

OpenAI's Whisper model demonstrated real-time translation from English to Spanish, showcasing progress toward practical universal translation tools. The demo highlights incremental but meaningful improvements in speech-to-speech translation latency and quality.

85% relevant

LuxTTS Democratizes Voice Cloning: High-Quality Synthesis Now Runs on Consumer Hardware

LuxTTS, a new open-source text-to-speech model, enables realistic voice cloning from just 3 seconds of audio using only 1GB of VRAM. The system operates 150x faster than real-time and produces 48kHz audio, challenging proprietary solutions like ElevenLabs.

95% relevant

OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations

OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.

95% relevant

Microsoft's VibeVoice-ASR Shatters Transcription Limits with 60-Minute Single-Pass Processing

Microsoft has released VibeVoice-ASR on Hugging Face, a revolutionary speech recognition model that transcribes 60-minute audio in one pass with speaker diarization, timestamps, and multilingual support across 50+ languages without configuration.

85% relevant

Typeless AI Redefines Voice-to-Text: From Transcription to Native-Level Rewriting

Typeless AI has introduced a revolutionary voice-to-text tool that doesn't just transcribe speech but rewrites it with native-level fluency, grammar correction, and tone adjustment across multiple languages, potentially eliminating manual typing for many professional tasks.

85% relevant