audio processing
30 articles about audio processing in AI news
DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness
Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.
Microsoft's MarkItDown Library Revolutionizes Document Processing for AI Applications
Microsoft's AutoGen team has released MarkItDown, an open-source Python library that converts diverse document formats into clean Markdown for LLM consumption. This tool eliminates complex preprocessing pipelines and supports over 10 file types including PDFs, Office documents, images, and audio.
Gemini 3.5 Live Translate Debuts as Real-Time Audio Model
Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.
mlx-audio v0.4.3 Ships 6 New TTS Models, Slimmer Deps
mlx-audio v0.4.3 adds 6 TTS models, server concurrency, and slims dependencies, targeting Apple Silicon developers.
NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text
NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.
NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning
NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.
Microsoft's 'Markdownify' Converts PDFs, Audio, Video to Clean LLM Markdown
Microsoft launched 'Markdownify', a Python tool that converts PDFs, Word docs, Excel, PowerPoint, audio, and YouTube URLs into clean Markdown. This addresses a major pain point in AI pipelines where raw file parsing breaks context and structure.
Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS
Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.
Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2
A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.
Waves Audio Launches Lightning V3.1: 10-Second Voice Cloning with 44.1kHz Studio Quality
Waves Audio released Lightning V3.1, a voice cloning model that creates studio-quality voice replicas from just 10 seconds of audio with under 100ms latency. The update supports over 50 languages and targets real-time applications.
OpenHome Launches Open-Source Voice Assistant Platform with Full Local Processing
OpenHome has launched an open-source voice assistant platform that processes all audio and commands locally on-device, positioning itself as a privacy-focused alternative to cloud-based services like Amazon Alexa.
OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency
Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.
OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations
OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.
Microsoft's VibeVoice-ASR Shatters Transcription Limits with 60-Minute Single-Pass Processing
Microsoft has released VibeVoice-ASR on Hugging Face, a revolutionary speech recognition model that transcribes 60-minute audio in one pass with speaker diarization, timestamps, and multilingual support across 50+ languages without configuration.
JAEGER Breaks the 2D Barrier: How 3D Audio-Visual AI Could Transform Robotics and AR
Researchers introduce JAEGER, a framework that extends audio-visual large language models into 3D space using RGB-D and spatial audio. This breakthrough enables AI to understand and reason about physical environments with unprecedented spatial awareness.
OpenAI's Audio Revolution: New Voice Models Signal Major AI Advancements
OpenAI appears poised to release new audio models that could significantly enhance voice interaction capabilities. This development follows recent trademark filings and suggests major improvements to voice mode technology.
Typeless v1.0 Launches for Windows, Claims 220 WPM Speech-to-Text with Local Processing
Typeless has launched v1.0 for Windows, claiming its local AI speech-to-text tool delivers polished text at 220 words per minute—4x faster than typing—with zero cloud retention.
OpenVoice v2: Complete Voice Cloning Directory Launches on GitHub
A developer has compiled and released a comprehensive directory of open-source voice cloning tools and resources on GitHub. This centralizes access to models, datasets, and training code, lowering the barrier to entry for AI audio development.
Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS
Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.
Google Releases Gemma 4 Family Under Apache 2.0, Featuring 2B to 31B Models with MoE and Multimodal Capabilities
Google has released the Gemma 4 family of open-weight models, derived from Gemini 3 technology. The four models, ranging from 2B to 31B parameters and including a Mixture-of-Experts variant, are available under a permissive Apache 2.0 license and feature multimodal processing.
OpenHome Launches Local-Only Smart Speaker Dev Kit with OpenClaw AI Agents
OpenHome has released a smart speaker development kit that runs AI agents entirely on local hardware, processing all voice data locally. This provides an open-source alternative to cloud-dependent assistants like Alexa, with no vendor lock-in.
Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications
Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.
ATLAS: Pioneering Lifelong Learning for AI That Sees and Hears
Researchers introduce the first continual learning benchmark for audio-visual segmentation, addressing how AI systems can adapt to evolving real-world environments without forgetting previous knowledge. The ATLAS framework uses audio-guided conditioning and low-rank anchoring to maintain performance across dynamic scenarios.
Google's Gemini Embedding 2 Unifies All Media Types in Single AI Framework
Google has launched Gemini Embedding 2, its first fully multimodal embedding model that maps text, images, video, audio, and documents into a single shared vector space. The breakthrough supports 100+ languages and flexible vector sizing for optimized performance.
DeepMind's Diffusion Breakthrough: Training Better Latents for Superior AI Generation
Google DeepMind researchers have developed new techniques for training latent representations in diffusion models, potentially leading to more efficient, higher-quality AI-generated content across images, audio, and video domains.
No-Code Revolution: How AI-Powered Platforms Are Democratizing Software Development
AI-powered no-code platforms are enabling non-technical professionals to build complex software applications in record time. From construction procurement platforms to specialized audiobook apps, these tools are breaking down traditional barriers to software development.
OpenAI's WebSocket Revolution: The End of AI Voice Lag and What It Means for Human-Computer Interaction
OpenAI has introduced WebSocket mode for its API, dramatically reducing latency in voice AI interactions. This technical breakthrough enables near-real-time conversations by eliminating the sequential processing bottlenecks that plagued previous voice AI systems.
Beyond the Token Limit: How Claude Opus 4.6's Architectural Breakthrough Enables True Long-Context Reasoning
Anthropic's Claude Opus 4.6 represents a fundamental shift in large language model architecture, moving beyond simple token expansion to create genuinely autonomous reasoning systems. The breakthrough enables practical use of million-token contexts through novel memory management and hierarchical processing.
Apple AFM Core Advanced: Sparse, Multimodal, iPhone 17 Pro Only
Apple AFM Core Advanced is sparse, multimodal, and exclusive to iPhone 17 Pro, M3+ Mac, M4+ iPad, while AFM Core is dense for other devices.
Google Gemma 4 12B: Encoder-Free Multimodal Model Launches
Google launched Gemma 4 12B, an encoder-free multimodal model for on-device AI, reducing latency by eliminating the vision encoder.