audio processing

30 articles about audio processing in AI news

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.

Mar 20, 202699% relevant

Microsoft's MarkItDown Library Revolutionizes Document Processing for AI Applications

Microsoft's AutoGen team has released MarkItDown, an open-source Python library that converts diverse document formats into clean Markdown for LLM consumption. This tool eliminates complex preprocessing pipelines and supports over 10 file types including PDFs, Office documents, images, and audio.

Feb 28, 202692% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

Mar 29, 202699% relevant

Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2

A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.

Mar 27, 202697% relevant

Waves Audio Launches Lightning V3.1: 10-Second Voice Cloning with 44.1kHz Studio Quality

Waves Audio released Lightning V3.1, a voice cloning model that creates studio-quality voice replicas from just 10 seconds of audio with under 100ms latency. The update supports over 50 languages and targets real-time applications.

Mar 25, 202687% relevant

OpenHome Launches Open-Source Voice Assistant Platform with Full Local Processing

OpenHome has launched an open-source voice assistant platform that processes all audio and commands locally on-device, positioning itself as a privacy-focused alternative to cloud-based services like Amazon Alexa.

Mar 17, 202685% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

Mar 16, 202692% relevant

OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations

OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.

Mar 5, 202695% relevant

Microsoft's VibeVoice-ASR Shatters Transcription Limits with 60-Minute Single-Pass Processing

Microsoft has released VibeVoice-ASR on Hugging Face, a revolutionary speech recognition model that transcribes 60-minute audio in one pass with speaker diarization, timestamps, and multilingual support across 50+ languages without configuration.

Mar 2, 202685% relevant

JAEGER Breaks the 2D Barrier: How 3D Audio-Visual AI Could Transform Robotics and AR

Researchers introduce JAEGER, a framework that extends audio-visual large language models into 3D space using RGB-D and spatial audio. This breakthrough enables AI to understand and reason about physical environments with unprecedented spatial awareness.

Feb 24, 202670% relevant

OpenAI's Audio Revolution: New Voice Models Signal Major AI Advancements

OpenAI appears poised to release new audio models that could significantly enhance voice interaction capabilities. This development follows recent trademark filings and suggests major improvements to voice mode technology.

Feb 23, 202685% relevant

Typeless v1.0 Launches for Windows, Claims 220 WPM Speech-to-Text with Local Processing

Typeless has launched v1.0 for Windows, claiming its local AI speech-to-text tool delivers polished text at 220 words per minute—4x faster than typing—with zero cloud retention.

Mar 23, 202685% relevant

Google Releases Gemma 4 Family Under Apache 2.0, Featuring 2B to 31B Models with MoE and Multimodal Capabilities

Google has released the Gemma 4 family of open-weight models, derived from Gemini 3 technology. The four models, ranging from 2B to 31B parameters and including a Mixture-of-Experts variant, are available under a permissive Apache 2.0 license and feature multimodal processing.

Apr 2, 2026100% relevant

OpenHome Launches Local-Only Smart Speaker Dev Kit with OpenClaw AI Agents

OpenHome has released a smart speaker development kit that runs AI agents entirely on local hardware, processing all voice data locally. This provides an open-source alternative to cloud-dependent assistants like Alexa, with no vendor lock-in.

Mar 16, 202685% relevant

Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications

Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.

Mar 13, 202677% relevant

ATLAS: Pioneering Lifelong Learning for AI That Sees and Hears

Researchers introduce the first continual learning benchmark for audio-visual segmentation, addressing how AI systems can adapt to evolving real-world environments without forgetting previous knowledge. The ATLAS framework uses audio-guided conditioning and low-rank anchoring to maintain performance across dynamic scenarios.

Mar 11, 202675% relevant

Google's Gemini Embedding 2 Unifies All Media Types in Single AI Framework

Google has launched Gemini Embedding 2, its first fully multimodal embedding model that maps text, images, video, audio, and documents into a single shared vector space. The breakthrough supports 100+ languages and flexible vector sizing for optimized performance.

Mar 10, 2026100% relevant

DeepMind's Diffusion Breakthrough: Training Better Latents for Superior AI Generation

Google DeepMind researchers have developed new techniques for training latent representations in diffusion models, potentially leading to more efficient, higher-quality AI-generated content across images, audio, and video domains.

Feb 26, 202685% relevant

No-Code Revolution: How AI-Powered Platforms Are Democratizing Software Development

AI-powered no-code platforms are enabling non-technical professionals to build complex software applications in record time. From construction procurement platforms to specialized audiobook apps, these tools are breaking down traditional barriers to software development.

Feb 25, 202685% relevant

OpenAI's WebSocket Revolution: The End of AI Voice Lag and What It Means for Human-Computer Interaction

OpenAI has introduced WebSocket mode for its API, dramatically reducing latency in voice AI interactions. This technical breakthrough enables near-real-time conversations by eliminating the sequential processing bottlenecks that plagued previous voice AI systems.

Feb 23, 202675% relevant

Beyond the Token Limit: How Claude Opus 4.6's Architectural Breakthrough Enables True Long-Context Reasoning

Anthropic's Claude Opus 4.6 represents a fundamental shift in large language model architecture, moving beyond simple token expansion to create genuinely autonomous reasoning systems. The breakthrough enables practical use of million-token contexts through novel memory management and hierarchical processing.

Feb 15, 202670% relevant

NemoVideo AI Automates Video Editing Based on Text Prompts

A video creator states NemoVideo AI now automates complex editing tasks like cuts and transitions from simple text descriptions, reducing a 5-hour manual process to a prompt-driven workflow.

Apr 5, 202685% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

Apr 5, 202675% relevant

Building a Memory Layer for a Voice AI Agent: A Developer's Blueprint

A developer shares a technical case study on building a voice-first journal app, focusing on the critical memory layer. The article details using Redis Agent Memory Server for working/long-term memory and key latency optimizations like streaming APIs and parallel fetches to meet voice's strict responsiveness demands.

Apr 4, 202674% relevant

OpenAI's GPT-Image-2 Model Reportedly Achieves Photorealistic Video Generation, Surpassing Prior Map-Generation Flaws

A social media user claims OpenAI's GPT-Image-2 model now produces video indistinguishable from reality, a significant leap from its predecessor's documented failure to generate coherent world maps.

Apr 4, 202685% relevant

Google's AICore Beta Enables On-Device Gemini Nano 4 Downloads for Android Phones

A new beta of Google's AICore system service enables users to download Gemini Nano 4 Full and Gemini Nano 4 Fast models directly onto compatible Android phones, including those with Snapdragon 8 Elite Gen 5 chips. This moves beyond pre-installed AI to user-initiated model management.

Apr 3, 202685% relevant

Alibaba Launches Qwen3.6-Plus with 1M-Token Context, Targeting AI Agent and Coding Workloads

Alibaba Cloud has launched Qwen3.6-Plus, a new multimodal large language model featuring a 1 million-token context length. The release is a strategic move to capture developer mindshare in the competitive AI agent and coding assistant market.

Apr 3, 2026100% relevant

Typeless Launches AI Voice-to-Text Tool Claiming 4x Speed Boost Over Typing

Typeless, a new AI tool, converts spoken voice into polished, formatted text directly within any application. The company claims it operates 4x faster than manual typing.

Apr 1, 202685% relevant

OpenClaw Skill Automatically Converts YouTube Links into 10 Ready-to-Post Shorts

A developer has created an OpenClaw skill that automatically processes any YouTube link, generating 10 formatted Shorts with captions and centered subjects. This tool aims to streamline content repurposing for social media creators.

Mar 30, 202687% relevant

Text-to-Speech Cost Plummets from $0.15/Word to Free Local Models Using 3GB RAM

High-quality text-to-speech has shifted from a $0.15 per word cloud service to free, local models requiring only 3GB of RAM in 12 months, signaling a broader price collapse in AI inference.

Mar 30, 202685% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety