audio
30 articles about audio in AI news
NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text
NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.
Pretrained Audio Models Underperform in Music Recommendation, New Research Shows
A new study evaluates nine pretrained audio models for music recommendation, finding significant performance disparity between traditional MIR tasks and both hot and cold-start recommendation scenarios.
NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning
NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.
ByteDance's OmniShow Unifies Text, Image, Audio, Pose for Video Gen
ByteDance introduced OmniShow, a unified multimodal framework for video generation that accepts text, reference images, audio, and pose inputs simultaneously. It claims state-of-the-art performance across diverse conditioning settings.
PixVerse V6 Launches: 15-Second 1080P Video with Full Audio
AI video startup PixVerse launched its V6 model, capable of generating 15-second, 1080p videos with full audio from text prompts. This marks a significant upgrade in output length and quality for the platform.
Microsoft's 'Markdownify' Converts PDFs, Audio, Video to Clean LLM Markdown
Microsoft launched 'Markdownify', a Python tool that converts PDFs, Word docs, Excel, PowerPoint, audio, and YouTube URLs into clean Markdown. This addresses a major pain point in AI pipelines where raw file parsing breaks context and structure.
Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability
Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific training. This suggests a significant leap in multimodal reasoning for a model already positioned as a strong GPT-4 competitor.
Alibaba's Qwen3.5-Omni Launches with Script-Level Captioning, Audio-Visual Vibe Coding, and Real-Time Web Search
Alibaba's Qwen team has released Qwen3.5-Omni, a multimodal model focused on interpreting images, audio, and video with new capabilities like script-level captioning and 'vibe coding'. It's open-access on Hugging Face but does not generate media.
Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS
Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.
Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2
A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.
Mistral AI Releases Voxtral TTS: 4B-Parameter Open-Weight Model Clones Voices from 3-Second Audio in 9 Languages
Mistral AI has launched Voxtral TTS, its first open-weight text-to-speech model. The 4B-parameter model clones voices from three seconds of reference audio across nine languages, with a latency of 70ms, and scored higher on naturalness than ElevenLabs Flash v2.5 in human tests.
Waves Audio Launches Lightning V3.1: 10-Second Voice Cloning with 44.1kHz Studio Quality
Waves Audio released Lightning V3.1, a voice cloning model that creates studio-quality voice replicas from just 10 seconds of audio with under 100ms latency. The update supports over 50 languages and targets real-time applications.
DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness
Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.
Fish Audio S2 Enables Word-Level Speech Control with Positional Tags, Beats GPT-4o in Human Preference Tests
Fish Audio S2 introduces a 100% open-source TTS model that uses inline positional tags for word-level vocal control, achieving 8/10 wins against GPT-4o and Gemini in human preference tests while generating audio nearly 5x faster than real-time.
OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency
Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.
OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations
OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.
JAEGER Breaks the 2D Barrier: How 3D Audio-Visual AI Could Transform Robotics and AR
Researchers introduce JAEGER, a framework that extends audio-visual large language models into 3D space using RGB-D and spatial audio. This breakthrough enables AI to understand and reason about physical environments with unprecedented spatial awareness.
OpenAI's Audio Revolution: New Voice Models Signal Major AI Advancements
OpenAI appears poised to release new audio models that could significantly enhance voice interaction capabilities. This development follows recent trademark filings and suggests major improvements to voice mode technology.
Microsoft’s VibeVoice: Open-Source Speech-to-Text with Diarization
Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. Simon Willison tested a 4-bit MLX conversion on an M5 MacBook, transcribing 1 hour of audio in ~9 minutes using ~60GB RAM.
How to Automate Meeting Notes and Action Items with Read AI's MCP Server
Integrate Read AI's MCP server with Claude Code to transform meeting audio into structured notes, decisions, and code-ready tasks without leaving your IDE.
OpenVoice v2: Complete Voice Cloning Directory Launches on GitHub
A developer has compiled and released a comprehensive directory of open-source voice cloning tools and resources on GitHub. This centralizes access to models, datasets, and training code, lowering the barrier to entry for AI audio development.
Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS
Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.
Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5
Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.
Halter's Solar-Powered Cattle Collars Hit $2B Valuation, Using AI to Replace Physical Fences
Livestock tech company Halter reached a $2 billion valuation by replacing physical fences with solar-powered collars that herd cattle using AI-driven vibrations and audio cues. The system turns cows into data streams managed through a smartphone app.
Open-Source Web UI 'LLM Studio' Enables Local Fine-Tuning of 500+ Models, Including GGUF and Multimodal
LLM Studio, a free and open-source web interface, allows users to fine-tune over 500 large language models locally on their own hardware. It supports GGUF-quantized models, vision, audio, and embedding models across Mac, Windows, and Linux.
Single Pane: The Terminal-First Workspace Built for Claude Code
A new macOS app consolidates your terminal, file manager, and markdown editor into one window, with native hooks for Claude Code audio notifications.
OpenHome Launches Open-Source Voice Assistant Platform with Full Local Processing
OpenHome has launched an open-source voice assistant platform that processes all audio and commands locally on-device, positioning itself as a privacy-focused alternative to cloud-based services like Amazon Alexa.
Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications
Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.
LuxTTS Democratizes Voice Cloning: High-Quality Synthesis Now Runs on Consumer Hardware
LuxTTS, a new open-source text-to-speech model, enables realistic voice cloning from just 3 seconds of audio using only 1GB of VRAM. The system operates 150x faster than real-time and produces 48kHz audio, challenging proprietary solutions like ElevenLabs.
ATLAS: Pioneering Lifelong Learning for AI That Sees and Hears
Researchers introduce the first continual learning benchmark for audio-visual segmentation, addressing how AI systems can adapt to evolving real-world environments without forgetting previous knowledge. The ATLAS framework uses audio-guided conditioning and low-rank anchoring to maintain performance across dynamic scenarios.