Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

audio ai

30 articles about audio ai in AI news

Pretrained Audio Models Underperform in Music Recommendation, New Research Shows

A new study evaluates nine pretrained audio models for music recommendation, finding significant performance disparity between traditional MIR tasks and both hot and cold-start recommendation scenarios.

80% relevant

Mistral AI Releases Voxtral TTS: 4B-Parameter Open-Weight Model Clones Voices from 3-Second Audio in 9 Languages

Mistral AI has launched Voxtral TTS, its first open-weight text-to-speech model. The 4B-parameter model clones voices from three seconds of reference audio across nine languages, with a latency of 70ms, and scored higher on naturalness than ElevenLabs Flash v2.5 in human tests.

95% relevant

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.

99% relevant

OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations

OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.

95% relevant

JAEGER Breaks the 2D Barrier: How 3D Audio-Visual AI Could Transform Robotics and AR

Researchers introduce JAEGER, a framework that extends audio-visual large language models into 3D space using RGB-D and spatial audio. This breakthrough enables AI to understand and reason about physical environments with unprecedented spatial awareness.

70% relevant

OpenAI's Audio Revolution: New Voice Models Signal Major AI Advancements

OpenAI appears poised to release new audio models that could significantly enhance voice interaction capabilities. This development follows recent trademark filings and suggests major improvements to voice mode technology.

85% relevant

NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text

NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.

93% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

95% relevant

ByteDance's OmniShow Unifies Text, Image, Audio, Pose for Video Gen

ByteDance introduced OmniShow, a unified multimodal framework for video generation that accepts text, reference images, audio, and pose inputs simultaneously. It claims state-of-the-art performance across diverse conditioning settings.

85% relevant

PixVerse V6 Launches: 15-Second 1080P Video with Full Audio

AI video startup PixVerse launched its V6 model, capable of generating 15-second, 1080p videos with full audio from text prompts. This marks a significant upgrade in output length and quality for the platform.

89% relevant

Microsoft's 'Markdownify' Converts PDFs, Audio, Video to Clean LLM Markdown

Microsoft launched 'Markdownify', a Python tool that converts PDFs, Word docs, Excel, PowerPoint, audio, and YouTube URLs into clean Markdown. This addresses a major pain point in AI pipelines where raw file parsing breaks context and structure.

85% relevant

Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability

Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific training. This suggests a significant leap in multimodal reasoning for a model already positioned as a strong GPT-4 competitor.

85% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

99% relevant

Fish Audio S2 Enables Word-Level Speech Control with Positional Tags, Beats GPT-4o in Human Preference Tests

Fish Audio S2 introduces a 100% open-source TTS model that uses inline positional tags for word-level vocal control, achieving 8/10 wins against GPT-4o and Gemini in human preference tests while generating audio nearly 5x faster than real-time.

95% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

92% relevant

mlx-audio v0.4.3 Ships 6 New TTS Models, Slimmer Deps

mlx-audio v0.4.3 adds 6 TTS models, server concurrency, and slims dependencies, targeting Apple Silicon developers.

85% relevant

Alibaba's Qwen3.5-Omni Launches with Script-Level Captioning, Audio-Visual Vibe Coding, and Real-Time Web Search

Alibaba's Qwen team has released Qwen3.5-Omni, a multimodal model focused on interpreting images, audio, and video with new capabilities like script-level captioning and 'vibe coding'. It's open-access on Hugging Face but does not generate media.

85% relevant

Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2

A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.

97% relevant

Waves Audio Launches Lightning V3.1: 10-Second Voice Cloning with 44.1kHz Studio Quality

Waves Audio released Lightning V3.1, a voice cloning model that creates studio-quality voice replicas from just 10 seconds of audio with under 100ms latency. The update supports over 50 languages and targets real-time applications.

87% relevant

Meta Trains Coding AI on Engineers' Work Traces as 8K Jobs Cut

Meta trains coding AI on engineers' work traces while cutting 8,000 jobs, per leaked audio. The behavior cloning strategy uses internal problem-solving steps as training data.

100% relevant

How to Automate Meeting Notes and Action Items with Read AI's MCP Server

Integrate Read AI's MCP server with Claude Code to transform meeting audio into structured notes, decisions, and code-ready tasks without leaving your IDE.

88% relevant

Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS

Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.

97% relevant

Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5

Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.

95% relevant

Halter's Solar-Powered Cattle Collars Hit $2B Valuation, Using AI to Replace Physical Fences

Livestock tech company Halter reached a $2 billion valuation by replacing physical fences with solar-powered collars that herd cattle using AI-driven vibrations and audio cues. The system turns cows into data streams managed through a smartphone app.

85% relevant

Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications

Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.

77% relevant

ATLAS: Pioneering Lifelong Learning for AI That Sees and Hears

Researchers introduce the first continual learning benchmark for audio-visual segmentation, addressing how AI systems can adapt to evolving real-world environments without forgetting previous knowledge. The ATLAS framework uses audio-guided conditioning and low-rank anchoring to maintain performance across dynamic scenarios.

75% relevant

DeepMind's Diffusion Breakthrough: Training Better Latents for Superior AI Generation

Google DeepMind researchers have developed new techniques for training latent representations in diffusion models, potentially leading to more efficient, higher-quality AI-generated content across images, audio, and video domains.

85% relevant

No-Code Revolution: How AI-Powered Platforms Are Democratizing Software Development

AI-powered no-code platforms are enabling non-technical professionals to build complex software applications in record time. From construction procurement platforms to specialized audiobook apps, these tools are breaking down traditional barriers to software development.

85% relevant

Google's Lyria3: The Next Evolution in AI-Generated Music Composition

Google has unveiled Lyria3, its latest AI music generation model that promises unprecedented audio quality and creative control. This advancement represents a significant leap in musical AI capabilities with potential implications for creators and the music industry.

85% relevant

Google's Gemini Embedding 2 Unifies All Media Types in Single AI Framework

Google has launched Gemini Embedding 2, its first fully multimodal embedding model that maps text, images, video, audio, and documents into a single shared vector space. The breakthrough supports 100+ languages and flexible vector sizing for optimized performance.

95% relevant