Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

audio

30 articles about audio in AI news

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

75% relevant

Gemini 3.5 Live Translate Debuts as Real-Time Audio Model

Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.

87% relevant

mlx-audio v0.4.3 Ships 6 New TTS Models, Slimmer Deps

mlx-audio v0.4.3 adds 6 TTS models, server concurrency, and slims dependencies, targeting Apple Silicon developers.

85% relevant

NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text

NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.

93% relevant

Pretrained Audio Models Underperform in Music Recommendation, New Research Shows

A new study evaluates nine pretrained audio models for music recommendation, finding significant performance disparity between traditional MIR tasks and both hot and cold-start recommendation scenarios.

80% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

95% relevant

ByteDance's OmniShow Unifies Text, Image, Audio, Pose for Video Gen

ByteDance introduced OmniShow, a unified multimodal framework for video generation that accepts text, reference images, audio, and pose inputs simultaneously. It claims state-of-the-art performance across diverse conditioning settings.

85% relevant

PixVerse V6 Launches: 15-Second 1080P Video with Full Audio

AI video startup PixVerse launched its V6 model, capable of generating 15-second, 1080p videos with full audio from text prompts. This marks a significant upgrade in output length and quality for the platform.

89% relevant

Microsoft's 'Markdownify' Converts PDFs, Audio, Video to Clean LLM Markdown

Microsoft launched 'Markdownify', a Python tool that converts PDFs, Word docs, Excel, PowerPoint, audio, and YouTube URLs into clean Markdown. This addresses a major pain point in AI pipelines where raw file parsing breaks context and structure.

85% relevant

Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability

Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific training. This suggests a significant leap in multimodal reasoning for a model already positioned as a strong GPT-4 competitor.

85% relevant

Alibaba's Qwen3.5-Omni Launches with Script-Level Captioning, Audio-Visual Vibe Coding, and Real-Time Web Search

Alibaba's Qwen team has released Qwen3.5-Omni, a multimodal model focused on interpreting images, audio, and video with new capabilities like script-level captioning and 'vibe coding'. It's open-access on Hugging Face but does not generate media.

85% relevant

Microsoft's VibeVoice Family Processes 60-Minute Audio in Single Pass, Eliminates Chunking for ASR & TTS

Microsoft open-sourced VibeVoice, a family of speech AI models that processes up to 60 minutes of audio without chunking. It delivers structured transcriptions with speaker diarization and generates 90-minute multi-speaker speech in one pass.

99% relevant

Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2

A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.

97% relevant

Mistral AI Releases Voxtral TTS: 4B-Parameter Open-Weight Model Clones Voices from 3-Second Audio in 9 Languages

Mistral AI has launched Voxtral TTS, its first open-weight text-to-speech model. The 4B-parameter model clones voices from three seconds of reference audio across nine languages, with a latency of 70ms, and scored higher on naturalness than ElevenLabs Flash v2.5 in human tests.

95% relevant

Waves Audio Launches Lightning V3.1: 10-Second Voice Cloning with 44.1kHz Studio Quality

Waves Audio released Lightning V3.1, a voice cloning model that creates studio-quality voice replicas from just 10 seconds of audio with under 100ms latency. The update supports over 50 languages and targets real-time applications.

87% relevant

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.

99% relevant

Fish Audio S2 Enables Word-Level Speech Control with Positional Tags, Beats GPT-4o in Human Preference Tests

Fish Audio S2 introduces a 100% open-source TTS model that uses inline positional tags for word-level vocal control, achieving 8/10 wins against GPT-4o and Gemini in human preference tests while generating audio nearly 5x faster than real-time.

95% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

92% relevant

OpenAI's Bidirectional Audio Breakthrough: The End of Awkward AI Conversations

OpenAI is developing a bidirectional audio model that processes speech continuously, allowing AI to adapt instantly to interruptions. This could revolutionize voice assistants and customer support by making conversations feel truly natural.

95% relevant

JAEGER Breaks the 2D Barrier: How 3D Audio-Visual AI Could Transform Robotics and AR

Researchers introduce JAEGER, a framework that extends audio-visual large language models into 3D space using RGB-D and spatial audio. This breakthrough enables AI to understand and reason about physical environments with unprecedented spatial awareness.

70% relevant

OpenAI's Audio Revolution: New Voice Models Signal Major AI Advancements

OpenAI appears poised to release new audio models that could significantly enhance voice interaction capabilities. This development follows recent trademark filings and suggests major improvements to voice mode technology.

85% relevant

Meta Trains Coding AI on Engineers' Work Traces as 8K Jobs Cut

Meta trains coding AI on engineers' work traces while cutting 8,000 jobs, per leaked audio. The behavior cloning strategy uses internal problem-solving steps as training data.

100% relevant

Microsoft’s VibeVoice: Open-Source Speech-to-Text with Diarization

Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. Simon Willison tested a 4-bit MLX conversion on an M5 MacBook, transcribing 1 hour of audio in ~9 minutes using ~60GB RAM.

85% relevant

How to Automate Meeting Notes and Action Items with Read AI's MCP Server

Integrate Read AI's MCP server with Claude Code to transform meeting audio into structured notes, decisions, and code-ready tasks without leaving your IDE.

88% relevant

OpenVoice v2: Complete Voice Cloning Directory Launches on GitHub

A developer has compiled and released a comprehensive directory of open-source voice cloning tools and resources on GitHub. This centralizes access to models, datasets, and training code, lowering the barrier to entry for AI audio development.

85% relevant

Google Launches AI Edge Eloquent: Free, Offline-First Dictation App on iOS

Google has quietly launched AI Edge Eloquent, a free, subscription-less dictation app for iOS. It uses a Gemma-based speech recognition model to process audio locally, removing filler words and self-corrections to produce cleaner text.

97% relevant

Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5

Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.

95% relevant

Open-Source Web UI 'LLM Studio' Enables Local Fine-Tuning of 500+ Models, Including GGUF and Multimodal

LLM Studio, a free and open-source web interface, allows users to fine-tune over 500 large language models locally on their own hardware. It supports GGUF-quantized models, vision, audio, and embedding models across Mac, Windows, and Linux.

85% relevant

Single Pane: The Terminal-First Workspace Built for Claude Code

A new macOS app consolidates your terminal, file manager, and markdown editor into one window, with native hooks for Claude Code audio notifications.

95% relevant

OpenHome Launches Open-Source Voice Assistant Platform with Full Local Processing

OpenHome has launched an open-source voice assistant platform that processes all audio and commands locally on-device, positioning itself as a privacy-focused alternative to cloud-based services like Amazon Alexa.

85% relevant