generative audio

30 articles about generative audio in AI news

ByteDance's OmniShow Unifies Text, Image, Audio, Pose for Video Gen

ByteDance introduced OmniShow, a unified multimodal framework for video generation that accepts text, reference images, audio, and pose inputs simultaneously. It claims state-of-the-art performance across diverse conditioning settings.

Apr 14, 202685% relevant

Microsoft's 'Markdownify' Converts PDFs, Audio, Video to Clean LLM Markdown

Microsoft launched 'Markdownify', a Python tool that converts PDFs, Word docs, Excel, PowerPoint, audio, and YouTube URLs into clean Markdown. This addresses a major pain point in AI pipelines where raw file parsing breaks context and structure.

Apr 8, 202685% relevant

Alibaba's Qwen3.5-Omni Launches with Script-Level Captioning, Audio-Visual Vibe Coding, and Real-Time Web Search

Alibaba's Qwen team has released Qwen3.5-Omni, a multimodal model focused on interpreting images, audio, and video with new capabilities like script-level captioning and 'vibe coding'. It's open-access on Hugging Face but does not generate media.

Mar 30, 202685% relevant

OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency

Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.

Mar 16, 202692% relevant

Google DeepMind's Unified Latents Framework: Solving Generative AI's Core Trade-Off

Google DeepMind introduces Unified Latents (UL), a novel framework that jointly trains diffusion priors and decoders to optimize latent space representation. This approach addresses the fundamental trade-off between reconstruction quality and learnability in generative AI models.

Feb 28, 202675% relevant

Game Studios Show Wide Variance in AI Adoption, Wharton Report Finds

A Wharton School report, based on interviews at 20 game studios, finds a wide spectrum of organizational approaches to adopting generative AI tools, from aggressive integration to active resistance.

Apr 10, 202685% relevant

Spotify's AI Music Boom Redirects Millions in Royalties from Human Artists, Report Claims

A report indicates the surge in AI-generated music on Spotify is redirecting millions of dollars in royalty payments away from human artists and toward AI content creators. This highlights the immediate financial impact of generative AI on creative industries.

Apr 2, 202685% relevant

Mistral AI Launches Voxtral TTS: 3B-Parameter Open-Source Model Claims 63% Win Rate Over ElevenLabs Flash v2.5

Mistral AI released Voxtral TTS, a 3-billion-parameter open-weights text-to-speech model. It reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, runs on 3 GB RAM, and clones voices from 5 seconds of audio.

Mar 26, 202695% relevant

DeepMind's Diffusion Breakthrough: Training Better Latents for Superior AI Generation

Google DeepMind researchers have developed new techniques for training latent representations in diffusion models, potentially leading to more efficient, higher-quality AI-generated content across images, audio, and video domains.

Feb 26, 202685% relevant

The AI Music Revolution: How Google and Apple Are Democratizing Music Creation

Google and Apple are integrating generative AI music features into their core platforms, allowing users to create custom 30-second tracks from text, photos, or video prompts. This move signals AI's transition from experimental tools to mainstream consumer applications.

Feb 18, 202670% relevant

Google Launches $0.034 Image Model, Video API for Gemini

Google launched Nano Banana 2 Lite ($0.034/image, 4-second generation) and Gemini Omni Flash ($0.10/second video API), targeting high-throughput developer pipelines.

Jun 30, 202687% relevant

Google Releases Magenta RealTime 2 for Open-Weight Music Generation

Google released Magenta RealTime 2 on Hugging Face, the only open-weights model for real-time continuous music generation on device with ~200ms latency.

Jun 3, 202685% relevant

MiniMax AI Powers Wati's Astra Voice 2.0 for WhatsApp Business

MiniMax AI is providing its voice technology to power Wati's Astra Voice 2.0 platform, enabling businesses to deploy conversational voice AI on WhatsApp in multiple languages.

Apr 16, 202685% relevant

IBM Demonstrates Extreme Scale for Content-Aware Storage with 100-Billion

IBM Research announced a breakthrough in vector database technology, achieving storage capacity of 100 billion vectors. This enables content-aware storage systems that can understand and retrieve data based on semantic meaning rather than just metadata.

Apr 13, 202682% relevant

OpenBMB's VoxCPM 2: 2B-Param Open-Source TTS for Multilingual Voice

OpenBMB launched VoxCPM 2, a 2-billion-parameter open-source text-to-speech model. It generates multilingual, emotionally expressive speech from text descriptions and runs on consumer-grade hardware.

Apr 13, 202697% relevant

OpenMontage: Open-Source Agentic Video Production System Costs $0.69 Per Ad

OpenMontage, an open-source agentic video production system, has been released. It orchestrates 11 pipelines and 49 tools across multiple AI providers to autonomously script, generate assets, edit, and render videos from a plain language prompt.

Apr 11, 202699% relevant

NemoVideo AI Automates Video Editing Based on Text Prompts

A video creator states NemoVideo AI now automates complex editing tasks like cuts and transitions from simple text descriptions, reducing a 5-hour manual process to a prompt-driven workflow.

Apr 5, 202685% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

Apr 5, 202675% relevant

OpenAI's GPT-Image-2 Model Reportedly Achieves Photorealistic Video Generation, Surpassing Prior Map-Generation Flaws

A social media user claims OpenAI's GPT-Image-2 model now produces video indistinguishable from reality, a significant leap from its predecessor's documented failure to generate coherent world maps.

Apr 4, 202685% relevant

Diffusion Recommender Models Fail Reproducibility Test: Study Finds 'Illusion of Progress' in Top-N Recommendation Research

A reproducibility study of nine recent diffusion-based recommender models finds only 25% of reported results are reproducible. Well-tuned simpler baselines outperform the complex models, revealing a conceptual mismatch and widespread methodological flaws in the field.

Mar 30, 202682% relevant

Geometric Latent Diffusion (GLD) Achieves SOTA Novel View Synthesis, Trains 4.4× Faster Than VAE

GLD repurposes features from geometric foundation models like Depth Anything 3 as a latent space for multi-view diffusion. It trains significantly faster than VAE-based approaches and achieves state-of-the-art novel view synthesis without text-to-image pretraining.

Mar 28, 202695% relevant

Google Lyria 3 Pro Music AI Demoed: Generates '1990s Boy Band' Version of Rilke Poetry

A researcher gained early access to Google's Lyria 3 Pro music generation AI, demonstrating its ability to transform Rainer Maria Rilke's 'First Elegy' into a 1990s boy band track. The demo highlights rapid stylistic remixing capabilities not yet publicly available.

Mar 25, 202685% relevant

VHS: Latent Verifier Cuts Diffusion Model Verification Cost by 63.3%, Boosts GenEval by 2.7%

Researchers propose Verifier on Hidden States (VHS), a verifier operating directly on DiT generator features, eliminating costly pixel-space decoding. It reduces joint generation-and-verification time by 63.3% and improves GenEval performance by 2.7% versus MLLM verifiers.

Mar 25, 202695% relevant

DiffGraph: An Agent-Driven Graph Framework for Automated Merging of Online Text-to-Image Expert Models

Researchers propose DiffGraph, a framework that automatically organizes and merges specialized online text-to-image models into a scalable graph. It dynamically activates subgraphs based on user prompts to combine expert capabilities without manual intervention.

Mar 24, 202695% relevant

Minimax to Release Open Weights in Two Weeks, Highlighting Chinese Startup Momentum

Chinese AI startup Minimax announced it will release open weights within two weeks. This follows a pattern of rapid open-source releases from Chinese firms, contrasting with Meta's more controlled approach.

Mar 22, 202685% relevant

8 AI Model Architectures Visually Explained: From Transformers to CNNs and VAEs

A visual guide maps eight foundational AI model architectures, including Transformers, CNNs, and VAEs, providing a clear reference for understanding specialized models beyond LLMs.

Mar 21, 202685% relevant

Tongyi Lab Releases World's First Open-Source Multi-Speaker AI Dubbing Model

Alibaba's Tongyi Lab has released the first open-source AI model capable of dubbing multi-speaker conversations, addressing one of the hardest problems in AI video generation. The model synchronizes voice with lip movements across multiple speakers in a single pass.

Mar 17, 202685% relevant

ElevenLabs Unleashes 'Flows': The Unified AI Creative Suite That Could Revolutionize Content Production

ElevenLabs has launched Flows, a groundbreaking AI platform that seamlessly integrates image, video, voice, music, and sound effects generation into a single visual pipeline. This eliminates tool-switching and re-exporting, potentially transforming creative workflows.

Mar 13, 202685% relevant

AI Video Generation Goes Mainstream: Text-to-Video Assistant Skill Emerges

A new AI skill called Medeo Video Skill for OpenClaw allows users to generate complete videos through simple text commands. Users can request videos on any topic, and the AI handles the entire creation process automatically.

Mar 13, 202689% relevant

CostRouter Emerges as Smart AI Gateway, Cutting API Expenses by 60% Through Intelligent Model Routing

A new API gateway called CostRouter analyzes request complexity and automatically routes queries to the cheapest capable AI model, saving developers up to 60% on API costs while maintaining quality thresholds.

Mar 12, 202679% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety