Wan-Streamer v0.1, a single Transformer from Lianghua Huang et al., achieves 200ms model-side latency for full-duplex audio-visual interaction. The model jointly learns perception, reasoning, and generation across text, audio, and video without external VAD, ASR, or TTS modules.
Key facts
- 200ms model-side response latency
- 550ms total interaction latency with 350ms network delay
- 160ms streaming units at 25 fps
- Single Transformer handles text, audio, video I/O
- No external VAD, ASR, TTS, or video modules
The paper Wan-Streamer v0.1, submitted to arXiv on June 23, 2026, introduces a native-streaming foundation model that unifies language, audio, and video as both input and output within a single Transformer architecture. The key architectural innovation is block-causal attention, which enables incremental streaming by interleaving visual, audio, and text tokens in a single sequence while maintaining causality for real-time generation.
Wan-Streamer's design eliminates the traditional cascaded pipeline—voice activity detection (VAD), automatic speech recognition (ASR), language model, text-to-speech (TTS), audio-driven animation, and video generation modules—replacing it with a single end-to-end model. The authors report that this consolidation reduces pipeline latency and error accumulation, as cross-modal synchronization is learned jointly rather than engineered through separate components.
The model's streaming unit is 160ms at 25 fps, enabled by causal encoders and decoders, block-causal attention, and a low-latency multimodal token scheduler. Total interaction latency is approximately 550ms when combining 200ms model-side latency with 350ms bidirectional network latency. This positions Wan-Streamer for sub-second duplex audio-visual communication, a benchmark that previous cascaded systems have struggled to meet consistently.
However, the paper does not disclose the model's parameter count, training dataset size, or compute budget—details that would allow comparison with other multimodal models like GPT-4o or Gemini. The authors also do not provide benchmark results on standard multimodal tasks (e.g., visual question answering, speech recognition accuracy, or video captioning), making it difficult to assess trade-offs between latency and task performance. The claim of "end-to-end" learning is strong, but without ablation studies showing the contribution of each component, it remains unclear whether the unified approach sacrifices quality for speed.
Wan-Streamer's architecture shares conceptual lineage with block-causal attention used in models like FlashAttention-4, but the paper does not cite or compare against prior streaming Transformer work, such as Google's StreamingLLM or Meta's Efficient Streaming Language Model. The absence of such comparisons weakens the novelty claim.
What the numbers mean
The 200ms model-side latency is impressive for a system generating both audio and video output. For context, typical cascaded systems incur 300-500ms just for ASR+TTS, before any video generation. Wan-Streamer's 550ms total latency under 350ms network delay suggests the model could support real-time conversational AI with visual avatars—a use case that has been a long-standing goal for companies like Meta, Apple, and Tencent.
The hidden trade-off
The paper's silence on parameter count and dataset size is telling. A unified model that handles text, audio, and video simultaneously likely requires a very large model, which would increase inference cost despite the latency improvements. The authors do not discuss model efficiency in terms of FLOPs or memory bandwidth, which are critical for deployment on edge devices or in data centers at scale.
Key Takeaways
- Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules.
- The paper lacks parameter count and benchmark comparisons, limiting reproducibility.
What to watch
Watch for open-source release of model weights or code, which would allow independent verification of the 200ms latency claim. Also monitor for follow-up papers disclosing parameter count, training data, and benchmark comparisons against GPT-4o and Gemini on multimodal tasks.

Source: arxiv.org









