Miso One, an 8B-parameter open-source text-to-speech model, delivers 110ms latency with real emotional range. The weights are fully open-source, enabling self-hosting and fine-tuning without data privacy trade-offs.
Key facts
- 8B parameter count for TTS model.
- 110ms latency, faster than human reaction time.
- Weights fully open-source for self-hosting.
- Purpose-built for shorts, podcasts, education.
- No disclosed training data or benchmark scores.
Miso One is an 8B-parameter text-to-speech model that targets a persistent gap in open-source TTS: natural emotional expression. According to @omarsar0, the model captures warmth, hesitation, and excitement rather than the flat delivery typical of prior open-source TTS systems. At 110ms latency, it undercuts human reaction time (~200ms), making it viable for real-time voiceover in shorts, podcasts, and educational content.
The release is notable for its openness. Weights are fully available on GitHub, allowing developers to clone, self-host, and fine-tune without sending data to a third party. This contrasts with proprietary TTS APIs from ElevenLabs or OpenAI, which charge per-character and require data egress. The 8B parameter count places Miso One between lightweight on-device models and the 13B+ class used by commercial TTS providers, suggesting a deliberate trade-off between generation quality and inference cost.
What the source doesn't disclose: training data provenance, supported languages, or benchmark comparisons against other open-source TTS models like Bark, XTTS-v2, or CosyVoice. Without standardized evaluation (e.g., MOS scores on LibriTTS or naturalness ratings), claims of "real emotional range" remain subjective. The model's architecture is also unspecified — transformer, diffusion, or hybrid — which matters for GPU memory requirements and integration complexity.
The release arrives amid a wave of open-source audio models. In the past 90 days, Meta released AudioCraft 2 with improved consistency, and Suno launched an open-source voice cloning kit. Miso One differentiates on latency and emotion focus, but the lack of a research paper or third-party audit means the community will need to validate performance independently.
Key Takeaways
- Miso One, an 8B open-source TTS model, achieves 110ms latency with emotional range.
- Weights are fully open-source for self-hosting, but no benchmark data is provided.
What to Watch

The key metric is adoption velocity: how quickly the GitHub repo accumulates stars, forks, and third-party integrations (e.g., Hugging Face Spaces, Ollama support). Watch for a follow-up paper or blog post disclosing training data, MOS scores, and supported languages — without those, enterprise adoption will stall. Also track whether ElevenLabs or Play.ht respond with open-weight tiers of their own.








