What Happened
Fish Audio has released S2, a 100% open-source text-to-speech (TTS) model that enables precise, word-level control over vocal delivery—a capability previously unavailable in commercial or open-source systems. The model uses a novel training approach where audio transcripts are annotated with inline vocal tags at exact word positions, allowing granular control over emotions, breaths, laughs, and other vocal effects.
How It Works: Positional Tagging and Unified Reward Design
Traditional TTS systems rely on coarse, global style labels (e.g., "angry clip," "whisper clip") that apply to entire sentences. Fish Audio S2 breaks from this paradigm by training on millions of hours of audio annotated with positional tags inserted directly into the transcript text.
For example, instead of labeling a clip as "angry," the transcript reads:
"I can't believe [angry] you did that [inhale] right in front of everyone."
The model learns that vocal control is local, precise, and tied to specific word positions.
The training pipeline incorporates a dual-purpose transcription model that serves both as the annotation source and as a reward signal during reinforcement learning (RL). This eliminates the typical mismatch between separate training and evaluation reward models. The RL setup uses three concurrent rewards to prevent gaming:
- Semantic accuracy: Correct words delivered with appropriate phrasing
- Acoustic quality: Clean audio output without artifacts
- Timbre similarity: Consistency with the reference speaker's voice characteristics
Performance Results
According to the announcement, Fish Audio S2 demonstrates significant advantages over existing models:
Human Preference (vs. GPT-4o & Gemini) Wins 8 out of 10 times Direct head-to-head evaluation Human vs. AI Deception Fooled humans more often than not GPT-4o "barely registers" on same test Vocal Effects (breaths, laughs, hesitations) Beat every model tested Includes closed-source competitors Inference Speed Nearly 5× faster than real-time First audio in <0.1 secondsTechnical Implementation and Availability
The model weights, fine-tuning code, and full inference engine are 100% open-source, available through the project's repository. This contrasts with most advanced TTS systems from major AI labs, which remain either closed-source or available only through restricted APIs.
The architecture enables fine-grained control through simple text annotations, making it accessible for developers to implement precise vocal delivery without complex parameter tuning.
Context
Current state-of-the-art TTS systems from OpenAI (GPT-4o), Google (Gemini), and ElevenLabs offer impressive voice quality but lack granular temporal control. Users can select overall voice styles or emotions, but cannot specify exactly when those vocal characteristics should occur within a sentence. Fish Audio S2 addresses this limitation directly through its positional tagging approach.
Previous open-source TTS models like Coqui TTS, Tortoise-TTS, and XTTS have focused primarily on voice cloning and basic emotion control, without the word-level precision demonstrated by S2.



