Essay 04 · interaction · the brain

The end of turn-taking.

On May 11 this year, Thinking Machines Lab broke an 18-month silence to propose a 200 ms micro-turn as the unit of computation. Five days later, OpenBMB open-sourced MiniCPM-o 4.5 — a 9 B-parameter full-duplex omni-modal model that runs on a MacBook with under 12 GB of RAM.

Both point at the same architectural shift: text, audio, video, and speech generation collapsing onto a single timeline at the model level. This brief covers what the architecture actually is, why "omni" is doing a lot of work in the marketing, and where the next bottleneck moved — to teaching the model when not to speak.

tl;dr · 60 seconds

01200 ms micro-turn replaces request/response. Thinking Machines proposes time-aligned audio + video + text on a single timeline at the model level.
02MiniCPM-o 4.5 = open-source proof. 9B parameters, full-duplex on a MacBook in <12 GB RAM. 77.6 on OpenCompass beats GPT-4o on vision-language.
03The blueprint is Moshi (2024). Low-frame-rate codec (12.5 Hz Mimi) + hierarchical transformer + Inner Monologue. Coarse tokens at ~10 Hz = 3 min audio in 2K context.
04vLLM-Omni is the production substrate the field was missing. Up to 91.4% JCT reduction on Qwen3-Omni vs baseline.
05The new bottleneck is semantic backchanneling. The "mhm / yeah / right" cadence. Nobody has shipped it. Whoever cracks it owns voice UX for the next two years.

two anchor projects

One closed-source, one open. Published five days apart.

The architectural thesis is the same. The marketing is different.

May 11, 2026 · Thinking Machines Lab — Mira Murati

TML-Interaction-Small

Real-time interactivity is a native model capability, not scaffolding bolted around a turn-based LLM. The unit of computation is the 200 ms micro-turn. The system splits into a co-trained interaction model that stays live with the user and a background model that runs reasoning + tools asynchronously.

01400 ms end-to-end response latency
0277.8 on FD-bench v1.5 vs 46–54 for GPT-Realtime and Gemini Live
03GPT-Realtime-2 minimal: 2.9% on cued-response timing, 0% on temporal action localization
0499.0% Harmbench refusal rate

Thinking Machines Lab, May 11 Interaction Models announcement→

May 16, 2026 · OpenBMB / Tsinghua

MiniCPM-o 4.5

End-to-end omni-modal model — vision, audio in, LLM core, speech out — all densely connected via hidden states, no cascade. The unifying mechanism is Omni-Flow: a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis.

019 B parameters — SigLip2 + Whisper-medium + Qwen3-8B + CosyVoice2
0277.6 average on OpenCompass — beats GPT-4o, approaches Gemini 2.5 Flash
033.3× fewer parameters than Qwen3-Omni-30B-A3B for similar omni understanding
04Runs full-duplex on edge devices with <12 GB RAM (MacBook demo)
0516 quantized variants; vLLM, SGLang, llama.cpp, Ollama, FlagOS supported

arXiv:2604.27393, May 16 2026→

how "same timeline" actually works

The Moshi blueprint.

The most thoroughly documented blueprint is Kyutai's Moshi (October 2024). Everything Thinking Machines and OpenBMB describe in 2026 is recognizably descended from its three-component recipe.

Low-frame-rate neural audio codec

Mimi runs at 12.5 Hz with Split RVQ — codebook 1 is semantic (distilled from WavLM), the rest are acoustic. At ~10 Hz, a 2,048-token context covers about 3 minutes of consistent audio structure — enough for minute-long full-duplex without context blow-up.

Hierarchical Transformer

A large Temporal Transformer (7B) runs along the time axis; a small Depth Transformer runs along the codebook axis at each step. At every 80 ms frame the model jointly predicts a text token, a semantic audio token, several acoustic audio tokens — and the user's audio tokens in parallel. Two streams run simultaneously; turn boundaries disappear.

Inner Monologue

Moshi predicts time-aligned text tokens before audio tokens at each frame. Unsexy but load-bearing: without it, end-to-end speech-to-speech tends toward fluent gibberish. The text-first trick is what makes the speech factually correct, not just acoustically natural.

The math of "same timeline" is not exotic. It is a shared positional embedding clock across modalities, plus cross-modal attention with relative time. The hard work was the codec, the training data, and the decision to throw away the turn boundary as an abstraction.

the codec race

Frame rate matters more than parameter count.

SNAC's insight is decisive: coarse tokens at ~10 Hz mean a 2,048-token context covers about 3 minutes of consistent audio structure — the same math that lets Moshi and MiniCPM-o 4.5 do minute-long full-duplex without context blow-up.

Codec	Vendor	Year	RVQ Structure	Frame Rate	Bitrate
SoundStream	Google	2021	Fixed-rate RVQ	single rate	3–18 kbps
EnCodec	Meta	2022	Fixed-rate RVQ	75 Hz @ 24 kHz	~6 kbps
SNAC	Siuzdak	2024	Multi-scale RVQ	12 / 23 / 47 Hz	0.98 kbps
Mimi	Kyutai	2024	Split RVQ + WavLM distillation	12.5 Hz	~1.1 kbps

the skeptic's brief

Three things in the 2026 narrative deserve a contrarian read.

"Native" is doing a lot of work

Moshi and MiniCPM-o 4.5 are emphatically end-to-end. But TML-Interaction-Small's split into "interaction model + background model" IS a cascade — just one operating below the user's perceptual threshold. The marketing claim that competitors are "scaffolding" while TML is "native" is rhetorical; the architectural difference is gradient, not binary.

FD-bench v1.5 is graded by a player with a horse in the race

Thinking Machines is the loudest proponent of full-duplex benchmarking. Until a third party — HELM, Stanford CRFM, or MLCommons — replicates GPT-Realtime-2 = 2.9%, treat it as advocacy rather than finding. When the metric becomes the product, the metric stops being trustworthy.

The 12 GB RAM demo dodges the hard cases

OpenBMB's own paper concedes unstable speech in omni mode, mixed English-Chinese drift, and high web-demo latency. Moshi's 200 ms latency assumes an L4 GPU; CPU inference numbers are not advertised. The consumer experience of full-duplex on phones is still gated by accelerator availability, not algorithm design.

production substrate

vLLM-Omni — the inference engine the field was missing.

Released November 30 2025, currently at v0.18.0. The headline: an OmniStage abstraction that lets any-to-any architectures be decomposed into a graph of stages (encoder / prefill / decode / generation). Each stage is independently served, batched, and GPU-allocated, with unified inter-stage connectors routing audio waveforms, image patches, and text tokens through a single API.

The single result that matters: up to 91.4% reduction in job completion time on Qwen3-Omni versus baseline. This is the only credible production path for self-hosting omni-modal models in 2026.

vllm-project/vllm-omni — GitHub →

where the bottleneck moved

The new frontier is semantic backchanneling.

If you grant that 2026 has solved latency, modality coverage, and codec efficiency — and the data above suggests we are within striking distance — the next bottleneck is no longer about modeling.

Humans produce roughly three "mhm / yeah / right" events per minute that signal understanding without taking the turn. None of the published systems — Moshi, MiniCPM-o 4.5, GPT-4o Realtime, TML-Interaction-Small — demonstrate this convincingly in third-party recordings. They can listen while speaking; they cannot yet acknowledge while listening with the cadence humans expect.

This is the unsolved UX problem of the next 24 months. Whoever cracks it — and there is no public hint that anyone has — will own the next phase of voice interaction the way OpenAI owned text chat for the previous three years.

The corollary: the consumer-relationship winner of full-duplex AI is probably not the lab with the best benchmark. It is the lab with the deepest dialogue-corpus access and the most rigorous treatment of conversational pragmatics. That favors Anthropic and Google over OpenAI and Thinking Machines — and leaves a real opening for open-source. MiniCPM-o 4.5's permissive license + 12 GB RAM footprint mean the first viable backchanneling implementation could come from a research lab, not a hyperscaler.

what to watch

Six milestones over the next nine months.

They will tell us whether 2026 was the inflection year or another false dawn.

Q3–Q4 2026

TML-Interaction-Large

Murati confirmed scaling to a larger pretrained base is a 2026 project. The current model is intentionally small; the next tests whether the architecture holds at frontier scale.

Quarterly

MiniCPM-o 5 / Qwen-Omni-Next

OpenBMB ships every quarter. The next generation will likely close the English-Chinese drift gap that the current paper acknowledges.

Imminent

vLLM-Omni 1.0

Currently at 0.18.0 with diffusion + audio + RL execution. The 1.0 milestone unlocks production self-hosting of omni models without GPU oversubscription games.

June 2026

Apple WWDC + on-device omni

The 12 GB RAM threshold is now within reach on iPhone 17 Pro / M5 MacBook. Apple Intelligence's omni-modal posture signals whether the consumer relationship belongs to the platform or the model lab.

Open question

FD-bench v2 governance

If FD-bench moves to a neutral host (MLCommons, HELM), the field gets a credible metric. If it stays under Thinking Machines, the 77.8 score reads as marketing.

The big one

First credible semantic-backchanneling demo

None has shipped publicly. Humans produce ~3 "mhm / yeah / right" events per minute that signal understanding without taking the turn. Whoever cracks this owns the next two years of voice UX.

summary line

2026 is not the year omni-modal interaction became possible — it became possible in 2024 with Moshi.

It is the year the architecture finalized, the open-source baseline got serious, and the production serving substrate (vLLM-Omni) arrived. The remaining frontier is no longer about modeling. It is about pragmatics — about teaching the model when not to speak.

primary sources