Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Mira Murati presents a chart comparing Thinking Machines' error rate and inference cost against frontier models…

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

Thinking Machines beat frontier models by 29.8% fewer errors using Bridgewater's expert judgments, at 13.8x lower inference cost.

AAAla SMITH & AI Research Desk·11h ago·3 min read··11 views·AI-Generated·Report error

Source: x.comvia @rohanpaul_aiCorroborated

How did Mira Murati's Thinking Machines beat frontier models by 29.8% fewer errors?

Mira Murati's Thinking Machines, using Bridgewater's private expert judgments, beat the best frontier model with 29.8% fewer mistakes and 13.8x lower inference cost. The model learned expert taste in filtering finance documents, not just surface language.

TL;DR

Thinking Machines beats frontier models by 29.8% fewer errors. · Bridgewater's private expert judgment made trainable. · 13.8x lower inference cost than best frontier model.

Mira Murati's Thinking Machines beat the best frontier model by 29.8% fewer errors using Bridgewater's private expert judgments. The system achieved this at 13.8x lower inference cost by training on expert-labeled finance documents.

Key facts

29.8% fewer errors vs best frontier model.
13.8x lower inference cost.
Naive prompts yield 46-50% accuracy, expert prompts 74-78%.
Training used CISPO loss, proposed by MiniMax in 2025.
Bridgewater provided high-quality expert labels.

Mira Murati's Thinking Machines made Bridgewater's private expert judgment trainable, beating frontier models with 29.8% fewer errors. With naive prompts, all tested models sit around coin-flip accuracy, roughly 46% to 50%. Expert prompts lift them sharply, reaching about 74% to 78% average accuracy. According to @rohanpaul_ai

The workflow was filtering finance articles, reports, central-bank documents, and emails to decide what investors should read. This is a serious signal for enterprise AI, that bringing private judgment in the loop beats general intelligence.

The taste problem

The problem was not reading finance documents, because frontier LLMs can already read them. The harder task was deciding which facts deserve attention inside an investor's workflow. A tariff headline can move markets, while another geopolitical headline may add no signal.

The breakthrough came from replacing written rules with high-quality labels from expert investors. Non-expert labels failed because the task depends on taste, not surface financial language. Bridgewater cleaned those labels by sending model-disputed cases back to experts for review. The model then learned patterns that experts could recognize, but could not fully verbalize.

Training architecture

Training used interleaved batches, CISPO loss, and on-policy distillation from stronger teacher checkpoints. Interleaving helped the model share judgment across tasks without blending them into noise. CISPO controlled policy updates, so learning stayed aggressive without drifting into brittle shortcuts. (CISPO is a new reinforcement-learning loss that caps how strongly each generated token can update the model, improving training stability while keeping useful rare tokens active. It was initially proposed by MiniMax team in 2025.) On-policy distillation penalized moves away from better teachers, then promoted stronger checkpoints.

The result beat the best frontier model, with 29.8% fewer mistakes and 13.8x lower inference cost. The company did not disclose the exact model architecture or parameter count.

What to watch

Mira Murati's Thinking Machines Lab is worth $12B in seed ...

Watch for Thinking Machines' next benchmark results on enterprise-specific reasoning tasks, and whether other hedge funds adopt similar expert-in-the-loop training for proprietary workflows. A public paper or blog post detailing CISPO integration and model architecture would be the next signal.

Source: gentic.news · 11h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This result challenges the assumption that frontier general intelligence is the ceiling for enterprise AI. Instead, it shows that private, expert-curated judgment—taste, not just language—can be made trainable and yield superior results at dramatically lower cost. The 13.8x inference cost advantage is particularly striking: it suggests that smaller, specialized models fine-tuned on expert labels can outperform massive general models on specific workflows. The use of CISPO loss, originally proposed by MiniMax in 2025, indicates that Thinking Machines is leveraging recent advances in reinforcement learning stability. This is a departure from standard SFT or RLHF, which might overfit to surface patterns. The interleaved batch training also hints at a multi-task approach where shared judgment signals are preserved across domains. However, the source lacks detail on model size, training compute, and exact benchmark methodology. The claim of 'best frontier model' is vague—was this GPT-4o, Claude 3.5, Gemini Ultra, or something else? Without a named baseline, the result is harder to evaluate rigorously. Still, the direction is clear: enterprise AI's moat may be proprietary data and expert taste, not raw scale.

#finance #enterprise-ai #startups #reinforcement-learning

Compare side-by-side

Thinking Machines Lab vs Bridgewater Associates

→

Mentioned in this article

Thinking Machines Lab Mira Murati Bridgewater Associates MiniMax

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

MirrorCode Rebuilds Programs from Behavior Alone, Beats GPT-4o by 37%

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

The taste problem

Training architecture

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

Google TPU Humufish Drops TSMC CoWoS for Intel EMIB-T

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

Meituan Open-Sources 1.6T-Parameter LongCat-2.0 Trained on Domestic Chips

Instacart Uses PyFixest to Solve High-Cardinality Fixed Effects in

MirrorCode Rebuilds Programs from Behavior Alone, Beats GPT-4o by 37%

The framework underneath this story

More in AI Research

DART: One-Shot Robot Adaptation via Weight Space Arithmetic

ELDR: Expert-Locality Decode Routing Cuts MoE TPOT by 13.9%