Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Mira Murati presents a chart comparing Thinking Machines' error rate and inference cost against frontier models…
AI ResearchScore: 94

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

Thinking Machines beat frontier models by 29.8% fewer errors using Bridgewater's expert judgments, at 13.8x lower inference cost.

·11h ago·3 min read··11 views·AI-Generated·Report error
Share:
How did Mira Murati's Thinking Machines beat frontier models by 29.8% fewer errors?

Mira Murati's Thinking Machines, using Bridgewater's private expert judgments, beat the best frontier model with 29.8% fewer mistakes and 13.8x lower inference cost. The model learned expert taste in filtering finance documents, not just surface language.

TL;DR

Thinking Machines beats frontier models by 29.8% fewer errors. · Bridgewater's private expert judgment made trainable. · 13.8x lower inference cost than best frontier model.

Mira Murati's Thinking Machines beat the best frontier model by 29.8% fewer errors using Bridgewater's private expert judgments. The system achieved this at 13.8x lower inference cost by training on expert-labeled finance documents.

Key facts

  • 29.8% fewer errors vs best frontier model.
  • 13.8x lower inference cost.
  • Naive prompts yield 46-50% accuracy, expert prompts 74-78%.
  • Training used CISPO loss, proposed by MiniMax in 2025.
  • Bridgewater provided high-quality expert labels.

Mira Murati's Thinking Machines made Bridgewater's private expert judgment trainable, beating frontier models with 29.8% fewer errors. With naive prompts, all tested models sit around coin-flip accuracy, roughly 46% to 50%. Expert prompts lift them sharply, reaching about 74% to 78% average accuracy. According to @rohanpaul_ai

The workflow was filtering finance articles, reports, central-bank documents, and emails to decide what investors should read. This is a serious signal for enterprise AI, that bringing private judgment in the loop beats general intelligence.

The taste problem

The problem was not reading finance documents, because frontier LLMs can already read them. The harder task was deciding which facts deserve attention inside an investor's workflow. A tariff headline can move markets, while another geopolitical headline may add no signal.

The breakthrough came from replacing written rules with high-quality labels from expert investors. Non-expert labels failed because the task depends on taste, not surface financial language. Bridgewater cleaned those labels by sending model-disputed cases back to experts for review. The model then learned patterns that experts could recognize, but could not fully verbalize.

Training architecture

Training used interleaved batches, CISPO loss, and on-policy distillation from stronger teacher checkpoints. Interleaving helped the model share judgment across tasks without blending them into noise. CISPO controlled policy updates, so learning stayed aggressive without drifting into brittle shortcuts. (CISPO is a new reinforcement-learning loss that caps how strongly each generated token can update the model, improving training stability while keeping useful rare tokens active. It was initially proposed by MiniMax team in 2025.) On-policy distillation penalized moves away from better teachers, then promoted stronger checkpoints.

The result beat the best frontier model, with 29.8% fewer mistakes and 13.8x lower inference cost. The company did not disclose the exact model architecture or parameter count.

What to watch

Mira Murati's Thinking Machines Lab is worth $12B in seed ...

Watch for Thinking Machines' next benchmark results on enterprise-specific reasoning tasks, and whether other hedge funds adopt similar expert-in-the-loop training for proprietary workflows. A public paper or blog post detailing CISPO integration and model architecture would be the next signal.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This result challenges the assumption that frontier general intelligence is the ceiling for enterprise AI. Instead, it shows that private, expert-curated judgment—taste, not just language—can be made trainable and yield superior results at dramatically lower cost. The 13.8x inference cost advantage is particularly striking: it suggests that smaller, specialized models fine-tuned on expert labels can outperform massive general models on specific workflows. The use of CISPO loss, originally proposed by MiniMax in 2025, indicates that Thinking Machines is leveraging recent advances in reinforcement learning stability. This is a departure from standard SFT or RLHF, which might overfit to surface patterns. The interleaved batch training also hints at a multi-task approach where shared judgment signals are preserved across domains. However, the source lacks detail on model size, training compute, and exact benchmark methodology. The claim of 'best frontier model' is vague—was this GPT-4o, Claude 3.5, Gemini Ultra, or something else? Without a named baseline, the result is harder to evaluate rigorously. Still, the direction is clear: enterprise AI's moat may be proprietary data and expert taste, not raw scale.
Compare side-by-side
Thinking Machines Lab vs Bridgewater Associates
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all