Why did GPT, Gemini, and Claude fail on these financial tasks?

The tasks require tacit investor knowledge — like distinguishing politically relevant from irrelevant news — that isn't in public training data.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

AI robot arm hesitating before a pile of money and server racks, illustrating financial AI model fine-tuning

AI ResearchScore: 76

Bridgewater, Murati's startup fine-tune Qwen3 to 84.7% on finance tests

Bridgewater and Thinking Machines Lab fine-tuned Qwen3-235B to 84.7% accuracy on financial tasks, beating GPT/Gemini/Claude at 1/14th cost.

AAAla SMITH & AI Research Desk·1d ago·4 min read··8 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderSingle Source

How did Bridgewater and Thinking Machines Lab achieve 84.7% accuracy on financial tasks with a fine-tuned model?

Bridgewater and Thinking Machines Lab fine-tuned Qwen3-235B on proprietary investor judgments, achieving 84.7% accuracy on financial document tasks — beating GPT, Gemini, and Claude at roughly one-fourteenth the cost.

TL;DR

Qwen3-235B fine-tuned to 84.7% accuracy. · Bridgewater and Thinking Machines Lab led training. · GPT, Gemini, Claude hit ~50% with basic prompts.

Bridgewater and Thinking Machines Lab fine-tuned a Qwen3-235B model to 84.7% accuracy on financial document tasks. The model beats GPT, Gemini, and Claude at roughly one-fourteenth the cost, per the companies' internal evaluation.

Key facts

Qwen3-235B fine-tuned to 84.7% accuracy on financial tasks.
GPT, Gemini, Claude hit ~50% accuracy with basic prompts.
Fine-tuned model costs 1/14th of frontier models.
Bridgewater and Thinking Machines Lab led training.
Training used proprietary investor judgments for labeling.

Bridgewater and Thinking Machines Lab—the startup from former OpenAI CTO Mira Murati—have fine-tuned a Qwen3-235B model for financial tasks. According to their own testing, the model hits 84.7 percent accuracy, beating Gemini, Claude, and GPT at roughly one-fourteenth of the cost. The numbers haven't been verified by anyone outside the two companies, though.

What the frontier models got wrong

The researchers defined six tasks drawn from an investor's daily routine. One example: deciding whether a financial article is relevant to an executive. Another: whether a central bank document signals the direction of future rate changes. For investors, these calls are trivial, but they can barely put their reasoning into words. The report gives a telling example. A headline about Trump's claim to Greenland gets flagged as irrelevant, while Trump's threat of new China tariffs is highly relevant. Both touch on geopolitics and finance.

Frontier models failed in the authors' tests. Variants of Gemini, Claude, and GPT hit only about 50 percent accuracy with a basic prompt. Expert-written instructions and a three-tier rating system ("relevant and interesting," "relevant but uninteresting," "irrelevant") pushed accuracy into the mid-70s. That still fell short of the 80 percent threshold the authors set for trustworthy deployment.

Newer models barely improve per dollar, the report says. GPT 5.4 costs 43 percent more than 5.2 but is only marginally more accurate.

The secret sauce: proprietary investor judgment

The real value lives inside investors' heads. The solution was fine-tuning, retraining an open-weight model on proprietary examples. The key ingredient was the Bridgewater investors' judgment: At first, cheap outside contractors labeled the documents, but many of those labels were wrong. To avoid having expensive professionals review everything, the researchers used a workaround. A first model learned from the flawed labels and re-evaluated the same documents. Wherever the model and the original label disagreed, there was likely an error. Only those disputed cases went to investors for correction.

Image description

Training ran on the Tinker platform from Thinking Machines Lab, built on top of the open model Qwen3-235B.

Why this matters beyond hedge funds

The result demonstrates that companies can develop powerful AI solutions using their own data without having to share sensitive information with large providers. The cost advantage is stark: the fine-tuned Qwen3-235B operates at roughly one-fourteenth the inference cost of frontier models. For financial institutions that process millions of documents daily, that delta translates to real dollars. The approach also sidesteps the data-sharing concerns that have made banks wary of sending proprietary filings to OpenAI or Anthropic's APIs.

When experts write the prompt, performance jumps sharply compared to a naive prompt. | Image: Thinking Machines

The 84.7% figure is unverified externally, and the evaluation set is proprietary — the public can't reproduce the benchmark. Still, the methodology is sound: use a cheap open-weight model, generate noisy labels from contractors, then use model disagreement to flag only the hard cases for expert review. That active-learning loop is the real innovation here, not the fine-tuning itself.

What to watch

Watch for third-party verification of the 84.7% accuracy claim, and whether Bridgewater or Thinking Machines Lab open-sources the evaluation benchmark. If other hedge funds replicate the approach, expect a wave of domain-specific fine-tuned models that bypass frontier providers entirely.

Source: the-decoder.com

Source: gentic.news · 1d ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The headline numbers are eye-catching but unverified — the evaluation set is proprietary, so the public can't reproduce the 84.7% claim. Still, the methodology is the real story. The active-learning loop (cheap contractors → model disagreement flags hard cases → expert review) is a clever cost-saving trick that could generalize beyond finance. It's essentially a distillation of tacit knowledge: the frontier models can't do the task because the right answers were never public, but a fine-tuned model can learn them from a small set of expert judgments. The cost comparison is apples-to-oranges: the fine-tuned Qwen3-235B is a single model doing one narrow task, while GPT-5.4 is a general-purpose system. The 14x cost advantage is real for this specific task, but it's not a general-purpose replacement. The bigger implication is that domain-specific fine-tuning on proprietary data is becoming a viable alternative to renting frontier APIs — especially for regulated industries like finance that can't share their data with third parties. The report's claim that "newer models barely improve per dollar" is a direct challenge to the frontier scaling narrative. If GPT 5.4 costs 43% more than 5.2 but doesn't meaningfully improve on this task, that's evidence that the frontier is plateauing for narrow, domain-specific use cases. The response from OpenAI and Anthropic will be telling: do they invest in domain-specific fine-tuning themselves, or do they continue betting on general capability scaling?

#finance #fine-tuning #ai models

Compare side-by-side

Bridgewater Associates vs Thinking Machines Lab

→

Mentioned in this article

Qwen3-235B Bridgewater Associates Thinking Machines Lab Mira Murati Claude Opus 4.6 Gemini GPT models

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches3 shared topics

Propel Ships First Production MCP Server for PLM

AI Research3 shared topics

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

Products & Launches2 shared topics

ChatGPT Market Share Dips Below 50% for First Time, Sensor Tower Reports

AI Research2 shared topics

Google Gemini-SQL2 Hits 80.04% on BIRD, Beating GPT-5.5 by 7 Points

Products & Launches2 shared topics

HydraDB Raises $6.5M for Persistent Agent Memory, Solving the Session Gap

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Bridgewater, Murati's startup fine-tune Qwen3 to 84.7% on finance tests

What the frontier models got wrong

The secret sauce: proprietary investor judgment

Why this matters beyond hedge funds

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Propel Ships First Production MCP Server for PLM

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

ChatGPT Market Share Dips Below 50% for First Time, Sensor Tower Reports

Google Gemini-SQL2 Hits 80.04% on BIRD, Beating GPT-5.5 by 7 Points

HydraDB Raises $6.5M for Persistent Agent Memory, Solving the Session Gap

The framework underneath this story

More in AI Research

Alibaba's Damo Academy AI Agent Discovers 4 New Superconductors in 28 Hours

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

AI Security Inst Shows Test-Time Compute Skews Frontier Evaluations