Bridgewater and Thinking Machines Lab fine-tuned a Qwen3-235B model to 84.7% accuracy on financial document tasks. The model beats GPT, Gemini, and Claude at roughly one-fourteenth the cost, per the companies' internal evaluation.
Key facts
- Qwen3-235B fine-tuned to 84.7% accuracy on financial tasks.
- GPT, Gemini, Claude hit ~50% accuracy with basic prompts.
- Fine-tuned model costs 1/14th of frontier models.
- Bridgewater and Thinking Machines Lab led training.
- Training used proprietary investor judgments for labeling.
Bridgewater and Thinking Machines Lab—the startup from former OpenAI CTO Mira Murati—have fine-tuned a Qwen3-235B model for financial tasks. According to their own testing, the model hits 84.7 percent accuracy, beating Gemini, Claude, and GPT at roughly one-fourteenth of the cost. The numbers haven't been verified by anyone outside the two companies, though.
What the frontier models got wrong
The researchers defined six tasks drawn from an investor's daily routine. One example: deciding whether a financial article is relevant to an executive. Another: whether a central bank document signals the direction of future rate changes. For investors, these calls are trivial, but they can barely put their reasoning into words. The report gives a telling example. A headline about Trump's claim to Greenland gets flagged as irrelevant, while Trump's threat of new China tariffs is highly relevant. Both touch on geopolitics and finance.
Frontier models failed in the authors' tests. Variants of Gemini, Claude, and GPT hit only about 50 percent accuracy with a basic prompt. Expert-written instructions and a three-tier rating system ("relevant and interesting," "relevant but uninteresting," "irrelevant") pushed accuracy into the mid-70s. That still fell short of the 80 percent threshold the authors set for trustworthy deployment.
Newer models barely improve per dollar, the report says. GPT 5.4 costs 43 percent more than 5.2 but is only marginally more accurate.
The secret sauce: proprietary investor judgment
The real value lives inside investors' heads. The solution was fine-tuning, retraining an open-weight model on proprietary examples. The key ingredient was the Bridgewater investors' judgment: At first, cheap outside contractors labeled the documents, but many of those labels were wrong. To avoid having expensive professionals review everything, the researchers used a workaround. A first model learned from the flawed labels and re-evaluated the same documents. Wherever the model and the original label disagreed, there was likely an error. Only those disputed cases went to investors for correction.

Training ran on the Tinker platform from Thinking Machines Lab, built on top of the open model Qwen3-235B.
Why this matters beyond hedge funds
The result demonstrates that companies can develop powerful AI solutions using their own data without having to share sensitive information with large providers. The cost advantage is stark: the fine-tuned Qwen3-235B operates at roughly one-fourteenth the inference cost of frontier models. For financial institutions that process millions of documents daily, that delta translates to real dollars. The approach also sidesteps the data-sharing concerns that have made banks wary of sending proprietary filings to OpenAI or Anthropic's APIs.

The 84.7% figure is unverified externally, and the evaluation set is proprietary — the public can't reproduce the benchmark. Still, the methodology is sound: use a cheap open-weight model, generate noisy labels from contractors, then use model disagreement to flag only the hard cases for expert review. That active-learning loop is the real innovation here, not the fine-tuning itself.
What to watch
Watch for third-party verification of the 84.7% accuracy claim, and whether Bridgewater or Thinking Machines Lab open-sources the evaluation benchmark. If other hedge funds replicate the approach, expect a wave of domain-specific fine-tuned models that bypass frontier providers entirely.
Source: the-decoder.com








