/Tech5h ago

Bridgewater Associates says fine-tuned Qwen3-235B beat proprietary LLMs on financial workflows at 1/14th the cost

Model disagreement metrics flag ambiguous cases for human review.

358971041.1K185.8K

#109

Original post

Aakash Gupta@aakashgupta

Bridgewater just published numbers that should make every frontier lab nervous.

The world's largest hedge fund tested Gemini, Claude, and GPT on six document filtering tasks its investors do every day. Naive prompts scored around 50%. A coin flip. Expert-written prompts pushed accuracy to 78%. Investors needed 80% before they'd trust the system in their workflow, and no frontier model cleared it. GPT 5.4 cost 43% more than 5.2 and was barely more accurate.

So they fine-tuned Qwen3-235B on Tinker instead. 84.7% accuracy. 29.8% fewer mistakes than the best frontier model. At 1/14th the inference cost.

The smartest part is buried in the middle of the paper. Their vendor-labeled training data was riddled with wrong labels, and expert labeling costs too much to run on everything. Their fix: train a model on the noisy dataset, then run it back over its own training data. Any example the model disagreed with got routed to senior investors, because either the example was genuinely hard or the label was wrong. The model's own confusion became a detector for bad labels.

Prompting hit a ceiling for a structural reason. A prompt captures only the judgment an expert can put into words. Twenty years of taste about which central bank memo actually signals a rate move doesn't compress into instructions. It transfers through labeled examples.

Every institution sitting on decades of expert decisions just learned that those archives can train a model that beats the frontier at their specific job. The alpha was in the filing cabinet the whole time.

Mira Murati@miramurati

Bridgewater used their unique financial knowledge and partnered with us on @tinkerapi to fine-tune a model that helps their analysts focus on what's important. Experts improving AI that empowers experts. https://thinkingmachines.ai/news/learning-to-replicate-expert-judgment-in-financial-tasks/

12:34 PM · Jul 2, 2026 · 184.9K Views

Sentiment

Positive users highlight Bridgewater's Qwen3-235B fine-tuning beating frontier models on document tasks at far lower cost, while negative users mock prompting skills of those relying on closed models.

Pos

57.1%

Neg

42.9%

7 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

THINKING MACHINES LABVia

#46

Posts from X

Most Activity

VIEWS2KLIKES5RETWEETS1

Aakash Gupta@aakashgupta

To get all my takes without an algorithmic filter, subscribe to my newsletter:

http://www.aibyaakash.com

3h2K5

BOOKMARKS1

Stephanie So@ComplicatedIsOK

I'd like to run this premise by you.

Data, or availability of data, is plentiful.

An AI model can be trained and retrained ad infinitum to generate labels, that's not the scarce resource either.

The bottleneck to scaling is moving toward the costs of re-deriving the conditions that ought to constrain the outputs. That's why expert judgment is needed.

So much effort has been done to expand the possibilities.

Time to focus more on invoking hard restrictions - and then making that repeatable.

3h14131

REPLIES1

Mike Lin@mikelin1789

@aakashgupta This should not be a surprise. All the benchmarks around the last human test or whatever random puzzle test is useless for professionals and enterprises that have a very nuanced job.

3h312

Bob O@BobO58076053213

@aakashgupta yeah they just fucking suck at prompting and harnessing you do not need to fine tune a model for doc scanning lmfao

it’s like asking a bunch of horae carriage owners to test your new gas motor and they use it to power an automated horse whipper n complain abt its efficiency LOL

4h7433

HundredDollarBillz@Jhawkbill

@aakashgupta Or you could just ask the old guy in the corner

5h8852

Ilman Shazhaev@shzhv13

@aakashgupta This proves that hyper-targeted local weights completely erase the premium value of generalized frontier endpoints.

4h3692

GP@eiregp

@aakashgupta This post is literally written by AI about I and on first read does not articulate very well what it’s trying to say. I’m a paid subscriber so I’m hoping there’s a far better explanation on your substack…

2h2992

joncelery@johncelery

@aakashgupta So what you are saying is that current frontier models with good prompting and no fine tuning are sufficiently accurate for investors to trust the system in their workflows? Idk why this should make frontier labs nervous.

3h341