Bridgewater just published numbers that should make every frontier lab nervous.
The world's largest hedge fund tested Gemini, Claude, and GPT on six document filtering tasks its investors do every day. Naive prompts scored around 50%. A coin flip. Expert-written prompts pushed accuracy to 78%. Investors needed 80% before they'd trust the system in their workflow, and no frontier model cleared it. GPT 5.4 cost 43% more than 5.2 and was barely more accurate.
So they fine-tuned Qwen3-235B on Tinker instead. 84.7% accuracy. 29.8% fewer mistakes than the best frontier model. At 1/14th the inference cost.
The smartest part is buried in the middle of the paper. Their vendor-labeled training data was riddled with wrong labels, and expert labeling costs too much to run on everything. Their fix: train a model on the noisy dataset, then run it back over its own training data. Any example the model disagreed with got routed to senior investors, because either the example was genuinely hard or the label was wrong. The model's own confusion became a detector for bad labels.
Prompting hit a ceiling for a structural reason. A prompt captures only the judgment an expert can put into words. Twenty years of taste about which central bank memo actually signals a rate move doesn't compress into instructions. It transfers through labeled examples.
Every institution sitting on decades of expert decisions just learned that those archives can train a model that beats the frontier at their specific job. The alpha was in the filing cabinet the whole time.
Bridgewater used their unique financial knowledge and partnered with us on @tinkerapi to fine-tune a model that helps their analysts focus on what's important. Experts improving AI that empowers experts. https://thinkingmachines.ai/news/learning-to-replicate-expert-judgment-in-financial-tasks/

















