"You don’t need frontier scale to reach frontier quality" in specialized domains, you need the right expert feedback loop.
Heidi says it matched Sonnet 4.6 in clinical search with a much smaller model trained on clinician preferences instead of raw scale.
Heidi Evidence is a clinical search tool where doctors ask medical questions and get sourced answers.
Here, clinicians were shown the same medical question with 2 anonymous answers, one from Heidi’s smaller model and one from Sonnet 4.6, and they picked Heidi’s answer 49.9% of the time.
In medicine specifically, the hard problem is knowing when to search, what to cite, how much to say, and when a vague answer is worse than no answer.
There’s been debate in the last couple days about whether general models beat specialized medical AI. It's the wrong question. This is an argument about how to measure.
You don't need frontier scale to reach frontier quality. Six weeks ago we matched the best frontier model in Heidi Evidence with a model of our own, a fraction of the size.
Here's how. 🧵