Great work from @stochasticdoggo !
The verdict is in! Frontier models can pass the bar, yet they struggle on comprehensive legal research
Today we're releasing Legal Research Bench, a benchmark that measures models’ ability to solve realistic legal research tasks across eight areas of U.S. law
Instead of awarding partial credit, Legal Research Bench measures whether a model can conduct exhaustive legal analysis. We grade against a strict, all-pass rubric written by practicing lawyers. A model only receives full credit if every required legal element is correct
Claude Opus 4.8 leads with 43.8% all-pass accuracy, followed by GPT 5.5 (40.4%) and Claude Sonnet 4.6 (38.5%). While top models score around 80% with partial credit, none exceed 44% when every required legal element must be correct
The gap between partial and all-pass accuracy shows how difficult it remains for AI to produce complete, reliable legal research. We hope that Legal Research Bench helps better measure, and ultimately close that gap
Lots of exciting work happening in Legal AI from @harvey and @crosbylegal. Excited for the legal research benchmarks ahead!