We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI.
We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.
A private held-out test set prevents data contamination.
We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI.
We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.
Users call the Vals AI live leaderboard for the Harvey Legal Agent Benchmark great because current models scoring only 11% highlights exciting room for future progress.
No Digg Deeper questions have been answered for this story yet.
GLM 5.1 scores zero btw there's no way they benchmaxed this thing directly we shall see how 5.2 performs. I'd be surprised if it landed below MiniMax
We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI.
We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.

The work product is graded on task-specific criteria. A task is resolved only when every criterion passes, so headline scores stay low even when models clear most requirements. Criteria pass rates make that clear: Fable 5 at 90.5%, Opus 4.8 at 87.9%, Sonnet at 86.7%.

Before access was cut off, we ran Fable 5 (with Opus 4.8 fallback) and it took the #1 spot at 11.25%. Opus 4.8 (9.58%) and Sonnet 4.6 (5%) were second and third, respectively. Fable 5 without fallback was still #1 at 10.4%
MiniMax M3 (4.17%) is the #1 open weight model, out-performing closed-weight model GPT 5.5 (3.75%)

This benchmark tests how well models can produce real legal work product in an agentic setting. Each task asks an agent to respond to a specific client inquiry, using shell and file-editing tools alongside specific skills for working in Word, Excel, and PowerPoint

We have made a few changes to the benchmark that have been contributed upstream. Previously, the judge was not able to see tracked changes - they were accepted before being passed to the evaluator. The judge now sees these changes, which is a requirement to pass certain criteria involving redlines
We also enabled prompt caching for the judges to reduce costs and improve response time. We were able to do this by setting breakpoints to cache common elements between API calls.

Fable 5 ran at $19.23 per test, nearly double the cost of Opus 4.8 ($10.22) and more than six times Sonnet 4.6 ($3.04). The Minimax M3, the #1 open-weight model came in, at a fraction of the closed-weight costs at $1.46/per test

@ValsAI @harvey Without glm???

Full results and the interactive leaderboard, filterable across 24 task types- https://www.vals.ai/benchmarks/hlab
You can read Harvey’s original release here - https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark

@ValsAI @harvey How did you resolve disagreements between GPT-5.5_Medium and Sonnet 4.6_High?

@STUD_MAN_X @harvey Coming shortly!

@ValsAI @harvey assuming glm 5.2 was not tested yet?

@ValsAI @harvey @deredleritt3r thoughts?

@ValsAI @harvey well, this is a great benchmark.
seeing models score only 11% at best makes me excited for the future.

@ValsAI @harvey These two models appear to disagree on about 5.6% of LAB criteria (100% − 94.4%), with outsized effects on all-pass scores. https://overfittingdicta.substack.com/p/investigating-the-lab-part-2-judge