/Tech13h ago

Vals AI launches a third-party leaderboard for Harvey's Legal Agent Benchmark, where GLM 5.1 scored zero

A private held-out test set prevents data contamination.

1115363219.3K

#501

Original post

Vals AI@ValsAI

We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI.

We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.

11:00 AM · Jun 17, 2026 · 15.3K Views

Sentiment

Users call the Vals AI live leaderboard for the Harvey Legal Agent Benchmark great because current models scoring only 11% highlights exciting room for future progress.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.3KBOOKMARKS5LIKES28RETWEETS1REPLIES1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

GLM 5.1 scores zero btw there's no way they benchmaxed this thing directly we shall see how 5.2 performs. I'd be surprised if it landed below MiniMax

Vals AI@ValsAI

We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI.

We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.

3h4.3K285

Vals AI@ValsAI

The work product is graded on task-specific criteria. A task is resolved only when every criterion passes, so headline scores stay low even when models clear most requirements. Criteria pass rates make that clear: Fable 5 at 90.5%, Opus 4.8 at 87.9%, Sonnet at 86.7%.

14h6657

Vals AI@ValsAI

Before access was cut off, we ran Fable 5 (with Opus 4.8 fallback) and it took the #1 spot at 11.25%. Opus 4.8 (9.58%) and Sonnet 4.6 (5%) were second and third, respectively. Fable 5 without fallback was still #1 at 10.4%

MiniMax M3 (4.17%) is the #1 open weight model, out-performing closed-weight model GPT 5.5 (3.75%)

14h5445

Vals AI@ValsAI

This benchmark tests how well models can produce real legal work product in an agentic setting. Each task asks an agent to respond to a specific client inquiry, using shell and file-editing tools alongside specific skills for working in Word, Excel, and PowerPoint

14h4755

Vals AI@ValsAI

We have made a few changes to the benchmark that have been contributed upstream. Previously, the judge was not able to see tracked changes - they were accepted before being passed to the evaluator. The judge now sees these changes, which is a requirement to pass certain criteria involving redlines

We also enabled prompt caching for the judges to reduce costs and improve response time. We were able to do this by setting breakpoints to cache common elements between API calls.

14h3934

Vals AI@ValsAI

Fable 5 ran at $19.23 per test, nearly double the cost of Opus 4.8 ($10.22) and more than six times Sonnet 4.6 ($3.04). The Minimax M3, the #1 open-weight model came in, at a fraction of the closed-weight costs at $1.46/per test

14h3924

Satyam Kumar ☄️@STUD_MAN_X

@ValsAI @harvey Without glm???

12h671

Vals AI@ValsAI

Full results and the interactive leaderboard, filterable across 24 task types- https://www.vals.ai/benchmarks/hlab

You can read Harvey’s original release here - https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark

14h3623

Overfitting Dicta@Overfit_Dicta

@ValsAI @harvey How did you resolve disagreements between GPT-5.5_Medium and Sonnet 4.6_High?

8h48

Vals AI@ValsAI

@STUD_MAN_X @harvey Coming shortly!

12h622

retto@rettooooo

@ValsAI @harvey assuming glm 5.2 was not tested yet?

13h721

Zach Huston@HueyHuston

@ValsAI @harvey @deredleritt3r thoughts?

13h97

lost in latency@lostinlatencyX

@ValsAI @harvey well, this is a great benchmark.

seeing models score only 11% at best makes me excited for the future.

12h55

Overfitting Dicta@Overfit_Dicta

@ValsAI @harvey These two models appear to disagree on about 5.6% of LAB criteria (100% − 94.4%), with outsized effects on all-pass scores. https://overfittingdicta.substack.com/p/investigating-the-lab-part-2-judge

8h26