/Tech34d ago

Harvey's Legal Agent Benchmark finds frontier AI models complete less than 10% of complex legal tasks end-to-end

Applied Compute's Yash Patil recommends using multi-model strategies.

4230040256132.4K

#387

Original post

Alex Ratner#1531

Gabe Pereyra@gabepereyra#1852inTech

http://x.com/i/article/2059284537503285248

10:08 AM · May 26, 2026 · 101.5K Views

Sentiment

Many users praise the Harvey legal agent benchmark for clarifying frontier models' low completion rates on long-horizon legal tasks while still producing impressive work and shifting focus to data and evaluations.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

X (FORMERLY TWITTER)Via

#1852

Posts from X

Most Activity

VIEWS16.6KBOOKMARKS54LIKES76RETWEETS10REPLIES5

Harvey@harvey

We evaluated frontier models on LAB, our long-horizon legal agent benchmark.

Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.

Gabe Pereyra@gabepereyra

http://x.com/i/article/2059284537503285248

34d16.6K7654

Yash Patil@ypatil125

"What this means in practice is that no single model is a silver bullet for legal work today. Maximizing agent performance on a real legal workload requires understanding which model family best matches the task at hand. The strongest production agent deployments will be multi-model from the start."

Lots of headroom! Great analysis by the @harvey team!

Gabe Pereyra@gabepereyra

http://x.com/i/article/2059284537503285248

34d7.5K4619

Snorkel AI@SnorkelAI

Initial LAB results from Harvey put a number on something we see across specialized AI work: under rigorous all-pass standards, frontier models complete fewer than 10% of long-horizon legal tasks, and no single model leads across practice areas.

General capability isn't sufficient for high-stakes professional work. Closing that gap takes domain-grounded data, evaluation, and post-training, which is exactly the research we're excited to do with the Harvey team next.

Harvey@harvey

We evaluated frontier models on LAB, our long-horizon legal agent benchmark.

Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.

34d4.9K3015

👩‍💻 Paige Bailey@DynamicWebPaige

📊 Evaluating agents is hard! And in large part, the industry has been stuck on short-horizon QA, but real-world work doesn't look like that.

@Harvey just open-sourced their Legal Agent Benchmark (LAB). The agent is dropped into a messy file system with a loose instruction and has to output a final deliverable... 👩‍⚖️

a.k.a., much more similar to the real world than a simple prompt -- and the grading is intense: "all-pass" against 75k expert criteria. Which means that if you match 9/10 M&A risks? You fail, because in real life, the 10th risk blows up the deal.

Coding agents started working when SWE-bench crystallized a multi-step, complex goal. This feels similar, but for the legal domain. Take a look! 👇

Harvey@harvey

We evaluated frontier models on LAB, our long-horizon legal agent benchmark.

Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.

33d2K165

Harvey@harvey

Legal work spans dozens of sub-domains, from corporate and regulatory to IP, tax, and employment.

Model performance varies sharply across these practice areas.

The same model can lead in one area and lag in another, and no single model leads across every practice area.

34d42461

Harvey@harvey

Expert performance is costly.

Opus 4.7, the highest-performing model by all-pass score, costs $50.90 and 22 minutes of wall-clock time per task.

On cost, GPT-5.5 is approximately 3x cheaper. On latency, Gemini 3.5 Flash returns a draft in under six minutes.

34d2953

Harvey@harvey

Legal work is far from saturated by frontier models.

Under LAB's all-pass standard, Opus 4.7 leads at just 7.1% completion.

All-pass grading reflects how high-stakes legal work is reviewed in practice: there is no partial credit for catching most of the issues.

34d5765

Harvey@harvey

We additionally analyzed model behavior over the course of their work, and found common patterns that affect end-to-end legal performance.

Models that spent substantial time verifying and revising their work performed best on LAB's task suite.

34d3094

Harvey@harvey

Read our full blog:

34d3405

Vitor Baptista@vitorbaptista

@gabepereyra That's extremely helpful, thank you! Do you plan on publishing the results using your internal harness? I'm curious about how much of this accuracy can be improved by using a better harness.

34d4252

Deepak Kumar@deepakdk3478

@ypatil125 I think the same applies for the finance work too.

34d552

Devansh: chocolate milk cult leader@Machine01776819

@gabepereyra Any plans to share your criteria pass rate? Will provide even richer comparisons on where models won or lost?

34d211

Connecticut Yankee@EmbeddingSpace

@gabepereyra Which GPT-5.5 model did you use: Thinking or Pro?

34d186

AInthusiast@ainthusiast

@harvey GPT 5.5 thinking? Pro?

34d531

Viraj@tunedgradient

@ypatil125 Indeed. Spent a few years practicing in the legal space after CS. Tasks need a mix of deep reasoning (with legal procedure), retrieval creativity. Varies by domain and risk profile. Multi-model matters, but so does multimodal as legal work often spans video depos, exhibits etc.

34d431

SJ@sj_nyc

@SnorkelAI reinforcement learning is the answer

34d26

Devansh: chocolate milk cult leader@Machine01776819

@harvey Why only all pass? Doesn't make sense since partial answers can also be very useful to the users.

34d11

George Filippakis@GeoFilippakis

@gabepereyra Really nice work. The sub-10% all-pass result is less discouraging than clarifying. Frontier models can produce impressive legal work, but reliable end-to-end legal agents need process discipline. Retrieval, validation, revision, grounding. Not just a bigger model.

34d5

George Filippakis@GeoFilippakis

@SnorkelAI Legal AI is becoming a great example of why the model is necessary but perhaps not (yet?) sufficient. The moat moves to the data, evals, workflow traces, and post-training that turn raw capability into reliable professional output.

34d4

Majo@originalmagneto

@gabepereyra @deredleritt3r 🤔 you seen this?

34d2