http://x.com/i/article/2059284537503285248
Harvey's Legal Agent Benchmark finds frontier AI models complete less than 10% of complex legal tasks end-to-end
Applied Compute's Yash Patil recommends using multi-model strategies.
Many users praise the Harvey legal agent benchmark for clarifying frontier models' low completion rates on long-horizon legal tasks while still producing impressive work and shifting focus to data and evaluations.
No Digg Deeper questions have been answered for this story yet.
Most Activity
We evaluated frontier models on LAB, our long-horizon legal agent benchmark.
Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.
Read more:
http://x.com/i/article/2059284537503285248
"What this means in practice is that no single model is a silver bullet for legal work today. Maximizing agent performance on a real legal workload requires understanding which model family best matches the task at hand. The strongest production agent deployments will be multi-model from the start."
Lots of headroom! Great analysis by the @harvey team!
http://x.com/i/article/2059284537503285248
Initial LAB results from Harvey put a number on something we see across specialized AI work: under rigorous all-pass standards, frontier models complete fewer than 10% of long-horizon legal tasks, and no single model leads across practice areas.
General capability isn't sufficient for high-stakes professional work. Closing that gap takes domain-grounded data, evaluation, and post-training, which is exactly the research we're excited to do with the Harvey team next.
We evaluated frontier models on LAB, our long-horizon legal agent benchmark.
Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.
Read more:
📊 Evaluating agents is hard! And in large part, the industry has been stuck on short-horizon QA, but real-world work doesn't look like that.
@Harvey just open-sourced their Legal Agent Benchmark (LAB). The agent is dropped into a messy file system with a loose instruction and has to output a final deliverable... 👩⚖️
a.k.a., much more similar to the real world than a simple prompt -- and the grading is intense: "all-pass" against 75k expert criteria. Which means that if you match 9/10 M&A risks? You fail, because in real life, the 10th risk blows up the deal.
Coding agents started working when SWE-bench crystallized a multi-step, complex goal. This feels similar, but for the legal domain. Take a look! 👇
We evaluated frontier models on LAB, our long-horizon legal agent benchmark.
Three findings stood out: 1) Legal work is far from saturated by frontier models. 2) Model performance varies sharply by practice area. 3) Cost and latency rise at the frontier.
Read more:

Legal work spans dozens of sub-domains, from corporate and regulatory to IP, tax, and employment.
Model performance varies sharply across these practice areas.
The same model can lead in one area and lag in another, and no single model leads across every practice area.

Expert performance is costly.
Opus 4.7, the highest-performing model by all-pass score, costs $50.90 and 22 minutes of wall-clock time per task.
On cost, GPT-5.5 is approximately 3x cheaper. On latency, Gemini 3.5 Flash returns a draft in under six minutes.

Legal work is far from saturated by frontier models.
Under LAB's all-pass standard, Opus 4.7 leads at just 7.1% completion.
All-pass grading reflects how high-stakes legal work is reviewed in practice: there is no partial credit for catching most of the issues.

We additionally analyzed model behavior over the course of their work, and found common patterns that affect end-to-end legal performance.
Models that spent substantial time verifying and revising their work performed best on LAB's task suite.

Read our full blog:

@gabepereyra That's extremely helpful, thank you! Do you plan on publishing the results using your internal harness? I'm curious about how much of this accuracy can be improved by using a better harness.

@ypatil125 I think the same applies for the finance work too.

@gabepereyra Any plans to share your criteria pass rate? Will provide even richer comparisons on where models won or lost?

@gabepereyra Which GPT-5.5 model did you use: Thinking or Pro?

@harvey GPT 5.5 thinking? Pro?

@ypatil125 Indeed. Spent a few years practicing in the legal space after CS. Tasks need a mix of deep reasoning (with legal procedure), retrieval creativity. Varies by domain and risk profile. Multi-model matters, but so does multimodal as legal work often spans video depos, exhibits etc.

@SnorkelAI reinforcement learning is the answer

@harvey Why only all pass? Doesn't make sense since partial answers can also be very useful to the users.

@gabepereyra Really nice work. The sub-10% all-pass result is less discouraging than clarifying. Frontier models can produce impressive legal work, but reliable end-to-end legal agents need process discipline. Retrieval, validation, revision, grounding. Not just a bigger model.

@SnorkelAI Legal AI is becoming a great example of why the model is necessary but perhaps not (yet?) sufficient. The moat moves to the data, evals, workflow traces, and post-training that turn raw capability into reliable professional output.

@gabepereyra @deredleritt3r 🤔 you seen this?