http://x.com/i/article/2059666894781554691
Harvey and Baseten's post-trained open-weight AI agents match closed-source frontier models on the Legal Agent Benchmark
The evaluation used over 1,200 long-horizon tasks.
Some users praise Harvey and Baseten's post-trained open-weight legal agents for matching frontier performance and showing strong momentum, while others dismiss the company's prospects by betting it will fail or be outpaced.
Most Activity
Fascinating. Legal, accounting and investment banking should thrive as vertical applications of AI, especially if you run on http://Factory.ai too.
Today we're sharing our first research collaboration with @baseten on open-weight legal agents.
Using signal from LAB (our Legal Agent Benchmark of 1,200+ tasks across 24 practice areas), we post-trained an open-weight model to match closed-source frontier performance.
Training open-weight agents for legal creates three major advantages for Harvey:
1) Cost and latency improvements:
The best-performing closed-source frontier models took an average of 22 minutes to complete each LAB task, with $50 per-task average inference cost.
Open-weight inference is substantially faster and cheaper.
2) Reasoning visibility:
In a high-stakes domain like legal, it's incredibly important to get visibility into agents' internal reasoning states - for audit and governance and also as a lever for Harvey to improve agent performance.
Closed-source foundation model providers avoid exposing raw reasoning tokens via API to prevent model distillation. For Harvey, that visibility is a major advantage.
3) Custom training:
Owning the model weights lets us customize training and modify architecture. One example: our blog’s final note on training a custom KV cache compactor.
More to come on our research collaboration with Baseten.
Post-training your own frontier model has become the new default for the leading AI companies. Baseten's research team partnered with the brilliant team at @harvey and showed that post-trained open models can compete at the frontier on LAB. Post-training not only enables high-quality legal agents to be more accessible, but also allows more specialization for the workflows firms actually care about.
http://x.com/i/article/2059666894781554691

@gabepereyra Great teamwork 💚

@AgiBeerus will definitely be around. dominating.

@rabois Factory won’t be around in a year unless acquired by Cognition or Anthropic but not sure why they would.

@rabois What’s their revenue? Anthropic $50 billion, Cursor $3 billion, Cognition $500 million.
Factory $30 million at best, probably negative margin?

@AgiBeerus much much higher and accelerating faster than Cursor and likely Cognition.

@rabois Is $upst still investable ?

@gabepereyra which practice areas saw the biggest gains from post training? I'd guess any corporate or commercial ones. Also, for any task where the model failed was it because of bad reasoning/retrieval? That's where I am seeing the failures on my end.

@DannieHerz @harvey so cool.

Post-training can make a model sound more lawyerly. It does not make it current, source-faithful, or procedurally reliable.
Legal agents fail where the law changes, sources conflict, jurisdictions diverge, or facts require verified retrieval. That needs different a approach, not just better weights.

@rabois Oh shit just noticed cognition mogged you guys today

@rabois How about you and @cognition play I’ll show you mine if you show me yours
I’ll bet 100k factory is not at $500 million 12 months from now.