/AI4h ago

Harvey and Fireworks AI hybrid agent routing outperforms Claude Opus 4.7 on legal benchmarks at 61% lower cost

The hybrid setup beat GPT-5.5, which scored 11% at $560.

--0--
Harvey@harvey

We partnered with @FireworksAI_HQ to train open-source models for legal. Here's what we found:

1) Hybrid legal agents can beat frontier models on quality and cost by routing selectively to a frontier advisor.

We tested a hybrid setup where GLM 5.1 served as the primary worker, routing tasks to Opus 4.7 as an advisor when needed.

GLM invoked Opus sparingly, just 0.83 times per task on average.

The hybrid setup beat Opus on both quality and cost: 18% all-pass vs 14%, at $368 vs $954 across the same 100 tasks.

2) Post-training can push open models to frontier-level legal performance.

On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6's all-pass rate from 11% to 15%, beating Opus' 14%.

But the cost gap was even more striking: $84 vs $954 across the same 100 tasks, or ~11x cheaper.

We're excited to continue working with @FireworksAI_HQ on the next generation of open-source legal agents.

10:03 AM 路 Jun 3, 2026 路 46.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS16.8KBOOKMARKS98LIKES118
Sean Cai@SeanZCai

Adding Harvey to the list of app layer companies, joining Ramp, Sierra, Decagon, etc, who are devoting some level of dedicated effort to join Cursor in approaching positive gross margins decoupling themselves from frontier model providers as the marginal cost of post training goes down.

The marginal unit of intelligence's value from the next model release, while being valuable, is being supplanted by harness-level differences in model performance, increasing ability to traverse the performance-cost-latency pareto curve with post-training infra advancements, and the simple fact that when one runs practical benchmarks (and not all the toy benchmarks around nowadays), GLM and latest MiniMax are at Parity or exceed Frontier Models on an absolute basis on many tasks (more on this in state of data May, a bit delayed).

Ofc ik Harvey has been toying with rlaas vendors for a while and their finetuning efforts pre big rl wave weren't incredibly well received, but I generally find that most app layer ai companies with some elite engineering talent will be seriously exploring post training their own small models, at least in conjunction with systems that use frontier models as above head orchestrators.

That some of them reach out to me on advice for procuring rl datasets from rl env companies is reifying evidence of that.

Harvey@harvey

We partnered with @FireworksAI_HQ to train open-source models for legal. Here's what we found:

1) Hybrid legal agents can beat frontier models on quality and cost by routing selectively to a frontier advisor.

We tested a hybrid setup where GLM 5.1 served as the primary worker, routing tasks to Opus 4.7 as an advisor when needed.

GLM invoked Opus sparingly, just 0.83 times per task on average.

The hybrid setup beat Opus on both quality and cost: 18% all-pass vs 14%, at $368 vs $954 across the same 100 tasks.

2) Post-training can push open models to frontier-level legal performance.

On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6's all-pass rate from 11% to 15%, beating Opus' 14%.

But the cost gap was even more striking: $84 vs $954 across the same 100 tasks, or ~11x cheaper.

We're excited to continue working with @FireworksAI_HQ on the next generation of open-source legal agents.

2hViews 16.8KLikes 118Bookmarks 98
RETWEETS10REPLIES17
clem 馃@ClementDelangue

Routing and post-training open-source models won't only give you more accurate systems but also meaningfully faster and cheaper systems as most companies are currently learning (in addition to giving you more control and privacy).

The idea that a "frontier" model (by frontier we mean is slightly more accurate on a few very limited benchmarks) will be better for all domains, all tasks, all setups just doesn't hold up! It's marketing for making you pay more!

Harvey@harvey

We partnered with @FireworksAI_HQ to train open-source models for legal. Here's what we found:

1) Hybrid legal agents can beat frontier models on quality and cost by routing selectively to a frontier advisor.

We tested a hybrid setup where GLM 5.1 served as the primary worker, routing tasks to Opus 4.7 as an advisor when needed.

GLM invoked Opus sparingly, just 0.83 times per task on average.

The hybrid setup beat Opus on both quality and cost: 18% all-pass vs 14%, at $368 vs $954 across the same 100 tasks.

2) Post-training can push open models to frontier-level legal performance.

On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6's all-pass rate from 11% to 15%, beating Opus' 14%.

But the cost gap was even more striking: $84 vs $954 across the same 100 tasks, or ~11x cheaper.

We're excited to continue working with @FireworksAI_HQ on the next generation of open-source legal agents.

2hViews 7.7KLikes 92Bookmarks 27