/AI5h ago

DeepSeek V4 Flash Matches Opus Verifier Performance At 1000x Lower Cost

--0--
Quote posts
Reposts
Original postHarrison Chase#739

nice post from @Harvey and @LangChain Labs, worth a read. improve agent feedback loop without setting $$ on fire

Harvey@harvey

Can we design legal agent verifiers that are up to 1,000x cheaper?

Verifiers are LLM judges that check an agent’s work against rubric criteria: they're used both in agent benchmarking and as reward signal in post-training.

But verifiers can be a bottleneck at scale.

For example, our Legal Agent Benchmark (LAB), comprising 1,200+ legal tasks across 24 different practice areas, requires grading an average of 50+ rubric criteria per answer.

We partnered with @LangChain Labs to design more efficient verifiers for LAB, comparing batch vs per-criterion scoring and open/cost-efficient models against Opus 4.7.

The results were surprising:

DeepSeek v4 Flash preserved much of the Opus 4.7 verifier signal with 94-96% agreement, between batch mode and per-criterion mode.

This came with a massive reduction in cost: 18x cheaper on per-criterion verification, and ~1,000x cheaper on batch verification.

In an RL setting with 3,200 rollouts, the cost of verification drops from $18,000 to $18.

5:17 PM · Jun 3, 2026 · 1.7K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
No ranked X posts are available for this story yet.