/AI7h ago

Creator Patrick Jiang launches Harness-1, an open-source 20B search agent claiming to beat GPT-5.4 on long-horizon search

It achieves 73% average evidence recall across eight benchmarks

285387069339.1K
Original postBen (no treats)#972

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.

> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4

> Context-1-level cost and latency

> externalizes candidates, evidence, verification, and search history

> open-source

9:34 AM · Jun 6, 2026 · 45.9K Views
Sentiment

Many users are excited about the open-source Harness-1 20B agent matching frontier search performance because it delivers strong results with lower costs, reduced latency, and externalized memory.

Pos
93.7%
Neg
6.3%
16 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS915LIKES11

[1/N] I’ve been wondering:

maybe search agents are bad at search partly because we make them do all the paperwork in their head.

So I tried a simple idea:

externalize the search state, then train the model to use that harness.

The result is Harness-1: a 20B search agent that can match or even beat much larger frontier AI on hard long-horizon search tasks.

7hViews 915Likes 11Bookmarks 4
BOOKMARKS11

Paper 📄: https://arxiv.org/abs/2606.02373 Code 💻: https://github.com/pat-jj/harness-1 Model 🤗: https://huggingface.co/pat-jj/harness-1 HF Paper: https://huggingface.co/papers/2606.02373

7hViews 544Likes 6Bookmarks 11
RETWEETS58

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.

> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4

> Context-1-level cost and latency

> externalizes candidates, evidence, verification, and search history

> open-source

7hViews 45.9KLikes 646Bookmarks 821
REPLIES1
All Over Tools@UseAllOverTools

@patpcj outperforms GPT-5.4? call me when GPT-5 ships. anyway, curious about the externalized evidence -last time i tried that with a 20B, the candidate log added 4k tokens per turn, tanking cost after 10 steps. have you benchmarked on anything beyond single-shot search?

3hViews 253
Carlos E. Perez@IntuitMachine

LLMs specialized to improve harnesses. What's next? Harnesses to improve harnesses?

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.

> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4

> Context-1-level cost and latency

> externalizes candidates, evidence, verification, and search history

> open-source

2hViews 455Likes 2Bookmarks 3

[5/N] Concretely, the harness keeps a working memory with: candidate docs, curated evidence, importance tags, search history, evidence links, verification records, dedup/compression, and context-budget markers.

So the agent is not just talking to a search box. It is operating over a workspace.

7hViews 594Likes 2Bookmarks 2

[6/N] I think this changes what RL is actually learning.

Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface: search, curate, revisit, verify, and submit.

Much closer to how I’d want a search agent to work.

7hViews 533Likes 5Bookmarks 1

[4/N] Harness-1 tries to separate these two jobs.

The model still makes the semantic decisions: what to search, what to read, what to keep, what to verify, when to stop.

But the harness maintains the recoverable state around those decisions.

7hViews 638Likes 3Bookmarks 1

[7/N] A fun part: this was not trained with a huge amount of task data.

Harness-1 uses 899 filtered SFT trajectories and RL on 3,453 queries.

The point is not “less data is always enough.”

The point is that a lot of the behavioral prior can live in the harness.

7hViews 515Likes 3Bookmarks 1

[3/N] This gets especially weird for RL. The final reward can tell you whether the episode worked, but it often does not tell you why it failed.

Was it a bad search? Forgotten evidence? Missing verification? Poor curation? Or the agent just losing track of what it had already seen?

7hViews 699Likes 2Bookmarks 1

Huge thanks to @trychroma for fully supporting this work, and to @tinkerapi for the training infra!

7hViews 450Likes 5Bookmarks 1
Patrick Donohoe@patrickdonohoe

@patpcj Super cool project. I recently was speaking about this at a conference about how smaller models with higher parameter density+ reasoning ability paired with external knowledge stores are the future. Could be interesting to pair this with a web search api!

2hViews 99Likes 2Bookmarks 1

[2/N] The usual search-agent setup is basically:

search → read → search → read → keep appending everything to the transcript.

At some point the model is not just “searching” anymore.

It is also being asked to be a memory system, a note taker, a verifier, and a librarian.

7hViews 800Likes 3

[10/N] My takeaway:

for search agents, “the model” is not the whole learning system.

The interface matters. The memory layout matters. The action space matters. The harness matters.

If we want RL to teach better search behavior, we should probably stop making the model do all the paperwork in its head.

7hViews 505Likes 2

[9/N] The ablations were also pretty revealing.

When we disable the harness mechanisms, the model does not just lose some information.

It changes behavior: more shallow searching, less reading / verification, worse final curation.

So the harness is not just engineering glue.

7hViews 432Likes 2

[8/N] The result that made me most excited is transfer.

Harness-1 improves over Context-1 by +7.9 recall points on source-family benchmarks.

But on held-out transfer benchmarks, the gain is +17.0 points.

That’s the part that made the idea feel real to me.

7hViews 453Likes 1
Vikas Tiwari@productpilotbb

@patpcj Will it eat up @ExaAILabs ?

4hViews 97Likes 2

@patpcj Having an open-source option that actually hits those performance levels without the massive cost and latency is huge. Being able to see the full search history and evidence trail is a total upgrade.

4hViews 167Likes 1
Elias Lumer@EliasLumer

@patpcj I wonder if there’s a more generalizable version of this. Great work, will check out the paper/githuv

3hViews 102Likes 1
Load more posts