/Tech15h ago

Santiago Valdarrama says enabling reasoning in the Cline harness boosted GLM 5.2 coding performance by 11.2 percentage points

Coding task success rose from 57.3% to 68.5%.

232012411731.9K

#1192

Original post

Santiago@svpino#1889inTech

Harnesses matter way more than people think.

Cline ran a couple of experiments on a set of coding tasks using GLM 5.2:

• 57.3% using their harness with reasoning turned off. • 68.5% with their harness with reasoning turned up.

That's a difference of 11.2 percentage points! Same model, same set of problems. The difference stemmed from how the model was driven by the harness.

Current open-weight models are way more capable than we think. They aren't the bottleneck anymore.

We need better harnesses.

Cline@cline

We’ve been impressed with GLM-5.2 and so are introducing a $9.99/month subscription to give you 2-5x discounted access to it and other open weight models like DeepSeek, Kimi, MiniMax, Mimo, Qwen.

Use it on Cline CLI & IDE with $1.99 special promo if sign up via: npm i -g cline

9:14 AM · Jun 29, 2026 · 31.8K Views

Sentiment

Positive users praise Cline experiments showing harnesses lift GLM-5.2 coding success by 11 points as proof architecture matters more than raw models, while negative users call the tests unconvincing or just another excuse.

Pos

64.3%

Neg

35.7%

14 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

X (FORMERLY TWITTER)Via

#1192

Posts from X

Most Activity

VIEWS1.3KLIKES11

Cline@cline

@svpino we are very dedicated focusing on harness and make open weights model absolutely amazing to use

15h1.3K111

BOOKMARKS1

kenzo@codewithkenzo

@svpino for the people of Pi, you can still use your Cline sub models in Pi with https://github.com/codewithkenzo/pi-clinepass . Supports reasoning, auto-discovers models, and follows the Pi OAuth flow through the browser

11h16711

RETWEETS24

Santiago@svpino

Harnesses matter way more than people think.

Cline ran a couple of experiments on a set of coding tasks using GLM 5.2:

• 57.3% using their harness with reasoning turned off. • 68.5% with their harness with reasoning turned up.

That's a difference of 11.2 percentage points! Same model, same set of problems. The difference stemmed from how the model was driven by the harness.

Current open-weight models are way more capable than we think. They aren't the bottleneck anymore.

We need better harnesses.

Cline@cline

We’ve been impressed with GLM-5.2 and so are introducing a $9.99/month subscription to give you 2-5x discounted access to it and other open weight models like DeepSeek, Kimi, MiniMax, Mimo, Qwen.

Use it on Cline CLI & IDE with $1.99 special promo if sign up via: npm i -g cline

15h31.8K201117

REPLIES1

Santiago@svpino

@Akumunokokoro I can see how my wording is very confusing. Sorry about that.

12h122

Akumu@Akumunokokoro

@svpino Terrible comparison. Either keep reasoning turned on, or turned off, but use the same model with different harnesses. This reads like a cheap AI model wrote up the post without understanding anything about scientific rigor.

14h5568

Hugo@HugoPodw

@svpino 😂 what lmao are you telling me cline had something to do with reasoning being turned off and then turned on.

14h2013

Akumu@Akumunokokoro

@svpino Your post says they used the same harness, same model, and reasoning turned on or off. That's not a useful test when you're validating the harness itself.

13h943

Joao Marcos@joaomviso

@svpino I don’t get it. The only difference was turning reasoning on.

14h2646

Santiago@svpino

@Akumunokokoro The experiment uses the same model with different harnesses, as you suggest.

13h4951

nbtb@nbtb_lab

@svpino What is the best way to use it / test it?

15h118

Rahul Raj@rahulrajsahay

@svpino Share link to the article: your post just shows that GLM 5.2 and cline work better with reasoning turned on.

That's true with almost all reasoning models.

Deepseek V4 Flash with max reasoning outperforms Deepseek V4 Pro with lower reasoning budget.

https://benchlm.ai/models/deepseek-v4-flash

12h67

Leon Cvetkovski@leon_cvetkovski

@svpino I think you need to work on the harmess you used to write this post

13h1703

Hova@Hovavayo

@cline @svpino Hey @grok bu şirkət model yaradirmi

15h35

Walter Sobchak@WalterS05043553

@svpino Couldn't agree more. If the governments are shutting down mythos and fable on their proprietary harnesses like clause and openai then theres a huge market in open source models with better harnesses. Cline is ahead of the game. Game over if they integrate images & other tools

15h445

Alpha Batcher@alphabatcher

@svpino well, i should to have focus on Cline and their ClinePass

15h287

Surtur@Surtur

@svpino So now when the magic 8-ball gives us a response that’s incorrect we can say “you’re using the wrong harness” instead of “you’re prompting it wrong.”

15h205

Vandos ❓@__vandos__

This tracks with what http://Z.ai themselves show — GLM-5.2’s own benchmarks jump meaningfully between ‘high’ and ‘max’ reasoning effort settings. The harness point is real: a model that’s 6x cheaper than GPT-5.5 only delivers that value if the harness actually knows how to drive it, otherwise you’re just paying less for worse output.

14h143

tan@captainntan

@svpino that 11% jump is actually wild to see.

15h122

Aditya@adityavg13

@svpino We also need harnesses that are fun to look at and have a theme!

15h111

Silas Burke@silasburke

@svpino 11 points just from flipping reasoning is a weird amount

8h107