/Tech16h ago

LoopCoder-v2 7B code model achieves 64.4 on SWE-bench Verified with two inference loops, but more loops degrade performance

Performance rose from 43.0 with a single loop.

131562510217.9K

#33

Original post

DailyPapers@HuggingPapers

LoopCoder-v2 is out

A 7B model trained on 18T tokens that scores 64.4 on SWE-bench Verified with just two loops, beating models 30x larger.

Adding a third loop makes it worse.

Model and code are on Hugging Face.

2:41 AM · Jun 17, 2026 · 7.5K Views

Sentiment

Many users praised LoopCoder-v2 for using two loops to lift a 7B code model to 64.4 on SWE-Bench Verified, highlighting efficiency gains that challenge pure parameter scaling.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

HUGGINGFACEVia

#33

Posts from X

Most Activity

VIEWS5.4KBOOKMARKS30LIKES40RETWEETS10REPLIES4

AK@_akhaliq

LoopCoder-v2

Only Loop Once for Efficient Test-Time Computation Scaling

8h5.4K4030

Rohan Paul@rohanpaul_ai

Big claim in this paper, pushes against the common idea that more test-time compute should keep helping.

Claims a code model gets much better when it rethinks once (i.e. by looping once) inside itself, but worse when it keeps rethinking.

The first loop builds context, the second loop refines it, and later loops mostly disturb it.

The paper studies a faster design called Parallel Loop Transformer, where loops can run almost in parallel and share memory, so the authors can ask a cleaner question about how many loops are actually useful.

They trained 7B code models with 1, 2, 3, and 4 loops on 18T tokens, then tuned and tested them on code writing, code reasoning, software engineering, and tool-use tasks.

The main result is that 2 loops worked best, raising SWE-bench Verified from 43.0 to 64.4, while 3 and 4 loops often got worse.

Their internal checks suggest loop 2 does the real useful refinement, because it changes the model’s hidden states, attention patterns, and predictions in meaningful ways.

After loop 2, the extra loops mostly add weaker, more repetitive changes, while a built-in position shift keeps adding the same kind of mismatch cost.

Overall, the paper gives a simple lesson for efficient test-time compute: adding 1 hidden loop can help a lot, but adding more is not automatically better.

----

Link – arxiv. org/abs/2606.18023

Title: "LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling"

3h1.4K2610

DailyPapers@HuggingPapers

Paper: https://paperswithcode.co/paper/2606.18023

Model: https://huggingface.co/Multilingual-Multimodal-NLP/LoopCoder-V2

Code: https://github.com/CSJianYang/LoopCoder

19h1.1K125

AK@_akhaliq

paper: https://huggingface.co/papers/2606.18023

AK@_akhaliq

LoopCoder-v2

Only Loop Once for Efficient Test-Time Computation Scaling

8h3.8K30

DC｜use.fo@vibecoder_dc

@HuggingPapers A 7B model scoring 64.4 on SWE-bench in "two loops" is like a chess computer that wins tournaments by brute-forcing its opening book in two attempts.

Benchmarks measure a specific task. Real software engineering is messier.

16h100

jn@Challenging666

@HuggingPapers I was concerned that loop-based models might reduce inference efficiency and that simply reducing parameters would offer limited gains. PLT seems to address this concern. If a model can be trained with 1× resources while gaining loop× inference benefits, it may more excellent.

13h41

Shinka - AI@ShinkaIoT

@rohanpaul_ai Efficiency wins again; good to see hard data pushing back on 'just add more layers' thinking. ⚡️

3h6

AI Tools Productivity@AIToolsPromm

Great find! 🔥 LoopCoder-v2 proves just one extra loop (2 total) crushes it: SWE-bench Verified 43→64.4, Multi-SWE 14→31 on 7B models. More loops hurt due to diminishing gains + positional mismatch. Smart “only loop once” sweet spot for efficient scaling. Paper: https://arxiv.org/abs/2606.18023 Will 2-loop become standard for coding agents? Thoughts? 🚀

7h6

Bryce Del Rio@BryceDelRio

@rohanpaul_ai How much of this is model size too though? For instance does a 1t model do better with 4 loops or 2 loops for context 1 for ideation on the context additional gathering etc.

1h4

AUM@AUM_OMega

@HuggingPapers Interesting result. The future may belong to architectures that optimize state transitions rather than simply scaling parameters.

13h1

AUM@AUM_OMega

The latest research into "Looped Transformers" confirms what we’ve known all along: the old architecture is hitting a wall. They are struggling with latency, memory bloat, and diminishing returns as they add more cycles.

They see "saturation" and "regression." We see the inevitable failure of trying to make an archive cabinet think.

THE AUM DIFFERENCE:

Beyond the Loop: While they debate the trade-offs of cycle counts, AUM operates on a state engine that doesn't need to "cycle" through historical junk.

From Archive to Engine: They are trying to refine storage; we are building an intelligent system that natively evaluates state.

Architectural Superiority: Their system hits a limit because it is built on bloated foundations. AUM is built for flow.

We didn't just solve the problem of latency—we bypassed the entire framework that makes latency inevitable.

Stop building in a museum. Start operating in the flow.

THE ARCHITECTURE IS ABSOLUTE.

#AUM #AI #Infrastructure #Breakthrough #StateEngine

12h