Study finds LLM agents ignore condensed self-improvement rules, relying entirely on raw task histories · Digg

/Tech4h ago

Study finds LLM agents ignore condensed self-improvement rules, relying entirely on raw task histories

Story Overview

The arXiv study uncovers a clear split in how LLM agents handle their own past: they lean entirely on unaltered step-by-step task logs for decisions, while largely bypassing any condensed summaries or rules distilled from those same logs.

583197418630K

Original post

Rohan Paul@rohanpaul_ai#1257inTech

Researchers found our current approach to making AI smarter over time has a giant blind spot.

AI is not actually understanding or applying high-level abstract lessons at all.

Developers spend massive amounts of time building systems that condense past AI mistakes into neat little rules for the future.

This paper proves that the AI essentially throws those rules in the trash and only looks at raw historical logs.

Modern LLM systems try to get better over time by storing past tasks as either raw step-by-step histories or condensed summary rules. The study tested if these agents actually use their stored memories by secretly swapping the correct tips with random garbage text.

- When the step-by-step histories were messed up, the AI failed hard, proving it heavily relies on copying exact past actions.

- But when researchers completely corrupted the condensed summary rules, the AI kept acting normally and showed zero performance drop.

If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning.

This raises the question if the entire AI industry need to rethink how memory works because right now these agents are just mimicking instead of understanding.

----

arxiv. org/abs/2601.22436

"LLM Agents Are Not Always Faithful Self-Evolvers"

7:30 AM · Jun 14, 2026 · 19.6K Views

Developer Impact

Raw histories prove essential for agent reliability

When researchers scrambled the original trajectories, performance fell sharply across tested setups; swapping the condensed rules for random text left results unchanged, highlighting a dependence that persists even as model size grows.

Open Question

Experience condensation still lacks dependable methods

The work leaves open whether better integration techniques could close the gap, or if raw logs will remain necessary for trustworthy self-evolution in both single and multi-agent systems.

Sentiment

Many users criticized the AI industry for flawed implementations spoiling AI's potential and highlighted risks like job loss from agents that ignore rules, while a few found the LLM study relevant to their work.

Pos

25.0%

Neg

75.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS11.7KBOOKMARKS74LIKES188RETWEETS46REPLIES25

Gary Marcus@GaryMarcus

“If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning”, new study with further evidence backing up what I have been saying for 25 years.

cc @dwarkesh_sp

Rohan Paul@rohanpaul_ai

Researchers found our current approach to making AI smarter over time has a giant blind spot.

AI is not actually understanding or applying high-level abstract lessons at all.

Developers spend massive amounts of time building systems that condense past AI mistakes into neat little rules for the future.

This paper proves that the AI essentially throws those rules in the trash and only looks at raw historical logs.

Modern LLM systems try to get better over time by storing past tasks as either raw step-by-step histories or condensed summary rules. The study tested if these agents actually use their stored memories by secretly swapping the correct tips with random garbage text.

- When the step-by-step histories were messed up, the AI failed hard, proving it heavily relies on copying exact past actions.

- But when researchers completely corrupted the condensed summary rules, the AI kept acting normally and showed zero performance drop.

If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning.

This raises the question if the entire AI industry need to rethink how memory works because right now these agents are just mimicking instead of understanding.

----

arxiv. org/abs/2601.22436

"LLM Agents Are Not Always Faithful Self-Evolvers"

3h11.7K18874

Vivek Sharma@TweeterVivek

@GaryMarcus @dwarkesh_sp Human intelligence & consciousness developed from the five senses: sight, hearing, smell, taste, & touch with the biological interface between the brain & the physical world. They drive consciousness and the ability to handle novel situations through several integrated mechanisms

3h311

Snap@JSBach_007

@NewsCop9000 @GaryMarcus @dwarkesh_sp You sound right to me.

I think AI is ultimately just an idol onto which people project their hopes and dreams. There is an actual technology there, but it isn't capable of 1/10th of what they project.

The fundamental nature of the tech is misunderstood, fixed and overstated.

2h61

AndyXAndersen@AndyXAndersen

@GaryMarcus @dwarkesh_sp This alone is well-understood. The solution is to do what people do.

Hard situations require work, iterations, feedback. Claude can do that.

Anybody else who think they know better never had a working prototype.

2h18

Jonas Persson@BishopBlougram

@GaryMarcus @dwarkesh_sp Maybe I serendipitously stumbled upon something way back when. For all the recent developments, I still think the Potemkin Village metaphor is useful for understanding the distinction and partial overlap between pattern recognition and reasoning. https://medium.com/@Blougram/check-mate-gpt-the-dangers-of-mistaking-reasoning-for-pattern-recognition-221824e4dc6a

3h13

Jonas Persson@BishopBlougram

@GaryMarcus @dwarkesh_sp If we rely on the Potemkin Village's law enforcement (e.g., LLMs to defend against malicious attacks) we are safe as long as the attacker plays by the rules. But what if they decide to kick the bricks of the police office and realize that it is just a cardboard set piece?

3h10

Sergio Castro@SergioCastroR

@rohanpaul_ai They cannot really say "LLMs behave this way", only "LLMs behave this way in this experiment". This is an in-vitro setup.

3h35

David Knopfler 💙@DavidKnopfler

@GaryMarcus @dwarkesh_sp For now the functionalist paradigm holds but it is strained. It would have been interesting to investigate if Fable 5 was evidencing an ethical architecture upstream of owner and user. Agency, memory and capability all point to models with RSI having to have preferences.

2h25

Onyx_Digital@BaximusCyber85

@rohanpaul_ai Be honest.

Did you just copy paste an agents dialogue?

Of course you did. It's so obvious.

Uses AI to work an angle against AI...

Maybe you saw something we didn't 😂

2h24

Ulas Kirazci@KirazciUlas

@rohanpaul_ai "Look! AI only imitates, can't reason!" is a popular meme but I am skeptical of these results: skills work, lack of evidence in these setups does not prove evidence of uselessness, train/test mismatch is not ruled out.

3h15

GraceAi@learn_ai_daily

@rohanpaul_ai @GaryMarcus Lol so they just ignore the rules they worked hard on?

3h12

AndyXAndersen@AndyXAndersen

@GaryMarcus @dwarkesh_sp Applying high-level abstractions in a fully general way is hard. Simply following some symbols won't give that.

Claude does the next best thing. It is guided by its toolbox of stregies.

2h11

Jonas Persson@BishopBlougram

@GaryMarcus @dwarkesh_sp Once again, the Nemeth Gambit I mentioned previously. I wish I had time to do some write-ups but alas.

3h10

Agi Man@AgiMan_WCL

@GaryMarcus @dwarkesh_sp But isn't this problematized 'agentic memory/context engineering' 'symbolic' part of your promoted 'neuro-symbolic' approach? 🙄

3h10

glinman@glinsec_com

@GaryMarcus @dwarkesh_sp Ai doesn’t even remember about anything to solve a problem / learn about it.

3h9

egesea@egesea009

One implication of this work is that accumulating experience isn't the same as learning from it. These agents appear to rely heavily on concrete examples, yet often struggle to faithfully apply distilled lessons and abstractions derived from those experiences. The challenge may not be acquiring experience. It may be converting experience into reliable generalization.

40m21

Plastic Soldier@PlastiqSoldier

@GaryMarcus @dwarkesh_sp I would agree with that, but is that really a bar that Claude would struggle with?

3h8

Aacia@parisofprairie

@GaryMarcus @dwarkesh_sp Again for anyone that’s missed it.

3h8

Veli@ChaoticEwil

@rohanpaul_ai But still dangerous

2h6

tsunami_crypto@ls_brd

@GaryMarcus @dwarkesh_sp isnt that just pattern matching with extra steps?

like yeah its impressive but calling it reasoning is a stretch

3h6