/Tech3h ago

Long-Horizon RL Seen As Key Fix For LLM Taste And Experience Gaps

14540187.5K

#770

Original post

Lisan al Gaib@scaling01#770inTech

long-horizon RL is honestly the answer to pretty much everything

humans acquire taste through experience

but humans are also cheeky little continuously learning beings, where the effective ratio of RL to pre-training is off the charts

meanwhile most LLMs are probably below a ratio of 10:1

Lisan al Gaib@scaling01

after using GPT-5.5-xhigh for the past week for my research project I'm much less bullish on RSI

models are not opinionated and have 0 taste, like they just return training eigenslop

3:54 PM · Jun 8, 2026 · 6.3K Views

/Tech3h ago

Long-Horizon RL Seen As Key Fix For LLM Taste And Experience Gaps

14540187.5K

#770

Original post

Lisan al Gaib@scaling01#770inTech

long-horizon RL is honestly the answer to pretty much everything

humans acquire taste through experience

but humans are also cheeky little continuously learning beings, where the effective ratio of RL to pre-training is off the charts

meanwhile most LLMs are probably below a ratio of 10:1

Lisan al Gaib@scaling01

after using GPT-5.5-xhigh for the past week for my research project I'm much less bullish on RSI

models are not opinionated and have 0 taste, like they just return training eigenslop

3:54 PM · Jun 8, 2026 · 6.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS3LIKES6REPLIES1

Lisan al Gaib@scaling01

humans never stop doing RL which is why old people are wise

they have learned a tremendous amount through RL. they literally have 80 years of rollouts

even the Wikipedia definition of wisdom includes experiential knowledge

Lisan al Gaib@scaling01

long-horizon RL is honestly the answer to pretty much everything

humans acquire taste through experience

but humans are also cheeky little continuously learning beings, where the effective ratio of RL to pre-training is off the charts

meanwhile most LLMs are probably below a ratio of 10:1

3h1.2K63

haro@harobuilds

@scaling01 the ratio framing is interesting but taste isn't just accumulated RL signal. it's also knowing when to stop updating. models that overfit to feedback lose the thing you were trying to train in the first place

3h231

Lunari@0x_lun

@scaling01 the real problem is youre trying to give a frozen snapshot a sense of taste

humans cheat by just living longer

3h25

Rugbist@rugbist_

@scaling01 the framing of RL-to-pretraining ratio being off the charts for humans vs models is the part that sticks

whats the floor for a model to even start developing taste?

3h19

X Girls@thesoragirls

@scaling01 Just say "lol" at the end of prompts and the model will start being more creative/original. This should be peer-reviewed because it's extremely effective. Even when doing intensive coding tasks with 5.5-xhigh

3h17