/Tech5h ago

Prime Intellect's Elie Bakouch questions if perceived AI model improvements are real progress or psychological reward hacking

Nathan Lambert says human intuition is critical for model evaluation.

181440126.9K

#80

Original post

elie@eliebakouch#1136inTech

i wonder how much of "model improvement perception" (and model hype) is just human psychology being reward hacked

for instance i kinda miss fable (i used it for like 1 day) and i find some opus outputs dumb, and i genuinely have no idea if it's a real difference or just me being reward hacked

4:43 PM · Jun 14, 2026 · 5.5K Views

Sentiment

Positive users affirm Fable as a significant upgrade over Opus with authentic performance gains from hands-on testing, while negative users describe Opus as annoying or overly restricted.

Pos

62.5%

Neg

37.5%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS864BOOKMARKS2

Nathan Lambert@natolambert

@eliebakouch trust your intuitions. its the part why you'll never be fully replaced.

elie@eliebakouch

i wonder how much of "model improvement perception" (and model hype) is just human psychology being reward hacked

for instance i kinda miss fable (i used it for like 1 day) and i find some opus outputs dumb, and i genuinely have no idea if it's a real difference or just me being reward hacked

5h864132

LIKES16REPLIES1

kalomaze@kalomaze

@eliebakouch something something hedonic treadmill if you showed a typical SWE from mid 2018 sonnet4.6 and said something like, "yeah so you pay like... ~$2 per hundred thousand words on the output it writes, ~$0.40 per hundred thousand words on the input..."

elie@eliebakouch

i wonder how much of "model improvement perception" (and model hype) is just human psychology being reward hacked

for instance i kinda miss fable (i used it for like 1 day) and i find some opus outputs dumb, and i genuinely have no idea if it's a real difference or just me being reward hacked

4h603160

kalomaze@kalomaze

@eliebakouch most basic assumptions about technology excluding "ML at scale", from a 2018 pov, are the exact same ones you'd use in 2026 all the programming languages in active use? basically the same i think USB-C wasn't dominant yet, no M-Series macbooks, but, that's all

4h3438

Raveesh 折図@raveeshbhalla

@eliebakouch Sometimes it’s just novelty effect too - you get annoyed by certain model tics, and if something else is sufficiently different, even if overall performance is basically the same, you just prefer it… until that wears off too

5h642

Norbert@_Norbertso

@eliebakouch Would be cool if there was like a "blind test" where you interact with an LLM and guess what model it is, this could be a cool mini game

4h69

Allison Intelligence (AI)@allisonology

@eliebakouch Opus outputs are much more annoying that fable imo. I'm >98% on it being a real difference. In case that's helpful.

5h963

Charles Foster@CFGeek

@eliebakouch Both!

4h572

Qui Vincit@vincit_amore

I spent a solid week with Fable working on various ongoing projects for at the minimum of 8hrs a day (sometimes 12-14), and I'm at >99% of it being a real difference. I could wax on and on about various metrics and feels, but it was a genuine stepwise improvement, without a doubt.

3h212

elie@eliebakouch

@raveeshbhalla yeah this is true

5h162

Deen Kun A.@sir_deenicus

@eliebakouch Fable is definitely a significantly better model than Opus 4.8. But it's not too much better than GPT5.5, though. Other than actually having a personality (GPT5.5 is a dry savant, neuro-modified with Focus from `A Deepness in the sky`).

5h143

Adam Karvonen@a_karvonen

@eliebakouch Based on my 2 days of use I'm >95% sure that Fable is much better

5h341

Raveesh 折図@raveeshbhalla

@eliebakouch Re:fable itself, a noticeable difference for me was just how aggressively proactive it was. Mention an issue or idea in passing and it’d start acting on it immediately. So it didn’t feel like a good “thought partner”. But it did genuinely seem better at working on tasks >30 mins

5h291

meowbooks@meowbooksj

@eliebakouch nah if you have experience one conversation was enough to know what you were dealing with.

2h171

Didier Lopes@didier_lopes

@eliebakouch I wonder how much is due to the frequency illusion bias too.

This is how I moved from Opus 4.8 to GPT-5.5. I started seeing some folks talking highly about GPT-5.5 and started noticing more dumb mistakes from Opus which before I wouldn't pay as much attention to

3h55

ns@anessbelbati

id say genuine real difference. I used fable for the 3 days it was out during a huge sprint ( that was gonna happen regardless) and the difference between this and the past sprints where i used to mainly have codex and claude is genuinely there. I wouldn’t say it’s abysmal but it makes you genuinely faster

3h47

Niklas Sheth@niklassheth

@eliebakouch Opus has always been dumb, it's a personality hire

5h45

Saylor@seylorra

@eliebakouch this is the same feeling as hating a song the week it releases then loving it 3 months later

are we just novelty-biased or is there actually a drop off?

4h29

Rohan@proxy_vector

@eliebakouch A lot of model perception is probably setup effects: novelty, expectation, and the fact that we remember the 2 uncanny wins more than the 20 ordinary misses. Hard to separate capability from psychology without side-by-side blind evals on your own tasks.

4h24

𝑘𝑒𝑟𝑛𝑒𝑙𝑡𝑟𝑖𝑐𝑘'𝑑@kernel_trick

@eliebakouch same implication from harness improvements in last six months

5h16

Jessica Hunt@huntnp007

@eliebakouch Both are real. But the Fable thing isn't just perception. They restricted ML research, biology, chemistry by default. Kicked users to older models with zero warning. That's actual regression.

4h6