/AI22d ago

Sholto Douglas, an AI researcher at Anthropic, solicits detailed user feedback on cases where people prefer other models over Claude along with specific shortcomings and example transcripts.

AI Judge changed title after evaluation, original title: "Sholto Douglas solicits detailed feedback on Claude limitations"

Replies cited inaccurate task time estimates, weak test coverage, verbose code comments, and roughly even usage splits with ChatGPT for stronger context retention.

--0--
Original post
Sholto Douglas@_sholtodouglas#91inAI

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

7:21 PM · May 16, 2026 · 325.3K Views
Sentiment

Many users criticize Claude for blunt refusals and KV cache flushing that degrades reasoning, while others value its direct character training and honest feedback style.

Pos
40.3%
Neg
59.7%
206 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS1.2MBOOKMARKS1.3KLIKES50KRETWEETS1.1K
Shub@shub0414

Bruh, who tf Claude think he is

22dViews 1.2MLikes 50KBookmarks 1.3K
REPLIES398
jason@jxnlco

When do you reach for other models instead of Codex? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 164.1KLikes 806Bookmarks 127
Peter Yang@petergyang

Here's my new episode with @alexalbert__, who shared an inside look at how Anthropic is building the next Claude.

We talked about how the research team:

→ Plans for the model and harness together → Uses Claude to turn user feedback into evals → Trains Claude's character & personality

Some quotes from Alex:

"We use Claude to cluster user feedback, find top themes, and create synthetic versions of user problems that we then turn into evals."

"We need to think about how the model is exposed through all our surfaces, whether it's API or Claude Code or Cowork. The product has a blend with the model and that affects your end user's experience."

""As these things become agents running tasks for a long time and making judgment decisions, what its character is and what it cares about are very important."

📌 Watch now: https://youtu.be/T4ieZPIEmd8

Thanks to our sponsors:

@WisprFlow: Don't type, just speak https://ref.wisprflow.ai/peteryang

@oceanstalent: Hire AI-native executive assistants https://www.oceanstalent.com/peter

22dViews 59.7KLikes 141Bookmarks 164
Liora@iyzebhel

What do people do to get these responses? lol I love this side of him.

Shub@shub0414

Bruh, who tf Claude think he is

22dViews 86.9KLikes 884Bookmarks 45
Thariq@trq212

@jxnlco lmao

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 93.3KLikes 787Bookmarks 25

btw we measured this in Memento: flushing your KV cache leads to measurably worse performance, no matter how good the model is

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22dViews 12.2KLikes 109Bookmarks 51

To make the KV cache thing concrete:

Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>.

Compaction masks T1 and T2 but T3's KVs were computed in the presence of T1 and T2 still in the prefix. Every layer of T3's residual stream absorbed information from T1 and T2 directly, not via the summary. The KVs carry non-textual information.

Setup B: idle past TTL, recomputes KV states on <sys> <summary> <T3>. But the fresh forward pass only ever sees the summary. T3's KVs are now computed in an alternate history where T1 and T2 never existed.

This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse.

We measured this in Memento:

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22dViews 10.5KLikes 92Bookmarks 47

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 33.4KLikes 152Bookmarks 32
Jeremy Howard@jeremyphoward

@_sholtodouglas I've stopped using Opus for brainstorming/strategizing, because it keeps wanting to jump to a conclusion and the end of every response. It's too confident it knows the answer every time. It makes it hard to have a back-and-forth.

Also, it's too expensive vs Codex 5.5 sub.

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 10KLikes 218Bookmarks 15
kache@yacineMTB

HDD drives are going to the moon aren't they

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22dViews 16KLikes 104Bookmarks 18
jason@jxnlco

@trq212 ahahahah

Thariq@trq212

@jxnlco lmao

22dViews 13.7KLikes 141Bookmarks 2
Teknium 🪽@Teknium

@_sholtodouglas Uhh its not available thru sub in hermes agent that’s clearly number one lol

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 2KLikes 67Bookmarks 1
jason@jxnlco

When do you reach for other models instead of Coded? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 1.2KLikes 20Bookmarks 1
Dan Shipper 📧@danshipper

@jxnlco claude is significantly better at front-design and nuanced thinking / argumentation

jason@jxnlco

When do you reach for other models instead of Codex? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 1.8KLikes 21Bookmarks 8
Beff (e/acc)@beffjezos

I want a LaTeX editor, and Claude to be able to read docs at a coarse grained level.

It's good at editing segments, but terrible at reading the whole long document and achieving global coherence / flow.

Maybe a hierarchical doc chunking/compression for better writing would be good

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 2KLikes 24Bookmarks 5
rohan anil@_arohan_

@_sholtodouglas Claude code on mobile. Standalone claude code app with the same aesthetics

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22dViews 3KLikes 46Bookmarks 1
Peter Yang@petergyang

Also available on:

Spotify: https://open.spotify.com/episode/2Jja9cAAkCiUAvs5JB7PtB?si=g04NLCzcT72VO9YMC6F-EQ

Apple: https://podcasts.apple.com/us/podcast/behind-the-craft/id1736359687?ign-itscg=30200&ign-itsct=podtail_podcasts

Newsletter: https://creatoreconomy.so/p/inside-how-anthropic-is-building-the-next-claude

22dViews 2.2KLikes 6Bookmarks 5

@_sholtodouglas edit fro clarification: The summary tokens were generated while T1 and T2 were in the prefix so the summary's KVs already encode information from T1 and T2 beyond what the summary text literally says that leaks to T3

To make the KV cache thing concrete:

Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>.

Compaction masks T1 and T2 but T3's KVs were computed in the presence of T1 and T2 still in the prefix. Every layer of T3's residual stream absorbed information from T1 and T2 directly, not via the summary. The KVs carry non-textual information.

Setup B: idle past TTL, recomputes KV states on <sys> <summary> <T3>. But the fresh forward pass only ever sees the summary. T3's KVs are now computed in an alternate history where T1 and T2 never existed.

This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse.

We measured this in Memento:

22dViews 2.6KLikes 19Bookmarks 2

Also when an experiment is not working out (the kind that i know beyond a reasonable doubt it should) Claude jumps to a hypothesis why the whole thing is broken and we why should just abandon it. So frustrating:) these are experiments where the resolution of whatever we stumble upon is to just change a few hyperparams and retry.

I found 4.6 to have way more agency on these types of problems than 4.7 and pursuing a longer horizon attempt

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22dViews 2.8KLikes 33Bookmarks 0
signüll@signulll

@AndrewCurran_ precisely. i just ran claude opus 4.7 on every piece of context on @skye. it was insane. we use cheaper models for certain things but running frontier is absolutely ridiculous.

Andrew Curran@AndrewCurran_

@signulll Imagine what a frontier model can infer a year from now by looking at it all at once.

22dViews 2.1KLikes 32Bookmarks 0
Load more posts