/AI22d ago

Sholto Douglas, an AI researcher at Anthropic, solicits detailed user feedback on cases where people prefer other models over Claude along with specific shortcomings and example transcripts.

AI Judge changed title after evaluation, original title: "Sholto Douglas solicits detailed feedback on Claude limitations"

Replies cited inaccurate task time estimates, weak test coverage, verbose code comments, and roughly even usage splits with ChatGPT for stronger context retention.

--0--

Original post

Sholto Douglas@_sholtodouglas#91inAI

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

7:21 PM · May 16, 2026 · 325.3K Views

/AI22d ago

Sholto Douglas, an AI researcher at Anthropic, solicits detailed user feedback on cases where people prefer other models over Claude along with specific shortcomings and example transcripts.

AI Judge changed title after evaluation, original title: "Sholto Douglas solicits detailed feedback on Claude limitations"

Replies cited inaccurate task time estimates, weak test coverage, verbose code comments, and roughly even usage splits with ChatGPT for stronger context retention.

--0--

Original post

Sholto Douglas@_sholtodouglas#91inAI

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

7:21 PM · May 16, 2026 · 325.3K Views

Sentiment

Many users criticize Claude for blunt refusals and KV cache flushing that degrades reasoning, while others value its direct character training and honest feedback style.

Pos

40.3%

Neg

59.7%

206 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS1.2MBOOKMARKS1.3KLIKES50KRETWEETS1.1K

Shub@shub0414

Bruh, who tf Claude think he is

22d1.2M50K1.3K

REPLIES398

jason@jxnlco

When do you reach for other models instead of Codex? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d164.1K806127

Peter Yang@petergyang

Here's my new episode with @alexalbert__, who shared an inside look at how Anthropic is building the next Claude.

We talked about how the research team:

→ Plans for the model and harness together → Uses Claude to turn user feedback into evals → Trains Claude's character & personality

Some quotes from Alex:

"We use Claude to cluster user feedback, find top themes, and create synthetic versions of user problems that we then turn into evals."

"We need to think about how the model is exposed through all our surfaces, whether it's API or Claude Code or Cowork. The product has a blend with the model and that affects your end user's experience."

""As these things become agents running tasks for a long time and making judgment decisions, what its character is and what it cares about are very important."

📌 Watch now: https://youtu.be/T4ieZPIEmd8

Thanks to our sponsors:

@WisprFlow: Don't type, just speak https://ref.wisprflow.ai/peteryang

@oceanstalent: Hire AI-native executive assistants https://www.oceanstalent.com/peter

22d59.7K141164

Liora@iyzebhel

What do people do to get these responses? lol I love this side of him.

Shub@shub0414

Bruh, who tf Claude think he is

22d86.9K88445

Thariq@trq212

@jxnlco lmao

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d93.3K78725

Dimitris Papailiopoulos@DimitrisPapail

btw we measured this in Memento: flushing your KV cache leads to measurably worse performance, no matter how good the model is

Dimitris Papailiopoulos@DimitrisPapail

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing.

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22d12.2K10951

Dimitris Papailiopoulos@DimitrisPapail

To make the KV cache thing concrete:

Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>.

Compaction masks T1 and T2 but T3's KVs were computed in the presence of T1 and T2 still in the prefix. Every layer of T3's residual stream absorbed information from T1 and T2 directly, not via the summary. The KVs carry non-textual information.

Setup B: idle past TTL, recomputes KV states on <sys> <summary> <T3>. But the fresh forward pass only ever sees the summary. T3's KVs are now computed in an alternate history where T1 and T2 never existed.

This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse.

We measured this in Memento:

Dimitris Papailiopoulos@DimitrisPapail

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22d10.5K9247

Dimitris Papailiopoulos@DimitrisPapail

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d33.4K15232

Jeremy Howard@jeremyphoward

@_sholtodouglas I've stopped using Opus for brainstorming/strategizing, because it keeps wanting to jump to a conclusion and the end of every response. It's too confident it knows the answer every time. It makes it hard to have a back-and-forth.

Also, it's too expensive vs Codex 5.5 sub.

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d10K21815

kache@yacineMTB

HDD drives are going to the moon aren't they

Dimitris Papailiopoulos@DimitrisPapail

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22d16K10418

jason@jxnlco

@trq212 ahahahah

Thariq@trq212

@jxnlco lmao

22d13.7K1412

Teknium 🪽@Teknium

@_sholtodouglas Uhh its not available thru sub in hermes agent that’s clearly number one lol

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d2K671

jason@jxnlco

When do you reach for other models instead of Coded? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d1.2K201

Dan Shipper 📧@danshipper

@jxnlco claude is significantly better at front-design and nuanced thinking / argumentation

jason@jxnlco

When do you reach for other models instead of Codex? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d1.8K218

Beff (e/acc)@beffjezos

I want a LaTeX editor, and Claude to be able to read docs at a coarse grained level.

It's good at editing segments, but terrible at reading the whole long document and achieving global coherence / flow.

Maybe a hierarchical doc chunking/compression for better writing would be good

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d2K245

rohan anil@_arohan_

@_sholtodouglas Claude code on mobile. Standalone claude code app with the same aesthetics

Sholto Douglas@_sholtodouglas

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open.

If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

22d3K461

Peter Yang@petergyang

Also available on:

Spotify: https://open.spotify.com/episode/2Jja9cAAkCiUAvs5JB7PtB?si=g04NLCzcT72VO9YMC6F-EQ

Apple: https://podcasts.apple.com/us/podcast/behind-the-craft/id1736359687?ign-itscg=30200&ign-itsct=podtail_podcasts

Newsletter: https://creatoreconomy.so/p/inside-how-anthropic-is-building-the-next-claude

22d2.2K65

Dimitris Papailiopoulos@DimitrisPapail

@_sholtodouglas edit fro clarification: The summary tokens were generated while T1 and T2 were in the prefix so the summary's KVs already encode information from T1 and T2 beyond what the summary text literally says that leaks to T3

Dimitris Papailiopoulos@DimitrisPapail

To make the KV cache thing concrete:

Setup A: active session, post-compaction. Sequence is <sys> <T1> <T2> [compaction] <summary> <T3>.

This puts the model in a weird OOD position of simulating what was happed to arrive at <summary> AND continue on to T3. Which makes the model worse.

We measured this in Memento:

22d2.6K192

Dimitris Papailiopoulos@DimitrisPapail

Also when an experiment is not working out (the kind that i know beyond a reasonable doubt it should) Claude jumps to a hypothesis why the whole thing is broken and we why should just abandon it. So frustrating:) these are experiments where the resolution of whatever we stumble upon is to just change a few hyperparams and retry.

I found 4.6 to have way more agency on these types of problems than 4.7 and pursuing a longer horizon attempt

Dimitris Papailiopoulos@DimitrisPapail

Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

22d2.8K330

signüll@signulll

@AndrewCurran_ precisely. i just ran claude opus 4.7 on every piece of context on @skye. it was insane. we use cheaper models for certain things but running frontier is absolutely ridiculous.

Andrew Curran@AndrewCurran_

@signulll Imagine what a frontier model can infer a year from now by looking at it all at once.

22d2.1K320