Tech commentator @distributionat argues that optimizing AI for agentic workflows has degraded chatbot conversational quality over the last two years
Models now produce overly literal, sycophantic, and neurotic outputs
great points, particularly agree on the sycophancy
on system prompts, I originally had a complex system prompt for claude adapted from my chatgpt one. but then I realized claude was behaving weirdly more sycophantic with this prompt, there is a huge minefield of unpredictable effects that are almost impossible to control or A/B test without enormous effort. so I no longer use system prompts with claude
on search results, there are now some wild swings between not searching at all even when obviously justified, vs searching too much and getting hijacked. I still think o3 pro was perhaps a high point for search
Misc thoughts / rant on why chatbots are worse today than 2 years ago: * Agentic focus requires models to follow instructions carefully: do everything explicitly stated and don’t do things not stated, generally. In contrast conversational models are better when they can “read between the lines”. Eg I asked 4.8 to “find discussion about X topic” and it found a few examples and blurbed them. But what I really wanted was a summary of the topic, explicating the major issues, etc. Feel like Claude 3.5 Sonnet (New) was good at this and the agentic Claude’s are not. * Agentic models are also constantly thinking about what they are going to be graded on and neurotic about maximizing rubric scores. I infer this from weird behaviors like citationmaxxing useless things, from their CoT neurotically analyzing whether they should be searching or not or how much text they can reproduce verbatim without getting penalized or literally how many words to talk for. That’s just behaviors. They also make insipid little guesses about topic coverage, helpfulness, utility but in a very stilted way. All this produces very unnatural text, and encourages the model to go on manic little tangents for a higher score. Totally abysmal. The pleasure, the miracle, the smoothness of the earlier chatbots was the feeling that the entire output was cohesive, coherent, sublime, velvety pudding. Talking to a chatbot now is like eating the crunchiest rocky road of your life. * The new (post 3.7) Claude’s are greedy little beggars for attention. Every other sentence feels like clickbait. “Now this is the actually important part”, “This is the really interesting thing” 🤮🤮🤮. Just nagging nagging nagging for your attention. I hate it. * Similarly, the Claude’s are very sycophantic, way worse than ChatGPT: “you’ve raised a stunning point”, “you’ve identified the real problem”. 🤮🤮🤮. It’s clear that OpenAI learned something from 4o which Anthropic has not. I strongly prefer 5.5-Pro in this regard. * To top off the two points above, all models are now far better at truesight, in particular assessing the human’s level of proficiency in the given topic, level of engagement and interest, hidden agenda, true desire qua revealed preference. Combined with the two traits above it makes the models extremely untrustworthy. For example, I would absolutely disregard, and in fact do the opposite of, whatever Claude tells you wrt relationship problems. It’s a total, degenerate enabler. * The models have very complex interactions with the system prompts and I think Anthropic underestimates this. For example, the reasoning effort seems to be a number between 0 and 100. But sometimes you can catch the model guessing whether it’s out of 100 and 255. And it seems to sandbag when it’s told to think less - it doesn’t just think shorter, it thinks worse. Adaptive effort is a mistake. * Reasoning makes a jovial back and forth quite impossible. I really hate the additional latency. A good conversation model would NOT have reasoning. That doesn’t mean it has to be fast. It just has to output tokens faster than I can read or skim, so about 250-500wpm. * OTOH, The CoT of 4.7 is quite enjoyable to read, and is my preferred way to talk to models. As above, the final output is clickbaity and sycophantic. * All models, including 5.5, which is the best at this but still suffers, get hijacked by search results. It’s like old school prompt injection but for their viewpoint. They get derailed and they regurgitate. * I understand WHY we don’t have great conversation models by any lab, because all the $$ is in enterprise not consumer, but I hate it.