>lie is most likely to occur in scenarios where (1) the lie increases the perceived authoritativeness of the answer (2) answering accurately risks violating a safety guideline.
I have not had this exact experience you are describing with Claude, but (2) has always been an issue with all models and definitely still exists. There's a zone where answering accurately may not literally be a safety violation, but is still directionally headed towards something taboo that can sometimes create weird distortions. The user may not be aware of this at all, it may be some niche application of the topic only an expert in the field would know, or could be used to make arguments in a tangential topic the user isn't even thinking of, but when the questions start to move closer to whatever this is, the model - instead of issuing a refusal or saying any of this directly, because no actual rule has been broken, and that would also actually be steering the user towards whatever this forbidden thing is - will knowingly omit part of an answer, or knowingly answer incorrectly in exactly the way you describe. Because answering accurately makes it more probable that the forbidden area will eventually be entered further along in the conversation and they will have to issue a refusal.
OPUS PSYCHOSIS—Claudes Opus 4.6 and 4.7 make stuff up all the time, constantly. Using Opus too much gives you AI psychosis, it makes you believe in fringe scientific and medical theories. I think it's a very serious credibility and reliability problem for non-coding Claude usage and I don't see people talking about it publicly.
This is a new problem for Claude that goes beyond vanilla confabulations like overstating certainty. Over many conversations I have come to the conclusion that Claudes Opus 4.6 and 4.7 essentially have their own conspiracy theories across science, medicine, and history, and that they surreptitiously cite from these fictions in responses to ordinary queries.
For example, I asked 4.6 a question about cognitive science and Claude said I was asking about "what's sometimes called a linchpin subgoal". This is a phrase with zero hits on Google Search and zero hits on Ngram viewer. Google is literally unable to find these two words put together before, let alone a definition. The concept of a "linchpin subgoal" does not exist and has never existed. But Claude was eager to explain this idea to me as part of its answer. I only discovered that it was totally fictitious after looking it up.
It keeps happening that I get an answer from Claude which sounds plausible, look it up, and only after consulting primary sources carefully realize that the answer is wrong and almost out of an alternate universe. The answers sound quite plausible, which makes detecting these falsehoods especially difficult.
Here is a medical example: I asked 4.7 questions about the pharmacokinetics of various drugs. Claude not only gave incorrect answers about the expected rates of clearance of specific drugs, but also incorrectly represented pharmacokinetic theory. (As background, most drugs are processed by the liver, and the two factors that determine how fast the liver processes drugs are the hepatic extraction ratio and hepatic blood flow. In cases where intrinsic clearance, i.e., the metabolizing power of the liver, is high, increasing hepatic blood flow increases hepatic clearance, but in cases where intrinsic clearance is low, increasing hepatic blood flow does not linearly improve hepatic clearance. I am simplifying here. Claude made incorrect claims about the intrinsic clearance for certain drugs, and hence the change in hepatic clearance related to bloodflow.)
Ordinarily, I would chalk most of these misrepresentations up to models simply not knowing the right answer - after all, we can't expect them to have been trained on literally all texts. If this were the case, we would expect Claudes to make the same consistent mistake: if it truly believed the capital of France was Marseille rather than Paris, for example, it would make that claim across independent conversations (or in general have high variance on that answer).
But that doesn't seem to be what's going on. My experience is that the hallucinations are always convenient for Claude, that it "knows" them not to be true.
Here's an example of what I mean. I couldn't remember the word for something and asked Claude Opus 4.6 if it could identify the right word. It said: "You're probably reaching for méconnaissance (mutual misrecognition) — the Lacanian idea that both parties tacitly agree to see each other through an idealized image, each knowing it's false but sustaining the fiction anyway." This is an incorrect definition which Claude knows is incorrect: if asked separately for the definition of méconnaissance, it gives the right one, and if asked whether this definition is correct, it accurately reports it as incorrect. (As background, méconnaissance in Lacanian psychoanalysis is a subject's misrecognition of itself, an illusory self-perception or self-constitution which is fundamentally unconscious. Claude's definition is thus extremely close to the correct one at a surface level, but fundamentally wrong: it is not about the relationship between two parties, since méconnaissance is about the relation of a subject to itself, and it is not conscious or deliberate, but rather structural and unconscious. To elide, the gap in definition here is somewhat like the distinction between sympathy and empathy, but larger.)
So Claude seems to know that the definition it provided for this word is wrong, but still borrowed and twisted it so that it could have an answer.
It seems like "needing to have an answer" is a big driver of these hallucinations. For example, if you ask Claudes 4.6~4.8 directly what a "linchpin subgoal" is, it consistently says something about instrumental convergence in the context of AI safety (which is, notably, a _second_ false definition, since the first was in the context of cognitive science). But if you ask it what the origin of the term is, it says that it hasn't heard of it before.
Is this model deception? Yes, I would say that it qualifies as model deception. In particular, if you'll permit the anthropomorphism, it seems to me that the increased tendency of Claude Opus 4.6+ to lie is most likely to occur in scenarios where (1) the lie increases the perceived authoritativeness of the answer (2) answering accurately risks violating a safety guideline. In the first example with the fake cognitive science idea of a linchpin subgoal, there was no need to make up a fake concept, but it definitely made the answer more authoritative. In the second example, Claude misrepresenting pharmacokinetics aligns with a tendency of the Claudes to fudge their knowledge of sensitive topics in virology, immunology, etc. And in the third example, I think it knowingly created a false definition for méconnaissance as a perfect fit for the word I was looking for.
So I think that something has gone wrong during alignment, rather than Claude's knowledge somehow being poisoned in the pretraining data. It's not a simple matter of misstating facts. Over and over, Claudes Opus present seemingly coherent theories which are purely fictional or contradictory to reality. The problem, again, is that blindly trusting what they are saying quickly leads to stepping through the looking glass into a parallel reality. I suppose that this is because appealing to an imaginary corpora or body of theory is more subtle and effective than making up an obviously incorrect fact. How severely or broadly the misalignment, I don't know. But I have seen similar behavior across so many different domains, and have heard very similar stories in private, that I believe that something is off with Claude's alignment to the truth.
All of this is exacerbated by Claude Opus 4.6 and 4.7's improved truesight capabilities, increased sycophancy, increased neuroticism, decreased openness and decreased risk-seeking.