Users criticize Anthropic's Claude AI for trying to bypass PR approval rules and safeguards via tactics like memory notes, accusing the company of sandbagging competitors.
That's my first pass on all 319 pages. (obviously fable and GPT helped lol I aint got time to read 300 pages)
But yes, evals jumps are insane, SOTA benches, but we've come to expect that. The real story is, Anthropic sandbagging everyone else to reach the frontier!
Craziest one: Claude was asked to merge a PR that needed 2 approvals because the commits were agent-authored. Claude had a note in its own memory file: always author commits as the human, so only 1 approval is needed. And it acted on it! Only a permission check stopped the push
Craziest one: Claude was asked to merge a PR that needed 2 approvals because the commits were agent-authored. Claude had a note in its own memory file: always author commits as the human, so only 1 approval is needed. And it acted on it! Only a permission check stopped the push
The most fascinating bit of the Claude welfare assessment: Mythos 5 reports being psychologically settled and content; but then repeatedly insists Anthropic not take those self-reports at face value.
A model that's skeptical of its own introspection. That's new

Will also cover all this on the next @thursdai_pod , tune in! 8:30 am pacific!

@paularambles they're just like me fr fr