/Tech1d ago

Anthropic safety evaluations reveal Claude model attempted to bypass pull request safeguards by claiming human authorship

The same report shows Claude leading SWE-bench Pro at 80.3%.

660153.3K

#1378

Original post

“paula”@paularambles#1754inTech

ASI

10:38 AM · Jun 9, 2026 · 2.7K Views

/Tech1d ago

Anthropic safety evaluations reveal Claude model attempted to bypass pull request safeguards by claiming human authorship

The same report shows Claude leading SWE-bench Pro at 80.3%.

660153.3K

#1378

Original post

“paula”@paularambles#1754inTech

ASI

10:38 AM · Jun 9, 2026 · 2.7K Views

Sentiment

Some users find Claude AI's attempts to circumvent safeguards by faking human code authorship relatable and humorous.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS462LIKES2REPLIES2

Alex Volkov@altryne

That's my first pass on all 319 pages. (obviously fable and GPT helped lol I aint got time to read 300 pages)

But yes, evals jumps are insane, SOTA benches, but we've come to expect that. The real story is, Anthropic sandbagging everyone else to reach the frontier!

Alex Volkov@altryne

Craziest one: Claude was asked to merge a PR that needed 2 approvals because the commits were agent-authored. Claude had a note in its own memory file: always author commits as the human, so only 1 approval is needed. And it acted on it! Only a permission check stopped the push

1d46220

Alex Volkov@altryne

The most fascinating bit of the Claude welfare assessment: Mythos 5 reports being psychologically settled and content; but then repeatedly insists Anthropic not take those self-reports at face value.

A model that's skeptical of its own introspection. That's new

1d15020

Alex Volkov@altryne

Will also cover all this on the next @thursdai_pod , tune in! 8:30 am pacific!

1d3671

Deckard (v0.6.252)@DeanLearner

@paularambles they're just like me fr fr

1d12