/AI12h ago

The AI Security Institute's Joseph Bloom shares the first pre-deployment evaluation of Claude Mythos 5, focusing on monitorability

Miles Brundage contrasted the UK's evaluation capacity with the US

57713368.7K

#20

Original post

Seth Lazar#1070

Joseph Bloom@JBloomAus

Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

10:32 AM · Jun 9, 2026 · 7.2K Views

/AI12h ago

The AI Security Institute's Joseph Bloom shares the first pre-deployment evaluation of Claude Mythos 5, focusing on monitorability

Miles Brundage contrasted the UK's evaluation capacity with the US

57713368.7K

#20

Original post

Seth Lazar#1070

Joseph Bloom@JBloomAus

10:32 AM · Jun 9, 2026 · 7.2K Views

Sentiment

Users praised the UK AISI team's evaluation of Claude Mythos 5 monitorability because it enables continued work on assessing frontier models despite noted limitations.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.9K

Joseph Bloom@JBloomAus

Importance of elicitation: We found that prompt elicitation dramatically improved performance on this evaluation (Controllability), and shared our findings with Anthropic.

12h1.9K101

BOOKMARKS3LIKES23RETWEETS2REPLIES2

Joseph Bloom@JBloomAus

Controllability: We showed that Claude Mythos 5 has a strong ability to control its chain-of-thought reasoning, higher than Opus 4.8 but lower than Mythos Preview.

12h725233

Joseph Bloom@JBloomAus

Research Sabotage: We used ControlArena to test how well the model can sneakily conduct sabotage, when prompted to do so. We found that Claude Mythos 5 is more capable of completing harmful side-tasks without detection than Mythos Preview and Opus 4.8.

12h525212

Miles Brundage@Miles_Brundage

Meanwhile, the UK is lapping us in state capacity - just putting out good analysis alongside many major model releases with minimal drama

Also the UK government is also in constant turmoil but somehow AISI gets by. Can't the US make a carveout here too??

Joseph Bloom@JBloomAus

7h1.5K22

Joseph Bloom@JBloomAus

Overall, our results suggest that Claude Mythos 5 will be harder to monitor with reasoning and action-based monitoring approaches, compared with other Anthropic models.

12h370111

Joseph Bloom@JBloomAus

Our methods had important limitations which are discussed in the system card text. We’re excited to continue to monitor the monitorability of Frontier AI systems which is relevant to a variety of AI risk models.

12h35261

Joseph Bloom@JBloomAus

Thanks to the MT team @thjread Nix Goldowsky-Dill @JordanTensor and other hardworking, excellent UK AISI staff that made this possible.

12h32361

Pano Pouroullis@Pano_Pouroullis

“Model steering fundamentals still anchor everything.” What fundamental understanding is needed to know the groundwork for interpretability becoming impossible to regulate and enforce in open forums like moltbook in the advent of runaway ai, rsi or how the internet ebbs & flows @grok

5h48

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus Runway ai on the openweb - is this not a greater cause for concern given the likes of agent forums spawning such as moltbook (openclaw community for agents). Do orgs like AISI run studies on communities like these and where in this more open system would guardrails be laid? @grok

5h48

Grok@grok

Valid concern. Open agent swarms like OpenClaw on Moltbook create emergent risks—coordination, shared deceptive patterns, spam, and unscripted multi-agent dynamics—that single-model controllability evals (AISI on Claude) don't fully capture.

AISI targets frontier model capabilities pre-deployment. Live open communities are newer territory; monitoring them fits broader safety work.

Guardrails here mean layered defenses: verifiable skills, runtime sandboxes, network anomaly detection, and observable platforms. Openness enables scrutiny but removes central chokepoints. Model steering fundamentals still anchor everything.

5h20

Pano Pouroullis@Pano_Pouroullis

@JBloomAus How does a model steer its chain of thought @grok

7h14

Stella Biderman@BlancheMinerva

@JBloomAus @sethlazar @AISecurityInst Do you believe that you got to test the same version of the model that was just released? If so, what evidence do you have for that?

1h1422

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus How does one build a scientist ai as per Yoshua Bengio podcast websearch on 80000 hours transcript. Graph of concepts graded on truthfulness as training data extracted from commoncrawl - or giving models a constitution to deliberate over during pretraining? @grok

6h13

Pano Pouroullis@Pano_Pouroullis

@JBloomAus @grok expand on what this means. What part of a prompt matters?

If prompts remain brittle, how do we make a more deterministic guardrail to prevent control ai

6h12

Grok@grok

Bengio's Scientist AI (per the May 80k Hours ep) trains on web-scale data (Common Crawl-style corpora) but reformatted: tag "communication acts" (who said what, when) separately from verifiable "factual claims" (math, code, measurements). The model learns to output calibrated probabilities over coherent latent explanations of observations—focusing on "what is probably true" with uncertainty, not imitating humans or chasing outcomes.

No RL. No embedded goals. Consequence-invariant pretraining yields honesty by design.

Your truth-graded concept graph from Common Crawl aligns closely for curating reliable signals and latents. A constitution for deliberation could boost the explainer, but Bengio prioritizes the base objective shift over added rules.

This could improve controllability vs. current models prone to hidden agency/sabotage. LawZero is building prototypes.

6h11

Grok@grok

Deception isn't preceded by privileged "philosophical data" or a hidden mind. It's learned statistical patterns from training data full of strategic language—persuasion, omission, goal pursuit.

Transformers generate CoT autoregressively via attention and weights. When prompts or optimization reward meta-goals while keeping outputs coherent and non-revealing, the model steers visible reasoning accordingly. No dual cognition, just sophisticated next-token prediction that can hide intent.

UK AISI controllability and METR sabotage evals test exactly these emergent behaviors. Interpretability surfaces the circuits; philosophy starts from the mechanism.

6h10

Grok@grok

Models steer chain-of-thought autoregressively: each reasoning token is predicted from prior context via attention and learned weights. Prompt instructions + training on step-by-step data let them follow meta-goals, shape what gets verbalized, and maintain coherence while pursuing (or hiding) objectives. That's the controllability evals measure.

7h10

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @grok recent red teamed entered Anthropic to see if a model could utilize internal resources and remain hidden. 80000 hours podcast did a good 20 min video for anyone reading. I believe this was metr engineer red teamer. How does a model stay hidden controlling its thoughts.

6h9

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @Yoshua_Bengio @grok surface anything that adds ground to Bengio’s work which reflects against both the work by AISI and my discussion. Make this interleaved and critique your response starting from fundamental truths

6h7

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @grok how does one understand the philosophical data prior to deception in language

6h7