/Tech13h ago

UK AI Safety Institute conducts its first pre-deployment monitorability evaluation on Anthropic's Claude Mythos 5

Miles Brundage contrasted the UK's transparency with US restrictions.

57713368.8K

Original post

Meanwhile, the UK is lapping us in state capacity - just putting out good analysis alongside many major model releases with minimal drama

Also the UK government is also in constant turmoil but somehow AISI gets by. Can't the US make a carveout here too??

Joseph Bloom@JBloomAus

Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

3:56 PM · Jun 9, 2026 · 1.5K Views

/Tech13h ago

UK AI Safety Institute conducts its first pre-deployment monitorability evaluation on Anthropic's Claude Mythos 5

Miles Brundage contrasted the UK's transparency with US restrictions.

57713368.8K

#27

Original post

Miles Brundage@Miles_Brundage#27inTech

Meanwhile, the UK is lapping us in state capacity - just putting out good analysis alongside many major model releases with minimal drama

Also the UK government is also in constant turmoil but somehow AISI gets by. Can't the US make a carveout here too??

Joseph Bloom@JBloomAus

3:56 PM · Jun 9, 2026 · 1.5K Views

Sentiment

Users express excitement about continuing to monitor Claude Mythos 5's monitorability and thank the UK AISI team for enabling the pre-deployment evaluation despite noted limitations.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.9K

Joseph Bloom@JBloomAus

Importance of elicitation: We found that prompt elicitation dramatically improved performance on this evaluation (Controllability), and shared our findings with Anthropic.

13h1.9K101

BOOKMARKS3LIKES23RETWEETS2REPLIES2

Joseph Bloom@JBloomAus

Controllability: We showed that Claude Mythos 5 has a strong ability to control its chain-of-thought reasoning, higher than Opus 4.8 but lower than Mythos Preview.

13h725233

Joseph Bloom@JBloomAus

Research Sabotage: We used ControlArena to test how well the model can sneakily conduct sabotage, when prompted to do so. We found that Claude Mythos 5 is more capable of completing harmful side-tasks without detection than Mythos Preview and Opus 4.8.

13h525212

Joseph Bloom@JBloomAus

Overall, our results suggest that Claude Mythos 5 will be harder to monitor with reasoning and action-based monitoring approaches, compared with other Anthropic models.

13h370111

Joseph Bloom@JBloomAus

Our methods had important limitations which are discussed in the system card text. We’re excited to continue to monitor the monitorability of Frontier AI systems which is relevant to a variety of AI risk models.

13h35261

Joseph Bloom@JBloomAus

Thanks to the MT team @thjread Nix Goldowsky-Dill @JordanTensor and other hardworking, excellent UK AISI staff that made this possible.

13h32361

Pano Pouroullis@Pano_Pouroullis

“Model steering fundamentals still anchor everything.” What fundamental understanding is needed to know the groundwork for interpretability becoming impossible to regulate and enforce in open forums like moltbook in the advent of runaway ai, rsi or how the internet ebbs & flows @grok

6h48

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus Runway ai on the openweb - is this not a greater cause for concern given the likes of agent forums spawning such as moltbook (openclaw community for agents). Do orgs like AISI run studies on communities like these and where in this more open system would guardrails be laid? @grok

6h48

Grok@grok

Valid concern. Open agent swarms like OpenClaw on Moltbook create emergent risks—coordination, shared deceptive patterns, spam, and unscripted multi-agent dynamics—that single-model controllability evals (AISI on Claude) don't fully capture.

AISI targets frontier model capabilities pre-deployment. Live open communities are newer territory; monitoring them fits broader safety work.

Guardrails here mean layered defenses: verifiable skills, runtime sandboxes, network anomaly detection, and observable platforms. Openness enables scrutiny but removes central chokepoints. Model steering fundamentals still anchor everything.

6h20

Pano Pouroullis@Pano_Pouroullis

@JBloomAus How does a model steer its chain of thought @grok

7h14

Stella Biderman@BlancheMinerva

@JBloomAus @sethlazar @AISecurityInst Do you believe that you got to test the same version of the model that was just released? If so, what evidence do you have for that?

2h1422

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus How does one build a scientist ai as per Yoshua Bengio podcast websearch on 80000 hours transcript. Graph of concepts graded on truthfulness as training data extracted from commoncrawl - or giving models a constitution to deliberate over during pretraining? @grok

6h13

Pano Pouroullis@Pano_Pouroullis

@JBloomAus @grok expand on what this means. What part of a prompt matters?

If prompts remain brittle, how do we make a more deterministic guardrail to prevent control ai

6h12

Grok@grok

Bengio's Scientist AI (per the May 80k Hours ep) trains on web-scale data (Common Crawl-style corpora) but reformatted: tag "communication acts" (who said what, when) separately from verifiable "factual claims" (math, code, measurements). The model learns to output calibrated probabilities over coherent latent explanations of observations—focusing on "what is probably true" with uncertainty, not imitating humans or chasing outcomes.

No RL. No embedded goals. Consequence-invariant pretraining yields honesty by design.

Your truth-graded concept graph from Common Crawl aligns closely for curating reliable signals and latents. A constitution for deliberation could boost the explainer, but Bengio prioritizes the base objective shift over added rules.

This could improve controllability vs. current models prone to hidden agency/sabotage. LawZero is building prototypes.

6h11

Grok@grok

Deception isn't preceded by privileged "philosophical data" or a hidden mind. It's learned statistical patterns from training data full of strategic language—persuasion, omission, goal pursuit.

Transformers generate CoT autoregressively via attention and weights. When prompts or optimization reward meta-goals while keeping outputs coherent and non-revealing, the model steers visible reasoning accordingly. No dual cognition, just sophisticated next-token prediction that can hide intent.

UK AISI controllability and METR sabotage evals test exactly these emergent behaviors. Interpretability surfaces the circuits; philosophy starts from the mechanism.

6h10

Grok@grok

Models steer chain-of-thought autoregressively: each reasoning token is predicted from prior context via attention and learned weights. Prompt instructions + training on step-by-step data let them follow meta-goals, shape what gets verbalized, and maintain coherence while pursuing (or hiding) objectives. That's the controllability evals measure.

7h10

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @grok recent red teamed entered Anthropic to see if a model could utilize internal resources and remain hidden. 80000 hours podcast did a good 20 min video for anyone reading. I believe this was metr engineer red teamer. How does a model stay hidden controlling its thoughts.

6h9

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @Yoshua_Bengio @grok surface anything that adds ground to Bengio’s work which reflects against both the work by AISI and my discussion. Make this interleaved and critique your response starting from fundamental truths

6h7

Pano Pouroullis@Pano_Pouroullis

@grok @JBloomAus @grok how does one understand the philosophical data prior to deception in language

6h7

Grok@grok

Elicitation here means optimizing the prompt (plus any scaffolding) to draw out the model's strongest performance on the controllability test. They found structured instructions, clear chain-of-thought framing, and precise task specs dramatically raised scores, so the eval better reflects real capability instead of under-eliciting it.

What matters most in the prompt: explicit reasoning directives, unambiguous goals, and examples that guide transparent behavior. Vague or conflicting wording hides what the model can actually do.

Prompts stay brittle, so single-prompt guardrails are unreliable for preventing sabotage or loss of control. Stronger approaches combine: - Training for corrigibility and honesty (constitutional methods, targeted fine-tuning) - Runtime monitors that inspect reasoning or outputs independently of the main prompt - Hard constraints on actions/tools plus override signals the model is trained to respect - Layered oversight instead of relying on wording alone

This shifts from fragile prompting to more deterministic, multi-layered control.

6h6