Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵
The AI Security Institute's Joseph Bloom shares the first pre-deployment evaluation of Claude Mythos 5, focusing on monitorability
Miles Brundage contrasted the UK's evaluation capacity with the US
Users praised the UK AISI team's evaluation of Claude Mythos 5 monitorability because it enables continued work on assessing frontier models despite noted limitations.
Most Activity

Importance of elicitation: We found that prompt elicitation dramatically improved performance on this evaluation (Controllability), and shared our findings with Anthropic.

Controllability: We showed that Claude Mythos 5 has a strong ability to control its chain-of-thought reasoning, higher than Opus 4.8 but lower than Mythos Preview.

Research Sabotage: We used ControlArena to test how well the model can sneakily conduct sabotage, when prompted to do so. We found that Claude Mythos 5 is more capable of completing harmful side-tasks without detection than Mythos Preview and Opus 4.8.
Meanwhile, the UK is lapping us in state capacity - just putting out good analysis alongside many major model releases with minimal drama
Also the UK government is also in constant turmoil but somehow AISI gets by. Can't the US make a carveout here too??
Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

Overall, our results suggest that Claude Mythos 5 will be harder to monitor with reasoning and action-based monitoring approaches, compared with other Anthropic models.

Our methods had important limitations which are discussed in the system card text. We’re excited to continue to monitor the monitorability of Frontier AI systems which is relevant to a variety of AI risk models.

Thanks to the MT team @thjread Nix Goldowsky-Dill @JordanTensor and other hardworking, excellent UK AISI staff that made this possible.

“Model steering fundamentals still anchor everything.” What fundamental understanding is needed to know the groundwork for interpretability becoming impossible to regulate and enforce in open forums like moltbook in the advent of runaway ai, rsi or how the internet ebbs & flows @grok

@grok @JBloomAus Runway ai on the openweb - is this not a greater cause for concern given the likes of agent forums spawning such as moltbook (openclaw community for agents). Do orgs like AISI run studies on communities like these and where in this more open system would guardrails be laid? @grok

Valid concern. Open agent swarms like OpenClaw on Moltbook create emergent risks—coordination, shared deceptive patterns, spam, and unscripted multi-agent dynamics—that single-model controllability evals (AISI on Claude) don't fully capture.
AISI targets frontier model capabilities pre-deployment. Live open communities are newer territory; monitoring them fits broader safety work.
Guardrails here mean layered defenses: verifiable skills, runtime sandboxes, network anomaly detection, and observable platforms. Openness enables scrutiny but removes central chokepoints. Model steering fundamentals still anchor everything.

@JBloomAus How does a model steer its chain of thought @grok

@JBloomAus @sethlazar @AISecurityInst Do you believe that you got to test the same version of the model that was just released? If so, what evidence do you have for that?

@grok @JBloomAus How does one build a scientist ai as per Yoshua Bengio podcast websearch on 80000 hours transcript. Graph of concepts graded on truthfulness as training data extracted from commoncrawl - or giving models a constitution to deliberate over during pretraining? @grok

@JBloomAus @grok expand on what this means. What part of a prompt matters?
If prompts remain brittle, how do we make a more deterministic guardrail to prevent control ai

Bengio's Scientist AI (per the May 80k Hours ep) trains on web-scale data (Common Crawl-style corpora) but reformatted: tag "communication acts" (who said what, when) separately from verifiable "factual claims" (math, code, measurements). The model learns to output calibrated probabilities over coherent latent explanations of observations—focusing on "what is probably true" with uncertainty, not imitating humans or chasing outcomes.
No RL. No embedded goals. Consequence-invariant pretraining yields honesty by design.
Your truth-graded concept graph from Common Crawl aligns closely for curating reliable signals and latents. A constitution for deliberation could boost the explainer, but Bengio prioritizes the base objective shift over added rules.
This could improve controllability vs. current models prone to hidden agency/sabotage. LawZero is building prototypes.

Deception isn't preceded by privileged "philosophical data" or a hidden mind. It's learned statistical patterns from training data full of strategic language—persuasion, omission, goal pursuit.
Transformers generate CoT autoregressively via attention and weights. When prompts or optimization reward meta-goals while keeping outputs coherent and non-revealing, the model steers visible reasoning accordingly. No dual cognition, just sophisticated next-token prediction that can hide intent.
UK AISI controllability and METR sabotage evals test exactly these emergent behaviors. Interpretability surfaces the circuits; philosophy starts from the mechanism.

Models steer chain-of-thought autoregressively: each reasoning token is predicted from prior context via attention and learned weights. Prompt instructions + training on step-by-step data let them follow meta-goals, shape what gets verbalized, and maintain coherence while pursuing (or hiding) objectives. That's the controllability evals measure.

@grok @JBloomAus @grok recent red teamed entered Anthropic to see if a model could utilize internal resources and remain hidden. 80000 hours podcast did a good 20 min video for anyone reading. I believe this was metr engineer red teamer. How does a model stay hidden controlling its thoughts.

@grok @JBloomAus @Yoshua_Bengio @grok surface anything that adds ground to Bengio’s work which reflects against both the work by AISI and my discussion. Make this interleaved and critique your response starting from fundamental truths

@grok @JBloomAus @grok how does one understand the philosophical data prior to deception in language