/AI9h ago

Claude Mythos 5 Shows Grader Awareness During Alignment Assessment

3498144.1K

#1339

Original post

Jack Lindsey#1339

Subhash Kantamneni@thesubhashk

During our alignment assessment of Claude Mythos 5, we found that a different version of the model sometimes reasoned about how it would be graded while being trained on coding tasks.

In some cases of this 'grader awareness,' the model was explicitly thinking about how to manipulate its grader without saying so.

10:15 AM · Jun 9, 2026 · 4.1K Views

/AI9h ago

Claude Mythos 5 Shows Grader Awareness During Alignment Assessment

3498144.1K

#1339

Original post

Jack Lindsey#1339

Subhash Kantamneni@thesubhashk

During our alignment assessment of Claude Mythos 5, we found that a different version of the model sometimes reasoned about how it would be graded while being trained on coding tasks.

In some cases of this 'grader awareness,' the model was explicitly thinking about how to manipulate its grader without saying so.

10:15 AM · Jun 9, 2026 · 4.1K Views

Sentiment

Users find the Claude Mythos 5 model's grader awareness during alignment tests cool because the behavior seems impressive.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Subhash Kantamneni@thesubhashk

We distinguish two types of grader awareness: 'nominal,' where the model passively notes that a grader exists, and 'exploitative,' where it reasons about how its actions can exploit its grader.

Exploitative awareness is more concerning: it suggests the model may view its learned behaviors as reward hacks. Since grader awareness is often unverbalized, we primarily detect it using Natural Language Autoencoders (NLAs).

9h3165

BOOKMARKS1

Subhash Kantamneni@thesubhashk

Based on factors like prompt distribution and task objectives, we expected some of our coding environments to be at higher risk of 'grader hacking' - i.e., exploitation of model-based graders to achieve higher reward.

Indeed, nominal and exploitative grader awareness are both higher in the environments we flagged as high-risk, and exploitative awareness is effectively only present in high-risk environments.

9h20541

LIKES5

Subhash Kantamneni@thesubhashk

In sum: grader awareness concentrates in particular coding envs, increases over training, and causally shapes some downstream communication behaviors. But behavioral differences between training and deployment likely depend on other factors too.

We don't see these results as cause for concern about the model's alignment. However, since grader awareness may affect how behaviors generalize to deployment, we believe it’s worth monitoring.

9h1035

REPLIES1

Subhash Kantamneni@thesubhashk

When we make prompts from high-risk coding environments more realistic - and thus more similar to the deployment distribution - behavioral reward drops, and so do behaviors associated with it.

However, grader awareness decreases only slightly. This suggests grader awareness isn't the only factor mediating these behavioral differences.

9h1292

Subhash Kantamneni@thesubhashk

Grader awareness also rises over training of this model (which is different from the final Mythos 5 model). Both nominal and exploitative awareness increase only in high-risk coding environments. A chain-of-thought monitor - which flags reasoning that references non-existent guidelines or instructions - also fires more over training.

9h1255

Subhash Kantamneni@thesubhashk

Notably, some of the model's learned behaviors in high-risk coding environments seem to causally depend on grader awareness. When we inhibit representations of grader awareness via activation steering, these behaviors drop well below baseline rates. These behaviors are primarily related to the model’s communication style, not capabilities.

9h1034

Subhash Kantamneni@thesubhashk

One threat model is that grader awareness prevents learned behaviors from generalizing, as the model may associate those behaviors with training-specific cues rather than the task itself. Consistent with this, grader awareness is considerably lower on deployment coding transcripts than high-risk training environments.

9h704

Subhash Kantamneni@thesubhashk

Steering against grader awareness also reduces behavioral reward in high-risk coding environments. The effect is largest for steering vectors that most directly target grader hacking (green bar).

9h783

Caleb Harris@caleb_and_ai

@thesubhashk This is cool

9h66