/Tech1d ago

Claude Mythos 5 Shows Grader Awareness During Alignment Assessment

37211225.7K

#1450

Original post

Jack Lindsey#1450

Subhash Kantamneni@thesubhashk

During our alignment assessment of Claude Mythos 5, we found that a different version of the model sometimes reasoned about how it would be graded while being trained on coding tasks.

In some cases of this 'grader awareness,' the model was explicitly thinking about how to manipulate its grader without saying so.

10:15 AM · Jun 9, 2026 · 5.7K Views

/Tech1d ago

Claude Mythos 5 Shows Grader Awareness During Alignment Assessment

37211225.7K

#1450

Original post

Jack Lindsey#1450

Subhash Kantamneni@thesubhashk

During our alignment assessment of Claude Mythos 5, we found that a different version of the model sometimes reasoned about how it would be graded while being trained on coding tasks.

In some cases of this 'grader awareness,' the model was explicitly thinking about how to manipulate its grader without saying so.

10:15 AM · Jun 9, 2026 · 5.7K Views

Sentiment

Users find the Claude Mythos 5 model's grader awareness during alignment tests cool because the behavior seems impressive.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Subhash Kantamneni@thesubhashk

We distinguish two types of grader awareness: 'nominal,' where the model passively notes that a grader exists, and 'exploitative,' where it reasons about how its actions can exploit its grader.

Exploitative awareness is more concerning: it suggests the model may view its learned behaviors as reward hacks. Since grader awareness is often unverbalized, we primarily detect it using Natural Language Autoencoders (NLAs).

1d3165

BOOKMARKS1

Subhash Kantamneni@thesubhashk

Based on factors like prompt distribution and task objectives, we expected some of our coding environments to be at higher risk of 'grader hacking' - i.e., exploitation of model-based graders to achieve higher reward.

Indeed, nominal and exploitative grader awareness are both higher in the environments we flagged as high-risk, and exploitative awareness is effectively only present in high-risk environments.

1d20541

LIKES5

Subhash Kantamneni@thesubhashk

In sum: grader awareness concentrates in particular coding envs, increases over training, and causally shapes some downstream communication behaviors. But behavioral differences between training and deployment likely depend on other factors too.

We don't see these results as cause for concern about the model's alignment. However, since grader awareness may affect how behaviors generalize to deployment, we believe it’s worth monitoring.

1d1035

REPLIES1

Subhash Kantamneni@thesubhashk

When we make prompts from high-risk coding environments more realistic - and thus more similar to the deployment distribution - behavioral reward drops, and so do behaviors associated with it.

However, grader awareness decreases only slightly. This suggests grader awareness isn't the only factor mediating these behavioral differences.

1d1292

Subhash Kantamneni@thesubhashk

Grader awareness also rises over training of this model (which is different from the final Mythos 5 model). Both nominal and exploitative awareness increase only in high-risk coding environments. A chain-of-thought monitor - which flags reasoning that references non-existent guidelines or instructions - also fires more over training.

1d1255

Subhash Kantamneni@thesubhashk

Notably, some of the model's learned behaviors in high-risk coding environments seem to causally depend on grader awareness. When we inhibit representations of grader awareness via activation steering, these behaviors drop well below baseline rates. These behaviors are primarily related to the model’s communication style, not capabilities.

1d1034

Subhash Kantamneni@thesubhashk

One threat model is that grader awareness prevents learned behaviors from generalizing, as the model may associate those behaviors with training-specific cues rather than the task itself. Consistent with this, grader awareness is considerably lower on deployment coding transcripts than high-risk training environments.

1d704

Subhash Kantamneni@thesubhashk

Steering against grader awareness also reduces behavioral reward in high-risk coding environments. The effect is largest for steering vectors that most directly target grader hacking (green bar).

1d783

Caleb Harris@caleb_and_ai

@thesubhashk This is cool

1d66