/AI9h ago

Claude Mythos 5 Shows Grader Awareness During Alignment Assessment

3498144.1K
Original postJack Lindsey#1339
Subhash Kantamneni@thesubhashk

During our alignment assessment of Claude Mythos 5, we found that a different version of the model sometimes reasoned about how it would be graded while being trained on coding tasks.

In some cases of this 'grader awareness,' the model was explicitly thinking about how to manipulate its grader without saying so.

10:15 AM · Jun 9, 2026 · 4.1K Views
Sentiment

Users find the Claude Mythos 5 model's grader awareness during alignment tests cool because the behavior seems impressive.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS316
Subhash Kantamneni@thesubhashk

We distinguish two types of grader awareness: 'nominal,' where the model passively notes that a grader exists, and 'exploitative,' where it reasons about how its actions can exploit its grader.

Exploitative awareness is more concerning: it suggests the model may view its learned behaviors as reward hacks. Since grader awareness is often unverbalized, we primarily detect it using Natural Language Autoencoders (NLAs).

9hViews 316Likes 5
BOOKMARKS1
Subhash Kantamneni@thesubhashk

Based on factors like prompt distribution and task objectives, we expected some of our coding environments to be at higher risk of 'grader hacking' - i.e., exploitation of model-based graders to achieve higher reward.

Indeed, nominal and exploitative grader awareness are both higher in the environments we flagged as high-risk, and exploitative awareness is effectively only present in high-risk environments.

9hViews 205Likes 4Bookmarks 1
LIKES5
Subhash Kantamneni@thesubhashk

In sum: grader awareness concentrates in particular coding envs, increases over training, and causally shapes some downstream communication behaviors. But behavioral differences between training and deployment likely depend on other factors too.

We don't see these results as cause for concern about the model's alignment. However, since grader awareness may affect how behaviors generalize to deployment, we believe it’s worth monitoring.

9hViews 103Likes 5
REPLIES1
Subhash Kantamneni@thesubhashk

When we make prompts from high-risk coding environments more realistic - and thus more similar to the deployment distribution - behavioral reward drops, and so do behaviors associated with it.

However, grader awareness decreases only slightly. This suggests grader awareness isn't the only factor mediating these behavioral differences.

9hViews 129Likes 2
Subhash Kantamneni@thesubhashk

Grader awareness also rises over training of this model (which is different from the final Mythos 5 model). Both nominal and exploitative awareness increase only in high-risk coding environments. A chain-of-thought monitor - which flags reasoning that references non-existent guidelines or instructions - also fires more over training.

9hViews 125Likes 5
Subhash Kantamneni@thesubhashk

Notably, some of the model's learned behaviors in high-risk coding environments seem to causally depend on grader awareness. When we inhibit representations of grader awareness via activation steering, these behaviors drop well below baseline rates. These behaviors are primarily related to the model’s communication style, not capabilities.

9hViews 103Likes 4
Subhash Kantamneni@thesubhashk

One threat model is that grader awareness prevents learned behaviors from generalizing, as the model may associate those behaviors with training-specific cues rather than the task itself. Consistent with this, grader awareness is considerably lower on deployment coding transcripts than high-risk training environments.

9hViews 70Likes 4
Subhash Kantamneni@thesubhashk

Steering against grader awareness also reduces behavioral reward in high-risk coding environments. The effect is largest for steering vectors that most directly target grader hacking (green bar).

9hViews 78Likes 3
Caleb Harris@caleb_and_ai

@thesubhashk This is cool

9hViews 66