22h ago

Stanford NLP's Peter Hase says LLMs fail to detect internal state tampering in gaslight control experiments

Models could not distinguish inputs from injected hidden representations.

Sentiment

Pos50%

Neg50%

Positive users commend the study for raising the right evidentiary bar on LLM introspection claims, while negative users criticize the work for engaging with or debunking anthropomorphic narratives about AI.

8 comments with sentiment.

Stanford NLP's Peter Hase says LLMs fail to detect internal state tampering in gaslight control experiments · Digg