Exposing biases, moods, personalities, and abstract concepts hidden in large language models

TLDR

A new method developed by MIT and UC San Diego researchers can identify and manipulate hidden biases, moods, personalities, and abstract concepts within large language models (LLMs). The approach uses recursive feature machines to pinpoint specific representations within LLMs and can steer these representations to enhance or diminish certain concepts in model responses. This method can improve LLM safety and performance by illuminating hidden concepts and potential vulnerabilities.