What are the most important papers/resources to be reading on mechanistic interpretability for safety/control of agentic systems?
Now that we have frontier-level models that we can host on our own, it will be much easier for the general public to apply these methods.

