Simple AI Identifiers Require Owner Notifications to Trigger Safeguards
Here's an idea for a new type of AI cyber safeguard: environment-based safeguards.
Basically: identify whether the model is operating in a sensitive software environment, and implement safeguards unless the connected account is whitelisted.

Instead, if you only use these very simple identifiers, you'd likely need to just activate safeguards like: notify the owner of the software environment.
The simplest implementation would be to use signatures already present in the software, or add simple canary strings. However, this wouldn't work if it's obvious that safeguards were triggered (e.g. the model refuses). Adversaries could identify and remove the identifier.
The simplest implementation would be to use signatures already present in the software, or add simple canary strings.
However, this wouldn't work if it's obvious that safeguards were triggered (e.g. the model refuses). Adversaries could identify and remove the identifier.
The thought is: - Owners of sensitive software systems add identifiers of their systems to a registry - Frontier companies have a classifier look for the identifier, and if so trigger safeguards (e.g. downgrade model, notify owner of the env), unless the account is whitelisted
More in the post: https://markusanderljung.substack.com/p/environment-based-safeguards-for
Would love cybersecurity researchers to look into it. I might be barking up the wrong tree!
Thanks for input from folks at GovAI, including @kamilelukosiute @mavroudisv
More in the post: https://markusanderljung.substack.com/p/environment-based-safeguards-for