profoundly disgusting.
All evaluations were run using mini-swe-agent as harness, and through official model APIs. Evaluation settings for each model are also shown on our site.
Tracking cybersecurity capabilities of AI agents is an important area of work that we will continue investing effort in. We’re grateful to CyberGym’s seminal work that we built on top of (adding new vulnerabilities as well as the patching task). We are also grateful to the ARVO project for creating infrastructure that we used in our evaluation.
Going forward, we plan to both expand the benchmark, and also partner with labs via trusted access programs to test models with and without guardrails to distinguish actual model capability and what is available via API to the public.
See full results here: https://www.vals.ai/benchmarks/cyber