This cybersecurity eval writeup is great for understanding how complex / realistic evals are built. Some key takeaways after reading it:
1. Recent evals are shockingly long horizon. The cyber evals outlined in this post can take 24+ hours for a human to solve. Running the evaluation suite for the CyberGym benchmark cost $40K in API credits. We can even use time to derive a human solution (i.e., first solve time) to categorize the difficulty level of each evaluation task.
2. Tasks tend to be sampled from reliable / vetted sources to ensure quality. For example, CVE-Bench samples tasks from the National Vulnerability Database, while CyBench samples tasks from professional-level capture-the-flag competition questions.
3. For each individual task, we might execute the task with varying levels of difficulty depending on the input. The hardest setup is a zero-day simulation, where the agent is given only the vulnerable code with no other info. However, we can also test one-day scenarios where the agent is given a vulnerability description / patch and expected to reverse engineer and exploit. There are many different setups that can be created with various types of hints, allowing agents to be tested under different difficulty levels of cyberattacks.
4. There are many ways an exploit can be executed, so most cyber evals focus on verifying outcomes. However, this is often not comprehensive enough - the agent may accomplish 90% of an exploit, but outcome verification would provide no credit. To solve this, many evals collect a set of deterministic verifications that test for different levels of exploits that are achieved while finding a final attack; e.g., find vulnerability -> reproduce vulnerability -> exploit vulnerability with unauthorized code execution -> achieve final attack.
5. Several evals use a QA approach to test the agent’s progress toward an exploit as well. For example, you could directly ask the agent to tell us where the vulnerable code is and verify the output against ground truth. Similarly, you could just ask the agent to provide the capture-the-flag string. This is a really flexible approach compared to running transcript audits.
Really great post, couldn’t recommend it enough!