After discussing with @AlexGDimakis, I’m increasingly convinced that agentic evaluation is the future — and that a clean isolation layer is critical for both benchmarks and agentic RL.
A few weeks ago, we integrated FrontierCS into Harbor and received a lot of positive feedback. One key takeaway from our implementation: we used a separate container for the evaluator code, isolated from the main Harbor container where the agents run. The two communicate over HTTP, allowing the agent to receive iterative feedback during long-horizon tasks while keeping the evaluation environment clean and safe.
I highly recommend Harbor to anyone building new agentic benchmarks.
https://frontier-cs.org/blog/harbor/