it almost never makes sense to use real api's for your evals.
with how good coding agents have become, i will pretty much always opt to create a fake mock server for my agent to hit.
the workflow is usually
- fetch tasks for each eval
- mock the endpoints that i need my agents to hit
- create fastapi server for this
- pre-fetch the real data i need
- have the stub server return that pre-fetched data
from start -> fully fledged mock server in under 30 minutes