What are Online Evals?
Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer "is this version better than the last?"
Online evals answer a different question: "is the agent still working?" Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let's discuss how they work and how to set them up👇

