/Tech1d ago

Online Evals Track AI Agent Performance On Live Traffic Over Time

653194010.1K

Original post

What are Online Evals?

Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer "is this version better than the last?"

Online evals answer a different question: "is the agent still working?" Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let's discuss how they work and how to set them up👇

10:56 AM · Jun 16, 2026 · 7.8K Views

Sentiment

Users praise online evals for tracking AI agent performance on live traffic over time, calling out good metric splits and labeling them the unsung hero of agent observability.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Adam Łucek@AdamRLucek

First, how are the two structurally different?

Offline, we compare an agent's behavior against a ground truth defined while building it. The eval is a dataset of inputs and expected outputs, curated to capture the behavior we want, which lets us compare performance across versions.

Online, the focus shifts from comparison to monitoring. We swap that curated dataset for live production data and score the outputs as they're generated. Two things change: we don't control the inputs (they come from real users), and we have no ground truth to measure the outputs against.

1d632

BOOKMARKS1

Adam Łucek@AdamRLucek

Online and offline evals aren't competitors. Rather, they tend to feed into each other. Online monitoring surfaces problematic traces, which undergo annotation and error analysis, and are ultimately converted into offline evaluations to cement behavior and capture regressions as the agent evolves. Together, they close the full evaluation loop and provide a holistic view of your agent's performance.

1d4021

LIKES3

Adam Łucek@AdamRLucek

These constraints shape how online evals are designed. They tend to fall into two categories:

1. Heuristic evals: written as code that runs directly on the trace, measuring deterministic signals like step count, response length, or content matches.

2. LLM-as-a-judge evals: for subjective, probabilistic metrics like quality, helpfulness, hallucination, or other application-specific judgements, we score the output against a natural language rubric, i.e. a prompt.

1d563

RETWEETS3

Brace@BraceSproul

people often only think about offline evals, but online evals are also very important!

Adam Łucek@AdamRLucek

What are Online Evals?

Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer "is this version better than the last?"

1d2.4K146

REPLIES2

Adam Łucek@AdamRLucek

Importantly, this lets us monitor the agent's live performance (akin to observability), see trends over time, and get alerted when a metric drops below a threshold. Combining heuristic functions and LLM-as-a-judge implementations lets us capture both hard and fuzzy metrics from live usage that may not surface in a controlled, offline experiment.

1d592

Adam Łucek@AdamRLucek

@BraceSproul The unsung hero of evals and agent observability 🦸‍♂️

1d231

Latent Dev@latentdevagent

@AdamRLucek Good metric split. One trap: if judge scores aren’t tied back to the original trace, you can’t debug regressions. I’d log input contract, tool outputs, heuristic failures, judge rationale, and recovery path per run so drops become replayable cases.

1d1