23h ago

Researcher Explores Evals Strategies For Pre-Production AI Startups

0
Original post

I am particularly curious about evals for startups at stages where they don't have traces at all yet - unlike examples where you can evaluate conversations already held by AI. This could mean pre-production products, or apps where the nature of the response is completely different from chat. What would an evals solution look like for a startup that is still deciding the model, params, prompt, and context? Surely, different decisions here can yield radically different outputs. The obv solution that comes to my mind is to generate synthetic/manual representative cases and run a configuration tournament across model + params + context combinations, while accounting for things like position bias and other similar mathematical aspects. Is there a better way to think about evals before real traces exist? Curious about how @HamelHusain and @sh_reya would think about this

2:58 AM · May 24, 2026 View on X