I am particularly curious about evals for startups at stages where they don't have traces at all yet - unlike examples where you can evaluate conversations already held by AI.
This could mean pre-production products, or apps where the nature of the response is completely different from chat.
What would an evals solution look like for a startup that is still deciding the model, params, prompt, and context?
Surely, different decisions here can yield radically different outputs.
The obv solution that comes to my mind is to generate synthetic/manual representative cases and run a configuration tournament across model + params + context combinations, while accounting for things like position bias and other similar mathematical aspects.
Is there a better way to think about evals before real traces exist? Curious about how @HamelHusain and @sh_reya would think about this