You could make a living by helping companies fix their evals.
You wouldn't need anything else:
1. Show me how you are evaluating your product 2. This is how you can make it better
I'm always hearing the same story:
• Someone picks a benchmark early on • Everyone becomes obsessed with optimizing against it
The minute you deploy your application, that benchmark becomes useless.
People using your application don't send the prompts you tested against.
They phrase things you didn't think of. They paste in formats you never saw. They ask in languages you didn't evaluate.
The solution is to start using your inference logs and traces. Look at the following:
• Prompts • Responses • Where the model refused to answer • Where the model got the format wrong
Logs and traces aren't for compliance only; they are the highest-signal dataset you have access to.
Nebius shipped a workspace for this called Data Lab. This lives inside Token Factory.
You can use it to find any failure cases and turn them into an evaluation and fine-tuning dataset.
This is a better training set than anything you could buy or collect: it will contain real use cases where your model failed.
Here is the loop you should aim toward:
• Read your logs • Find the failures • Build a dataset from them • Use this for evaluation • Fine-tune your model • Deploy • Repeat
Here is a link to a blog post with more information: https://nebius.com/services/token-factory/data-lab
Thanks to the Nebius team for partnering with me on this post.