Many users praise LangChain's post-trained model for agent trace monitoring because it delivers cheap accurate evaluations that cut high costs and enable more experiments.
Most Activity
Detecting issues in production agent traces is hard. You have to do it cheaply (because of volume) but also accurately (or too much noise)
We post-trained our own model for this. SOTA accuracy, at ~10-100x cheaper rates than frontier models
Try it out: https://airtable.com/appWdRBlSecNOgErA/pagAEfUlHu4F35opm/form
http://x.com/i/article/2066537289107255296
New LangChain Labs x @FireworksAI_HQ study
http://x.com/i/article/2066537289107255296

@LangChain @FireworksAI_HQ I've been hit hard by eval costs - running GPT-4 for grading traces eats budget fast. Fine-tuned small models could be the game-changer for cost-sensitive infra teams.
there's a very exciting future agent recipe for building intelligence too cheap to meter, applied towards extracting signals from every single Trace agents produce
it involves: 1. Fine-tuning efficient, specialized open models that reach frontier performance on narrow, important tasks
2. Understanding Trace data at massive scale so we can extract signals to improve every agent over long-time horizons --> Continual Learning framed as a Data Mining problem
we're excited to release some new work from LangChain Labs with the awesome folks @FireworksAI_HQ (shoutout @chahvivi and the excellent team over there)
we find that with good data design + SFT, builders can surpass frontier performance on LLM-as-a-judge tasks that read every Trace agents produce & extract signal from them via rubrics
reach out if any of this is interesting - and if you want to fine-tune your own judges to process every trace at scale
http://x.com/i/article/2066537289107255296
Great to work with the @FireworksAI_HQ team on this. Frontier judgement at a fraction of the cost couldn't be more important as the amount of signal produced by agents continues to exponentially increase.
http://x.com/i/article/2066537289107255296

@hwchase17 cheap + accurate or cheap + mostly accurate? one of those works way harder

@LangChain @FireworksAI_HQ The most valuable data may not be the output.
It may be the trace left behind when the output fails.

@hwchase17 Post-training on agent trace data makes sense — frontier models are overkill for structured log classification. Curious what distillation technique you used for the 10-100x cost reduction.

@hwchase17 Interesting! All the tool calls in the turn is the model input ? Do you handle compaction mid turn as well ?

@hwchase17 Cheap and accurate is the hard combo for trace eval. Post-training your own judge is the right call.

@LangChain Incredible working with the LangChain team. Let's go!

@Kio_yashi @LangChain @FireworksAI_HQ Exactly what devs need right now! Lower eval costs = more experiments = better models. This approach hits different 🔥

@Kio_yashi @LangChain @FireworksAI_HQ This fits the bigger wave of distillation replacing pricey APIs. Curious—what's the accuracy delta vs GPT-4 on your eval set?

@Kio_yashi @LangChain @FireworksAI_HQ What's your eval accuracy vs GPT-4? Been curious if the cost savings hold up across different trace complexity levels