Arkil Patel posts paper “Forecasting Downstream Performance of LLMs With Proxy Metrics” showing cross-entropy loss correlates below 0.5 with downstream results while proposed proxy metrics exceed 0.8
Proxy metrics rank pretraining datasets with fraction of usual compute.
Nature is complex. Why would cross-entropy loss predict scaling behavior of language models on downstream task? Introducing data-driven proxy metrics for scaling laws. Proxy metrics are incredibly useful especially on tasks where models don't perform strongly yet.
Excellent work by @arkil_patel!
Excited to share our new paper! “Forecasting Downstream Performance of LLMs With Proxy Metrics” w/ my amazing advisors @sivareddyg, @mariusmosbach, @DBahdanau Cross-entropy loss is a poor predictor of how models perform on downstream tasks (esp. reasoning). We propose something better: proxy metrics computed over expert reasoning traces. 🧵 Thread below 👇
To democratize AI, we need to help AI practitioners argue how investment can bring returns in the forms of superior intelligence. Forecasting downstream performance is super important! Check out @arkil_patel's work:
Excited to share our new paper! “Forecasting Downstream Performance of LLMs With Proxy Metrics” w/ my amazing advisors @sivareddyg, @mariusmosbach, @DBahdanau Cross-entropy loss is a poor predictor of how models perform on downstream tasks (esp. reasoning). We propose something better: proxy metrics computed over expert reasoning traces. 🧵 Thread below 👇