1d ago

Netflix research scientist Cameron R. Wolfe argues static LLM benchmarks must evolve to avoid saturation as models improve

He uses MMLU-Pro and MMLU-Redux to illustrate refinement strategies.

0
Original post

Evaluations should not be static. We need to evolve evaluation sets / benchmarks over time so that they remain relevant and unsaturated. There are three main ways we can refine our evals to make them better: - Difficulty-based refinement: curating more difficult tasks or data to use for evaluation within a benchmark. - Quality-based refinement: identifying and fixing issues in the benchmark (e.g., mislabeled data, vague or unrealistic questions, poor format, etc.). - Diversity-based refinement: expanding the scope of questions and topics covered by a particular benchmark. There are many ways to accomplish this, but here are a few concrete examples… MMLU-Pro extends MMLU by making it more accurate, difficult and discriminative. Easy questions are removed by using model-based difficulty filtering, where we take a pool of eight models and remove questions that the majority of models get correct. More difficult questions are sourced from a variety of public datasets. All new and remaining questions undergo an extensive quality audit using a combination of human and LLM oversight. MMLU-Redux takes a different approach of sampling ~100 questions per MMLU category and performing an extensive human quality audit. All questions are categorized into a pre-defined error taxonomy and modified by humans to form a more accurate benchmark. Around 7% of MMLU questions are found to contain errors, but the ratio varies by category. BIG-Bench Extra Hard is constructed by replacing each task in BIG-Bench Hard with a corresponding task that tests a similar category of reasoning capabilities but is significantly more difficult. Tasks are sourced from a variety of existing reasoning benchmarks and manually chosen according to their topic and difficulty. Model-based filtering (i.e., testing a few models on tasks to see where they fail) is also used to inform the selection process. Benchmark authors prioritize longer problems that cannot be solved by cheating or random guessing. RealMath and MathArena are both continually evolving math benchmarks. RealMath automatically updates with new problems derived from newly-published research papers and discussion forums. MathArena evaluates LLMs on math competition problems only within a short time window after their release to avoid contamination risk and updates frequently with new problems that become available. DatBench refines a wide variety of benchmarks for vision language models (VLMs) using a combination of data filtering / selection techniques: - Converting multiple choice to generative-style questions. - Removing questions that can be solved with no vision info. - Performing model-based quality filtering to find questions with quality issues that are then further filtered by a more powerful model. - Selecting the most discriminative examples (i.e., meaning they differentiate between the performance of different models) using item-response theory.

10:12 AM · May 29, 2026 View on X

@cwolferesearch Very cool that you continue to write these excellent pieces to help researchers. :)

Cameron R. Wolfe, Ph.D.Cameron R. Wolfe, Ph.D.@cwolferesearch

Evaluations should not be static. We need to evolve evaluation sets / benchmarks over time so that they remain relevant and unsaturated. There are three main ways we can refine our evals to make them better: - Difficulty-based refinement: curating more difficult tasks or data to use for evaluation within a benchmark. - Quality-based refinement: identifying and fixing issues in the benchmark (e.g., mislabeled data, vague or unrealistic questions, poor format, etc.). - Diversity-based refinement: expanding the scope of questions and topics covered by a particular benchmark. There are many ways to accomplish this, but here are a few concrete examples… MMLU-Pro extends MMLU by making it more accurate, difficult and discriminative. Easy questions are removed by using model-based difficulty filtering, where we take a pool of eight models and remove questions that the majority of models get correct. More difficult questions are sourced from a variety of public datasets. All new and remaining questions undergo an extensive quality audit using a combination of human and LLM oversight. MMLU-Redux takes a different approach of sampling ~100 questions per MMLU category and performing an extensive human quality audit. All questions are categorized into a pre-defined error taxonomy and modified by humans to form a more accurate benchmark. Around 7% of MMLU questions are found to contain errors, but the ratio varies by category. BIG-Bench Extra Hard is constructed by replacing each task in BIG-Bench Hard with a corresponding task that tests a similar category of reasoning capabilities but is significantly more difficult. Tasks are sourced from a variety of existing reasoning benchmarks and manually chosen according to their topic and difficulty. Model-based filtering (i.e., testing a few models on tasks to see where they fail) is also used to inform the selection process. Benchmark authors prioritize longer problems that cannot be solved by cheating or random guessing. RealMath and MathArena are both continually evolving math benchmarks. RealMath automatically updates with new problems derived from newly-published research papers and discussion forums. MathArena evaluates LLMs on math competition problems only within a short time window after their release to avoid contamination risk and updates frequently with new problems that become available. DatBench refines a wide variety of benchmarks for vision language models (VLMs) using a combination of data filtering / selection techniques: - Converting multiple choice to generative-style questions. - Removing questions that can be solved with no vision info. - Performing model-based quality filtering to find questions with quality issues that are then further filtered by a more powerful model. - Selecting the most discriminative examples (i.e., meaning they differentiate between the performance of different models) using item-response theory.

5:12 PM · May 29, 2026 · 5.2K Views
7:38 PM · May 30, 2026 · 1.1K Views