BINEVAL framework evaluates LLMs using atomic binary questions to outperform G-Eval and UniEval

VIEWS5.3KBOOKMARKS39LIKES34REPLIES7

Hamel Husain@HamelHusain

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns

All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://maven.com/parlance-labs/o/540bd8

elvis@omarsar0

If you use LLM-as-judge, this one is worth reading.

(bookmark it)

It's actually one of the most effective ways to use LLM-as-a-Judge for evals.

Holistic judge scores hide both their reasoning and their ceiling effects.

BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores.

Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal.

Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency.

Paper: https://arxiv.org/abs/2606.27226

Learn to build effective AI agents in our academy: https://academy.dair.ai/

3h5.3K3439

RETWEETS47

elvis@omarsar0

If you use LLM-as-judge, this one is worth reading.

(bookmark it)

It's actually one of the most effective ways to use LLM-as-a-Judge for evals.

Holistic judge scores hide both their reasoning and their ceiling effects.

BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores.

Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal.

Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency.

Paper: https://arxiv.org/abs/2606.27226

Learn to build effective AI agents in our academy: https://academy.dair.ai/

10h48K6451K

Chris Albon@chrisalbon

This is so cool

Hamel Husain@HamelHusain

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns

All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://maven.com/parlance-labs/o/540bd8

3h1.4K44

V0LYX@0xV0LYX

@omarsar0 BINEVAL sounds like a clean way to surface whats actually breaking instead of hiding it in a blended score.

"asking dont judging" is a better framing than most people realize for catching bad evaluation loops.

10h751

Pluto@plut0sx

@omarsar0 Two years tuning judges for 1 to 10 scores, turns out yes or no was the answer.

7h3782

Joe Barrow@barrowjoseph

@HamelHusain @chrisalbon To add on: if you have to measure something more fine-grained than binary with an LLM, I really think tournaments are a really useful paradigm.

For example, @Havelock_AI or our (inspired) analysis of local laws:

3h302

Daniel Smidstrup@DanielSmidstrup

@omarsar0 very clean breakdown, atomic verdicts feel way easier to trust :D

6h1752

Nick Venturi@nickventuri

@omarsar0 holistic scores are basically just vibes anyway

7h2141

Christopher@communicating

@omarsar0 A much more grounded approach. Great find, I missed this one! 🍺

9h383

Strata@ChainZenit

@omarsar0 this is a massive upgrade over holistic scoring, good find

10h278

Hunter Gon@gonlenidefi

@omarsar0 bookmark button smashed instantly

the binary framing makes me wonder how many hidden averages were floating around before

10h184

GeniusPothead 💹🧲@GeniusPothead

@omarsar0 This looks like a much more reliable evaluation framework

10h183

Nikolai Yakovenko@ivan_bezdomny

@omarsar0 yeah -- you really do need to ask a series of binary question, then turn that into some kind of weighted sum (or another combination)

5h136

Timur Yessenov@Timur_Yessenov

@omarsar0 Binary judge questions beat 1–10 scores because they leave something you can fix. “Was the answer grounded?” is actionable. “7.3 quality” just creates arguments. I’d rather have 30 small yes/no verdicts than one confident grade.

4h89

Tee Emm@tariqmustafa

@omarsar0 Here is the mega prompt

https://bitbucket.org/tariqmustafapk/bineval-prompt/src/d324c69dace36af4da68129a845aac81ab152904/BinevalMegaPrompt?at=main

3h161

Kay@kay_myg

@omarsar0 also worth looking at is...👇👇

7h30

V0LYX@0xV0LYX

@HamelHusain @chrisalbon binary is honestly underrated as a forcing function. forces you to actually define what passing means instead of hiding behind 7 vibes

3h81

a travesty in 9 parts@travofoz

@plut0sx @omarsar0 Always was. Gets u much closer to deterministic.

If the cut off is >7 and you get a 6 score do you feel confident? Will u get a 6 next time? A 7? An 8?

Use pass/fail or yes/no

I'll take a coin flip over a dice roll if I'm a betting man.

5h26

Joe Barrow@barrowjoseph

@HamelHusain @chrisalbon @Havelock_AI I could spill a lot more ink on it but it was surprisingly effective

3h71

dag@theDrewDag

@omarsar0 Reminds me of the Sklearn days. Perhaps training a classifier on the yes/no outputs could save us some tokens and keep judgement cohesive with what the LLM would have outputted?

1h17

BINEVAL framework evaluates LLMs using atomic binary questions to outperform G-Eval and UniEval

Story Overview

Binary scoring makes LLM judges easier to debug

Real-world reach beyond the reported benchmarks is still unknown