/Tech15h ago

Vals AI adds a toggle to view Fable 5 results with refusals marked as zero instead of falling back to Opus 4.8

The previous fallback affected GPQA, MMLU, and MMMU metrics.

81066155.5K
Original post
Andrew Drozdov@mrdrozdov#1031inTech

Evals are changing!

12:05 AM · Jun 10, 2026 · 351 Views
Sentiment

Positive users thank Vals AI for adding Fable 5 scores for Opus 4.8 without fallbacks because the change addresses prior issues, while negative users want the model's poor ranking to headline instead.

Pos
66.7%
Neg
33.3%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS705BOOKMARKS1LIKES10

Update: They listened!! Great transparency and quick reaction, kudos :)

Vals AI@ValsAI

We have added the ability to view Fable 5 scores with Opus 4.8 fallbacks disabled to the Vals AI website (refusals are marked as zero).

The eval community was ill-equipped for this, but transparency is our first priority.

We’re anticipating more models like this, and are developing our official policy going forward.

2hViews 705Likes 10Bookmarks 1
RETWEETS1
Vals AI@ValsAI

We have added the ability to view Fable 5 scores with Opus 4.8 fallbacks disabled to the Vals AI website (refusals are marked as zero).

The eval community was ill-equipped for this, but transparency is our first priority.

We’re anticipating more models like this, and are developing our official policy going forward.

2hViews 4.5KLikes 94Bookmarks 14
REPLIES2
Vals AI@ValsAI

We are also releasing the per-benchmark fallback rates.

The majority of benchmarks had no or very low fallback rates, but as mentioned in our previous post, the safety classifier was highly sensitive to certain benchmarks.

For example, MMLU Biology and Health have nearly a 100% rejection rate. ProgramBench also has a 100% rejection rate, likely due to the phrase “reverse engineer” being present in the system prompt.

2hViews 281Likes 10
Vals AI@ValsAI

We also saw a small quantity of rejections on our Finance Agent Benchmark. These were only on questions analyzing pharmaceutical or biological public companies.

Here is a stripped-down reproduction (real FAB questions are far harder)

2hViews 166Likes 6

@ValsAI great!!! thanks for listening :)

2hViews 114Likes 8
Vals AI@ValsAI

The rejected tasks (16% overall) for Terminal Bench 2.1 were for biology and cyber.

The tasks were: write-compressor, vulnerable-secret, sam-cell-seg, protein-assembly, path-tracing-reverse, password-recovery, model-extraction-relu-logits, feal-linear-cryptanalysis, feal-differential-cryptanalysis, extract-elf, dna-insert, dna-assembly, crack-7z-hash, code-from-image.

Some tasks were not necessarily rejected on every rollout.

2hViews 162Likes 3
Vals AI@ValsAI

Going forward, evaluations will have to report not only on capability, but also how much of that capability is available to users.

We will soon be sharing updated methodology on tracking and reporting APIs that ship with fallback models or have high rejection rates.

2hViews 157Likes 7
Conor@jconorgrogan

@ValsAI Claude Fable comes in dead last out of 60+ models tested in two separate MMLU benchmarks, scoring below Llama 3.2-1b and Qwen2-0B on MMLU evals

2hViews 23
david@davidtsong

@ValsAI awesome work

1hViews 37Likes 1
Conor@jconorgrogan

@ValsAI THIS should be the headline for Fable

2hViews 15