/Tech2h ago

T3 Stack creator Theo proposes niche benchmarks like ts-bench and ios-bench to evaluate domain-specific AI performance

Logan Kilpatrick supported the call for diverse evaluations.

132760187726.6K

#95

Original post

Theo - t3.gg@theo#1851inTech

We need more niche benches.

We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.

We need way more creativity in how we measure what models can do.

6:52 PM · Jun 10, 2026 · 24.2K Views

/Tech2h ago

T3 Stack creator Theo proposes niche benchmarks like ts-bench and ios-bench to evaluate domain-specific AI performance

Logan Kilpatrick supported the call for diverse evaluations.

132760187726.6K

#95

Original post

Theo - t3.gg@theo#1851inTech

We need more niche benches.

We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.

We need way more creativity in how we measure what models can do.

6:52 PM · Jun 10, 2026 · 24.2K Views

Sentiment

Users strongly back the push for niche AI benchmarks because generic ones feel stale and specialized evaluations better capture real use cases.

Pos

93.7%

Neg

6.3%

16 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS2LIKES43REPLIES5

Logan Kilpatrick@OfficialLoganK

@theo yes

Theo - t3.gg@theo

We need more niche benches.

We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.

We need way more creativity in how we measure what models can do.

1h2.4K432

RETWEETS1

Jon Klaric@complex_maths

Quietly working on water-bench (where I get the AIs to explain and implement a bunch of tasks in water engineering, from programming, mathematics, optimization, GIS operations etc) and LogoBench (where I get the AIs to make an SVG recreation of a relatively complex image of a logo).

1h91

Ben Dicken@BenjDicken

@theo how would you measure thing like yt-thumbnail-bench?

2h6576

Nathan Wilbanks@NathanWilbanks_

@theo pokemon-bench, which is really binary to WASM bench for me cause im trying to brute force static recompile literally every retro video game.

in the end i should have a good system for porting any arbitrary binary to WASM

all the games in the browser!

11m3821

Theo - t3.gg@theo

@BenjDicken For image gen? not happening, garbage.

I'm imagining a bench that uses vision models to "rank" thumbnails and compare to how well they do in reality

2h3332

Joshua DeLaughter@Cheeks2184

@theo I'm sure someone made a hentai bench before anything else. 😆

2h3615

jason@jxnlco

@OfficialLoganK @theo The eval should be whole video plus transcript to Jpeg

1h1533

ен ен@pasivisiziran

@OfficialLoganK @theo We need Gemini 3.5 Pro 👀

1h632

irshit@irshit0

@theo Yeah true

2h1371

Logan Kilpatrick@OfficialLoganK

@pasivisiziran @theo yes

1h152

pretty.hate.machine@southphxceleb

@theo I made a bench to test new models on how closely they stick to a language’s unique idiosyncrasies when building db’s, i ran it and saw how much it cost me in API and never ran it again

2h944

BijanBowen@bijanbowen

@Endal1791 @theo :D I gave it harder tests in the follow up, it made this little v8 3d engine that an rc motor slides into

30m151

Christian@chrislciaba

@theo We need dynamic benchmarks you can’t put in the training data

2h2892

Zak El Fassi@zakelfassi

@theo i gotchu http://havewemadeaunicorn.com

1h14

Aiden Bai@aidenybai

@theo @DanielleFong ok we'll do it

1h1412

Jason@jason_bennitt

@Validate_QA @OfficialLoganK @theo Lol, that's the screenshot from the announcement post. Fable went pretty agrarian. 👨🏻‍🌾 Working on the v0 metrics and media now.

1h11