/Tech2h ago

T3 Stack creator Theo proposes niche benchmarks like ts-bench and ios-bench to evaluate domain-specific AI performance

Logan Kilpatrick supported the call for diverse evaluations.

132760187726.6K
Original post
Theo - t3.gg@theo#1851inTech

We need more niche benches.

We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.

We need way more creativity in how we measure what models can do.

6:52 PM · Jun 10, 2026 · 24.2K Views
Sentiment

Users strongly back the push for niche AI benchmarks because generic ones feel stale and specialized evaluations better capture real use cases.

Pos
93.7%
Neg
6.3%
16 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.4KBOOKMARKS2LIKES43REPLIES5
Logan Kilpatrick@OfficialLoganK

@theo yes

We need more niche benches.

We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.

We need way more creativity in how we measure what models can do.

1hViews 2.4KLikes 43Bookmarks 2
RETWEETS1
Jon Klaric@complex_maths

Quietly working on water-bench (where I get the AIs to explain and implement a bunch of tasks in water engineering, from programming, mathematics, optimization, GIS operations etc) and LogoBench (where I get the AIs to make an SVG recreation of a relatively complex image of a logo).

1hViews 9Likes 1
Ben Dicken@BenjDicken

@theo how would you measure thing like yt-thumbnail-bench?

2hViews 657Likes 6
Nathan Wilbanks@NathanWilbanks_

@theo pokemon-bench, which is really binary to WASM bench for me cause im trying to brute force static recompile literally every retro video game.

in the end i should have a good system for porting any arbitrary binary to WASM

all the games in the browser!

11mViews 38Likes 2Bookmarks 1

@BenjDicken For image gen? not happening, garbage.

I'm imagining a bench that uses vision models to "rank" thumbnails and compare to how well they do in reality

2hViews 333Likes 2

@theo I'm sure someone made a hentai bench before anything else. 😆

2hViews 361Likes 5
jason@jxnlco

@OfficialLoganK @theo The eval should be whole video plus transcript to Jpeg

1hViews 153Likes 3
ен ен@pasivisiziran

@OfficialLoganK @theo We need Gemini 3.5 Pro 👀

1hViews 63Likes 2
irshit@irshit0

@theo Yeah true

2hViews 137Likes 1
pretty.hate.machine@southphxceleb

@theo I made a bench to test new models on how closely they stick to a language’s unique idiosyncrasies when building db’s, i ran it and saw how much it cost me in API and never ran it again

2hViews 94Likes 4
BijanBowen@bijanbowen

@Endal1791 @theo :D I gave it harder tests in the follow up, it made this little v8 3d engine that an rc motor slides into

30mViews 15Likes 1
Christian@chrislciaba

@theo We need dynamic benchmarks you can’t put in the training data

2hViews 289Likes 2
Zak El Fassi@zakelfassi

@theo i gotchu http://havewemadeaunicorn.com

1hViews 14
Aiden Bai@aidenybai

@theo @DanielleFong ok we'll do it

1hViews 141Likes 2
Jason@jason_bennitt

@Validate_QA @OfficialLoganK @theo Lol, that's the screenshot from the announcement post. Fable went pretty agrarian. 👨🏻‍🌾 Working on the v0 metrics and media now.

1hViews 11

@theo just write your own for your agent: https://trackbaseline.com

longitudinal baselines are important too - track how agents do day by day, since providers are rug pulling capacity and intelligence.

1hViews 10
Endal@Endal1791

@bijanbowen @theo Was saving it for my drive home! I can't wait to see!

29mViews 2Likes 1
Load more posts