We need more niche benches.
We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.
We need way more creativity in how we measure what models can do.
Logan Kilpatrick supported the call for diverse evaluations.
We need more niche benches.
We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.
We need way more creativity in how we measure what models can do.
Users strongly back the push for niche AI benchmarks because generic ones feel stale and specialized evaluations better capture real use cases.
@theo yes
We need more niche benches.
We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.
We need way more creativity in how we measure what models can do.

Quietly working on water-bench (where I get the AIs to explain and implement a bunch of tasks in water engineering, from programming, mathematics, optimization, GIS operations etc) and LogoBench (where I get the AIs to make an SVG recreation of a relatively complex image of a logo).

@theo how would you measure thing like yt-thumbnail-bench?

@theo pokemon-bench, which is really binary to WASM bench for me cause im trying to brute force static recompile literally every retro video game.
in the end i should have a good system for porting any arbitrary binary to WASM
all the games in the browser!

@BenjDicken For image gen? not happening, garbage.
I'm imagining a bench that uses vision models to "rank" thumbnails and compare to how well they do in reality

@theo I'm sure someone made a hentai bench before anything else. 😆

@OfficialLoganK @theo The eval should be whole video plus transcript to Jpeg

@OfficialLoganK @theo We need Gemini 3.5 Pro 👀

@theo Yeah true

@pasivisiziran @theo yes

@theo I made a bench to test new models on how closely they stick to a language’s unique idiosyncrasies when building db’s, i ran it and saw how much it cost me in API and never ran it again

@Endal1791 @theo :D I gave it harder tests in the follow up, it made this little v8 3d engine that an rc motor slides into

@theo We need dynamic benchmarks you can’t put in the training data

@theo i gotchu http://havewemadeaunicorn.com

@theo @DanielleFong ok we'll do it

@Validate_QA @OfficialLoganK @theo Lol, that's the screenshot from the announcement post. Fable went pretty agrarian. 👨🏻🌾 Working on the v0 metrics and media now.

@theo just write your own for your agent: https://trackbaseline.com
longitudinal baselines are important too - track how agents do day by day, since providers are rug pulling capacity and intelligence.

@bijanbowen @theo Was saving it for my drive home! I can't wait to see!

@theo yes

@theo ts+bun+hono bench pls.
Logan Kilpatrick supported the call for diverse evaluations.
We need more niche benches.
We need ios-bench. We need ts-bench. We need baseball-bench. We need yt-thumbnail-bench.
We need way more creativity in how we measure what models can do.