
only a matter of time until all the top firms are doing this
Emad Mostaque suggested naming the benchmarking-focused firm "atp+".
Many users view AI model benchmarking as the future of deep-tech venture investing because it offers superior due diligence and an information edge in tracking model improvements.

only a matter of time until all the top firms are doing this

or put a different way, every venture firm should have an evals / benchmarks team putting the models through the paces, building new benchmarks, etc

@OfficialLoganK Definitely. Have you considered Eval alpha decays fast. Public benchmarks saturate and get gamed. The edge has to live in proprietary evals, which means running a small research lab inside the firm — compute, evaluators, infra. That’s a cost structure most VCs don’t carry.

@alex_peys :)

@OfficialLoganK @grok explain to me what Logan means like I’m 12

@EMostaque :)

@OfficialLoganK there's only so much benchmarks can prove
vibes >>> benchmarks

@OfficialLoganK Could call it atp+

@OfficialLoganK We do this at http://gertlabs.com/rankings
Did you know Gemini 3.5 Flash is a frontier model in real-time simulations and spatial reasoning, but drops the ball in abstract environments with constraints it wouldn't have seen in its training data?

@Curlh1 @OfficialLoganK @grok @grok explain to me what Logan means like I’m 20 with detailed explanations with examples

Agree partially but some blockers are:
a) shifts in model ability on areas that are decisive can show emergence i.e. wherein hard perf metrics on an eval are low due to the model being weak on some out of all the sub-abilities that compose most tasks in the eval but once they stop being weak there perf on the eval sort of breaks out...
b) Benchmarks / evals are themselves built kind of reactively but also slightly serendipitously where people building them (e.g. @OfirPress et al with SWE-Bench) see the frontier and get inspired or "catharsized" enough to put efforts to make an eval [and this is a function of model ability] and it becomes viral enough to adopt. For long term prediction based on benchmarking, this kind of randomized evolution can be unstable to predict or marginalize over (and continual add/remove of the eval basked that's the basis for pred)
c) Also as some comments note here, saturation due to various reasons renders well known benchmarks useless as signals of growing model ability after a while, apart from their use to check non-regression...

@OfficialLoganK Can I also implement it in an existing company?

@OfficialLoganK kinda feels like were entering the era where the best investors are just eval readers with a thesis
how do u avoid getting wrecked by a model release though?

@OfficialLoganK During the singularity? VC investment periods are too long for that.
If you had foresight about the need for agentic harnesses, what would you have done? Invested in Devin and Replit?
It's the big three labs and that's all.

@OfficialLoganK When is gemini 3.5 pro?
Focus on video and audio understanding and lesser censorship
Give up coding

@OfficialLoganK Model evals are becoming the new due diligence. The firms that can consistently identify where models are meaningfully improving (or plateauing) will have an information edge that traditional financial analysis can’t match.

@OfficialLoganK benchmark-driven thesis works only if you can hold the position between eval lead and revenue catching up. that gap is usually a few quarters and most LPs do not have patience for it.

The catch is the same one that breaks every quant signal in that a benchmark edge is only an edge until it's priced in. Public evals get arbitraged away fast and the alpha isn't benchmarking deeply, it's benchmarking what the market hasn't priced yet, capability overhang the consensus can't see, or a trajectory read that's not obvious. The moment the eval is legible to everyone, the edge decays

@OfficialLoganK benchmarks saturate way before the capability does though. half the interesting gaps never show up in public evals until someone ships a product into them. private task-specific evals feel like the actual alpha here

Doing this already. Distribution is still the key. Been working BTS with companies on product builds model refinement and evals so I know how deep the industry already runs. Gotta position yourself well and pitch even better. P.S. been trying to get in on AI Studio too and share my views 👀