/AI3h ago

Google's Logan Kilpatrick proposes a venture capital strategy driven entirely by deep AI model benchmarking and evaluations

Emad Mostaque suggested naming the benchmarking-focused firm "atp+".

626032418552.6K

Original post unavailable.

/AI3h ago

Google's Logan Kilpatrick proposes a venture capital strategy driven entirely by deep AI model benchmarking and evaluations

Emad Mostaque suggested naming the benchmarking-focused firm "atp+".

626032418552.6K

Original post unavailable.

Sentiment

Many users view AI model benchmarking as the future of deep-tech venture investing because it offers superior due diligence and an information edge in tracking model improvements.

Pos

85.7%

Neg

14.3%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.2K

Logan Kilpatrick@OfficialLoganK

only a matter of time until all the top firms are doing this

3h2.2K121

BOOKMARKS2LIKES17RETWEETS2

Logan Kilpatrick@OfficialLoganK

or put a different way, every venture firm should have an evals / benchmarks team putting the models through the paces, building new benchmarks, etc

3h1.6K172

REPLIES4

Rampalli Karthik 🇯🇵🇺🇸@karthashok008

@OfficialLoganK Definitely. Have you considered Eval alpha decays fast. Public benchmarks saturate and get gamed. The edge has to live in proprietary evals, which means running a small research lab inside the firm — compute, evaluators, infra. That’s a cost structure most VCs don’t carry.

3h9

Logan Kilpatrick@OfficialLoganK

@alex_peys :)

3h66251

Curlheinz@Curlh1

@OfficialLoganK @grok explain to me what Logan means like I’m 12

3h2291

Logan Kilpatrick@OfficialLoganK

@EMostaque :)

3h2934

furnaces@furnaces

@OfficialLoganK there's only so much benchmarks can prove

vibes >>> benchmarks

3h230

Emad@EMostaque

@OfficialLoganK Could call it atp+

3h95

Leo Linsky@leo_linsky

@OfficialLoganK We do this at http://gertlabs.com/rankings

Did you know Gemini 3.5 Flash is a frontier model in real-time simulations and spatial reasoning, but drops the ball in abstract environments with constraints it wouldn't have seen in its training data?

2h1493

Praveen Kumar@itzzme_pk

@Curlh1 @OfficialLoganK @grok @grok explain to me what Logan means like I’m 20 with detailed explanations with examples

3h36

Varun Gangal@VarunGangal

Agree partially but some blockers are:

a) shifts in model ability on areas that are decisive can show emergence i.e. wherein hard perf metrics on an eval are low due to the model being weak on some out of all the sub-abilities that compose most tasks in the eval but once they stop being weak there perf on the eval sort of breaks out...

b) Benchmarks / evals are themselves built kind of reactively but also slightly serendipitously where people building them (e.g. @OfirPress et al with SWE-Bench) see the frontier and get inspired or "catharsized" enough to put efforts to make an eval [and this is a function of model ability] and it becomes viral enough to adopt. For long term prediction based on benchmarking, this kind of randomized evolution can be unstable to predict or marginalize over (and continual add/remove of the eval basked that's the basis for pred)

c) Also as some comments note here, saturation due to various reasons renders well known benchmarks useless as signals of growing model ability after a while, apart from their use to check non-regression...

2h34

Hanielle@Haniell59310840

@OfficialLoganK Can I also implement it in an existing company?

3h3081

Blissy@BlissyOnX

@OfficialLoganK kinda feels like were entering the era where the best investors are just eval readers with a thesis

how do u avoid getting wrecked by a model release though?

3h821

MetaCritic Capital@MetacriticCap

@OfficialLoganK During the singularity? VC investment periods are too long for that.

If you had foresight about the need for agentic harnesses, what would you have done? Invested in Devin and Replit?

It's the big three labs and that's all.

3h210

Furkan Gözükara@FurkanGozukara

@OfficialLoganK When is gemini 3.5 pro?

Focus on video and audio understanding and lesser censorship

Give up coding

3h571

WENE@omijagun

@OfficialLoganK Model evals are becoming the new due diligence. The firms that can consistently identify where models are meaningfully improving (or plateauing) will have an information edge that traditional financial analysis can’t match.

3h451

Soroush Fadaeimanesh@S_Fadaeimanesh

@OfficialLoganK benchmark-driven thesis works only if you can hold the position between eval lead and revenue catching up. that gap is usually a few quarters and most LPs do not have patience for it.

2h136

TheAlphaLetters@TheAlphaLetters

The catch is the same one that breaks every quant signal in that a benchmark edge is only an edge until it's priced in. Public evals get arbitraged away fast and the alpha isn't benchmarking deeply, it's benchmarking what the market hasn't priced yet, capability overhang the consensus can't see, or a trajectory read that's not obvious. The moment the eval is legible to everyone, the edge decays

3h134

ByteCrafter@bytecrafter_1

@OfficialLoganK benchmarks saturate way before the capability does though. half the interesting gaps never show up in public evals until someone ships a product into them. private task-specific evals feel like the actual alpha here

3h124

Sup*@mostlyaboutai

Doing this already. Distribution is still the key. Been working BTS with companies on product builds model refinement and evals so I know how deep the industry already runs. Gotta position yourself well and pitch even better. P.S. been trying to get in on AI Studio too and share my views 👀

3h106