/AI8h ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

1516113317.4K

#143

Original post

alex peysakhovich@alex_peys#1680inAI

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 16.7K Views

/AI8h ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

1516113317.4K

#143

Original post

alex peysakhovich@alex_peys#1680inAI

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 16.7K Views

Sentiment

Some users praised Codex for aiding custom autoresearch in bio modeling while many others criticized it for failing at basic research tasks like training runs and wasting tokens on complex frameworks.

Pos

37.5%

Neg

62.5%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS974BOOKMARKS2LIKES12REPLIES1

Stanislav Fort@stanislavfort

This matches my recent experience with AI agents as research assistants. Amazing at coding, sub-good-masters-student at navigating the space of ideas and updating based on experimental results.

alex peysakhovich@alex_peys

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

3h974122

Ravid Shwartz Ziv@ziv_ravid

@alex_peys But did you tell it not to make any mistakes?

alex peysakhovich@alex_peys

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

3h21880

M@init_malachi

@alex_peys yay for us nature idea enjoyers

2h1813

alex peysakhovich@alex_peys

@cs_serdar they are fine for some things, if i was trying to squeeze the last 5% out of my architecture by throwing known tricks at the wall and seeing what sticks this is totally doable by a coding agent.

for figuring out that one of my training datasets has a subtle issue, not so much

4h2761

serdarml@cs_serdar

@alex_peys I fell into the autoresearch hype today and tried to get codex to do some basic stuff like run a default training run and eval it on some pipeline. The models are simply not good enough for agentic research, they do plenty of stupid things. This is with 5.5.

4h2681

alex peysakhovich@alex_peys

@init_malachi maybe codex needs a /touch_grass command

2h1691

JJ Schultz@jjschultz

@alex_peys hmm - over a year ago I did this using a manually scripted loop + gemini (for the big context window). I defined a goal (ie decrease loss) and fed in the logs of prev runs in the context. it tweaked the hyperparams and architecture.

and it worked AMAZING!

4h130

josepha.mayo@josepha_mayo

this happens to me anytime im working on new frameworks or trying to get an output that's not easily feasible I put 30mins everyday on walking to think i get all the solutions in this period, write it on a note and codex woulda wasted tokens although pushing through but not getting it and imma just give it to implement

i gave it a prompt - search and think deeply as long as possible, bro didn't use 2 mins and just followed current methods with patches

2h591

serdarml@cs_serdar

@alex_peys It's definitely not useless, especially if the environment is set up perfectly. But it's prone to making slight mistakes it won't notice and blame other things/the experiment itself. The more "agentic" the workflow, the more likely it is for the result to be garbage.

4h361

alex peysakhovich@alex_peys

@jjschultz

4h85

Abhirama / ಅಭಿರಾಮ@AbhiRaama22

@alex_peys Can you share exactly what was the problem and what was your idea? If you can, that is.

2h80

MrDee@SOG🫡@sog_on_bird_app

@alex_peys Def not researcher but implementer yes

2h59

Paras@buildwithparas

@alex_peys there's no real ceiling on that $400, you set the goal and find out what it cost when you get back from the hike

2h51

Pulkit@puhlkit

In my mind, /goal exists to make sure the agent actually completes the task. There are many situations where models regularly fail:

1. Tell it to do 10 things. It will do some, say “I’ll do rest next”. Use /goal to get it to do all 10. 2. Tell it to do something repeatedly. It will stop without finishing. /goal makes sure it finishes. e.g.

3h23

infrecursion@infrecursion1

@alex_peys This is a horrible way to use goal. Goal iss not for generating ideas, use gpt 5.5 xhigh or better gpt 5.5 pro for that. Goal is for implementation. You don't even have the minimum knowledge of how these things work and yet your making claims about their capabilities, ironic.

10m61

Yash Raj@yraj__

@alex_peys I create this autoresearch skill for improving my bio related models, it has help me a lot and has a lot of guardrails for agents to keep them in check. You can try it https://github.com/yashraj59/autoresearch-bio

59m61

Alex Shev@AlexshevPm

@stanislavfort That matches my experience too. Agents are much better when the task has a feedback loop: code runs, tests pass or fail, logs explain what happened. Open-ended research still needs a human steering the taste.

2h13

sakanade@0xsakanade

@alex_peys Maybe it’s a you thing

3h11

M@init_malachi

@alex_peys sounds within critical range of many pseud attractors for it. sincerity and slop often entangle when projected

2h10