/AI8h ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

1516113317.4K
Original post
alex peysakhovich@alex_peys#1680inAI

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 16.7K Views
Sentiment

Some users praised Codex for aiding custom autoresearch in bio modeling while many others criticized it for failing at basic research tasks like training runs and wasting tokens on complex frameworks.

Pos
37.5%
Neg
62.5%
8 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS974BOOKMARKS2LIKES12REPLIES1
Stanislav Fort@stanislavfort

This matches my recent experience with AI agents as research assistants. Amazing at coding, sub-good-masters-student at navigating the space of ideas and updating based on experimental results.

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

3hViews 974Likes 12Bookmarks 2

@alex_peys But did you tell it not to make any mistakes?

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

3hViews 218Likes 8Bookmarks 0
M@init_malachi

@alex_peys yay for us nature idea enjoyers

2hViews 181Likes 3

@cs_serdar they are fine for some things, if i was trying to squeeze the last 5% out of my architecture by throwing known tricks at the wall and seeing what sticks this is totally doable by a coding agent.

for figuring out that one of my training datasets has a subtle issue, not so much

4hViews 276Likes 1
serdarml@cs_serdar

@alex_peys I fell into the autoresearch hype today and tried to get codex to do some basic stuff like run a default training run and eval it on some pipeline. The models are simply not good enough for agentic research, they do plenty of stupid things. This is with 5.5.

4hViews 268Likes 1

@init_malachi maybe codex needs a /touch_grass command

2hViews 169Likes 1
JJ Schultz@jjschultz

@alex_peys hmm - over a year ago I did this using a manually scripted loop + gemini (for the big context window). I defined a goal (ie decrease loss) and fed in the logs of prev runs in the context. it tweaked the hyperparams and architecture.

and it worked AMAZING!

4hViews 130
josepha.mayo@josepha_mayo

this happens to me anytime im working on new frameworks or trying to get an output that's not easily feasible I put 30mins everyday on walking to think i get all the solutions in this period, write it on a note and codex woulda wasted tokens although pushing through but not getting it and imma just give it to implement

i gave it a prompt - search and think deeply as long as possible, bro didn't use 2 mins and just followed current methods with patches

2hViews 59Likes 1
serdarml@cs_serdar

@alex_peys It's definitely not useless, especially if the environment is set up perfectly. But it's prone to making slight mistakes it won't notice and blame other things/the experiment itself. The more "agentic" the workflow, the more likely it is for the result to be garbage.

4hViews 36Likes 1

@alex_peys Can you share exactly what was the problem and what was your idea? If you can, that is.

2hViews 80
MrDee@SOG🫡@sog_on_bird_app

@alex_peys Def not researcher but implementer yes

2hViews 59
Paras@buildwithparas

@alex_peys there's no real ceiling on that $400, you set the goal and find out what it cost when you get back from the hike

2hViews 51
Pulkit@puhlkit

In my mind, /goal exists to make sure the agent actually completes the task. There are many situations where models regularly fail:

1. Tell it to do 10 things. It will do some, say “I’ll do rest next”. Use /goal to get it to do all 10. 2. Tell it to do something repeatedly. It will stop without finishing. /goal makes sure it finishes. e.g.

3hViews 23
infrecursion@infrecursion1

@alex_peys This is a horrible way to use goal. Goal iss not for generating ideas, use gpt 5.5 xhigh or better gpt 5.5 pro for that. Goal is for implementation. You don't even have the minimum knowledge of how these things work and yet your making claims about their capabilities, ironic.

10mViews 6Likes 1
Yash Raj@yraj__

@alex_peys I create this autoresearch skill for improving my bio related models, it has help me a lot and has a lot of guardrails for agents to keep them in check. You can try it https://github.com/yashraj59/autoresearch-bio

59mViews 6Likes 1
Alex Shev@AlexshevPm

@stanislavfort That matches my experience too. Agents are much better when the task has a feedback loop: code runs, tests pass or fail, logs explain what happened. Open-ended research still needs a human steering the taste.

2hViews 13
sakanade@0xsakanade

@alex_peys Maybe it’s a you thing

3hViews 11
M@init_malachi

@alex_peys sounds within critical range of many pseud attractors for it. sincerity and slop often entangle when projected

2hViews 10