/Tech2d ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

561285.5K
Original post
alex peysakhovich@alex_peys#1830inTech

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 43.5K Views
Sentiment

Some users celebrated custom autoresearch techniques for improving models, while many criticized Codex for failing basic training runs and evaluations as a research agent.

Pos
37.5%
Neg
62.5%
8 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.9KBOOKMARKS2LIKES21RETWEETS2REPLIES1
Stanislav Fort@stanislavfort

This matches my recent experience with AI agents as research assistants. Amazing at coding, sub-good-masters-student at navigating the space of ideas and updating based on experimental results.

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

1dViews 2.9KLikes 21Bookmarks 2

@init_malachi maybe codex needs a /touch_grass command

1dViews 506Likes 5

@alex_peys But did you tell it not to make any mistakes?

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

1dViews 592Likes 9Bookmarks 0
M@init_malachi

@alex_peys yay for us nature idea enjoyers

1dViews 521Likes 4

@cs_serdar they are fine for some things, if i was trying to squeeze the last 5% out of my architecture by throwing known tricks at the wall and seeing what sticks this is totally doable by a coding agent.

for figuring out that one of my training datasets has a subtle issue, not so much

2dViews 276Likes 1
serdarml@cs_serdar

@alex_peys I fell into the autoresearch hype today and tried to get codex to do some basic stuff like run a default training run and eval it on some pipeline. The models are simply not good enough for agentic research, they do plenty of stupid things. This is with 5.5.

2dViews 268Likes 1
JJ Schultz@jjschultz

@alex_peys hmm - over a year ago I did this using a manually scripted loop + gemini (for the big context window). I defined a goal (ie decrease loss) and fed in the logs of prev runs in the context. it tweaked the hyperparams and architecture.

and it worked AMAZING!

2dViews 130
infrecursion@infrecursion1

@alex_peys This is a horrible way to use goal. Goal iss not for generating ideas, use gpt 5.5 xhigh or better gpt 5.5 pro for that. Goal is for implementation. You don't even have the minimum knowledge of how these things work and yet your making claims about their capabilities, ironic.

1dViews 142Likes 3

so the main goal was roughly: here is a model i am training, it trains on X and evaluates on Y, investigate everything in the pipeline and figure out how to improve performance on Y. you can train small versions using the modal script below.

the biggest issue turned out to be a data related thing (as usual). the thing that codex spent all of its time on was adding layer norms in places, switching optimizer hyper parameters, seeing if it could add layers to the base model, if i let it keep going im sure it would have kept trying these tweaks. which, yea, they are useful in a particular problem like the gpt speedrun

once i said, hey i think its the following data thing, lets check it and fix it, it obviously did it correctly but at that point it wasn’t a “research” problem anymore

interesting! what were the failure modes? I wonder how much of this can be alleviate by more context/ direction in the original goal itself

personally, have had massive success by letting codex create hypothesis and put them in a markdown file and set goal along with it!

you can also ask codex to use set_goal tool to set goal accordingly

1dViews 157Likes 1Bookmarks 0
josepha.mayo@josepha_mayo

this happens to me anytime im working on new frameworks or trying to get an output that's not easily feasible I put 30mins everyday on walking to think i get all the solutions in this period, write it on a note and codex woulda wasted tokens although pushing through but not getting it and imma just give it to implement

i gave it a prompt - search and think deeply as long as possible, bro didn't use 2 mins and just followed current methods with patches

1dViews 123Likes 1

@alex_peys Can you share exactly what was the problem and what was your idea? If you can, that is.

1dViews 180
Paras@buildwithparas

@alex_peys there's no real ceiling on that $400, you set the goal and find out what it cost when you get back from the hike

1dViews 124
serdarml@cs_serdar

@alex_peys It's definitely not useless, especially if the environment is set up perfectly. But it's prone to making slight mistakes it won't notice and blame other things/the experiment itself. The more "agentic" the workflow, the more likely it is for the result to be garbage.

2dViews 36Likes 1
MrDee@SOG🫡@sog_on_bird_app

@alex_peys Def not researcher but implementer yes

1dViews 110
Yash Raj@yraj__

@alex_peys I create this autoresearch skill for improving my bio related models, it has help me a lot and has a lot of guardrails for agents to keep them in check. You can try it https://github.com/yashraj59/autoresearch-bio

1dViews 23Likes 1
Alex Shev@AlexshevPm

@stanislavfort That matches my experience too. Agents are much better when the task has a feedback loop: code runs, tests pass or fail, logs explain what happened. Open-ended research still needs a human steering the taste.

1dViews 31
M@init_malachi

@alex_peys sounds within critical range of many pseud attractors for it. sincerity and slop often entangle when projected

1dViews 27
Pulkit@puhlkit

In my mind, /goal exists to make sure the agent actually completes the task. There are many situations where models regularly fail:

1. Tell it to do 10 things. It will do some, say “I’ll do rest next”. Use /goal to get it to do all 10. 2. Tell it to do something repeatedly. It will stop without finishing. /goal makes sure it finishes. e.g.

1dViews 23
sakanade@0xsakanade

@alex_peys Maybe it’s a you thing

1dViews 11