/Tech2d ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

561285.5K

#149

Original post

alex peysakhovich@alex_peys#1830inTech

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

in the meantime /goal spent $400 on modal and a lot of tokens to achieve nothing. i went through the ideas it had come up with and they were decent generic ml ideas (eg try this normalization) but terrible for the thing i was working on.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 43.5K Views

/Tech2d ago

Alex Peysakhovich, Sutter Hill Ventures partner, says Codex failed as an autonomous researcher, burning $400 on Modal

The agent generated generic, ineffective machine learning ideas.

561285.5K

#149

Original post

alex peysakhovich@alex_peys#1830inTech

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

10:24 AM · Jun 8, 2026 · 43.5K Views

Sentiment

Some users celebrated custom autoresearch techniques for improving models, while many criticized Codex for failing basic training runs and evaluations as a research agent.

Pos

37.5%

Neg

62.5%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.9KBOOKMARKS2LIKES21RETWEETS2REPLIES1

Stanislav Fort@stanislavfort

This matches my recent experience with AI agents as research assistants. Amazing at coding, sub-good-masters-student at navigating the space of ideas and updating based on experimental results.

alex peysakhovich@alex_peys

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

1d2.9K212

alex peysakhovich@alex_peys

@init_malachi maybe codex needs a /touch_grass command

1d5065

Ravid Shwartz Ziv@ziv_ravid

@alex_peys But did you tell it not to make any mistakes?

alex peysakhovich@alex_peys

i gave codex a /goal to improve an ml training pipeline i am working on while i went for a hike.

during the hike i had an idea. which i came back and (codex) implemented and it worked to bump things up a bit.

so… coding assistant? very good. even jr researcher? not yet.

1d59290

M@init_malachi

@alex_peys yay for us nature idea enjoyers

1d5214

alex peysakhovich@alex_peys

@cs_serdar they are fine for some things, if i was trying to squeeze the last 5% out of my architecture by throwing known tricks at the wall and seeing what sticks this is totally doable by a coding agent.

for figuring out that one of my training datasets has a subtle issue, not so much

2d2761

serdarml@cs_serdar

@alex_peys I fell into the autoresearch hype today and tried to get codex to do some basic stuff like run a default training run and eval it on some pipeline. The models are simply not good enough for agentic research, they do plenty of stupid things. This is with 5.5.

2d2681

JJ Schultz@jjschultz

@alex_peys hmm - over a year ago I did this using a manually scripted loop + gemini (for the big context window). I defined a goal (ie decrease loss) and fed in the logs of prev runs in the context. it tweaked the hyperparams and architecture.

and it worked AMAZING!

2d130

infrecursion@infrecursion1

@alex_peys This is a horrible way to use goal. Goal iss not for generating ideas, use gpt 5.5 xhigh or better gpt 5.5 pro for that. Goal is for implementation. You don't even have the minimum knowledge of how these things work and yet your making claims about their capabilities, ironic.

1d1423

alex peysakhovich@alex_peys

so the main goal was roughly: here is a model i am training, it trains on X and evaluates on Y, investigate everything in the pipeline and figure out how to improve performance on Y. you can train small versions using the modal script below.

the biggest issue turned out to be a data related thing (as usual). the thing that codex spent all of its time on was adding layer norms in places, switching optimizer hyper parameters, seeing if it could add layers to the base model, if i let it keep going im sure it would have kept trying these tweaks. which, yea, they are useful in a particular problem like the gpt speedrun

once i said, hey i think its the following data thing, lets check it and fix it, it obviously did it correctly but at that point it wasn’t a “research” problem anymore

Vaibhav (VB) Srivastav@reach_vb

interesting! what were the failure modes? I wonder how much of this can be alleviate by more context/ direction in the original goal itself

personally, have had massive success by letting codex create hypothesis and put them in a markdown file and set goal along with it!

you can also ask codex to use set_goal tool to set goal accordingly

1d15710

josepha.mayo@josepha_mayo

this happens to me anytime im working on new frameworks or trying to get an output that's not easily feasible I put 30mins everyday on walking to think i get all the solutions in this period, write it on a note and codex woulda wasted tokens although pushing through but not getting it and imma just give it to implement

i gave it a prompt - search and think deeply as long as possible, bro didn't use 2 mins and just followed current methods with patches

1d1231

Abhirama / ಅಭಿರಾಮ@AbhiRaama22

@alex_peys Can you share exactly what was the problem and what was your idea? If you can, that is.

1d180

Paras@buildwithparas

@alex_peys there's no real ceiling on that $400, you set the goal and find out what it cost when you get back from the hike

1d124

serdarml@cs_serdar

@alex_peys It's definitely not useless, especially if the environment is set up perfectly. But it's prone to making slight mistakes it won't notice and blame other things/the experiment itself. The more "agentic" the workflow, the more likely it is for the result to be garbage.

2d361

MrDee@SOG🫡@sog_on_bird_app

@alex_peys Def not researcher but implementer yes

1d110

alex peysakhovich@alex_peys

@jjschultz

2d85

Yash Raj@yraj__

@alex_peys I create this autoresearch skill for improving my bio related models, it has help me a lot and has a lot of guardrails for agents to keep them in check. You can try it https://github.com/yashraj59/autoresearch-bio

1d231

Alex Shev@AlexshevPm

@stanislavfort That matches my experience too. Agents are much better when the task has a feedback loop: code runs, tests pass or fail, logs explain what happened. Open-ended research still needs a human steering the taste.

1d31

M@init_malachi

@alex_peys sounds within critical range of many pseud attractors for it. sincerity and slop often entangle when projected

1d27

Pulkit@puhlkit

In my mind, /goal exists to make sure the agent actually completes the task. There are many situations where models regularly fail:

1. Tell it to do 10 things. It will do some, say “I’ll do rest next”. Use /goal to get it to do all 10. 2. Tell it to do something repeatedly. It will stop without finishing. /goal makes sure it finishes. e.g.

1d23

sakanade@0xsakanade

@alex_peys Maybe it’s a you thing

1d11