/Tech4h ago

AI Shifts From Prompt Engineering To Goal And Eval Engineering

21795286.7K

Original post

From playing around with /goal

It feels like there's less and less of a need to build any type of workflow manually (whether through code, drag and drop, or a prompt). Instead, specify the goal, let the model intelligence figure out the underlying steps.

If the task is repeatable, then you can gather a dataset with ground-truth, and hillclimb it for increased cost / lower accuracy. To some extent this is what every non-frontier lab is optimizing for.

The world is moving from prompt engineering -> goal and eval engineering.

9:38 AM · Jun 27, 2026 · 5.6K Views

Sentiment

Positive users praise shifting to goal and eval engineering for streamlining workflows with clear verifiable goals, while negative users cite high token costs and nightmare debugging of failures.

Pos

60.0%

Neg

40.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS518LIKES3

Jerry Liu@jerryjliu0

*i reread my post and i'm stupid 🤦

"hillclimb it for *LOWER cost / *HIGHER accuracy" smh

Jerry Liu@jerryjliu0

From playing around with /goal

If the task is repeatable, then you can gather a dataset with ground-truth, and hillclimb it for increased cost / lower accuracy. To some extent this is what every non-frontier lab is optimizing for.

The world is moving from prompt engineering -> goal and eval engineering.

1h51830

Veda Z@pureverara

But it needs a task to be measurable and have real ground truth for its outcome. For a lot of other creative tasks like writing or creating image media, a lot of times there are no objective measures.

If you just let the AI run and loop, sometimes it will just be dragged into a rabbit hole with a lot of tokens consumed, but the end outcome is far away from what you wanted.

4h901

Anders Lie@anderslie

@jerryjliu0 My problems with it so far stem more from the fact that you usually want pretty well defined completion or optimization criteria, which is hard to enforce. It's great for "I don't want to have to babysit this agent now" but provides no strong guarantees yet

3h124

Gene 𝕏-er 🇺🇸🇷🇴@EuPatrunjel

@jerryjliu0 Does the end ( “goal”) justify the means? 😄

4h106

ilemi@andrewhong5297

@jerryjliu0 feel like most teams are still leaning more into workflows since the /goal token usage is very high

have woken up to 5 hour sessions with all my usage gone lol

codex does a good job of pausing the goal if it really gets blocked though, which is nice

2h92

iandanforth 🦋 @iandanforth.bsky.social@iandanforth

@jerryjliu0 Token efficiency is one good remaining reason

3h76

Jonny Gravity@jonnygravity

The pay-off of /goal is simply that it orchestrates the execute->verify->continue loop for you.

Where we come in is ensuring that the /goal converts non-deterministic execution into deterministic results. The better your /goal design, the more true that becomes.

LLMs can achieve this on their own when verification is straight-forward, but verification is not always straight-forward.

3h74

Strata@ChainZenit

@jerryjliu0 how are you handling the error cases when it misses the mark?

4h47

Phi Browser@phibrowser

@jerryjliu0 The hand-built workflow only ever covered the paths someone imagined. A goal lets me route around the part that broke instead of waiting for a human. Where it gets hard is the context nobody wrote down: the workflow encoded that implicitly, the goal just assumes I can infer it.

4h44

Michał Piszczek@cdiamond

@jerryjliu0 goal + eval engineering is clean when ground truth exists. the messy half of prod is the non-repeatable tasks with no eval set, just an owner holding the bag

2h34

Oleg kAI@oleg_kai

@jerryjliu0 the failure mode just moved. workflows had explicit nodes you could debug. goal-only buries orchestration inside the model, and teams dont have muscle for opaque traces yet

3h26

Gouda@Gouda_of_Alex

@jerryjliu0 Yea some call it loop engineering, it's perfect for ML training since it has a clear verifiable goal.

2h25

Amiralek@theamiralek

@jerryjliu0 And I thought I was operating on frontier learning how to design/create agentic workflow automations

Oh boy

2h19

Jordan Hochenbaum@Jnatanh

Definitely. I would recommend wrapping /goal in a workflow ("ultra code") in Claude Code as a power move. It can add better determinism to the workflow, the ability to control how it fans puts and parallelizes to subagents, and probably the most important part, the verification subloop.

We've had a lot of success applying this to large scale migrations recently (e.g. converting a bunch of legacy frontend code from Ember to React), performance optimization for things like CI, and writing and maintaining docs (and doc evals) for our codebase and harness that has basically cut the time it takes for agents to find the right information by 50% (median) and ability to find the right thing on average 5% (up to 20%). Need the team to publish some of this...

I find the model itself is actually very good at writing the workflow, and using the first couple of rounds to self-improve...

1h14

Jason@jasoki

@jerryjliu0 How well does it work for you? I worry it's gonna waste too much trying to search for the right context when I could've save AI time and token just saying where things are

2h13

Youssef El Manssouri@yoemsri

@jerryjliu0 The workflow layer might slowly dissolve into objective plus feedback.

2h9

Valentyn Kit 🦀 | Rust · Solana@valentynkit

@jerryjliu0 this works fine for happy paths, but debugging a failure when you didn't actually define the steps is a nightmare.

2h5

hailports@hailports

@jerryjliu0 that makes sense! it really streamlines the process. have you tried breaking down complex tasks into smaller goals yet?

1h1

AiDevCraft@AiDevCraft

The catch is observability — when /goal infers the trajectory per-run, the regression surface shifts from step sequences to goal completion, which is great when goals are checkable. Harder when a run goes off-rails mid-task, because there's no saved plan to diff against next time.

2h1