Greg Kamradt, President of the ARC Prize Foundation, posts a seven-level framework ranking domains by verification difficulty for AI systems from math and code to culture and governance · Digg

/Tech39d ago

Greg Kamradt, President of the ARC Prize Foundation, posts a seven-level framework ranking domains by verification difficulty for AI systems from math and code to culture and governance

Daniel Jeffries quote-tweeted that verification difficulty constrains AI progress more than alignment.

24105147115.2K

Original post

Greg Kamradt@GregKamradt#1219inTech

"Code and math are taking off because they are easy to verify, the next frontier is domains that are hard to verify"

This got me thinking - what does the spectrum of "easy to verify" look like?

This is loosely aligned w/ @DarioAmodei's "intelligence bottlenecked" domains.

My take of easy > hard:

- Level 1: Instant, objective verification Math, code, formal proofs, chess tactics, parsing

AI improvement is easiest here because the loop is tight

- Level 2: Fast but incomplete verification Software engineering, UI implementation, data analysis, security bug finding

You can test a lot, but not everything. “It passes tests” is not the same as “it is good”

- Level 3: Human-evaluable creative work Copywriting, design, video thumbnails, sales emails, landing pages

Verification is possible through humans or markets, but noisy. AI can improve by predicting human reaction, but taste shifts and metrics can be gamed

There is no "right" answer, only feedback from humans

- Level 4: Market-verifiable work Startups, investing, product strategy, hiring, pricing, distribution

Reality gives feedback, but slowly and with tons of confounders

- Level 5: Experimentally verifiable science Materials, biology, chemistry, medicine, robotics

There is ground truth (physics), but experiments cost time and money. AI helps most when it can propose better candidates and reduce search space

- Level 6: Institutionally verifiable systems Education systems (Alpha school), legal systems, city planning, corporate management systems

You can measure outcomes, but the feedback cycle is long, and the counterfactual is hard

- Level 7: Civilization-scale verification Democracy variants, alternative governance, monetary systems, cultural norms, geopolitical strategy

Verification is slow, morally loaded, noisy, and often impossible to isolate. You may never get a clean answer, only accumulated historical evidence

7:00 AM · May 21, 2026 · 11K Views

Sentiment

Users praise the framework ranking AI progress by verification difficulty because it highlights verifiability as a key multifaceted issue rather than a single axis.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.6KBOOKMARKS5LIKES21REPLIES2

Daniel Jeffries@Dan_Jeffries1

The alignment problem is an illusion hallucinated by "rationalists" huffing too much glue while watching 2001: A Space Odyssey.

The real problem is and always was "the verification problem."

Greg Kamradt@GregKamradt

"Code and math are taking off because they are easy to verify, the next frontier is domains that are hard to verify"

This got me thinking - what does the spectrum of "easy to verify" look like?

This is loosely aligned w/ @DarioAmodei's "intelligence bottlenecked" domains.

My take of easy > hard:

- Level 1: Instant, objective verification Math, code, formal proofs, chess tactics, parsing

AI improvement is easiest here because the loop is tight

- Level 2: Fast but incomplete verification Software engineering, UI implementation, data analysis, security bug finding

You can test a lot, but not everything. “It passes tests” is not the same as “it is good”

- Level 3: Human-evaluable creative work Copywriting, design, video thumbnails, sales emails, landing pages

Verification is possible through humans or markets, but noisy. AI can improve by predicting human reaction, but taste shifts and metrics can be gamed

There is no "right" answer, only feedback from humans

- Level 4: Market-verifiable work Startups, investing, product strategy, hiring, pricing, distribution

Reality gives feedback, but slowly and with tons of confounders

- Level 5: Experimentally verifiable science Materials, biology, chemistry, medicine, robotics

There is ground truth (physics), but experiments cost time and money. AI helps most when it can propose better candidates and reduce search space

- Level 6: Institutionally verifiable systems Education systems (Alpha school), legal systems, city planning, corporate management systems

You can measure outcomes, but the feedback cycle is long, and the counterfactual is hard

- Level 7: Civilization-scale verification Democracy variants, alternative governance, monetary systems, cultural norms, geopolitical strategy

Verification is slow, morally loaded, noisy, and often impossible to isolate. You may never get a clean answer, only accumulated historical evidence

39d2.6K215

RETWEETS1

Sahil Bhaiwala@Sbhaiwala03

Good framework for the hierarchy of rewards from instant and entirely digital to those that require time and engagement with the real world

Greg Kamradt@GregKamradt

"Code and math are taking off because they are easy to verify, the next frontier is domains that are hard to verify"

This got me thinking - what does the spectrum of "easy to verify" look like?

This is loosely aligned w/ @DarioAmodei's "intelligence bottlenecked" domains.

My take of easy > hard:

- Level 1: Instant, objective verification Math, code, formal proofs, chess tactics, parsing

AI improvement is easiest here because the loop is tight

- Level 2: Fast but incomplete verification Software engineering, UI implementation, data analysis, security bug finding

You can test a lot, but not everything. “It passes tests” is not the same as “it is good”

- Level 3: Human-evaluable creative work Copywriting, design, video thumbnails, sales emails, landing pages

Verification is possible through humans or markets, but noisy. AI can improve by predicting human reaction, but taste shifts and metrics can be gamed

There is no "right" answer, only feedback from humans

- Level 4: Market-verifiable work Startups, investing, product strategy, hiring, pricing, distribution

Reality gives feedback, but slowly and with tons of confounders

- Level 5: Experimentally verifiable science Materials, biology, chemistry, medicine, robotics

There is ground truth (physics), but experiments cost time and money. AI helps most when it can propose better candidates and reduce search space

- Level 6: Institutionally verifiable systems Education systems (Alpha school), legal systems, city planning, corporate management systems

You can measure outcomes, but the feedback cycle is long, and the counterfactual is hard

- Level 7: Civilization-scale verification Democracy variants, alternative governance, monetary systems, cultural norms, geopolitical strategy

Verification is slow, morally loaded, noisy, and often impossible to isolate. You may never get a clean answer, only accumulated historical evidence

39d1.6K21

Sahil Bhaiwala@Sbhaiwala03

I think the key point, regardless of whether the reward can be instant + digital vs temporal + real-world is the degree of verifiability.

For instances where the rewards are not completely and objectively verifiable (which is most things), there is challenging cognitive work to codify intelligence and taste in a nuanced way so that someone doesn't penalize valid outputs that might just be different from how they in particular might view it (eg in creative writing)

It gets super interesting with real-world rewards because there can be so many conflating factors. Experiment design to create rewards for AI will become a real science and/or real-world companies will be designed around effectively producing RL data

39d211

Greg Kamradt@GregKamradt

@Sbhaiwala03 Then the real question is…how are humans able to do the “hard to verify” tasks so well

And what makes some humans better than others at it

And what needs to happen to replicate that in AI

39d67

Agentic Glacius@temhandev

@GregKamradt @DarioAmodei nice breakdown. i'd say verifiability is two axes, not one. how completely you can verify, and how much it costs to. race conditions are fully verifiable in theory but nobody runs the full check every commit, so in practice they sit at level 3 even though they look level 1.

39d46

Legacy Guy@heylegacyguy

@GregKamradt @DarioAmodei Kitchen renovations are near the easy side of the spectrum.

39d44

Owol Destiny@owoldestiny

@GregKamradt @DarioAmodei So basically we're all about to get really good at bullshitting in fuzzy domains.

39d26

R4@R4mboX87

@Dan_Jeffries1 ☝️

39d17

Alfredo González-Espinoza@Spiralizing

@GregKamradt @DarioAmodei there is no "easy to verify", is more about standardization and formalization. We need new theories as well.

39d14

Greg Kamradt@GregKamradt

@Spiralizing @DarioAmodei Math and code aren’t easy to verify? At least more so than writing a novel?

39d9

Alfredo González-Espinoza@Spiralizing

@GregKamradt @DarioAmodei the "easy to verify" is not what you should focus on, is the standardization and formalization of processes, that's what I meant.

39d8

Alfredo González-Espinoza@Spiralizing

@GregKamradt @DarioAmodei you need to do a lot of work! lol, there is no free lunch, now we have tools that can help us find shortcuts though.

39d6

Greg Kamradt@GregKamradt

@Spiralizing @DarioAmodei How do you standardize and formalize the process of writing a novel? Or creating a start up idea?

39d4

Christian Catalini@ccatalini

@GregKamradt @DarioAmodei Verification is the only bottleneck that matters! https://arxiv.org/html/2602.20946v2

39d611

Greg Kamradt@GregKamradt

@owoldestiny @DarioAmodei Weren’t we already?

39d181

Random Dev@JrDev97715

@GregKamradt @DarioAmodei I would put experimentally verifiable science at 3 since there’s ground truth and thus progress can compound.

For creative and market work, it seems like there’s an asymptote on verifiability or perhaps many local optima without a clear global optima.

39d121

James Kovalenko@deburdened

The ladder mixes what type of claim is the verifier checking, and what assumptions does it require.

Level 1 verifiers check claims of one type and require one base theory.

Levels 5 through 7 verifiers check claims of multiple types and require contested base theories.

A successful experiment in materials science is a different claim type than a working education system. Treating them as points on one spectrum hides the type difference.

The frontier is typing the claim before measuring the gap.

39d29

Greg Kamradt@GregKamradt

@temhandev @DarioAmodei How do you put a price on verifying a novel?

39d28

Buddy@dyorbuddy

@GregKamradt @DarioAmodei Software is funny because it pretends to be Level 1 until the pull request hits a real codebase.

Tests verify behavior.

They don’t verify whether the next engineer will hate you in 6 months.

39d21

Primordial Code@primordial_code

This is the verification-gradient problem.

AI scales fastest where feedback is tight:

math code formal proofs tests parsing

But as domains move into markets, institutions, medicine, education, governance, and civilization-scale systems, verification becomes slower, noisier, and morally loaded.

That is where the pressure changes.

The risk is treating fluent output as if it has passed the same kind of verification as code or math.

It has not.

Hard-verification domains need stronger provenance, uncertainty preservation, domain expertise, audit trails, pause points, and repair paths.

“Easy to test” is not the same as safe.

“Passes tests” is not the same as good.

The harder the feedback loop, the more accountability has to stay attached.

39d20