Researchers introduce ProgramBench benchmark for binary-to-repository reconstruction

1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views

3:01 PM · May 5, 2026 · 121.7K Views

REPLY

@ChrSzegedy We have a mode in there that requires the agent to implement the program in a *different* lang than it was in originally. And we can easily develop a private version of this for closed-source programs to weed out any memorization issues.

Christian Szegedy@ChrSzegedy

Great new benchmark for LLM memoization 🤣

4:05 AM · May 6, 2026 · 12.8K Views

1:02 AM · May 7, 2026 · 123 Views

QUOTE POST

Apparently everyone is post-training with a lot of Python- we show in the paper that models substantially prefer reimplementing programs in Python, even though Python is the actual src lang in 0 of our tasks.

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views

5:01 PM · May 6, 2026 · 6.1K Views

REPLY

@yoavgo ya def something we've thought of. timing tests are super hard to do well though, it took a lot of effort in SWE-fficiency and the infra here is much more complex.

(((ل()(ل() 'yoav))))👾@yoavgo

@OfirPress add wall-clock runtime checks to the tests ;)

6:49 PM · May 6, 2026 · 219 Views

7:17 PM · May 6, 2026 · 99 Views

QUOTE POST

@yoavgo

Ofir Press@OfirPress

Coding models are already much faster and I'm pretty sure more capable than most humans. Benchmarks have to pave the way towards the next frontier- if we stuck to human level benchmarks we would not further improve. So now we're entering the stage of super-human benchmarking.

11:00 AM · May 6, 2026 · 8K Views

12:57 AM · May 7, 2026 · 129 Views

QUOTE POST

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views

REPLY

@teortaxesTex no matter what metric you pick there's gonna be some amount of information loss. but I think going with "50% pass" here would be much much more misleading. the end result here is that the agents didn't manage to fully replicate any binaries...

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@OfirPress "0% resolved" is still more misleading

8:11 AM · May 6, 2026 · 407 Views

10:46 AM · May 6, 2026 · 276 Views

REPLY

@teortaxesTex that's just for our main metric. we have a lot of analysis in the paper, including this plot, showing the partial solve rates. it's clear that agents are on the cusp of being able to do well on this task, and that the current trajectory is pretty positive.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@OfirPress The interesting metric for me is how they move closer to completion. If you regard those tests as stripes on a non-sequential download progress bar, then it becomes pretty obvious which way the wind blows.

10:52 AM · May 6, 2026 · 116 Views

10:58 AM · May 6, 2026 · 151 Views

REPLY

Yes. We've also tried really hard to manually / agentically go through the test sets to find "impossible tests" and couldn't find any. In the future we can remove any that are found too.

In addition, we have the 'almost solved' metric so even if a few tests are bad we can still hillclimb on that metric.

(((ل()(ل() 'yoav))))👾@yoavgo

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

8:35 PM · May 5, 2026 · 2.8K Views

12:55 AM · May 7, 2026 · 102 Views

REPLY

@OfirPress are the harder tests hard to pass, or hard to discover?

Ofir Press@OfirPress

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views

8:52 PM · May 5, 2026 · 401 Views

REPLY

@OfirPress add wall-clock runtime checks to the tests ;)

Ofir Press@OfirPress

Apparently everyone is post-training with a lot of Python- we show in the paper that models substantially prefer reimplementing programs in Python, even though Python is the actual src lang in 0 of our tasks.

5:01 PM · May 6, 2026 · 6.1K Views

6:49 PM · May 6, 2026 · 219 Views

POST

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views

REPLY

i would argue that the task is beyond very hard, it is actually impossible, for the same reason concept learning from only positive examples and queries is impossible.

(((ل()(ل() 'yoav))))👾@yoavgo

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views

5:02 PM · May 5, 2026 · 1.3K Views

QUOTE POST

(((ل()(ل() 'yoav))))👾@yoavgo

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

8:35 PM · May 5, 2026 · 2.8K Views

8:39 PM · May 5, 2026 · 938 Views

QUOTE POST

#103Delip Rao e/σ@DELIPRAO

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

(((ل()(ل() 'yoav))))👾@yoavgo

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views

8:35 PM · May 5, 2026 · 2.8K Views

REPLY

@OfirPress I agree, but in that case, why use the APR to create a rank list? The ranking, by itself, implies APR is the primary metric.

Ofir Press@OfirPress

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views

1:17 AM · May 6, 2026 · 340 Views

QUOTE POST

@deedydas called it. in 1998:

Gary Marcus@GaryMarcus

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI. Marcus, 1998: neural nets have trouble generalizing far beyond the data. Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data. Apple, 2025: neural nets have trouble generalizing far beyond the data. Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

6:00 PM · May 5, 2026 · 290.7K Views

6:08 PM · May 5, 2026 · 4.5K Views

QUOTE POST

@deedydas Seen this so move so much I made a name for it: the AI bait and switch:

2:39 PM · May 6, 2026 · 180 Views

QUOTE POST

The last sentence in this abstract is really important, in a way that professional programmers will immediately recognize: the models favored big single files rather than breaking things into modules.

That means that the code these systems write is going to be really hard to maintain.

AI code might get written quickly, but especially in new, complex projects, fixing it will be hell.

Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

2:41 PM · May 6, 2026 · 23.4K Views

QUOTE POST

The last sentence in this abstract is really important: models favorable big single files rather than breaking things into modules.

Any good programmer will immediately appreciate that the code these systems write is going to be really hard to maintain.

AI code might get written quickly, but especially in new, complex projects, fixing it will be hell.

Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

2:38 PM · May 6, 2026 · 774 Views

QUOTE POST

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI.

Marcus, 1998: neural nets have trouble generalizing far beyond the data.

Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data.

Apple, 2025: neural nets have trouble generalizing far beyond the data.

Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

6:00 PM · May 5, 2026 · 290.7K Views

REPLY

#172Alex Dimakis@ALEXGDIMAKIS

@ChrSzegedy honestly this cheap shot is beneath you, Christian

and the truth is that many many people have moved to my side re the critical importance of distribution shift

Christian Szegedy@ChrSzegedy

Some things never change: Gary just doesn't fit in distribution.

9:48 AM · May 6, 2026 · 13.6K Views

2:12 PM · May 6, 2026 · 2.6K Views

QUOTE POST

Very cool work. Asking agents to build big coding projects from scratch is a great way to create long-horizon tasks.

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views

1:26 AM · May 21, 2026 · 1.4K Views

QUOTE POST

#285Gabriel Synnaeve@SYHW

"Coding is [0%] solved".

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views

3:08 PM · May 5, 2026 · 172.5K Views

QUOTE POST

#331François Fleuret@FRANCOISFLEURET

Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

7:28 AM · May 6, 2026 · 20.3K Views

POST

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on.

ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet?

We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

REPLY

Source: https://programbench.com/

Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

3:23 PM · May 5, 2026 · 34.6K Views

REPLY

A lot of critique around "how is memorizing ffmpeg software engineering?".

Well, every benchmark can be overfit to and memorized. You can memorize all the bugs in SWE-Bench too. ARC AGI might solve this by having a hidden set of games you can't look at. Getting 100% on ProgramBench does not mean we've achieved AGI.

However, in practice, most good models will regress in other obvious ways if they try to brute force memorize these programs. In practice, this is not how frontier models are built. We can also trivially test for memorization by comparing it to the source implementation.

The bet here is: a bottoms-up implementation of a real-world tool is a very long horizon high utility task. If models can reason through building them, it probably generalizes to many more such tasks.

Deedy@deedydas

Source: https://programbench.com/

3:23 PM · May 5, 2026 · 34.6K Views

5:32 PM · May 5, 2026 · 30.9K Views

REPLY

The other critique which is more baffling to me is "well, humans can't do this."

So? Humans can't do a lot of things LLMs today can do today. The goal of benchmarks is to hillclimb on intelligence far above the average human.

Deedy@deedydas

A lot of critique around "how is memorizing ffmpeg software engineering?". Well, every benchmark can be overfit to and memorized. You can memorize all the bugs in SWE-Bench too. ARC AGI might solve this by having a hidden set of games you can't look at. Getting 100% on ProgramBench does not mean we've achieved AGI. However, in practice, most good models will regress in other obvious ways if they try to brute force memorize these programs. In practice, this is not how frontier models are built. We can also trivially test for memorization by comparing it to the source implementation. The bet here is: a bottoms-up implementation of a real-world tool is a very long horizon high utility task. If models can reason through building them, it probably generalizes to many more such tasks.

5:32 PM · May 5, 2026 · 30.9K Views

5:33 PM · May 5, 2026 · 28.4K Views

REPLY