15d ago

Researchers introduce ProgramBench benchmark for binary-to-repository reconstruction

Researchers behind SWE-Bench launched ProgramBench, a benchmark with 200 tasks challenging large language models to generate complete software repositories from executable binaries alone. In a cleanroom with no starter code, documentation, or internet, models target real-world projects like FFmpeg, SQLite, ripgrep, and PHP compilers. All evaluated LLMs, including frontier models, scored 0%.

0
Original post

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

7:56 AM · May 5, 2026 View on X
Reposted by

Great new benchmark for LLM memoization 🤣

DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
4:05 AM · May 6, 2026 · 12.8K Views

Some things never change: Gary just doesn't fit in distribution.

Gary MarcusGary Marcus@GaryMarcus

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI. Marcus, 1998: neural nets have trouble generalizing far beyond the data. Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data. Apple, 2025: neural nets have trouble generalizing far beyond the data. Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

6:00 PM · May 5, 2026 · 290.7K Views
9:48 AM · May 6, 2026 · 13.6K Views

1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.

John YangJohn Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views
3:01 PM · May 5, 2026 · 121.7K Views

@ChrSzegedy We have a mode in there that requires the agent to implement the program in a *different* lang than it was in originally. And we can easily develop a private version of this for closed-source programs to weed out any memorization issues.

Christian SzegedyChristian Szegedy@ChrSzegedy

Great new benchmark for LLM memoization 🤣

4:05 AM · May 6, 2026 · 12.8K Views
1:02 AM · May 7, 2026 · 123 Views

Apparently everyone is post-training with a lot of Python- we show in the paper that models substantially prefer reimplementing programs in Python, even though Python is the actual src lang in 0 of our tasks.

John YangJohn Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views
5:01 PM · May 6, 2026 · 6.1K Views

@yoavgo ya def something we've thought of. timing tests are super hard to do well though, it took a lot of effort in SWE-fficiency and the infra here is much more complex.

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

@OfirPress add wall-clock runtime checks to the tests ;)

6:49 PM · May 6, 2026 · 219 Views
7:17 PM · May 6, 2026 · 99 Views

@yoavgo

Ofir PressOfir Press@OfirPress

Coding models are already much faster and I'm pretty sure more capable than most humans. Benchmarks have to pave the way towards the next frontier- if we stuck to human level benchmarks we would not further improve. So now we're entering the stage of super-human benchmarking.

11:00 AM · May 6, 2026 · 8K Views
12:57 AM · May 7, 2026 · 129 Views

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views

@teortaxesTex no matter what metric you pick there's gonna be some amount of information loss. but I think going with "50% pass" here would be much much more misleading. the end result here is that the agents didn't manage to fully replicate any binaries...

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@OfirPress "0% resolved" is still more misleading

8:11 AM · May 6, 2026 · 407 Views
10:46 AM · May 6, 2026 · 276 Views

@teortaxesTex that's just for our main metric. we have a lot of analysis in the paper, including this plot, showing the partial solve rates. it's clear that agents are on the cusp of being able to do well on this task, and that the current trajectory is pretty positive.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@OfirPress The interesting metric for me is how they move closer to completion. If you regard those tests as stripes on a non-sequential download progress bar, then it becomes pretty obvious which way the wind blows.

10:52 AM · May 6, 2026 · 116 Views
10:58 AM · May 6, 2026 · 151 Views

Yes. We've also tried really hard to manually / agentically go through the test sets to find "impossible tests" and couldn't find any. In the future we can remove any that are found too.

In addition, we have the 'almost solved' metric so even if a few tests are bad we can still hillclimb on that metric.

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

8:35 PM · May 5, 2026 · 2.8K Views
12:55 AM · May 7, 2026 · 102 Views

@OfirPress are the harder tests hard to pass, or hard to discover?

Ofir PressOfir Press@OfirPress

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views
8:52 PM · May 5, 2026 · 401 Views

@OfirPress add wall-clock runtime checks to the tests ;)

Ofir PressOfir Press@OfirPress

Apparently everyone is post-training with a lot of Python- we show in the paper that models substantially prefer reimplementing programs in Python, even though Python is the actual src lang in 0 of our tasks.

5:01 PM · May 6, 2026 · 6.1K Views
6:49 PM · May 6, 2026 · 219 Views

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views

i would argue that the task is beyond very hard, it is actually impossible, for the same reason concept learning from only positive examples and queries is impossible.

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views
5:02 PM · May 5, 2026 · 1.3K Views
(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

8:35 PM · May 5, 2026 · 2.8K Views
8:39 PM · May 5, 2026 · 938 Views

hmm i guess that what makes the task not impossible is that the test-suite itself is not adversarial, but based on fuzzing, so it is closer to the probabilistic setting. the agent can implement its own fuzzer and try to pass its generated tests

(((ل()(ل() 'yoav))))👾(((ل()(ل() 'yoav))))👾@yoavgo

programbench is a super-hard task that no human can reliably succeed in. i would argue that even those who wrote the original code are likely to fail at this task as defined (reproduce code that is compatible with a given reference binary, given the binary and its docs).

4:59 PM · May 5, 2026 · 7.4K Views
8:35 PM · May 5, 2026 · 2.8K Views

@OfirPress I agree, but in that case, why use the APR to create a rank list? The ranking, by itself, implies APR is the primary metric.

Ofir PressOfir Press@OfirPress

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views
1:17 AM · May 6, 2026 · 340 Views

@deedydas called it. in 1998:

Gary MarcusGary Marcus@GaryMarcus

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI. Marcus, 1998: neural nets have trouble generalizing far beyond the data. Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data. Apple, 2025: neural nets have trouble generalizing far beyond the data. Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

6:00 PM · May 5, 2026 · 290.7K Views
6:08 PM · May 5, 2026 · 4.5K Views

@deedydas Seen this so move so much I made a name for it: the AI bait and switch:

2:39 PM · May 6, 2026 · 180 Views

The last sentence in this abstract is really important, in a way that professional programmers will immediately recognize: the models favored big single files rather than breaking things into modules.

That means that the code these systems write is going to be really hard to maintain.

AI code might get written quickly, but especially in new, complex projects, fixing it will be hell.

DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
2:41 PM · May 6, 2026 · 23.4K Views

The last sentence in this abstract is really important: models favorable big single files rather than breaking things into modules.

Any good programmer will immediately appreciate that the code these systems write is going to be really hard to maintain.

AI code might get written quickly, but especially in new, complex projects, fixing it will be hell.

DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
2:38 PM · May 6, 2026 · 774 Views

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI.

Marcus, 1998: neural nets have trouble generalizing far beyond the data.

Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data.

Apple, 2025: neural nets have trouble generalizing far beyond the data.

Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
6:00 PM · May 5, 2026 · 290.7K Views

@ChrSzegedy honestly this cheap shot is beneath you, Christian

and the truth is that many many people have moved to my side re the critical importance of distribution shift

Christian SzegedyChristian Szegedy@ChrSzegedy

Some things never change: Gary just doesn't fit in distribution.

9:48 AM · May 6, 2026 · 13.6K Views
2:12 PM · May 6, 2026 · 2.6K Views

Very cool work. Asking agents to build big coding projects from scratch is a great way to create long-horizon tasks.

John YangJohn Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views
1:26 AM · May 21, 2026 · 1.4K Views

"Coding is [0%] solved".

John YangJohn Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views
3:08 PM · May 5, 2026 · 172.5K Views
DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
7:28 AM · May 6, 2026 · 20.3K Views

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on.

ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet?

We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views

Source: https://programbench.com/

DeedyDeedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

3:23 PM · May 5, 2026 · 806K Views
3:23 PM · May 5, 2026 · 34.6K Views

A lot of critique around "how is memorizing ffmpeg software engineering?".

Well, every benchmark can be overfit to and memorized. You can memorize all the bugs in SWE-Bench too. ARC AGI might solve this by having a hidden set of games you can't look at. Getting 100% on ProgramBench does not mean we've achieved AGI.

However, in practice, most good models will regress in other obvious ways if they try to brute force memorize these programs. In practice, this is not how frontier models are built. We can also trivially test for memorization by comparing it to the source implementation.

The bet here is: a bottoms-up implementation of a real-world tool is a very long horizon high utility task. If models can reason through building them, it probably generalizes to many more such tasks.

DeedyDeedy@deedydas

Source: https://programbench.com/

3:23 PM · May 5, 2026 · 34.6K Views
5:32 PM · May 5, 2026 · 30.9K Views

The other critique which is more baffling to me is "well, humans can't do this."

So? Humans can't do a lot of things LLMs today can do today. The goal of benchmarks is to hillclimb on intelligence far above the average human.

DeedyDeedy@deedydas

A lot of critique around "how is memorizing ffmpeg software engineering?". Well, every benchmark can be overfit to and memorized. You can memorize all the bugs in SWE-Bench too. ARC AGI might solve this by having a hidden set of games you can't look at. Getting 100% on ProgramBench does not mean we've achieved AGI. However, in practice, most good models will regress in other obvious ways if they try to brute force memorize these programs. In practice, this is not how frontier models are built. We can also trivially test for memorization by comparing it to the source implementation. The bet here is: a bottoms-up implementation of a real-world tool is a very long horizon high utility task. If models can reason through building them, it probably generalizes to many more such tasks.

5:32 PM · May 5, 2026 · 30.9K Views
5:33 PM · May 5, 2026 · 28.4K Views

Addressing other criticism. To be clear I'm not defending them.

1. They didn't test harnesses like CC / codex: well, they could just do that 2. Don't like that they measured only completion rates, not progress (@scaling01): yeah, but you could just measure and rank by that 3. Not using internet is not a good constraint: it prevents some blatant cheating, but I agree it can lead to hillclimbing on the wrong thing. You can always just measure performance with internet access too! 4. Why not use novel problems not solved problems: painful and laborious to come up with, hard to write great exhaustive tests for without knowing what a solution looks like, hard to ascertain that it actually is a real world task and not some made up fiction

DeedyDeedy@deedydas

The other critique which is more baffling to me is "well, humans can't do this." So? Humans can't do a lot of things LLMs today can do today. The goal of benchmarks is to hillclimb on intelligence far above the average human.

5:33 PM · May 5, 2026 · 28.4K Views
12:21 AM · May 6, 2026 · 14.7K Views

@GaryMarcus I think this hides too much nuance. Claim 1: neural nets (or anything) has problem generalizing beyond data (actually distribution) under various settings. Claim 2: in a specific case, a specific neural network (or whatever) does not work well. Claim 2 is not implied by Claim 1.

Gary MarcusGary Marcus@GaryMarcus

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI. Marcus, 1998: neural nets have trouble generalizing far beyond the data. Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data. Apple, 2025: neural nets have trouble generalizing far beyond the data. Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

6:00 PM · May 5, 2026 · 290.7K Views
12:17 AM · May 7, 2026 · 376 Views

@davidad oh wow this is a great benchmark I can’t believe I haven’t seen someone propose it before

davidad 🎇davidad 🎇@davidad

Yes, “write all of ffmpeg make no mistakes” is the correct benchmark hill to be climbing in 2026, much more so than SWE-bench. Good work. Looking forward to seeing where GPT-5.5 places on here, and beyond.

7:35 PM · May 5, 2026 · 29K Views
12:24 AM · May 6, 2026 · 1.2K Views

@davidad though it has a pretty big memorization problem, not that swe-bench doesn’t. maybe it should be “recreate X in rust, and as performant as possible”, testable in the same way the original program was and not easily memorizable (cc @jarredsumner)

NickNick@nickcammarata

@davidad oh wow this is a great benchmark I can’t believe I haven’t seen someone propose it before

12:24 AM · May 6, 2026 · 1.2K Views
12:25 AM · May 6, 2026 · 605 Views

@OfirPress "0% resolved" is still more misleading

Ofir PressOfir Press@OfirPress

Looking at average pass rate is *very* misleading- every task has a big chunk of tests that are very easy to pass and sometimes a minority of tests that are much harder to pass- so you can implement 10% of the program and get a 60% pass rate.

8:15 PM · May 5, 2026 · 10.9K Views
8:11 AM · May 6, 2026 · 407 Views

@OfirPress The interesting metric for me is how they move closer to completion. If you regard those tests as stripes on a non-sequential download progress bar, then it becomes pretty obvious which way the wind blows.

Ofir PressOfir Press@OfirPress

@teortaxesTex no matter what metric you pick there's gonna be some amount of information loss. but I think going with "50% pass" here would be much much more misleading. the end result here is that the agents didn't manage to fully replicate any binaries...

10:46 AM · May 6, 2026 · 276 Views
10:52 AM · May 6, 2026 · 116 Views

@OfirPress Would be interesting if you added some smaller simpler binaries to see where exactly do they reach nontrivial reproduction already

Ofir PressOfir Press@OfirPress

@teortaxesTex that's just for our main metric. we have a lot of analysis in the paper, including this plot, showing the partial solve rates. it's clear that agents are on the cusp of being able to do well on this task, and that the current trajectory is pretty positive.

10:58 AM · May 6, 2026 · 151 Views
11:01 AM · May 6, 2026 · 122 Views

Yes, “write all of ffmpeg make no mistakes” is the correct benchmark hill to be climbing in 2026, much more so than SWE-bench. Good work. Looking forward to seeing where GPT-5.5 places on here, and beyond.

John YangJohn Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

2:56 PM · May 5, 2026 · 677.3K Views
7:35 PM · May 5, 2026 · 29K Views

ProgramBench is very hard, but it’s solvable by design.

While the official best score is 0%, our extended results reveal varying levels of meaningful progress across tasks. We added a lot of interactive plots & tables to our website.

2:56 PM · May 5, 2026 · 10.2K Views

Check out our paper for a bunch of analyses

- Models write Python solutions even though originals are in C++/Rust/Go - 98% of runs, models submit before hitting time/turn limits - Given internet access, models Googled solutions 36% of the time, even when told not to

John YangJohn Yang@jyangballin

ProgramBench is very hard, but it’s solvable by design. While the official best score is 0%, our extended results reveal varying levels of meaningful progress across tasks. We added a lot of interactive plots & tables to our website.

2:56 PM · May 5, 2026 · 10.2K Views
2:56 PM · May 5, 2026 · 10.3K Views

ProgramBench is a joint effort across Meta FAIR, Meta TBD, Stanford, Harvard

@KLieret (co-first author) @18jeffreyma @parth007_96 @dpedch @sten_sootla @micmylin @pengchengyin @magpie_rayhou @syhw @Diyi_Yang @OfirPress

Paper: https://programbench.com/static/paper.pdf

2:56 PM · May 5, 2026 · 8.4K Views

this is actually hilarious

ProgramBench website reports 0% scores for all models

but in the background rank the model by the more useful but hidden metric of average % of tests passed

6:50 PM · May 5, 2026 · 6.5K Views
Researchers introduce ProgramBench benchmark for binary-to-repository reconstruction · Digg