23h ago

ProgramBench evaluates language models on reconstructing complete codebases solely from compiled binaries and documentation, as John Yang calls for v2 task suggestions

John Yang seeks input on CLI tools, executables, and apps.

0
Original post

guys, you are my new METR time horizon I need to see Gemini 3.5 Flash results

8:37 AM · May 20, 2026 View on X
Reposted by

Thinking about what new tasks to put in programbench v2.

What software programs (CLI tool/executables? Local apps? Websites?) would u wanna see models try building from scratch?

2:00 PM · May 21, 2026 · 625 Views