ProgramBench is the first whole-repository-generation benchmark that also allows agents to pick *which* language they're going to use and *how* they're going to implement the given program. w/ @jyangballin @KLieret @18jeffreyma
Positive users highlight agents choosing languages in ProgramBench for whole-repository code generation as interesting because implementation freedom reveals more about capabilities.
Most Activity
@jyangballin @KLieret @18jeffreyma Full ProgramBench Q&A: https://youtube.com/watch?v=blxN5jYWe8U Benchmark at https://programbench.com
ProgramBench is the first whole-repository-generation benchmark that also allows agents to pick *which* language they're going to use and *how* they're going to implement the given program. w/ @jyangballin @KLieret @18jeffreyma

@OfirPress @jyangballin @KLieret @18jeffreyma Letting agents pick the language is the interesting part. Implementation freedom probably reveals more about the agent's reasoning architecture than any fixed-language benchmark ever could.