ProgramBench is the first whole-repository-generation benchmark that also allows agents to pick *which* language they're going to use and *how* they're going to implement the given program. w/ @jyangballin @KLieret @18jeffreyma
ProgramBench is the first whole-repository-generation benchmark that also allows agents to pick *which* language they're going to use and *how* they're going to implement the given program. w/ @jyangballin @KLieret @18jeffreyma
Users highlight ProgramBench's feature letting agents pick languages as interesting because the resulting implementation freedom should reveal more about flexible whole-repository code generation.
@jyangballin @KLieret @18jeffreyma Full ProgramBench Q&A: https://youtube.com/watch?v=blxN5jYWe8U Benchmark at https://programbench.com
ProgramBench is the first whole-repository-generation benchmark that also allows agents to pick *which* language they're going to use and *how* they're going to implement the given program. w/ @jyangballin @KLieret @18jeffreyma

@OfirPress @jyangballin @KLieret @18jeffreyma Letting agents pick the language is the interesting part. Implementation freedom probably reveals more about the agent's reasoning architecture than any fixed-language benchmark ever could.