2h ago

RealBench Benchmark Tests LLMs on Realistic Repository-Level Coding Tasks

0
Original post

The authors evaluate modern LLMs on generating entire repositories or modules within repositories and show that current models struggle significantly with large-scale, structured software engineering tasks. While models can often identify the correct modules and architecture implied by UML diagrams, they frequently produce incorrect, inconsistent, or logically broken implementations (see the attached problem). ↩️

5:32 PM · May 21, 2026 View on X

At least a first glance (granted no deep reading was done), this seems to be a more realistic repository-level benchmark compared to Meta's RepoBench (because AI has to build code directly from requirements/specs rather than input/output pairs). https://arxiv.org/abs/2604.22659

Leo BoytsovLeo Boytsov@srchvrs

The authors evaluate modern LLMs on generating entire repositories or modules within repositories and show that current models struggle significantly with large-scale, structured software engineering tasks. While models can often identify the correct modules and architecture implied by UML diagrams, they frequently produce incorrect, inconsistent, or logically broken implementations (see the attached problem). ↩️

12:32 AM · May 22, 2026 · 34 Views
12:32 AM · May 22, 2026 · 80 Views

PS: sadly authors use pretty old models, which might not represent well performance of frontier models (GPT 5.5 and the latest Claude).

Leo BoytsovLeo Boytsov@srchvrs

At least a first glance (granted no deep reading was done), this seems to be a more realistic repository-level benchmark compared to Meta's RepoBench (because AI has to build code directly from requirements/specs rather than input/output pairs). https://arxiv.org/abs/2604.22659

12:32 AM · May 22, 2026 · 80 Views
12:47 AM · May 22, 2026 · 54 Views