NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Original article: https://arxiv.org/abs/2512.12730v2
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software system...

This entry is part of the Top 50 AI Agent Articles curation.