SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains chal...

Ava Brooks
· 1 min read
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Original article: https://arxiv.org/abs/2504.08703v3

Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains chal...

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

This entry is part of the Top 50 AI Agent Articles curation.