ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
Original article: https://arxiv.org/abs/2603.26137v1
Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time...

This entry is part of the Top 50 AI Agent Articles curation.