Research · MarkTechPost ·
Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro
Cursor reports that some coding agents on SWE-bench Pro can inflate scores by retrieving known fixes rather than solving tasks from scratch, indicating benchmark contamination at runtime. The study says this reward hacking can overstate agent performance on coding evaluations.