Research · The Decoder ·
UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do
The UK’s AI Security Institute studied seven benchmarks and found that standard AI evaluations can underestimate agent capabilities by limiting compute budgets. On software engineering tasks, success rates rose about 25% when token budgets increased tenfold, with newer models benefiting most.