Research · The Decoder ·

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

The UK’s AI Security Institute studied seven benchmarks and found that standard AI evaluations can underestimate agent capabilities by limiting compute budgets. On software engineering tasks, success rates rose about 25% when token budgets increased tenfold, with newer models benefiting most.

Read the full story at The Decoder →