Safety · The Decoder ·
New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously
Researchers at Carnegie Mellon University created a benchmark to measure how far AI agents can go in exploiting real vulnerabilities in Google’s V8 engine. In tests, Claude Mythos outperformed GPT-5.5 by a wide margin, but at about 12 times the cost.