
An autonomous AI system called XBOW has outperformed human researchers to become the top-ranked security tester in the US on HackerOne, a bug bounty platform used by major organizations to strengthen their cybersecurity.
This is a milestone in the use of AI for ethical security research, not simply because it performed well, but because it’s the first documented instance of an autonomous system outperforming human experts on a large scale in a real-world environment.
XBOW was developed to function as an independent penetration tester, capable of identifying, validating, and reporting vulnerabilities in real-world systems. In a span of a few months, XBOW submitted over 1,000 vulnerability reports, leapfrogging thousands of human ethical hackers to land at the top of the US leaderboard.
“All findings were fully automated,” wrote Nico Waisman, XBOW head of security, in a blog post about its top ranking. However, he noted that human staff conducted reviews prior to submission to comply with HackerOne’s current policies governing AI tool usage.
How accurate is the AI tool?
Despite common concerns that AI tools often produce false positives in security testing, XBOW’s accuracy has impressed security professionals. According to internal metrics:
- 132 vulnerabilities were confirmed and resolved by program owners.
- 303 vulnerabilities were “triaged,” which means acknowledged but not yet resolved.
- 125 vulnerabilities remain under review.
- 208 vulnerabilities reports were marked as duplicates.
- 209 vulnerabilities were labeled as informative.
- 36 vulnerabilities were considered applicable.
In terms of severity over the past three months, XBOW’s reports included:
- 54 critical vulnerabilities
- 242 high vulnerabilities
- 524 medium vulnerabilities
- 65 low vulnerabilities
These figures suggest the AI’s findings are not only rapid but also impactful.
How does XBOW work?
XBOW’s training began with solving Capture The Flag (CTF) challenges, a common method in cybersecurity education, before moving on to testing environments that simulate real-world conditions.
To ensure quality, the system uses a “validator” layer. These are automated checkers — sometimes powered by language models, other times by custom scripts — that verify whether a vulnerability truly exists.
“We treated [XBOW] like any external researcher would: no shortcuts, no internal knowledge — just XBOW, running on its own,” said Waisman. The company plans to release a series of blog posts detailing some of the AI’s most creative discoveries, offering a transparent look into how it works and what it found.
XBOW has just raised $75 million in a new funding round led by Altimeter Capital, with participation from Sequoia Capital and NFDG, according to Bloomberg.