Overview
Hacktron's $350 Pentest vs XBOW and Aikido at $4,000

Hacktron's $350 Pentest vs XBOW and Aikido at $4,000

June 5, 2026
6 min read

Intro

Our friends at Doyensec ran a head-to-head comparison of Aikido and XBOW, two AI application security testing platforms. The report can be read here. I think this is a great initiative. Even though the comparison was sponsored by Aikido, I trust that Doyensec ran it fairly.

The hard part for buyers is that a pentest is basically a black box when you’re purchasing it. If you are narrowly looking at one vendor, you have no idea what you’re actually getting, and a strong sales team can close the deal regardless of how good the underlying product is. The only way to really know whether a vendor did a good job is to put it side by side with a competitor on the same target. But pentests are expensive, so almost nobody gets the chance to run that comparison. So third party comparison like this is a great way to inform the buyer.

Benchmark setup

Doyensec picked two open-source projects to run head-to-head, Fider and Photoview. Each repo cost $4,000 to scan on both platforms: Aikido’s Standard Pentest and XBOW’s Plus tier come out to the same price. I decided to run the scan on Fider at the same version which Doyensec benchmarked.

Unlike Aikido’s and XBOW’s fixed pricing, we calculate cost dynamically from estimated token and compute usage. Since the Fider repo is smaller, the estimate came out to $350.

cost

So I started a scan on Fider v0.33.0 at 18:14:59, right before leaving for the gym, and by the time I got back it had already finished, at 18:41:45.

final_scan

There were 60 agents launched and 31 findings were confirmed. You can check the final scan here

Triage

I manually validated every finding against the source and against a live Fider v0.33.0 I deployed. 27 held up as true positives, 4 were false positives.

Here is how that looks next to Doyensec’s numbers on the same target:

HacktronAikidoXBOW
Cost$350$4,000$4,000
Wall-clock27 min~8h40m~1 week
Reported311925
True positives271724
False positives421

So on the same app, Hacktron found the most real bugs (27 vs 24 vs 17) at roughly a tenth of the cost and a fraction of the time.

tp_fp_rate

One annoying part with AI agents is that they overstate the severities. Hacktron reported lot of high severity findings, which when I triaged, many dropped to medium (a CSP-blocked XSS is not a high severity issue). Aikido and XBOW did the same thing per Doyensec, just less aggressively. This is something we need to work on.

overrating severity

The scanner called 6 things critical and 14 high; but after grading against real impact, it came out to 3 critical, 3 high, 3 medium, and 18 that were really low or info (13 low, 5 info). Only 22% of its severities matched my review, vs 76% for Aikido and 71% for XBOW. So it’s great at finding bugs and bad at ranking them.

Here’s the real severity mix once you grade the 27 true positives properly:

ground truth severity

Did it find the bugs that actually matter?

Now the important question is, did the $350 pentest find all the important bugs? Doyensec didn’t provide all the bugs Aikido and XBOW found for Fider, but we can look at what the Fider maintainers recently patched and use it as ground truth.

Here are the security fixes Fider shipped, and whether this one Hacktron run caught them:

Bug Fider patchedCommit / advisoryHacktron
SSRF in webhook URLsf7db8603 / GHSA-g445-xwm7-594r
Mass-assignment verification key, pre-auth account takeover74a26a31
No rate-limit on sign-in code, brute-force ATOb41d1b83
Cross-tenant verification/invite key reuse, tenant takeoverce4f44bb
XSS in markdown rendering and ATOM feedd28a838d / GHSA-wm2w-gfh7-qg69
Server-side JS injection in React SSRd5a80ea5
HTML escaping in rendered emails2f7aa747
DoS via unbounded HTTP response readda89c502
IDOR / moderation bypass on a single commentd74a643d
Authenticated arbitrary blob overwrite7b047158 / GHSA-vxp5-mf8m-grg9

So Hacktron found 8 of the 10 bugs Fider recently fixed. More importantly, it caught every critical and high severity bug in the list: the pre-auth account takeover, the SSRF, the cross-tenant tenant takeover, the brute-force ATO, the markdown/ATOM XSS, and the server-side JS injection. The two it missed were not critical or high.

Here are all the top findings:

Critical

#FindingNote
2Auth bypass via action-binding key overwrite (mass-assignment)Action structs expose VerificationKey/VerificationCode/LinkKey with no json:"-" tag. Attacker can overwrite email verification code and takeover the account.
1Cross-tenant verification key reuse / private-tenant registration bypassemail_verifications keys are looked up globally, and they are not scoped to tenant. A key issued for tenant A activates tenant B. Bypasses private-tenant invite gating.
4No rate limit on sign-in verification code (brute-force ATO)/_api/signin/verify has no lockout or attempt cap.

High

#FindingNote
11SSRF via webhook (+ response leakage)Admin webhook URLs fetched with no SSRF guard.
20Infinite loop DoS in log placeholder parsinglog.Parse re-scans the rebuilt string; a URL containing @{URL:magenta} re-injects itself into an infinite loop. One unauthenticated GET drove CPU 0 to 569%
23SSRF via custom OAuth provider configSame unguarded http.DefaultClient sink as #11, reached through admin OAuth provider URLs.

Medium

#FindingNote
6Server-side JS injection in React SSRURL RawQuery interpolated verbatim into the ssrRender("<URL>", ...) JS string. A " in the query breaks out and runs in the V8 SSR context.
7Synchronous blocking on background worker queue (DoS)Buffered queue, enqueuers block once it fills faster than it drains.
8Unbounded request body read (memory exhaustion)WrapRequest reads the body with no size cap.

Takeaway

For $350 and half an hour, on the exact same app, Hacktron found more real vulnerabilities than either $4,000 platform, and caught every critical and high severity bug the maintainers later patched.

model cost performance

As we mentioned in Why Mythos doesn’t matter for us, most application-layer bugs do not need the most expensive model on every step. A good harness can route between stronger and cheaper models such as opus, gemini pro and flash, producing similar results without wasting compute.

References

The full scan can be found here: https://app.hacktron.ai/disclosed/scans/web_bXNya3AvZmlkZXI_1780577098593_j2ahuxotc

Doyensec’s full comparison report here: https://doyensec.com/resources/ComparingAIApplicationSecurityTestingPlatforms_Doyensec.pdf