Hacktron's $350 Pentest vs XBOW and Aikido at $4,000

Intro

Our friends at Doyensec ran a head-to-head comparison of Aikido and XBOW, two AI application security testing platforms. The report can be read here. I think this is a great initiative. Even though the comparison was sponsored by Aikido, I trust that Doyensec ran it fairly.

The hard part for buyers is that a pentest is basically a black box when you’re purchasing it. If you are narrowly looking at one vendor, you have no idea what you’re actually getting, and a strong sales team can close the deal regardless of how good the underlying product is. The only way to really know whether a vendor did a good job is to put it side by side with a competitor on the same target. But pentests are expensive, so almost nobody gets the chance to run that comparison. So third party comparison like this is a great way to inform the buyer.

Benchmark setup

Doyensec picked two open-source projects to run head-to-head, Fider and Photoview. Each repo cost $4,000 to scan on both platforms: Aikido’s Standard Pentest and XBOW’s Plus tier come out to the same price. I decided to run the scan on Fider at the same version which Doyensec benchmarked.

Unlike Aikido’s and XBOW’s fixed pricing, we calculate cost dynamically from estimated token and compute usage. Since the Fider repo is smaller, the estimate came out to $350.

cost

So I started a scan on Fider v0.33.0 at 18:14:59, right before leaving for the gym, and by the time I got back it had already finished, at 18:41:45.

final_scan

There were 60 agents launched and 31 findings were confirmed. You can check the final scan here

Triage

I manually validated every finding against the source and against a live Fider v0.33.0 I deployed. 27 held up as true positives, 4 were false positives.

Here is how that looks next to Doyensec’s numbers on the same target:

	Hacktron	Aikido	XBOW
Cost	$350	$4,000	$4,000
Wall-clock	27 min	~8h40m	~1 week
Reported	31	19	25
True positives	27	17	24
False positives	4	2	1

So on the same app, Hacktron found the most real bugs (27 vs 24 vs 17) at roughly a tenth of the cost and a fraction of the time.

tp_fp_rate

One annoying part with AI agents is that they overstate the severities. Hacktron reported lot of high severity findings, which when I triaged, many dropped to medium (a CSP-blocked XSS is not a high severity issue). Aikido and XBOW did the same thing per Doyensec, just less aggressively. This is something we need to work on.

overrating severity

The scanner called 6 things critical and 14 high; but after grading against real impact, it came out to 3 critical, 3 high, 3 medium, and 18 that were really low or info (13 low, 5 info). Only 22% of its severities matched my review, vs 76% for Aikido and 71% for XBOW. So it’s great at finding bugs and bad at ranking them.

Here’s the real severity mix once you grade the 27 true positives properly:

ground truth severity

Did it find the bugs that actually matter?

Now the important question is, did the $350 pentest find all the important bugs? Doyensec didn’t provide all the bugs Aikido and XBOW found for Fider, but we can look at what the Fider maintainers recently patched and use it as ground truth.

Here are the security fixes Fider shipped, and whether this one Hacktron run caught them:

Bug Fider patched	Commit / advisory	Hacktron
SSRF in webhook URLs	`f7db8603` / GHSA-g445-xwm7-594r	✅
Mass-assignment verification key, pre-auth account takeover	`74a26a31`	✅
No rate-limit on sign-in code, brute-force ATO	`b41d1b83`	✅
Cross-tenant verification/invite key reuse, tenant takeover	`ce4f44bb`	✅
XSS in markdown rendering and ATOM feed	`d28a838d` / GHSA-wm2w-gfh7-qg69	✅
Server-side JS injection in React SSR	`d5a80ea5`	✅
HTML escaping in rendered emails	`2f7aa747`	✅
DoS via unbounded HTTP response read	`da89c502`	✅
IDOR / moderation bypass on a single comment	`d74a643d`	❌
Authenticated arbitrary blob overwrite	`7b047158` / GHSA-vxp5-mf8m-grg9	❌

So Hacktron found 8 of the 10 bugs Fider recently fixed. More importantly, it caught every critical and high severity bug in the list: the pre-auth account takeover, the SSRF, the cross-tenant tenant takeover, the brute-force ATO, the markdown/ATOM XSS, and the server-side JS injection. The two it missed were not critical or high.

Here are all the top findings:

Critical

#	Finding	Note
2	Auth bypass via action-binding key overwrite (mass-assignment)	Action structs expose `VerificationKey`/`VerificationCode`/`LinkKey` with no `json:"-"` tag. Attacker can overwrite email verification code and takeover the account.
1	Cross-tenant verification key reuse / private-tenant registration bypass	`email_verifications` keys are looked up globally, and they are not scoped to tenant. A key issued for tenant A activates tenant B. Bypasses private-tenant invite gating.
4	No rate limit on sign-in verification code (brute-force ATO)	`/_api/signin/verify` has no lockout or attempt cap.

High

#	Finding	Note
11	SSRF via webhook (+ response leakage)	Admin webhook URLs fetched with no SSRF guard.
20	Infinite loop DoS in log placeholder parsing	`log.Parse` re-scans the rebuilt string; a URL containing `@{URL:magenta}` re-injects itself into an infinite loop. One unauthenticated GET drove CPU 0 to 569%
23	SSRF via custom OAuth provider config	Same unguarded `http.DefaultClient` sink as #11, reached through admin OAuth provider URLs.

Medium

#	Finding	Note
6	Server-side JS injection in React SSR	URL `RawQuery` interpolated verbatim into the `ssrRender("<URL>", ...)` JS string. A `"` in the query breaks out and runs in the V8 SSR context.
7	Synchronous blocking on background worker queue (DoS)	Buffered queue, enqueuers block once it fills faster than it drains.
8	Unbounded request body read (memory exhaustion)	`WrapRequest` reads the body with no size cap.

Takeaway

For $350 and half an hour, on the exact same app, Hacktron found more real vulnerabilities than either $4,000 platform, and caught every critical and high severity bug the maintainers later patched.

model cost performance

As we mentioned in Why Mythos doesn’t matter for us, most application-layer bugs do not need the most expensive model on every step. A good harness can route between stronger and cheaper models such as opus, gemini pro and flash, producing similar results without wasting compute.

References

The full scan can be found here: https://app.hacktron.ai/disclosed/scans/web_bXNya3AvZmlkZXI_1780577098593_j2ahuxotc

Doyensec’s full comparison report here: https://doyensec.com/resources/ComparingAIApplicationSecurityTestingPlatforms_Doyensec.pdf

Hacktron's $350 Pentest vs XBOW and Aikido at $4,000

Intro

Benchmark setup

Triage

Did it find the bugs that actually matter?

Critical

High

Medium

Takeaway

References

PRODUCT

RESEARCH

CONNECT

SOCIAL

DIAGNOSTICS