Research

Your Team is Drowning in Bug Reports. Now What?

Mar 11, 2026

AI models got good enough to scan a codebase, identify a potential vulnerability, write a proof of concept, and submit it to a bounty platform with clean formatting. The barrier to producing a finding dropped to nearly zero. Someone with limited security experience can produce submissions that look, on the surface, like the work of a seasoned auditor.

This has overwhelmed the pipeline at every level.

Bounty platforms are responding with rate limits, restricting how many findings can be submitted within a given time window. These measures make sense as a short-term response, but they carry a cost. Rate limits compress volume. They do not distinguish between the quality of work behind each submission. Quantity is treated as the problem, and for now that may be the most practical lever platforms have. But quantity was never the actual problem. Unproven quantity is.

For protocol teams, triage has gone from a background task to a primary time sink. Teams that used to spend an hour a week reviewing bounty submissions now spend entire weeks on it. And the findings themselves have a particular character that makes them hard to dismiss quickly. They are often just plausible enough to require real analysis, even when they turn out to be invalid.

This would be manageable if the submissions were clearly invalid, but they are not. Many of them sit in a middle ground where the logic is partially sound, the issue described is seemingly real, but the finding has not been carried far enough to confirm whether it constitutes an actual threat.

Finding vs. Proving

Consider what a typical AI-assisted submission looks like in practice.

A model scans protocol's contracts. It identifies a function where an invariant can be broken under certain conditions. It generates a report, assigns a severity, and includes a code snippet that demonstrates the logic flaw. The report is well-structured. The reasoning appears sound.

But the finding has not been proven. It has been asserted.

Proving a vulnerability requires demonstrating two things. First, that the invariant actually breaks when tested against a fork of the live protocol with real state, not in an isolated or simplified environment. Second, that the preconditions needed to trigger the break are achievable. Can an attacker actually construct this scenario? What does it cost? What liquidity conditions need to exist? What sequence of transactions is required to get the protocol into the vulnerable state?

Most raw AI outputs do not fully satisfy either condition. They identify the logic flow correctly and sometimes demonstrate it in a simplified test, but they do not run it against production state. And they almost never address whether the preconditions required to trigger it are realistic. The gap between what these submissions assert and what they actually prove is where the triage burden now concentrates.

A pattern has become common across bounty programs. Submissions arrive describing real logic issues that the protocol team is already aware of and made deliberate design decisions around. The team may have mitigated the issue through mechanisms outside the contract layer, or they may have determined that the preconditions required to exploit it are unreachable or economically irrational given the actual deployment parameters. The vulnerability exists in the abstract. It does not exist in practice.

Other submissions identify a real issue but attribute it to the wrong root cause. Without deep protocol context, the report reads as convincing. With context, it falls apart. But reaching that conclusion still takes hours of careful review before any action can be taken.

What Separates Signal From Noise

A well-proven vulnerability meets both conditions described above. The invariant breaks on a fork with production state, the preconditions to reach it are achievable, and the outcome produces measurable damage. That is what separates a confirmed exploit from a theoretical finding.

The bulk of the volume fails on proving. The invariant breaks under conditions that would never arise in practice. The PoC runs against a setup that does not reflect the live deployment. The severity is inflated because the model classified it as critical or high without considering whether the attack is economically viable. Or the finding is a duplicate of something already submitted, reframed in different language but describing the same underlying issue.

The Triage Problem is a Proving Problem

The broader conversation right now tends to frame this as a judging problem. Most discussions about this problem end up at the same place, which is building faster ways to sort through submissions. A reasonable instinct is to automate the triage layer itself. There is some truth to that. Automated triaging for duplicates and low-quality submissions would relieve pressure on human judges and protocol teams. But the filter that actually matters is the proving step itself. If a finding cannot be confirmed through execution, it does not pass. This eliminates an entire category of submissions that currently consume the most triage time: the ones that are plausible enough to require evaluation but have not been entirely demonstrated to be real.

Right now, the proving work falls on the judges and the protocol teams. It does not have to. This is where AI could be applied. An AI system that can fork a protocol, execute an attack sequence, verify that preconditions are achievable under realistic deployment conditions, and confirm measurable impact is doing fundamentally different work than one that scans code and flags patterns. The first confirms whether an exploit is real. The second generates more submissions.

The gap is that most AI workflows in security today are oriented around finding, not as much around proving. When that changes, the triage problem resolves itself. Not only because the volume decreases, but because everything that cannot be confirmed through the proving step gets filtered out before it ever reaches anyone who has to thoroughly evaluate it.

The Bigger Picture

The proving gap extends beyond bounty programs. A finding without reproducible proof of exploitability (where applicable) creates the same burden whether it comes from a bounty researcher, a contest participant, an audit report, a security tool, or an internal review. Protocols should be asking whether their security solutions are delivering actionable answers or producing more questions to address.

The ability to identify potential vulnerabilities is becoming table stakes. AI models will only get better at scanning code, pattern-matching, and generating reports. What remains scarce is the ability to take a potential finding and prove whether it constitutes a real threat under the accurate conditions. That is where the highest-value security work is heading, and it will only become more important as the models get better at the easy part.

Anyone can find a bug now. The value is in proving it is real.

Shepherd is built around this distinction. Let us help you get your engineering time back.

Get in touch with us

‹ Introducing Touchstone

How Fast Does Cybercrime Move Now? ›