Researchers unveil LLM tool to find Python zero-days

Researchers with Seattle-based Protect AI plan to release a free, open source tool that can find zero-day vulnerabilities in Python codebases with the help of Anthropic's Claude AI model.

The software, called Vulnhuntr, was announced at the No Hat security conference in Italy on Saturday.

"The tool does not simply paste some code from the project and ask for analysis," explained Dan McInerney, lead AI threat researcher at Protect AI, who developed the software with colleague Marcello Salvati.

"It automatically finds project files that are likely to handle remote user input, Claude analyzes that for potential vulnerabilities, then for each potential vulnerability Claude is given a vulnerability-specific highly optimized prompt and enters a loop."

"In this loop it intelligently requests functions/classes/variables from elsewhere in the code continually until it completes the entire call chain from user input to server output without blowing up its context window. The advantage of this over current static code analyzers is a massive reduction in false positives/negatives since it can read the entire call chain, not just little code snippets one at a time."

This approach, McInerney claims, can reveal complex, multi-step vulnerabilities, as opposed to flagging functions like eval() with known security implications.

"The tool was originally designed using Claude and used Claude's best practices in prompt engineering so it performs by far the best using Claude," said McInerney. "We included the option to use [OpenAI's] GPT-4 and we tested it with GPT-4o but got poorer results. Modifying the prompts to better fit GPT-4o is very straightforward and using the GPT-4o model is just a change in 1 line of code. By open sourcing it, we hope to encourage modifications such as these as new models come out."

So far, McInerney says, Vulnhuntr has found more than a dozen zero-day vulnerabilities in large, open source Python projects.

"All of these vulnerabilities were not previously known or reported to the project maintainers," he said.

The tool presently focuses on seven types of remotely exploitable vulnerabilities.

Arbitrary File Overwrite (AFO)
Local File Inclusion (LFI)
Server-Side Request Forgery (SSRF)
Cross-Site Scripting (XSS)
Insecure Direct Object References (IDOR)
SQL Injection (SQLi)
Remote Code Execution (RCE)

Affected projects include:

gpt_academic, 64k stars on GitHub, LFI, XSS
ComfyUI, 50K stars, XSS
FastChat, 35K stars, SSRF
Ragflow, 16K stars, RCE

Other projects with vulnerable code spotted less than 90 days ago have not been identified to give maintainers time to fix things.

Ragflow, said McInerney, is the only project he's aware of that has fixed its identified bug.

Vulnhuntr has some limitations. It only works on Python code at the moment and it depends on access to a Python static analyzer. As a result, the tool is more likely to generate false positives when scanning Python projects that incorporate code in other languages (e.g. TypeScript).

When generating a proof-of-concept (PoC) exploit, the software generates a confidence score ranging from 1 to 10. A score of 7 means it's probably a valid vulnerability, though the PoC code may need some refinement. A score of 8 or more is highly likely to be valid. Scores of 6 or less are unlikely to be valid.

The output looks something like this:


scratchpad: 
1. Analyzing the add_llm function in llm_app.py.
2. Identified user input used as keys to access dictionaries: EmbeddingModel, ChatModel, RerankModel, CvModel, and TTSModel.
3. These dictionaries contain class references, which are instantiated with user-supplied parameters.
4. The factory variable is directly used from user input (req['llm_factory']) without proper validation. [...]
---------------------------------------- 
analysis: 
The add_llm function in llm_app.py contains a critical Remote Code Execution (RCE) vulnerability. The function uses user-supplied input (req['llm_factory'] and req['llm_name']) to dynamically instantiate classes from the EmbeddingModel, ChatModel, RerankModel, CvModel, and TTSModel dictionaries. This pattern of using user input as a key to access and instantiate classes is inherently dangerous, as it allows an attacker to potentially execute arbitrary code. The vulnerability is exacerbated by the lack of comprehensive input validation or sanitization on these user-supplied values. [...]
---------------------------------------- 
poc: 
POST /add_llm HTTP/1.1 
Host: target.com 
Content-Type: application/json 
Authorization: Bearer <valid_token> 
{ "llm_factory": "__import__('os').system", "llm_name": "id", "model_type": "EMBEDDING", "api_key": "dummy_key"}
---------------------------------------- 
confidence_score: 8 
---------------------------------------- 
vulnerability_types:
  - RCE 
----------------------------------------

Another issue is that LLMs aren't deterministic – they may provide different results for the same prompt at different times – so multiple runs may be required. Nonetheless, McInerney says that Vulnhuntr is a significant improvement over the current generation of static analyzers.

There's also some cost involved since the Claude API isn't free.

"My average use of it is to identify the one or two files in a project that handle remote user input and tell the tool to do analysis on just those couple files," said McInerney. "When used this way, it averages less than $0.50 of token usage. It will automatically find these network-related files as well, but it's a broad search that often sees it scanning 10-20 files instead of the 1-2 that give the best results usually. Depending on project size, scanning all the network-related files will still only cost ~$1-$3."

As far as our research can tell, the release of Vulnhuntr will be the first time LLMs have actually found zero-days in the wild.

McInerney says he believes Vulnhuntr's discoveries represent the first time actual zero-day vulnerabilities have been identified in public projects by an AI-assisted tool.

"There are multiple papers purporting this and all are misleading because their AI did not discover zero-days, it was merely fed known vulnerable targets or code that it wasn't trained on and then said this was evidence their AI can find zero-days," he said. "As far as our research can tell, the release of Vulnhuntr will be the first time LLMs have actually found zero-days in the wild."

As an example, he pointed to a paper by academic researchers whose work we've covered previously.

Daniel Kang, assistant professor of computer science at the University of Illinois Urbana-Champaign, and a co-author on the cited paper and similar ones, told The Register that relying on simulated data is a common practice in security research.

"It is widely accepted that simulations of real-world environments are acceptable proxies for the real world," he said. "I can link to hundreds of security papers and press releases where security tools are used in simulated environments or on past real-world vulnerabilities and no one disputes these findings. The correct thing to say is that we simulate the zero-day setting, but again, this is widely accepted as common practice."

Kang's paper describes using teams of LLM agents to exploit zero-day vulnerabilities, noted that Vulnhuntr doesn't handle exploitation. He also said that in the absence of an analysis of false positives or a comparison to tools like ZAP, Metasploit, or BurpSuite, it's difficult to say how the tool compares to existing open source or proprietary alternatives.

According to McInerney, the vulnerabilities identified by Vulnhuntr are very easy to exploit once identified.

"The tool gives you a proof-of-concept exploit once it finds a vulnerability," he said. "It's not uncommon to need to make some kind of minor adjustment to the PoC to make it work, but it's obvious what adjustments to make after reading the analysis the LLM gives you as to why it's vulnerable."

We're told Vulnhuntr will be released on GitHub, presumably through a repo associated with Protect AI. The biz is also encouraging budding bug hunters to try the tool on open source projects listed on its bug bounty website, huntr.com. ®

Related stories

Other stories