pwshub.com

AI code helpers just can't stop inventing package names

AI models just can't seem to stop making things up. As two recent studies point out, that proclivity underscores prior warnings not to rely on AI advice for anything that really matters.

One thing AI makes up quite often is the names of software packages.

As we noted earlier this year, Lasso Security found that large language models (LLMs), when generating sample source code, will sometimes invent names of software package dependencies that don't exist.

That's scary, because criminals could easily create a package that uses a name produced by common AI services and cram it full of malware. Then they just have to wait for a hapless developer to accept an AI's suggestion to use a poisoned package that incorporates a co-opted, corrupted dependency.

Researchers from University of Texas at San Antonio, University of Oklahoma, and Virginia Tech recently looked at 16 LLMs used for code generation to explore their penchant for making up package names.

In a preprint paper titled "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs," the authors explain that hallucinations are one of the unresolved shortcomings of LLMs.

That's perhaps not lost on the lawyers who last year used generative AI to cite non-existent court cases in legal briefs, and then had to make their own apologies to affected courts. But among those who find LLMs genuinely useful for coding assistance, it's a point that bears repeating.

"Hallucinations are outputs produced by LLMs that are factually incorrect, nonsensical, or completely unrelated to the input task," according to authors Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. "Hallucinations present a critical obstacle to the effective and safe deployment of LLMs in public-facing applications due to their potential to generate inaccurate or misleading information."

Maybe not "we've bet on the wrong horse" critical – more like "manageable with enough marketing and lobbying" critical.

LLMs already have been deployed in public-facing applications, thanks to the enthusiastic sellers of AI enlightenment and cloud vendors who just want to make sure all the expensive GPUs in their datacenters see some utilization. And developers, to hear AI vendors tell it, love coding assistant AIs. They apparently improve productivity and leave coders more confident in the quality of their work.

Even so, the researchers wanted to assess the likelihood that generative AI models will fabulate bogus packages. So they used 16 popular LLMs, both commercial and open source, to generate 576,000 code samples in JavaScript and Python, which rely respectively on the npm and PyPI package repositories.

The results left something to be desired.

"Our findings reveal that the average percentage of hallucinated packages is at least 5.2 percent for commercial models and 21.7 percent for open source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat," the authors state.

  • FTC sues five AI outfits – and one case in particular raises questions
  • Recall the Recall recall? Microsoft thinks it can make that Windows feature palatable
  • Now Dell salespeople must be onsite five days a week
  • Data harvesting superapp admits it struggled to wield data – until it built an LLM

The 30 tests run from the set of research prompts resulted in 2.23 million packages being created – about 20 percent of which (440,445) were determined to be hallucinations. Of those, 205,474 were unique non-existent packages that could not be found in PyPI or npm.

What's noteworthy here – beyond the fact that commercial models are four times less likely than open source models to fabricate package names – is that these results show four to six times fewer hallucination than Lasso Security's figures for GPT-3.5 (5.76 percent vs 24.2 percent) and GPT-4 (4.05 percent vs. 22.2 percent). That counts for something.

Reducing the likelihood of package hallucinations comes at a cost. Using the DeepSeek Coder 6.7B and CodeLlama 7B models, researchers implemented a mitigation strategy via Retrieval Augmented Generation (RAG), to provide a list of valid package names to help guide prompt responses, and Supervised Fine-Tuning, to filter out invented packages and retain the model. The result was reduced hallucination – at the expense of code quality.

"The code quality of the fine-tuned models did decrease significantly, -26.1 percent and -3.1 percent for DeepSeek and CodeLlama respectively, in exchange for substantial improvements in package hallucination rate," the researchers wrote.

Size matters too

In the other study exploring AI hallucination, José Hernández-Orallo and colleagues at the Valencian Research Institute for Artificial Intelligence in Spain found that LLMs become more unreliable as they scale up.

The researchers looked at three model families: OpenAI's GPT, Meta's LLaMA and BigScience's open source BLOOM. They tested the various models against scaled-up versions (more parameters) of themselves, with questions about addition, word anagrams, geographical knowledge, science, and information-oriented transformations.

They found that while the larger models – those shaped with fine-tuning and more parameters – are more accurate in their answers, they are less reliable.

That's because the smaller models will avoid responding to some prompts they can't answer, whereas the larger models are more likely to provide a plausible but wrong answer. So the portion of non-accurate responses consists of a greater portion of incorrect answers, with a commensurate reduction in avoided answers.

This trend was noticed particularly for OpenAI's GPT family. The researchers found that GPT-4 will answer almost anything, where prior model generations would avoid responding in the absence of a reliable prediction.

Further compounding the problem, the researchers found that humans are bad at evaluating LLM answers – classifying incorrect answers as correct from around 10 to 40 percent of the time.

Based on their findings, Hernández-Orallo and his co-authors argue, "relying on human oversight for these systems is a hazard, especially for areas for which the truth is critical."

This is a long-winded way of rephrasing Microsoft's AI boilerplate, which warns not to use AI for anything important.

"[E]arly models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook," the researchers conclude.

"These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount." ®

Source: theregister.com

Related stories
3 weeks ago - Web services celebrates 'leader' designation for Q Developer Amazon Web Services on Tuesday took a moment to pat itself on the back for being thought of inside the box, specifically, the upper right-hand square that's part of Gartner's...
1 month ago - Google sees a future in which Gemini can execute tasks for you, not just answer questions.
1 month ago - Here's how to get started with the open source tool Continue Hands-on Code assistants have gained considerable attention as an early use case for generative AI – especially following the launch of Microsoft's GitHub Copilot. But, if you...
2 weeks ago - Gartner thinks generative AI is right for only five percent of workloads AI techniques that require specialist hardware are "doomed," according to analyst firm Gartner's chief of research for AI Erick Brethenoux – who included GPUs in...
2 weeks ago - AI ROI? Jensen Huang claims infra providers make $5 for every dollar spent on GPUs Nvidia CEO Jensen Huang has attempted to quell concerns over the reported late arrival of the Blackwell GPU architecture, and the lack of ROI from AI...
Other stories
15 minutes ago - Is this AI showing the would-be leader of the free world who's really the boss? Artificial intelligence may have shown the would-be-leader of the free world who's really in charge, after rogue robo-taxis halted the motorcade of US...
31 minutes ago - If you're looking to uncover family history or learn more about your health, here's how to decide which DNA testing kit is best for you.
2 hours ago - The Fresh Start Program can move your loans out of default and help you qualify for forgiveness -- but you have to apply by Sep. 30.
2 hours ago - If you've missed student loan payments or have loans in default, you'll want to take action before 11:59 p.m. tonight.
3 hours ago - PLUS: UK man charged with hacking US firms for stock secrets; ransomware actor foils self; and more Infosec In Brief Put away that screwdriver and USB charging cable – the latest way to steal a Kia just requires a cellphone and the...