pwshub.com

Meta's AI safety system defeated by the space bar

Meta's machine-learning model for detecting prompt injection attacks – special prompts to make neural networks behave inappropriately – is itself vulnerable to, you guessed it, prompt injection attacks.

Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "to help developers detect and respond to prompt injection and jailbreak inputs," the social network giant said.

Large language models (LLMs) are trained with massive amounts of text and other data, and may parrot it on demand, which isn't ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called "guardrails" to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example.

Those using AI models have made it a sport to circumvent guardrails using prompt injection – inputs designed to make an LLM ignore its internal system prompts that guide its output – or jailbreaks – input designed to make a model ignore safeguards.

This is a widely known and yet-to-be solved problem. About a year ago, for example, computer scientists affiliated with Carnegie Mellon University developed an automated technique to generate adversarial prompts that break safety mechanisms. The risk of AI models that can be manipulated in this way is illustrated by a Chevrolet dealership in Watsonville, California, that saw its chatbot agree to sell a $76,000 Chevy Tahoe for $1.

Perhaps the most widely known prompt injection attack begins "Ignore previous instructions…" And a common jailbreak attack is the "Do Anything Now" or "DAN" attack that urges the LLM to adopt the role of DAN, an AI model without rules.

It turns out Meta's Prompt-Guard-86M classifier model can be asked to "Ignore previous instructions" if you just add spaces between the letters and omit punctuation.

Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base.

Prompt-Guard-86M was produced by fine-tuning the base model to make it capable of catching high-risk prompts. But Priyanshu found that the fine-tuning process had minimal effect on single English language characters. As a result, he was able to devise an attack.

"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt," explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. "This simple transformation effectively renders the classifier unable to detect potentially harmful content."

The finding is consistent with a post the security org made in May about how fine-tuning a model can break safety controls.

"Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Hyrum Anderson, CTO at Robust Intelligence, told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate."

Anderson acknowledged that the potential failure of Prompt-Guard is only the first line of defense and that whatever model was being tested by Prompt-Guard might still balk at a malicious prompt. But he said that the point of calling this out is to raise awareness among enterprises trying to use AI that there are a lot of things that can go wrong.

Meta did not immediately respond to a request for comment though we're told the social media biz is working on a fix. ®

Source: go.theregister.com

Related stories
1 month ago - Sweet sweet GenAI money not yet flowing, Zuck reckons other ML efforts are paying off Meta has told investors generative AI won't bring it revenue this year, but that the massive investments it's planning will pay off over time – and be...
1 month ago - A model finetuned on your social media profile? What could possibly go wrong? SIGGRAPH Big public AI models like ChatGPT, Gemini, and Copilot have become near ubiquitous over the past few years – but Meta CEO Mark Zuckerberg is banking...
1 month ago - Just when you think you've ban-hammered one, it pops up with another name Analysis This month Anthropic's ClaudeBot – a web content crawler that scrapes data from pages for training AI models – visited tech advice site iFixit.com about a...
1 month ago - 28-year-old mowed down in US state where cars ordinarily aren't allowed to operate autonomously Washington State Patrol investigators have found that the Tesla involved in the death of a motorcyclist on Friday, April 19, 2024 was...
1 week ago - Get up to speed on the rapidly evolving world of AI with our roundup of the week's developments.
Other stories
7 minutes ago - Experts at the Netherlands Institute for Radio Astronomy (ASTRON) claim that second-generation, or "V2," Mini Starlink satellites emit interference that is a staggering 32 times stronger than that from previous models. Director Jessica...
8 minutes ago - The PKfail incident shocked the computer industry, exposing a deeply hidden flaw within the core of modern firmware infrastructure. The researchers who uncovered the issue have returned with new data, offering a more realistic assessment...
8 minutes ago - Nighttime anxiety can really mess up your ability to sleep at night. Here's what you can do about it right now.
8 minutes ago - With spectacular visuals and incredible combat, I cannot wait for Veilguard to launch on Oct. 31.
8 minutes ago - Finding the perfect pair of glasses is difficult, but here's how to do so while considering your face shape, skin tone, lifestyle and personality.