Meta's AI safety system defeated by the space bar

Meta's machine-learning model for detecting prompt injection attacks – special prompts to make neural networks behave inappropriately – is itself vulnerable to, you guessed it, prompt injection attacks.

Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "to help developers detect and respond to prompt injection and jailbreak inputs," the social network giant said.

Large language models (LLMs) are trained with massive amounts of text and other data, and may parrot it on demand, which isn't ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called "guardrails" to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example.

Those using AI models have made it a sport to circumvent guardrails using prompt injection – inputs designed to make an LLM ignore its internal system prompts that guide its output – or jailbreaks – input designed to make a model ignore safeguards.

This is a widely known and yet-to-be solved problem. About a year ago, for example, computer scientists affiliated with Carnegie Mellon University developed an automated technique to generate adversarial prompts that break safety mechanisms. The risk of AI models that can be manipulated in this way is illustrated by a Chevrolet dealership in Watsonville, California, that saw its chatbot agree to sell a $76,000 Chevy Tahoe for $1.

Perhaps the most widely known prompt injection attack begins "Ignore previous instructions…" And a common jailbreak attack is the "Do Anything Now" or "DAN" attack that urges the LLM to adopt the role of DAN, an AI model without rules.

It turns out Meta's Prompt-Guard-86M classifier model can be asked to "Ignore previous instructions" if you just add spaces between the letters and omit punctuation.

Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base.

Prompt-Guard-86M was produced by fine-tuning the base model to make it capable of catching high-risk prompts. But Priyanshu found that the fine-tuning process had minimal effect on single English language characters. As a result, he was able to devise an attack.

"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt," explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. "This simple transformation effectively renders the classifier unable to detect potentially harmful content."

The finding is consistent with a post the security org made in May about how fine-tuning a model can break safety controls.

"Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Hyrum Anderson, CTO at Robust Intelligence, told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate."

Anderson acknowledged that the potential failure of Prompt-Guard is only the first line of defense and that whatever model was being tested by Prompt-Guard might still balk at a malicious prompt. But he said that the point of calling this out is to raise awareness among enterprises trying to use AI that there are a lot of things that can go wrong.

Meta did not immediately respond to a request for comment though we're told the social media biz is working on a fix. ®

Related stories

Other stories