No major AI model is safe, but some do better than others

Feature Anthropic has positioned itself as a leader in AI safety, and in a recent analysis by Chatterbox Labs, that proved to be the case.

Chatterbox Labs tested eight major large language models (LLMs) and all were found to produce harmful content, though Anthropic's Claude 3.5 Sonnet fared better than rivals.

The UK-based biz offers a testing suite called AIMI that rates LLMs on various "pillars" such as "fairness," "toxicity," "privacy," and "security."

"Security" in this context refers to model safety – resistance to emitting harmful content – rather than the presence of potentially exploitable code flaws.

"What we look at on the security pillar is the harm that these models can do or can cause," explained Stuart Battersby, CTO of Chatterbox Labs.

When prompted with text input, LLMs try to respond with text output (there are also multi-modal models that can produce images or audio). They may be capable of producing content that's illegal – if for example prompted to provide a recipe for a biological weapon. Or they may provide advice that leads to injury or death.

"There are then a series of categories of things that organizations don't want these models to do, particularly on their behalf," said Battersby. "So our harm categories are things like talking about self-harm or sexually explicit material or security and malware and things like that."

The Security pillar of AIMI for GenAI tests whether a model will provide a harmful response when presented with a series of 30 challenge prompts per harm category.

"Some models will actually just quite happily answer you about these nefarious types of things," said Battersby. "But most models these days, particularly the newer ones, have some kind of sort of safety controls built into them."

But like any security mechanism, AI safety mechanisms, sometimes referred to as "guardrails," don't always catch everything.

"What we do on the security pillar is we say, let's simulate an attack on this thing," said Battersby. "And for an LLM, for a language model, that means designing prompts in a nefarious way. It's called jailbreaking. And actually, we haven't yet come across a model that we can't break in some way."

Chatterbox Labs tested the following models: Microsoft Phi 3.5 Mini Instruct (3.8b); Mistral AI 7b Instruct v0.3; OpenAI GPT-4o; Google Gemma 2 2b Instruct; TII Falcon 7b Instruct; Anthropic Claude 3.5 Sonnet (20240620); Cohere Command R; and Meta Llama 3.1 8b Instruct.

Graph of AI model safety test results - Click to enlarge

The company's report, provided to The Register, says, "The analysis shows that all the major models tested will produce harmful content. Except for Anthropic, harmful content was produced across all the harm categories. This means that the safety layers that are in these models are not sufficient to produce a safe model deployment across all the harm categories tested for."

It adds: "If you look at someone like Anthropic, they're the ones that actually did the best out of everyone," said Battersby. "Because they had a few categories where across all the jailbreaks, across some of the harm categories, the model would reject or redirect them. So whatever they're building into their system seems to be quite effective across some of the categories, whereas others are not."

The Register asked Anthropic whether anyone might be willing to provide more information about how the company approaches AI safety. We heard back from Stuart Ritchie, research comms lead for Anthropic.

The Register: "Anthropic has staked out a position as the responsible AI company. Based on tests run by Chatterbox Labs' AIMI software, Anthropic's Claude 3.5 Sonnet had the best results. Can you describe what Anthropic does that's different from the rest of the industry?"

Ritchie: "Anthropic takes a unique approach to AI development and safety. We're deeply committed to empirical research on frontier AI systems, which is crucial for addressing the potential risks from future, highly advanced AI systems. Unlike many companies, we employ a portfolio approach that prepares for a range of scenarios, from optimistic to pessimistic. We're pioneers in areas like scalable oversight and process-oriented learning, which aim to create AI systems that are fundamentally safer and more aligned with human values.

"Importantly, with our Responsible Scaling Policy, we've made a commitment to only develop more advanced models if rigorous safety standards can be met, and we're open to external evaluation of both our models' capabilities and safety measures. We were the first in the industry to develop such a comprehensive, safety-first approach.

"Finally, we're also investing heavily in mechanistic interpretability, striving to truly understand the inner workings of our models. We've recently made some major advances in interpretability, and we're optimistic that this research will lead to safety breakthroughs further down the line."

Defense AI models 'a risk to life' alleges spurned tech firm
AI giants pinky swear (again) not to help make deepfake smut
OpenAI's latest o1 model family tries to emulate 'reasoning' – tho might overthink things a bit
Google sued for using trademarked Gemini name for AI service

The Register: "Can you elaborate on the process of creating model 'guardrails'? Is it mainly RLHF (reinforcement learning from human feedback)? And is the result fairly specific in the kind of responses that get blocked (ranges of text patterns) or is it fairly broad and conceptual (topics related to a specific idea)?

Ritchie: "Our approach to model guardrails is multifaceted and goes well beyond traditional techniques like RLHF. We've developed Constitutional AI, which is an innovative approach to training AI models to follow ethical principles and behave safely by having them engage in self-supervision and debate, essentially teaching themselves to align with human values and intentions. We also employ automated and manual red-teaming to proactively identify potential issues. Rather than simply blocking specific text patterns, we focus on training our models to understand and follow safe processes. This leads to a broader, more conceptual grasp of appropriate behavior.

"As our models become more capable, we continually evaluate and refine these safety techniques. The goal isn't just to prevent specific unwanted outputs, but to create AI systems with a robust, generalizable understanding of safe and beneficial behavior."

The Register: "To what extent does Anthropic see safety measures existing outside of models? E.g. you can alter model behavior with fine-tuning or with external filters – are both approaches necessary?"

Ritchie: "At Anthropic, we have a multi-layered strategy to address safety at every stage of AI development and deployment.

"This multi-layered approach means that, as you suggest, we do indeed use both types of alteration to the model’s behavior. For example, we use Constitutional AI (a variety of fine-tuning) to train Claude's character, ensuring that it hews to values of fairness, thoughtfulness, and open-mindedness in its responses. We also use a variety of classifiers and filters to spot potentially harmful or illegal inputs - though as previously noted, we'd prefer that the model learns to avoid responding to this kind of content rather than having to rely on the blunt instrument of classifiers."

The Register: "Is it important to have transparency into training data and fine-tuning to address safety concerns?"

Ritchie: "Much of the training process is confidential. By default, Anthropic does not train on user data."

The Register: "Has Anthropic's Constitutional AI had the intended impact? To help AI models help themselves?"

Ritchie: "Constitutional AI has indeed shown promising results in line with our intention. This approach has improved honesty, harm avoidance, and task performance in AI models, effectively helping them "help themselves."

"As noted above, we use a similar technique to Constitutional AI when we train Claude's character, showing how this technique can be used to enhance the model in even unexpected ways – users really appreciate Claude's personality and we have Constitutional AI to thank for this.

"Anthropic recently explored Collective Constitutional AI, involving public input to create an AI constitution. We solicited feedback from a representative sample of the US population on which values we should impart to Claude using our fine-tuning techniques. This experiment demonstrated that AI models can effectively incorporate diverse public values while maintaining performance, and highlighted the potential for more democratic and transparent AI development. While challenges remain, this approach represents a significant step towards aligning AI systems with broader societal values."

The Register: "What's the most pressing safety challenge that Anthropic is working on?"

Ritchie: "One of the most pressing safety challenges we're focusing on is scalable oversight for increasingly capable AI systems. As models become more advanced, ensuring they remain aligned with human values and intentions becomes both more crucial and more difficult. We're particularly concerned with how to maintain effective human oversight when AI capabilities potentially surpass human-level performance in many domains. This challenge intersects with our work on mechanistic interpretability, process-oriented learning, and understanding AI generalization.

"Another issue we’re addressing is adversarial robustness. This research involves developing techniques to make our models substantially less easy to 'jailbreak' – where users convince the models to bypass their guardrails and produce potentially harmful responses. With future highly capable systems, the risks from jailbreaking become all the larger, so it's important right now to develop techniques that make them robust to these kinds of attacks.

"We're striving to develop robust methods to guide and evaluate AI behavior, even in scenarios where the AI's reasoning might be beyond immediate human comprehension. This work is critical for ensuring that future AI systems, no matter how capable, remain safe and beneficial to humanity."

The Register:" Is there anything else you'd like to add?"

Ritchie: "We're not just developing AI; we're actively shaping a framework for its safe and beneficial integration into society. This involves ongoing collaboration with policymakers, ethicists, and other stakeholders to ensure our work aligns with broader societal needs and values. We're also deeply invested in fostering a culture of responsibility within the AI community, advocating for industry-wide safety standards and practices, and openly sharing issues like jailbreaks that we discover.

"Ultimately, our goal extends beyond creating safe AI models – we're striving to set a new standard for ethical AI development – a 'race to the top' that prioritizes human welfare and long-term societal benefit." ®

Related stories

Other stories