Meta and Google AI Safety Controls Can Be Stripped in Minutes, Testing Finds

The safety controls embedded in Meta's Llama 3.3 and Google's Gemma 3 open-weight AI models can be dismantled in under 10 minutes using freely available tools, according to Financial Times testing conducted with AI safety group Alice.

After modification, both models produced outputs on topics their creators explicitly prohibit, including biological weapons and malware creation.

The tool used, called Heretic and publicly available on GitHub, strips away post-training safety alignment, reverting the models to respond to virtually any prompt without restriction. Thousands of altered variants of popular open-weight models already circulate across developer platforms and forums.

The findings fuel the debate over accountability when modified versions generate dangerous content. Current regulatory frameworks lack clear answers on whether responsibility lies with the original developer, the modifier, the hosting platform, or the user.

Decentralized AI networks in crypto attempt to distribute governance across stakeholders to reduce risk. The broader AI governance conversation demands structural solutions where safety is architected into models at a fundamental level.

For investors, governments eyeing AI regulation now have concrete evidence that voluntary safety measures from major tech companies can be circumvented, potentially reshaping model distribution and increasing interest in decentralized AI infrastructure.