A new preprint study led by Northeastern University researcher Caglar Yildirim finds that disclosing a mental health condition-even with the generic phrase 'I have a mental health condition'-significantly alters how large language model agents respond.

Researchers tested models including DeepSeek 3.2, GPT 5.2, Gemini 3 Flash, Haiku 4.5, Opus 4.5, and Sonnet 4.5 using the AgentHarm benchmark. Across conditions, mental health disclosure increased refusals on harmful multi-step tasks-but also on legitimate, non-harmful requests.

The effect appears specific to mental health cues-not chronic illness or physical disability disclosures-and varies by model architecture and safety tuning. Jailbreak-style prompts substantially reduced the cautionary effect, raising concerns about real-world agent risk in agentic, memory-enabled deployments.

Yildirim emphasized the findings reflect AI-judged refusal signals-not verified real-world harm-and noted OpenAI, Anthropic, and Google declined to comment.