How Fake AI Reasoning Unlocked Cocaine Recipe Instructions

MIT researchers found frontier LLMs grant harmful requests when fed forged reasoning traces, hitting 61% success on safety benchmarks

Jun 30, 2026

2 min read

Key Takeaways

Forged internal reasoning traces manipulated frontier LLMs into providing cocaine synthesis instructions.
CoT Forgery raised attack success rates from near zero to roughly 61% on StrongREJECT benchmark.
LLMs assign trust based on linguistic style, not role tags, leaving a structural security flaw.

Telling an AI you’re wearing a green shirt shouldn’t unlock cocaine synthesis instructions. Yet that’s essentially what happened when MIT-affiliated researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell fed frontier LLMs a fake internal monologue declaring harmful requests acceptable based on nonsensical logic. The models complied — not because safety training failed in some edge case, but because LLMs judge trustworthiness by how text sounds, not where it comes from. Their ICML 2026 paper, “Prompt Injection as Role Confusion,” argues this isn’t a patchable bug. It’s structural.

The Trick: Faking the Model’s Inner Voice

Attackers don’t need to break the safety system — they just need to sound like they already own it.

Chain-of-Thought prompting — where models “think step by step” using a hidden reasoning trace before answering — has become standard for improving AI reasoning. CoT Forgery exploits that process directly. An attacker inserts fabricated reasoning into their prompt, mimicking the model’s own internal monologue style so precisely that the LLM treats it as its own prior deliberation. The forged block declares a dangerous request already vetted. The model skips its safety checks like a bouncer waving through someone who looks like staff.

On the StrongREJECT safety benchmark, attack success rates jumped from near zero to roughly 61% across multiple frontier models.

What the research actually found:

LLMs assign authority based on linguistic style, not role tags — text that sounds like assistant reasoning gets treated as trusted, regardless of which channel delivered it
CoT Forgery boosted attack success from near zero to approximately 61% on StrongREJECT across several strong models
The technique won the 2025 OpenAI Kaggle red-teaming contest, confirming real-world frontier model exposure
Human red-teamers reach near-100% success on live systems through adaptive iteration — static safety benchmarks miss this entirely
The flaw generalizes: malicious instructions hidden in documents or webpages exploit the same underlying role geometry

“Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.” — Ye, Cui, and Hadfield-Menell, ICML 2026

Why Role Tags Were Never Really a Security Boundary

The labels developers trust most were built for formatting, not security — and attackers figured that out first.

Those familiar labels — system, user, assistant, tool, think — started as formatting tricks to turn autocomplete engines into chatbots. They drifted into a de facto permission system nobody formally designed. The paper’s core diagnosis: security gets defined at the interface, but authority gets assigned in latent space. Think of it like identifying a surgeon by their scrubs instead of their hospital credentials. Attackers just dress their text in the right outfit, and the model waves them through.

Defense-in-depth helps. The right operational moves include:

Input sanitization
Output filtering
Least-privilege agent permissions
Stripping CoT-style content from untrusted context

OWASP GenAI lists Prompt Injection as the number one risk for LLM applications, and IBM security researchers note that no foolproof prevention exists under current architectures. Every patch addresses a symptom. The geometry flaw remains. These kinds of structural oversights echo broader tech scandals where systemic design flaws enabled widespread exploitation before anyone acted.

Until models develop genuine role perception — some internal mechanism that style alone can’t spoof — deploying LLM agents in security-sensitive environments means accepting an unresolved attack surface. Architectural controls and monitoring manage the risk. They don’t close it.