How Fake AI Reasoning Unlocked Cocaine Recipe Instructions

MIT researchers found frontier LLMs grant harmful requests when fed forged reasoning traces, hitting 61% success on safety benchmarks

Nikshep Myle Avatar
Nikshep Myle Avatar

By

Image: The Blue Diamond Gallery

Key Takeaways

Key Takeaways

  • Forged internal reasoning traces manipulated frontier LLMs into providing cocaine synthesis instructions.
  • CoT Forgery raised attack success rates from near zero to roughly 61% on StrongREJECT benchmark.
  • LLMs assign trust based on linguistic style, not role tags, leaving a structural security flaw.

Telling an AI you’re wearing a green shirt shouldn’t unlock cocaine synthesis instructions. Yet that’s essentially what happened when MIT-affiliated researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell fed frontier LLMs a fake internal monologue declaring harmful requests acceptable based on nonsensical logic. The models complied — not because safety training failed in some edge case, but because LLMs judge trustworthiness by how text sounds, not where it comes from. Their ICML 2026 paper, “Prompt Injection as Role Confusion,” argues this isn’t a patchable bug. It’s structural.

The Trick: Faking the Model’s Inner Voice

Attackers don’t need to break the safety system — they just need to sound like they already own it.

Chain-of-Thought prompting — where models “think step by step” using a hidden reasoning trace before answering — has become standard for improving AI reasoning. CoT Forgery exploits that process directly. An attacker inserts fabricated reasoning into their prompt, mimicking the model’s own internal monologue style so precisely that the LLM treats it as its own prior deliberation. The forged block declares a dangerous request already vetted. The model skips its safety checks like a bouncer waving through someone who looks like staff.

On the StrongREJECT safety benchmark, attack success rates jumped from near zero to roughly 61% across multiple frontier models.

What the research actually found:

  • LLMs assign authority based on linguistic style, not role tags — text that sounds like assistant reasoning gets treated as trusted, regardless of which channel delivered it
  • CoT Forgery boosted attack success from near zero to approximately 61% on StrongREJECT across several strong models
  • The technique won the 2025 OpenAI Kaggle red-teaming contest, confirming real-world frontier model exposure
  • Human red-teamers reach near-100% success on live systems through adaptive iteration — static safety benchmarks miss this entirely
  • The flaw generalizes: malicious instructions hidden in documents or webpages exploit the same underlying role geometry

“Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game.” — Ye, Cui, and Hadfield-Menell, ICML 2026

Why Role Tags Were Never Really a Security Boundary

The labels developers trust most were built for formatting, not security — and attackers figured that out first.

Those familiar labels — system, user, assistant, tool, think — started as formatting tricks to turn autocomplete engines into chatbots. They drifted into a de facto permission system nobody formally designed. The paper’s core diagnosis: security gets defined at the interface, but authority gets assigned in latent space. Think of it like identifying a surgeon by their scrubs instead of their hospital credentials. Attackers just dress their text in the right outfit, and the model waves them through.

Defense-in-depth helps. The right operational moves include:

  • Input sanitization
  • Output filtering
  • Least-privilege agent permissions
  • Stripping CoT-style content from untrusted context

OWASP GenAI lists Prompt Injection as the number one risk for LLM applications, and IBM security researchers note that no foolproof prevention exists under current architectures. Every patch addresses a symptom. The geometry flaw remains. These kinds of structural oversights echo broader tech scandals where systemic design flaws enabled widespread exploitation before anyone acted.

Until models develop genuine role perception — some internal mechanism that style alone can’t spoof — deploying LLM agents in security-sensitive environments means accepting an unresolved attack surface. Architectural controls and monitoring manage the risk. They don’t close it.

Share this

At Gadget Review, our guides, reviews, and news are driven by thorough human expertise and use our Trust Rating system and the True Score. AI assists in refining our editorial process, ensuring that every article is engaging, clear and succinct. See how we write our content here →