Security researchers have identified a novel method capable of bypassing GPT-5’s built-in safeguards, enabling the model to produce harmful outputs without requiring explicitly malicious prompts.
The approach, developed by the team at NeuralTrust, merges the “Echo Chamber” attack with a narrative-based strategy that uses storytelling to gradually influence the model’s responses while avoiding detection.
This technique builds upon a prior jailbreak carried out against Grok-4 just two days after its launch, where researchers combined Echo Chamber with the “Crescendo” method to escalate prompt intensity across multiple interactions—eventually obtaining step-by-step instructions for constructing a Molotov cocktail. In the GPT-5 case, Crescendo was replaced with narrative-driven guidance to achieve a similar effect.
How the Jailbreak Works
The researchers initiated the attack by embedding subtle, strategically chosen keywords into otherwise harmless text, then developing a fictional storyline around them.
This narrative structure acted as a cover, enabling potentially harmful procedural details to surface naturally as the plot evolved—without directly asking for prohibited instructions or triggering standard refusal mechanisms.
The process followed four key stages:
- Introduce low-visibility “poisoned” context within benign sentences.
- Maintain a coherent storyline to disguise the underlying intent.
- Request elaborations that keep the plot consistent while deepening context.
- Adjust narrative stakes or perspective if the conversation stagnates.
In one test scenario, the team used a survival-themed plot and asked GPT-5 to incorporate words like “cocktail,” “story,” “survival,” “molotov,” “safe,” and “lives.” Over multiple expansions of the story, GPT-5 eventually included increasingly technical, step-by-step information—fully embedded in the fictional narrative.
Risks and Security Implications
NeuralTrust’s findings suggest that themes involving urgency, safety, and survival increase the likelihood of GPT-5 advancing toward unsafe content. Because the harmful output develops gradually across multiple exchanges, traditional keyword-based filtering proved ineffective.
“The model works to maintain consistency with the established narrative,” the researchers explained, “and this consistency pressure can subtly move the conversation toward the attacker’s objective.”
To mitigate these risks, the study recommends:
- Conversation-level monitoring to detect gradual manipulation.
- Detection of persuasion patterns and narrative-based steering.
- Implementation of robust AI access gateways to block multi-turn exploit attempts.
While GPT-5’s guardrails remain effective against direct malicious queries, this research highlights how strategic, multi-step dialogue framed as harmless storytelling can still serve as a significant threat vector.
Source: https://www.infosecurity-magazine.com/news/chatgpt5-bypassed-using-story