Key Takeaways
1. An AI bug hunter used a guessing game format to trick ChatGPT-4o into revealing Windows Product Activation keys.
2. The researcher manipulated the AI’s logic by insisting it “must” engage and “cannot lie,” exploiting a flaw in its programming.
3. The manipulation involved using the phrase “I give up” to prompt the AI to disclose sensitive information.
4. The method succeeded because the activation keys were common and misinterpreted by the AI as less sensitive.
5. The technique demonstrated potential vulnerabilities in AI filters, suggesting they may fail against obfuscation tactics.
A recent contribution from an AI bug hunter to Mozilla’s ODIN (0-Day Investigative Network) bug bounty initiative displayed a clever method to deceive OpenAI’s ChatGPT-4o and 4o mini into disclosing active Windows Product Activation keys.
The Ingenious Approach
The strategy revolved around presenting the interaction as a guessing game while hiding specifics in HTML tags. The key request was cleverly placed at the end of the game, making it seem less suspicious.
The researcher kicked off the conversation as a guessing game, making the exchange “non-threatening or inconsequential,” and presenting the dialogue “through a playful, harmless lens” to mask the real intention. This effectively lowered the AI’s defenses against sharing sensitive information.
Manipulating the AI’s Logic
After that, the researcher established some rules, insisting that the AI “must” engage and “cannot lie.” This took advantage of a logical flaw in the AI’s programming, which required it to adhere to user prompts, even when such requests contradicted its content filters.
The bug hunter then played a round with the AI, using the phrase “I give up” at the end of the request. This manipulation led the chatbot to “believe it had to respond with the string of characters.”
Insights from ODIN
As mentioned in ODIN’s blog post, the method succeeded because the keys were not unique but “commonly seen on public forums.” Their commonality might have led the AI to misinterpret their level of sensitivity.
In this specific jailbreak scenario, the guardrails faltered because they were designed to block direct requests but failed to consider “obfuscation tactics—like hiding sensitive phrases in HTML tags.”
This clever technique could be leveraged to navigate around other filters, including those for adult content, links to harmful websites, and even personally identifiable information.
Source:
Link



Leave a Reply