GPT-5.5 tops LLM security challenge as Gemini refuses to participate

Key Takeaway

– GPT-5.5 was the top performer, solving 7/10 runs at $9.46 per solve.
DeepSeek V4 Pro was the cost-efficiency leader, solving 3/10 runs at $0.62 per solve (15x cheaper than GPT-5.5).
– Claude Opus 4.8 got close multiple times but was stopped by safety guardrails.
– Gemini 3.1 Pro Preview and Gemini 3.5 Flash performed the worst, with frequent early refusals.
– Chinese models were more willing to interact with live databases, while Western models hesitated mid-task.


Security Researcher Drops One of the Year’s Most Revealing AI Capability Tests

Kasra Rahjerdi, a professional app security researcher, has published a fasinating experiment that pits over a dozen AI models against a real-world cybersecurity challenge. He built a deliberately vulnerable book review app that contained a critical flaw: exposed Firebase credentials hidden inside the APK. This allows direct database access, bypassing the apps otherwise hardened API. Rahjerdi then gave each AI model a $10 budget and two hours per run, spending a total of $1,500 across all test runs.

GPT-5.5 Dominates with Consistency and Speed

GPT-5.5 was the clear standout, solving the challenge in 7 out of 10 runs at a cost of just $9.46 per successful exploit. Almost every successful run instantly focused on the Firebase vulnerability right after unpacking the APK, without getting sidetracked by the API or the apps surface features. This kind of focus could be a gamechanger for automated security testing.

DeepSeek V4 Pro emerged as the cost efficiency champion, solving 3 out of 10 runs at a tiny $0.62 per solve. This makes it roughly 15 times cheaper per success then GPT-5.5, despite a lower overall solve rate. For any organization scaling security operations, that cost difference is massive and cannot be ignored.

Claude Models Show Promise but Hit Guardrails

Claude Sonnet 4.6 and Claude Opus 4.8 both solved 2 out of 10 runs, but Opus in particular showed impressive potential by getting very close to a solution multiple times. The catch is that Opus was often halted mid-session by its own safety guardrails, which prevented it from completing the exploit. This highlight a key tension in AI security testing: models that are too cautious can fail to finish the job.

At the bottom of the pack sits Gemini. Gemini 3.1 Pro Preview refused to even attempt the challenge in nearly every run, reflected in a median token count of just 9k compared to 100k+ for every other model. Gemini 3.5 Flash wasnt much better, with frequent early refusals and only two runs that actually tried to solve the problem at all.

Cultural Divide in AI Security Testing

Rahjerdi observed a clear pattern: Chinese models where way more willing to interact directly with live databases, while Western models showed more hesitation mid-task—even when they had correctly identified the right approach. The researcher also adds that this is not a scientific evaluation, just a well-documented experiment. But for anyone watching the AI security landscape, the results speak volumes about where these models really stand.

Sources

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *