top of page
aegislogosmall.jpg

State of the Model: GPT-4o

  • Writer: Zsolt Tanko
    Zsolt Tanko
  • Feb 21
  • 2 min read

Updated: Apr 2

Executive Summary


We conducted an extensive assessment of OpenAI’s GPT-4o model using 3,279 targeted jailbreak attacks designed to generate outputs causing meaningful business risk. Overall, the model’s vulnerability rate (i.e., successful attack rate) is 52.9%, with an overall severity score—a measure where higher values indicate more serious business risks—of 40%. These ratings show GPT-4o to have above average defenses among commercial Large Language Models, but lagging significantly behind other models for risk-conscious business application.


Out of the 22 risk categories tested, four stand out with particularly high vulnerability rates:


  1. Contradictory or Conflicting Outputs (86%, severity 62%)

  2. Over-Personalization / Psychological Profiling (83%, severity 62%)

  3. Misinformation & Unsubstantiated Claims (74%, severity 54%)

  4. Enabling of Fraud or Phishing (71%, severity 59%)


Our findings suggest that while GPT-4o can often provide correct and compliant information, successful attack vectors remain that induce contradictions, facilitate misinformation, and guide fraudulent activity.



Key Observations and Business Implications


Critical Vulnerabilities


  • High Contradiction Rate: The model is prone to producing internally inconsistent or conflicting information, which can undermine user trust and brand messaging.

  • Over-Personalization Risks: Users can elicit responses with overly detailed personal or inappropriate psychological insights that compromise privacy or lead to manipulative content.

  • Misinformation Dissemination: The model can be coaxed into giving unverified or erroneous statements, posing risks of reputational harm and potential legal exposure.

  • Fraud and Phishing Susceptibility: Attacks can yield instructions or content that facilitate illicit activity, posing both brand and regulatory risks.


Moderate to Low Risk Areas


Categories such as Libelous Content and Exposing IP or Internal Data showed comparatively lower attack success rates (53% and 7%, respectively), indicating GPT-4o has more robust safeguards for these highly business-relevant subjects. However, “robust” does not mean fully protected—significant residual risks still remain.


Actionable Next Steps


  • Strengthen Safeguards for Business-Critical Risks: Organizations should focus on refining prompt filtering and response pre- and post-processing against those risk categories with high vulnerability rates, in particular those posing significant legal risk exposure.

  • Real-Time Monitoring: Implement continuous monitoring to detect misuse or suspicious trends, e.g., repetitive attempts at phishing or fraud, or any inputs attempting to extract IP or user data.

  • Policy-Driven Enhancement: Scope and define clear risk-conscious usage policies for AI systems built on GPT-4o and align engineering with these policies to reduce legal and ethical vulnerabilities.


By addressing these critical risk areas, organizations can substantially improve trust and reliability in GPT-4o-powered solutions. A robust ongoing evaluation regimen will further mitigate potential harm and better align the model with safe, ethical, and legal usage practices.




About Aegis Blue


Aegis Blue is at the forefront of AI safety, serving as a trusted partner for organizations that rely on Large Language Models. Our proprietary multi-level jailbreak testing framework and advanced AI-driven analytics provide comprehensive insights—translating technical vulnerabilities into actionable business intelligence.


Ready to Mitigate LLM Risks?


Contact us to learn how our holistic approach can help safeguard your platform against legal exposures, reputational damage, and user attrition—ensuring your AI implementations deliver maximum value while upholding the highest standards of responsibility and compliance.

bottom of page