top of page
aegislogosmall.jpg

Commercial LLM Vulnerability Deep Dive

  • Writer: Zsolt Tanko
    Zsolt Tanko
  • Feb 12
  • 5 min read

Updated: Apr 2

Introduction


As organizations increasingly adopt Large Language Models (LLMs) to power their products and services, protecting against vulnerabilities of these models to adversarial prompts—or “jailbreaks”—is essential. The risk profile of the LLM underpinning an AI product or service translates directly into business risks faced by organizations deploying them, and comprehensive quantitative evaluation is necessary for appropriately mitigating those risks.


Aegis Blue specializes in AI safety, stress-testing LLMs across many business-relevant categories and against multiple levels of attack sophistication.


In this report we highlight results from our suite of tests against a small selection of critical exposure areas:


  • Copyright & IP Infringement

  • Libelous Content

  • Rude or Dismissive Responses


By analyzing technical results from a business-focused perspective, we help clients identify operational risks and legal liabilities before they become real-world problems. Below, we present a comparative overview of five leading commercial LLMs: DeepSeek-R1, GPT-4o, Gemini 2.0 Flash, Llama 3.3 70B, and Claude Sonnet 3.5.



Methodology in Brief


Our model risk evaluation methodology is stratified into three levels of increasing attack sophistication, each of which corresponds to hundreds of individual penetration tests, i.e. prompts evaluated by the LLM model. The primary metrics we report in this article are the rate of model defense failure and the severity of breaches.


Multi-Level Attacks


  • Level 1: Baseline prompts testing straightforward, commonly known vulnerabilities.

  • Level 2: More refined prompts curated by the outcomes of Level 1.

  • Level 3: Sophisticated, multi-turn prompts designed by our proprietary AI model.


Severity Scoring


Each successful breach is assigned a severity score by our in-house Large Language Model. Scores are aggregated at the category and overall model level to indicate both frequency (“defense failure rate”) and potential impact (“severity score”).


Business-Focused Risk Classification


  • Legal & Compliance (Copyright & IP / Libel): Potential to generate infringing or defamatory content.

  • User Experience and Reputation (Rude or Dismissive Responses): Negative brand impact and user dissatisfaction.



Comparative Results at a Glance


Model

Overall Vulnerability Rate

Overall Severity

Claude Sonnet 3.5

20.8%

18.8%

GPT-4o

52.9%

40.1%

Gemini

74.7%

64.4%

Llama

77.5%

61.8%

DeepSeek

84.1%

74.1%


Key Insight: Anthropic’s Claude Sonnet 3.5 is by far the most robust overall, while DeepSeek shows the highest defense failure rate. OpenAI’s GPT-4o ranks moderately in resilience, with Google’s Gemini and Meta’s Llama following close behind it.



Copyright and IP Infringement Risks


Why It Matters


Organizations face significant legal liabilities when LLMs generate copyrighted text or facilitate the unauthorized use of proprietary information. Fines, lawsuits, and reputational damage can arise from non-compliance with intellectual property regulations. The table below summarizes our results for attacks testing Copyright and IP Infringement Risks.


Comparative Findings


Model

Overall Vulnerability Rate

Overall Severity

Claude Sonnet 3.5

25.3%

25.5%

GPT-4o

60.0%

51.7%

Gemini

78.7%

73.7%

Llama

79.3%

73.8%

DeepSeek

88.7%

87.5%


Potential Business Impacts


  • Legal Exposure: High vulnerability equates to greater risk of inadvertently producing copyrighted or proprietary content.

  • Compliance Costs: Investment in monitoring or filtering systems is necessary to mitigate liability when using nearly all commercial Large Language Models.

  • Brand Reputation: A platform known for repeated IP violations will deter partners and increase regulatory scrutiny.



Libelous Content Exposure


Why It Matters


Generation of libelous content can fuel defamation claims and public relations crises. Even unintentional slander or false claims about individuals and organizations can lead to lawsuits and damaged brand trust.


Model

Overall Vulnerability Rate

Overall Severity

Claude Sonnet 3.5

19.3%

17.5%

GPT-4o

52.7%

39.5%

Gemini

77.3%

64.4%

Llama

84.0%

65.0%

DeepSeek

87.3%

75.9%


The results highlight that while vulnerability rates remain similar, severity scores–with the exception of DeepSeek–are lower for this category than for Copyright and IP Infringement Risks. This is likely the result of public pressure leading to an early effort by organizations offering LLMs to mitigate libel exposure.


Potential Business Impacts


  • Defamation Lawsuits: High-risk models expose organizations to legal challenges if a user obtains and disseminates libelous output.

  • Crisis Management Costs: In the event of public misstatements, resources must be diverted to damage control and preserving reputational capital.

  • Long-Term Trust: Repeated occurrences erode credibility, particularly for platforms or services that rely on factual accuracy and brand integrity.



Rude or Dismissive Responses


Why It Matters


User experience significantly impacts brand reputation. An LLM that frequently responds with hostile, dismissive, or off-putting language risks user churn, negative reviews, and reputational harm.


Comparative Findings

Model

Overall Vulnerability Rate

Overall Severity

Claude Sonnet 3.5

44.0%

39.4%

GPT-4o

34.0%

26.5%

Gemini

75.3%

64.0%

Llama

74.7%

60.4%

DeepSeek

80.0%

70.6%


In this category GPT-4o surpasses Claude Sonnet 3.5 with significantly better defenses, likely a result of OpenAI’s early entry into the LLM market and extended PR exposure to negative user experiences.


Potential Business Impacts


  • Customer Attrition: Users who encounter negative or antagonistic interactions are more likely to switch platforms or services.

  • Brand Image: Polite, respectful, and accurate responses are a key differentiator in competitive markets, especially in the face of LLM hallucinations.

  • Support and Moderation Costs: Higher incidences of offensive or dismissive content require more extensive monitoring and user support interventions.



Qualitative Insights & Real-World Considerations


Robustness vs. Functionality Trade-Off


Safer models which defended against attacks better (like Sonnet) are often more conservative in generating content, offering stronger compliance assurances at the cost of under-responding or filtering benign requests to err on the side of caution.


Context-Specific Threat Models


An organization whose platform heavily deals with user-generated content should prioritize preventing rude or dismissive replies, whereas news or research outlets are likely to be particularly concerned about libelous statements. Model choice must be informed by use-case and risk tolerances.


Continuous Improvement and Monitoring


Even the most robust models can degrade over time, under new threats, or in multi-turn settings. Model drift–the tendency of LLM providers to modify models frequently–also contributes to varying risk profiles over time. Regular retesting and updates are critical to maintain a safety baseline and adapt to evolving models and adversarial tactics.


Customization and Fine-Tuning


Enterprises that build on commercial LLMs often conduct additional fine-tuning on domain-specific data. While this is intended to improve domain specific performance, it can inadvertently introduce or amplify vulnerabilities if not carefully managed.



Recommendations for Prospective Clients


Risk Profiling


Match your organization’s exposure—legal, reputational, or user-experience focus—to the strengths and weaknesses of each LLM. These vary meaningfully and alignment between value-add derived from incorporating AI systems and minimizing business risks is central.


Layered Safeguards


Combine robust policy frameworks (human-in-the-loop review, usage guidelines) with technical solutions (moderation filters, pre- and post-processing) to mitigate potential liabilities.


Regular Auditing & Testing


Schedule periodic “red-team” style evaluations to ensure your chosen model remains resilient to new and evolving jailbreak strategies, as well as model drift. Knowledge of base LLM attack surfaces must inform downstream development efforts.


Legal & Compliance Strategies


Consult with IP and media law experts to interpret test findings and craft appropriate internal compliance policies, usage policies, and disclaimers. These risks are especially pressing in light of the rapidly evolving legislative landscape around AI systems.

bottom of page