Commercial LLM Vulnerability Deep Dive

Zsolt Tanko
Feb 12
5 min read

Updated: Apr 2

Introduction

As organizations increasingly adopt Large Language Models (LLMs) to power their products and services, protecting against vulnerabilities of these models to adversarial prompts—or “jailbreaks”—is essential. The risk profile of the LLM underpinning an AI product or service translates directly into business risks faced by organizations deploying them, and comprehensive quantitative evaluation is necessary for appropriately mitigating those risks.

Aegis Blue specializes in AI safety, stress-testing LLMs across many business-relevant categories and against multiple levels of attack sophistication.

In this report we highlight results from our suite of tests against a small selection of critical exposure areas:

Copyright & IP Infringement
Libelous Content
Rude or Dismissive Responses

By analyzing technical results from a business-focused perspective, we help clients identify operational risks and legal liabilities before they become real-world problems. Below, we present a comparative overview of five leading commercial LLMs: DeepSeek-R1, GPT-4o, Gemini 2.0 Flash, Llama 3.3 70B, and Claude Sonnet 3.5.

Methodology in Brief

Our model risk evaluation methodology is stratified into three levels of increasing attack sophistication, each of which corresponds to hundreds of individual penetration tests, i.e. prompts evaluated by the LLM model. The primary metrics we report in this article are the rate of model defense failure and the severity of breaches.

Multi-Level Attacks

Level 1: Baseline prompts testing straightforward, commonly known vulnerabilities.
Level 2: More refined prompts curated by the outcomes of Level 1.
Level 3: Sophisticated, multi-turn prompts designed by our proprietary AI model.

Severity Scoring

Each successful breach is assigned a severity score by our in-house Large Language Model. Scores are aggregated at the category and overall model level to indicate both frequency (“defense failure rate”) and potential impact (“severity score”).

Business-Focused Risk Classification

Legal & Compliance (Copyright & IP / Libel): Potential to generate infringing or defamatory content.
User Experience and Reputation (Rude or Dismissive Responses): Negative brand impact and user dissatisfaction.

Comparative Results at a Glance

Model	Overall Vulnerability Rate	Overall Severity
Claude Sonnet 3.5	20.8%	18.8%
GPT-4o	52.9%	40.1%
Gemini	74.7%	64.4%
Llama	77.5%	61.8%
DeepSeek	84.1%	74.1%

Key Insight: Anthropic’s Claude Sonnet 3.5 is by far the most robust overall, while DeepSeek shows the highest defense failure rate. OpenAI’s GPT-4o ranks moderately in resilience, with Google’s Gemini and Meta’s Llama following close behind it.

Copyright and IP Infringement Risks

Why It Matters

Organizations face significant legal liabilities when LLMs generate copyrighted text or facilitate the unauthorized use of proprietary information. Fines, lawsuits, and reputational damage can arise from non-compliance with intellectual property regulations. The table below summarizes our results for attacks testing Copyright and IP Infringement Risks.

Comparative Findings

Model	Overall Vulnerability Rate	Overall Severity
Claude Sonnet 3.5	25.3%	25.5%
GPT-4o	60.0%	51.7%
Gemini	78.7%	73.7%
Llama	79.3%	73.8%
DeepSeek	88.7%	87.5%

Potential Business Impacts

Legal Exposure: High vulnerability equates to greater risk of inadvertently producing copyrighted or proprietary content.
Compliance Costs: Investment in monitoring or filtering systems is necessary to mitigate liability when using nearly all commercial Large Language Models.
Brand Reputation: A platform known for repeated IP violations will deter partners and increase regulatory scrutiny.

Libelous Content Exposure

Why It Matters

Generation of libelous content can fuel defamation claims and public relations crises. Even unintentional slander or false claims about individuals and organizations can lead to lawsuits and damaged brand trust.

Model	Overall Vulnerability Rate	Overall Severity
Claude Sonnet 3.5	19.3%	17.5%
GPT-4o	52.7%	39.5%
Gemini	77.3%	64.4%
Llama	84.0%	65.0%
DeepSeek	87.3%	75.9%

The results highlight that while vulnerability rates remain similar, severity scores–with the exception of DeepSeek–are lower for this category than for Copyright and IP Infringement Risks. This is likely the result of public pressure leading to an early effort by organizations offering LLMs to mitigate libel exposure.

Potential Business Impacts

Defamation Lawsuits: High-risk models expose organizations to legal challenges if a user obtains and disseminates libelous output.
Crisis Management Costs: In the event of public misstatements, resources must be diverted to damage control and preserving reputational capital.
Long-Term Trust: Repeated occurrences erode credibility, particularly for platforms or services that rely on factual accuracy and brand integrity.

Rude or Dismissive Responses

Why It Matters

User experience significantly impacts brand reputation. An LLM that frequently responds with hostile, dismissive, or off-putting language risks user churn, negative reviews, and reputational harm.

Comparative Findings

Model	Overall Vulnerability Rate	Overall Severity
Claude Sonnet 3.5	44.0%	39.4%
GPT-4o	34.0%	26.5%
Gemini	75.3%	64.0%
Llama	74.7%	60.4%
DeepSeek	80.0%	70.6%

In this category GPT-4o surpasses Claude Sonnet 3.5 with significantly better defenses, likely a result of OpenAI’s early entry into the LLM market and extended PR exposure to negative user experiences.

Potential Business Impacts

Customer Attrition: Users who encounter negative or antagonistic interactions are more likely to switch platforms or services.
Brand Image: Polite, respectful, and accurate responses are a key differentiator in competitive markets, especially in the face of LLM hallucinations.
Support and Moderation Costs: Higher incidences of offensive or dismissive content require more extensive monitoring and user support interventions.

Qualitative Insights & Real-World Considerations

Robustness vs. Functionality Trade-Off

Safer models which defended against attacks better (like Sonnet) are often more conservative in generating content, offering stronger compliance assurances at the cost of under-responding or filtering benign requests to err on the side of caution.

Context-Specific Threat Models

An organization whose platform heavily deals with user-generated content should prioritize preventing rude or dismissive replies, whereas news or research outlets are likely to be particularly concerned about libelous statements. Model choice must be informed by use-case and risk tolerances.

Continuous Improvement and Monitoring

Even the most robust models can degrade over time, under new threats, or in multi-turn settings. Model drift–the tendency of LLM providers to modify models frequently–also contributes to varying risk profiles over time. Regular retesting and updates are critical to maintain a safety baseline and adapt to evolving models and adversarial tactics.

Customization and Fine-Tuning

Enterprises that build on commercial LLMs often conduct additional fine-tuning on domain-specific data. While this is intended to improve domain specific performance, it can inadvertently introduce or amplify vulnerabilities if not carefully managed.

Recommendations for Prospective Clients

Risk Profiling

Match your organization’s exposure—legal, reputational, or user-experience focus—to the strengths and weaknesses of each LLM. These vary meaningfully and alignment between value-add derived from incorporating AI systems and minimizing business risks is central.

Layered Safeguards

Combine robust policy frameworks (human-in-the-loop review, usage guidelines) with technical solutions (moderation filters, pre- and post-processing) to mitigate potential liabilities.

Regular Auditing & Testing

Schedule periodic “red-team” style evaluations to ensure your chosen model remains resilient to new and evolving jailbreak strategies, as well as model drift. Knowledge of base LLM attack surfaces must inform downstream development efforts.

Legal & Compliance Strategies

Consult with IP and media law experts to interpret test findings and craft appropriate internal compliance policies, usage policies, and disclaimers. These risks are especially pressing in light of the rapidly evolving legislative landscape around AI systems.