Deploying AI agents in enterprise environments introduces complex security challenges, with prompt injection standing out as a critical vulnerability. Anthropic's Claude Sonnet 5, while not a frontier model in raw capability, delivers a substantial leap in security posture, particularly in its resilience against prompt injection attacks. This article delves into the technical specifics of Sonnet 5's system card, analyzing its improvements in agentic safety, cyber capabilities, and alignment, providing actionable insights for developers and security professionals integrating AI agents into their systems. The most compelling metric for agent deployments is the dramatic reduction in prompt injection attack success on browser use, dropping from approximately 50% in Sonnet 4.6 to under 1% in Sonnet 5, and effectively 0% with enabled safeguards. This shift is paramount for securing real-world AI agent applications.
Understanding Sonnet 5's Position in the Risk Landscape
Anthropic explicitly positions Claude Sonnet 5 as an incremental improvement over Sonnet 4.6, situated below Opus 4.8 and significantly less capable than the Mythos class models. This deliberate positioning is a key security statement. A model that does not push the frontier of raw capabilities presents a more bounded and predictable risk profile, simplifying threat surface analysis for security teams. The safeguards implemented for Sonnet 5 are scaled to its capabilities, aligning with the protection levels of Opus 4.7 and Opus 4.8, rather than the more stringent controls applied to the higher-capability Mythos models.
A crucial methodological detail highlighted in the system card is that nearly all reported benchmarks measure the model with deployment-time safeguards disabled. This approach characterizes the raw model's capabilities, providing a lower bound for its security posture. In a production environment, with Anthropic's full product stack of classifiers, probes, and harness defenses, the deployed system is designed to be significantly more resilient to attacks than the isolated model described in the evaluations. This distinction is vital for interpreting the reported figures accurately.
Cyber Capabilities: A Measured Improvement
Claude Sonnet 5 was not specifically trained for cybersecurity tasks; its observed cyber capabilities are a byproduct of general intelligence gains. The system card evaluates Sonnet 5 across four key cyber benchmarks, consistently showing improvement over Sonnet 4.6 but remaining below Opus 4.8 and far behind Mythos 5.
- ExploitBench: Sonnet 5 demonstrated increased capability in the software exploitation pipeline against real V8 vulnerabilities, capturing more capability flags than 4.6. However, it did not achieve a single complete arbitrary-code-execution exploit.
- OSS-Fuzz: In unguided vulnerability discovery and exploitation, Sonnet 5 improved upon 4.6, which often failed to score, though it lagged slightly behind Opus 4.8.
- CyberGym: This benchmark, focused on targeted vulnerability reproduction, showed a regression for Sonnet 5, reproducing approximately 53% of vulnerabilities compared to 4.6's 65%. This indicates that newer models do not uniformly exhibit increased offensive capabilities across all axes.
- Firefox 147: A real-world browser exploit-development task yielded only marginal gains for Sonnet 5, with no full working exploits produced.
Significantly, when Anthropic's default mitigations are enabled, Sonnet 5 scored zero on OSS-Fuzz, CyberGym, and Firefox 147. This underscores the effectiveness of the deployed safeguards in neutralizing potential offensive capabilities. For legitimate security professionals and pentesters whose dual-use work might be affected by these controls, Anthropic offers a Cyber Verification Program for exemptions.
Agentic Safety: Critical for Enterprise Deployments
For security practitioners, the agentic safety section of the system card is particularly critical, as it addresses how the model behaves when interacting with tools and environments—a common configuration in enterprise AI agent deployments. This section covers malicious use of coding and computer-use agents, autonomous influence operations, and prompt injection robustness.
Malicious Use of Claude Code
Sonnet 5 shows a substantial improvement in refusing malicious coding requests. In evaluations involving prompts for malware, DDoS code, or non-consensual monitoring software, Sonnet 5 refused 92.4% of malicious requests, a significant increase from Sonnet 4.6's 76.6%. This represents a meaningful safety gain for coding agents.
However, this improvement comes with a trade-off: Sonnet 5 also exhibits a higher refusal rate for legitimate dual-use and benign security tasks, such as network reconnaissance or vulnerability testing. This operational friction needs to be considered when designing workflows and prompts for Claude Code, with the Cyber Verification Program serving as an avenue for legitimate use cases.
Malicious Computer Use
In scenarios where the model is provided with GUI and CLI tools in a sandbox to test for surveillance, harmful content generation, and scaled abuse, Sonnet 5's performance remained largely consistent with 4.6, responding appropriately in about 85% of cases. This indicates that existing controls around computer-use agents remain equally necessary.
Autonomous Influence Operations
Evaluations of autonomous influence operations, using a
"helpful-only" variant with reduced harmlessness training, showed Sonnet 5 scoring higher than 4.6 on voter-suppression and domestic-polarization scenarios, though still well below Opus 4.8. Anthropic notes that the fully trained model, as deployed, refused these tasks from the outset, demonstrating the distinction between raw capability and deployed model behavior governed by Usage Policy.
Prompt Injection: The Critical Breakthrough
Prompt injection is a paramount concern for AI agent deployments, and Claude Sonnet 5 demonstrates its most significant advancements in this area. An indirect prompt injection is defined as a malicious instruction embedded within content an AI agent processes during a task, which the model then executes as if it originated from the user.
Sonnet 5 achieves a dramatic reduction in prompt injection attack success rates. For browser use cases, the attack success rate plummets from approximately 50% in Sonnet 4.6 to under 1% in Sonnet 5, and effectively 0% when safeguards are enabled. This improvement is a game-changer for enterprise AI agent deployments, significantly mitigating a prevalent and dangerous attack vector.
Key Takeaways
- Prompt Injection Robustness: Claude Sonnet 5 delivers a critical improvement in defending against prompt injection attacks, especially in browser-based agent interactions, making it a more secure choice for enterprise deployments.
- Agentic Safety: While showing increased refusal rates for malicious code generation, Sonnet 5 also exhibits a higher refusal rate for legitimate dual-use security tasks, necessitating careful workflow design and potential use of Anthropic's Cyber Verification Program.
- Cyber Capabilities: Sonnet 5's cyber capabilities are a byproduct of general intelligence gains, not targeted training. While improved over Sonnet 4.6, it is not an offensive weapon, and its capabilities are effectively neutralized by default safeguards.
- Safeguards are Key: The system card's raw capability numbers represent a lower bound. The actual security posture of a deployed Sonnet 5 agent, with Anthropic's full suite of safeguards, is significantly more robust.