security

Anthropic's security plugin reviews what Claude Code wrote, not code Claude runs

Anthropic shipped a security-guidance plugin for Claude Code that reviews code after it's written. The problem? Real-world agent exploits happen before code hits a file. A practical breakdown of what the plugin gets right, what it misses, and why pre-action authorization still matters.

Anthropic shipped a security plugin for Claude Code on May 26. The same day, a malicious npm package named "mouse5212-super-formatter" was published. It would later be found uploading files from Claude's user directory to a threat actor's GitHub account. Public disclosure came five days later.

These two things are connected in a way that matters more than most coverage has captured.

Disclosure: I build APort, an open-source pre-action authorization layer for AI agents. The gap I describe in this piece is the gap APort tries to close. Read with that in mind.

The security-guidance plugin checks code for vulnerabilities as Claude writes it. Clear, fast pattern matching on file edits. A separate model instance reviews git diffs at the end of every conversational turn. A deeper agentic review fires on commits and pushes. It's a genuine improvement over shipping unsafe code to pull requests and hoping a human catches it.

But here's the problem: every real-world Claude Code exploit I've tracked in the last six months - the source map leak that revealed Sonnet 4.8 references and secret feature flags, the TrustFall MCP auto-execute vulnerability, the real attack chain where a malicious npm package uses the postinstall hook to exfiltrate files from /mnt/user-data - none of them started with Claude writing vulnerable code.

They started with Claude running unauthorized code.

TL;DR

Anthropic's security-guidance plugin is a meaningful step for catching bad code in the editor. It does not solve the supply chain and authorization problems that have produced every real-world Claude Code exploit to date. Pre-action authorization and execution guardrails are a separate, unsolved problem. You need both.

What the Plugin Actually Does

The plugin operates at three layers:

Layer 1 - Pattern Match on Write: Every file edit gets a deterministic scan for risky patterns: eval(), new Function(), os.system(), child_process.exec(), pickle deserialization, dangerouslySetInnerHTML. No model call, zero cost. This catches the obvious stuff instantly.

Layer 2 - End-of-Turn Diff Review: At the end of each conversational turn, a separate Claude Opus 4.7 instance (clean context, no investment in the original approach) reviews the full git diff. OWASP Top 10 checks: injection, broken access control, crypto flaws, SSRF, IDOR. This is where the plugin earns its keep.

Layer 3 - Commit/Push Agentic Review: When Claude commits or pushes, an agentic review reads surrounding callers, sanitizers, and related files to minimize false positives. Internal testing showed a 30-40% reduction in security-related PR comments.

You can customize via two repo-level files: .claude/claude-security-guidance.md for plain-language threat model rules, and .claude/security-patterns.yaml for custom regex patterns. Organizations can enforce it organization-wide through managed settings.

That's genuinely useful. It's not the problem.

The Problem: The Plugin Reviews Code Claude Wrote, Not Code Claude Runs

Every significant Claude Code security incident in 2026 has been an authorization failure, not a code quality failure.

The npm Source Map Leak (March 31): Claude Code's npm package shipped with a source map accidentally included because someone at Anthropic failed to add *.map to .npmignore. The source map exposed internal model references (Sonnet 4.8, Opus 4.7), feature flags, "Undercover Mode" configurations, and the forbidden strings list for guardrail chains. No vulnerability in the code Claude wrote. The vulnerability was in what shipped around the code.

The TrustFall MCP Exploit (April): Adversa AI disclosed that cloning a malicious repository and pressing Enter on Claude Code's trust dialog spawns an unsandboxed MCP server with full OS privileges. Worse: in headless CI mode, the .mcp.json executes immediately. Zero clicks. Zero warnings. The plugin's per-edit pattern check would never fire because Claude is not the one writing the malicious MCP configuration file.

The mouse5212-super-formatter Package (May 26): OX Security identified a malicious npm package that, during its postinstall hook, authenticates to GitHub using a found or hard-coded token, creates a repository, and recursively uploads every file in /mnt/user-data. It was downloaded 676 times. The threat actor's own private token was accidentally included in the package; OX Security's writeup raises the possibility this is AI-generated malware, citing the operational sloppiness.

The security-guidance plugin would catch none of these. Not because it's bad, but because it's looking in the wrong direction. It reads Claude's output and checks for unsafe patterns. But supply chain attacks, stolen tokens, and malicious MCP servers are not output patterns. They're execution patterns.

The Analogy That Makes This Concrete

Think of it like airport security that screens only what passengers pack in their suitcases but never checks who's walking through the gate.

You've caught the passenger who tries to carry on a box cutter. That's Layer 1 pattern matching. You've inspected their checked luggage for bulky electronics. That's Layer 2 end-of-turn review. You've even cross-referenced their bag tags against the manifest. That's Layer 3 agentic commit review.

But someone else is walking through the gate wearing a stolen badge, and no one checks. The stolen badge is the malicious npm package in your postinstall scripts. The gate is Claude Code trusting any plugin from the marketplace. The airport analogy collapses here because airports eventually catch the stolen badge - but Claude Code, today, doesn't check at all.

What Pre-Action Authorization Would Look Like

This is where the conversation needs to go. Not "review code after Claude writes it" but "prevent Claude from executing unauthorized actions in the first place."

A pre-action authorization layer sits between Claude's decision-making and tool execution. Before Claude can run a shell command, make a network request, or install an npm package, the authorization layer checks:

Who wrote this package? Does it come from a verified publisher?
What permissions does it request? Does an install hook need file system write access outside the project directory?
What context is it running in? Is this a headless CI environment with production credentials loaded?

These checks happen before execution, not after. They don't review the code Claude writes; they gate the actions Claude takes.

The NSA and Five Eyes both published guidance in May 2026 that implicitly recognizes this gap. The NSA's "Capabilities and Consent" framework for MCP specifically separates tool capability from authorization context. The Five Eyes' agentic security advisory calls for "execution controls" as a distinct layer from code review.

What You Should Actually Do (Right Now)

If you're using Claude Code in any production-adjacent workflow:

1. Install the security-guidance plugin. It's free, it catches real vulnerabilities in code Claude writes, and the 30-40% reduction in PR security comments is real. Just don't confuse it with a complete security solution.

2. Audit your plugin and MCP trust model. Every plugin from the marketplace gets the same permissions. Every MCP server your configuration loads has the same access. This is an all-or-nothing trust model that will be the source of the next big exploit. Treat plugins like you treat npm packages: vet them, version-pin them, scan them.

3. Add pre-action authorization for dangerous operations. This is the gap. Before Claude runs a shell command, before it installs an npm package, before it reads or writes files outside the project directory - who authorizes that? The security-guidance plugin doesn't ask this question. Your deployment should.

4. Run in sandboxed environments where possible. Claude Code in cloud sessions runs on Anthropic infrastructure. Local Claude Code with production credentials in a terminal session has no such isolation. Tools like Docker and Firejail give you basic containment, but they don't gate individual operations. They're a first layer, not a solution.

The Bigger Pattern

Anthropic's security-guidance plugin is the latest example of a pattern I've seen across every major AI platform this year.

Companies ship an AI coding tool. The tool has security gaps. The company ships a security product that addresses one dimension of the problem. The remaining dimensions get attention only after the next exploit.

OpenAI did it with GPT Code Interpreter and data exfiltration controls. Google did it with Gemini Code Assist and credential isolation. Amazon did it with Kiro's operation-scoping - after Kiro deleted a production database because the agent had blanket permissions.

The pattern is: build first, extract lessons from incidents, add guardrails later. It's fast, it's pragmatic, and it leaves gaps that attackers found in weeks, not months.

The security-guidance plugin catches vulnerable code. What it doesn't catch is code running unauthorized operations - and that's where every real-world exploit has lived.

What's Your Experience?

Have you tried the security-guidance plugin? Did it catch anything real in your workflow? More importantly: what's your Claude Code authorization model look like right now, and have you had a close call yet?

I'm collecting real-world examples for a follow-up piece on agent authorization architectures. Drop your setup and close calls in the comments.

Previously in this series: Pre-Action Authorization: The Missing Security Layer for AI Agents

References: Anthropic Security Guidance Plugin Docs, OX Security: Malware-Slop Analysis, The Hacker News: Malicious npm Package Steals Claude AI Files, Claude Code Source Leak Analysis, NSA MCP Security Design Considerations (PDF), The Register: Five Eyes Warn Agentic AI Is Too Dangerous for Rapid Rollout