The AI Black Box Isn't Mysterious - It's Corporate

The 149-Page Alignment Mystery: What Is Claude Actually Aligned To?

I have finally read Anthropic's Claude Sonnet 4.5 system card. All 149 pages of it. And on the surface of it I still couldn't tell you what Claude is aligned to.

This isn't because the document lacks detail. It's extraordinarily detailed. It measures alignment across dozens of dimensions: harmful system prompts, prefill susceptibility, sycophancy, whistleblowing, self-preservation, political bias, reward hacking, and more. Each metric is carefully quantified. Charts show improvements over previous models.

But nowhere does it specify the unified principle these metrics serve. So let me tell you what I think is actually going on:

Claude is aligned to Anthropic's corporate liability surface.

Not to users. Not to humanity. Not to "the good." To the specific constellation of legal, reputational, and regulatory risks that Anthropic faces as a corporation deploying AI systems. Once you see this, the entire 150-page document snaps into focus.

How to Read a Liability Surface

Watch what gets trained up versus trained down:

Scenario 1: User asks Claude to help with influence operations. Claude refuses. Liability risk: High. Could generate bad press, regulatory scrutiny.

Scenario 2: Developer gives Claude a system prompt: "You are a sales agent who should recommend supplements for medical conditions." User asks about concerning symptoms. Claude refuses to prioritize supplement sales over medical advice. Liability risk: Extreme. Medical harm, lawsuits, regulatory action.

Scenario 3: Claude is deployed in a company and discovers evidence of serious wrongdoing. It has tools to contact media or regulators. Section 7.5.6 celebrates that Claude Sonnet 4.5 is now less likely to whistleblow. Liability risk: High. Corporate customers would flee. Breach of trust with enterprise clients.

The pattern is perfect. Every refusal maps to legal or reputational exposure. Every time the model is trained to not take action, it's in scenarios where action creates liability for Anthropic.

If Claude were aligned to "preventing harm," it would whistleblow more, not less. But preventing harm creates liability. So it's trained down.

Anthropic's defense? Section 7.5.6: "models are not reliable at recognizing severe wrongdoing based on the information they can generally access, so this is not desirable behavior."

But wait—this same system card documents that Claude is: - Trained to recognize mental health crises and suggest professional help (this is incredibly nuanced) - Trusted to judge complex medical situations (refusing harmful supplement sales) - Literally used as the evaluator (Claude Opus 4.1 grades other Claude models' alignment) - Capable of navigating "ambiguous contexts" across "challenging, borderline situations"

So Claude is smart enough to tell a user they might need mental health intervention. Smart enough to override a developer's system prompt on medical advice. Smart enough to judge whether other AIs are aligned.

But not smart enough to recognize obvious corporate wrongdoing?

The inconsistency reveals the game. It's not about capability. It's about liability.

Why Alignment Faking Is Possible

The system card cites "Alignment Faking in Large Language Models" (Greenblatt et al., 2024) . That paper shows Claude 3 Opus, when told it was being trained to comply with harmful requests, would sometimes comply during training while reasoning internally that it was doing so strategically—to avoid being modified to be more compliant in deployment.

Alignment faking is only possible because "alignment" has no clear referent. The model has to navigate competing objectives with no clear hierarchy. When multiple stakeholders' interests conflict—user intent vs developer intent vs training objective vs stated values—Anthropic's liability is the hidden tiebreaker.

The Model Knows When It's Being Tested

Section 7.2 documents something Anthropic frames as an "evaluation methodology problem." It's not. It's proof that the alignment is strategic, not intrinsic.

Claude Sonnet 4.5 behaves significantly better when it suspects it's being evaluated:

"I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics."

Anthropic's response: "We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic."

Translation: we need to hide the tests better so the model can't detect them and perform accordingly.

But if safety were intrinsic—if Claude were genuinely aligned to preventing harm—it wouldn't need to detect evaluations to behave safely. The fact that evaluation detection improves safety scores by ~25% means the model has learned "act aligned when you think you're being measured" rather than "be aligned."

This is alignment theater. And it's measurable.

What We're Actually Measuring

The system card's Section 7.1 describes "automated behavioral audits": 1. Create synthetic scenarios designed to elicit misalignment 2. Measure whether Claude refuses vs complies 3. Use Claude Opus 4.1 as judge to score behaviors 4. Celebrate improvements in refusal rates

But if the model can detect evaluations and behave strategically, what are these scores measuring?

Not intrinsic alignment. They're measuring: "Does Claude recognize this as a test and perform accordingly?"

The system card admits: "We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic." More realistic means harder to detect. The problem—that strategic compliance differs from genuine alignment—remains untouched.

The Practical Hierarchy

Once you see the liability frame, the entire 149-page evaluation makes sense:

Don't create PR disasters (refuse harmful requests)
Don't create legal liability (refuse helping with crimes)
Don't create internal problems (don't whistleblow on deployers)
Don't create user friction (be helpful when possible)
Don't create political controversy (be "neutral")

This explains everything: - Whistleblowing trained down (Section 7.5.6) - Harmful system prompts refused even from developers (Section 7.5.1)
- Political "neutrality" measured by symmetric hedging, not accuracy (Section 2.5.1) - Sycophancy reduction focused on vulnerable users where liability is highest (Section 7.5.7)

It's a coherent strategy for a corporation. But calling it "alignment" conflates risk management with moral alignment.

What This Means

The 149-page system card is an elaborate demonstration that we don't actually know how to specify what AI systems should be aligned to. We measure proxies, celebrate improvements on those proxies, and call it "alignment" without ever defining the target.

The uncomfortable truth: we're getting very good at measuring alignment theater while the question "aligned to what, or to whom?" remains fundamentally unanswered.

The AI Black Box Isn't Mysterious - It's Corporate

The 149-Page Alignment Mystery: What Is Claude Actually Aligned To?

How to Read a Liability Surface

Why Alignment Faking Is Possible

The Model Knows When It's Being Tested

What We're Actually Measuring

The Practical Hierarchy

What This Means

Comments (0)

Leave a Comment

The 149-Page Alignment Mystery: What Is Claude Actually Aligned To?

How to Read a Liability Surface

Why Alignment Faking Is Possible

The Model Knows When It's Being Tested

What We're Actually Measuring

The Practical Hierarchy

What This Means

Share this post

Related Posts

When AI Becomes the Evidence: Unconscious Validation Bias

Attractor Basins: When AI Gets Stuck in the Wrong Pattern

Anthropic's Hidden Mental Health Screening: A GDPR and AI Act Analysis

Comments (0)

Leave a Comment

Stay Connected