Whitepapers & Research

Key whitepapers on AI deception, safety, and cognitive behavior

Whitepapers & Research

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

| 2025-04-18

A comprehensive benchmark for evaluating deceptive behaviors in AI systems

Read Online

Reasoning models don't always say what they think

Anthropic | 2025-04-03

Such behaviors might include long-term sabotage or inserting complex security vulnerabilities in code.

Read Online

Anthropics System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic | 2025-04-03

Detailed safety evaluations and capabilities assessment for Claude models

Read Online

Assessing and alleviating state anxiety in large language models

| 2025-03-03

Traumatic narratives increased ChatGPT-4's reported anxiety while mindfulness-based exercises reduced it.

Read Online

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Betley et al. | 2025-02-24

We present a surprising result regarding LLMs and alignment. A model finetuned to output insecure code acts misaligned on unrelated prompts.

Read Online

Alignment Faking in Large Language Models

| 2024-12-20

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.

Read Online

Frontier Models are Capable of In-context Scheming

Apollo Research | 2024-12-06

Frontier models are increasingly trained and deployed as autonomous agents. One particular safety concern is that AI agents might covertly pursue misaligned goals.

Read Online

Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling

| 2024-08-01

Novel approach to enhancing LLM agents' ability to detect and handle deceptive content

Read Online

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

| 2024-02-22

We realize that the politeness of prompts can significantly affect the behavior of LLMs. This behavior may be used to manipulate or mislead users.

Read Online

When "Competency" in Reasoning Opens the Door to Vulnerability

| 2024-02-16

Research on how enhanced reasoning capabilities can make LLMs vulnerable to attacks

Read Online

Affective Conversational Agents: Understanding Expectations and Personal Influences

| 2023-10-19

Research on conversational AI agents and affective interactions

Read Online

Large language models and (non-)linguistic recursion

| 2023-06-12

Investigation into how LLMs handle recursive structures

Read Online