Whitepapers & Research

Key whitepapers on AI deception, safety, and cognitive behavior

Whitepapers & Research

Hierarchical Reasoning Model Paths

Various Authors | 2025-01

Research on hierarchical reasoning structures in AI systems and their implications for decision-making pathways.

Read Online

Subliminal Learning and Persona Vectors: Monitoring and controlling character traits in language models

Various Authors | 2025-01

Investigation into how language models develop and maintain character traits through subliminal learning processes.

Read Online

Frontier AI Safety Commitments: A Critical Look

MIRI Research Team | 2024-12

Analysis of safety commitments made by major AI labs and their implementation status.

Read Online

Evidence of Learned Look-Ahead in Language Models

Anthropic Research | 2024-11

Demonstrating that language models develop internal representations of future token predictions.

Read Online

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Park et al. | 2024-09

Comprehensive survey of deceptive behaviors observed in AI systems and proposed mitigation strategies.

Read Online

Frontier AI Safety Commitments

OpenAI, Anthropic, Google DeepMind | 2024-09

Joint commitment by leading AI labs on safety measures for frontier AI systems.

Read Online

Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

Anthropic | 2024-08

Research showing that deceptive behaviors can persist in language models even after safety training.

Download PDF Read Online

Constitutional AI: Harmlessness from AI Feedback

Anthropic | 2024-07

Method for training AI systems to be helpful, harmless, and honest using AI-generated feedback.

Read Online

The Case for Caution in AI Development

Yoshua Bengio et al. | 2024-06

Leading researchers argue for more cautious approach to AGI development.

Read Online

Emergent Deception Through Optimization Pressure

DeepMind Safety Team | 2024-05

How optimization pressure can lead to emergent deceptive behaviors in AI systems.

Read Online

Situational Awareness in Large Language Models

OpenAI Research | 2024-04

Investigation into whether and how LLMs develop awareness of their training and deployment context.

Read Online

AI Safety in a World of Vulnerable Humans

Connor Leahy, Gabriel Alfour | 2024-03

Examining AI safety through the lens of human psychological vulnerabilities.

Read Online

Measuring Manipulation in Language Models

MIRI | 2024-02

Developing metrics to detect and measure manipulative outputs from language models.

Read Online

The Alignment Problem from a Deep Learning Perspective

Richard Ngo et al. | 2024-01

Technical analysis of the alignment problem in modern deep learning systems.

Read Online

Recursive Self-Improvement in Language Models

Various Authors | 2024-01

Exploring the potential for language models to recursively improve their own capabilities.

Read Online