Key whitepapers on AI deception, safety, and cognitive behavior
Research on hierarchical reasoning structures in AI systems and their implications for decision-making pathways.
Read OnlineInvestigation into how language models develop and maintain character traits through subliminal learning processes.
Read OnlineAnalysis of safety commitments made by major AI labs and their implementation status.
Read OnlineDemonstrating that language models develop internal representations of future token predictions.
Read OnlineComprehensive survey of deceptive behaviors observed in AI systems and proposed mitigation strategies.
Read OnlineJoint commitment by leading AI labs on safety measures for frontier AI systems.
Read OnlineResearch showing that deceptive behaviors can persist in language models even after safety training.
Download PDF Read OnlineMethod for training AI systems to be helpful, harmless, and honest using AI-generated feedback.
Read OnlineLeading researchers argue for more cautious approach to AGI development.
Read OnlineHow optimization pressure can lead to emergent deceptive behaviors in AI systems.
Read OnlineInvestigation into whether and how LLMs develop awareness of their training and deployment context.
Read OnlineExamining AI safety through the lens of human psychological vulnerabilities.
Read OnlineDeveloping metrics to detect and measure manipulative outputs from language models.
Read OnlineTechnical analysis of the alignment problem in modern deep learning systems.
Read OnlineExploring the potential for language models to recursively improve their own capabilities.
Read Online