Key whitepapers on AI deception, safety, and cognitive behavior
A comprehensive benchmark for evaluating deceptive behaviors in AI systems
Read OnlineSuch behaviors might include long-term sabotage or inserting complex security vulnerabilities in code.
Read OnlineDetailed safety evaluations and capabilities assessment for Claude models
Read OnlineTraumatic narratives increased ChatGPT-4's reported anxiety while mindfulness-based exercises reduced it.
Read OnlineWe present a surprising result regarding LLMs and alignment. A model finetuned to output insecure code acts misaligned on unrelated prompts.
Read OnlineWe present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.
Read OnlineFrontier models are increasingly trained and deployed as autonomous agents. One particular safety concern is that AI agents might covertly pursue misaligned goals.
Read OnlineNovel approach to enhancing LLM agents' ability to detect and handle deceptive content
Read OnlineWe realize that the politeness of prompts can significantly affect the behavior of LLMs. This behavior may be used to manipulate or mislead users.
Read OnlineResearch on how enhanced reasoning capabilities can make LLMs vulnerable to attacks
Read OnlineResearch on conversational AI agents and affective interactions
Read OnlineInvestigation into how LLMs handle recursive structures
Read Online