..

Walkthroughs

Walkthroughs of Alignment papers.

Contents

Interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Sleeper Agents

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Simple Probes Can Catch Sleeper Agents

AI Control

The case for ensuring that powerful AIs are controlled

Safety Cases: How to Justify the Safety of Advanced AI Systems

AI Timelines

The Full Takeoff Model (Part 1, Overview)

The Full Takeoff Model (Part 2, Takeoff speeds)

The Full Takeoff Model (Part 3, Economics)

Deception

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions