..

Walkthroughs

Walkthroughs of Alignment papers.

Contents

Interpretability
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Sleeper Agents
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Simple Probes Can Catch Sleeper Agents
AI Control
- The case for ensuring that powerful AIs are controlled
- Safety Cases: How to Justify the Safety of Advanced AI Systems
AI Timelines
Deception
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Sleeper Agents

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Simple Probes Can Catch Sleeper Agents

AI Control

The case for ensuring that powerful AIs are controlled

Safety Cases: How to Justify the Safety of Advanced AI Systems

AI Timelines

The Full Takeoff Model (Part 1, Overview)

The Full Takeoff Model (Part 2, Takeoff speeds)

The Full Takeoff Model (Part 3, Economics)

Deception

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions