..Walkthroughs
Walkthroughs of Alignment papers.
Contents
Interpretability
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Sleeper Agents
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Simple Probes Can Catch Sleeper Agents
AI Control
The case for ensuring that powerful AIs are controlled
Safety Cases: How to Justify the Safety of Advanced AI Systems
AI Timelines
The Full Takeoff Model (Part 1, Overview)
The Full Takeoff Model (Part 2, Takeoff speeds)
The Full Takeoff Model (Part 3, Economics)
Deception