Resources

Reading list

Papers, courses, and tools that have shaped how I think about AI safety. Curated and annotated — not exhaustive, but honest about what actually moved my understanding. Updated as I work through the research agenda.

New to AI safety? Start with Concrete Problems (2016) for grounding, then Risks from Learned Optimization (2019) for the conceptual frame. BlueDot's AI Safety Fundamentals course is the best structured on-ramp I've found.

Foundational

2016

Concrete Problems in AI Safety

Amodei, Olah et al. — The taxonomy that organized early alignment work. Five failure modes still frame most practical safety research today.
2019

Risks from Learned Optimization

Hubinger et al. — Introduces inner vs. outer alignment and deceptive alignment. The conceptual backbone behind most work on training-time vs. deployment-time safety gaps.
2017

Deep Reinforcement Learning from Human Preferences

Christiano et al. — Origin of RLHF. Worth reading as a primary source; short, clear, and still the reference point for most alignment training methods.
2022

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) — Replaces human labelers with AI feedback in RLHF. The CAI framing shaped a generation of alignment work.

Open-weight & fine-tuning safety

My primary research niche: safety properties that must survive fine-tuning, quantization, and weight release. These papers are the empirical bedrock.

2023

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Qi et al. — Demonstrates that safety fine-tuning is fragile to further fine-tuning, even on benign data. The most direct empirical case for why post-release safety guarantees don't hold.
2023

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Yang et al. — Shows that fewer than 100 examples suffice to undo safety alignment via fine-tuning. Sobering read on the practical attack surface.
2023

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Lermen et al. — Extends the fragility result specifically to LoRA/PEFT adapters, the dominant open-weight fine-tuning paradigm.
2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger et al. — Backdoor behaviors that survive RLHF fine-tuning. The clearest empirical demonstration that deceptive alignment is not just theoretical.

Interpretability

2022

In-context Learning and Induction Heads

Olsson et al. — A mechanistic account of how transformers perform in-context learning. The clearest example of interpretability explaining an emergent capability from first principles.
2023

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Bricken et al. (Anthropic) — Uses sparse autoencoders to find interpretable features in superposition. Current frontier of mechanistic interpretability in practice.
tool

TransformerLens

Neel Nanda — The standard library for mechanistic interpretability research. If you're starting on interpretability work, begin here.

Multi-agent & compositional alignment

2026

Differences in Alignment Behavior between Single-Agent and Multi-Agent LLM Systems

Hermann (HCII 2026 · Springer CCIS 3052) — My own work. Documents that alignment behavior measurably differs between single- and multi-agent setups — early empirical evidence for compositional misalignment.
2023

AgentBench: Evaluating LLMs as Agents

Liu et al. — Systematic benchmark across 8 agentic environments. Useful baseline for anyone studying agent behavior in real-world conditions.

Evaluation & tools

2024

WMDP: A Benchmark for Measuring and Reducing Malicious Use With Unlearning

Li et al. — Biosecurity / cybersecurity hazard benchmark with paired machine unlearning methods. A good model of safety-motivated evaluation connected to concrete harm scenarios.
tool

EleutherAI LM Evaluation Harness

Standard framework for reproducible LLM evaluation across hundreds of benchmarks, including safety-relevant ones (TruthfulQA, WinoGrande, etc.).

Courses

course

ARENA — Alignment Research Engineer Accelerator

The most technically rigorous AI safety curriculum I've found. Covers transformers from scratch, mechanistic interpretability, RLHF, and agent foundations. Essential if you want to do empirical alignment work.
course

AI Safety Fundamentals — BlueDot Impact

Excellent 8-week structured course. The best introduction for ML engineers crossing into safety — the one I'd recommend first.

Where the discourse lives

Alignment Forum — primary venue for technical alignment research and discussion. More rigorous than average; the place to read new work before it becomes a paper.
LessWrong — broader rationalist / AGI discourse. Noise-to-signal is higher, but the best posts are excellent.
Anthropic Research — interpretability, evals, and alignment from the most technically transparent frontier lab.
Redwood Research — adversarial robustness and scalable oversight from a small, focused team.

Reading list

Foundational

Open-weight & fine-tuning safety

Interpretability

Multi-agent & compositional alignment

Differences in Alignment Behavior between Single-Agent and Multi-Agent LLM Systems

Evaluation & tools

Courses

Where the discourse lives