Resources
Reading list
Papers, courses, and tools that have shaped how I think about AI safety. Curated and annotated — not exhaustive, but honest about what actually moved my understanding. Updated as I work through the research agenda.
New to AI safety? Start with Concrete Problems (2016) for grounding, then Risks from Learned Optimization (2019) for the conceptual frame. BlueDot's AI Safety Fundamentals course is the best structured on-ramp I've found.
Foundational
- 2016
- 2019
- 2017
- 2022
Open-weight & fine-tuning safety
My primary research niche: safety properties that must survive fine-tuning, quantization, and weight release. These papers are the empirical bedrock.
- 2023
- 2023
- 2023
- 2024
Interpretability
- 2022
- 2023
- tool
Multi-agent & compositional alignment
-
2026
Differences in Alignment Behavior between Single-Agent and Multi-Agent LLM Systems
- 2023
Evaluation & tools
- 2024
- tool
Courses
- course
- course
Where the discourse lives
- Alignment Forum — primary venue for technical alignment research and discussion. More rigorous than average; the place to read new work before it becomes a paper.
- LessWrong — broader rationalist / AGI discourse. Noise-to-signal is higher, but the best posts are excellent.
- Anthropic Research — interpretability, evals, and alignment from the most technically transparent frontier lab.
- Redwood Research — adversarial robustness and scalable oversight from a small, focused team.