Research

Research agenda

One concern runs through everything I work on: safety is a property of a deployment, not of a checkpoint. A lab can sign off on the weights it releases; it cannot sign off on the fine-tune, the quantization, or the agent scaffold someone wraps around them a week later. My research builds the measurement tooling for that gap.

Threads

Open-weight & post-deployment safety

Once weights are public, a safety property either survives everything the world does to it — fine-tuning, quantization, composition — or it doesn't hold at all. There is no recall. I want probes that run against any open-weight model to detect whether post-hoc training has stripped specific safety behavior, and training methods that reinforce the properties that actually survive. Read the threat model →

Compositional misalignment

Alignment is tested on single models; the world deploys compositions. Alignment does not compose linearly. My HCII 2026 paper documents this empirically for multi-agent LLM systems — single-agent and multi-agent setups show measurably different alignment behavior — and it's the thread I most want to pull on next.

Mechanistic interpretability & evaluation

Measurement is load-bearing regardless of which intervention wins. I'm building fluency in interpretability by doing it — currently a causal, intra-trace analysis of why inverse scaling happens: what goes wrong inside a single reasoning chain when more thinking makes a model worse.

Inoculation against model poisoning

With Safe AI Germany, I'm testing whether data-level "antidote" datasets can contain emergent misalignment — letting a model absorb a narrow bad behavior without it generalizing into broad misalignment — and how that compares to representation- and weight-level defenses.