Why does inverse scaling happen? A research log

Research question: What mechanistic or behavioral explanation accounts for inverse scaling — the phenomenon where more capability, or more test-time reasoning, degrades performance rather than improving it?

This is an open research project, written up as a log rather than a result. I'd welcome pushback.

The gap

Two recent papers set up the puzzle:

Inverse Scaling in Test-Time Compute (Gema et al., 2025) shows that extended reasoning can reduce accuracy. Models get distracted by irrelevant details, overfit misleading framings, latch onto spurious correlations, or lose track of constraints over long deductions.
The Hot Mess of AI (McKee et al., 2026) shows that as models scale, their remaining errors shift from systematic to random: the incoherence ratio — variance over total error — rises with scale.

Both papers carefully document what happens. Neither fully explains why. And there's a specific blind spot: the existing work focuses on inter-trace variance — different sampled traces give different answers. Almost nobody has looked at intra-trace dynamics: what happens inside a single chain of thought that sends it off the rails.

The plan

Phase 1 — Intra-trace causal mapping. For tasks that exhibit inverse scaling, find which tokens within a reasoning trace causally determine the final answer. Borrow the causal-attribution machinery from the Thought Anchors and Thought Branches work: ablate or counterfactually replace tokens and watch whether the final answer flips. Label each trace with its correctness, its anchor tokens, its branch points, and its commitment point — the earliest token after which the answer no longer changes under perturbation.

Phase 2 — Pattern discovery. Characterize the turning points. Are anchor tokens reasoning steps, restated premises, hedges, self-corrections? Do attention patterns reorganize at branch points — a "lost in the middle" effect within a trace? Does the model's internal representation of the problem drift as the chain grows?

Phase 3 — Mechanistic hypotheses. Two candidates I want to test first:

Attention dilution. Longer traces dilute attention over the original constraints, so the model loses the problem. Prediction: failures cluster where key constraint tokens fall below an attention threshold.
Representation drift. The internal representation of the problem drifts step by step until it crosses into the basin of a wrong answer. Prediction: a characteristic "drift then snap" in residual-stream similarity for failed traces.

Phase 4 — Intervention. If a mechanism shows up, test whether targeted interventions — steering vectors, selective token masking, trace truncation — can reduce the effect.

Why I care

As we hand more consequential tasks to AI, we need models that get more reliable with more thinking, not less. Inverse scaling is a small, sharp counterexample to the assumption that more compute is always safer. Understanding the mechanism is a prerequisite to fixing it — and it's a clean setting to practice the interpretability methods I most want to get good at.