A decoder-only architecture where thalamic routing and a causal hippocampus make continual learning a property of the backbone — not a procedure bolted on top.
A Transformer's perplexity on C4 collapses the moment it touches WikiText-103. TRC² keeps the whole forgetting curve near the floor — and the advantage compounds as the stream gets longer and more adversarial.
Drag the playhead across 22,000 optimizer steps. Toggle any of the eight models on or off. Pick any of the five task orderings. The red line is TRC² — it barely leaves the floor. Every other model climbs the moment the stream shifts tasks.
TRC² stacks cortical columns in the center and wraps them in two feedback loops. The thalamus shapes every column's attention from causal past state. The hippocampus writes, retrieves, and replays events around the cortical stream. Scroll to step through it.
Weight-tied with the output head, no absolute position encodings. The hidden state H⁽⁰⁾ flows down the center column — the only path that stays fully differentiable end-to-end.
Each column is the standard backbone: RMSNorm → grouped-query attention with RoPE → RMSNorm → MoE (top-k of E, plus a shared expert) → residual. Stable, known, fast.
The thalamus compresses C⁽ℓ⁾ to rank r, mixes a local focal path with a diffuse past-only average, gates by TD-surprise, then runs groupwise divisive TRN competition. The output Z modulates the next column's queries.
Content-addressable exact top-k reads on H⁽ℓᵢₙⱼ⁾. A fast/slow TD critic emits surprise. Writes are stashed during the forward pass and flushed only after backward — so the read at step t can never see its own future. Replay samples from past chunks only.
A sigmoid gate combines detached cortical state with the memory readout to produce F_hip, which is added to the carried thalamic signal at a single Σ node. Late columns never see memory directly — only the merged side-input.
Late columns keep the column-to-column thalamic loop and receive the merged thalamus+hippo signal as Q-modulation. Final RMSNorm, tied lm_head, logits out. Replay bypasses all feedback paths — it just re-runs the embedding → columns → lm_head bus.
Best configuration per architecture. Perplexity axes are inverted and log-scaled. Click any model to toggle. The red hexagon is TRC² (408M, dm=768, nb=8).
AUFC at 22k across five task-sequence shuffles. Dark cells are bad (lots of forgetting). TRC² is the only row where every cell stays cool.
Every continual-learning paper shows one ordering. We show all five. Pick an adversary.
The pill shows how much better TRC² is than the best baseline for that specific ordering. Revisiting a task late in the stream (W→G→C) is where every baseline suffers most — and where TRC² wins biggest.
Both switches off is a vanilla MoE backbone. Hippocampus alone helps retention but drifts. Thalamus alone is competitive at endpoints but forgets. Both together is TRC². The numbers below are real Table 3 entries from the paper.
Task-boundary scores at 10k/20k/22k across every model × every size. Best in column highlighted. TRC² rows in red.
| Model | Params | C4 ↓ | Wiki ↓ | GSM ↓ | Wiki BLEU ↑ | GSM % ↑ | PPL AUFC ↓ |
|---|