TRC²
arXiv ↗ · cs.LG

The brain
doesn't forget.
Neither does TRC2.

A decoder-only architecture where thalamic routing and a causal hippocampus make continual learning a property of the backbone — not a procedure bolted on top.

Anonymous Author ·
0.08
TokAcc AUFC
0.44
PPL AUFC
2.48×
best-shuffle advantage
2.88B
tokens · 4× V100
scroll
◆ TRC² dominates 5/5 task shuffles PPL AUFC 0.44 vs best baseline 1.11 ◆ 60% forgetting reduction on W→G→C GSM8K accuracy 59.0% ◆ Thalamic router + hippocampus = part of the forward pass 7 baselines · 3 corpora · 22k steps ◆ TRC² dominates 5/5 task shuffles PPL AUFC 0.44 vs best baseline 1.11 ◆ 60% forgetting reduction on W→G→C GSM8K accuracy 59.0% ◆ Thalamic router + hippocampus = part of the forward pass 7 baselines · 3 corpora · 22k steps
The headline

Same stream. Every baseline blows up.
TRC² stays flat.

A Transformer's perplexity on C4 collapses the moment it touches WikiText-103. TRC² keeps the whole forgetting curve near the floor — and the advantage compounds as the stream gets longer and more adversarial.

PPL AUFC @ 22k · lower is better
1.11
best baseline
(MoE 1024)
0.44
TRC² · ours
(226M, d768-l4)
A 60% cut in cumulative perplexity forgetting, at half the parameters of the best baseline.
BLEU AUFC
0.12
vs 0.74 for the best baseline. Token-level structure survives task switches.
TokAcc AUFC
0.08
vs 0.13 for the best baseline. Fine-grained accuracy is retained end-to-end.
GSM8K accuracy
59.0%
408M TRC² (d768-l8) beats every baseline at any scale up to 545M.
Best-shuffle advantage
2.48×
On the adversarial W→G→C stream, TRC²'s AUFC is 2.48× better than the best baseline.
Shuffle range (min–max)
0.29
TRC²'s AUFC barely moves across 5 random task orders. Every baseline doubles.
Live simulation

Watch it happen.
Scrub the 22k-step training stream.

Drag the playhead across 22,000 optimizer steps. Toggle any of the eight models on or off. Pick any of the five task orderings. The red line is TRC² — it barely leaves the floor. Every other model climbs the moment the stream shifts tasks.

Cumulative forgetting · AUFC(t)
05k10k15k20k22k
Step
11,000
Current task
WikiText-103
TRC² AUFC
0.14
Best baseline AUFC
0.68
Architecture

Three loops.
One backbone.

TRC² stacks cortical columns in the center and wraps them in two feedback loops. The thalamus shapes every column's attention from causal past state. The hippocampus writes, retrieves, and replays events around the cortical stream. Scroll to step through it.

STAGE 01 · INPUT

Tokens enter a single embedding.

Weight-tied with the output head, no absolute position encodings. The hidden state H⁽⁰⁾ flows down the center column — the only path that stays fully differentiable end-to-end.

H⁽⁰⁾ = Dropout(E[x₁:T]) ∈ ℝᴮˣᵀˣᵈ
STAGE 02 · EARLY CORTEX

Columns compute. RoPE-GQA + routed SwiGLU MoE.

Each column is the standard backbone: RMSNorm → grouped-query attention with RoPE → RMSNorm → MoE (top-k of E, plus a shared expert) → residual. Stable, known, fast.

H⁽ℓ⁾ = H⁽ℓ⁻¹⁾ + A(U) + M(V)
STAGE 03 · THALAMIC ROUTER

Deep-layer activity becomes causal feedback.

The thalamus compresses C⁽ℓ⁾ to rank r, mixes a local focal path with a diffuse past-only average, gates by TD-surprise, then runs groupwise divisive TRN competition. The output Z modulates the next column's queries.

Z⁽ℓ⁾ = TRN(G_focal · F_local + G_diffuse · F_past)
STAGE 04 · HIPPOCAMPAL MEMORY

Read now. Write later. Never cheat causality.

Content-addressable exact top-k reads on H⁽ℓᵢₙⱼ⁾. A fast/slow TD critic emits surprise. Writes are stashed during the forward pass and flushed only after backward — so the read at step t can never see its own future. Replay samples from past chunks only.

write(t) = δ_fast − δ_slow · flush_after_backward()
STAGE 05 · FEEDBACK GATE

Two signals merge. The late stack reads one input.

A sigmoid gate combines detached cortical state with the memory readout to produce F_hip, which is added to the carried thalamic signal at a single Σ node. Late columns never see memory directly — only the merged side-input.

F_hip = σ(a) · W_hip→thal(G ⊙ M)
STAGE 06 · LATE CORTEX → LOGITS

Main hidden state continues. Side-input shapes attention.

Late columns keep the column-to-column thalamic loop and receive the merged thalamus+hippo signal as Q-modulation. Final RMSNorm, tied lm_head, logits out. Replay bypasses all feedback paths — it just re-runs the embedding → columns → lm_head bus.

logits = RMSNorm(H⁽ᴸ⁾) · Eᵀ
§
Multi-metric comparison

Outer is better.
TRC² owns the hull.

Best configuration per architecture. Perplexity axes are inverted and log-scaled. Click any model to toggle. The red hexagon is TRC² (408M, dm=768, nb=8).

Retention across shuffles

Pick any ordering.
The red row still wins.

AUFC at 22k across five task-sequence shuffles. Dark cells are bad (lots of forgetting). TRC² is the only row where every cell stays cool.

Perplexity AUFC · lower is better
Order-robustness lab

Shuffle the stream.
Pick a stress test.

Every continual-learning paper shows one ordering. We show all five. Pick an adversary.

Choose a task shuffle

The pill shows how much better TRC² is than the best baseline for that specific ordering. Revisiting a task late in the stream (W→G→C) is where every baseline suffers most — and where TRC² wins biggest.

Head-to-head · PPL AUFC @ 22k
TRC² · ours0.46
0.46
Best baseline · FALCON0.91
0.91
1.98×
49% less forgetting than the strongest baseline
Ablation lab

Turn the knobs.
See what each loop actually does.

Both switches off is a vanilla MoE backbone. Hippocampus alone helps retention but drifts. Thalamus alone is competitive at endpoints but forgets. Both together is TRC². The numbers below are real Table 3 entries from the paper.

Components
Thalamic router
Causal Q-modulation · TRN competition
Hippocampal memory
TD surprise · deferred write · replay
Full TRC². Best retention profile across all three AUFC metrics — worst throughput.
AUFC @ 22k · thal ✓ · hippo ✓
PPL AUFC
0.44
BLEU AUFC
0.12
TokAcc AUFC
0.08
Full results

The raw numbers.
No spin.

Task-boundary scores at 10k/20k/22k across every model × every size. Best in column highlighted. TRC² rows in red.

Model Params C4 ↓ Wiki ↓ GSM ↓ Wiki BLEU ↑ GSM % ↑ PPL AUFC ↓