TRC² · Thalamically Routed Cortical Columns

The headline

Same stream. Every baseline blows up.
TRC² stays flat.

A Transformer's perplexity on C4 collapses the moment it touches WikiText-103. TRC² keeps the whole forgetting curve near the floor — and the advantage compounds as the stream gets longer and more adversarial.

PPL AUFC @ 22k · lower is better

1.11

best baseline
(MoE 1024)

→

0.44

TRC² · ours
(226M, d768-l4)

A 60% cut in cumulative perplexity forgetting, at half the parameters of the best baseline.

BLEU AUFC

0.12

vs 0.74 for the best baseline. Token-level structure survives task switches.

TokAcc AUFC

0.08

vs 0.13 for the best baseline. Fine-grained accuracy is retained end-to-end.

GSM8K accuracy

59.0%

408M TRC² (d768-l8) beats every baseline at any scale up to 545M.

Best-shuffle advantage

2.48×

On the adversarial W→G→C stream, TRC²'s AUFC is 2.48× better than the best baseline.

Shuffle range (min–max)

0.29

TRC²'s AUFC barely moves across 5 random task orders. Every baseline doubles.

Architecture

Three loops.
One backbone.

TRC² stacks cortical columns in the center and wraps them in two feedback loops. The thalamus shapes every column's attention from causal past state. The hippocampus writes, retrieves, and replays events around the cortical stream. Scroll to step through it.

STAGE 01 · INPUT

Tokens enter a single embedding.

Weight-tied with the output head, no absolute position encodings. The hidden state H⁽⁰⁾ flows down the center column — the only path that stays fully differentiable end-to-end.

H⁽⁰⁾ = Dropout(E[x₁:T]) ∈ ℝᴮˣᵀˣᵈ

STAGE 02 · EARLY CORTEX

Columns compute. RoPE-GQA + routed SwiGLU MoE.

Each column is the standard backbone: RMSNorm → grouped-query attention with RoPE → RMSNorm → MoE (top-k of E, plus a shared expert) → residual. Stable, known, fast.

H⁽ℓ⁾ = H⁽ℓ⁻¹⁾ + A(U) + M(V)

STAGE 03 · THALAMIC ROUTER

Deep-layer activity becomes causal feedback.

The thalamus compresses C⁽ℓ⁾ to rank r, mixes a local focal path with a diffuse past-only average, gates by TD-surprise, then runs groupwise divisive TRN competition. The output Z modulates the next column's queries.

Z⁽ℓ⁾ = TRN(G_focal · F_local + G_diffuse · F_past)

STAGE 04 · HIPPOCAMPAL MEMORY

Read now. Write later. Never cheat causality.

Content-addressable exact top-k reads on H⁽ℓᵢₙⱼ⁾. A fast/slow TD critic emits surprise. Writes are stashed during the forward pass and flushed only after backward — so the read at step t can never see its own future. Replay samples from past chunks only.

write(t) = δ_fast − δ_slow · flush_after_backward()

STAGE 05 · FEEDBACK GATE

Two signals merge. The late stack reads one input.

A sigmoid gate combines detached cortical state with the memory readout to produce F_hip, which is added to the carried thalamic signal at a single Σ node. Late columns never see memory directly — only the merged side-input.

F_hip = σ(a) · W_hip→thal(G ⊙ M)

STAGE 06 · LATE CORTEX → LOGITS

Main hidden state continues. Side-input shapes attention.

Late columns keep the column-to-column thalamic loop and receive the merged thalamus+hippo signal as Q-modulation. Final RMSNorm, tied lm_head, logits out. Replay bypasses all feedback paths — it just re-runs the embedding → columns → lm_head bus.

logits = RMSNorm(H⁽ᴸ⁾) · Eᵀ

Order-robustness lab

Shuffle the stream.
Pick a stress test.

Every continual-learning paper shows one ordering. We show all five. Pick an adversary.

Choose a task shuffle

The pill shows how much better TRC² is than the best baseline for that specific ordering. Revisiting a task late in the stream (W→G→C) is where every baseline suffers most — and where TRC² wins biggest.

Head-to-head · PPL AUFC @ 22k

TRC² · ours0.46

0.46

Best baseline · FALCON0.91

0.91

1.98×

49% less forgetting than the strongest baseline

Ablation lab

Turn the knobs.
See what each loop actually does.

Both switches off is a vanilla MoE backbone. Hippocampus alone helps retention but drifts. Thalamus alone is competitive at endpoints but forgets. Both together is TRC². The numbers below are real Table 3 entries from the paper.

Components

Thalamic router

Causal Q-modulation · TRN competition

Hippocampal memory

TD surprise · deferred write · replay

Full TRC². Best retention profile across all three AUFC metrics — worst throughput.

AUFC @ 22k · thal ✓ · hippo ✓

PPL AUFC

0.44

BLEU AUFC

0.12

TokAcc AUFC

0.08

The brain
doesn't forget.
Neither does TRC2.

Same stream. Every baseline blows up.
TRC² stays flat.

Watch it happen.
Scrub the 22k-step training stream.

Three loops.
One backbone.

Tokens enter a single embedding.

Columns compute. RoPE-GQA + routed SwiGLU MoE.

Deep-layer activity becomes causal feedback.

Read now. Write later. Never cheat causality.

Two signals merge. The late stack reads one input.

Main hidden state continues. Side-input shapes attention.

Outer is better.
TRC² owns the hull.

Pick any ordering.
The red row still wins.

Shuffle the stream.
Pick a stress test.

Turn the knobs.
See what each loop actually does.

The raw numbers.
No spin.

Same stream. Every baseline blows up.TRC² stays flat.

Watch it happen.Scrub the 22k-step training stream.

Three loops.One backbone.

Tokens enter a single embedding.

Columns compute. RoPE-GQA + routed SwiGLU MoE.

Deep-layer activity becomes causal feedback.

Read now. Write later. Never cheat causality.

Two signals merge. The late stack reads one input.

Main hidden state continues. Side-input shapes attention.

Outer is better.TRC² owns the hull.

Pick any ordering.The red row still wins.

Shuffle the stream.Pick a stress test.

Turn the knobs.See what each loop actually does.

The raw numbers.No spin.

Same stream. Every baseline blows up.
TRC² stays flat.

Watch it happen.
Scrub the 22k-step training stream.

Three loops.
One backbone.

Outer is better.
TRC² owns the hull.

Pick any ordering.
The red row still wins.

Shuffle the stream.
Pick a stress test.

Turn the knobs.
See what each loop actually does.

The raw numbers.
No spin.