arXiv:2602.22479v1 · cs.LG · Feb 2026

TRC²
Thalamically Routed
Cortical Columns

Efficient Continual Learning in Language Models via
Biologically Grounded Sparse Routing & Fast Correction

Afshin Khadangi · SnT, University of Luxembourg
~2.88B tokens trained
Scroll to explore

Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. We present TRC² (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual adaptation a property of the backbone itself.

TRC² combines stacked cortical columns—each performing grouped-query attention and routed mixture-of-experts computation—with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation.

We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential stream over C4, WikiText-103, and GSM8K, TRC² consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines.

Three Key Contributions

Architecture

TRC² integrates thalamic modulation and hippocampal episodic memory into the decoder backbone, making selective communication and online consolidation part of the forward computation.

Cortex · Thalamus · Hippocampus

Causal Memory

Hippocampal writes are retrospective and surprise-based. Replay is sampled only from past stored chunks. An adaptive controller adjusts consolidation strength online.

TD-surprise · Deferred Write · Replay

Empirical Study

Controlled comparisons against Transformer, Mamba, MoE, and DeepSeek baselines under a shared task stream, plus ablations isolating each component's contribution.

C4 · WikiText-103 · GSM8K
Three Interacting Subsystems

TRC² is organized into three interacting subsystems. The cortical columns perform causal sequence modeling. The thalamic router transforms deep-layer activity into a causal modulatory signal. The hippocampal memory provides event-selective retrieval, surprise-based writing, and replay-driven consolidation.

Cortical Columns

Each column: RMSNorm → GQA Attention (with thalamic Q-modulation + RoPE) → residual → RMSNorm → Routed MoE (SwiGLU, top-k of E experts + shared) → residual → L5 projection. Stacked L deep, split into early and late stacks at ℓinj.

GQA · MoE · SwiGLU · RoPE

Thalamic Router

Transforms L5 output C(ℓ) into causal feedback Fthal for the next column's Q. Compresses to rank r, forms local (focal) + diffuse (past-mean) paths gated by surprise, then applies TRN groupwise divisive competition.

Local · Diffuse · TRN · Surprise

Hippocampal Memory

Content-addressable episodic store with chunked exact top-k retrieval. TD critic with fast/slow predictors. Deferred causal writes after backward. Two replay buffers (recent ring + long-term reservoir). Adaptive controller adjusts λrep, BR, ρlong online.

TD · Deferred Write · Replay · Controller
Animated Token Flow Through TRC²

Watch token particles flow through the complete TRC² architecture — color-coded by subsystem. Cortex Thalamus Hippocampus Feedback Gate Replay

Results

TRC² is evaluated under a task-sequential language-modeling stream: C4 → WikiText-103 → GSM8K (22k optimizer steps total, 4× V100 GPUs). All baselines share the same pipeline.

33.5
C4 PPL (best baseline: 68.7)
21.7
Wiki PPL (best: 28.9)
29.9
GSM PPL (best: 1.7k)
0.44
PPL AUFC (best: 1.11)
0.12
BLEU AUFC (best: 0.74)

Table 1: Task-Boundary Performance

Scores at task boundaries (steps 10k, 20k, 22k). BLEU at C4/Wiki boundaries, token accuracy on GSM8K.

Model Paramsdmnb C4↓Wiki↓GSM↓ C4 txt↑Wiki txt↑GSM%↑ Tok/s↑GB·h↓
Transformer144M5122893.1733.831.7k7.9016.1551.4293k157
Transformer228M7682076.2729.942.8k8.0616.7853.0984k202
Transformer321M10241670.6828.902.8k7.7816.6953.4173k266
MoE345M5121699.2835.712.3k7.3215.9651.0176k238
MoE441M7681094.3234.476.4k6.8314.9150.0978k235
MoE493M10246136.0436.7518k4.7013.6947.4691k211
Mamba151M51228110.1537.3115k6.5114.6042.4977k229
Mamba202M7681897.5334.5049k5.8514.5243.5976k255
Mamba245M10241295.2533.0192k5.9215.4144.0184k226
DeepSeek338M5121471.5933.091.8k8.5416.5854.2088k220
DeepSeek308M768680.8432.756.9k7.5315.7552.80118k142
DeepSeek545M1024668.7133.228.3k7.3916.7353.2887k246
TRC² (ours)149M2562452.4827.5956.368.3117.4455.4929k515
TRC² (ours)254M5121037.4323.0942.999.7619.3157.9061k259
TRC² (ours)226M768438.0621.7331.289.2818.1157.19103k134
TRC² (ours)408M768834.2925.0435.709.8419.1659.0359k349
TRC² (ours)560M1024633.4926.9529.879.5518.6558.8060k211

Table 2: Continual-Learning Retention (AUFC)

Area-under-forgetting-curve at steps 20k and 22k. Lower = better retention.

ModelParams PPL 20k↓PPL 22k↓ BLEU 20k↓BLEU 22k↓ TokAcc 20k↓TokAcc 22k↓
Transformer 512144M1.161.152.211.330.210.15
Transformer 768228M1.231.302.161.300.200.15
MoE 1024493M1.111.221.220.740.200.14
Mamba 1024245M1.121.341.500.930.210.15
DeepSeek 1024545M1.211.401.751.080.180.13
TRC² 256149M0.710.490.180.120.130.08
TRC² 768-l4226M0.630.440.200.120.130.08
TRC² 1024560M0.700.480.180.120.130.08

Table 3: Ablation Study (d768-l4)

Upper: last-step scores at 22k. Lower: AUFC retention at 20k and 22k.

Thal.Hippo. C4 PPLWiki PPLGSM PPL C4 txtWiki txtGSM % Tok/sGB·h
99068.731.34.0515.057.2103k134
128555.330.93.2315.557.2109k131
201661327.41.823.2857.7147k95
205055026.82.304.5058.5154k87

Thal.Hippo. PPL 20k↓PPL 22k↓ BLEU 20k↓BLEU 22k↓ TokAcc 20k↓TokAcc 22k↓
0.630.440.200.120.130.08
0.860.560.200.130.120.08
1.190.760.260.190.200.15
1.230.780.260.200.210.16

The full model (✓/✓) achieves the best retention profile. Removing hippocampal memory causes the largest degradation in AUFC. Removing both increases throughput but sharply degrades retention.

Making Interference Control Part of the Forward Pass

TRC² improves the task-boundary quality frontier: the strongest variants dominate baselines on WikiText-103 and GSM8K while remaining competitive on C4. The improvement is mirrored by stronger boundary BLEU and token accuracy, suggesting gains reflect better retained representational structure.

The retention story is even clearer. TRC² consistently achieves the lowest forgetting area, with its largest margin in perplexity AUFC. This supports the central premise: separating fast adaptation from the main cortical pathway through thalamic modulation, episodic memory, and replay leads to less destructive interference over the task stream.

The ablation clarifies where the advantage comes from. Removing either component can improve some endpoint metrics, but these simplifications degrade cumulative retention. The global modules materially alter how plasticity is distributed across training.

The broader conclusion: continual language modeling benefits from an explicit architectural separation between stable computation and fast, localized plasticity. TRC² shows this can be realized in a modern decoder, improving both quality and retention under a realistic training pipeline.