Efficient Continual Learning in Language Models via
Biologically Grounded Sparse Routing & Fast Correction
Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. We present TRC² (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual adaptation a property of the backbone itself.
TRC² combines stacked cortical columns—each performing grouped-query attention and routed mixture-of-experts computation—with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation.
We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential stream over C4, WikiText-103, and GSM8K, TRC² consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines.
TRC² integrates thalamic modulation and hippocampal episodic memory into the decoder backbone, making selective communication and online consolidation part of the forward computation.
Cortex · Thalamus · HippocampusHippocampal writes are retrospective and surprise-based. Replay is sampled only from past stored chunks. An adaptive controller adjusts consolidation strength online.
TD-surprise · Deferred Write · ReplayControlled comparisons against Transformer, Mamba, MoE, and DeepSeek baselines under a shared task stream, plus ablations isolating each component's contribution.
C4 · WikiText-103 · GSM8KTRC² is organized into three interacting subsystems. The cortical columns perform causal sequence modeling. The thalamic router transforms deep-layer activity into a causal modulatory signal. The hippocampal memory provides event-selective retrieval, surprise-based writing, and replay-driven consolidation.
Each column: RMSNorm → GQA Attention (with thalamic Q-modulation + RoPE) → residual → RMSNorm → Routed MoE (SwiGLU, top-k of E experts + shared) → residual → L5 projection. Stacked L deep, split into early and late stacks at ℓinj.
GQA · MoE · SwiGLU · RoPETransforms L5 output C(ℓ) into causal feedback Fthal for the next column's Q. Compresses to rank r, forms local (focal) + diffuse (past-mean) paths gated by surprise, then applies TRN groupwise divisive competition.
Local · Diffuse · TRN · SurpriseContent-addressable episodic store with chunked exact top-k retrieval. TD critic with fast/slow predictors. Deferred causal writes after backward. Two replay buffers (recent ring + long-term reservoir). Adaptive controller adjusts λrep, BR, ρlong online.
TD · Deferred Write · Replay · ControllerWatch token particles flow through the complete TRC² architecture — color-coded by subsystem. ■ Cortex ■ Thalamus ■ Hippocampus ■ Feedback Gate ■ Replay
TRC² is evaluated under a task-sequential language-modeling stream: C4 → WikiText-103 → GSM8K (22k optimizer steps total, 4× V100 GPUs). All baselines share the same pipeline.
Scores at task boundaries (steps 10k, 20k, 22k). BLEU at C4/Wiki boundaries, token accuracy on GSM8K.
| Model | Params | dm | nb | C4↓ | Wiki↓ | GSM↓ | C4 txt↑ | Wiki txt↑ | GSM%↑ | Tok/s↑ | GB·h↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Transformer | 144M | 512 | 28 | 93.17 | 33.83 | 1.7k | 7.90 | 16.15 | 51.42 | 93k | 157 |
| Transformer | 228M | 768 | 20 | 76.27 | 29.94 | 2.8k | 8.06 | 16.78 | 53.09 | 84k | 202 |
| Transformer | 321M | 1024 | 16 | 70.68 | 28.90 | 2.8k | 7.78 | 16.69 | 53.41 | 73k | 266 |
| MoE | 345M | 512 | 16 | 99.28 | 35.71 | 2.3k | 7.32 | 15.96 | 51.01 | 76k | 238 |
| MoE | 441M | 768 | 10 | 94.32 | 34.47 | 6.4k | 6.83 | 14.91 | 50.09 | 78k | 235 |
| MoE | 493M | 1024 | 6 | 136.04 | 36.75 | 18k | 4.70 | 13.69 | 47.46 | 91k | 211 |
| Mamba | 151M | 512 | 28 | 110.15 | 37.31 | 15k | 6.51 | 14.60 | 42.49 | 77k | 229 |
| Mamba | 202M | 768 | 18 | 97.53 | 34.50 | 49k | 5.85 | 14.52 | 43.59 | 76k | 255 |
| Mamba | 245M | 1024 | 12 | 95.25 | 33.01 | 92k | 5.92 | 15.41 | 44.01 | 84k | 226 |
| DeepSeek | 338M | 512 | 14 | 71.59 | 33.09 | 1.8k | 8.54 | 16.58 | 54.20 | 88k | 220 |
| DeepSeek | 308M | 768 | 6 | 80.84 | 32.75 | 6.9k | 7.53 | 15.75 | 52.80 | 118k | 142 |
| DeepSeek | 545M | 1024 | 6 | 68.71 | 33.22 | 8.3k | 7.39 | 16.73 | 53.28 | 87k | 246 |
| TRC² (ours) | 149M | 256 | 24 | 52.48 | 27.59 | 56.36 | 8.31 | 17.44 | 55.49 | 29k | 515 |
| TRC² (ours) | 254M | 512 | 10 | 37.43 | 23.09 | 42.99 | 9.76 | 19.31 | 57.90 | 61k | 259 |
| TRC² (ours) | 226M | 768 | 4 | 38.06 | 21.73 | 31.28 | 9.28 | 18.11 | 57.19 | 103k | 134 |
| TRC² (ours) | 408M | 768 | 8 | 34.29 | 25.04 | 35.70 | 9.84 | 19.16 | 59.03 | 59k | 349 |
| TRC² (ours) | 560M | 1024 | 6 | 33.49 | 26.95 | 29.87 | 9.55 | 18.65 | 58.80 | 60k | 211 |
Area-under-forgetting-curve at steps 20k and 22k. Lower = better retention.
| Model | Params | PPL 20k↓ | PPL 22k↓ | BLEU 20k↓ | BLEU 22k↓ | TokAcc 20k↓ | TokAcc 22k↓ |
|---|---|---|---|---|---|---|---|
| Transformer 512 | 144M | 1.16 | 1.15 | 2.21 | 1.33 | 0.21 | 0.15 |
| Transformer 768 | 228M | 1.23 | 1.30 | 2.16 | 1.30 | 0.20 | 0.15 |
| MoE 1024 | 493M | 1.11 | 1.22 | 1.22 | 0.74 | 0.20 | 0.14 |
| Mamba 1024 | 245M | 1.12 | 1.34 | 1.50 | 0.93 | 0.21 | 0.15 |
| DeepSeek 1024 | 545M | 1.21 | 1.40 | 1.75 | 1.08 | 0.18 | 0.13 |
| TRC² 256 | 149M | 0.71 | 0.49 | 0.18 | 0.12 | 0.13 | 0.08 |
| TRC² 768-l4 | 226M | 0.63 | 0.44 | 0.20 | 0.12 | 0.13 | 0.08 |
| TRC² 1024 | 560M | 0.70 | 0.48 | 0.18 | 0.12 | 0.13 | 0.08 |
Upper: last-step scores at 22k. Lower: AUFC retention at 20k and 22k.
| Thal. | Hippo. | C4 PPL | Wiki PPL | GSM PPL | C4 txt | Wiki txt | GSM % | Tok/s | GB·h |
|---|---|---|---|---|---|---|---|---|---|
| ✓ | ✓ | 990 | 68.7 | 31.3 | 4.05 | 15.0 | 57.2 | 103k | 134 |
| ✗ | ✓ | 1285 | 55.3 | 30.9 | 3.23 | 15.5 | 57.2 | 109k | 131 |
| ✓ | ✗ | 2016 | 613 | 27.4 | 1.82 | 3.28 | 57.7 | 147k | 95 |
| ✗ | ✗ | 2050 | 550 | 26.8 | 2.30 | 4.50 | 58.5 | 154k | 87 |
| Thal. | Hippo. | PPL 20k↓ | PPL 22k↓ | BLEU 20k↓ | BLEU 22k↓ | TokAcc 20k↓ | TokAcc 22k↓ |
|---|---|---|---|---|---|---|---|
| ✓ | ✓ | 0.63 | 0.44 | 0.20 | 0.12 | 0.13 | 0.08 |
| ✗ | ✓ | 0.86 | 0.56 | 0.20 | 0.13 | 0.12 | 0.08 |
| ✓ | ✗ | 1.19 | 0.76 | 0.26 | 0.19 | 0.20 | 0.15 |
| ✗ | ✗ | 1.23 | 0.78 | 0.26 | 0.20 | 0.21 | 0.16 |
The full model (✓/✓) achieves the best retention profile. Removing hippocampal memory causes the largest degradation in AUFC. Removing both increases throughput but sharply degrades retention.
TRC² improves the task-boundary quality frontier: the strongest variants dominate baselines on WikiText-103 and GSM8K while remaining competitive on C4. The improvement is mirrored by stronger boundary BLEU and token accuracy, suggesting gains reflect better retained representational structure.
The retention story is even clearer. TRC² consistently achieves the lowest forgetting area, with its largest margin in perplexity AUFC. This supports the central premise: separating fast adaptation from the main cortical pathway through thalamic modulation, episodic memory, and replay leads to less destructive interference over the task stream.
The ablation clarifies where the advantage comes from. Removing either component can improve some endpoint metrics, but these simplifications degrade cumulative retention. The global modules materially alter how plasticity is distributed across training.
The broader conclusion: continual language modeling benefits from an explicit architectural separation between stable computation and fast, localized plasticity. TRC² shows this can be realized in a modern decoder, improving both quality and retention under a realistic training pipeline.