arXiv:2602.22479v1 · cs.LG · Feb 2026

TRC²
Thalamically Routed
Cortical Columns

Efficient Continual Learning in Language Models via
Biologically Grounded Sparse Routing & Fast Correction

Afshin Khadangi · SnT, University of Luxembourg
Working Paper · 169M params
~2.88B tokens trained
Scroll to explore

Continual learning is a core requirement for deployed language models, yet standard training pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while stability-improving methods frequently increase latency and memory in ways that don't scale.

We introduce TRC² (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC² combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback — together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters.

The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. Across language modeling and continual learning benchmarks, TRC² improves the stability–plasticity tradeoff at comparable compute.

Core thesis: Continual learning should be an architectural property, not a bolt-on procedure. Plasticity should be localized in fast mechanisms while slower representational structures remain stable.
Three Key Contributions
01

The TRC² Architecture

A decoder-only backbone combining sparse thalamic top-k routing over cortical columns with biologically grounded mechanisms for modulation, prediction, memory, feedback, and fast correction.

02

Sparse Chunk-Parallel Implementation

Topology-aware routing, chunk-level computation, and memory-aware execution with optional activation checkpointing for efficient training on modern accelerators.

03

Continual Learning Evaluation Stack

Distributed multi-GPU training, standardized logging, and task-wise evaluations tracking forgetting and forward transfer under streaming domain shifts.

Seven Brain-Inspired Subsystems

Each subsystem is independently toggleable and draws from a distinct neuroscience principle. Together they form a looped layer structure that routes, modulates, predicts, remembers, and corrects.

D ACh NE

Neuromodulator Controller

Estimates dopamine, acetylcholine, and norepinephrine signals from running input statistics. Controls routing temperature, top-down/bottom-up balance, and global cortical gain.

Modulation
ε = U − P̂

Predictive Cortex

Implements Rao & Ballard predictive coding. A causal convolution generates prior predictions; the cortex processes prediction errors rather than raw signals. ACh dynamically weights the blend.

Prediction
top-k

Thalamic Router

Chunk-level sparse top-k routing with a topology-aware 2D prior. Encourages temporal continuity in column selection, reducing parameter interference during streaming updates.

Routing
E I

Cortical Columns

Compact microcircuits with selective state-space updates, excitatory/inhibitory gating (SST·PV·VIP interneurons), adaptive membrane filtering, and causal within-chunk processing.

Computation
Hopfield Net

Associative Memory

Modern Hopfield Network with learned engram patterns. Chunk-level queries retrieve content-addressable long-range context, providing unlimited-range binding beyond lateral convolutions.

Memory
apical basal

Dendritic Readout

Two-compartment neuron model: basal compartment receives bottom-up cortical drive while apical receives top-down hippocampal context. A sigmoid plateau potential gates interaction — matching Larkum et al. 1999.

Feedback
fast Δ rank-r V U

Cerebellar Corrector

Low-rank fast-weight corrective pathway. Computes rank-r residuals from both the normalized input and cortex output, enabling rapid online adjustment without rewriting slow cortical parameters.

Correction
The TRC² Block: Signal Flow

Each TRC² layer follows a structured forward pass: normalize, modulate, predict, route, compute in parallel columns, refine routing, correct, and merge via residual connections.

Input
X ∈ ℝB×T×d
Token + positional embeddings → RMSNorm
① Neuromodulator
ModCtrl(U) → sroute, spred, sgain
EMA deviation stats → 2-layer MLP → 3 control signals ∈ [0,1]
② Predictive Coding
Û = U − (1 − spred)·P̃
Causal conv prediction → error blend → aux ℒpred
③ Thalamic Router
I, R, S = TopK(L, k)
Chunk-pool → Q·KT + topology prior → top-k softmax
④ Associative Memory
Cmem = HopfieldRetrieve(Ū, Ξ)
Chunk queries → normalized Hopfield → content-addressable context
⑤ Parallel Cortex
Y = Cortex(Û, I, R, Cmem)
Dense projection → E/I gating → adaptive membrane → causal conv → dendritic readout → routed mixture
⑥ Routing Refinement
R′ = softmax(S + αfb·Sfb)
Cortex output → feedback projection → refine weights → 2nd cortex pass
⑦ Cerebellar Correction
Δ = SiLU(Z)·V·UT
Low-rank fast-weight residual from [Û; Y] → rank-r correction
Output
X+ = X + Drop(ggain⊙Y + Δ)
Global gain modulation → residual + dropout → RMSNorm → SwiGLU FFN → residual
U = RMSNorm(X),   (s_route, s_pred, s_gain) = ModCtrl(U),   Û = U − (1−s_pred)·P̃
(I, R, S, ℒ_route) = Router(Û),   C^mem = AssocMem(Ū),   Y = Cortex(Û, I, R, C^mem)
X̃ = X + Drop(g_gain⊙Y + Δ),   X⁺ = X̃ + Drop(SwiGLU(RMSNorm(X̃)))
ℒ_train = ℒ_CE + 0.1·ℒ^Σ_pred + ℒ^Σ_route
Animated Token Flow Through TRC² Block

Watch token particles flow through the complete TRC² block — from input normalization through thalamic routing, into active cortical columns with E/I gating, past the Hopfield associative memory, through dendritic readout and cortico-thalamic feedback, and finally the cerebellar fast-weight corrector merging into the residual stream.

Neuromodulator
Predictive Cortex
Thalamic Router
Active Column
Inactive Column
Hopfield Memory
Dendritic Readout
CT Feedback
Cerebellar Δ
Results

Evaluated on C4, WikiText-103, and LAMBADA with 4×V100 GPUs, 2.88B tokens, against parameter-matched Transformer and Mamba baselines. TRC² achieves dramatically lower perplexity and higher BLEU, with substantially reduced continual-learning forgetting.

Table 1 — Evaluation Performance & Efficiency

Model Params dm PPL C4 ↓ PPL Wiki ↓ PPL LAM ↓ BLEU C4 ↑ BLEU Wiki ↑ BLEU LAM ↑ Tok/s ↑
Transformer 162M 768 60.70 215.18 105.72 8.12 8.23 5.09 ~127k
Mamba 176M 768 70.45 357.67 116.73 6.90 2.87 3.97 ~108k
TRC² (ours) 169M 512 2.00 2.56 2.02 71.66 66.57 70.07 ~57k
Perplexity Comparison — C4 (lower is better)
Transformer
Mamba
TRC² (ours)
BLEU Score — C4 (higher is better)
Transformer
Mamba
TRC² (ours)

Table 2 — Continual Learning: Average Forgetting

Model Last Step ↓ Normalized AUC ↓
PPL TokAcc BLEU PPL TokAcc BLEU
Transformer 0.0000 0.0014 0.3757 0.0669 0.0008 0.1684
Mamba 0.0000 0.0006 0.0900 0.3371 0.0011 0.1957
TRC² (ours) 0.0110 0.0010 0.0435 0.0018 0.0008 0.0981
Key insight: TRC² shows 37× lower normalized PPL forgetting AUC than Transformer and 187× lower than Mamba — the model retains earlier behavior consistently over the full stream, not just at the final step.
Making Interference Control Part of the Forward Pass

The results support TRC²'s core design claim: continual learning improves when plasticity is allocated to a small, explicit pathway while keeping most representational structure stable. The normalized forgetting AUC — tracking behavior over the full training stream — shows markedly lower forgetting than both baselines.

TRC² trades raw throughput for structured sparsity and online-correctable computation. The chunked routing scheme amortizes router overhead, but end-to-end performance remains sensitive to kernel fusion and memory layout. The favorable scaling regime appears when routing decisions are stable across neighboring tokens and column-local scans stay contiguous in memory.

Several mechanisms likely contribute: topology-aware routing encourages temporal continuity, reducing parameter interference. Excitatory-inhibitory gating suppresses unstable activations before residual propagation. The cerebellar corrector provides fast stream-driven adjustment without rewriting slower parameters.

Future work should extend evaluation to larger scales and longer contexts, study router stability under harder non-stationary streams, and couple the corrective pathway with deployment-time constraints for bounded, interpretable, and reversible adaptation.