Efficient Continual Learning in Language Models via
Biologically Grounded Sparse Routing & Fast Correction
Continual learning is a core requirement for deployed language models, yet standard training pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while stability-improving methods frequently increase latency and memory in ways that don't scale.
We introduce TRC² (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC² combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback — together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters.
The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. Across language modeling and continual learning benchmarks, TRC² improves the stability–plasticity tradeoff at comparable compute.
A decoder-only backbone combining sparse thalamic top-k routing over cortical columns with biologically grounded mechanisms for modulation, prediction, memory, feedback, and fast correction.
Topology-aware routing, chunk-level computation, and memory-aware execution with optional activation checkpointing for efficient training on modern accelerators.
Distributed multi-GPU training, standardized logging, and task-wise evaluations tracking forgetting and forward transfer under streaming domain shifts.
Each subsystem is independently toggleable and draws from a distinct neuroscience principle. Together they form a looped layer structure that routes, modulates, predicts, remembers, and corrects.
Estimates dopamine, acetylcholine, and norepinephrine signals from running input statistics. Controls routing temperature, top-down/bottom-up balance, and global cortical gain.
ModulationImplements Rao & Ballard predictive coding. A causal convolution generates prior predictions; the cortex processes prediction errors rather than raw signals. ACh dynamically weights the blend.
PredictionChunk-level sparse top-k routing with a topology-aware 2D prior. Encourages temporal continuity in column selection, reducing parameter interference during streaming updates.
RoutingCompact microcircuits with selective state-space updates, excitatory/inhibitory gating (SST·PV·VIP interneurons), adaptive membrane filtering, and causal within-chunk processing.
ComputationModern Hopfield Network with learned engram patterns. Chunk-level queries retrieve content-addressable long-range context, providing unlimited-range binding beyond lateral convolutions.
MemoryTwo-compartment neuron model: basal compartment receives bottom-up cortical drive while apical receives top-down hippocampal context. A sigmoid plateau potential gates interaction — matching Larkum et al. 1999.
FeedbackLow-rank fast-weight corrective pathway. Computes rank-r residuals from both the normalized input and cortex output, enabling rapid online adjustment without rewriting slow cortical parameters.
CorrectionEach TRC² layer follows a structured forward pass: normalize, modulate, predict, route, compute in parallel columns, refine routing, correct, and merge via residual connections.
Watch token particles flow through the complete TRC² block — from input normalization through thalamic routing, into active cortical columns with E/I gating, past the Hopfield associative memory, through dendritic readout and cortico-thalamic feedback, and finally the cerebellar fast-weight corrector merging into the residual stream.
Evaluated on C4, WikiText-103, and LAMBADA with 4×V100 GPUs, 2.88B tokens, against parameter-matched Transformer and Mamba baselines. TRC² achieves dramatically lower perplexity and higher BLEU, with substantially reduced continual-learning forgetting.
| Model | Params | dm | PPL C4 ↓ | PPL Wiki ↓ | PPL LAM ↓ | BLEU C4 ↑ | BLEU Wiki ↑ | BLEU LAM ↑ | Tok/s ↑ |
|---|---|---|---|---|---|---|---|---|---|
| Transformer | 162M | 768 | 60.70 | 215.18 | 105.72 | 8.12 | 8.23 | 5.09 | ~127k |
| Mamba | 176M | 768 | 70.45 | 357.67 | 116.73 | 6.90 | 2.87 | 3.97 | ~108k |
| TRC² (ours) | 169M | 512 | 2.00 | 2.56 | 2.02 | 71.66 | 66.57 | 70.07 | ~57k |
| Model | Last Step ↓ | Normalized AUC ↓ | ||||
|---|---|---|---|---|---|---|
| PPL | TokAcc | BLEU | PPL | TokAcc | BLEU | |
| Transformer | 0.0000 | 0.0014 | 0.3757 | 0.0669 | 0.0008 | 0.1684 |
| Mamba | 0.0000 | 0.0006 | 0.0900 | 0.3371 | 0.0011 | 0.1957 |
| TRC² (ours) | 0.0110 | 0.0010 | 0.0435 | 0.0018 | 0.0008 | 0.0981 |
The results support TRC²'s core design claim: continual learning improves when plasticity is allocated to a small, explicit pathway while keeping most representational structure stable. The normalized forgetting AUC — tracking behavior over the full training stream — shows markedly lower forgetting than both baselines.
TRC² trades raw throughput for structured sparsity and online-correctable computation. The chunked routing scheme amortizes router overhead, but end-to-end performance remains sensitive to kernel fusion and memory layout. The favorable scaling regime appears when routing decisions are stable across neighboring tokens and column-local scans stay contiguous in memory.
Several mechanisms likely contribute: topology-aware routing encourages temporal continuity, reducing parameter interference. Excitatory-inhibitory gating suppresses unstable activations before residual propagation. The cerebellar corrector provides fast stream-driven adjustment without rewriting slower parameters.
Future work should extend evaluation to larger scales and longer contexts, study router stability under harder non-stationary streams, and couple the corrective pathway with deployment-time constraints for bounded, interpretable, and reversible adaptation.