TRC² — Thalamically Routed Cortical Columns

Abstract

Continual learning is a core requirement for deployed language models, yet standard training pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while stability-improving methods frequently increase latency and memory in ways that don't scale.

We introduce TRC² (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC² combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback — together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters.

The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. Across language modeling and continual learning benchmarks, TRC² improves the stability–plasticity tradeoff at comparable compute.

      Core thesis: Continual learning should be an architectural property, not a bolt-on procedure. Plasticity should be localized in fast mechanisms while slower representational structures remain stable.
    

Contributions

Three Key Contributions

The TRC² Architecture

A decoder-only backbone combining sparse thalamic top-k routing over cortical columns with biologically grounded mechanisms for modulation, prediction, memory, feedback, and fast correction.

Sparse Chunk-Parallel Implementation

Topology-aware routing, chunk-level computation, and memory-aware execution with optional activation checkpointing for efficient training on modern accelerators.

Continual Learning Evaluation Stack

Distributed multi-GPU training, standardized logging, and task-wise evaluations tracking forgetting and forward transfer under streaming domain shifts.

Architecture

Seven Brain-Inspired Subsystems

Each subsystem is independently toggleable and draws from a distinct neuroscience principle. Together they form a looped layer structure that routes, modulates, predicts, remembers, and corrects.

Neuromodulator Controller

Estimates dopamine, acetylcholine, and norepinephrine signals from running input statistics. Controls routing temperature, top-down/bottom-up balance, and global cortical gain.

Modulation

Predictive Cortex

Implements Rao & Ballard predictive coding. A causal convolution generates prior predictions; the cortex processes prediction errors rather than raw signals. ACh dynamically weights the blend.

Prediction

Thalamic Router

Chunk-level sparse top-k routing with a topology-aware 2D prior. Encourages temporal continuity in column selection, reducing parameter interference during streaming updates.

Routing

Cortical Columns

Compact microcircuits with selective state-space updates, excitatory/inhibitory gating (SST·PV·VIP interneurons), adaptive membrane filtering, and causal within-chunk processing.

Computation

Associative Memory

Modern Hopfield Network with learned engram patterns. Chunk-level queries retrieve content-addressable long-range context, providing unlimited-range binding beyond lateral convolutions.

Memory

Dendritic Readout

Two-compartment neuron model: basal compartment receives bottom-up cortical drive while apical receives top-down hippocampal context. A sigmoid plateau potential gates interaction — matching Larkum et al. 1999.

Feedback

Cerebellar Corrector

Low-rank fast-weight corrective pathway. Computes rank-r residuals from both the normalized input and cortex output, enabling rapid online adjustment without rewriting slow cortical parameters.

Correction

Data Flow

The TRC² Block: Signal Flow

Each TRC² layer follows a structured forward pass: normalize, modulate, predict, route, compute in parallel columns, refine routing, correct, and merge via residual connections.

Input

X ∈ ℝ^B×T×d

Token + positional embeddings → RMSNorm

① Neuromodulator

ModCtrl(U) → s_route, s_pred, s_gain

EMA deviation stats → 2-layer MLP → 3 control signals ∈ [0,1]

② Predictive Coding

Û = U − (1 − s_pred)·P̃

Causal conv prediction → error blend → aux ℒ_pred

③ Thalamic Router

I, R, S = TopK(L, k)

Chunk-pool → Q·K^T + topology prior → top-k softmax

④ Associative Memory

C^mem = HopfieldRetrieve(Ū, Ξ)

Chunk queries → normalized Hopfield → content-addressable context

⑤ Parallel Cortex

Y = Cortex(Û, I, R, C^mem)

Dense projection → E/I gating → adaptive membrane → causal conv → dendritic readout → routed mixture

⑥ Routing Refinement

R′ = softmax(S + α_fb·S^fb)

Cortex output → feedback projection → refine weights → 2nd cortex pass

⑦ Cerebellar Correction

Δ = SiLU(Z)·V·U^T

Low-rank fast-weight residual from [Û; Y] → rank-r correction

Output

X⁺ = X + Drop(g_gain⊙Y + Δ)

Global gain modulation → residual + dropout → RMSNorm → SwiGLU FFN → residual

Key Equations

U = RMSNorm(X), (s_route, s_pred, s_gain) = ModCtrl(U), Û = U − (1−s_pred)·P̃

(I, R, S, ℒ_route) = Router(Û), C^mem = AssocMem(Ū), Y = Cortex(Û, I, R, C^mem)

X̃ = X + Drop(g_gain⊙Y + Δ), X⁺ = X̃ + Drop(SwiGLU(RMSNorm(X̃)))

ℒ_train = ℒ_CE + 0.1·ℒ^Σ_pred + ℒ^Σ_route

Live Visualization

Animated Token Flow Through TRC² Block

Watch token particles flow through the complete TRC² block — from input normalization through thalamic routing, into active cortical columns with E/I gating, past the Hopfield associative memory, through dendritic readout and cortico-thalamic feedback, and finally the cerebellar fast-weight corrector merging into the residual stream.

Neuromodulator

Predictive Cortex

Thalamic Router

Active Column

Inactive Column

Hopfield Memory

Dendritic Readout

CT Feedback

Cerebellar Δ

Experiments

Results

Evaluated on C4, WikiText-103, and LAMBADA with 4×V100 GPUs, 2.88B tokens, against parameter-matched Transformer and Mamba baselines. TRC² achieves dramatically lower perplexity and higher BLEU, with substantially reduced continual-learning forgetting.

Table 1 — Evaluation Performance & Efficiency

Model	Params	d_m	PPL C4 ↓	PPL Wiki ↓	PPL LAM ↓	BLEU C4 ↑	BLEU Wiki ↑	BLEU LAM ↑	Tok/s ↑
Transformer	162M	768	60.70	215.18	105.72	8.12	8.23	5.09	~127k
Mamba	176M	768	70.45	357.67	116.73	6.90	2.87	3.97	~108k
TRC² (ours)	169M	512	2.00	2.56	2.02	71.66	66.57	70.07	~57k

Perplexity Comparison — C4 (lower is better)

Transformer

Mamba

TRC² (ours)

BLEU Score — C4 (higher is better)

Transformer

Mamba

TRC² (ours)

Table 2 — Continual Learning: Average Forgetting

Model	Last Step ↓			Normalized AUC ↓
	PPL	TokAcc	BLEU	PPL	TokAcc	BLEU
Transformer	0.0000	0.0014	0.3757	0.0669	0.0008	0.1684
Mamba	0.0000	0.0006	0.0900	0.3371	0.0011	0.1957
TRC² (ours)	0.0110	0.0010	0.0435	0.0018	0.0008	0.0981

      Key insight: TRC² shows 37× lower normalized PPL forgetting AUC than Transformer and 187× lower than Mamba — the model retains earlier behavior consistently over the full stream, not just at the final step.
    

Discussion & Conclusion

Making Interference Control Part of the Forward Pass

The results support TRC²'s core design claim: continual learning improves when plasticity is allocated to a small, explicit pathway while keeping most representational structure stable. The normalized forgetting AUC — tracking behavior over the full training stream — shows markedly lower forgetting than both baselines.

TRC² trades raw throughput for structured sparsity and online-correctable computation. The chunked routing scheme amortizes router overhead, but end-to-end performance remains sensitive to kernel fusion and memory layout. The favorable scaling regime appears when routing decisions are stable across neighboring tokens and column-local scans stay contiguous in memory.

Several mechanisms likely contribute: topology-aware routing encourages temporal continuity, reducing parameter interference. Excitatory-inhibitory gating suppresses unstable activations before residual propagation. The cerebellar corrector provides fast stream-driven adjustment without rewriting slower parameters.

Future work should extend evaluation to larger scales and longer contexts, study router stability under harder non-stationary streams, and couple the corrective pathway with deployment-time constraints for bounded, interpretable, and reversible adaptation.