Semantic Crystallization¶
Status: Research & Planning
Priority: Experimental / High Scientific Value
Objective: Pioneer cognitive architecture with emergent internal language through persistent memory and motif crystallization
Created: December 8, 2025
Type: Novel Research - Potential Breakthrough
Companion Documents:
- π Mathematical Foundations: PERSISTENT_TRACES_APPENDIX_MATHEMATICAL_FOUNDATIONS.md - Detailed derivations, proofs, complexity analysis
- π§ Cognitive Science Perspective: PERSISTENT_TRACES_COGNITIVE_SCIENCE.md - Emergent language theory, philosophical implications
- π Quick Reference: PERSISTENT_TRACES_QUICK_REFERENCE.md - Config templates, FAQ, decision trees
- ποΈ Vector Store Integration: data-backends-vector-stores.md (PF-005) - Persistent storage backend for cross-session memory
π Executive Summary¶
The Vision¶
Current language models process each input independently, computing attention and routing decisions from scratch every forward pass. This is computationally wasteful and biologically implausible. Persistent Attention Traces and Semantic Crystallization aim to create a model that develops its own efficient internal "thought language" by:
- Remembering high-salience reasoning pathways that recur across contexts (attention traces)
- Consolidating frequently-used expert routing patterns into reusable symbolic primitives (crystallization)
- Evolving an internal cognitive lexicon optimized for dense, rapid computation rather than human readability
This is analogous to how biological neural networks consolidate episodic experiences into semantic knowledge through synaptic strengthening and pruning.
The Problem¶
- Attention is ephemeral: Valuable reasoning patterns discovered during training are recomputed from scratch every forward pass
- Routing is redundant: MoE models repeatedly discover the same expert pathways for similar inputs
- No cognitive compression: Models cannot develop efficient "shorthand" for complex conceptual patterns
- Memory inefficiency: Full attention matrices are O(LΒ²) per layer - infeasible to persist
The Solution¶
Two-component cognitive enhancement system:
Component 1: Persistent Attention Traces¶
- Capture sparse high-salience attention edges (top 0.1%) that consistently strengthen across sequences
- Store in compact coordinate format:
(layer, head, query_idx, key_idx, salience, age)β 12 bytes per trace - Apply exponential decay to unused traces (biological forgetting)
- Bias future attention using accumulated trace memory β faster convergence to proven reasoning pathways
- Memory footprint: ~6-12 MB for 32-layer model (vs. ~68 GB for full attention storage!)
Component 2: Semantic Crystallization¶
- Track MoE routing paths: sequences of expert activations
[EββEββEββEβ]across layers - Build suffix tree of frequently-traversed motifs with utility scoring
- Freeze high-utility motifs into specialized computational units (new "thought symbols")
- Competitive dynamics ensure only genuinely useful patterns survive
- Result: Model develops hierarchical vocabulary of reusable reasoning primitives
Expected Outcomes¶
Baseline Model:
- Recomputes attention from scratch: 100% cost every forward pass
- Discovers useful routing path: forgotten immediately after sequence ends
- Long-horizon reasoning: struggles with dependencies beyond context window
Enhanced Model:
- Persistent traces accelerate attention: 20-40% speedup on familiar patterns
- Crystallized motifs become reusable "concepts": 15-30% FLOP reduction
- Emergent internal language: hierarchical reasoning primitives
- Long-term memory: consolidates knowledge across training lifetime
This is pioneering work - no existing implementation combines persistent attention biasing with routing-path crystallization at this level of integration.
π§ Theoretical Foundation¶
Biological Inspiration¶
Long-Term Potentiation (LTP)
In biological neural networks, synaptic connections that repeatedly fire together strengthen through LTP - the cellular basis of learning and memory. Our persistent attention traces implement a computational analog:
- Hebbian principle: "Neurons that fire together wire together"
- Synaptic consolidation: Episodic memories (hippocampus) consolidate into semantic knowledge (neocortex) during sleep
- Structural plasticity: Frequently-used neural pathways develop stronger connections; unused pathways prune
Transfer to Transformers: - Attention weights β synaptic strengths - Persistent traces β consolidated synaptic weights - Trace decay β synaptic pruning - Crystallized motifs β semantic concepts in neocortex
Mathematical Formulation¶
1. Persistent Attention Traces¶
Notation: - $A_{i,j}^{(l,h)}(t)$ = attention weight from query position $i$ to key position $j$, layer $l$, head $h$, timestep $t$ - $M_{i,j}^{(l,h)}$ = persistent trace memory (sparse matrix) - $\mathcal{L}$ = task loss - $\lambda \in [0,1]$ = trace retention rate (momentum) - $\gamma \in [0,1)$ = decay rate for unused traces - $\theta_{sal}$ = salience threshold for trace capture - $\alpha \in [0,1]$ = bias injection strength
Salience Score (determines which attention edges persist): $$ S_{i,j}^{(l,h)}(t) = A_{i,j}^{(l,h)}(t) \cdot \left|\frac{\partial \mathcal{L}}{\partial A_{i,j}^{(l,h)}}\right| \cdot \text{persist}(i,j,t) $$
where $\text{persist}(i,j,t)$ measures how often edge $(i,j)$ has appeared in recent history.
Trace Memory Update (sparse EMA with competitive decay): $$ M_{i,j}^{(l,h)} \leftarrow \begin{cases} \lambda \cdot M_{i,j}^{(l,h)} + (1-\lambda) \cdot S_{i,j}^{(l,h)} & \text{if } S_{i,j}^{(l,h)} > \theta_{sal} \ \gamma \cdot M_{i,j}^{(l,h)} & \text{otherwise} \end{cases} $$
Biased Attention Computation (inject memory into standard attention): $$ A'^{(l,h)} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}} + \alpha \cdot M^{(l,h)}\right) $$
Note: $M^{(l,h)}$ is sparse; zero entries are not stored.
Memory Budget Per Layer: $$ \begin{align} N_{\text{edges}}^{\text{full}} &= T^2 \quad \text{(sequence length squared)} \ N_{\text{edges}}^{\text{sparse}} &\approx k_{\text{sparse}} \cdot T^2 \quad \text{where } k_{\text{sparse}} \approx 0.001 \ \text{Bytes per trace} &= 4 + 4 + 2 + 2 = 12 \text{ (layer:1B, head:1B, i:2B, j:2B, salience:4B, age:2B)} \ B_{\text{mem}}^{(l,h)} &= N_{\text{edges}}^{\text{sparse}} \cdot 12 \text{ bytes} \end{align} $$
Example: For $T=4096$, $L=32$ layers, $H=32$ heads: - Full attention storage: $32 \times 32 \times 4096^2 \times 4 \approx 68$ GB - Sparse trace storage ($k=0.001$): $32 \times 32 \times (0.001 \times 4096^2) \times 12 \approx 6.4$ MB
Reduction factor: ~10,000Γ
2. Semantic Crystallization¶
Notation: - $\pi(x) = [e_1^{(1)}, e_2^{(2)}, \ldots, e_L^{(L)}]$ = routing path for input $x$, where $e_l \in {1, \ldots, K}$ - $f(\pi)$ = frequency count of path $\pi$ - $U(\pi)$ = conditional utility of path $\pi$ - $H(\pi)$ = routing entropy of path $\pi$
Path Frequency (tracked via suffix tree): $$ f(\pi) = \sum_{t=1}^{T} \mathbb{1}[\pi(x_t) = \pi] $$
Conditional Utility (performance improvement when using path $\pi$): $$ U(\pi) = \mathbb{E}[\text{reward} \mid \pi] - \mathbb{E}[\text{reward}] = \frac{1}{f(\pi)} \sum_{x:\pi(x)=\pi} r(x) - \bar{r} $$
where $r(x)$ is task-specific reward (e.g., negative loss, accuracy).
Routing Entropy (measure of path stability): $$ H(\pi) = -\sum_{l=1}^{L} \sum_{k=1}^{K} p(e_l = k \mid \pi_{1:l-1}) \log p(e_l = k \mid \pi_{1:l-1}) $$
Low entropy β deterministic, stable routing β good crystallization candidate.
Crystallization Criterion (all conditions must hold): $$ \text{Crystallize}(\pi) \Leftrightarrow \begin{cases} f(\pi) > f_{\min} & \text{(sufficient frequency)} \ U(\pi) > U_{\min} & \text{(positive utility)} \ H(\pi) < H_{\max} & \text{(low entropy/stable)} \ \text{age}(\pi) > \tau & \text{(temporal stability)} \end{cases} $$
Frozen Motif as New Expert: $$ E_{\text{new}}(x) = \prod_{l=1}^{L_{\text{motif}}} E_{e_l}^{(l)}(h_{l-1}) $$
The crystallized expert becomes a single atomic operation, callable like any other expert.
Competitive Consolidation (limited motif budget): $$ \text{Evict lowest-utility motif if } |\mathcal{M}| > M_{\max} \text{ and } \exists \pi': U(\pi') < \min_{\pi \in \mathcal{M}} U(\pi) $$
πΎ Memory-Efficient Implementation Design¶
Critical Constraints¶
- VRAM budget: User config allows ~90% GPU memory for training (
train_cuda_mem_pct: 90) - Typical VRAM: 12-24 GB consumer GPUs (RTX 4090, 3090)
- Model size: HRM-ACTV1 ranges from 150M to 1B+ parameters
- Must not OOM: Memory overhead must be negligible compared to model weights and activations
Trace Storage Data Structure¶
Sparse COO (Coordinate) Format:
@dataclass
class AttentionTrace:
"""Single persistent attention edge."""
layer_id: uint8 # 1 byte (max 255 layers)
head_id: uint8 # 1 byte (max 255 heads)
query_idx: uint16 # 2 bytes (max 65K sequence length)
key_idx: uint16 # 2 bytes
salience: float32 # 4 bytes (accumulated strength)
age: uint16 # 2 bytes (timesteps since last reinforcement)
# Total: 12 bytes per trace
@dataclass
class TraceMemoryLayer:
"""Sparse trace storage for one layer."""
traces: List[AttentionTrace] # Sparse edges only
max_traces: int # Quota per layer/head
decay_rate: float # Ξ³ parameter
retention_rate: float # Ξ» parameter
bias_strength: float # Ξ± parameter
def to_sparse_matrix(self, device: torch.device) -> torch.sparse.FloatTensor:
"""Convert traces to sparse PyTorch tensor for bias injection."""
if len(self.traces) == 0:
return None
indices = torch.tensor(
[[t.query_idx, t.key_idx] for t in self.traces],
dtype=torch.long
).t()
values = torch.tensor(
[t.salience for t in self.traces],
dtype=torch.float32
)
size = (max_seq_len, max_seq_len) # from config
return torch.sparse_coo_tensor(indices, values, size, device=device)
Memory Budget Calculation:
# Configuration
L = 32 # layers
H = 32 # heads per layer
T = 4096 # max sequence length
k_sparse = 0.001 # sparsity (capture top 0.1% edges)
bytes_per_trace = 12
# Per-layer calculation
edges_per_layer_head = k_sparse * (T ** 2) # ~16,777 edges
traces_per_layer = H * edges_per_layer_head # ~536,870 traces
bytes_per_layer = traces_per_layer * bytes_per_trace # ~6.4 MB
# Full model
total_traces = L * traces_per_layer # ~17M traces
total_memory_mb = L * bytes_per_layer / (1024**2) # ~205 MB
# BUT: We enforce stricter quota to be safe
quota_per_head = 2048 # max traces per head
actual_traces = L * H * quota_per_head # ~2M traces
actual_memory_mb = actual_traces * bytes_per_trace / (1024**2) # ~24 MB
Result: β ~24 MB total overhead for trace memory - completely negligible on modern GPUs.
Routing Path Storage (Suffix Tree)¶
Data Structure:
@dataclass
class RoutingNode:
"""Node in routing path suffix tree."""
expert_id: int # Expert ID at this layer
layer: int # Layer depth
count: int # Traversal frequency
total_reward: float # Cumulative task reward
children: Dict[int, RoutingNode] # Next-layer expert transitions
def utility(self) -> float:
"""Conditional utility score."""
if self.count == 0:
return 0.0
return (self.total_reward / self.count) - global_mean_reward
def entropy(self) -> float:
"""Routing entropy at this node."""
if not self.children:
return 0.0
total = sum(c.count for c in self.children.values())
probs = [c.count / total for c in self.children.values()]
return -sum(p * math.log(p + 1e-10) for p in probs)
class RoutingPathTree:
"""Suffix tree tracking all expert routing paths."""
root: RoutingNode
motif_registry: Dict[str, CrystallizedMotif]
max_motifs: int = 512 # Hard limit on crystallized primitives
def record_path(self, path: List[int], reward: float):
"""Update tree with new routing path observation."""
node = self.root
for layer, expert_id in enumerate(path):
if expert_id not in node.children:
node.children[expert_id] = RoutingNode(
expert_id=expert_id,
layer=layer,
count=0,
total_reward=0.0,
children={}
)
node = node.children[expert_id]
node.count += 1
node.total_reward += reward
def find_crystallization_candidates(self) -> List[Tuple[List[int], float]]:
"""DFS to find high-utility stable paths."""
candidates = []
def dfs(node: RoutingNode, path: List[int]):
if len(path) >= MIN_MOTIF_LENGTH:
if (node.count > FREQ_THRESHOLD and
node.utility() > UTILITY_THRESHOLD and
node.entropy() < ENTROPY_THRESHOLD):
candidates.append((path.copy(), node.utility()))
for expert_id, child in node.children.items():
path.append(expert_id)
dfs(child, path)
path.pop()
dfs(self.root, [])
return sorted(candidates, key=lambda x: x[1], reverse=True)
Memory Budget: - Each node: ~48 bytes (int:4, int:4, int:8, float:8, dict_overhead:24) - Max motif length: ~8 layers - Branching factor: ~8 experts per layer on average - Max tree size: $8^8 = 16.7$M nodes worst case β ~800 MB worst case - Realistic case (pruned/sparse): ~100K nodes β ~5 MB
Total Persistent Memory: - RAM (required): ~30 MB (traces + routing tree) - Active working memory - Vector Store (optional): Unlimited capacity - Cross-session persistence via Qdrant/LanceDB (see PF-005)
ποΈ Vector Store Integration (Optional Enhancement)¶
Overview¶
While persistent traces and crystallized motifs operate effectively in RAM (~30 MB), optional integration with vector stores (see data-backends-vector-stores.md PF-005) enables:
- Cross-session persistence: Traces/motifs survive training restarts
- Unlimited capacity: Billions of traces vs 2M RAM limit
- Multi-model sharing: Transfer learned motifs between models
- Scalable retrieval: O(log n) ANN search vs O(n) RAM scan
Architecture¶
Three-tier memory hierarchy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: Cognitive Components (this document) β
β ββ TraceManager (24 MB RAM hot storage) β
β ββ RoutingPathTree (5 MB RAM motif tree) β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β optional sync
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: Specialized Wrappers β
β ββ TraceVectorStore (trace embedding + metadata) β
β ββ MotifVectorStore (motif embedding + utility) β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β uses
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: VectorStoreClient (PF-005) β
β Backend: Qdrant or LanceDB β
β Methods: upsert(), query(), delete() β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Trace Embedding Specification¶
Challenge: Convert sparse coordinate trace to dense vector for similarity search.
Solution: Compositional embedding
class TraceEmbedder:
"""Converts AttentionTrace to dense embedding."""
def __init__(self, embed_dim: int = 128):
# Learned embeddings for trace components
self.layer_embed = nn.Embedding(256, embed_dim // 4) # max 256 layers
self.head_embed = nn.Embedding(256, embed_dim // 4) # max 256 heads
self.position_encoder = SinusoidalPositionalEncoding(embed_dim // 2)
def embed(self, trace: AttentionTrace) -> np.ndarray:
"""Convert trace to 128D vector."""
# Component embeddings
layer_vec = self.layer_embed(trace.layer_id) # 32D
head_vec = self.head_embed(trace.head_id) # 32D
pos_vec = self.position_encoder(
trace.query_idx, trace.key_idx) # 64D (encodes spatial relation)
# Weighted by salience
full_vec = torch.cat([layer_vec, head_vec, pos_vec], dim=-1) # 128D
return (full_vec * trace.salience).detach().cpu().numpy()
Metadata storage:
trace_metadata = {
"layer": trace.layer_id,
"head": trace.head_id,
"query_idx": trace.query_idx,
"key_idx": trace.key_idx,
"salience": trace.salience,
"age": trace.age,
"training_step": current_step,
"task_tag": "QA" or "generation" or "classification", # optional
}
Motif Embedding Specification¶
Challenge: Embed variable-length expert sequences [EββEββEββEβ] into fixed-size vector.
Solution: Sequence encoding with utility weighting
class MotifEmbedder:
"""Converts crystallized motif to dense embedding."""
def __init__(self, embed_dim: int = 256, max_experts: int = 64):
self.expert_embed = nn.Embedding(max_experts, embed_dim)
self.sequence_encoder = nn.LSTM(
input_size=embed_dim,
hidden_size=embed_dim,
num_layers=2,
batch_first=True
)
def embed(self, motif: CrystallizedMotif) -> np.ndarray:
"""Convert motif expert sequence to 256D vector."""
# Embed each expert in sequence
expert_ids = torch.tensor(motif.expert_sequence) # [L]
expert_vecs = self.expert_embed(expert_ids) # [L, 256]
# Encode sequence
_, (h_n, _) = self.sequence_encoder(expert_vecs.unsqueeze(0))
motif_vec = h_n[-1] # Final hidden state [256]
# Weight by utility
return (motif_vec * motif.utility).detach().cpu().numpy()
Metadata storage:
motif_metadata = {
"motif_id": motif.id,
"expert_sequence": motif.expert_sequence, # [2, 7, 3, 1]
"frequency": motif.count,
"utility": motif.utility,
"entropy": motif.entropy,
"age": motif.age,
"task_tags": ["retrieval", "QA"], # tasks where motif is useful
}
Synchronization Protocol¶
TraceManager persistence:
class TraceManager:
def __init__(self, config, vector_store_client=None):
self.traces = [] # RAM storage
self.vector_store = TraceVectorStore(vector_store_client) if vector_store_client else None
self.sync_interval = config.trace_sync_interval # e.g., 1000 steps
self.last_sync_step = 0
def sync_to_vector_store(self, current_step: int):
"""Persist recent traces to vector DB."""
if not self.vector_store or not self.vector_store.enabled:
return
# Convert traces to embeddings
embeddings = [self.embedder.embed(t) for t in self.traces]
ids = [f"trace_{t.layer_id}_{t.head_id}_{t.query_idx}_{t.key_idx}" for t in self.traces]
metadata = [self._trace_to_metadata(t, current_step) for t in self.traces]
# Upsert to vector store
self.vector_store.upsert(ids, embeddings, metadata)
self.last_sync_step = current_step
logger.info(f"Synced {len(self.traces)} traces to vector store at step {current_step}")
def load_from_vector_store(self, task_tag=None, top_k=10000):
"""Warm-start from previous training session."""
if not self.vector_store or not self.vector_store.enabled:
return
# Query for relevant traces (if task_tag specified)
filter_dict = {"task_tag": task_tag} if task_tag else None
results = self.vector_store.query_all(top_k=top_k, filter=filter_dict)
# Reconstruct traces from metadata
for trace_id, score, metadata in results:
trace = AttentionTrace(
layer_id=metadata["layer"],
head_id=metadata["head"],
query_idx=metadata["query_idx"],
key_idx=metadata["key_idx"],
salience=metadata["salience"],
age=metadata["age"],
)
self.traces.append(trace)
logger.info(f"Loaded {len(results)} traces from vector store")
RoutingPathTree persistence (similar pattern for motifs).
Configuration Integration¶
Updated config schema (unified under memory namespace):
memory:
# Vector store backend (PF-005)
vector_store:
backend: "qdrant" # qdrant|lancedb|disabled
qdrant:
host: "localhost"
port: 6333
collection_prefix: "aios_memory"
lancedb:
path: "artifacts/memory/lancedb"
# Persistent attention traces
persistent_traces:
enabled: false
sparsity: 0.001
quota_per_head: 2048
salience_threshold: 0.05
retention_rate: 0.95
decay_rate: 0.98
bias_strength: 0.1
update_interval: 100
warmup_steps: 1000
# Vector store integration (optional)
persist_to_vector_store: false # Enable cross-session persistence
trace_sync_interval: 1000 # Steps between DB syncs
embedding_dim: 128 # Trace embedding size
warm_start: false # Load traces from DB on training start
task_tag: null # Filter traces by task type
# Semantic crystallization
semantic_crystallization:
enabled: false
min_frequency: 100
min_utility: 0.05
max_entropy: 1.0
min_age: 500
max_motifs: 512
# Vector store integration (optional)
persist_motifs: false # Auto-save crystallized motifs
motif_embedding_dim: 256 # Motif embedding size
share_across_models: false # Allow other models to query motifs
Benefits of Integration¶
Without vector store (baseline): - β Works standalone, no external dependencies - β Fast (pure RAM) - β Limited to ~2M traces - β Lost on training restart - β Cannot share between models
With vector store (enhanced): - β Unlimited capacity (billions of traces) - β Persistent across sessions - β Multi-model collaboration - β Fast retrieval (ANN indexing) - β Requires Qdrant/LanceDB service - β Slight I/O overhead during sync
Recommendation: Start without vector store, enable after validating core functionality.
π§ Architecture Integration (HRM-ACTV1 Specific)¶
Current Architecture Analysis¶
Files involved:
- src/aios/core/hrm_models/impl/hrm_act_v1.py - Main model architecture
- src/aios/core/hrm_models/impl/layers.py - Attention and MoE layers
- src/aios/core/hrm_models/moe_layer.py - MoE base implementation
- src/aios/core/hrm_models/dynamic_moe/dynamic_layer.py - Dynamic expert loading
- src/aios/cli/hrm_hf/training_loop.py - Training orchestration
Current attention implementation (layers.py line ~164):
def forward(self, cos_sin: CosSin, hidden_states: torch.Tensor) -> torch.Tensor:
# ... qkv projection ...
# Standard scaled dot-product attention
attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim)
attn_weights = F.softmax(attn_weights, dim=-1)
# Apply attention
attn_output = torch.matmul(attn_weights, value)
Current MoE router (moe_layer.py line ~66):
def forward(self, hidden_states, top_k):
logits = self.gate(hidden_states) # [batch, seq, num_experts]
top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1)
top_k_weights = F.softmax(top_k_logits, dim=-1)
return top_k_weights, top_k_indices, logits
Integration Points¶
Hook 1: Attention Trace Capture (Post-Attention)¶
Location: src/aios/core/hrm_models/impl/layers.py - Attention.forward()
Modification:
class Attention(nn.Module):
def __init__(self, ...):
# ... existing init ...
self.trace_manager = None # Set by model if tracing enabled
def forward(self, cos_sin: CosSin, hidden_states: torch.Tensor) -> torch.Tensor:
# ... existing QKV projection and RoPE ...
# Compute attention weights
attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim)
# INTEGRATION POINT 1: Inject persistent trace bias
if self.trace_manager is not None and self.trace_manager.enabled:
trace_bias = self.trace_manager.get_bias_for_layer_head(
layer_id=self.layer_id,
head_id=head_id, # iterate over heads
device=attn_weights.device
)
if trace_bias is not None:
attn_weights = attn_weights + self.trace_manager.bias_strength * trace_bias
attn_weights = F.softmax(attn_weights, dim=-1)
# INTEGRATION POINT 2: Capture high-salience traces (training only)
if self.training and self.trace_manager is not None:
# Detach to avoid interfering with backprop
self.trace_manager.register_attention_for_update(
layer_id=self.layer_id,
head_id=head_id,
attn_weights=attn_weights.detach(),
query_idx=..., # derived from position info
key_idx=...,
)
# ... rest of attention computation ...
Gradient capture (for salience score):
# In training loop, after loss.backward()
if trace_manager.enabled:
for layer_id, layer in enumerate(model.layers):
attn_grad = layer.self_attn.attn_weights.grad # Need to retain_grad()
trace_manager.update_traces_with_gradients(layer_id, attn_grad)
Hook 2: Routing Path Logging (Post-Router)¶
Location: src/aios/core/hrm_models/moe_layer.py - TopKRouter.forward()
Modification:
class TopKRouter(nn.Module):
def __init__(self, ...):
# ... existing init ...
self.crystallization_tracker = None # Set by model
def forward(self, hidden_states, top_k):
logits = self.gate(hidden_states)
top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1)
top_k_weights = F.softmax(top_k_logits, dim=-1)
# INTEGRATION POINT 3: Log routing paths for crystallization
if self.training and self.crystallization_tracker is not None:
# Log chosen experts per token
self.crystallization_tracker.record_routing(
layer_id=self.layer_id,
expert_indices=top_k_indices.detach(),
routing_weights=top_k_weights.detach(),
sequence_ids=..., # from batch metadata
)
return top_k_weights, top_k_indices, logits
Hook 3: Auxiliary Loss Computation¶
Location: src/aios/cli/hrm_hf/training_loop.py - loss computation
Addition:
def compute_loss_with_auxiliary(model, batch, trace_manager, crystallization_tracker):
# Standard forward pass
outputs = model(batch)
base_loss = outputs.loss
# Existing MoE load balancing loss
load_balance_loss = 0.0
for layer in model.layers:
if hasattr(layer, 'mlp') and hasattr(layer.mlp, 'last_router_logits'):
load_balance_loss += compute_load_balance_loss(layer.mlp.last_router_logits)
# NEW: Trace reuse auxiliary loss (encourage using persistent traces)
trace_reuse_loss = 0.0
if trace_manager.enabled:
trace_reuse_loss = trace_manager.compute_trace_utilization_loss()
# NEW: Crystallization stability loss (low entropy for stable paths)
crystallization_loss = 0.0
if crystallization_tracker.enabled:
crystallization_loss = crystallization_tracker.compute_entropy_regularization()
# Combined loss
total_loss = (
base_loss +
0.01 * load_balance_loss + # existing
0.005 * trace_reuse_loss + # new
0.002 * crystallization_loss # new
)
return total_loss, {
'base_loss': base_loss.item(),
'load_balance_loss': load_balance_loss.item(),
'trace_reuse_loss': trace_reuse_loss.item(),
'crystallization_loss': crystallization_loss.item(),
}
Flash Attention Compatibility¶
Challenge: Flash Attention is a fused CUDA kernel - cannot intercept intermediate attention weights.
Solution: Dual-mode attention
class Attention(nn.Module):
def forward(self, ...):
# Trace capture requires explicit attention matrices
if self.training and self.trace_manager is not None and self.trace_manager.capture_mode:
# Use standard attention (slower but exposes weights)
return self._forward_standard_with_traces(...)
else:
# Use Flash Attention (faster)
return self._forward_flash_attn(...)
Training strategy: - Phase 1 (warmup): Use standard attention, accumulate traces (0-5% of training) - Phase 2 (main training): Use Flash Attention with frozen trace bias (95% of training) - Phase 3 (periodic trace updates): Briefly switch to standard attention every N steps to refresh traces
Gradient Checkpointing Compatibility¶
Challenge: Gradient checkpointing discards activations, including attention weights.
Solution: Trace updates happen in separate forward-only passes
# Every UPDATE_INTERVAL steps (e.g., 100)
if step % TRACE_UPDATE_INTERVAL == 0:
with torch.no_grad():
# Forward-only pass with trace capture enabled
_ = model(batch, trace_capture_mode=True)
# trace_manager accumulates observations
# Update trace memory based on accumulated stats
trace_manager.consolidate_traces()
π Training Protocol¶
Loss Function Components¶
Total training loss: $$ \mathcal{L}{\text{total}} = \mathcal{L}}} + \beta_1 \mathcal{L{\text{balance}} + \beta_2 \mathcal{L} $$}} + \beta_3 \mathcal{L}_{\text{crystal}
where: - $\mathcal{L}{\text{task}}$ = standard next-token prediction loss - $\mathcal{L}$ = MoE load balancing (existing) - $\mathcal{L}}{\text{trace}}$ = trace utilization regularizer (new) - $\mathcal{L}$ = crystallization stability (new)}
1. Trace Utilization Loss¶
Goal: Encourage model to reuse persistent traces (exploration-exploitation balance).
$$ \mathcal{L}{\text{trace}} = -\frac{1}{L \cdot H} \sum $$}^{L} \sum_{h=1}^{H} \frac{\text{trace_hits}^{(l,h)}}{\text{total_attention_ops}^{(l,h)}
Penalizes models that ignore their accumulated trace memory.
2. Crystallization Stability Loss¶
Goal: Encourage deterministic routing for high-utility paths (low entropy).
$$ \mathcal{L}{\text{crystal}} = \frac{1}{N] $$}}} \sum_{\pi \in \mathcal{M}} H(\pi) \cdot \mathbb{1}[U(\pi) > U_{\min
Only penalize entropy for high-utility motifs (we want them stable).
Hyperparameters¶
Trace management:
persistent_traces:
enabled: true
sparsity: 0.001 # Top 0.1% of attention edges
quota_per_head: 2048 # Max traces per head
salience_threshold: 0.05 # Minimum salience to capture
retention_rate: 0.95 # Ξ» (momentum)
decay_rate: 0.98 # Ξ³ (forgetting)
bias_strength: 0.1 # Ξ± (injection strength)
update_interval: 100 # Steps between trace consolidation
warmup_steps: 1000 # Standard attention for trace accumulation
Crystallization:
semantic_crystallization:
enabled: true
min_frequency: 100 # f_min
min_utility: 0.05 # U_min (5% improvement over baseline)
max_entropy: 1.0 # H_max (low entropy = stable)
min_age: 500 # Temporal stability requirement
max_motifs: 512 # Hard limit on crystallized primitives
motif_min_length: 3 # Minimum layers in motif
motif_max_length: 8 # Maximum layers in motif
prune_interval: 1000 # Steps between motif pruning
Loss coefficients:
loss_weights:
task: 1.0
load_balance: 0.01 # Existing MoE
trace_utilization: 0.005 # New
crystallization: 0.002 # New
Training Curriculum¶
Phase 1: Foundation (0-20% of training) - Standard training, no trace/crystallization - Model learns basic patterns - Establishes baseline performance
Phase 2: Trace Accumulation (20-30% of training) - Enable trace capture (standard attention mode) - No bias injection yet - Build initial trace memory
Phase 3: Trace-Biased Training (30-70% of training) - Enable bias injection - Switch to Flash Attention for speed - Periodic trace updates every 100 steps - Model learns to leverage persistent memory
Phase 4: Crystallization Discovery (70-85% of training) - Enable routing path tracking - Identify high-utility motifs - No freezing yet (observation only)
Phase 5: Motif Crystallization (85-100% of training) - Freeze top motifs into specialized experts - Fine-tune with crystallized primitives - Competitive pruning of low-utility motifs
Stability Safeguards¶
Elastic Weight Consolidation (EWC): - Protect important parameters supporting high-salience traces - Fisher information matrix computed over trace-heavy pathways - Penalty: $\mathcal{L}_{\text{EWC}} = \sum_i \frac{\lambda_i}{2} F_i (\theta_i - \theta_i^*)^2$
Router Collapse Prevention: - Maintain minimum load balancing even with crystallization - Enforce capacity constraints per expert - Periodic entropy injection (exploration)
Memory Overflow Protection: - Hard quota enforcement: evict lowest-salience traces when full - Automatic decay rate adjustment if memory pressure detected - Fallback to standard attention if trace system fails
π Evaluation Framework¶
Metrics¶
Primary Metrics (Task Performance)¶
- Perplexity on validation set (lower is better)
- Long-range dependency accuracy (copy tasks, induction heads)
- Multi-hop reasoning (HotpotQA, MMLU)
- Inference latency (ms per token)
Trace-Specific Metrics¶
- Trace coverage: % of attention operations using persistent bias
- Trace stability: Correlation of traces across epochs
- Salience distribution: Histogram of trace strengths
- Memory efficiency: Actual bytes used vs. quota
Crystallization Metrics¶
- Motif count: Number of crystallized primitives over training
- Motif utility: Average $U(\pi)$ for crystallized motifs
- Routing entropy: Average $H(\pi)$ for active motifs
- FLOP reduction: % compute saved by crystallized experts
Emergent Language Metrics (Exploratory)¶
- Hierarchical structure: Tree depth of motif dependencies
- Compositionality: Frequency of motif combinations
- Semantic coherence: Clustering of motifs by task type
Baselines¶
- Standard HRM-ACTV1: No traces, no crystallization
- Compressive Transformer: External memory baseline
- Memorizing Transformer: kNN retrieval baseline
- Mixture-of-Experts (vanilla): MoE without crystallization
Ablation Studies¶
Test each component independently:
| Experiment | Traces | Crystallization | Expected Outcome |
|---|---|---|---|
| Baseline | β | β | Standard performance |
| Traces Only | β | β | Faster convergence, better long-range |
| Crystal Only | β | β | FLOP reduction, expert specialization |
| Full System | β | β | Synergistic improvements |
Ablation parameters: - Trace sparsity: {0.0001, 0.001, 0.01} - Bias strength $\alpha$: {0.01, 0.05, 0.1, 0.2} - Crystallization threshold $U_{\min}$: {0.01, 0.05, 0.1} - Motif length: {2, 4, 6, 8}
Evaluation Tasks¶
Short-term reasoning: - HellaSwag (commonsense) - PIQA (physical reasoning) - WinoGrande (coreference)
Long-term reasoning: - bAbI tasks (require multi-hop) - SQuAD (reading comprehension) - Custom copy/induction tasks
Expert utilization: - Track which motifs activate for which task types - Visualize routing path trees - Measure task-motif specialization
β οΈ Risk Analysis & Mitigation¶
Risk 1: Catastrophic Forgetting¶
Description: Frozen motifs become stale as data distribution shifts.
Likelihood: High
Impact: Critical (model performance degrades)
Mitigation: - Drift detection: Monitor per-motif utility over sliding window - Adaptive unfreezing: Unfreeze motifs if $U(\pi)$ drops below threshold - Rehearsal buffer: Store examples that activated each motif, replay periodically - Conditional crystallization: Only freeze motifs validated on diverse tasks
Risk 2: Router Collapse¶
Description: Crystallization causes all tokens to route through same few motifs.
Likelihood: Medium
Impact: High (loss of model capacity)
Mitigation: - Strong load balancing: Maintain $\beta_1 \geq 0.01$ throughout - Capacity limits: Enforce maximum tokens per expert per batch - Diversity bonus: Add entropy bonus to routing loss during crystallization phase - Progressive crystallization: Freeze motifs gradually (top-1, then top-5, etc.)
Risk 3: Memory Overflow¶
Description: Trace memory exceeds quota, causing OOM.
Likelihood: Low (hard limits in place)
Impact: Critical (training crash)
Mitigation:
- Hard quotas: Strictly enforce quota_per_head limits
- Competitive eviction: Lowest-salience traces evicted first
- Automatic decay tuning: Increase $\gamma$ if memory pressure detected
- Graceful degradation: Disable tracing if memory allocation fails
Risk 4: Flash Attention Incompatibility¶
Description: Trace capture breaks with fused kernels.
Likelihood: Low (dual-mode designed)
Impact: Medium (slower training)
Mitigation: - Dual-mode attention: Standard for capture, Flash for speed - Sparse capture schedule: Only enable standard attention every 100 steps - Post-hoc approximation: Estimate salience from gradients without explicit weights
Risk 5: Gradient Instability¶
Description: Auxiliary losses destabilize training.
Likelihood: Medium
Impact: Medium (slower convergence)
Mitigation: - Conservative coefficients: Start with $\beta_2, \beta_3 < 0.01$ - Gradual ramp-up: Linearly increase loss weights over training - Gradient clipping: Clip auxiliary loss gradients separately - Ablation-based tuning: Empirically find stable coefficient ranges
Risk 6: Emergent Language Failure¶
Description: Model doesn't develop meaningful internal primitives.
Likelihood: Medium (pioneering work)
Impact: Low (graceful degradation to baseline)
Mitigation: - Baseline guarantees: System degrades to standard transformer if crystallization fails - Interpretability tools: Visualize motif activations to detect failure early - Manual seeding: Option to manually specify useful routing patterns - Curriculum design: Carefully design tasks that encourage motif discovery
πΊοΈ Implementation Roadmap¶
Phase 0: Infrastructure (Week 1-2)¶
Deliverables:
- [ ] Create src/aios/core/hrm_models/cognitive/ module
- [ ] Implement TraceManager class with sparse storage
- [ ] Implement RoutingPathTree suffix tree
- [ ] Add configuration schemas to config/default.yaml
- [ ] Unit tests for data structures
Files:
src/aios/core/hrm_models/cognitive/
__init__.py
trace_manager.py
routing_tree.py
crystallization.py
config.py
tests/core/hrm_models/cognitive/
test_trace_manager.py
test_routing_tree.py
Phase 1: Attention Trace Capture (Week 3-4)¶
Deliverables:
- [ ] Modify Attention class to support trace hooks
- [ ] Implement salience score computation
- [ ] Add gradient-based trace updates
- [ ] Integration with existing training loop
- [ ] Memory profiling and optimization
Modified Files:
- src/aios/core/hrm_models/impl/layers.py
- src/aios/cli/hrm_hf/training_loop.py
Tests: - Memory footprint < 50 MB for 32-layer model - Trace capture doesn't slow training > 5% - Salience scores correlate with task loss gradients
Phase 2: Persistent Bias Injection (Week 5-6)¶
Deliverables: - [ ] Implement sparse trace β attention bias conversion - [ ] Add dual-mode attention (standard/Flash) - [ ] Trace update scheduling - [ ] Exponential decay mechanism - [ ] Visualization tools for traces
Tests: - Bias injection improves convergence on copy tasks - Flash Attention speedup maintained (> 90% of time) - Trace stability across training runs
Phase 2.5: Vector Store Integration (Week 6.5-7.5) [OPTIONAL]¶
Prerequisites: PF-005 vector store implementation complete (Weeks 1-6)
Deliverables:
- [ ] Implement TraceEmbedder class (sparse trace β dense vector)
- [ ] Implement MotifEmbedder class (expert sequence β dense vector)
- [ ] Create TraceVectorStore wrapper around VectorStoreClient
- [ ] Create MotifVectorStore wrapper around VectorStoreClient
- [ ] Add sync_to_vector_store() and load_from_vector_store() to TraceManager
- [ ] Add persist_motifs() to RoutingPathTree
- [ ] Update configuration schema with vector store integration flags
- [ ] Integration tests for trace/motif persistence and retrieval
Modified Files:
- src/aios/core/hrm_models/cognitive/trace_manager.py
- src/aios/core/hrm_models/cognitive/routing_tree.py
- src/aios/core/hrm_models/cognitive/embedders.py (new)
- src/aios/core/hrm_models/cognitive/vector_wrappers.py (new)
- config/default.yaml
Tests: - Trace embedding preserves similarity (nearby traces have high cosine similarity) - Motif embedding distinguishes task types (retrieval vs generation motifs cluster separately) - Sync/load cycle preserves trace salience within 1% error - Cross-session warm-start improves initial perplexity vs cold start - Memory overhead < 5 MB additional (for embedding models)
Acceptance Criteria: - TraceManager can persist 100K traces and reload with < 2% information loss - MotifVectorStore correctly retrieves top-10 similar motifs for given input - Integration works with both Qdrant and LanceDB backends - Disabling vector store (config flag) causes graceful fallback to RAM-only mode
Notes:
- This phase can be skipped if vector store integration not needed
- All subsequent phases work with or without vector store enabled
- See data-backends-vector-stores.md for VectorStoreClient API details
Phase 3: Routing Path Logging (Week 7-8)¶
Deliverables:
- [ ] Modify TopKRouter to log expert selections
- [ ] Build suffix tree from routing paths
- [ ] Compute path frequency and utility
- [ ] Implement competitive motif discovery
- [ ] Path visualization tools
Modified Files:
- src/aios/core/hrm_models/moe_layer.py
- src/aios/core/hrm_models/dynamic_moe/dynamic_layer.py
Tests: - Suffix tree correctly tracks all paths - Utility scores correlate with task performance - Tree memory < 10 MB
Phase 4: Semantic Crystallization (Week 9-11)¶
Deliverables: - [ ] Implement motif freezing mechanism - [ ] Create distillation process for crystallized experts - [ ] Add competitive pruning - [ ] Integrate with expert registry - [ ] Motif specialization analysis tools
Tests: - Crystallized motifs reduce FLOPs - Frozen motifs maintain performance - Pruning doesn't cause catastrophic forgetting
Phase 5: Auxiliary Losses (Week 12-13)¶
Deliverables: - [ ] Implement trace utilization loss - [ ] Implement crystallization entropy loss - [ ] Add EWC for stability - [ ] Hyperparameter tuning - [ ] Loss curve analysis
Tests: - Auxiliary losses don't destabilize training - Combined loss converges faster than baseline - Ablations validate each component
Phase 6: Evaluation & Analysis (Week 14-16)¶
Deliverables: - [ ] Run full baseline comparisons - [ ] Long-range dependency benchmarks - [ ] FLOP reduction measurements - [ ] Emergent language analysis - [ ] Documentation and research writeup
Benchmarks: - bAbI tasks, SQuAD, HellaSwag, MMLU - Custom induction head tasks - Latency profiling
Phase 7: Production Hardening (Week 17-18)¶
Deliverables: - [ ] Memory overflow safeguards - [ ] Gradient checkpointing full compatibility - [ ] Distributed training support (DeepSpeed) - [ ] Configuration validation - [ ] User documentation
Tests: - Multi-GPU training stability - Recovery from OOM gracefully - Configuration validation catches errors
π References & Prior Art¶
Foundational Papers¶
Memory-Augmented Transformers: 1. Dai et al. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." arXiv:1901.02860 2. Rae et al. (2019). "Compressive Transformers for Long-Range Sequence Modelling." arXiv:1911.05507 3. Wu et al. (2022). "Memorizing Transformers." arXiv:2203.08913 4. Borgeaud et al. (2022). "Improving language models by retrieving from trillions of tokens." arXiv:2112.04426 (RETRO)
Mixture of Experts: 5. Shazeer et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538 6. Lepikhin et al. (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668 7. Fedus et al. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961
Continual Learning & Consolidation: 8. Kirkpatrick et al. (2017). "Overcoming catastrophic forgetting in neural networks." arXiv:1612.00796 (EWC) 9. Zenke et al. (2017). "Continual Learning Through Synaptic Intelligence." arXiv:1703.04200 10. Schwarz et al. (2018). "Progress & Compress: A scalable framework for continual learning." arXiv:1805.06370
Neural Architecture Search & Plasticity: 11. Frankle & Carbin (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." arXiv:1803.03635 12. Evci et al. (2020). "Rigging the Lottery: Making All Tickets Winners." arXiv:1911.11134 (RigL) 13. Liu et al. (2019). "DARTS: Differentiable Architecture Search." arXiv:1806.09055
Neuroscience Foundations: 14. Bliss & LΓΈmo (1973). "Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path." Journal of Physiology. 15. McClelland et al. (1995). "Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory." Psychological Review.
Repositories¶
- KRONOS Memory-Augmented Transformer: https://github.com/agentic-labs/KRONOS
- Hyperbolic Neural Networks: https://github.com/HazyResearch/hgcn
- PyTorch Sparse Training (RigL): https://github.com/google-research/rigl
- Memorizing Transformer: https://github.com/lucidrains/memorizing-transformers-pytorch
π― Success Criteria¶
Minimum Viable Product (MVP)¶
- [ ] Trace memory overhead < 50 MB
- [ ] Training slowdown < 10%
- [ ] Inference speedup β₯ 5% on familiar patterns
- [ ] No catastrophic forgetting on standard benchmarks
Target Goals¶
- [ ] Perplexity improvement β₯ 5% on long-context tasks
- [ ] FLOP reduction β₯ 15% from crystallized motifs
- [ ] Trace coverage β₯ 30% of attention operations
- [ ] β₯ 50 stable crystallized motifs after full training
Stretch Goals¶
- [ ] Emergent hierarchical motif structure (β₯ 3 levels deep)
- [ ] Task-specific motif specialization (interpretable)
- [ ] Zero-shot transfer of motifs to new tasks
- [ ] Publishable research contribution (novel technique)
π¬ Open Research Questions¶
- Trace Generalization: Do traces learned on one domain transfer to others?
- Optimal Sparsity: What is the Pareto frontier of trace density vs. performance?
- Crystallization Depth: What is the optimal motif length for different task types?
- Emergent Compositionality: Can motifs combine to form higher-order primitives?
- Cross-Attention Traces: Do traces in cross-attention (encoder-decoder) behave differently?
- Adaptive Decay: Can decay rates be learned per-trace rather than global?
- Motif Transfer Learning: Can crystallized motifs be transferred between models?
- Interpretability: What do crystallized motifs "mean" in human-understandable terms?
π Notes & Caveats¶
This is pioneering research - no guarantees of success. The system is designed with graceful degradation: if traces/crystallization fail to improve performance, it degrades to standard transformer behavior.
Computational cost: Initial implementation may be slower due to dual-mode attention and logging overhead. Optimizations will come in later phases.
Hyperparameter sensitivity: Many new hyperparameters introduced. Extensive tuning required.
Interpretability: Emergent internal language may be completely alien to human understanding - visualization tools critical.
Distributed training: Initial implementation focuses on single-GPU. Multi-GPU support requires careful synchronization of trace/motif state.
Production readiness: This is a research feature. Full productionization (reliability, edge cases, monitoring) will require significant additional work.
Status: Ready for Phase 0 implementation
Next Steps: Create infrastructure module and unit tests
Owner: AI-OS Core Team
Timeline: ~18 weeks for full implementation