Semantic Crystallization¶

Status: Research & Planning
Priority: Experimental / High Scientific Value
Objective: Pioneer cognitive architecture with emergent internal language through persistent memory and motif crystallization
Created: December 8, 2025
Type: Novel Research - Potential Breakthrough

Companion Documents: - 📐 Mathematical Foundations: PERSISTENT_TRACES_APPENDIX_MATHEMATICAL_FOUNDATIONS.md - Detailed derivations, proofs, complexity analysis - 🧠 Cognitive Science Perspective: PERSISTENT_TRACES_COGNITIVE_SCIENCE.md - Emergent language theory, philosophical implications - 📋 Quick Reference: PERSISTENT_TRACES_QUICK_REFERENCE.md - Config templates, FAQ, decision trees - 🗄️ Vector Store Integration: data-backends-vector-stores.md (PF-005) - Persistent storage backend for cross-session memory

📋 Executive Summary¶

The Vision¶

Current language models process each input independently, computing attention and routing decisions from scratch every forward pass. This is computationally wasteful and biologically implausible. Persistent Attention Traces and Semantic Crystallization aim to create a model that develops its own efficient internal "thought language" by:

Remembering high-salience reasoning pathways that recur across contexts (attention traces)
Consolidating frequently-used expert routing patterns into reusable symbolic primitives (crystallization)
Evolving an internal cognitive lexicon optimized for dense, rapid computation rather than human readability

This is analogous to how biological neural networks consolidate episodic experiences into semantic knowledge through synaptic strengthening and pruning.

The Problem¶

Attention is ephemeral: Valuable reasoning patterns discovered during training are recomputed from scratch every forward pass
Routing is redundant: MoE models repeatedly discover the same expert pathways for similar inputs
No cognitive compression: Models cannot develop efficient "shorthand" for complex conceptual patterns
Memory inefficiency: Full attention matrices are O(L²) per layer - infeasible to persist

The Solution¶

Two-component cognitive enhancement system:

Component 1: Persistent Attention Traces¶

Capture sparse high-salience attention edges (top 0.1%) that consistently strengthen across sequences
Store in compact coordinate format: (layer, head, query_idx, key_idx, salience, age) ≈ 12 bytes per trace
Apply exponential decay to unused traces (biological forgetting)
Bias future attention using accumulated trace memory → faster convergence to proven reasoning pathways
Memory footprint: ~6-12 MB for 32-layer model (vs. ~68 GB for full attention storage!)

Component 2: Semantic Crystallization¶

Track MoE routing paths: sequences of expert activations [E₂→E₇→E₃→E₁] across layers
Build suffix tree of frequently-traversed motifs with utility scoring
Freeze high-utility motifs into specialized computational units (new "thought symbols")
Competitive dynamics ensure only genuinely useful patterns survive
Result: Model develops hierarchical vocabulary of reusable reasoning primitives

Expected Outcomes¶

Baseline Model:
  - Recomputes attention from scratch: 100% cost every forward pass
  - Discovers useful routing path: forgotten immediately after sequence ends
  - Long-horizon reasoning: struggles with dependencies beyond context window

Enhanced Model:
  - Persistent traces accelerate attention: 20-40% speedup on familiar patterns
  - Crystallized motifs become reusable "concepts": 15-30% FLOP reduction
  - Emergent internal language: hierarchical reasoning primitives
  - Long-term memory: consolidates knowledge across training lifetime

This is pioneering work - no existing implementation combines persistent attention biasing with routing-path crystallization at this level of integration.

🧠 Theoretical Foundation¶

Biological Inspiration¶

Long-Term Potentiation (LTP)
In biological neural networks, synaptic connections that repeatedly fire together strengthen through LTP - the cellular basis of learning and memory. Our persistent attention traces implement a computational analog:

Hebbian principle: "Neurons that fire together wire together"
Synaptic consolidation: Episodic memories (hippocampus) consolidate into semantic knowledge (neocortex) during sleep
Structural plasticity: Frequently-used neural pathways develop stronger connections; unused pathways prune

Transfer to Transformers: - Attention weights ≈ synaptic strengths - Persistent traces ≈ consolidated synaptic weights - Trace decay ≈ synaptic pruning - Crystallized motifs ≈ semantic concepts in neocortex

Mathematical Formulation¶

1. Persistent Attention Traces¶

Notation: - $A_{i,j}^{(l,h)}(t)$ = attention weight from query position $i$ to key position $j$, layer $l$, head $h$, timestep $t$ - $M_{i,j}^{(l,h)}$ = persistent trace memory (sparse matrix) - $\mathcal{L}$ = task loss - $\lambda \in [0,1]$ = trace retention rate (momentum) - $\gamma \in [0,1)$ = decay rate for unused traces - $\theta_{sal}$ = salience threshold for trace capture - $\alpha \in [0,1]$ = bias injection strength

Salience Score (determines which attention edges persist): $$ S_{i,j}^{(l,h)}(t) = A_{i,j}^{(l,h)}(t) \cdot \left|\frac{\partial \mathcal{L}}{\partial A_{i,j}^{(l,h)}}\right| \cdot \text{persist}(i,j,t) $$

where $\text{persist}(i,j,t)$ measures how often edge $(i,j)$ has appeared in recent history.

Trace Memory Update (sparse EMA with competitive decay): $$ M_{i,j}^{(l,h)} \leftarrow \begin{cases} \lambda \cdot M_{i,j}^{(l,h)} + (1-\lambda) \cdot S_{i,j}^{(l,h)} & \text{if } S_{i,j}^{(l,h)} > \theta_{sal} \ \gamma \cdot M_{i,j}^{(l,h)} & \text{otherwise} \end{cases} $$

Biased Attention Computation (inject memory into standard attention): $$ A'^{(l,h)} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}} + \alpha \cdot M^{(l,h)}\right) $$

Note: $M^{(l,h)}$ is sparse; zero entries are not stored.

Memory Budget Per Layer: $$ \begin{align} N_{\text{edges}}^{\text{full}} &= T^2 \quad \text{(sequence length squared)} \ N_{\text{edges}}^{\text{sparse}} &\approx k_{\text{sparse}} \cdot T^2 \quad \text{where } k_{\text{sparse}} \approx 0.001 \ \text{Bytes per trace} &= 4 + 4 + 2 + 2 = 12 \text{ (layer:1B, head:1B, i:2B, j:2B, salience:4B, age:2B)} \ B_{\text{mem}}^{(l,h)} &= N_{\text{edges}}^{\text{sparse}} \cdot 12 \text{ bytes} \end{align} $$

Example: For $T=4096$, $L=32$ layers, $H=32$ heads: - Full attention storage: $32 \times 32 \times 4096^2 \times 4 \approx 68$ GB - Sparse trace storage ($k=0.001$): $32 \times 32 \times (0.001 \times 4096^2) \times 12 \approx 6.4$ MB

Reduction factor: ~10,000×

2. Semantic Crystallization¶

Notation: - $\pi(x) = [e_1^{(1)}, e_2^{(2)}, \ldots, e_L^{(L)}]$ = routing path for input $x$, where $e_l \in {1, \ldots, K}$ - $f(\pi)$ = frequency count of path $\pi$ - $U(\pi)$ = conditional utility of path $\pi$ - $H(\pi)$ = routing entropy of path $\pi$

Path Frequency (tracked via suffix tree): $$ f(\pi) = \sum_{t=1}^{T} \mathbb{1}[\pi(x_t) = \pi] $$

Conditional Utility (performance improvement when using path $\pi$): $$ U(\pi) = \mathbb{E}[\text{reward} \mid \pi] - \mathbb{E}[\text{reward}] = \frac{1}{f(\pi)} \sum_{x:\pi(x)=\pi} r(x) - \bar{r} $$

where $r(x)$ is task-specific reward (e.g., negative loss, accuracy).

Routing Entropy (measure of path stability): $$ H(\pi) = -\sum_{l=1}^{L} \sum_{k=1}^{K} p(e_l = k \mid \pi_{1:l-1}) \log p(e_l = k \mid \pi_{1:l-1}) $$

Low entropy → deterministic, stable routing → good crystallization candidate.

Crystallization Criterion (all conditions must hold): $$ \text{Crystallize}(\pi) \Leftrightarrow \begin{cases} f(\pi) > f_{\min} & \text{(sufficient frequency)} \ U(\pi) > U_{\min} & \text{(positive utility)} \ H(\pi) < H_{\max} & \text{(low entropy/stable)} \ \text{age}(\pi) > \tau & \text{(temporal stability)} \end{cases} $$

Frozen Motif as New Expert: $$ E_{\text{new}}(x) = \prod_{l=1}^{L_{\text{motif}}} E_{e_l}^{(l)}(h_{l-1}) $$

The crystallized expert becomes a single atomic operation, callable like any other expert.

Competitive Consolidation (limited motif budget): $$ \text{Evict lowest-utility motif if } |\mathcal{M}| > M_{\max} \text{ and } \exists \pi': U(\pi') < \min_{\pi \in \mathcal{M}} U(\pi) $$

💾 Memory-Efficient Implementation Design¶

Critical Constraints¶

VRAM budget: User config allows ~90% GPU memory for training (train_cuda_mem_pct: 90)
Typical VRAM: 12-24 GB consumer GPUs (RTX 4090, 3090)
Model size: HRM-ACTV1 ranges from 150M to 1B+ parameters
Must not OOM: Memory overhead must be negligible compared to model weights and activations

Trace Storage Data Structure¶

Sparse COO (Coordinate) Format:

name="__codelineno-1-1" href="#__codelineno-1-1">@dataclass class="k">class AttentionTrace: class="w"> """Single persistent attention edge.""" layer_id: uint8 # 1 byte (max 255 layers) head_id: uint8 # 1 byte (max 255 heads) query_idx: uint16 # 2 bytes (max 65K sequence length) key_idx: uint16 # 2 bytes salience: float32 # 4 bytes (accumulated strength) age: uint16 # 2 bytes (timesteps since last reinforcement) # Total: 12 bytes per trace class="nd">@dataclass class="k">class TraceMemoryLayer: class="w"> """Sparse trace storage for one layer.""" traces: List[AttentionTrace] # Sparse edges only max_traces: int # Quota per layer/head decay_rate: float # γ parameter retention_rate: float # λ parameter bias_strength: float # α parameter def to_sparse_matrix(self, device: torch.device) -> torch.sparse.FloatTensor: class="w"> """Convert traces to sparse PyTorch tensor for bias injection.""" if len(self.traces) == 0: return None indices = torch.tensor( [[t.query_idx, t.key_idx] for t in self.traces], dtype=torch.long ).t() values = torch.tensor( [t.salience for t in self.traces], dtype=torch.float32 ) size = (max_seq_len, max_seq_len) # from config return torch.sparse_coo_tensor(indices, values, size, device=device)

Memory Budget Calculation:

# Configuration
L = 32              # layers
H = 32              # heads per layer
T = 4096            # max sequence length
k_sparse = 0.001    # sparsity (capture top 0.1% edges)
bytes_per_trace = 12

# Per-layer calculation
edges_per_layer_head = k_sparse * (T ** 2)  # ~16,777 edges
traces_per_layer = H * edges_per_layer_head # ~536,870 traces
bytes_per_layer = traces_per_layer * bytes_per_trace  # ~6.4 MB

# Full model
total_traces = L * traces_per_layer  # ~17M traces
total_memory_mb = L * bytes_per_layer / (1024**2)  # ~205 MB

# BUT: We enforce stricter quota to be safe
quota_per_head = 2048  # max traces per head
actual_traces = L * H * quota_per_head  # ~2M traces
actual_memory_mb = actual_traces * bytes_per_trace / (1024**2)  # ~24 MB

Result: ✅ ~24 MB total overhead for trace memory - completely negligible on modern GPUs.

Routing Path Storage (Suffix Tree)¶

Data Structure:

@dataclass
class RoutingNode:
    """Node in routing path suffix tree."""
    expert_id: int                    # Expert ID at this layer
    layer: int                        # Layer depth
    count: int                        # Traversal frequency
    total_reward: float               # Cumulative task reward
    children: Dict[int, RoutingNode]  # Next-layer expert transitions

    def utility(self) -> float:
        """Conditional utility score."""
        if self.count == 0:
            return 0.0
        return (self.total_reward / self.count) - global_mean_reward

    def entropy(self) -> float:
        """Routing entropy at this node."""
        if not self.children:
            return 0.0
        total = sum(c.count for c in self.children.values())
        probs = [c.count / total for c in self.children.values()]
        return -sum(p * math.log(p + 1e-10) for p in probs)

class RoutingPathTree:
    """Suffix tree tracking all expert routing paths."""
    root: RoutingNode
    motif_registry: Dict[str, CrystallizedMotif]
    max_motifs: int = 512  # Hard limit on crystallized primitives

    def record_path(self, path: List[int], reward: float):
        """Update tree with new routing path observation."""
        node = self.root
        for layer, expert_id in enumerate(path):
            if expert_id not in node.children:
                node.children[expert_id] = RoutingNode(
                    expert_id=expert_id,
                    layer=layer,
                    count=0,
                    total_reward=0.0,
                    children={}
                )
            node = node.children[expert_id]
            node.count += 1
            node.total_reward += reward

    def find_crystallization_candidates(self) -> List[Tuple[List[int], float]]:
        """DFS to find high-utility stable paths."""
        candidates = []

        def dfs(node: RoutingNode, path: List[int]):
            if len(path) >= MIN_MOTIF_LENGTH:
                if (node.count > FREQ_THRESHOLD and 
                    node.utility() > UTILITY_THRESHOLD and
                    node.entropy() < ENTROPY_THRESHOLD):
                    candidates.append((path.copy(), node.utility()))

            for expert_id, child in node.children.items():
                path.append(expert_id)
                dfs(child, path)
                path.pop()

        dfs(self.root, [])
        return sorted(candidates, key=lambda x: x[1], reverse=True)

Memory Budget: - Each node: ~48 bytes (int:4, int:4, int:8, float:8, dict_overhead:24) - Max motif length: ~8 layers - Branching factor: ~8 experts per layer on average - Max tree size: $8^8 = 16.7$M nodes worst case → ~800 MB worst case - Realistic case (pruned/sparse): ~100K nodes → ~5 MB

Total Persistent Memory: - RAM (required): ~30 MB (traces + routing tree) - Active working memory - Vector Store (optional): Unlimited capacity - Cross-session persistence via Qdrant/LanceDB (see PF-005)

🗄️ Vector Store Integration (Optional Enhancement)¶

Overview¶

While persistent traces and crystallized motifs operate effectively in RAM (~30 MB), optional integration with vector stores (see data-backends-vector-stores.md PF-005) enables:

Cross-session persistence: Traces/motifs survive training restarts
Unlimited capacity: Billions of traces vs 2M RAM limit
Multi-model sharing: Transfer learned motifs between models
Scalable retrieval: O(log n) ANN search vs O(n) RAM scan

Architecture¶

Three-tier memory hierarchy:

┌─────────────────────────────────────────────────────────┐
│ LAYER 3: Cognitive Components (this document)          │
│  ├─ TraceManager (24 MB RAM hot storage)               │
│  └─ RoutingPathTree (5 MB RAM motif tree)              │
└────────────────┬────────────────────────────────────────┘
                 │ optional sync
┌────────────────▼────────────────────────────────────────┐
│ LAYER 2: Specialized Wrappers                          │
│  ├─ TraceVectorStore (trace embedding + metadata)      │
│  └─ MotifVectorStore (motif embedding + utility)       │
└────────────────┬────────────────────────────────────────┘
                 │ uses
┌────────────────▼────────────────────────────────────────┐
│ LAYER 1: VectorStoreClient (PF-005)                    │
│  Backend: Qdrant or LanceDB                            │
│  Methods: upsert(), query(), delete()                  │
└─────────────────────────────────────────────────────────┘

Trace Embedding Specification¶

Challenge: Convert sparse coordinate trace to dense vector for similarity search.

Solution: Compositional embedding

class TraceEmbedder:
    """Converts AttentionTrace to dense embedding."""

    def __init__(self, embed_dim: int = 128):
        # Learned embeddings for trace components
        self.layer_embed = nn.Embedding(256, embed_dim // 4)  # max 256 layers
        self.head_embed = nn.Embedding(256, embed_dim // 4)   # max 256 heads
        self.position_encoder = SinusoidalPositionalEncoding(embed_dim // 2)

    def embed(self, trace: AttentionTrace) -> np.ndarray:
        """Convert trace to 128D vector."""
        # Component embeddings
        layer_vec = self.layer_embed(trace.layer_id)       # 32D
        head_vec = self.head_embed(trace.head_id)          # 32D
        pos_vec = self.position_encoder(
            trace.query_idx, trace.key_idx)                 # 64D (encodes spatial relation)

        # Weighted by salience
        full_vec = torch.cat([layer_vec, head_vec, pos_vec], dim=-1)  # 128D
        return (full_vec * trace.salience).detach().cpu().numpy()

Metadata storage:

trace_metadata = {
    "layer": trace.layer_id,
    "head": trace.head_id,
    "query_idx": trace.query_idx,
    "key_idx": trace.key_idx,
    "salience": trace.salience,
    "age": trace.age,
    "training_step": current_step,
    "task_tag": "QA" or "generation" or "classification",  # optional
}

Motif Embedding Specification¶

Challenge: Embed variable-length expert sequences [E₂→E₇→E₃→E₁] into fixed-size vector.

Solution: Sequence encoding with utility weighting

class MotifEmbedder:
    """Converts crystallized motif to dense embedding."""

    def __init__(self, embed_dim: int = 256, max_experts: int = 64):
        self.expert_embed = nn.Embedding(max_experts, embed_dim)
        self.sequence_encoder = nn.LSTM(
            input_size=embed_dim, 
            hidden_size=embed_dim, 
            num_layers=2,
            batch_first=True
        )

    def embed(self, motif: CrystallizedMotif) -> np.ndarray:
        """Convert motif expert sequence to 256D vector."""
        # Embed each expert in sequence
        expert_ids = torch.tensor(motif.expert_sequence)  # [L]
        expert_vecs = self.expert_embed(expert_ids)       # [L, 256]

        # Encode sequence
        _, (h_n, _) = self.sequence_encoder(expert_vecs.unsqueeze(0))
        motif_vec = h_n[-1]  # Final hidden state [256]

        # Weight by utility
        return (motif_vec * motif.utility).detach().cpu().numpy()

Metadata storage:

motif_metadata = {
    "motif_id": motif.id,
    "expert_sequence": motif.expert_sequence,  # [2, 7, 3, 1]
    "frequency": motif.count,
    "utility": motif.utility,
    "entropy": motif.entropy,
    "age": motif.age,
    "task_tags": ["retrieval", "QA"],  # tasks where motif is useful
}

Synchronization Protocol¶

TraceManager persistence:

class TraceManager:
    def __init__(self, config, vector_store_client=None):
        self.traces = []  # RAM storage
        self.vector_store = TraceVectorStore(vector_store_client) if vector_store_client else None
        self.sync_interval = config.trace_sync_interval  # e.g., 1000 steps
        self.last_sync_step = 0

    def sync_to_vector_store(self, current_step: int):
        """Persist recent traces to vector DB."""
        if not self.vector_store or not self.vector_store.enabled:
            return

        # Convert traces to embeddings
        embeddings = [self.embedder.embed(t) for t in self.traces]
        ids = [f"trace_{t.layer_id}_{t.head_id}_{t.query_idx}_{t.key_idx}" for t in self.traces]
        metadata = [self._trace_to_metadata(t, current_step) for t in self.traces]

        # Upsert to vector store
        self.vector_store.upsert(ids, embeddings, metadata)
        self.last_sync_step = current_step
        logger.info(f"Synced {len(self.traces)} traces to vector store at step {current_step}")

    def load_from_vector_store(self, task_tag=None, top_k=10000):
        """Warm-start from previous training session."""
        if not self.vector_store or not self.vector_store.enabled:
            return

        # Query for relevant traces (if task_tag specified)
        filter_dict = {"task_tag": task_tag} if task_tag else None
        results = self.vector_store.query_all(top_k=top_k, filter=filter_dict)

        # Reconstruct traces from metadata
        for trace_id, score, metadata in results:
            trace = AttentionTrace(
                layer_id=metadata["layer"],
                head_id=metadata["head"],
                query_idx=metadata["query_idx"],
                key_idx=metadata["key_idx"],
                salience=metadata["salience"],
                age=metadata["age"],
            )
            self.traces.append(trace)

        logger.info(f"Loaded {len(results)} traces from vector store")

RoutingPathTree persistence (similar pattern for motifs).

Configuration Integration¶

Updated config schema (unified under memory namespace):

memory:
  # Vector store backend (PF-005)
  vector_store:
    backend: "qdrant"  # qdrant|lancedb|disabled
    qdrant:
      host: "localhost"
      port: 6333
      collection_prefix: "aios_memory"
    lancedb:
      path: "artifacts/memory/lancedb"

  # Persistent attention traces
  persistent_traces:
    enabled: false
    sparsity: 0.001
    quota_per_head: 2048
    salience_threshold: 0.05
    retention_rate: 0.95
    decay_rate: 0.98
    bias_strength: 0.1
    update_interval: 100
    warmup_steps: 1000

    # Vector store integration (optional)
    persist_to_vector_store: false  # Enable cross-session persistence
    trace_sync_interval: 1000       # Steps between DB syncs
    embedding_dim: 128              # Trace embedding size
    warm_start: false               # Load traces from DB on training start
    task_tag: null                  # Filter traces by task type

  # Semantic crystallization
  semantic_crystallization:
    enabled: false
    min_frequency: 100
    min_utility: 0.05
    max_entropy: 1.0
    min_age: 500
    max_motifs: 512

    # Vector store integration (optional)
    persist_motifs: false           # Auto-save crystallized motifs
    motif_embedding_dim: 256        # Motif embedding size
    share_across_models: false      # Allow other models to query motifs

Benefits of Integration¶

Without vector store (baseline): - ✅ Works standalone, no external dependencies - ✅ Fast (pure RAM) - ❌ Limited to ~2M traces - ❌ Lost on training restart - ❌ Cannot share between models

With vector store (enhanced): - ✅ Unlimited capacity (billions of traces) - ✅ Persistent across sessions - ✅ Multi-model collaboration - ✅ Fast retrieval (ANN indexing) - ❌ Requires Qdrant/LanceDB service - ❌ Slight I/O overhead during sync

Recommendation: Start without vector store, enable after validating core functionality.

🔧 Architecture Integration (HRM-ACTV1 Specific)¶

Current Architecture Analysis¶

Files involved: - src/aios/core/hrm_models/impl/hrm_act_v1.py - Main model architecture - src/aios/core/hrm_models/impl/layers.py - Attention and MoE layers - src/aios/core/hrm_models/moe_layer.py - MoE base implementation - src/aios/core/hrm_models/dynamic_moe/dynamic_layer.py - Dynamic expert loading - src/aios/cli/hrm_hf/training_loop.py - Training orchestration

Current attention implementation (layers.py line ~164):

def forward(self, cos_sin: CosSin, hidden_states: torch.Tensor) -> torch.Tensor:
    # ... qkv projection ...

    # Standard scaled dot-product attention
    attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim)
    attn_weights = F.softmax(attn_weights, dim=-1)

    # Apply attention
    attn_output = torch.matmul(attn_weights, value)

Current MoE router (moe_layer.py line ~66):

def forward(self, hidden_states, top_k):
    logits = self.gate(hidden_states)  # [batch, seq, num_experts]
    top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1)
    top_k_weights = F.softmax(top_k_logits, dim=-1)
    return top_k_weights, top_k_indices, logits

Integration Points¶

Hook 1: Attention Trace Capture (Post-Attention)¶

Location: src/aios/core/hrm_models/impl/layers.py - Attention.forward()

Modification:

class Attention(nn.Module):
    def __init__(self, ...):
        # ... existing init ...
        self.trace_manager = None  # Set by model if tracing enabled

    def forward(self, cos_sin: CosSin, hidden_states: torch.Tensor) -> torch.Tensor:
        # ... existing QKV projection and RoPE ...

        # Compute attention weights
        attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # INTEGRATION POINT 1: Inject persistent trace bias
        if self.trace_manager is not None and self.trace_manager.enabled:
            trace_bias = self.trace_manager.get_bias_for_layer_head(
                layer_id=self.layer_id, 
                head_id=head_id,  # iterate over heads
                device=attn_weights.device
            )
            if trace_bias is not None:
                attn_weights = attn_weights + self.trace_manager.bias_strength * trace_bias

        attn_weights = F.softmax(attn_weights, dim=-1)

        # INTEGRATION POINT 2: Capture high-salience traces (training only)
        if self.training and self.trace_manager is not None:
            # Detach to avoid interfering with backprop
            self.trace_manager.register_attention_for_update(
                layer_id=self.layer_id,
                head_id=head_id,
                attn_weights=attn_weights.detach(),
                query_idx=...,  # derived from position info
                key_idx=...,
            )

        # ... rest of attention computation ...

Gradient capture (for salience score):

# In training loop, after loss.backward()
if trace_manager.enabled:
    for layer_id, layer in enumerate(model.layers):
        attn_grad = layer.self_attn.attn_weights.grad  # Need to retain_grad()
        trace_manager.update_traces_with_gradients(layer_id, attn_grad)

Hook 2: Routing Path Logging (Post-Router)¶

Location: src/aios/core/hrm_models/moe_layer.py - TopKRouter.forward()

Modification:

class TopKRouter(nn.Module):
    def __init__(self, ...):
        # ... existing init ...
        self.crystallization_tracker = None  # Set by model

    def forward(self, hidden_states, top_k):
        logits = self.gate(hidden_states)
        top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1)
        top_k_weights = F.softmax(top_k_logits, dim=-1)

        # INTEGRATION POINT 3: Log routing paths for crystallization
        if self.training and self.crystallization_tracker is not None:
            # Log chosen experts per token
            self.crystallization_tracker.record_routing(
                layer_id=self.layer_id,
                expert_indices=top_k_indices.detach(),
                routing_weights=top_k_weights.detach(),
                sequence_ids=...,  # from batch metadata
            )

        return top_k_weights, top_k_indices, logits

Hook 3: Auxiliary Loss Computation¶

Location: src/aios/cli/hrm_hf/training_loop.py - loss computation

Addition:

def compute_loss_with_auxiliary(model, batch, trace_manager, crystallization_tracker):
    # Standard forward pass
    outputs = model(batch)
    base_loss = outputs.loss

    # Existing MoE load balancing loss
    load_balance_loss = 0.0
    for layer in model.layers:
        if hasattr(layer, 'mlp') and hasattr(layer.mlp, 'last_router_logits'):
            load_balance_loss += compute_load_balance_loss(layer.mlp.last_router_logits)

    # NEW: Trace reuse auxiliary loss (encourage using persistent traces)
    trace_reuse_loss = 0.0
    if trace_manager.enabled:
        trace_reuse_loss = trace_manager.compute_trace_utilization_loss()

    # NEW: Crystallization stability loss (low entropy for stable paths)
    crystallization_loss = 0.0
    if crystallization_tracker.enabled:
        crystallization_loss = crystallization_tracker.compute_entropy_regularization()

    # Combined loss
    total_loss = (
        base_loss + 
        0.01 * load_balance_loss +  # existing
        0.005 * trace_reuse_loss +  # new
        0.002 * crystallization_loss  # new
    )

    return total_loss, {
        'base_loss': base_loss.item(),
        'load_balance_loss': load_balance_loss.item(),
        'trace_reuse_loss': trace_reuse_loss.item(),
        'crystallization_loss': crystallization_loss.item(),
    }

Flash Attention Compatibility¶

Challenge: Flash Attention is a fused CUDA kernel - cannot intercept intermediate attention weights.

Solution: Dual-mode attention

class Attention(nn.Module):
    def forward(self, ...):
        # Trace capture requires explicit attention matrices
        if self.training and self.trace_manager is not None and self.trace_manager.capture_mode:
            # Use standard attention (slower but exposes weights)
            return self._forward_standard_with_traces(...)
        else:
            # Use Flash Attention (faster)
            return self._forward_flash_attn(...)

Training strategy: - Phase 1 (warmup): Use standard attention, accumulate traces (0-5% of training) - Phase 2 (main training): Use Flash Attention with frozen trace bias (95% of training) - Phase 3 (periodic trace updates): Briefly switch to standard attention every N steps to refresh traces

Gradient Checkpointing Compatibility¶

Challenge: Gradient checkpointing discards activations, including attention weights.

Solution: Trace updates happen in separate forward-only passes

# Every UPDATE_INTERVAL steps (e.g., 100)
if step % TRACE_UPDATE_INTERVAL == 0:
    with torch.no_grad():
        # Forward-only pass with trace capture enabled
        _ = model(batch, trace_capture_mode=True)
        # trace_manager accumulates observations

    # Update trace memory based on accumulated stats
    trace_manager.consolidate_traces()

📊 Training Protocol¶

Loss Function Components¶

Total training loss: $$ \mathcal{L}{\text{total}} = \mathcal{L}}} + \beta_1 \mathcal{L{\text{balance}} + \beta_2 \mathcal{L} $$}} + \beta_3 \mathcal{L}_{\text{crystal}

where: - $\mathcal{L}{\text{task}}$ = standard next-token prediction loss - $\mathcal{L}$ = MoE load balancing (existing) - $\mathcal{L}}{\text{trace}}$ = trace utilization regularizer (new) - $\mathcal{L}$ = crystallization stability (new)}

1. Trace Utilization Loss¶

Goal: Encourage model to reuse persistent traces (exploration-exploitation balance).

$$ \mathcal{L}{\text{trace}} = -\frac{1}{L \cdot H} \sum $$}^{L} \sum_{h=1}^{H} \frac{\text{trace_hits}^{(l,h)}}{\text{total_attention_ops}^{(l,h)}

Penalizes models that ignore their accumulated trace memory.

2. Crystallization Stability Loss¶

Goal: Encourage deterministic routing for high-utility paths (low entropy).

$$ \mathcal{L}{\text{crystal}} = \frac{1}{N] $$}}} \sum_{\pi \in \mathcal{M}} H(\pi) \cdot \mathbb{1}[U(\pi) > U_{\min

Only penalize entropy for high-utility motifs (we want them stable).

Hyperparameters¶

Trace management:

persistent_traces:
  enabled: true
  sparsity: 0.001              # Top 0.1% of attention edges
  quota_per_head: 2048         # Max traces per head
  salience_threshold: 0.05     # Minimum salience to capture
  retention_rate: 0.95         # λ (momentum)
  decay_rate: 0.98             # γ (forgetting)
  bias_strength: 0.1           # α (injection strength)
  update_interval: 100         # Steps between trace consolidation
  warmup_steps: 1000           # Standard attention for trace accumulation

Crystallization:

semantic_crystallization:
  enabled: true
  min_frequency: 100           # f_min
  min_utility: 0.05            # U_min (5% improvement over baseline)
  max_entropy: 1.0             # H_max (low entropy = stable)
  min_age: 500                 # Temporal stability requirement
  max_motifs: 512              # Hard limit on crystallized primitives
  motif_min_length: 3          # Minimum layers in motif
  motif_max_length: 8          # Maximum layers in motif
  prune_interval: 1000         # Steps between motif pruning

Loss coefficients:

loss_weights:
  task: 1.0
  load_balance: 0.01           # Existing MoE
  trace_utilization: 0.005     # New
  crystallization: 0.002       # New

Training Curriculum¶

Phase 1: Foundation (0-20% of training) - Standard training, no trace/crystallization - Model learns basic patterns - Establishes baseline performance

Phase 2: Trace Accumulation (20-30% of training) - Enable trace capture (standard attention mode) - No bias injection yet - Build initial trace memory

Phase 3: Trace-Biased Training (30-70% of training) - Enable bias injection - Switch to Flash Attention for speed - Periodic trace updates every 100 steps - Model learns to leverage persistent memory

Phase 4: Crystallization Discovery (70-85% of training) - Enable routing path tracking - Identify high-utility motifs - No freezing yet (observation only)

Phase 5: Motif Crystallization (85-100% of training) - Freeze top motifs into specialized experts - Fine-tune with crystallized primitives - Competitive pruning of low-utility motifs

Stability Safeguards¶

Elastic Weight Consolidation (EWC): - Protect important parameters supporting high-salience traces - Fisher information matrix computed over trace-heavy pathways - Penalty: $\mathcal{L}_{\text{EWC}} = \sum_i \frac{\lambda_i}{2} F_i (\theta_i - \theta_i^*)^2$

Router Collapse Prevention: - Maintain minimum load balancing even with crystallization - Enforce capacity constraints per expert - Periodic entropy injection (exploration)

Memory Overflow Protection: - Hard quota enforcement: evict lowest-salience traces when full - Automatic decay rate adjustment if memory pressure detected - Fallback to standard attention if trace system fails

📈 Evaluation Framework¶

Metrics¶

Primary Metrics (Task Performance)¶

Perplexity on validation set (lower is better)
Long-range dependency accuracy (copy tasks, induction heads)
Multi-hop reasoning (HotpotQA, MMLU)
Inference latency (ms per token)

Trace-Specific Metrics¶

Trace coverage: % of attention operations using persistent bias
Trace stability: Correlation of traces across epochs
Salience distribution: Histogram of trace strengths
Memory efficiency: Actual bytes used vs. quota

Crystallization Metrics¶

Motif count: Number of crystallized primitives over training
Motif utility: Average $U(\pi)$ for crystallized motifs
Routing entropy: Average $H(\pi)$ for active motifs
FLOP reduction: % compute saved by crystallized experts

Emergent Language Metrics (Exploratory)¶

Hierarchical structure: Tree depth of motif dependencies
Compositionality: Frequency of motif combinations
Semantic coherence: Clustering of motifs by task type

Baselines¶

Standard HRM-ACTV1: No traces, no crystallization
Compressive Transformer: External memory baseline
Memorizing Transformer: kNN retrieval baseline
Mixture-of-Experts (vanilla): MoE without crystallization

Ablation Studies¶

Test each component independently:

Experiment	Traces	Crystallization	Expected Outcome
Baseline	❌	❌	Standard performance
Traces Only	✅	❌	Faster convergence, better long-range
Crystal Only	❌	✅	FLOP reduction, expert specialization
Full System	✅	✅	Synergistic improvements

Ablation parameters: - Trace sparsity: {0.0001, 0.001, 0.01} - Bias strength $\alpha$: {0.01, 0.05, 0.1, 0.2} - Crystallization threshold $U_{\min}$: {0.01, 0.05, 0.1} - Motif length: {2, 4, 6, 8}

Evaluation Tasks¶

Short-term reasoning: - HellaSwag (commonsense) - PIQA (physical reasoning) - WinoGrande (coreference)

Long-term reasoning: - bAbI tasks (require multi-hop) - SQuAD (reading comprehension) - Custom copy/induction tasks

Expert utilization: - Track which motifs activate for which task types - Visualize routing path trees - Measure task-motif specialization

⚠️ Risk Analysis & Mitigation¶

Risk 1: Catastrophic Forgetting¶

Description: Frozen motifs become stale as data distribution shifts.

Likelihood: High
Impact: Critical (model performance degrades)

Mitigation: - Drift detection: Monitor per-motif utility over sliding window - Adaptive unfreezing: Unfreeze motifs if $U(\pi)$ drops below threshold - Rehearsal buffer: Store examples that activated each motif, replay periodically - Conditional crystallization: Only freeze motifs validated on diverse tasks

Risk 2: Router Collapse¶

Description: Crystallization causes all tokens to route through same few motifs.

Likelihood: Medium
Impact: High (loss of model capacity)

Mitigation: - Strong load balancing: Maintain $\beta_1 \geq 0.01$ throughout - Capacity limits: Enforce maximum tokens per expert per batch - Diversity bonus: Add entropy bonus to routing loss during crystallization phase - Progressive crystallization: Freeze motifs gradually (top-1, then top-5, etc.)

Risk 3: Memory Overflow¶

Description: Trace memory exceeds quota, causing OOM.

Likelihood: Low (hard limits in place)
Impact: Critical (training crash)

Mitigation: - Hard quotas: Strictly enforce quota_per_head limits - Competitive eviction: Lowest-salience traces evicted first - Automatic decay tuning: Increase $\gamma$ if memory pressure detected - Graceful degradation: Disable tracing if memory allocation fails

Risk 4: Flash Attention Incompatibility¶

Description: Trace capture breaks with fused kernels.

Likelihood: Low (dual-mode designed)
Impact: Medium (slower training)

Mitigation: - Dual-mode attention: Standard for capture, Flash for speed - Sparse capture schedule: Only enable standard attention every 100 steps - Post-hoc approximation: Estimate salience from gradients without explicit weights

Risk 5: Gradient Instability¶

Description: Auxiliary losses destabilize training.

Likelihood: Medium
Impact: Medium (slower convergence)

Mitigation: - Conservative coefficients: Start with $\beta_2, \beta_3 < 0.01$ - Gradual ramp-up: Linearly increase loss weights over training - Gradient clipping: Clip auxiliary loss gradients separately - Ablation-based tuning: Empirically find stable coefficient ranges

Risk 6: Emergent Language Failure¶

Description: Model doesn't develop meaningful internal primitives.

Likelihood: Medium (pioneering work)
Impact: Low (graceful degradation to baseline)

Mitigation: - Baseline guarantees: System degrades to standard transformer if crystallization fails - Interpretability tools: Visualize motif activations to detect failure early - Manual seeding: Option to manually specify useful routing patterns - Curriculum design: Carefully design tasks that encourage motif discovery

🗺️ Implementation Roadmap¶

Phase 0: Infrastructure (Week 1-2)¶

Deliverables: - [ ] Create src/aios/core/hrm_models/cognitive/ module - [ ] Implement TraceManager class with sparse storage - [ ] Implement RoutingPathTree suffix tree - [ ] Add configuration schemas to config/default.yaml - [ ] Unit tests for data structures

Files:

src/aios/core/hrm_models/cognitive/
    __init__.py
    trace_manager.py
    routing_tree.py
    crystallization.py
    config.py
tests/core/hrm_models/cognitive/
    test_trace_manager.py
    test_routing_tree.py

Phase 1: Attention Trace Capture (Week 3-4)¶

Deliverables: - [ ] Modify Attention class to support trace hooks - [ ] Implement salience score computation - [ ] Add gradient-based trace updates - [ ] Integration with existing training loop - [ ] Memory profiling and optimization

Modified Files: - src/aios/core/hrm_models/impl/layers.py - src/aios/cli/hrm_hf/training_loop.py

Tests: - Memory footprint < 50 MB for 32-layer model - Trace capture doesn't slow training > 5% - Salience scores correlate with task loss gradients

Phase 2: Persistent Bias Injection (Week 5-6)¶

Deliverables: - [ ] Implement sparse trace → attention bias conversion - [ ] Add dual-mode attention (standard/Flash) - [ ] Trace update scheduling - [ ] Exponential decay mechanism - [ ] Visualization tools for traces

Tests: - Bias injection improves convergence on copy tasks - Flash Attention speedup maintained (> 90% of time) - Trace stability across training runs

Phase 2.5: Vector Store Integration (Week 6.5-7.5) [OPTIONAL]¶

Prerequisites: PF-005 vector store implementation complete (Weeks 1-6)

Deliverables: - [ ] Implement TraceEmbedder class (sparse trace → dense vector) - [ ] Implement MotifEmbedder class (expert sequence → dense vector) - [ ] Create TraceVectorStore wrapper around VectorStoreClient - [ ] Create MotifVectorStore wrapper around VectorStoreClient - [ ] Add sync_to_vector_store() and load_from_vector_store() to TraceManager - [ ] Add persist_motifs() to RoutingPathTree - [ ] Update configuration schema with vector store integration flags - [ ] Integration tests for trace/motif persistence and retrieval

Modified Files: - src/aios/core/hrm_models/cognitive/trace_manager.py - src/aios/core/hrm_models/cognitive/routing_tree.py - src/aios/core/hrm_models/cognitive/embedders.py (new) - src/aios/core/hrm_models/cognitive/vector_wrappers.py (new) - config/default.yaml

Tests: - Trace embedding preserves similarity (nearby traces have high cosine similarity) - Motif embedding distinguishes task types (retrieval vs generation motifs cluster separately) - Sync/load cycle preserves trace salience within 1% error - Cross-session warm-start improves initial perplexity vs cold start - Memory overhead < 5 MB additional (for embedding models)

Acceptance Criteria: - TraceManager can persist 100K traces and reload with < 2% information loss - MotifVectorStore correctly retrieves top-10 similar motifs for given input - Integration works with both Qdrant and LanceDB backends - Disabling vector store (config flag) causes graceful fallback to RAM-only mode

Notes: - This phase can be skipped if vector store integration not needed - All subsequent phases work with or without vector store enabled - See data-backends-vector-stores.md for VectorStoreClient API details

Phase 3: Routing Path Logging (Week 7-8)¶

Deliverables: - [ ] Modify TopKRouter to log expert selections - [ ] Build suffix tree from routing paths - [ ] Compute path frequency and utility - [ ] Implement competitive motif discovery - [ ] Path visualization tools

Modified Files: - src/aios/core/hrm_models/moe_layer.py - src/aios/core/hrm_models/dynamic_moe/dynamic_layer.py

Tests: - Suffix tree correctly tracks all paths - Utility scores correlate with task performance - Tree memory < 10 MB

Phase 4: Semantic Crystallization (Week 9-11)¶

Deliverables: - [ ] Implement motif freezing mechanism - [ ] Create distillation process for crystallized experts - [ ] Add competitive pruning - [ ] Integrate with expert registry - [ ] Motif specialization analysis tools

Tests: - Crystallized motifs reduce FLOPs - Frozen motifs maintain performance - Pruning doesn't cause catastrophic forgetting

Phase 5: Auxiliary Losses (Week 12-13)¶

Deliverables: - [ ] Implement trace utilization loss - [ ] Implement crystallization entropy loss - [ ] Add EWC for stability - [ ] Hyperparameter tuning - [ ] Loss curve analysis

Tests: - Auxiliary losses don't destabilize training - Combined loss converges faster than baseline - Ablations validate each component

Phase 6: Evaluation & Analysis (Week 14-16)¶

Deliverables: - [ ] Run full baseline comparisons - [ ] Long-range dependency benchmarks - [ ] FLOP reduction measurements - [ ] Emergent language analysis - [ ] Documentation and research writeup

Benchmarks: - bAbI tasks, SQuAD, HellaSwag, MMLU - Custom induction head tasks - Latency profiling

Phase 7: Production Hardening (Week 17-18)¶

Deliverables: - [ ] Memory overflow safeguards - [ ] Gradient checkpointing full compatibility - [ ] Distributed training support (DeepSpeed) - [ ] Configuration validation - [ ] User documentation

Tests: - Multi-GPU training stability - Recovery from OOM gracefully - Configuration validation catches errors

📚 References & Prior Art¶

Foundational Papers¶

Memory-Augmented Transformers: 1. Dai et al. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." arXiv:1901.02860 2. Rae et al. (2019). "Compressive Transformers for Long-Range Sequence Modelling." arXiv:1911.05507 3. Wu et al. (2022). "Memorizing Transformers." arXiv:2203.08913 4. Borgeaud et al. (2022). "Improving language models by retrieving from trillions of tokens." arXiv:2112.04426 (RETRO)

Mixture of Experts: 5. Shazeer et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538 6. Lepikhin et al. (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668 7. Fedus et al. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961

Continual Learning & Consolidation: 8. Kirkpatrick et al. (2017). "Overcoming catastrophic forgetting in neural networks." arXiv:1612.00796 (EWC) 9. Zenke et al. (2017). "Continual Learning Through Synaptic Intelligence." arXiv:1703.04200 10. Schwarz et al. (2018). "Progress & Compress: A scalable framework for continual learning." arXiv:1805.06370

Neural Architecture Search & Plasticity: 11. Frankle & Carbin (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." arXiv:1803.03635 12. Evci et al. (2020). "Rigging the Lottery: Making All Tickets Winners." arXiv:1911.11134 (RigL) 13. Liu et al. (2019). "DARTS: Differentiable Architecture Search." arXiv:1806.09055

Neuroscience Foundations: 14. Bliss & Lømo (1973). "Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path." Journal of Physiology. 15. McClelland et al. (1995). "Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory." Psychological Review.

Repositories¶

KRONOS Memory-Augmented Transformer: https://github.com/agentic-labs/KRONOS
Hyperbolic Neural Networks: https://github.com/HazyResearch/hgcn
PyTorch Sparse Training (RigL): https://github.com/google-research/rigl
Memorizing Transformer: https://github.com/lucidrains/memorizing-transformers-pytorch

🎯 Success Criteria¶

Minimum Viable Product (MVP)¶

[ ] Trace memory overhead < 50 MB
[ ] Training slowdown < 10%
[ ] Inference speedup ≥ 5% on familiar patterns
[ ] No catastrophic forgetting on standard benchmarks

Target Goals¶

[ ] Perplexity improvement ≥ 5% on long-context tasks
[ ] FLOP reduction ≥ 15% from crystallized motifs
[ ] Trace coverage ≥ 30% of attention operations
[ ] ≥ 50 stable crystallized motifs after full training

Stretch Goals¶

[ ] Emergent hierarchical motif structure (≥ 3 levels deep)
[ ] Task-specific motif specialization (interpretable)
[ ] Zero-shot transfer of motifs to new tasks
[ ] Publishable research contribution (novel technique)

🔬 Open Research Questions¶

Trace Generalization: Do traces learned on one domain transfer to others?
Optimal Sparsity: What is the Pareto frontier of trace density vs. performance?
Crystallization Depth: What is the optimal motif length for different task types?
Emergent Compositionality: Can motifs combine to form higher-order primitives?
Cross-Attention Traces: Do traces in cross-attention (encoder-decoder) behave differently?
Adaptive Decay: Can decay rates be learned per-trace rather than global?
Motif Transfer Learning: Can crystallized motifs be transferred between models?
Interpretability: What do crystallized motifs "mean" in human-understandable terms?

📝 Notes & Caveats¶

This is pioneering research - no guarantees of success. The system is designed with graceful degradation: if traces/crystallization fail to improve performance, it degrades to standard transformer behavior.

Computational cost: Initial implementation may be slower due to dual-mode attention and logging overhead. Optimizations will come in later phases.

Hyperparameter sensitivity: Many new hyperparameters introduced. Extensive tuning required.

Interpretability: Emergent internal language may be completely alien to human understanding - visualization tools critical.

Distributed training: Initial implementation focuses on single-GPU. Multi-GPU support requires careful synchronization of trace/motif state.

Production readiness: This is a research feature. Full productionization (reliability, edge cases, monitoring) will require significant additional work.

Status: Ready for Phase 0 implementation
Next Steps: Create infrastructure module and unit tests
Owner: AI-OS Core Team
Timeline: ~18 weeks for full implementation