Mathematical Foundations¶
Companion Document to: PERSISTENT_TRACES_SEMANTIC_CRYSTALLIZATION.md
Status: Theoretical Analysis
Created: December 8, 2025
📐 Detailed Mathematical Derivations¶
1. Trace Memory Update Algorithm¶
1.1 Formal Problem Statement¶
Given: - Attention weights $A^{(l,h)}{i,j}(t) \in [0,1]$ at timestep $t$ - Loss gradients $\nabla_A \mathcal{L}$ - Persistence history $p(t)$ tracking edge recurrence - Memory budget $B$ (max traces per head)
Find: - Sparse trace memory $M^{(l,h)}$ that maximizes expected future utility - Update rule that balances plasticity and stability
1.2 Salience Score Derivation¶
Intuition: Attention edges that are both strong AND important for the loss should persist.
Components: 1. Attention magnitude: $A^{(l,h)}{i,j}$ - raw attention weight 2. Gradient importance: $\left|\frac{\partial \mathcal{L}}{\partial A^{(l,h)}\right|$ - how much loss depends on this edge 3. }Temporal consistency: $p_{i,j}(t) = \sum_{\tau=t-w}^{t} \mathbb{1}[A^{(l,h)}_{i,j}(\tau) > \epsilon]$ - edge frequency
Combined salience: $$ S^{(l,h)}{i,j}(t) = \underbrace{A^{(l,h)}}(t){\text{strength}} \cdot \underbrace{\left|\frac{\partial \mathcal{L}}{\partial A^{(l,h)}}}\right|{\text{importance}} \cdot \underbrace{\log(1 + p $$}(t))}_{\text{recurrence bonus}
The $\log(1 + p)$ term provides diminishing returns - edges don't need to recur infinitely to be valuable.
1.3 Exponential Moving Average (EMA) Update¶
Standard EMA: $$ M^{(l,h)}{i,j}(t+1) = \lambda \cdot M^{(l,h)}(t) $$}(t) + (1-\lambda) \cdot S^{(l,h)}_{i,j
Problem: Doesn't handle sparsity - memory grows unbounded.
Solution: Conditional update with competitive eviction.
Sparse EMA with Decay: $$ M^{(l,h)}{i,j}(t+1) = \begin{cases} \lambda \cdot M^{(l,h)}}(t) + (1-\lambda) \cdot S^{(l,h){i,j}(t) & \text{if } S^{(l,h)} \ \gamma \cdot M^{(l,h)}}(t) > \theta_{\text{sal}{i,j}(t) & \text{otherwise} \ 0 & \text{if } M^{(l,h)} \end{cases} $$}(t+1) < \epsilon_{\text{prune}
Competitive Eviction (when $|M^{(l,h)}| > B$): $$ \text{evict} = \arg\min_{(i,j) \in M^{(l,h)}} M^{(l,h)}_{i,j}(t) $$
Remove lowest-salience trace to make room for new high-salience trace.
1.4 Steady-State Analysis¶
Question: Does the trace memory converge to a stable distribution?
Assumptions: - Attention distribution is stationary: $A^{(l,h)}{i,j}(t) \sim \mathcal{D}_A$ - Salience distribution is stationary: $S^{(l,h)}_S$}(t) \sim \mathcal{D
Steady state: $\mathbb{E}[M_{i,j}(t+1)] = \mathbb{E}[M_{i,j}(t)] = M^*_{i,j}$
For edges receiving reinforcement: $$ M^*{i,j} = \frac{(1-\lambda) \mathbb{E}[S $$}]}{1 - \lambda
For edges not receiving reinforcement: $$ M^*_{i,j} = 0 \quad \text{(geometric decay drives to zero)} $$
Convergence rate: - Reinforced edges converge in $O(\frac{1}{1-\lambda})$ steps - Decaying edges vanish in $O(\frac{1}{1-\gamma})$ steps
Memory occupancy: $$ |M^{(l,h)}| \approx \min\left(B, \text{Pr}[S > \theta_{\text{sal}}] \cdot T^2\right) $$
If threshold is calibrated such that $\text{Pr}[S > \theta_{\text{sal}}] = \frac{B}{T^2}$, memory stays at quota.
2. Attention Bias Injection¶
2.1 Modified Attention Mechanism¶
Standard attention: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$
Trace-biased attention: $$ \text{Attention}_{\text{biased}}(Q, K, V, M) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \alpha \cdot M\right) V $$
2.2 Bias Strength Analysis¶
Question: How does $\alpha$ affect attention distribution?
Softmax sensitivity: For logits $z$ and perturbation $\delta$: $$ \frac{\partial}{\partial \delta_i} \text{softmax}(z + \delta)_i = \text{softmax}(z)_i (1 - \text{softmax}(z)_i) $$
Effective bias: $$ \Delta A_{i,j} \approx \alpha \cdot M_{i,j} \cdot A_{i,j} (1 - A_{i,j}) $$
Maximum effect when $A_{i,j} \approx 0.5$ (uncertain attention).
Recommended $\alpha$ calibration: $$ \alpha = \frac{1}{\max_{i,j} M_{i,j}} \cdot \beta $$
where $\beta \in [0.1, 0.3]$ controls overall bias strength. This normalizes bias relative to trace strength.
2.3 Information-Theoretic Analysis¶
Question: Does trace bias reduce attention entropy?
Attention entropy (before bias): $$ H(A) = -\sum_{j} A_{i,j} \log A_{i,j} $$
Attention entropy (after bias): $$ H(A') = -\sum_{j} A'{i,j} \log A' $$
Theorem: If $M_{i,j} > 0$ is concentrated on subset $\mathcal{S} \subset {1, \ldots, T}$, then $H(A') < H(A)$.
Proof: Bias increases probability mass on $\mathcal{S}$, reducing uncertainty.
Entropy reduction: $$ \Delta H = H(A) - H(A') \approx \alpha \sum_{j \in \mathcal{S}} M_{i,j} \log\left(\frac{|\mathcal{S}|}{T}\right) $$
Stronger bias and higher concentration → greater entropy reduction → faster convergence.
3. Routing Path Crystallization¶
3.1 Suffix Tree Construction¶
Input: Sequence of routing paths ${\pi_1, \pi_2, \ldots, \pi_N}$ where $\pi_i = [e_1^{(1)}, e_2^{(2)}, \ldots, e_L^{(L)}]$
Output: Suffix tree $\mathcal{T}$ with nodes representing path prefixes.
Construction Algorithm:
function BuildRoutingTree(paths):
tree = Node(root)
for each path π in paths:
node = tree.root
for each layer l in 1..L:
expert_id = π[l]
if expert_id not in node.children:
node.children[expert_id] = Node(expert_id, layer=l)
node = node.children[expert_id]
node.count += 1
node.total_reward += reward(path)
return tree
Complexity: - Time: $O(N \cdot L)$ for $N$ paths of length $L$ - Space: $O(|V|)$ where $|V|$ is number of unique path prefixes - Worst case: $O(K^L)$ for $K$ experts per layer (all paths unique) - Best case: $O(L)$ (all paths identical) - Typical case: $O(K \cdot L)$ (bounded branching)
3.2 Utility Estimation¶
Conditional utility: $$ U(\pi) = \mathbb{E}[r \mid \pi] - \mathbb{E}[r] $$
Sample-based estimator: $$ \hat{U}(\pi) = \frac{1}{f(\pi)} \sum_{i: \pi(x_i) = \pi} r_i - \frac{1}{N} \sum_{i=1}^{N} r_i $$
Variance: $$ \text{Var}[\hat{U}(\pi)] = \frac{\sigma_r^2}{f(\pi)} + \frac{\sigma_r^2}{N} $$
Minimum frequency requirement: To ensure $\text{Var}[\hat{U}] < \epsilon^2$: $$ f_{\min} > \frac{\sigma_r^2}{\epsilon^2} $$
For $\sigma_r = 1$ and $\epsilon = 0.1$ (10% precision), need $f_{\min} > 100$.
3.3 Crystallization Decision¶
Multi-criteria optimization: Find motifs that maximize: $$ \text{Score}(\pi) = w_1 \cdot \log f(\pi) + w_2 \cdot U(\pi) - w_3 \cdot H(\pi) + w_4 \cdot \text{age}(\pi) $$
Thresholding: $$ \text{Crystallize}(\pi) \Leftrightarrow \text{Score}(\pi) > \theta_{\text{crystal}} $$
Pareto frontier: Alternatively, find motifs that are Pareto-optimal across $(f, U, -H)$.
3.4 Frozen Motif Implementation¶
Conceptual: Crystallized motif $\pi = [e_1, e_2, \ldots, e_k]$ becomes single expert $E_{\pi}$.
Forward pass: $$ E_{\pi}(h_0) = E_{e_k} \circ E_{e_{k-1}} \circ \cdots \circ E_{e_1}(h_0) $$
Computational savings: - Standard MoE: $k$ router calls + $k \cdot \text{top_k}$ expert evaluations - Crystallized: 0 router calls + 1 frozen expert evaluation (deterministic path)
FLOP reduction: $$ \text{Savings} = k \cdot (\text{router_FLOPs} + (\text{top_k} - 1) \cdot \text{expert_FLOPs}) $$
For $k=4$ layers, $\text{top_k}=2$, router = 10% expert cost: $$ \text{Savings} \approx 4 \cdot (0.1 + 1) = 4.4 \text{ expert-equivalents} $$
Speedup: ~4.4× faster than routing for this motif!
4. Computational Complexity Analysis¶
4.1 Trace Management Overhead¶
Per-forward-pass costs:
| Operation | Standard | With Traces | Overhead |
|---|---|---|---|
| Attention (Flash) | $O(T \cdot d)$ | $O(T \cdot d)$ | 0% (no trace) |
| Attention (Standard) | $O(T^2 \cdot d)$ | $O(T^2 \cdot d + B \cdot \log B)$ | +$O(B \log B)$ sparse add |
| Trace update | - | $O(T^2 + B \log B)$ | Once per 100 steps |
Sparse matrix addition (bias injection): - Dense attention logits: $T^2$ elements - Sparse trace matrix: $B$ elements - Addition: $O(B)$ (iterate sparse entries) - Negligible compared to $O(T^2 \cdot d)$ attention computation
Trace consolidation (periodic): - Sort by salience: $O(B \log B)$ - Decay all traces: $O(B)$ - Total: $O(B \log B)$ once per 100 steps → amortized $O(B \log B / 100)$
Verdict: < 1% overhead when using Flash Attention + sparse capture scheduling.
4.2 Crystallization Overhead¶
Per-forward-pass costs:
| Operation | Cost |
|---|---|
| Log routing path | $O(L)$ |
| Update suffix tree | $O(L \log K)$ |
| Per-batch total | $O(B_{\text{batch}} \cdot L \log K)$ |
For batch size 32, 32 layers, 8 experts: $$ \text{Cost} = 32 \cdot 32 \cdot \log(8) = 3072 \text{ ops} \approx 0.0001\% \text{ of forward pass} $$
Motif discovery (periodic): - Traverse suffix tree: $O(|V|)$ where $|V| \approx K \cdot L$ - Compute utilities: $O(|V|)$ - Sort by score: $O(|V| \log |V|)$ - Total: $O(K \cdot L \log (K \cdot L))$ once per 1000 steps
Verdict: Negligible overhead (< 0.01%).
4.3 Memory Access Patterns¶
Trace memory (random access):
- Structure: Hash map (layer, head, i, j) → salience
- Access pattern: Random (depends on attention sparsity)
- Cache efficiency: Poor (each trace access likely cache miss)
- Optimization: Batch trace updates to improve locality
Routing tree (sequential access): - Structure: Tree with pointer-based children - Access pattern: Depth-first traversal (sequential) - Cache efficiency: Good (children stored contiguously)
5. Convergence Guarantees¶
5.1 Trace Memory Convergence¶
Theorem 1: Under stationary salience distribution, trace memory converges to steady state.
Proof sketch: - EMA update is a contraction mapping for $\lambda \in (0,1)$ - Fixed point: $M^ = (1-\lambda)^{-1} \mathbb{E}[S]$ for reinforced edges - Lyapunov function: $V(M) = \sum_{i,j} (M_{i,j} - M^_{i,j})^2$ - $\mathbb{E}[V(M(t+1))] = \lambda^2 V(M(t))$ → exponential convergence
Convergence rate: $O(\lambda^t)$
5.2 Crystallization Stability¶
Theorem 2: Crystallized motifs remain stable if underlying routing distribution is stationary.
Proof sketch: - Motif frequency $f(\pi)$ is sample mean of Bernoulli trials - By LLN: $f(\pi) \to p(\pi)$ as $N \to \infty$ - Utility $U(\pi)$ is sample mean of rewards - By LLN: $U(\pi) \to \mathbb{E}[r | \pi]$ as $f(\pi) \to \infty$ - If $p(\pi) > 0$ and $\mathbb{E}[r | \pi] > \mathbb{E}[r]$, motif will eventually crystallize
Instability risk: If routing distribution is non-stationary (distribution shift), motifs may become stale.
Mitigation: Periodic motif revalidation.
🧮 Experimental Design for Emergent Language Analysis¶
Hypothesis Testing¶
H1: Hierarchical Structure
Crystallized motifs form hierarchical dependencies (motifs call other motifs).
Measurement: - Build call graph of motif activations - Compute graph depth: $d_{\max} = \max_{\pi} \text{depth}(\pi)$ - Measure branching factor: $b = \frac{|\text{edges}|}{|\text{nodes}|}$
Test: Compare to random graph baseline. Expect $d_{\max} > d_{\text{random}}$ and structured branching.
H2: Semantic Coherence
Motifs cluster by task type (e.g., arithmetic motifs vs. language motifs).
Measurement: - Embed motifs in feature space: $\phi(\pi) = \text{avg}(\text{activations when } \pi \text{ fires})$ - Cluster using k-means or spectral clustering - Compute cluster purity w.r.t. task labels
Test: Silhouette score > 0.5 indicates meaningful clustering.
H3: Compositionality
Motifs combine to form higher-order concepts.
Measurement: - Track co-activation patterns: $P(\pi_i, \pi_j)$ - Mutual information: $MI(\pi_i, \pi_j) = \sum P(\pi_i, \pi_j) \log \frac{P(\pi_i, \pi_j)}{P(\pi_i)P(\pi_j)}$ - Find high-MI pairs → compositional primitives
Test: High-MI pairs should correspond to semantic composition (e.g., "question" + "retrieval" = "QA").
H4: Efficiency
Internal language is more efficient than token-level processing.
Measurement: - Bits per concept: $\log_2(|\text{motif vocabulary}|)$ - Compare to bits per token: $\log_2(|\text{token vocabulary}|)$ - Compression ratio: $\frac{\text{avg tokens per concept}}{\text{bits per motif} / \text{bits per token}}$
Test: Motifs should encode multi-token concepts in fewer "symbols".
Visualization Techniques¶
1. Motif Activation Heatmap - Rows: Tasks - Columns: Motifs - Color: Activation frequency - Reveals task-motif specialization
2. Routing Path Sankey Diagram - Nodes: Experts per layer - Edges: Routing transitions - Width: Frequency - Shows dominant pathways
3. Motif Dependency Graph - Nodes: Crystallized motifs - Edges: Co-activation (high MI) - Layout: Hierarchical (depth by layer) - Clusters: Semantic groups
4. t-SNE/UMAP of Motif Embeddings - Each motif embedded by average activation pattern - Dimensionality reduction to 2D - Color by task performance - Reveals semantic topology
📊 Theoretical Limits¶
Information-Theoretic Bounds¶
Question: What is the maximum compression achievable through crystallization?
Token-level model: - Vocabulary size: $|V_{\text{token}}| \approx 50,000$ - Bits per token: $\log_2(50,000) \approx 15.6$ bits
Motif-level model: - Motif vocabulary: $|V_{\text{motif}}| \approx 500$ - Bits per motif: $\log_2(500) \approx 8.97$ bits - Average tokens per motif: $\approx 5$ tokens
Compression ratio: $$ C = \frac{5 \cdot 15.6}{8.97} \approx 8.7\times $$
Theoretical limit (Shannon entropy): $$ H_{\max} = \log_2(K^L) $$
where $K$ experts, $L$ layers. For $K=8$, $L=8$: $$ H_{\max} = 8 \cdot 3 = 24 \text{ bits per motif} $$
Practical limit (accounting for motif frequency distribution): $$ H_{\text{eff}} = -\sum_{\pi} p(\pi) \log_2 p(\pi) $$
With Zipfian distribution over motifs, expect $H_{\text{eff}} \approx 10$ bits.
Scaling Laws¶
Question: How does performance scale with model size, trace budget, and motif count?
Hypothesized scaling laws:
-
Trace effectiveness: $$ \text{Perplexity improvement} \propto \log(B / B_{\min}) $$ Diminishing returns - doubling trace budget yields smaller gains.
-
Motif utility: $$ \text{FLOP reduction} \propto \sqrt{M} $$ where $M$ is number of crystallized motifs. Sublinear due to motif overlap.
-
Training time: $$ \text{Convergence steps} \propto \frac{1}{(1 - \lambda)(1 - \gamma)} $$ Faster decay → faster convergence but less stability.
Empirical validation needed - these are testable predictions!
🔬 Advanced Topics¶
1. Multi-Task Crystallization¶
Challenge: Different tasks may benefit from different motifs.
Solution: Task-conditional crystallization - Maintain separate motif registries per task type - Router learns to select appropriate registry based on input - Motifs specialize to task domains
Extension: Meta-learning over motif selection (learning to route to motifs).
2. Cross-Model Motif Transfer¶
Question: Can crystallized motifs transfer between models?
Approach: 1. Train model A, crystallize motifs 2. Extract motif expert weights 3. Initialize model B with transferred motifs 4. Fine-tune routing to use transferred motifs
Hypothesis: High-level motifs (reasoning patterns) should transfer better than low-level (token-specific).
3. Adaptive Crystallization Depth¶
Observation: Some concepts need short motifs (2-3 layers), others need deep motifs (6-8 layers).
Solution: Variable-length crystallization - Terminate motif when utility plateaus - Allow motifs to "call" other motifs (recursive composition) - Discover optimal depth per concept automatically
4. Continuous Crystallization¶
Current: Discrete crystallization (freeze or don't freeze)
Alternative: Soft crystallization with learned "solidification" parameter $$ \text{Expert output} = \sigma(\pi) \cdot E_{\text{frozen}}(h) + (1 - \sigma(\pi)) \cdot E_{\text{routed}}(h) $$
where $\sigma(\pi) \in [0,1]$ is learned crystallization strength.
Allows gradual transition and adaptive unfreezing.
Status: Theoretical foundation complete
Next: Implement and validate experimentally
Questions: Document open research questions for community