Dynamic Subbrains (Mixture of Experts)¶
Purpose: Sparse expert routing for efficiency and specialization. Includes expert metadata/registry, expert-only training, and goal-aware routing hooks.
Status: Implemented core MoE in ACTv1 + expert registry and expert-only training. Goal-aware biasing and full GUI management are WIP.
Key files:
- MoE layer and routing stats: src/aios/core/hrm_models/moe_layer.py
- Goal-aware router (biasing by goals): src/aios/core/hrm_models/goal_aware_router.py
- Expert metadata and registry: src/aios/core/hrm_models/expert_metadata.py
- ACTv1 model uses MoE by default: src/aios/core/hrm_models/impl/hrm_act_v1.py and src/aios/core/brains/hf_brain.py
- Training CLI flags (MoE and experts): src/aios/cli/hrm_hf_cli.py
- Expert-only training implementation: src/aios/cli/hrm_hf/expert_training.py
- Metrics logging (load balancing + expert usage): src/aios/cli/hrm_hf/training_logic/train_epoch.py
- GUI Subbrains Manager (WIP): src/aios/gui/components/subbrains_manager_panel/
See also:
- Core training: CORE_TRAINING.md
- Memory optimization (8-bit optimizer, AMP): MEMORY_OPTIMIZATION.md
- Multi-GPU/Windows-friendly parallel training: MULTI_GPU_DISTRIBUTED.md, PARALLEL_TRAINING_BLOCK_CHUNK_SYSTEM.md
- Goals CLI (link goals to experts): CLI_COMMANDS.md (Goals section)
What you get¶
- Sparse MoE with top-k expert routing per token for ~75% compute reduction while increasing capacity.
- Automatic auxiliary load-balancing loss to prevent collapse.
- Periodic expert-usage metrics in your log file (routing probabilities, token counts).
- Expert-only training mode that produces standalone expert checkpoints and updates a persistent registry.
- Goal-aware router module (WIP hookup) to bias expert selection by active goals.
Commands (PowerShell, Windows-first)¶
1) Train ACTv1 with MoE (default enabled)
- Flags come from aios hrm-hf train-actv1. MoE-related flags:
- --use-moe/--no-moe (default: --use-moe)
- --num-experts <int> (default: 8)
- --num-experts-per-tok <int> (top-k, default: 2)
- --moe-capacity-factor <float> (default: 1.25)
- --auto-adjust-lr/--no-auto-adjust-lr (default: on; reduces LR for MoE stability)
Example (small dry-run, logs expert usage):
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/base_model `
--dataset-file training_data/curated_datasets/test_sample.txt `
--steps 20 --batch-size 8 `
--use-moe --num-experts 8 --num-experts-per-tok 2 --moe-capacity-factor 1.25 `
--log-file artifacts/brains/actv1/metrics.jsonl
Disable MoE (train dense FFN instead):
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/base_model `
--dataset-file training_data/curated_datasets/test_sample.txt `
--steps 20 --batch-size 8 `
--no-moe `
--log-file artifacts/brains/actv1/metrics.jsonl
Tips:
- Lower --num-experts-per-tok to 1 to reduce active compute/memory on very constrained GPUs.
- Keep --auto-adjust-lr enabled unless you know what you’re doing; MoE routers can be unstable at higher LR.
2) Train a standalone expert only (writes artifacts/experts and updates registry)
- Trigger by passing --expert-id <string> to train-actv1.
- Uses a lightweight FeedForward expert, saves as .safetensors, and writes/updates artifacts/experts/registry.json.
Example (quick expert build):
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/base_model `
--dataset-file training_data/curated_datasets/test_sample.txt `
--steps 3 --batch-size 2 `
--expert-id test-expert-004 `
--default-goal "Improve summarization quality" `
--log-file artifacts/experts/test-expert-004/metrics.jsonl
Outputs:
- artifacts/experts/test-expert-004/expert.safetensors
- artifacts/experts/registry.json (created or updated with metadata including expert_id, name, goals, checkpoint_path, is_active/is_frozen, hierarchy fields)
3) Link goals to experts (biasing signal for router)
- Goals live in the directives DB and can be associated with an expert.
- Commands are under aios goals-*.
Examples:
# Add a goal and link to an expert immediately
aios goals-add "Improve summarization quality" --expert-id test-expert-004
# Link an existing goal to an expert
aios goals-link-expert 42 test-expert-004
# List active goals
aios goals-list
# List goals for an expert
aios goals-list-for-expert test-expert-004
Notes:
- The GoalAwareRouter module supports biasing toward experts linked to active goals. Integration into the default training/inference loop is in progress; track src/aios/core/hrm_models/goal_aware_router.py.
Inputs and Outputs¶
Inputs (training flags relevant to MoE/experts):
- --use-moe, --num-experts, --num-experts-per-tok, --moe-capacity-factor, --auto-adjust-lr
- Standard training knobs: --max-seq-len, --batch-size, --steps, --lr, --amp, --gradient-checkpointing, etc.
- Expert-only mode: --expert-id <id> plus optional --default-goal to seed goal linkage.
Outputs (files and metrics):
- Brain training logs: your --log-file JSONL includes, when MoE is enabled:
- lb_loss: load-balancing loss value (applied internally; coef ~0.05)
- Periodic expert_usage events with:
- avg_routing_prob: average probability per expert
- token_counts: tokens routed to each expert
- total_tokens: total tokens seen when sampled
- Expert-only training:
- artifacts/experts/<expert-id>/expert.safetensors
- artifacts/experts/registry.json with entries like:
- expert_id, name, description, category, goals, timestamps
- is_active, is_frozen, parent_expert_id, child_expert_ids
- checkpoint_path: e.g., artifacts\\experts\\<expert-id>\\expert.safetensors
- training_config: hidden/intermediate sizes, steps, batch size, etc.
How routing works (high level)¶
- Each MoE layer computes router logits over N experts and activates the top-k experts per token (
--num-experts-per-tok). - An auxiliary load-balancing loss is added to spread traffic across experts and avoid collapse. Metrics include
lb_lossandmoe_layerscount. expert_usageentries in logs let you validate router health and specialization during training.
GUI: Subbrains Manager (WIP)¶
- The panel shows the expert registry with counts and hierarchy and can refresh from disk:
- Code:
src/aios/gui/components/subbrains_manager_panel/ - Data loader:
data_manager.pyreadsartifacts/experts/registry.json
- Code:
- Actions like create/delete/freeze are currently placeholders that print “CLI command needed”. Use CLI for expert training and goals linking.
- As features land, the panel will manage expert lifecycle and goal associations directly.
Troubleshooting¶
- Training is unstable (NaNs/Inf) with MoE:
- Keep
--auto-adjust-lrenabled (default). It reduces LR for MoE automatically. - Lower base
--lrand/or--num-experts-per-tok. - Ensure AMP/precision settings are stable (
--ampby default; try--model-dtype bf16on supported GPUs).
- Keep
- VRAM pressure with many experts:
- Reduce
--num-expertsand/or set--num-experts-per-tok 1. - Use
--amp,--gradient-checkpointing, and--use-8bit-optimizer(requires bitsandbytes).
- Reduce
- No expert usage metrics in log:
- Ensure
--use-moeis on. expert_usagelogs are periodic (every ~100 steps) and sampled from early layers; short runs may not emit them.
- Ensure
- Can’t find the expert registry:
- Path:
artifacts/experts/registry.json. It’s created on first expert-only training.
- Path:
Try it quickly¶
-
Minimal MoE run with metrics:
aios hrm-hf train-actv1 ` --model artifacts/hf_implant/base_model ` --dataset-file training_data/curated_datasets/test_sample.txt ` --steps 30 --batch-size 4 ` --use-moe --num-experts 8 --num-experts-per-tok 2 ` --log-file artifacts/brains/actv1/metrics.jsonl -
Train one tiny expert and link a goal:
aios hrm-hf train-actv1 ` --model artifacts/hf_implant/base_model ` --dataset-file training_data/curated_datasets/test_sample.txt ` --steps 3 --batch-size 2 ` --expert-id demo-expert-001 ` --default-goal "Focus on troubleshooting clarity" ` --log-file artifacts/experts/demo-expert-001/metrics.jsonl aios goals-list-for-expert demo-expert-001
Notes and next steps¶
- Goal-aware router module exists and exposes bias controls; full wiring to training/inference loops and GUI controls is in progress.
- The GUI Subbrains Manager will gain create/delete/freeze operations backed by CLI endpoints.
- We’ll expose advanced router knobs (e.g., load-balance loss coef) once stabilized.
Back to Feature Index: COMPLETE_FEATURE_INDEX.md • Back to Guide Index: ../INDEX.MD