Dynamic Subbrains (Mixture of Experts)¶

Purpose: Sparse expert routing for efficiency and specialization. Includes expert metadata/registry, expert-only training, and goal-aware routing hooks.

Status: Implemented core MoE in ACTv1 + expert registry and expert-only training. Goal-aware biasing and full GUI management are WIP.

Key files: - MoE layer and routing stats: src/aios/core/hrm_models/moe_layer.py - Goal-aware router (biasing by goals): src/aios/core/hrm_models/goal_aware_router.py - Expert metadata and registry: src/aios/core/hrm_models/expert_metadata.py - ACTv1 model uses MoE by default: src/aios/core/hrm_models/impl/hrm_act_v1.py and src/aios/core/brains/hf_brain.py - Training CLI flags (MoE and experts): src/aios/cli/hrm_hf_cli.py - Expert-only training implementation: src/aios/cli/hrm_hf/expert_training.py - Metrics logging (load balancing + expert usage): src/aios/cli/hrm_hf/training_logic/train_epoch.py - GUI Subbrains Manager (WIP): src/aios/gui/components/subbrains_manager_panel/

See also: - Core training: CORE_TRAINING.md - Memory optimization (8-bit optimizer, AMP): MEMORY_OPTIMIZATION.md - Multi-GPU/Windows-friendly parallel training: MULTI_GPU_DISTRIBUTED.md, PARALLEL_TRAINING_BLOCK_CHUNK_SYSTEM.md - Goals CLI (link goals to experts): CLI_COMMANDS.md (Goals section)

What you get¶

Sparse MoE with top-k expert routing per token for ~75% compute reduction while increasing capacity.
Automatic auxiliary load-balancing loss to prevent collapse.
Periodic expert-usage metrics in your log file (routing probabilities, token counts).
Expert-only training mode that produces standalone expert checkpoints and updates a persistent registry.
Goal-aware router module (WIP hookup) to bias expert selection by active goals.

Commands (PowerShell, Windows-first)¶

1) Train ACTv1 with MoE (default enabled) - Flags come from aios hrm-hf train-actv1. MoE-related flags: - --use-moe/--no-moe (default: --use-moe) - --num-experts <int> (default: 8) - --num-experts-per-tok <int> (top-k, default: 2) - --moe-capacity-factor <float> (default: 1.25) - --auto-adjust-lr/--no-auto-adjust-lr (default: on; reduces LR for MoE stability)

Example (small dry-run, logs expert usage):

    aios hrm-hf train-actv1 `
        --model artifacts/hf_implant/base_model `
        --dataset-file training_data/curated_datasets/test_sample.txt `
        --steps 20 --batch-size 8 `
        --use-moe --num-experts 8 --num-experts-per-tok 2 --moe-capacity-factor 1.25 `
        --log-file artifacts/brains/actv1/metrics.jsonl

Disable MoE (train dense FFN instead):

    aios hrm-hf train-actv1 `
        --model artifacts/hf_implant/base_model `
        --dataset-file training_data/curated_datasets/test_sample.txt `
        --steps 20 --batch-size 8 `
        --no-moe `
        --log-file artifacts/brains/actv1/metrics.jsonl

Tips: - Lower --num-experts-per-tok to 1 to reduce active compute/memory on very constrained GPUs. - Keep --auto-adjust-lr enabled unless you know what you’re doing; MoE routers can be unstable at higher LR.

2) Train a standalone expert only (writes artifacts/experts and updates registry) - Trigger by passing --expert-id <string> to train-actv1. - Uses a lightweight FeedForward expert, saves as .safetensors, and writes/updates artifacts/experts/registry.json.

Example (quick expert build):

    aios hrm-hf train-actv1 `
        --model artifacts/hf_implant/base_model `
        --dataset-file training_data/curated_datasets/test_sample.txt `
        --steps 3 --batch-size 2 `
        --expert-id test-expert-004 `
        --default-goal "Improve summarization quality" `
        --log-file artifacts/experts/test-expert-004/metrics.jsonl

Outputs: - artifacts/experts/test-expert-004/expert.safetensors - artifacts/experts/registry.json (created or updated with metadata including expert_id, name, goals, checkpoint_path, is_active/is_frozen, hierarchy fields)

3) Link goals to experts (biasing signal for router) - Goals live in the directives DB and can be associated with an expert. - Commands are under aios goals-*.

Examples:

    # Add a goal and link to an expert immediately
    aios goals-add "Improve summarization quality" --expert-id test-expert-004

    # Link an existing goal to an expert
    aios goals-link-expert 42 test-expert-004

    # List active goals
    aios goals-list

    # List goals for an expert
    aios goals-list-for-expert test-expert-004

Notes: - The GoalAwareRouter module supports biasing toward experts linked to active goals. Integration into the default training/inference loop is in progress; track src/aios/core/hrm_models/goal_aware_router.py.

Inputs and Outputs¶

Inputs (training flags relevant to MoE/experts): - --use-moe, --num-experts, --num-experts-per-tok, --moe-capacity-factor, --auto-adjust-lr - Standard training knobs: --max-seq-len, --batch-size, --steps, --lr, --amp, --gradient-checkpointing, etc. - Expert-only mode: --expert-id <id> plus optional --default-goal to seed goal linkage.

Outputs (files and metrics): - Brain training logs: your --log-file JSONL includes, when MoE is enabled: - lb_loss: load-balancing loss value (applied internally; coef ~0.05) - Periodic expert_usage events with: - avg_routing_prob: average probability per expert - token_counts: tokens routed to each expert - total_tokens: total tokens seen when sampled - Expert-only training: - artifacts/experts/<expert-id>/expert.safetensors - artifacts/experts/registry.json with entries like: - expert_id, name, description, category, goals, timestamps - is_active, is_frozen, parent_expert_id, child_expert_ids - checkpoint_path: e.g., artifacts\\experts\\<expert-id>\\expert.safetensors - training_config: hidden/intermediate sizes, steps, batch size, etc.

How routing works (high level)¶

Each MoE layer computes router logits over N experts and activates the top-k experts per token (--num-experts-per-tok).
An auxiliary load-balancing loss is added to spread traffic across experts and avoid collapse. Metrics include lb_loss and moe_layers count.
expert_usage entries in logs let you validate router health and specialization during training.

GUI: Subbrains Manager (WIP)¶

The panel shows the expert registry with counts and hierarchy and can refresh from disk:
- Code: src/aios/gui/components/subbrains_manager_panel/
- Data loader: data_manager.py reads artifacts/experts/registry.json
Actions like create/delete/freeze are currently placeholders that print “CLI command needed”. Use CLI for expert training and goals linking.
As features land, the panel will manage expert lifecycle and goal associations directly.

Troubleshooting¶

Training is unstable (NaNs/Inf) with MoE:
- Keep --auto-adjust-lr enabled (default). It reduces LR for MoE automatically.
- Lower base --lr and/or --num-experts-per-tok.
- Ensure AMP/precision settings are stable (--amp by default; try --model-dtype bf16 on supported GPUs).
VRAM pressure with many experts:
- Reduce --num-experts and/or set --num-experts-per-tok 1.
- Use --amp, --gradient-checkpointing, and --use-8bit-optimizer (requires bitsandbytes).
No expert usage metrics in log:
- Ensure --use-moe is on.
- expert_usage logs are periodic (every ~100 steps) and sampled from early layers; short runs may not emit them.
Can’t find the expert registry:
- Path: artifacts/experts/registry.json. It’s created on first expert-only training.

Try it quickly¶

Minimal MoE run with metrics:

aios hrm-hf train-actv1 `
    --model artifacts/hf_implant/base_model `
    --dataset-file training_data/curated_datasets/test_sample.txt `
    --steps 30 --batch-size 4 `
    --use-moe --num-experts 8 --num-experts-per-tok 2 `
    --log-file artifacts/brains/actv1/metrics.jsonl

Train one tiny expert and link a goal:

aios hrm-hf train-actv1 `
    --model artifacts/hf_implant/base_model `
    --dataset-file training_data/curated_datasets/test_sample.txt `
    --steps 3 --batch-size 2 `
    --expert-id demo-expert-001 `
    --default-goal "Focus on troubleshooting clarity" `
    --log-file artifacts/experts/demo-expert-001/metrics.jsonl

aios goals-list-for-expert demo-expert-001

Notes and next steps¶

Goal-aware router module exists and exposes bias controls; full wiring to training/inference loops and GUI controls is in progress.
The GUI Subbrains Manager will gain create/delete/freeze operations backed by CLI endpoints.
We’ll expose advanced router knobs (e.g., load-balance loss coef) once stabilized.

Back to Feature Index: COMPLETE_FEATURE_INDEX.md • Back to Guide Index: ../INDEX.MD