Core Training¶
Generated: December 12, 2025 Purpose: Architecture and training engine for HRM Status: Implemented
Key Files¶
src/aios/core/hrm_training/training_config.py– TrainingConfig (874+ lines)src/aios/cli/hrm_hf/train_actv1.py– Training loop (~2000 lines)src/aios/core/hrm_engine.py– Engine utilities
Overview¶
The core training flow is exposed via the aios hrm-hf train-actv1 command. It loads a base model and tokenizer, applies HRM training logic, logs metrics, writes checkpoints, and maintains a brain bundle directory under artifacts/brains/actv1/.
Training Configuration¶
- Single source of truth for parameters
- Validation, type checking, CLI arg conversion, defaults, serialization
Training Loop Features¶
- Gradient accumulation, loss/optimizer/scheduler
- Checkpoints, metrics logging, OOM handling, graceful stop file
Brain Bundle System¶
Directory structure:
artifacts/brains/actv1/<brain-name>/
├─ config.json
├─ model.safetensors
├─ tokenizer.json
├─ metadata.json
├─ training_args.json
└─ checkpoints/
Commands (CLI syntax)¶
You can run training either via the CLI entry point or directly through Python's module interface. On Windows, prefer PowerShell examples below.
a) Direct CLI (named flags)¶
aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 1000 --batch-size 2 --halt-max-steps 1 --eval-batches 2 --log-file artifacts/brains/actv1/metrics.jsonl
b) Module invocation (explicit model flag)¶
.venv\Scripts\python.exe -m aios.cli.aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 1000 --batch-size 2 --halt-max-steps 1 --eval-batches 2 --log-file artifacts/brains/actv1/metrics.jsonl
Key parameters (selection)¶
- Model selection:
--model <name_or_path> - Dataset:
--dataset-file <path>(txt/jsonl); optional--ascii-only - Steps and batching:
--steps <int>,--batch-size <int> - Halting:
--halt-max-steps <int>(controls ACT halting behavior) - Evaluation:
--eval-file <path>,--eval-batches <int> - Logging:
--log-file <path>(JSONL) - Iteration control:
--iterate,--stop-file <path> - Brain bundle:
--brain-name <str>,--bundle-dir <path> - Architecture knobs:
--h-layers,--l-layers,--hidden-size,--expansion,--num-heads,--h-cycles,--l-cycles,--window-size,--pos-encodings - Memory:
--gradient-checkpointing|--no-gradient-checkpointing,--amp|--no-amp,--use-8bit-optimizer - Multi-GPU:
--ddp,--cuda-ids <list>,--world-size <int> - DeepSpeed:
--zero-stage <none|zero1|zero2|zero3>(uses configs inconfig/) - Experts:
--expert-id <id>(train/freeze expert-specific components)
Notes:
- Paths are relative to repo root unless absolute. PowerShell accepts forward slashes (/) in Python paths.
- For Windows shells, escape backslashes or quote paths with spaces.
Iterate Mode¶
--iterate: restart after completion with new shuffle; supports stop file
Evaluation During Training¶
--eval-file,--eval-batches- Periodic eval, validation loss, perplexity, history
Expert Training Mode¶
--expert-id <id>: train individual expert, freeze base, save underartifacts/experts/<id>/- Related: Dynamic Subbrains/MoE
Inputs¶
- Dataset file(s):
training_data/curated_datasets/*.txt(example set provided) - Optional eval file:
training_data/eval_test_dataset.txt - Base model: HuggingFace hub id or local path (e.g.,
gpt2orartifacts/hf_implant/base_model) - Tokenizer: auto-resolved from model or
artifacts/hf_implant/tokenizers
Outputs¶
- Brain bundle under
artifacts/brains/actv1/<brain-name>/ - Metrics log (JSONL): default/explicit
artifacts/brains/actv1/metrics.jsonl - Checkpoints under the bundle
checkpoints/ - Optional evaluation summaries in metrics/logs
Try it: quick dry-run examples¶
These mirror VS Code tasks configured in this repo and are safe to run. Ensure your venv is active.
Option 1: Direct CLI dry-run¶
aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 1 --batch-size 2 --halt-max-steps 1 --eval-batches 1 --log-file artifacts/brains/actv1/metrics.jsonl
Option 2: Module invocation¶
.venv\Scripts\python.exe -m aios.cli.aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 1 --batch-size 2 --halt-max-steps 1 --eval-batches 1 --log-file artifacts/brains/actv1/metrics.jsonl
Option 3: Use VS Code Task¶
- Run: Tasks → "Run brief HRM CLI dry-run" or "Run HRM dry-run (module)"
- Expected outputs:
- Metrics JSONL at
artifacts/brains/actv1/metrics.jsonl - Brain bundle directories under
artifacts/brains/actv1/ - Console logs including training/eval step counts
- Metrics JSONL at
Usage Notes¶
- Use AMP and gradient checkpointing for memory savings
- Use 8-bit optimizer for larger models when bitsandbytes is available
Related: Memory Optimization, Model Architecture, Datasets, Tokenizers
Back to Feature Index: COMPLETE_FEATURE_INDEX.md • Back to Guide Index: ../INDEX.MD
Troubleshooting¶
- OOM (out of memory): lower
--batch-size,--max-seq-len, or--dataset-chunk-size; enable--gradient-checkpointingand--amp; consider--use-8bit-optimizerif bitsandbytes is installed. - FlashAttention: ensure
--use-flash-attnand that your GPU supports it; otherwise it will fall back to SDPA. - Multi-GPU on Windows: prefer
--parallel-independentwith--cuda-ids; DDP often fails on Windows. If you need DDP, set$env:AIOS_DDP_SPAWN = "1"before running with--ddp. - Resume: when using parallel mode,
chunk_tracker_state.jsonin the brain bundle enables resume; delete it if you want a fresh start.
See also: - Parallel Training Block/Chunk System - Multi-GPU & Distributed