Advanced Training Features¶
Last Updated: December 12, 2025 Purpose: Practical guide to advanced HRM-HF training flags, best‑known‑good combinations, platform notes, and troubleshooting. Status: Implemented
Overview¶
This guide consolidates all advanced knobs for the HRM-HF trainer (aios hrm-hf train-actv1) into one place. It groups options by function (attention/positional, memory & performance, precision/quantization, dataset streaming & chunking, distributed, MoE, PEFT, hot‑reload inference) and calls out compatibility constraints on Windows vs Linux.
Source of truth (CLI): src/aios/cli/hrm_hf_cli.py → command train-actv1
Configuration model: src/aios/core/hrm_training/training_config.py
See also: - Memory & VRAM: MEMORY_OPTIMIZATION.md - FlashAttention: FLASH_ATTENTION.md and FLASH_ATTENTION_VS_CHUNKING.md - Multi‑GPU & streaming: MULTI_GPU_DISTRIBUTED.md, PARALLEL_TRAINING_BLOCK_CHUNK_SYSTEM.md - MoE: DYNAMIC_SUBBRAINS_MOE.md - PEFT/LoRA: LORA_PEFT.md - Core training entry: CORE_TRAINING.md
Prerequisites¶
- Windows PowerShell (pwsh), repo venv activated
- GPU and drivers installed. CUDA recommended on NVIDIA. DirectML supported for inference; training support varies.
- For 8‑bit optimizer and INT8/INT4 quantization:
bitsandbyteswith CUDA GPU. Windows support is limited—prefer Linux for 8/4‑bit. - For FlashAttention 2: Ampere+ GPU; dedicated FA2 build required on most setups. Not typically available on Windows—falls back to SDPA.
- For DeepSpeed ZeRO: DeepSpeed installation (Linux recommended). ZeRO often not supported or unstable on Windows.
Commands (CLI syntax)¶
a) Direct CLI¶
aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 200 --batch-size 8
b) Module invocation (exact venv)¶
.venv\Scripts\python.exe -m aios.cli.aios hrm-hf train-actv1 --model gpt2 --dataset-file training_data/curated_datasets/test_sample.txt --steps 200 --batch-size 8
Key option groups (with notes)¶
Attention & positional¶
--use-flash-attn/--no-flash-attn: Enable FlashAttention 2. Requires Ampere+ and proper install. On Windows, usually unavailable; will fall back to SDPA.--window-size <int|None>: Sliding‑window attention. Use 256–512 for extreme contexts (50–100k tokens) to reduce memory. Works with or without FlashAttention.--pos-encodings rope|learned: Choose positional encoding. Default: rope.
Memory & performance¶
--gradient-checkpointing/--no-gradient-checkpointing: ↓VRAM by ~30–50% at ~20% speed cost. Default: enabled.--amp/--no-amp: Mixed precision activations (FP16/BF16). Big savings with minimal quality impact. Default: enabled.--cpu-offload/--no-cpu-offload: Offload carry states to CPU between chunks for ultra‑long contexts (>500k). Slower; requires sufficient system RAM.--use-8bit-optimizer: Use bitsandbytes 8‑bit optimizer (~75% optimizer memory reduction). Requires CUDA + bitsandbytes.--dataset-chunk-size <int>: Samples per training cycle in iterate mode. Smaller uses less memory, larger is faster.
Precision & quantization¶
--model-dtype fp32|fp16|bf16: Weight precision when loading full‑precision models. Separate from AMP.--load-in-8bit: INT8 weight loading (75% memory reduction). Requires bitsandbytes + CUDA.--load-in-4bit: INT4 (QLoRA‑style) loading (≈87.5% memory reduction). Strongly pair with PEFT. Notes:- When
--load-in-8bitor--load-in-4bitis set, the base weights load quantized; AMP still controls activation precision. On Windows, bitsandbytes support is limited—prefer Linux.
Dataset streaming & chunked training¶
--use-chunked-training: Split sequences into chunks to fit memory for long contexts.--chunk-size <tokens>: 1024–4096 typical. Smaller = less VRAM, slower.--linear-dataset/--no-linear-dataset: Linear order (default) enables progress tracking and resume.--dataset-start-offset <int>: Resume index for linear mode.--iterate: Repeat generate→train cycles until stopped.
Distributed & multi‑GPU¶
--ddp: Enable torch.distributed (CUDA only). Best on Linux.--world-size <int>: Number of processes/GPUs for DDP.--cuda-ids "0,1": Pin devices explicitly.--parallel-independent: Windows‑friendly multi‑GPU alternative. Trains separate data blocks on different GPUs sequentially, then merges checkpoints. Bypasses DDP.--zero-stage none|zero1|zero2|zero3: DeepSpeed ZeRO. Requires DeepSpeed; Linux recommended.--strict: Disallow device fallbacks; fail fast on mismatches.
MoE (Mixture of Experts)¶
--use-moe/--no-moe(default: enabled)--num-experts <int>: Total experts (capacity vs VRAM trade‑off).--num-experts-per-tok <int>: Top‑k experts per token; lower = faster/less memory.--moe-capacity-factor <float>: Load‑balancing headroom.--auto-adjust-lr/--no-auto-adjust-lr: Auto reduce LR for MoE stability. Tips: Start with 8 experts, top‑k=2, capacity 1.25; raise gradually.
PEFT (parameter‑efficient fine‑tuning)¶
--use-peft/--no-peftand--peft-method lora|adalora|ia3|loha|lokr--lora-r,--lora-alpha,--lora-dropout,--lora-target-modules "q_proj,v_proj"Best practice: With--load-in-4bit, enable PEFT and tune adapters only to keep memory low.
Inference hot‑reload during training¶
--inference-device cuda:N: Use a dedicated GPU for inference while training on another.--hot-reload-steps <int>: Frequency to reload inference model from checkpoints.
Auto optimization¶
--optimize: Auto‑find a stable combination for context length (up to ~100k) and batch size based on VRAM. May override--max-seq-lenand--batch-size.
Try it: minimal safe examples¶
1) Quick dry‑run (single GPU)¶
Runs 1 step with tiny batch to validate pipeline and logging.
aios hrm-hf train-actv1 --model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--steps 1 --batch-size 2 `
--halt-max-steps 1 `
--eval-batches 1 `
--log-file artifacts/brains/actv1/metrics.jsonl
Expected outputs:
- Metrics log appended at artifacts/brains/actv1/metrics.jsonl
- Checkpoints under training_data/actv1 (default --save-dir)
2) Windows multi‑GPU without DDP¶
Use parallel‑independent with chunked training to reduce VRAM pressure.
aios hrm-hf train-actv1 --model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--parallel-independent `
--use-chunked-training --chunk-size 2048 `
--amp --gradient-checkpointing `
--steps 200 --batch-size 4
3) QLoRA‑style PEFT on a single GPU (very low VRAM)¶
aios hrm-hf train-actv1 --model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--load-in-4bit --use-peft --peft-method lora `
--lora-r 16 --lora-alpha 32 --lora-dropout 0.05 `
--lora-target-modules "q_proj,v_proj" `
--amp --gradient-checkpointing `
--steps 200 --batch-size 8
4) MoE enabled with conservative routing¶
aios hrm-hf train-actv1 --model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--use-moe --num-experts 8 --num-experts-per-tok 2 `
--moe-capacity-factor 1.25 --auto-adjust-lr `
--amp --gradient-checkpointing `
--steps 200 --batch-size 8
Compatibility, constraints, and tips¶
- FlashAttention 2:
- Works best on Linux with an Ampere+ GPU and proper FA2 install.
- On Windows, expect fallback to SDPA; performance gain may be limited.
- DeepSpeed ZeRO:
- Requires DeepSpeed; Linux recommended. ZeRO not typically supported on Windows.
- DDP:
- Best on Linux. On Windows, prefer
--parallel-independent.
- Best on Linux. On Windows, prefer
- Quantization:
--load-in-4bitpairs best with--use-peft. Keep base LM frozen; train adapters.- If bitsandbytes isn’t available, omit
--load-in-8bit/--load-in-4bitand consider--use-8bit-optimizeronly.
- Heads and shapes:
- Ensure
--num-headsdivides--hidden-size. The validator will error otherwise.
- Ensure
- Chunked training:
- For very long contexts, combine
--use-chunked-training,--chunk-size 1024–2048,--gradient-checkpointing, and optionally--cpu-offload.
- For very long contexts, combine
- Logging & resume:
- Use
--log-filefor JSONL metrics; pair with--linear-datasetand--dataset-start-offsetto resume deterministically.
- Use
Troubleshooting¶
- “Configuration error …” on launch:
- Check incompatible shapes (e.g.,
num_headsmust dividehidden_size). - Remove
--use-flash-attnif FA2 isn’t installed; it will fall back, but explicit removal helps isolate issues. - If using ZeRO on Windows, remove
--zero-stage.
- Check incompatible shapes (e.g.,
- “bitsandbytes not found” or CUDA errors:
- Remove
--load-in-8bit/--load-in-4bitand/or--use-8bit-optimizer, or switch to Linux with CUDA.
- Remove
- DDP hang on Windows:
- Switch to
--parallel-independent. Verify--cuda-idsand drivers.
- Switch to
- OOM during long‑context training:
- Lower
--chunk-size, enable--gradient-checkpointing, ensure--amp, and reduce--batch-size.
- Lower
References¶
- CLI entry:
src/aios/cli/hrm_hf_cli.py(train‑actv1) - Config:
src/aios/core/hrm_training/training_config.py - Related docs:
Back to Feature Index: COMPLETE_FEATURE_INDEX.md • Back to Guide Index: ../INDEX.MD