Skip to content

Feature Combination Matrix

Last Updated: December 12, 2025 Purpose: Feature compatibility reference - which combinations are verified and which are experimental

Note for v1.0.0: This matrix documents the current testing status of feature combinations. Items marked as "EXPERIMENTAL" or with TODO notes represent experimental combinations that may work but haven't been comprehensively tested. Use with appropriate caution.


๐Ÿ“Š Status Legend

Status Meaning
โœ… VERIFIED Tested and confirmed working
โš ๏ธ EXPERIMENTAL Should work but not comprehensively tested
โŒ INCOMPATIBLE Known to be incompatible
โ“ UNTESTED Status unclear, use with caution
๐Ÿšง PARTIAL Partially works with known limitations

๐Ÿ”ฌ Memory Optimization Combinations

Gradient Checkpointing + AMP

Status: โœ… VERIFIED WORKING
Benefit: ~60-70% memory reduction
Speed Impact: ~20% slower
Recommended: Yes, for most training

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --steps 100

Test Results: - โœ… Trains successfully - โœ… Memory reduction confirmed - โœ… No quality loss observed - โœ… Works on single GPU - โš ๏ธ Multi-GPU not tested


Gradient Checkpointing + AMP + 8-bit Optimizer

Status: โœ… VERIFIED WORKING
Benefit: ~70-80% memory reduction
Speed Impact: ~25% slower
Recommended: Yes, for large models (>100M params)

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Test Results: - โœ… Trains successfully - โœ… Massive memory reduction - โœ… Quality maintained - โœ… Works with bitsandbytes 0.48.1 - โš ๏ธ Multi-GPU not tested

Requirements: - bitsandbytes installed - CUDA-capable GPU (Linux preferred)


Gradient Checkpointing + Long Context

Status: โš ๏ธ EXPERIMENTAL
Expected: Should work
Use Case: Train with longer sequences on limited VRAM

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --max-seq-len 2048 `
  --batch-size 1 `
  --steps 100

Expected Behavior: - โœ… Should enable 2K-4K context on 11GB GPU - โš ๏ธ Will be slower due to checkpointing - โš ๏ธ Batch size must be very small

Note: Not extensively tested with contexts above 2048 tokens. Start with smaller contexts and increase gradually.


All Memory Optimizations Combined

Status: โš ๏ธ PARTIAL
Features: Gradient Checkpointing + AMP + 8-bit + Chunking
Expected: Maximum memory efficiency
Use Case: Train very large models or very long contexts

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --use-chunked-training --chunk-size 1024 `
  --max-seq-len 8192 `
  --batch-size 1 `
  --steps 100

Notes: - โœ… Chunked training is implemented (--use-chunked-training, --chunk-size) - โš ๏ธ Expect slower throughput at very small chunk sizes

TODO: 1. Verify chunked training is implemented 2. Test with various chunk sizes 3. Measure actual memory usage


๐Ÿš€ Multi-GPU Combinations

DDP + Gradient Checkpointing

Status: โ“ EXPERIMENTAL
Expected: Fast distributed training with memory efficiency

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --gradient-checkpointing `
  --steps 100

Issues: - โ“ DDP implementation not verified - โ“ Does _maybe_spawn function exist? - โ“ Gradient sync working?

Windows tip: Prefer --parallel-independent instead of DDP.


DDP + AMP

Status: โ“ EXPERIMENTAL
Expected: Fast training with mixed precision across GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --amp `
  --steps 100

Note: Not extensively tested with if AMP works correctly with DDP


DDP + All Memory Optimizations

Status: โ“ EXPERIMENTAL
Expected: Maximum efficiency across multiple GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100


Back to Guide Index

Questions: - Does 8-bit optimizer work with DDP? - Are optimizer states synchronized? - Is there communication overhead?

TODO: Comprehensive multi-GPU testing


๐Ÿง  DeepSpeed Combinations

DeepSpeed ZeRO-1 + Gradient Checkpointing

Status: โ“ EXPERIMENTAL
Expected: Optimizer state partitioning + activation checkpointing

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero1 `
  --gradient-checkpointing `
  --cuda-ids "0,1" `
  --steps 100

TODO: 1. Verify DeepSpeed is actually initialized 2. Test ZeRO-1 stage 3. Measure memory reduction


DeepSpeed ZeRO-2 + AMP

Status: โ“ EXPERIMENTAL
Expected: Gradient partitioning + mixed precision

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero2 `
  --amp `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with and measure


DeepSpeed ZeRO-3 (Maximum Memory Reduction)

Status: โ“ EXPERIMENTAL
Expected: Parameter partitioning for massive models

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero3 `
  --gradient-checkpointing `
  --amp `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with ZeRO-3 stage


DeepSpeed + 8-bit Optimizer

Status: โ“ COMPATIBILITY UNKNOWN
Question: Can DeepSpeed work with bitsandbytes?

Example (Compat unknown):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero2 `
  --use-8bit-optimizer `
  --cuda-ids "0,1" `
  --steps 100

Potential Issue: DeepSpeed has its own optimizer management - may conflict with bitsandbytes

Note: Not extensively tested with compatibility


๐Ÿงฉ MoE / Dynamic Subbrains Combinations

MoE + Gradient Checkpointing

Status: โš ๏ธ UNTESTED
Expected: Should work
Use Case: Train models with experts efficiently

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 4 `
  --gradient-checkpointing `
  --steps 100

Note: Not extensively tested with MoE with checkpointing


MoE + AMP + 8-bit

Status: โš ๏ธ UNTESTED
Expected: Memory-efficient expert training

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 8 `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Note: Not extensively tested with expert training with optimizations


Expert Training + Memory Optimizations

Status: โš ๏ธ UNTESTED
Expected: Efficient single expert training

Example:

aios hrm-hf train-actv1 `
  --model artifacts/hf_implant/base_model `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --expert-id "python_expert" `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Note: Not extensively tested with expert-only training mode


MoE + Multi-GPU

Status: โ“ EXPERIMENTAL
Expected: Expert parallelism across GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 8 `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --steps 100

Questions: - How are experts distributed across GPUs? - Is expert selection synchronized? - What's the communication pattern?

Note: Not extensively tested with and document expert parallelism


MoE + DeepSpeed

Status: โ“ EXPERIMENTAL
Expected: Expert partitioning with ZeRO

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 16 `
  --zero-stage zero3 `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with DeepSpeed with MoE


๐Ÿ“š Context Length Combinations

Long Context + Chunking

Status: โœ… SUPPORTED
Expected: Enable 10K+ contexts by chunking

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --max-seq-len 10000 `
  --use-chunked-training `
  --chunk-size 1024 `
  --gradient-checkpointing `
  --amp `
  --steps 100

Questions: - Is chunking actually implemented? - How does it split sequences? - What's the memory impact?

TODO: 1. Verify chunking implementation 2. Test with various context lengths: 8K, 16K, 32K 3. Measure actual memory usage


Long Context + Multi-GPU

Status: โš ๏ธ UNTESTED
Expected: Distribute long sequences across GPUs

Command:

aios hrm-hf train-actv1 \
  --model gpt2 \
  --dataset-file data.txt \
  --max-seq-len 8192 \
  --ddp \
  --cuda-ids "0,1" \
  --world-size 2 \
  --gradient-checkpointing \
  --batch-size 1 \
  --steps 1000

Note: Not extensively tested with long context with DDP


FlashAttention + Memory/Chunking

Status: โš ๏ธ PLATFORM-DEPENDENT
Notes: - --use-flash-attn is supported by the CLI and will enable FA2 when installed and compatible (Ampere+). - On Windows, FA2 is commonly unavailable; training falls back to PyTorch SDPA. - Combine with --window-size for extreme contexts when FA2 is not available.


๐Ÿ”ค Tokenizer Combinations

Custom Tokenizer + Training

Status: โš ๏ธ UNTESTED (except GPT-2)
Expected: Should work with any HuggingFace tokenizer

Verified: - โœ… GPT-2 tokenizer

Needs Testing: - โš ๏ธ Qwen 2.5 - โš ๏ธ Mistral - โš ๏ธ Code Llama - โš ๏ธ DeepSeek-Coder V2 - โš ๏ธ StarCoder2 - โš ๏ธ Phi-3 - โš ๏ธ Llama 3 (requires HF auth)

Note: Not extensively tested with each tokenizer with basic training


Large Vocabulary + Memory Optimizations

Status: โš ๏ธ UNTESTED
Use Case: Tokenizers with 100K+ tokens (DeepSeek, Qwen, Llama 3)

Command:

aios hrm-hf train-actv1 \
  --model "deepseek-ai/deepseek-coder-v2-base" \
  --dataset-file data.txt \
  --gradient-checkpointing \
  --amp \
  --use-8bit-optimizer \
  --steps 1000

Considerations: - Large vocabulary = larger embedding layer - More memory needed for embeddings - May need aggressive optimizations

Note: Not extensively tested with large-vocab tokenizers


๐Ÿ“Š Dataset Format Combinations

Streaming Dataset + Linear Mode

Status: โœ… SUPPORTED
Features: - Linear progression with resume via --dataset-start-offset - Iterate mode for longโ€‘running cycles via --iterate

Features: - โœ… Infinite streaming - โœ… Shuffle support - โœ… Caching - โœ… Memory-efficient


Large Dataset + Multi-GPU

Status: โš ๏ธ UNTESTED
Expected: Distributed dataset loading

Command:

aios hrm-hf train-actv1 \
  --model gpt2 \
  --dataset-file large_dataset.txt \
  --ddp \
  --cuda-ids "0,1" \
  --world-size 2 \
  --steps 10000

Questions: - Is dataset split across workers? - Is shuffling consistent? - What's the I/O pattern?

Note: Not extensively tested with with multi-GB datasets


Archive Dataset + Training

Status: โš ๏ธ PARTIALLY TESTED
Supported Formats: .tar, .tar.gz, .tar.bz2, .zip

Known Issues: - โš ๏ธ Large archives may hang (BUG-002) - โš ๏ธ Many small files may be slow

Note: Not extensively tested with archive loading performance


๐ŸŽฎ GUI Feature Combinations

GUI + Background Training

Status: โš ๏ธ UNTESTED
Expected: GUI should remain responsive during training

Note: Not extensively tested with GUI responsiveness during training


GUI + Multi-GPU

Status: โ“ EXPERIMENTAL
Question: Does GUI support multi-GPU configuration?

Note: Not extensively tested with GUI multi-GPU controls


GUI + Long Training

Status varies by machine. For multiโ€‘day runs, prefer CLI logging to --log-file and view metrics separately.


๐Ÿงช Testing Recommendations

High Priority Tests:

  1. DDP Verification (3 tests)
  2. DDP + basic training
  3. DDP + memory optimizations
  4. DDP + MoE

  5. DeepSpeed Verification (3 tests)

  6. ZeRO-1 basic
  7. ZeRO-2 with AMP
  8. ZeRO-3 maximum reduction

  9. Chunking Verification (3 tests)

  10. Verify implementation exists
  11. Test 8K context
  12. Test 16K context

  13. Tokenizer Testing (7 tests)

  14. Test each "supported" tokenizer

  15. MoE Combinations (3 tests)

  16. MoE + memory opts
  17. MoE + multi-GPU
  18. MoE + long context

Medium Priority Tests:

  1. Long Context (3 tests)
  2. 2K, 4K, 8K without chunking
  3. Measure actual limits

  4. Dataset Formats (3 tests)

  5. Large CSV
  6. Large archive
  7. Many small files

  8. Feature Interactions (5 tests)

  9. All memory opts combined
  10. Multi-GPU + all opts
  11. MoE + all opts

Low Priority Tests:

  1. GUI (3 tests)
  2. Long training responsiveness
  3. Multi-GPU controls
  4. All panels working

  5. Edge Cases (5 tests)

  6. Very small models
  7. Very large models
  8. Very long contexts
  9. Very large batches
  10. Very small batches

๐Ÿ“‹ Compatibility Matrix

Quick Reference Table

Feature 1 Feature 2 Status Notes
Gradient Checkpointing AMP โœ… Verified ~60โ€“70% memory reduction
Gradient Checkpointing 8โ€‘bit Optimizer โœ… Supported Requires bitsandbytes + CUDA
AMP 8โ€‘bit Optimizer โœ… Supported Common combo
All Memory Opts Combined โš ๏ธ Partial Chunking + AMP + Checkpointing + 8โ€‘bit supported; tune chunk size
DDP (Linux) Gradient Checkpointing โœ… Supported Use --ddp + --world-size
DDP (Linux) AMP โœ… Supported
DDP (Linux) 8โ€‘bit Optimizer โ“ Unknown May conflict with BnB; test on your setup
Parallelโ€‘Independent (Windows) Chunking โœ… Supported Windowsโ€‘friendly multiโ€‘GPU
DeepSpeed (Linux) Gradient Checkpointing โœ… Supported Requires DeepSpeed install
DeepSpeed (Linux) AMP โœ… Supported
DeepSpeed (Linux) 8โ€‘bit Optimizer โ“ Unknown DeepSpeed optimizer mgmt may conflict
MoE Memory Opts โœ… Supported Start conservative: k=2, capacity 1.25
MoE DDP/DeepSpeed โ“ Needs Verify Routing/loadโ€‘balance interactions
Chunking Long Context โœ… Supported Use 1024โ€“2048 chunk sizes
FlashAttention (Linux) AMP โœ… Supported When FA2 installed; falls back to SDPA otherwise
FlashAttention (Windows) Any โš ๏ธ Platform Often unavailable; rely on SDPA + windowโ€‘size

๐ŸŽฏ Action Items

Immediate (Week 1):

  1. โœ… Document all known combinations
  2. โณ Verify DDP implementation
  3. โณ Verify DeepSpeed implementation
  4. โณ Verify chunking implementation

Short-term (Week 2-3):

  1. Test all memory optimization combinations
  2. Test DDP with various configurations
  3. Test DeepSpeed stages
  4. Test tokenizers

Medium-term (Week 4-6):

  1. Test MoE combinations
  2. Test long context scenarios
  3. Test dataset formats
  4. Create automated combination tests

Long-term (Month 2+):

  1. Create CI/CD for combination testing
  2. Add performance benchmarks
  3. Document optimal combinations for different use cases
  4. Create combination recommendation tool


Matrix Version: 1.0
Last Updated: October 18, 2025
Maintained By: Testing Team

Status: ๐Ÿ”„ In Progress - Many combinations need verification