Feature Combination Matrix¶
Last Updated: December 12, 2025 Purpose: Feature compatibility reference - which combinations are verified and which are experimental
Note for v1.0.0: This matrix documents the current testing status of feature combinations. Items marked as "EXPERIMENTAL" or with TODO notes represent experimental combinations that may work but haven't been comprehensively tested. Use with appropriate caution.
๐ Status Legend¶
| Status | Meaning |
|---|---|
| โ VERIFIED | Tested and confirmed working |
| โ ๏ธ EXPERIMENTAL | Should work but not comprehensively tested |
| โ INCOMPATIBLE | Known to be incompatible |
| โ UNTESTED | Status unclear, use with caution |
| ๐ง PARTIAL | Partially works with known limitations |
๐ฌ Memory Optimization Combinations¶
Gradient Checkpointing + AMP¶
Status: โ
VERIFIED WORKING
Benefit: ~60-70% memory reduction
Speed Impact: ~20% slower
Recommended: Yes, for most training
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--gradient-checkpointing `
--amp `
--steps 100
Test Results: - โ Trains successfully - โ Memory reduction confirmed - โ No quality loss observed - โ Works on single GPU - โ ๏ธ Multi-GPU not tested
Gradient Checkpointing + AMP + 8-bit Optimizer¶
Status: โ
VERIFIED WORKING
Benefit: ~70-80% memory reduction
Speed Impact: ~25% slower
Recommended: Yes, for large models (>100M params)
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--gradient-checkpointing `
--amp `
--use-8bit-optimizer `
--steps 100
Test Results: - โ Trains successfully - โ Massive memory reduction - โ Quality maintained - โ Works with bitsandbytes 0.48.1 - โ ๏ธ Multi-GPU not tested
Requirements: - bitsandbytes installed - CUDA-capable GPU (Linux preferred)
Gradient Checkpointing + Long Context¶
Status: โ ๏ธ EXPERIMENTAL
Expected: Should work
Use Case: Train with longer sequences on limited VRAM
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--gradient-checkpointing `
--max-seq-len 2048 `
--batch-size 1 `
--steps 100
Expected Behavior: - โ Should enable 2K-4K context on 11GB GPU - โ ๏ธ Will be slower due to checkpointing - โ ๏ธ Batch size must be very small
Note: Not extensively tested with contexts above 2048 tokens. Start with smaller contexts and increase gradually.
All Memory Optimizations Combined¶
Status: โ ๏ธ PARTIAL
Features: Gradient Checkpointing + AMP + 8-bit + Chunking
Expected: Maximum memory efficiency
Use Case: Train very large models or very long contexts
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--gradient-checkpointing `
--amp `
--use-8bit-optimizer `
--use-chunked-training --chunk-size 1024 `
--max-seq-len 8192 `
--batch-size 1 `
--steps 100
Notes:
- โ
Chunked training is implemented (--use-chunked-training, --chunk-size)
- โ ๏ธ Expect slower throughput at very small chunk sizes
TODO: 1. Verify chunked training is implemented 2. Test with various chunk sizes 3. Measure actual memory usage
๐ Multi-GPU Combinations¶
DDP + Gradient Checkpointing¶
Status: โ EXPERIMENTAL
Expected: Fast distributed training with memory efficiency
Example (Linux recommended):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--ddp `
--cuda-ids "0,1" `
--world-size 2 `
--gradient-checkpointing `
--steps 100
Issues:
- โ DDP implementation not verified
- โ Does _maybe_spawn function exist?
- โ Gradient sync working?
Windows tip: Prefer --parallel-independent instead of DDP.
DDP + AMP¶
Status: โ EXPERIMENTAL
Expected: Fast training with mixed precision across GPUs
Example (Linux recommended):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--ddp `
--cuda-ids "0,1" `
--world-size 2 `
--amp `
--steps 100
Note: Not extensively tested with if AMP works correctly with DDP
DDP + All Memory Optimizations¶
Status: โ EXPERIMENTAL
Expected: Maximum efficiency across multiple GPUs
Example (Linux recommended):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--ddp `
--cuda-ids "0,1" `
--world-size 2 `
--gradient-checkpointing `
--amp `
--use-8bit-optimizer `
--steps 100
Back to Guide Index
Questions: - Does 8-bit optimizer work with DDP? - Are optimizer states synchronized? - Is there communication overhead?
TODO: Comprehensive multi-GPU testing
๐ง DeepSpeed Combinations¶
DeepSpeed ZeRO-1 + Gradient Checkpointing¶
Status: โ EXPERIMENTAL
Expected: Optimizer state partitioning + activation checkpointing
Example (Linux + DeepSpeed):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--zero-stage zero1 `
--gradient-checkpointing `
--cuda-ids "0,1" `
--steps 100
TODO: 1. Verify DeepSpeed is actually initialized 2. Test ZeRO-1 stage 3. Measure memory reduction
DeepSpeed ZeRO-2 + AMP¶
Status: โ EXPERIMENTAL
Expected: Gradient partitioning + mixed precision
Example (Linux + DeepSpeed):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--zero-stage zero2 `
--amp `
--cuda-ids "0,1" `
--steps 100
Note: Not extensively tested with and measure
DeepSpeed ZeRO-3 (Maximum Memory Reduction)¶
Status: โ EXPERIMENTAL
Expected: Parameter partitioning for massive models
Example (Linux + DeepSpeed):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--zero-stage zero3 `
--gradient-checkpointing `
--amp `
--cuda-ids "0,1" `
--steps 100
Note: Not extensively tested with ZeRO-3 stage
DeepSpeed + 8-bit Optimizer¶
Status: โ COMPATIBILITY UNKNOWN
Question: Can DeepSpeed work with bitsandbytes?
Example (Compat unknown):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--zero-stage zero2 `
--use-8bit-optimizer `
--cuda-ids "0,1" `
--steps 100
Potential Issue: DeepSpeed has its own optimizer management - may conflict with bitsandbytes
Note: Not extensively tested with compatibility
๐งฉ MoE / Dynamic Subbrains Combinations¶
MoE + Gradient Checkpointing¶
Status: โ ๏ธ UNTESTED
Expected: Should work
Use Case: Train models with experts efficiently
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--use-moe `
--num-experts 4 `
--gradient-checkpointing `
--steps 100
Note: Not extensively tested with MoE with checkpointing
MoE + AMP + 8-bit¶
Status: โ ๏ธ UNTESTED
Expected: Memory-efficient expert training
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--use-moe `
--num-experts 8 `
--gradient-checkpointing `
--amp `
--use-8bit-optimizer `
--steps 100
Note: Not extensively tested with expert training with optimizations
Expert Training + Memory Optimizations¶
Status: โ ๏ธ UNTESTED
Expected: Efficient single expert training
Example:
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/base_model `
--dataset-file training_data/curated_datasets/test_sample.txt `
--expert-id "python_expert" `
--gradient-checkpointing `
--amp `
--use-8bit-optimizer `
--steps 100
Note: Not extensively tested with expert-only training mode
MoE + Multi-GPU¶
Status: โ EXPERIMENTAL
Expected: Expert parallelism across GPUs
Example (Linux recommended):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--use-moe `
--num-experts 8 `
--ddp `
--cuda-ids "0,1" `
--world-size 2 `
--steps 100
Questions: - How are experts distributed across GPUs? - Is expert selection synchronized? - What's the communication pattern?
Note: Not extensively tested with and document expert parallelism
MoE + DeepSpeed¶
Status: โ EXPERIMENTAL
Expected: Expert partitioning with ZeRO
Example (Linux + DeepSpeed):
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--use-moe `
--num-experts 16 `
--zero-stage zero3 `
--cuda-ids "0,1" `
--steps 100
Note: Not extensively tested with DeepSpeed with MoE
๐ Context Length Combinations¶
Long Context + Chunking¶
Status: โ
SUPPORTED
Expected: Enable 10K+ contexts by chunking
Example:
aios hrm-hf train-actv1 `
--model gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--max-seq-len 10000 `
--use-chunked-training `
--chunk-size 1024 `
--gradient-checkpointing `
--amp `
--steps 100
Questions: - Is chunking actually implemented? - How does it split sequences? - What's the memory impact?
TODO: 1. Verify chunking implementation 2. Test with various context lengths: 8K, 16K, 32K 3. Measure actual memory usage
Long Context + Multi-GPU¶
Status: โ ๏ธ UNTESTED
Expected: Distribute long sequences across GPUs
Command:
aios hrm-hf train-actv1 \
--model gpt2 \
--dataset-file data.txt \
--max-seq-len 8192 \
--ddp \
--cuda-ids "0,1" \
--world-size 2 \
--gradient-checkpointing \
--batch-size 1 \
--steps 1000
Note: Not extensively tested with long context with DDP
FlashAttention + Memory/Chunking¶
Status: โ ๏ธ PLATFORM-DEPENDENT
Notes:
- --use-flash-attn is supported by the CLI and will enable FA2 when installed and compatible (Ampere+).
- On Windows, FA2 is commonly unavailable; training falls back to PyTorch SDPA.
- Combine with --window-size for extreme contexts when FA2 is not available.
๐ค Tokenizer Combinations¶
Custom Tokenizer + Training¶
Status: โ ๏ธ UNTESTED (except GPT-2)
Expected: Should work with any HuggingFace tokenizer
Verified: - โ GPT-2 tokenizer
Needs Testing: - โ ๏ธ Qwen 2.5 - โ ๏ธ Mistral - โ ๏ธ Code Llama - โ ๏ธ DeepSeek-Coder V2 - โ ๏ธ StarCoder2 - โ ๏ธ Phi-3 - โ ๏ธ Llama 3 (requires HF auth)
Note: Not extensively tested with each tokenizer with basic training
Large Vocabulary + Memory Optimizations¶
Status: โ ๏ธ UNTESTED
Use Case: Tokenizers with 100K+ tokens (DeepSeek, Qwen, Llama 3)
Command:
aios hrm-hf train-actv1 \
--model "deepseek-ai/deepseek-coder-v2-base" \
--dataset-file data.txt \
--gradient-checkpointing \
--amp \
--use-8bit-optimizer \
--steps 1000
Considerations: - Large vocabulary = larger embedding layer - More memory needed for embeddings - May need aggressive optimizations
Note: Not extensively tested with large-vocab tokenizers
๐ Dataset Format Combinations¶
Streaming Dataset + Linear Mode¶
Status: โ
SUPPORTED
Features:
- Linear progression with resume via --dataset-start-offset
- Iterate mode for longโrunning cycles via --iterate
Features: - โ Infinite streaming - โ Shuffle support - โ Caching - โ Memory-efficient
Large Dataset + Multi-GPU¶
Status: โ ๏ธ UNTESTED
Expected: Distributed dataset loading
Command:
aios hrm-hf train-actv1 \
--model gpt2 \
--dataset-file large_dataset.txt \
--ddp \
--cuda-ids "0,1" \
--world-size 2 \
--steps 10000
Questions: - Is dataset split across workers? - Is shuffling consistent? - What's the I/O pattern?
Note: Not extensively tested with with multi-GB datasets
Archive Dataset + Training¶
Status: โ ๏ธ PARTIALLY TESTED
Supported Formats: .tar, .tar.gz, .tar.bz2, .zip
Known Issues: - โ ๏ธ Large archives may hang (BUG-002) - โ ๏ธ Many small files may be slow
Note: Not extensively tested with archive loading performance
๐ฎ GUI Feature Combinations¶
GUI + Background Training¶
Status: โ ๏ธ UNTESTED
Expected: GUI should remain responsive during training
Note: Not extensively tested with GUI responsiveness during training
GUI + Multi-GPU¶
Status: โ EXPERIMENTAL
Question: Does GUI support multi-GPU configuration?
Note: Not extensively tested with GUI multi-GPU controls
GUI + Long Training¶
Status varies by machine. For multiโday runs, prefer CLI logging to --log-file and view metrics separately.
๐งช Testing Recommendations¶
High Priority Tests:¶
- DDP Verification (3 tests)
- DDP + basic training
- DDP + memory optimizations
-
DDP + MoE
-
DeepSpeed Verification (3 tests)
- ZeRO-1 basic
- ZeRO-2 with AMP
-
ZeRO-3 maximum reduction
-
Chunking Verification (3 tests)
- Verify implementation exists
- Test 8K context
-
Test 16K context
-
Tokenizer Testing (7 tests)
-
Test each "supported" tokenizer
-
MoE Combinations (3 tests)
- MoE + memory opts
- MoE + multi-GPU
- MoE + long context
Medium Priority Tests:¶
- Long Context (3 tests)
- 2K, 4K, 8K without chunking
-
Measure actual limits
-
Dataset Formats (3 tests)
- Large CSV
- Large archive
-
Many small files
-
Feature Interactions (5 tests)
- All memory opts combined
- Multi-GPU + all opts
- MoE + all opts
Low Priority Tests:¶
- GUI (3 tests)
- Long training responsiveness
- Multi-GPU controls
-
All panels working
-
Edge Cases (5 tests)
- Very small models
- Very large models
- Very long contexts
- Very large batches
- Very small batches
๐ Compatibility Matrix¶
Quick Reference Table¶
| Feature 1 | Feature 2 | Status | Notes |
|---|---|---|---|
| Gradient Checkpointing | AMP | โ Verified | ~60โ70% memory reduction |
| Gradient Checkpointing | 8โbit Optimizer | โ Supported | Requires bitsandbytes + CUDA |
| AMP | 8โbit Optimizer | โ Supported | Common combo |
| All Memory Opts | Combined | โ ๏ธ Partial | Chunking + AMP + Checkpointing + 8โbit supported; tune chunk size |
| DDP (Linux) | Gradient Checkpointing | โ Supported | Use --ddp + --world-size |
| DDP (Linux) | AMP | โ Supported | |
| DDP (Linux) | 8โbit Optimizer | โ Unknown | May conflict with BnB; test on your setup |
| ParallelโIndependent (Windows) | Chunking | โ Supported | Windowsโfriendly multiโGPU |
| DeepSpeed (Linux) | Gradient Checkpointing | โ Supported | Requires DeepSpeed install |
| DeepSpeed (Linux) | AMP | โ Supported | |
| DeepSpeed (Linux) | 8โbit Optimizer | โ Unknown | DeepSpeed optimizer mgmt may conflict |
| MoE | Memory Opts | โ Supported | Start conservative: k=2, capacity 1.25 |
| MoE | DDP/DeepSpeed | โ Needs Verify | Routing/loadโbalance interactions |
| Chunking | Long Context | โ Supported | Use 1024โ2048 chunk sizes |
| FlashAttention (Linux) | AMP | โ Supported | When FA2 installed; falls back to SDPA otherwise |
| FlashAttention (Windows) | Any | โ ๏ธ Platform | Often unavailable; rely on SDPA + windowโsize |
๐ฏ Action Items¶
Immediate (Week 1):¶
- โ Document all known combinations
- โณ Verify DDP implementation
- โณ Verify DeepSpeed implementation
- โณ Verify chunking implementation
Short-term (Week 2-3):¶
- Test all memory optimization combinations
- Test DDP with various configurations
- Test DeepSpeed stages
- Test tokenizers
Medium-term (Week 4-6):¶
- Test MoE combinations
- Test long context scenarios
- Test dataset formats
- Create automated combination tests
Long-term (Month 2+):¶
- Create CI/CD for combination testing
- Add performance benchmarks
- Document optimal combinations for different use cases
- Create combination recommendation tool
๐ Related Documents¶
- COMPLETE_FEATURE_INDEX.md โ Complete feature list
- FLASH_ATTENTION.md โข FLASH_ATTENTION_VS_CHUNKING.md
- PARALLEL_TRAINING_BLOCK_CHUNK_SYSTEM.md
- MULTI_GPU_DISTRIBUTED.md
- LORA_PEFT.md
- DYNAMIC_SUBBRAINS_MOE.md
Matrix Version: 1.0
Last Updated: October 18, 2025
Maintained By: Testing Team
Status: ๐ In Progress - Many combinations need verification