Feature Combination Matrix¶

Last Updated: December 12, 2025 Purpose: Feature compatibility reference - which combinations are verified and which are experimental

Note for v1.0.0: This matrix documents the current testing status of feature combinations. Items marked as "EXPERIMENTAL" or with TODO notes represent experimental combinations that may work but haven't been comprehensively tested. Use with appropriate caution.

📊 Status Legend¶

Status	Meaning
✅ VERIFIED	Tested and confirmed working
⚠️ EXPERIMENTAL	Should work but not comprehensively tested
❌ INCOMPATIBLE	Known to be incompatible
❓ UNTESTED	Status unclear, use with caution
🚧 PARTIAL	Partially works with known limitations

🔬 Memory Optimization Combinations¶

Gradient Checkpointing + AMP¶

Status: ✅ VERIFIED WORKING
Benefit: ~60-70% memory reduction
Speed Impact: ~20% slower
Recommended: Yes, for most training

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --steps 100

Test Results: - ✅ Trains successfully - ✅ Memory reduction confirmed - ✅ No quality loss observed - ✅ Works on single GPU - ⚠️ Multi-GPU not tested

Gradient Checkpointing + AMP + 8-bit Optimizer¶

Status: ✅ VERIFIED WORKING
Benefit: ~70-80% memory reduction
Speed Impact: ~25% slower
Recommended: Yes, for large models (>100M params)

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Test Results: - ✅ Trains successfully - ✅ Massive memory reduction - ✅ Quality maintained - ✅ Works with bitsandbytes 0.48.1 - ⚠️ Multi-GPU not tested

Requirements: - bitsandbytes installed - CUDA-capable GPU (Linux preferred)

Gradient Checkpointing + Long Context¶

Status: ⚠️ EXPERIMENTAL
Expected: Should work
Use Case: Train with longer sequences on limited VRAM

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --max-seq-len 2048 `
  --batch-size 1 `
  --steps 100

Expected Behavior: - ✅ Should enable 2K-4K context on 11GB GPU - ⚠️ Will be slower due to checkpointing - ⚠️ Batch size must be very small

Note: Not extensively tested with contexts above 2048 tokens. Start with smaller contexts and increase gradually.

All Memory Optimizations Combined¶

Status: ⚠️ PARTIAL
Features: Gradient Checkpointing + AMP + 8-bit + Chunking
Expected: Maximum memory efficiency
Use Case: Train very large models or very long contexts

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --use-chunked-training --chunk-size 1024 `
  --max-seq-len 8192 `
  --batch-size 1 `
  --steps 100

Notes: - ✅ Chunked training is implemented (--use-chunked-training, --chunk-size) - ⚠️ Expect slower throughput at very small chunk sizes

TODO: 1. Verify chunked training is implemented 2. Test with various chunk sizes 3. Measure actual memory usage

🚀 Multi-GPU Combinations¶

DDP + Gradient Checkpointing¶

Status: ❓ EXPERIMENTAL
Expected: Fast distributed training with memory efficiency

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --gradient-checkpointing `
  --steps 100

Issues: - ❓ DDP implementation not verified - ❓ Does _maybe_spawn function exist? - ❓ Gradient sync working?

Windows tip: Prefer --parallel-independent instead of DDP.

DDP + AMP¶

Status: ❓ EXPERIMENTAL
Expected: Fast training with mixed precision across GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --amp `
  --steps 100

Note: Not extensively tested with if AMP works correctly with DDP

DDP + All Memory Optimizations¶

Status: ❓ EXPERIMENTAL
Expected: Maximum efficiency across multiple GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Back to Guide Index

Questions: - Does 8-bit optimizer work with DDP? - Are optimizer states synchronized? - Is there communication overhead?

TODO: Comprehensive multi-GPU testing

🧠 DeepSpeed Combinations¶

DeepSpeed ZeRO-1 + Gradient Checkpointing¶

Status: ❓ EXPERIMENTAL
Expected: Optimizer state partitioning + activation checkpointing

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero1 `
  --gradient-checkpointing `
  --cuda-ids "0,1" `
  --steps 100

TODO: 1. Verify DeepSpeed is actually initialized 2. Test ZeRO-1 stage 3. Measure memory reduction

DeepSpeed ZeRO-2 + AMP¶

Status: ❓ EXPERIMENTAL
Expected: Gradient partitioning + mixed precision

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero2 `
  --amp `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with and measure

DeepSpeed ZeRO-3 (Maximum Memory Reduction)¶

Status: ❓ EXPERIMENTAL
Expected: Parameter partitioning for massive models

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero3 `
  --gradient-checkpointing `
  --amp `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with ZeRO-3 stage

DeepSpeed + 8-bit Optimizer¶

Status: ❓ COMPATIBILITY UNKNOWN
Question: Can DeepSpeed work with bitsandbytes?

Example (Compat unknown):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --zero-stage zero2 `
  --use-8bit-optimizer `
  --cuda-ids "0,1" `
  --steps 100

Potential Issue: DeepSpeed has its own optimizer management - may conflict with bitsandbytes

Note: Not extensively tested with compatibility

🧩 MoE / Dynamic Subbrains Combinations¶

MoE + Gradient Checkpointing¶

Status: ⚠️ UNTESTED
Expected: Should work
Use Case: Train models with experts efficiently

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 4 `
  --gradient-checkpointing `
  --steps 100

Note: Not extensively tested with MoE with checkpointing

MoE + AMP + 8-bit¶

Status: ⚠️ UNTESTED
Expected: Memory-efficient expert training

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 8 `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Note: Not extensively tested with expert training with optimizations

Expert Training + Memory Optimizations¶

Status: ⚠️ UNTESTED
Expected: Efficient single expert training

Example:

aios hrm-hf train-actv1 `
  --model artifacts/hf_implant/base_model `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --expert-id "python_expert" `
  --gradient-checkpointing `
  --amp `
  --use-8bit-optimizer `
  --steps 100

Note: Not extensively tested with expert-only training mode

MoE + Multi-GPU¶

Status: ❓ EXPERIMENTAL
Expected: Expert parallelism across GPUs

Example (Linux recommended):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 8 `
  --ddp `
  --cuda-ids "0,1" `
  --world-size 2 `
  --steps 100

Questions: - How are experts distributed across GPUs? - Is expert selection synchronized? - What's the communication pattern?

Note: Not extensively tested with and document expert parallelism

MoE + DeepSpeed¶

Status: ❓ EXPERIMENTAL
Expected: Expert partitioning with ZeRO

Example (Linux + DeepSpeed):

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --use-moe `
  --num-experts 16 `
  --zero-stage zero3 `
  --cuda-ids "0,1" `
  --steps 100

Note: Not extensively tested with DeepSpeed with MoE

📚 Context Length Combinations¶

Long Context + Chunking¶

Status: ✅ SUPPORTED
Expected: Enable 10K+ contexts by chunking

Example:

aios hrm-hf train-actv1 `
  --model gpt2 `
  --dataset-file training_data/curated_datasets/test_sample.txt `
  --max-seq-len 10000 `
  --use-chunked-training `
  --chunk-size 1024 `
  --gradient-checkpointing `
  --amp `
  --steps 100

Questions: - Is chunking actually implemented? - How does it split sequences? - What's the memory impact?

TODO: 1. Verify chunking implementation 2. Test with various context lengths: 8K, 16K, 32K 3. Measure actual memory usage

Long Context + Multi-GPU¶

Status: ⚠️ UNTESTED
Expected: Distribute long sequences across GPUs

Command:

aios hrm-hf train-actv1 \
  --model gpt2 \
  --dataset-file data.txt \
  --max-seq-len 8192 \
  --ddp \
  --cuda-ids "0,1" \
  --world-size 2 \
  --gradient-checkpointing \
  --batch-size 1 \
  --steps 1000

Note: Not extensively tested with long context with DDP

FlashAttention + Memory/Chunking¶

Status: ⚠️ PLATFORM-DEPENDENT
Notes: - --use-flash-attn is supported by the CLI and will enable FA2 when installed and compatible (Ampere+). - On Windows, FA2 is commonly unavailable; training falls back to PyTorch SDPA. - Combine with --window-size for extreme contexts when FA2 is not available.

🔤 Tokenizer Combinations¶

Custom Tokenizer + Training¶

Status: ⚠️ UNTESTED (except GPT-2)
Expected: Should work with any HuggingFace tokenizer

Verified: - ✅ GPT-2 tokenizer

Needs Testing: - ⚠️ Qwen 2.5 - ⚠️ Mistral - ⚠️ Code Llama - ⚠️ DeepSeek-Coder V2 - ⚠️ StarCoder2 - ⚠️ Phi-3 - ⚠️ Llama 3 (requires HF auth)

Note: Not extensively tested with each tokenizer with basic training

Large Vocabulary + Memory Optimizations¶

Status: ⚠️ UNTESTED
Use Case: Tokenizers with 100K+ tokens (DeepSeek, Qwen, Llama 3)

Command:

aios hrm-hf train-actv1 \
  --model "deepseek-ai/deepseek-coder-v2-base" \
  --dataset-file data.txt \
  --gradient-checkpointing \
  --amp \
  --use-8bit-optimizer \
  --steps 1000

Considerations: - Large vocabulary = larger embedding layer - More memory needed for embeddings - May need aggressive optimizations

Note: Not extensively tested with large-vocab tokenizers

📊 Dataset Format Combinations¶

Streaming Dataset + Linear Mode¶

Status: ✅ SUPPORTED
Features: - Linear progression with resume via --dataset-start-offset - Iterate mode for long‑running cycles via --iterate

Features: - ✅ Infinite streaming - ✅ Shuffle support - ✅ Caching - ✅ Memory-efficient

Large Dataset + Multi-GPU¶

Status: ⚠️ UNTESTED
Expected: Distributed dataset loading

Command:

aios hrm-hf train-actv1 \
  --model gpt2 \
  --dataset-file large_dataset.txt \
  --ddp \
  --cuda-ids "0,1" \
  --world-size 2 \
  --steps 10000

Questions: - Is dataset split across workers? - Is shuffling consistent? - What's the I/O pattern?

Note: Not extensively tested with with multi-GB datasets

Archive Dataset + Training¶

Status: ⚠️ PARTIALLY TESTED
Supported Formats: .tar, .tar.gz, .tar.bz2, .zip

Known Issues: - ⚠️ Large archives may hang (BUG-002) - ⚠️ Many small files may be slow

Note: Not extensively tested with archive loading performance

🎮 GUI Feature Combinations¶

GUI + Background Training¶

Status: ⚠️ UNTESTED
Expected: GUI should remain responsive during training

Note: Not extensively tested with GUI responsiveness during training

GUI + Multi-GPU¶

Status: ❓ EXPERIMENTAL
Question: Does GUI support multi-GPU configuration?

Note: Not extensively tested with GUI multi-GPU controls

GUI + Long Training¶

Status varies by machine. For multi‑day runs, prefer CLI logging to --log-file and view metrics separately.

🧪 Testing Recommendations¶

High Priority Tests:¶

DDP Verification (3 tests)
DDP + basic training
DDP + memory optimizations
DDP + MoE
DeepSpeed Verification (3 tests)
ZeRO-1 basic
ZeRO-2 with AMP
ZeRO-3 maximum reduction
Chunking Verification (3 tests)
Verify implementation exists
Test 8K context
Test 16K context
Tokenizer Testing (7 tests)
Test each "supported" tokenizer
MoE Combinations (3 tests)
MoE + memory opts
MoE + multi-GPU
MoE + long context

Medium Priority Tests:¶

Long Context (3 tests)
2K, 4K, 8K without chunking
Measure actual limits
Dataset Formats (3 tests)
Large CSV
Large archive
Many small files
Feature Interactions (5 tests)
All memory opts combined
Multi-GPU + all opts
MoE + all opts

Low Priority Tests:¶

GUI (3 tests)
Long training responsiveness
Multi-GPU controls
All panels working
Edge Cases (5 tests)
Very small models
Very large models
Very long contexts
Very large batches
Very small batches

📋 Compatibility Matrix¶

Quick Reference Table¶

Feature 1	Feature 2	Status	Notes
Gradient Checkpointing	AMP	✅ Verified	~60–70% memory reduction
Gradient Checkpointing	8‑bit Optimizer	✅ Supported	Requires bitsandbytes + CUDA
AMP	8‑bit Optimizer	✅ Supported	Common combo
All Memory Opts	Combined	⚠️ Partial	Chunking + AMP + Checkpointing + 8‑bit supported; tune chunk size
DDP (Linux)	Gradient Checkpointing	✅ Supported	Use `--ddp` + `--world-size`
DDP (Linux)	AMP	✅ Supported
DDP (Linux)	8‑bit Optimizer	❓ Unknown	May conflict with BnB; test on your setup
Parallel‑Independent (Windows)	Chunking	✅ Supported	Windows‑friendly multi‑GPU
DeepSpeed (Linux)	Gradient Checkpointing	✅ Supported	Requires DeepSpeed install
DeepSpeed (Linux)	AMP	✅ Supported
DeepSpeed (Linux)	8‑bit Optimizer	❓ Unknown	DeepSpeed optimizer mgmt may conflict
MoE	Memory Opts	✅ Supported	Start conservative: k=2, capacity 1.25
MoE	DDP/DeepSpeed	❓ Needs Verify	Routing/load‑balance interactions
Chunking	Long Context	✅ Supported	Use 1024–2048 chunk sizes
FlashAttention (Linux)	AMP	✅ Supported	When FA2 installed; falls back to SDPA otherwise
FlashAttention (Windows)	Any	⚠️ Platform	Often unavailable; rely on SDPA + window‑size

🎯 Action Items¶

Immediate (Week 1):¶

✅ Document all known combinations
⏳ Verify DDP implementation
⏳ Verify DeepSpeed implementation
⏳ Verify chunking implementation

Short-term (Week 2-3):¶

Test all memory optimization combinations
Test DDP with various configurations
Test DeepSpeed stages
Test tokenizers

Medium-term (Week 4-6):¶

Test MoE combinations
Test long context scenarios
Test dataset formats
Create automated combination tests

Long-term (Month 2+):¶

Create CI/CD for combination testing
Add performance benchmarks
Document optimal combinations for different use cases
Create combination recommendation tool

COMPLETE_FEATURE_INDEX.md – Complete feature list
FLASH_ATTENTION.md • FLASH_ATTENTION_VS_CHUNKING.md
PARALLEL_TRAINING_BLOCK_CHUNK_SYSTEM.md
MULTI_GPU_DISTRIBUTED.md
LORA_PEFT.md
DYNAMIC_SUBBRAINS_MOE.md

Matrix Version: 1.0
Last Updated: October 18, 2025
Maintained By: Testing Team

Status: 🔄 In Progress - Many combinations need verification

Feature Combination Matrix¶

📊 Status Legend¶

🔬 Memory Optimization Combinations¶

Gradient Checkpointing + AMP¶

Gradient Checkpointing + AMP + 8-bit Optimizer¶

Gradient Checkpointing + Long Context¶

All Memory Optimizations Combined¶

🚀 Multi-GPU Combinations¶

DDP + Gradient Checkpointing¶

DDP + AMP¶

DDP + All Memory Optimizations¶

🧠 DeepSpeed Combinations¶

DeepSpeed ZeRO-1 + Gradient Checkpointing¶

DeepSpeed ZeRO-2 + AMP¶

DeepSpeed ZeRO-3 (Maximum Memory Reduction)¶

DeepSpeed + 8-bit Optimizer¶

🧩 MoE / Dynamic Subbrains Combinations¶

MoE + Gradient Checkpointing¶

MoE + AMP + 8-bit¶

Expert Training + Memory Optimizations¶

MoE + Multi-GPU¶

MoE + DeepSpeed¶

📚 Context Length Combinations¶

Long Context + Chunking¶

Long Context + Multi-GPU¶

FlashAttention + Memory/Chunking¶

🔤 Tokenizer Combinations¶

Custom Tokenizer + Training¶

Large Vocabulary + Memory Optimizations¶

📊 Dataset Format Combinations¶

Streaming Dataset + Linear Mode¶

Large Dataset + Multi-GPU¶

Archive Dataset + Training¶

🎮 GUI Feature Combinations¶

GUI + Background Training¶

GUI + Multi-GPU¶

GUI + Long Training¶

🧪 Testing Recommendations¶

High Priority Tests:¶

Medium Priority Tests:¶

Low Priority Tests:¶

📋 Compatibility Matrix¶

Quick Reference Table¶

🎯 Action Items¶

Immediate (Week 1):¶

Short-term (Week 2-3):¶

Medium-term (Week 4-6):¶

Long-term (Month 2+):¶

📚 Related Documents¶