DeepSpeed ZeRO-Infinity¶

Executive Summary¶

Goal: Integrate DeepSpeed ZeRO-Infinity to enable training of multi-billion parameter models that exceed GPU and CPU memory capacity by leveraging NVMe storage offloading.

Current State: AI-OS supports ZeRO Stages 1-3 with basic CPU offloading for optimizer states.

Target State: Full ZeRO-Infinity support with intelligent NVMe offloading for parameters, gradients, and optimizer states, enabling training of 10B+ parameter models on consumer hardware.

Impact: - Train models 10-100x larger than current limits - Utilize high-speed NVMe storage as memory extension - Enable training of GPT-3 scale models (175B params) on multi-GPU consumer setups - Minimal speed penalty (5-15% slower) with fast NVMe drives

Background¶

What is ZeRO-Infinity?¶

ZeRO-Infinity extends DeepSpeed's ZeRO-3 optimization by adding NVMe offloading as a third tier of memory hierarchy:

GPU VRAM (fastest, smallest) - Active computation
CPU RAM (fast, medium) - Optimizer states, gradients
NVMe Storage (slower, massive) - Parameters, checkpoints

Key Innovation: Overlapped data movement between tiers masks latency, maintaining 90%+ training efficiency even with NVMe offloading.

Current AI-OS Capabilities¶

✅ Already Implemented: - ZeRO Stage 1: Optimizer state partitioning (~25% VRAM savings) - ZeRO Stage 2: Optimizer + gradient partitioning (~50% VRAM savings) - ZeRO Stage 3: Full parameter partitioning (~75% VRAM savings) - Basic CPU offloading for carry states in extreme contexts - 8-bit optimizers via bitsandbytes - Gradient checkpointing - Chunked training for extreme contexts (100K+ tokens)

❌ Missing for ZeRO-Infinity: - NVMe offload configuration and initialization - Parameter offloading to NVMe storage - Optimizer state offloading to NVMe - Gradient offloading to NVMe (optional) - Async prefetching from NVMe to CPU/GPU - NVMe bandwidth monitoring and optimization - Memory hierarchy management - Pin memory optimizations for fast transfers

Technical Architecture¶

Memory Hierarchy with ZeRO-Infinity¶

┌─────────────────────────────────────────────────────────────┐
│                    Training Iteration                        │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  GPU VRAM (11GB)                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Active Parameters (32 layers out of 96)             │   │
│  │ Activations & Gradients (current batch)             │   │
│  │ Optimizer States (for active params)                │   │
│  └─────────────────────────────────────────────────────┘   │
│         ↕ PCIe 3.0/4.0 (15-30 GB/s)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  CPU RAM (32-64GB)                                          │
│  │ Prefetched Parameters (next 64 layers)              │   │
│  │ Optimizer States (frozen params)                    │   │
│  │ Gradients (waiting to be applied)                   │   │
│  └─────────────────────────────────────────────────────┘   │
│         ↕ NVMe PCIe 4.0 (5-7 GB/s)                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  NVMe Storage (1-2TB)                                       │
│  │ Full Model Parameters (10B-175B params)             │   │
│  │ Optimizer State History (Adam momentum/variance)    │   │
│  │ Checkpoints & Intermediate States                   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                               │
└─────────────────────────────────────────────────────────────┘

Data Flow During Training¶

Forward Pass:
1. NVMe → CPU: Async prefetch next layer params
2. CPU → GPU: Transfer layer params for computation  
3. GPU: Compute activations
4. Repeat for all layers

Backward Pass:
1. GPU: Compute gradients
2. GPU → CPU: Offload gradients (optional)
3. CPU → NVMe: Archive gradients (optional)
4. GPU: Update params with optimizer
5. GPU → CPU → NVMe: Offload updated params

Key Optimization: Overlap transfers with computation!

Implementation Plan¶

Phase 1: Configuration and Setup (Week 1)¶

1.1 DeepSpeed Configuration Files¶

Create new configuration profiles for ZeRO-Infinity:

File: config/deepspeed_zero_infinity.json

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto",
      "total_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 3,

    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/tmp/deepspeed_offload",
      "pin_memory": true,
      "buffer_count": 5,
      "fast_init": false
    },

    "offload_param": {
      "device": "nvme",
      "nvme_path": "/tmp/deepspeed_offload",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8,
      "max_in_cpu": 1e9
    },

    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 1e6,
    "stage3_prefetch_bucket_size": 1e6,
    "stage3_param_persistence_threshold": 1e5,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },

  "aio": {
    "block_size": 1048576,
    "queue_depth": 8,
    "thread_count": 1,
    "single_submit": false,
    "overlap_events": true
  },

  "bf16": {
    "enabled": true
  },

  "gradient_clipping": 1.0,
  "steps_per_print": 100,
  "wall_clock_breakdown": false
}

File: config/deepspeed_zero_infinity_cpu.json (CPU fallback)

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "zero_optimization": {
    "stage": 3,

    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },

    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },

    "overlap_comm": true,
    "contiguous_gradients": true
  },

  "bf16": {
    "enabled": true
  }
}

1.2 Training Configuration Fields¶

File: src/aios/core/hrm_training/training_config/optimization_fields.py

Add new fields to OptimizationFields dataclass:

# ZeRO-Infinity NVMe Offloading
nvme_offload_path: str = "/tmp/deepspeed_offload"
"""Path to NVMe directory for ZeRO-Infinity offloading.

When zero_stage="infinity", this directory is used to offload:
- Model parameters
- Optimizer states  
- Optionally gradients

Requirements:
- Must be on NVMe SSD for best performance (PCIe 3.0/4.0)
- Needs ~2-10x model size in free space
- Should be on fast storage (>2 GB/s sequential write)

Recommended paths:
- Linux: /mnt/nvme/deepspeed_offload (mount NVMe here)
- Windows: D:/deepspeed_offload (if D: is NVMe)
- Temp fallback: /tmp/deepspeed_offload (auto-cleanup)

Size requirements (approximate):
- 1B params: ~20GB NVMe space
- 10B params: ~200GB NVMe space  
- 100B params: ~2TB NVMe space

Default: /tmp/deepspeed_offload
"""

nvme_offload_optimizer: bool = True
"""Offload optimizer states to NVMe storage.

When using ZeRO-Infinity, optimizer states (Adam momentum & variance) 
consume 2x parameter memory. Offloading to NVMe frees CPU RAM.

Impact:
- Memory: Saves 2x param size in CPU RAM
- Speed: ~5-10% slower (with fast NVMe)
- Requires: Fast NVMe drive (>2 GB/s)

Default: True (enabled for infinity mode)
"""

nvme_offload_params: bool = True
"""Offload model parameters to NVMe storage.

Stores frozen/inactive parameters on NVMe, loading only active layers
to GPU during computation. Essential for models >10B parameters.

Impact:
- Memory: Stores full model on NVMe vs RAM
- Speed: ~10-15% slower (with fast NVMe)
- Requires: NVMe with model_size * 4 bytes free space

Default: True (enabled for infinity mode)
"""

nvme_offload_gradients: bool = False
"""Offload gradients to NVMe storage (optional).

Experimental: Offload computed gradients to NVMe between backward passes.
Only beneficial for extremely large models (>100B params) or small RAM.

Impact:
- Memory: Saves param_size in CPU RAM
- Speed: ~20-30% slower
- Rarely needed: Usually CPU RAM sufficient

Default: False (not recommended unless desperate)
"""

aio_block_size: int = 1048576
"""Async I/O block size for NVMe operations (bytes).

Controls the block size for asynchronous I/O transfers between NVMe and CPU.
Larger blocks = better throughput, higher latency.

Recommended values:
- 1048576 (1MB): Balanced (default)
- 2097152 (2MB): High throughput NVMe (PCIe 4.0)
- 524288 (512KB): Lower latency, slower drives

Default: 1048576 (1MB)
"""

aio_queue_depth: int = 8
"""Async I/O queue depth for NVMe operations.

Number of concurrent I/O requests to NVMe. Higher = better parallelism.

Recommended values:
- 4: Conservative, older NVMe drives
- 8: Balanced (default)
- 16: High-performance NVMe (Samsung 980 Pro, etc.)
- 32: Extreme performance (server-grade NVMe)

Default: 8
"""

pin_memory: bool = True
"""Pin CPU memory for faster GPU transfers.

Pinned memory enables DMA transfers without CPU involvement,
reducing latency for CPU↔GPU transfers during offloading.

Impact:
- Speed: ~20-30% faster transfers
- Memory: Uses non-swappable RAM
- Stability: May cause issues with low RAM systems

Default: True (recommended unless <16GB RAM)
"""

1.3 CLI Arguments¶

File: src/aios/cli/hrm_hf_cli.py

Add new arguments to train_actv1 function:

# ZeRO-Infinity (NVMe Offloading) options
nvme_offload_path: str = typer.Option(
    "/tmp/deepspeed_offload", 
    "--nvme-offload-path",
    help="Path to NVMe directory for ZeRO-Infinity offloading. Must be on fast NVMe SSD (>2 GB/s). Requires ~2-10x model size in free space."
),
nvme_offload_optimizer: bool = typer.Option(
    True, 
    "--nvme-offload-optimizer/--no-nvme-offload-optimizer",
    help="Offload optimizer states to NVMe (saves 2x param size in RAM, ~5-10%% slower). Requires zero-stage=infinity."
),
nvme_offload_params: bool = typer.Option(
    True,
    "--nvme-offload-params/--no-nvme-offload-params", 
    help="Offload model parameters to NVMe (essential for 10B+ models, ~10-15%% slower). Requires zero-stage=infinity."
),
nvme_offload_gradients: bool = typer.Option(
    False,
    "--nvme-offload-gradients/--no-nvme-offload-gradients",
    help="Offload gradients to NVMe (rarely needed, ~20-30%% slower). Only for 100B+ models."
),
aio_block_size: int = typer.Option(
    1048576,
    "--aio-block-size",
    help="Async I/O block size for NVMe (bytes). 1MB=balanced, 2MB=high throughput, 512KB=low latency."
),
aio_queue_depth: int = typer.Option(
    8,
    "--aio-queue-depth", 
    help="Async I/O queue depth for NVMe. 8=balanced, 16=high-performance NVMe, 32=server-grade."
),
pin_memory: bool = typer.Option(
    True,
    "--pin-memory/--no-pin-memory",
    help="Pin CPU memory for faster GPU transfers. Disable if <16GB RAM."
),

Update zero_stage argument options:

zero_stage: str = typer.Option(
    "none", 
    "--zero-stage", 
    help="DeepSpeed ZeRO optimization: none, zero1 (↓25%% VRAM), zero2 (↓50%% VRAM, recommended), zero3 (↓75%% VRAM), infinity (NVMe offload, train 10B+ models). Auto-selected if --optimize is used."
),

Phase 2: Core Implementation (Week 2-3)¶

2.1 NVMe Configuration Builder¶

File: src/aios/cli/hrm_hf/nvme_config.py (new)

"""NVMe configuration and validation for ZeRO-Infinity."""

from __future__ import annotations
import os
import shutil
from pathlib import Path
from typing import Dict, Any, Optional, Tuple
import json


def validate_nvme_path(nvme_path: str, required_space_gb: float) -> Tuple[bool, str]:
    """
    Validate NVMe offload path has sufficient space and write permissions.

    Args:
        nvme_path: Path to NVMe directory
        required_space_gb: Required free space in GB

    Returns:
        (is_valid, error_message)
    """
    path = Path(nvme_path)

    # Check if path exists or can be created
    try:
        path.mkdir(parents=True, exist_ok=True)
    except PermissionError:
        return False, f"Permission denied: Cannot create {nvme_path}"
    except Exception as e:
        return False, f"Failed to create {nvme_path}: {e}"

    # Check write permissions
    test_file = path / ".deepspeed_test"
    try:
        test_file.touch()
        test_file.unlink()
    except Exception as e:
        return False, f"Cannot write to {nvme_path}: {e}"

    # Check available space
    stat = shutil.disk_usage(path)
    available_gb = stat.free / (1024**3)

    if available_gb < required_space_gb:
        return False, f"Insufficient space: {available_gb:.1f}GB available, {required_space_gb:.1f}GB required"

    return True, ""


def estimate_nvme_space_required(
    total_params: int,
    offload_params: bool,
    offload_optimizer: bool,
    offload_gradients: bool,
    safety_factor: float = 1.5
) -> float:
    """
    Estimate required NVMe space in GB.

    Args:
        total_params: Total model parameters
        offload_params: Whether offloading parameters
        offload_optimizer: Whether offloading optimizer states
        offload_gradients: Whether offloading gradients
        safety_factor: Multiply by this for safety margin

    Returns:
        Required space in GB
    """
    bytes_per_param = 4  # FP32

    space_gb = 0.0

    if offload_params:
        # Parameters: 4 bytes each
        space_gb += (total_params * bytes_per_param) / (1024**3)

    if offload_optimizer:
        # Optimizer states: 8 bytes per param (Adam: momentum + variance)
        space_gb += (total_params * 8) / (1024**3)

    if offload_gradients:
        # Gradients: 4 bytes per param
        space_gb += (total_params * bytes_per_param) / (1024**3)

    # Apply safety factor for temp files, fragmentation, etc.
    space_gb *= safety_factor

    return space_gb


def check_nvme_performance(nvme_path: str) -> Dict[str, Any]:
    """
    Check NVMe performance characteristics.

    Returns dict with:
    - sequential_read_gb_s: Sequential read speed (GB/s)
    - sequential_write_gb_s: Sequential write speed (GB/s)  
    - is_ssd: Whether drive appears to be SSD
    - warnings: List of performance warnings
    """
    import subprocess
    import platform

    result = {
        "sequential_read_gb_s": None,
        "sequential_write_gb_s": None,
        "is_ssd": None,
        "warnings": []
    }

    path = Path(nvme_path).resolve()

    # Get device info based on OS
    if platform.system() == "Linux":
        try:
            # Try to identify device from path
            df_output = subprocess.check_output(
                ["df", str(path)], 
                universal_newlines=True
            )
            device = df_output.split('\n')[1].split()[0]

            # Check if rotational (HDD=1, SSD=0)
            device_name = device.split('/')[-1].rstrip('0123456789')
            rotational_file = f"/sys/block/{device_name}/queue/rotational"

            if os.path.exists(rotational_file):
                with open(rotational_file) as f:
                    is_rotational = f.read().strip() == '1'
                    result["is_ssd"] = not is_rotational

                    if is_rotational:
                        result["warnings"].append(
                            "WARNING: Path appears to be on HDD, not SSD. "
                            "ZeRO-Infinity will be very slow. Use NVMe SSD for best results."
                        )
        except Exception:
            pass  # Can't determine, not critical

    elif platform.system() == "Windows":
        # On Windows, harder to detect programmatically
        # Just warn if path is C: (usually OS drive, may be SATA SSD)
        if str(path).startswith("C:"):
            result["warnings"].append(
                "Path is on C: drive. For best performance, use dedicated NVMe drive (D:, E:, etc.)"
            )

    # TODO: Could add actual speed test with dd/fio, but may be too slow for startup

    return result


def build_infinity_config(
    base_config: Dict[str, Any],
    nvme_path: str,
    offload_params: bool,
    offload_optimizer: bool,
    offload_gradients: bool,
    aio_block_size: int,
    aio_queue_depth: int,
    pin_memory: bool,
    max_in_cpu_gb: float = 1.0,
) -> Dict[str, Any]:
    """
    Build ZeRO-Infinity configuration dict.

    Args:
        base_config: Base DeepSpeed ZeRO-3 config
        nvme_path: Path for NVMe offloading
        offload_params: Offload parameters to NVMe
        offload_optimizer: Offload optimizer states to NVMe
        offload_gradients: Offload gradients to NVMe
        aio_block_size: Async I/O block size
        aio_queue_depth: Async I/O queue depth
        pin_memory: Use pinned memory
        max_in_cpu_gb: Max params to keep in CPU RAM (GB)

    Returns:
        Updated config dict with ZeRO-Infinity settings
    """
    config = base_config.copy()

    # Ensure we're using ZeRO stage 3
    config["zero_optimization"]["stage"] = 3

    # Configure parameter offloading
    if offload_params:
        config["zero_optimization"]["offload_param"] = {
            "device": "nvme",
            "nvme_path": nvme_path,
            "pin_memory": pin_memory,
            "buffer_count": 5,
            "buffer_size": int(1e8),  # 100MB buffer
            "max_in_cpu": int(max_in_cpu_gb * 1e9),  # Keep some in CPU
        }
    else:
        config["zero_optimization"]["offload_param"] = {
            "device": "cpu",
            "pin_memory": pin_memory,
        }

    # Configure optimizer offloading  
    if offload_optimizer:
        config["zero_optimization"]["offload_optimizer"] = {
            "device": "nvme",
            "nvme_path": nvme_path,
            "pin_memory": pin_memory,
            "buffer_count": 5,
            "fast_init": False,
        }
    else:
        config["zero_optimization"]["offload_optimizer"] = {
            "device": "cpu",
            "pin_memory": pin_memory,
        }

    # Configure gradient offloading (rare)
    if offload_gradients:
        # Note: Gradient offload to NVMe not officially in DeepSpeed API
        # This would require custom implementation
        config["zero_optimization"]["offload_gradients"] = {
            "device": "nvme",
            "nvme_path": nvme_path,
        }

    # Configure AIO (Async I/O)
    config["aio"] = {
        "block_size": aio_block_size,
        "queue_depth": aio_queue_depth,
        "thread_count": 1,
        "single_submit": False,
        "overlap_events": True,
    }

    return config


def save_infinity_config(
    config: Dict[str, Any],
    output_path: str
) -> None:
    """Save ZeRO-Infinity config to JSON file."""
    with open(output_path, 'w') as f:
        json.dump(config, f, indent=2)

2.2 Update Distributed Setup¶

File: src/aios/cli/hrm_hf/distributed_setup.py

Update initialize_deepspeed function:

def initialize_deepspeed(
    model: Any,
    config: "TrainingConfig",
    device_obj: torch.device,
    log_fn
) -> Tuple[Optional[Any], bool]:
    """Initialize DeepSpeed ZeRO optimizer with Infinity support."""

    zero_stage = config.zero_stage

    if not zero_stage or zero_stage == "none":
        return None, False

    dev = str(device_obj).split(':')[0]
    if dev != "cuda":
        log_fn({
            "deepspeed": "skipped",
            "reason": "Only CUDA devices supported",
            "device": dev
        })
        return None, False

    try:
        import deepspeed

        # Determine config file
        if zero_stage == "infinity":
            from .nvme_config import (
                validate_nvme_path,
                estimate_nvme_space_required,
                check_nvme_performance,
                build_infinity_config,
                save_infinity_config,
            )

            # Calculate model size
            total_params = sum(p.numel() for p in model.parameters())

            # Estimate NVMe space needed
            required_space_gb = estimate_nvme_space_required(
                total_params=total_params,
                offload_params=config.nvme_offload_params,
                offload_optimizer=config.nvme_offload_optimizer,
                offload_gradients=config.nvme_offload_gradients,
                safety_factor=1.5,
            )

            # Validate NVMe path
            is_valid, error_msg = validate_nvme_path(
                config.nvme_offload_path,
                required_space_gb
            )

            if not is_valid:
                log_fn({
                    "deepspeed": "nvme_validation_failed",
                    "error": error_msg,
                    "fallback": "Using CPU offload instead",
                })
                # Fallback to CPU offload
                ds_config_path = "config/deepspeed_zero_infinity_cpu.json"
            else:
                # Check NVMe performance
                perf_info = check_nvme_performance(config.nvme_offload_path)

                for warning in perf_info.get("warnings", []):
                    log_fn({"nvme_warning": warning})

                # Load base ZeRO-3 config
                import json
                with open("config/deepspeed_zero3.json") as f:
                    base_config = json.load(f)

                # Build Infinity config
                ds_config = build_infinity_config(
                    base_config=base_config,
                    nvme_path=config.nvme_offload_path,
                    offload_params=config.nvme_offload_params,
                    offload_optimizer=config.nvme_offload_optimizer,
                    offload_gradients=config.nvme_offload_gradients,
                    aio_block_size=config.aio_block_size,
                    aio_queue_depth=config.aio_queue_depth,
                    pin_memory=config.pin_memory,
                    max_in_cpu_gb=1.0,  # Keep 1GB in CPU RAM
                )

                # Save to temp file
                import tempfile
                temp_config = tempfile.NamedTemporaryFile(
                    mode='w',
                    suffix='.json',
                    delete=False,
                    dir='artifacts/brains/actv1'
                )
                save_infinity_config(ds_config, temp_config.name)
                ds_config_path = temp_config.name

                log_fn({
                    "deepspeed": "infinity_config_created",
                    "nvme_path": config.nvme_offload_path,
                    "required_space_gb": round(required_space_gb, 2),
                    "offload_params": config.nvme_offload_params,
                    "offload_optimizer": config.nvme_offload_optimizer,
                    "offload_gradients": config.nvme_offload_gradients,
                    "aio_block_size": config.aio_block_size,
                    "aio_queue_depth": config.aio_queue_depth,
                })

        elif zero_stage in ["zero1", "zero2", "zero3"]:
            # Existing ZeRO stage logic
            ds_config_path = f"config/deepspeed_{zero_stage}.json"

        else:
            log_fn({
                "deepspeed": "invalid_stage",
                "zero_stage": zero_stage,
            })
            return None, False

        # Rest of initialization...
        # (existing code continues)

    except ImportError as e:
        # Handle missing DeepSpeed
        # (existing error handling)
        pass

2.3 GUI Integration¶

File: src/aios/gui/components/hrm_training_panel/ui_optimizations.py

Update ZeRO dropdown options:

# Row 5: DeepSpeed ZeRO (updated with Infinity)
zero_row = ttk.Frame(self)
zero_row.grid(row=5, column=0, columnspan=3, sticky="ew", padx=5, pady=2)

ttk.Label(zero_row, text="DeepSpeed:", width=15, anchor="e", font=("TkDefaultFont", 9, "bold")).pack(side="left")

zero_combo = ttk.Combobox(
    zero_row,
    textvariable=self.zero_stage_var,
    values=["none", "zero1", "zero2", "zero3", "infinity"],  # Added "infinity"
    state="readonly",
    width=12
)
zero_combo.pack(side="left", padx=5)
self.zero_combo = zero_combo

# Dynamic label for memory savings
zero_savings_lbl = ttk.Label(zero_row, text="", foreground="blue")
zero_savings_lbl.pack(side="left", padx=5)
self.zero_savings_lbl = zero_savings_lbl

def update_zero_label(*args):
    stage = self.zero_stage_var.get()
    savings_text = {
        "none": "",
        "zero1": "↓25% VRAM, ~2% slower",
        "zero2": "↓50% VRAM, ~5% slower (recommended)",
        "zero3": "↓75% VRAM, ~15% slower",
        "infinity": "↓90%+ VRAM, train 10B+ models (requires NVMe)",  # NEW
    }.get(stage, "")
    zero_savings_lbl.config(text=savings_text)

self.zero_stage_var.trace_add("write", update_zero_label)
update_zero_label()

# Updated tooltip
add_tooltip(
    zero_combo,
    "DeepSpeed ZeRO: Distributed memory optimization\n"
    "• none: Standard training\n"
    "• zero1: Partition optimizer states (↓25% VRAM)\n"
    "• zero2: Partition optimizer + gradients (↓50% VRAM) [RECOMMENDED]\n"
    "• zero3: Partition everything (↓75% VRAM, slower)\n"
    "• infinity: NVMe offload for 10B+ models (↓90%+ VRAM, requires fast NVMe SSD)"
)

Add NVMe configuration section (collapsible):

# Row 6: NVMe Configuration (shown when infinity selected)
nvme_section = ttk.LabelFrame(self, text="ZeRO-Infinity NVMe Settings", padding=5)
# Hidden by default, shown when zero_stage == "infinity"

nvme_path_row = ttk.Frame(nvme_section)
nvme_path_row.pack(fill="x", pady=2)
ttk.Label(nvme_path_row, text="NVMe Path:", width=15).pack(side="left")
nvme_path_entry = ttk.Entry(nvme_path_row, textvariable=self.nvme_offload_path_var, width=30)
nvme_path_entry.pack(side="left", padx=5, fill="x", expand=True)
ttk.Button(nvme_path_row, text="Browse", command=self._browse_nvme_path, width=8).pack(side="left")

# Checkboxes for what to offload
offload_frame = ttk.Frame(nvme_section)
offload_frame.pack(fill="x", pady=2)
ttk.Checkbutton(
    offload_frame,
    text="Offload Parameters",
    variable=self.nvme_offload_params_var
).pack(side="left", padx=5)
ttk.Checkbutton(
    offload_frame,
    text="Offload Optimizer",
    variable=self.nvme_offload_optimizer_var
).pack(side="left", padx=5)
ttk.Checkbutton(
    offload_frame,
    text="Offload Gradients",
    variable=self.nvme_offload_gradients_var
).pack(side="left", padx=5)

# AIO settings (advanced, maybe collapsible)
aio_frame = ttk.LabelFrame(nvme_section, text="Advanced I/O Settings", padding=3)
aio_frame.pack(fill="x", pady=2)

aio_row1 = ttk.Frame(aio_frame)
aio_row1.pack(fill="x")
ttk.Label(aio_row1, text="Block Size:").pack(side="left")
ttk.Spinbox(
    aio_row1,
    from_=524288,
    to=4194304,
    increment=524288,
    textvariable=self.aio_block_size_var,
    width=10
).pack(side="left", padx=5)
ttk.Label(aio_row1, text="bytes").pack(side="left")

aio_row2 = ttk.Frame(aio_frame)
aio_row2.pack(fill="x")
ttk.Label(aio_row2, text="Queue Depth:").pack(side="left")
ttk.Spinbox(
    aio_row2,
    from_=4,
    to=32,
    increment=4,
    textvariable=self.aio_queue_depth_var,
    width=10
).pack(side="left", padx=5)

self.nvme_section = nvme_section

def toggle_nvme_section(*args):
    if self.zero_stage_var.get() == "infinity":
        nvme_section.grid(row=6, column=0, columnspan=3, sticky="ew", padx=5, pady=5)
    else:
        nvme_section.grid_forget()

self.zero_stage_var.trace_add("write", toggle_nvme_section)

File: src/aios/gui/components/hrm_training_panel/variable_setup.py

Add new variables:

# ZeRO-Infinity NVMe Offloading
self.nvme_offload_path_var = tk.StringVar(value="/tmp/deepspeed_offload")
self.nvme_offload_params_var = tk.BooleanVar(value=True)
self.nvme_offload_optimizer_var = tk.BooleanVar(value=True)
self.nvme_offload_gradients_var = tk.BooleanVar(value=False)
self.aio_block_size_var = tk.IntVar(value=1048576)
self.aio_queue_depth_var = tk.IntVar(value=8)
self.pin_memory_var = tk.BooleanVar(value=True)

Phase 3: Memory Estimation Updates (Week 3)¶

3.1 Update VRAM Estimator¶

File: src/aios/gui/components/hrm_training/memory_estimator/vram_estimation.py

Update estimate_vram function to handle ZeRO-Infinity:

def estimate_vram(estimator: "MemoryEstimator") -> Dict[str, Any]:
    """Estimate VRAM usage accounting for ZeRO-Infinity."""

    # ... existing code ...

    # Handle ZeRO-Infinity (most aggressive savings)
    if estimator.zero_stage == "infinity":
        # With Infinity, almost everything can be offloaded
        # Only keep active computation in VRAM

        # Model parameters: Only active layers in VRAM (~5-10% of model)
        model_gb_per_gpu = model_gb * 0.08  # ~8% in VRAM at once

        # Optimizer: Offloaded to NVMe or CPU
        optimizer_gb_per_gpu = 0.0

        # Gradients: Mostly on NVMe/CPU, small buffer in VRAM
        gradients_gb_per_gpu = gradients_gb * 0.1  # 10% buffer

        # Activations: Still need these in VRAM (not offloadable during computation)
        # Keep activations calculation same

        savings_note = (
            f"ZeRO-Infinity: ~92% memory offloaded to NVMe. "
            f"Requires {estimator.nvme_offload_path} with "
            f"~{model_gb * 3:.1f}GB free space."
        )

    elif estimator.zero_stage == "zero3":
        # ... existing ZeRO-3 logic ...

    # ... rest of function ...

Phase 4: Testing and Validation (Week 4)¶

4.1 Unit Tests¶

File: tests/test_zero_infinity.py (new)

"""Unit tests for ZeRO-Infinity integration."""

import pytest
import tempfile
import shutil
from pathlib import Path

from aios.cli.hrm_hf.nvme_config import (
    validate_nvme_path,
    estimate_nvme_space_required,
    build_infinity_config,
)


def test_validate_nvme_path():
    """Test NVMe path validation."""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Valid path with enough space
        is_valid, msg = validate_nvme_path(tmpdir, required_space_gb=0.001)
        assert is_valid
        assert msg == ""

        # Invalid path (doesn't exist, can't create)
        is_valid, msg = validate_nvme_path("/root/forbidden", required_space_gb=0.001)
        assert not is_valid
        assert "Permission denied" in msg or "Failed to create" in msg


def test_estimate_nvme_space():
    """Test NVMe space estimation."""
    # 1B params, offload everything
    space_gb = estimate_nvme_space_required(
        total_params=1_000_000_000,
        offload_params=True,
        offload_optimizer=True,
        offload_gradients=True,
        safety_factor=1.5,
    )

    # Expected: (4 + 8 + 4) * 1B / 1e9 * 1.5 = 24 GB
    assert 20 < space_gb < 30

    # Only optimizer offload
    space_gb = estimate_nvme_space_required(
        total_params=1_000_000_000,
        offload_params=False,
        offload_optimizer=True,
        offload_gradients=False,
        safety_factor=1.0,
    )

    # Expected: 8 * 1B / 1e9 = 8 GB
    assert 7 < space_gb < 9


def test_build_infinity_config():
    """Test Infinity config builder."""
    base_config = {
        "train_batch_size": "auto",
        "zero_optimization": {
            "stage": 2,
        }
    }

    with tempfile.TemporaryDirectory() as tmpdir:
        config = build_infinity_config(
            base_config=base_config,
            nvme_path=tmpdir,
            offload_params=True,
            offload_optimizer=True,
            offload_gradients=False,
            aio_block_size=1048576,
            aio_queue_depth=8,
            pin_memory=True,
            max_in_cpu_gb=1.0,
        )

        # Check stage upgraded to 3
        assert config["zero_optimization"]["stage"] == 3

        # Check param offload to NVMe
        assert config["zero_optimization"]["offload_param"]["device"] == "nvme"
        assert config["zero_optimization"]["offload_param"]["nvme_path"] == tmpdir

        # Check optimizer offload to NVMe
        assert config["zero_optimization"]["offload_optimizer"]["device"] == "nvme"

        # Check AIO settings
        assert config["aio"]["block_size"] == 1048576
        assert config["aio"]["queue_depth"] == 8

4.2 Integration Tests¶

File: tests/integration/test_infinity_training.py (new)

"""Integration tests for ZeRO-Infinity training."""

import pytest
import tempfile
from pathlib import Path

from aios.core.hrm_training.training_config import TrainingConfig


@pytest.mark.skipif(
    not Path("/dev/nvme0n1").exists(),
    reason="NVMe device not available"
)
def test_infinity_training_small_model():
    """Test ZeRO-Infinity on small model to verify setup."""

    with tempfile.TemporaryDirectory() as nvme_dir:
        config = TrainingConfig(
            model="gpt2",
            dataset_file="training_data/curated_datasets/test_sample.txt",
            steps=2,
            batch_size=1,
            zero_stage="infinity",
            nvme_offload_path=nvme_dir,
            nvme_offload_params=True,
            nvme_offload_optimizer=True,
            nvme_offload_gradients=False,
            device="cuda",
        )

        # Should complete without errors
        from aios.cli.hrm_hf.train_actv1 import train_actv1_impl
        train_actv1_impl(config)

        # Verify NVMe directory was created and used
        assert Path(nvme_dir).exists()
        assert len(list(Path(nvme_dir).iterdir())) > 0


def test_infinity_config_fallback_to_cpu():
    """Test fallback to CPU offload when NVMe unavailable."""

    # Use invalid NVMe path
    config = TrainingConfig(
        model="gpt2",
        dataset_file="training_data/curated_datasets/test_sample.txt",
        steps=1,
        batch_size=1,
        zero_stage="infinity",
        nvme_offload_path="/invalid/path/no/permissions",
        device="cuda",
    )

    # Should fallback to CPU offload gracefully
    from aios.cli.hrm_hf.train_actv1 import train_actv1_impl
    # Should complete without crashing
    train_actv1_impl(config)

4.3 Manual Testing Plan¶

Test Case 1: Small Model (1B params) on 2x 11GB GPUs

# Create NVMe offload directory
mkdir -p /mnt/nvme/deepspeed_offload

# Test with Infinity
aios hrm-hf train-actv1 \
  --model gpt2-medium \
  --dataset-file training_data/curated_datasets/test_sample.txt \
  --steps 10 \
  --batch-size 2 \
  --zero-stage infinity \
  --nvme-offload-path /mnt/nvme/deepspeed_offload \
  --nvme-offload-params \
  --nvme-offload-optimizer \
  --ddp \
  --cuda-ids 0,1

# Expected: Training completes, NVMe directory contains offload files

Test Case 2: Large Model (10B params) on 2x 11GB GPUs

# This would OOM without Infinity
aios hrm-hf train-actv1 \
  --hidden-size 4096 \
  --h-layers 16 \
  --l-layers 16 \
  --num-heads 32 \
  --expansion 4.0 \
  --dataset-file training_data/curated_datasets/test_sample.txt \
  --steps 5 \
  --batch-size 1 \
  --zero-stage infinity \
  --nvme-offload-path /mnt/nvme/deepspeed_offload \
  --ddp \
  --cuda-ids 0,1

# Expected: Training works, uses ~30-50GB NVMe space

Test Case 3: Performance Comparison

# Baseline: ZeRO-2
time aios hrm-hf train-actv1 --zero-stage zero2 [... other args ...]

# With Infinity
time aios hrm-hf train-actv1 --zero-stage infinity [... other args ...]

# Expected: Infinity ~10-20% slower with good NVMe

Phase 5: Documentation and Polish (Week 4-5)¶

5.1 User Documentation¶

File: docs/guide/zero_infinity_guide.md (new)

Create comprehensive user guide covering: - What is ZeRO-Infinity and when to use it - Hardware requirements (NVMe SSD, PCIe 3.0+) - Setup instructions for Linux and Windows - Performance tuning (AIO settings, queue depth) - Troubleshooting common issues - Example configurations for different model sizes

5.2 API Documentation¶

Update docstrings in: - optimization_fields.py: Document all Infinity-related fields - nvme_config.py: Comprehensive module documentation - distributed_setup.py: Infinity initialization flow

5.3 GUI Tooltips¶

Add informative tooltips to all Infinity UI elements: - NVMe path selector - Offload checkboxes
- AIO settings - Performance warnings

Implementation Checklist¶

Phase 1: Configuration ✅¶

[ ] Create config/deepspeed_zero_infinity.json
[ ] Create config/deepspeed_zero_infinity_cpu.json
[ ] Add fields to optimization_fields.py
[ ] Add CLI arguments to hrm_hf_cli.py
[ ] Update zero_stage help text

Phase 2: Core Implementation ✅¶

[ ] Create src/aios/cli/hrm_hf/nvme_config.py
[ ] validate_nvme_path()
[ ] estimate_nvme_space_required()
[ ] check_nvme_performance()
[ ] build_infinity_config()
[ ] save_infinity_config()
[ ] Update distributed_setup.py
[ ] Infinity detection and config generation
[ ] NVMe validation
[ ] Fallback to CPU offload
[ ] Update GUI
[ ] Add "infinity" to ZeRO dropdown
[ ] Create NVMe settings section
[ ] Add browse button for NVMe path
[ ] Add offload checkboxes
[ ] Add AIO settings (advanced)
[ ] Update variable_setup.py
[ ] Add Infinity-related variables

Phase 3: Memory Estimation ✅¶

[ ] Update vram_estimation.py
[ ] Handle zero_stage == "infinity"
[ ] Calculate ~92% VRAM savings
[ ] Show NVMe space requirements
[ ] Update memory_estimator/estimator.py
[ ] Add Infinity mode support
[ ] Update GUI VRAM display
[ ] Show "Offloaded to NVMe" label
[ ] Display NVMe space needed

Phase 4: Testing ✅¶

[ ] Create tests/test_zero_infinity.py
[ ] Path validation tests
[ ] Space estimation tests
[ ] Config builder tests
[ ] Create tests/integration/test_infinity_training.py
[ ] Small model test
[ ] Fallback test
[ ] Large model test (if hardware available)
[ ] Manual testing
[ ] 1B model on 2x GPUs
[ ] 10B model on 2x GPUs
[ ] Performance benchmarks
[ ] Windows compatibility (if possible)

Phase 5: Documentation ✅¶

[ ] Create docs/guide/zero_infinity_guide.md
[ ] Update API docstrings
[ ] Add GUI tooltips
[ ] Update README with Infinity examples
[ ] Create troubleshooting guide

Success Metrics¶

Functional Requirements ✅¶

✅ Can train 10B+ parameter models on 2x 11GB GPUs
✅ NVMe offloading works with validation and error handling
✅ Graceful fallback to CPU offload when NVMe unavailable
✅ GUI exposes all Infinity settings
✅ CLI supports all Infinity arguments

Performance Requirements ✅¶

✅ Infinity mode <20% slower than ZeRO-2 (with fast NVMe)
✅ Supports models up to 100B params (with sufficient NVMe space)
✅ Memory footprint <2GB VRAM per GPU for 10B model
✅ NVMe I/O throughput >2 GB/s (hardware dependent)

Quality Requirements ✅¶

✅ Comprehensive error messages for NVMe issues
✅ Performance warnings for slow storage
✅ Unit test coverage >80% for new code
✅ Integration tests for key scenarios
✅ Complete user documentation

Risks and Mitigations¶

Risk 1: NVMe Performance Varies Widely¶

Impact: Users with slow SSDs may see 50%+ slowdown instead of 10-20%.

Mitigation: - Detect storage type (HDD vs SSD) at startup - Warn users if not on NVMe - Provide performance expectations in docs - Suggest testing with small models first

Risk 2: Windows NVMe Support¶

Impact: Windows may have issues with AIO or NVMe detection.

Mitigation: - Test on Windows with fast SSD - Provide CPU fallback automatically - Document Windows-specific limitations - Consider Windows-optimized AIO settings

Risk 3: Insufficient NVMe Space¶

Impact: Training fails mid-way when NVMe fills up.

Mitigation: - Validate space before starting - Reserve 20% safety margin - Monitor space during training - Provide clear error messages

Risk 4: DeepSpeed Version Compatibility¶

Impact: Older DeepSpeed may not support all Infinity features.

Mitigation: - Require deepspeed>=0.8.0 in pyproject.toml - Check DeepSpeed version at runtime - Provide upgrade instructions - Test with multiple DeepSpeed versions

Risk 5: User Confusion¶

Impact: Users may enable Infinity unnecessarily for small models.

Mitigation: - Clear GUI guidance ("For 10B+ models only") - Recommend ZeRO-2 as default - Auto-suggest Infinity in optimizer tool - Show model size vs memory comparison

Future Enhancements¶

Phase 6: Advanced Optimizations (Optional)¶

Adaptive Prefetching: Predict which layers needed next, prefetch ahead
Compression: Compress parameters on NVMe (trade space for speed)
Multi-tier Caching: Smart caching of frequently used params
Bandwidth Monitoring: Real-time I/O performance tracking
Auto-tuning: Automatically adjust AIO settings based on hardware

Phase 7: Multi-Node Support (Optional)¶

Distributed NVMe: Coordinate offloading across multiple nodes
Network-attached Storage: Support NFS/SMB for shared offload
Heterogeneous Hardware: Mix GPU + CPU + NVMe across nodes

Dependencies¶

Required Python Packages¶

# pyproject.toml
[project.optional-dependencies]
hf = [
    "deepspeed>=0.8.0",  # Required for ZeRO-Infinity
    "psutil>=5.8.0",     # For disk space checking
]

System Requirements¶

Linux: - NVMe SSD with PCIe 3.0 or later (2+ GB/s) - libaio installed (sudo apt-get install libaio-dev) - Kernel 4.0+ (for modern AIO support)

Windows (limited support): - NVMe SSD with >1 GB/s write speed - May require Windows 10/11 with latest updates - Performance may be lower than Linux

Hardware Requirements¶

Minimum: - 1x NVIDIA GPU (11GB+ VRAM) - 32GB RAM - 256GB+ NVMe SSD (PCIe 3.0)

Recommended: - 2x NVIDIA GPUs (11GB+ VRAM each) - 64GB RAM - 1TB+ NVMe SSD (PCIe 4.0, >5 GB/s)

Optimal: - 4x NVIDIA GPUs (24GB+ VRAM each) - 128GB+ RAM - 2TB+ NVMe SSD (PCIe 4.0/5.0, >7 GB/s)

Performance Expectations¶

Small Model (1B params)¶

Baseline (no ZeRO): 100% speed, 8GB VRAM
ZeRO-2: 95% speed, 4GB VRAM
ZeRO-Infinity: 85% speed, 1GB VRAM, 20GB NVMe

Medium Model (10B params)¶

Baseline: OOM on 11GB GPU
ZeRO-2: OOM on 11GB GPU
ZeRO-3: 85% speed, 8GB VRAM (multi-GPU)
ZeRO-Infinity: 75% speed, 2GB VRAM, 200GB NVMe

Large Model (100B params)¶

Baseline: OOM
ZeRO-2: OOM
ZeRO-3: OOM on <8 GPUs
ZeRO-Infinity: 65% speed, 4GB VRAM per GPU, 2TB NVMe (8x GPUs)

Percentages relative to baseline training speed.

Timeline¶

Week 1: Phase 1 (Configuration) - Complete
Week 2-3: Phase 2 (Core Implementation) - Complete
Week 3: Phase 3 (Memory Estimation) - Complete
Week 4: Phase 4 (Testing) - Complete
Week 4-5: Phase 5 (Documentation) - Complete
Total: ~5 weeks for full integration

Open Questions¶

Should we support gradient offloading?
Rarely needed even for 100B models
Adds complexity for minimal benefit
Recommendation: Add flag but default to False, document as experimental
How to handle multi-node Infinity?
Requires shared storage or per-node offload coordination
Complex to implement and test
Recommendation: Defer to future enhancement (Phase 7)
Should GUI expose AIO settings?
Most users won't understand or need to change
Could clutter interface
Recommendation: Advanced collapsible section, use sane defaults
Windows support priority?
Limited testing hardware
May have AIO compatibility issues
Recommendation: Best-effort support, Linux-first, document limitations

References¶

Conclusion¶

ZeRO-Infinity integration will enable AI-OS to train models 10-100x larger than currently possible, unlocking multi-billion parameter models on consumer hardware. The implementation is well-defined with clear phases, comprehensive testing, and proper fallbacks for edge cases.

Key Benefits: - Train 10B+ models on 2x 11GB GPUs - Minimal slowdown (10-20%) with fast NVMe - Graceful degradation to CPU offload - Full GUI and CLI support - Comprehensive validation and error handling

Recommendation: Proceed with implementation. Start with Phase 1 configuration, validate with small models, then scale up to large models. Prioritize Linux support, provide Windows compatibility as best-effort.