Gradient Accumulation¶
Status: Planned
Priority: High
Objective: Fix loss instability from small batch sizes by implementing gradient accumulation
Created: 2025-10-23
๐ Executive Summary¶
The Problem¶
Current training exhibits severe loss instability due to small batch sizes:
Step 448: Loss = 7.0073
Step 449: Loss = 8.1779 โ Wild 17% jump
Step 450: Loss = 7.3505
Step 451: Loss = 8.1017
Step 452: Loss = 8.7440 โ Peak instability
Step 453: Loss = 7.5572
Root Cause: Batch size too small (2-8 samples) โ noisy gradient estimates โ unstable training
The Solution¶
Gradient Accumulation: Accumulate gradients over N batches before updating weights - Effective batch size = physical_batch_size ร gradient_accumulation_steps - VRAM usage = same as physical_batch_size (no increase!) - Training stability = equivalent to large batch training
Expected Results¶
# With batch=8, gradient_accumulation_steps=4
Step 112: Loss = 7.0073 (effective_batch=32)
Step 113: Loss = 6.8234 โ Smooth 2.6% decline
Step 114: Loss = 6.6512
Step 115: Loss = 6.4891
Step 116: Loss = 6.3201 โ Stable convergence
๐ Technical Background¶
How Gradient Accumulation Works¶
Standard Training (current):
for batch in dataloader:
loss = model(batch)
loss.backward() # Compute gradients
optimizer.step() # Update weights immediately
optimizer.zero_grad()
With Gradient Accumulation:
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps # โ ๏ธ CRITICAL: Scale loss!
loss.backward() # Gradients accumulate in model.parameters()
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights every N batches
optimizer.zero_grad()
Why Loss Scaling is Critical¶
# โ WRONG - Gradients will sum instead of average
loss.backward()
# โ
CORRECT - Scale loss so gradients average
loss = loss / gradient_accumulation_steps
loss.backward()
Memory Usage Comparison¶
Model: 1.3B parameters
โโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ Batch Size โ Accum โ Effective โ VRAM Usage โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ 2 โ 1 โ 2 โ ~6 GB โ
โ 2 โ 16 โ 32 โ ~6 GB โ โ Same!
โ 8 โ 1 โ 8 โ ~10 GB โ
โ 8 โ 4 โ 32 โ ~10 GB โ โ Same!
โ 32 โ 1 โ 32 โ ~24 GB โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
Conclusion: Accumulation doesn't increase VRAM!
๐๏ธ Implementation Architecture¶
Files to Modify¶
src/aios/core/hrm_training/training_config/
โโโ optimization_fields.py [ADD] gradient_accumulation_steps field
โโโ config_main.py [MODIFY] to_cli_args() method
src/aios/cli/
โโโ hrm_hf_cli.py [ADD] --gradient-accumulation-steps option
src/aios/cli/hrm_hf/training_logic/
โโโ train_epoch.py [MODIFY] backward/step logic (MAIN CHANGE)
src/aios/gui/components/hrm_training_panel/
โโโ ui_optimizations.py [ADD] UI controls for gradient accumulation
โโโ variable_setup.py [ADD] gradient_accumulation_var
โโโ config_builder.py [MODIFY] include in config build
โโโ state_management.py [MODIFY] save/load state
๐ Implementation Steps¶
โ Phase 1: Configuration Layer¶
Step 1.1: Add Configuration Field¶
File: src/aios/core/hrm_training/training_config/optimization_fields.py
Location: After line 118 (after load_in_4bit field)
Code to Add:
# ============================================================================
# Gradient Accumulation
# ============================================================================
gradient_accumulation_steps: int = 1
"""Number of batches to accumulate gradients before updating weights.
Enables training with larger effective batch sizes without increasing VRAM.
The effective batch size is: physical_batch_size ร gradient_accumulation_steps
Benefits:
- Fixes loss instability from small batch sizes
- No VRAM increase (memory usage stays at physical batch size)
- Smoother training dynamics
- Better gradient estimates
Example:
- batch_size=8, gradient_accumulation_steps=4 โ effective_batch_size=32
- VRAM usage: ~10GB (for batch=8)
- Training stability: equivalent to batch=32
Recommended values:
- 1: No accumulation (default, update every batch)
- 2-4: Mild accumulation for slightly smoother training
- 4-8: Moderate accumulation (recommended for most cases)
- 8-16: High accumulation for very small batch sizes
- 16+: Extreme accumulation (use when batch=1-2 required)
How to choose:
1. Start with current batch_size and desired effective_batch_size
2. Calculate: gradient_accumulation_steps = effective_batch_size / batch_size
3. Test and adjust based on loss stability
Memory impact:
- Gradients: +1ร model size (same as normal training)
- Activations: Only for physical batch size
- Total overhead: Negligible (<5% of total memory)
Performance impact:
- Slightly slower due to more forward passes
- ~5-15% overhead depending on accumulation_steps
- Worth it for stability improvement
Compatibility:
- Works with: AMP, gradient checkpointing, DeepSpeed ZeRO, PEFT/LoRA
- Works across: DDP, parallel independent, single-GPU modes
- Scheduler: Automatically adjusted to step with weight updates
"""
Step 1.2: Update CLI Args Conversion¶
File: src/aios/core/hrm_training/training_config/config_main.py
Location: In to_cli_args() method, after batch_size (around line 109)
Find:
Replace with:
args.extend(["--batch-size", str(self.batch_size)])
args.extend(["--steps", str(self.steps)])
# Gradient accumulation
if self.gradient_accumulation_steps > 1:
args.extend(["--gradient-accumulation-steps", str(self.gradient_accumulation_steps)])
โ Phase 2: CLI Integration¶
Step 2.1: Add CLI Option¶
File: src/aios/cli/hrm_hf_cli.py
Location: In train_actv1 function, after batch_size parameter (around line 137)
Find:
Replace with:
batch_size: int = typer.Option(8, "--batch-size"),
gradient_accumulation_steps: int = typer.Option(
1,
"--gradient-accumulation-steps",
help="Accumulate gradients over N batches before updating weights. "
"Effective batch size = batch_size ร gradient_accumulation_steps. "
"Use to train with larger effective batches without increasing VRAM. "
"Example: batch=8, accum=4 โ effective_batch=32"
),
steps: int = typer.Option(200, "--steps"),
โ Phase 3: Training Loop Modification (CRITICAL)¶
Step 3.1: Modify Training Logic¶
File: src/aios/cli/hrm_hf/training_logic/train_epoch.py
Location: Around line 350-380 (backward/step section)
Find (the entire backward pass section):
# Backward pass
if deepspeed_engine is not None:
deepspeed_engine.backward(loss)
deepspeed_engine.step()
elif use_amp and scaler is not None and dev == "cuda":
scaler.scale(loss).backward()
scaler.unscale_(opt)
average_gradients_if_distributed(model_student, is_distributed=ddp_actually_working, world_sz=world_sz)
grad_norm = torch.nn.utils.clip_grad_norm_(model_student.parameters(), 0.5)
if torch.isnan(grad_norm) or torch.isinf(grad_norm):
opt.zero_grad(set_to_none=True)
scaler.update()
else:
scaler.step(opt)
scaler.update()
else:
loss.backward()
average_gradients_if_distributed(model_student, is_distributed=ddp_actually_working, world_sz=world_sz)
grad_norm = torch.nn.utils.clip_grad_norm_(model_student.parameters(), 0.5)
if torch.isnan(grad_norm) or torch.isinf(grad_norm):
opt.zero_grad(set_to_none=True)
else:
opt.step()
steps_done += 1
Replace with:
# Get gradient accumulation config
gradient_accumulation_steps = getattr(config, 'gradient_accumulation_steps', 1)
accumulation_steps = max(1, int(gradient_accumulation_steps))
# Scale loss for gradient accumulation
# CRITICAL: This ensures gradients average instead of sum
scaled_loss = loss / accumulation_steps
# Backward pass (gradients accumulate automatically)
if deepspeed_engine is not None:
# DeepSpeed handles accumulation internally
deepspeed_engine.backward(scaled_loss)
# Only step optimizer every N batches
if (batch_idx + 1) % accumulation_steps == 0:
deepspeed_engine.step()
elif use_amp and scaler is not None and dev == "cuda":
# AMP with gradient accumulation
scaler.scale(scaled_loss).backward()
# Only update weights every N batches
if (batch_idx + 1) % accumulation_steps == 0:
scaler.unscale_(opt)
average_gradients_if_distributed(model_student, is_distributed=ddp_actually_working, world_sz=world_sz)
grad_norm = torch.nn.utils.clip_grad_norm_(model_student.parameters(), 0.5)
if torch.isnan(grad_norm) or torch.isinf(grad_norm):
opt.zero_grad(set_to_none=True)
scaler.update()
else:
scaler.step(opt)
scaler.update()
opt.zero_grad(set_to_none=True)
else:
# Standard mode with gradient accumulation
scaled_loss.backward()
# Only update weights every N batches
if (batch_idx + 1) % accumulation_steps == 0:
average_gradients_if_distributed(model_student, is_distributed=ddp_actually_working, world_sz=world_sz)
grad_norm = torch.nn.utils.clip_grad_norm_(model_student.parameters(), 0.5)
if torch.isnan(grad_norm) or torch.isinf(grad_norm):
opt.zero_grad(set_to_none=True)
else:
opt.step()
opt.zero_grad(set_to_none=True)
# Increment step counter (count actual weight updates, not batches)
if (batch_idx + 1) % accumulation_steps == 0:
steps_done += 1
Step 3.2: Update Logging¶
File: src/aios/cli/hrm_hf/training_logic/train_epoch.py
Location: Around line 400 (logging section)
Find:
if steps_done % 5 == 0 or steps_done == 1:
try:
import torch
if dev == "cuda" and torch.cuda.is_available():
torch.cuda.synchronize()
allocated_gb = torch.cuda.memory_allocated(device_obj) / 1024**3
reserved_gb = torch.cuda.memory_reserved(device_obj) / 1024**3
max_allocated_gb = torch.cuda.max_memory_allocated(device_obj) / 1024**3
total_gb = torch.cuda.get_device_properties(device_obj).total_memory / 1024**3
write_jsonl({
"step": steps_done,
"memory_gb": round(allocated_gb, 3),
Replace with:
if steps_done % 5 == 0 or steps_done == 1:
try:
import torch
if dev == "cuda" and torch.cuda.is_available():
torch.cuda.synchronize()
allocated_gb = torch.cuda.memory_allocated(device_obj) / 1024**3
reserved_gb = torch.cuda.memory_reserved(device_obj) / 1024**3
max_allocated_gb = torch.cuda.max_memory_allocated(device_obj) / 1024**3
total_gb = torch.cuda.get_device_properties(device_obj).total_memory / 1024**3
# Get accumulation info
gradient_accumulation_steps = getattr(config, 'gradient_accumulation_steps', 1)
accumulation_steps = max(1, int(gradient_accumulation_steps))
write_jsonl({
"step": steps_done,
"batch_idx": batch_idx + 1,
"loss": float(loss.item()), # Unscaled loss for logging
"gradient_accumulation_steps": accumulation_steps,
"effective_batch_size": batch_size * accumulation_steps,
"physical_batch_size": batch_size,
"memory_gb": round(allocated_gb, 3),
โ Phase 4: GUI Integration¶
Step 4.1: Add Variable¶
File: src/aios/gui/components/hrm_training_panel/variable_setup.py
Location: In setup_variables() function, after existing optimization variables (around line 60)
Find:
panel.use_8bit_optimizer_var = tk.BooleanVar(value=False)
panel.use_cpu_offload_var = tk.BooleanVar(value=False)
Add after:
panel.use_8bit_optimizer_var = tk.BooleanVar(value=False)
panel.use_cpu_offload_var = tk.BooleanVar(value=False)
# Gradient accumulation
panel.gradient_accumulation_var = tk.StringVar(value="1")
Step 4.2: Add UI Controls¶
File: src/aios/gui/components/hrm_training_panel/ui_optimizations.py
Location: After Row 1 (Memory Optimizations), around line 38
Find:
Add between:
cpu_offload_btn.pack(side="left", padx=(8, 0))
# Row 1.5: Gradient Accumulation
grad_accum_row = ttk.Frame(opt_frame)
grad_accum_row.pack(fill="x", pady=2)
ttk.Label(grad_accum_row, text="Batch Scaling:", width=15, anchor="e", font=("TkDefaultFont", 9, "bold")).pack(side="left")
ttk.Label(grad_accum_row, text="Accum Steps:").pack(side="left", padx=(0, 2))
accum_combo = ttk.Combobox(grad_accum_row, textvariable=panel.gradient_accumulation_var, width=8, state="readonly")
accum_combo['values'] = ('1', '2', '4', '8', '16', '32')
accum_combo.pack(side="left")
ttk.Label(grad_accum_row, text="โ").pack(side="left", padx=4)
panel.effective_batch_lbl = ttk.Label(grad_accum_row, text="Effective Batch: 8", foreground="blue")
panel.effective_batch_lbl.pack(side="left")
# Row 2: PEFT (Parameter-Efficient Fine-Tuning)
Step 4.3: Add Tooltip and Update Callback¶
File: src/aios/gui/components/hrm_training_panel/ui_optimizations.py
Location: In tooltip section, around line 100
Find:
add_tooltip(cpu_offload_btn, "CPU Offload: Moves optimizer states to system RAM\nSaves VRAM โข ~30% slower training")
add_tooltip(peft_enable_btn, "Enable PEFT: Use Low-Rank Adaptation (LoRA) for efficient fine-tuning\nโ95-99% trainable parameters (87M โ 500K-2M)")
Add after:
add_tooltip(cpu_offload_btn, "CPU Offload: Moves optimizer states to system RAM\nSaves VRAM โข ~30% slower training")
add_tooltip(accum_combo,
"Gradient Accumulation: Accumulate gradients over N batches\n"
"before updating weights.\n\n"
"Benefits:\n"
"โข Fixes loss instability from small batches\n"
"โข No VRAM increase\n"
"โข Smoother training dynamics\n\n"
"Effective Batch = Physical Batch ร Accum Steps\n"
"Example: batch=8, accum=4 โ effective=32\n\n"
"Recommended:\n"
"โข 1: No accumulation (default)\n"
"โข 4: Balanced (recommended)\n"
"โข 8-16: High stability (small batches)\n"
"โข 32: Maximum stability (batch=1-2)")
add_tooltip(peft_enable_btn, "Enable PEFT: Use Low-Rank Adaptation (LoRA) for efficient fine-tuning\nโ95-99% trainable parameters (87M โ 500K-2M)")
Step 4.4: Add Update Callback Function¶
File: src/aios/gui/components/hrm_training_panel/ui_optimizations.py
Location: After tooltips section, before the end of build_optimizations_section() function
Add:
# Setup callback to update effective batch label
def update_effective_batch_label(*args):
try:
batch = int(panel.batch_var.get() or 8)
accum = int(panel.gradient_accumulation_var.get() or 1)
effective = batch * accum
panel.effective_batch_lbl.config(text=f"Effective Batch: {effective}")
except Exception:
pass
panel.gradient_accumulation_var.trace_add("write", update_effective_batch_label)
panel.batch_var.trace_add("write", update_effective_batch_label)
# Initial update
update_effective_batch_label()
Step 4.5: Update Config Builder¶
File: src/aios/gui/components/hrm_training_panel/config_builder.py
Location: In build_training_config() function, after batch_size (around line 25)
Find:
Replace with:
batch_size=int(panel.batch_var.get() or 8),
gradient_accumulation_steps=int(panel.gradient_accumulation_var.get() or 1),
steps=int(panel.steps_var.get() or 200),
Step 4.6: Update State Management¶
File: src/aios/gui/components/hrm_training_panel/state_management.py
Location: In get_state() function, around line 30
Find:
Add after:
"batch_size": panel.batch_var.get(),
"gradient_accumulation_steps": panel.gradient_accumulation_var.get(),
"steps": panel.steps_var.get(),
Location: In set_state() function, around line 100
Find:
if "batch_size" in state:
panel.batch_var.set(str(state["batch_size"]))
if "steps" in state:
panel.steps_var.set(str(state["steps"]))
Add after:
if "batch_size" in state:
panel.batch_var.set(str(state["batch_size"]))
if "gradient_accumulation_steps" in state:
panel.gradient_accumulation_var.set(str(state["gradient_accumulation_steps"]))
if "steps" in state:
panel.steps_var.set(str(state["steps"]))
๐งช Testing Plan¶
Test 1: Basic Functionality¶
Objective: Verify gradient accumulation works correctly
# Terminal test
cd /path/to/AI-OS # Replace with your AI-OS directory
.\.venv\Scripts\Activate.ps1
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--steps 20 `
--log-file artifacts/test_grad_accum.jsonl
# Check log file
cat artifacts/test_grad_accum.jsonl | Select-String "gradient_accumulation"
Expected Output:
Test 2: Loss Stability Comparison¶
Objective: Demonstrate loss stability improvement
Test 2a: Without accumulation (baseline)
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 2 `
--gradient-accumulation-steps 1 `
--steps 100 `
--log-file artifacts/test_without_accum.jsonl
Test 2b: With accumulation
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 2 `
--gradient-accumulation-steps 16 `
--steps 100 `
--log-file artifacts/test_with_accum.jsonl
Analysis Script:
import json
import matplotlib.pyplot as plt
# Load logs
with open('artifacts/test_without_accum.jsonl') as f:
without = [json.loads(line) for line in f if 'loss' in line]
with open('artifacts/test_with_accum.jsonl') as f:
with_accum = [json.loads(line) for line in f if 'loss' in line]
# Plot comparison
plt.figure(figsize=(12, 6))
plt.plot([x['step'] for x in without], [x['loss'] for x in without],
label='Without accumulation (batch=2)', alpha=0.7)
plt.plot([x['step'] for x in with_accum], [x['loss'] for x in with_accum],
label='With accumulation (batch=2, accum=16, effective=32)', alpha=0.7)
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Loss Stability: With vs Without Gradient Accumulation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('artifacts/gradient_accumulation_comparison.png')
print("Chart saved to artifacts/gradient_accumulation_comparison.png")
Test 3: VRAM Usage Verification¶
Objective: Confirm VRAM usage doesn't increase with accumulation
# Test A: batch=2, accum=1
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 2 `
--gradient-accumulation-steps 1 `
--steps 10 `
--log-file artifacts/vram_test_no_accum.jsonl
# Test B: batch=2, accum=16 (should use same VRAM!)
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 2 `
--gradient-accumulation-steps 16 `
--steps 10 `
--log-file artifacts/vram_test_with_accum.jsonl
Verification:
# Compare peak VRAM usage
$noAccum = Get-Content artifacts/vram_test_no_accum.jsonl | ConvertFrom-Json | Where-Object {$_.peak_gb} | Select-Object -Last 1
$withAccum = Get-Content artifacts/vram_test_with_accum.jsonl | ConvertFrom-Json | Where-Object {$_.peak_gb} | Select-Object -Last 1
Write-Host "No accumulation VRAM: $($noAccum.peak_gb) GB"
Write-Host "With accumulation VRAM: $($withAccum.peak_gb) GB"
Write-Host "Difference: $(($withAccum.peak_gb - $noAccum.peak_gb)) GB (should be ~0)"
Test 4: GUI Integration¶
Objective: Verify GUI controls work correctly
Steps:
1. Launch GUI: aios gui
2. Navigate to HRM Training panel
3. Verify new controls appear:
- "Batch Scaling:" label
- "Accum Steps:" dropdown
- "โ Effective Batch: X" label
4. Test interactions:
- Set Batch Size: 8
- Set Accum Steps: 4
- Verify label shows "Effective Batch: 32"
- Change Batch Size to 16
- Verify label updates to "Effective Batch: 64"
5. Start training and verify:
- Training runs without errors
- Log shows correct accumulation settings
- Loss curve is smoother than without accumulation
Test 5: Compatibility Tests¶
Test 5a: With AMP
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--amp `
--steps 20
Test 5b: With Gradient Checkpointing
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--gradient-checkpointing `
--steps 20
Test 5c: With DeepSpeed ZeRO-2
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--zero-stage zero2 `
--steps 20
Test 5d: With PEFT/LoRA
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--use-peft `
--lora-r 16 `
--steps 20
Test 5e: Parallel Independent Mode
aios hrm-hf train-actv1 `
--model artifacts/hf_implant/gpt2 `
--dataset-file training_data/curated_datasets/test_sample.txt `
--batch-size 4 `
--gradient-accumulation-steps 8 `
--parallel-independent `
--cuda-ids 0,1 `
--steps 20
โ Implementation Checklist¶
Phase 1: Configuration Layer¶
- [ ] Add
gradient_accumulation_stepsfield tooptimization_fields.py - [ ] Update
to_cli_args()inconfig_main.py - [ ] Test: Import TrainingConfig and verify new field exists
Phase 2: CLI Integration¶
- [ ] Add
--gradient-accumulation-stepsoption tohrm_hf_cli.py - [ ] Test:
aios hrm-hf train-actv1 --helpshows new option
Phase 3: Training Loop¶
- [ ] Modify backward/step logic in
train_epoch.py - [ ] Add loss scaling
- [ ] Add conditional optimizer step
- [ ] Update logging to show accumulation metrics
- [ ] Test: Run basic training with accumulation
Phase 4: GUI Integration¶
- [ ] Add
gradient_accumulation_varinvariable_setup.py - [ ] Add UI controls in
ui_optimizations.py - [ ] Add tooltip
- [ ] Add update callback for effective batch label
- [ ] Update
config_builder.py - [ ] Update
state_management.py(get_state) - [ ] Update
state_management.py(set_state) - [ ] Test: Launch GUI and verify controls appear
Phase 5: Testing¶
- [ ] Test 1: Basic functionality
- [ ] Test 2: Loss stability comparison
- [ ] Test 3: VRAM usage verification
- [ ] Test 4: GUI integration
- [ ] Test 5a: AMP compatibility
- [ ] Test 5b: Gradient checkpointing compatibility
- [ ] Test 5c: DeepSpeed ZeRO compatibility
- [ ] Test 5d: PEFT/LoRA compatibility
- [ ] Test 5e: Parallel independent mode compatibility
Phase 6: Documentation¶
- [ ] Update CLI help text (done via option definition)
- [ ] Add tooltip to GUI (done in Phase 4)
- [ ] Update
docs/guide/features/TRAINING_OPTIMIZATIONS.md(if exists) - [ ] Add note to CHANGELOG.md
๐ Success Criteria¶
Functional Requirements¶
โ
Gradient accumulation configurable via CLI and GUI
โ
Loss stability improves with accumulation enabled
โ
VRAM usage remains constant regardless of accumulation steps
โ
Compatible with all existing optimizations (AMP, checkpointing, ZeRO, PEFT)
โ
Works in all training modes (single-GPU, DDP, parallel independent)
Performance Requirements¶
โ
Training slowdown < 15% for accumulation_steps โค 16
โ
Loss variance reduced by โฅ 50% compared to small batch baseline
โ
No memory leaks during extended training
User Experience Requirements¶
โ
GUI controls intuitive and self-documenting
โ
CLI help text clear and complete
โ
Logging shows accumulation status and effective batch size
โ
State persistence works (save/load GUI state)
๐ฏ Quick Start (After Implementation)¶
Via CLI¶
# Basic usage
aios hrm-hf train-actv1 \
--model artifacts/hf_implant/gpt2 \
--dataset-file training_data/curated_datasets/my_dataset.txt \
--batch-size 8 \
--gradient-accumulation-steps 4 \
--steps 1000
# High stability (small batch)
aios hrm-hf train-actv1 \
--batch-size 2 \
--gradient-accumulation-steps 16 \
--steps 1000
Via GUI¶
- Open HRM Training panel
- Set Batch Size:
8 - Set Accum Steps:
4 - Verify: "Effective Batch: 32"
- Click "Start Training"
๐ References¶
Research¶
- PyTorch gradient accumulation patterns (from pytorch/examples)
- Standard deep learning practice for memory-constrained training
- User's current loss instability pattern: 7.0โ8.7โ7.3 (analyzed)
Codebase Components¶
- Main training loop:
src/aios/cli/hrm_hf/training_logic/train_epoch.py - Config system:
src/aios/core/hrm_training/training_config/ - GUI panel:
src/aios/gui/components/hrm_training_panel/ - CLI:
src/aios/cli/hrm_hf_cli.py
Related Features¶
- Automatic Mixed Precision (AMP)
- Gradient Checkpointing
- DeepSpeed ZeRO
- PEFT/LoRA
- Parallel Independent Training
๐ Future Enhancements¶
Potential Improvements¶
- Auto-calculate accumulation: GUI button to automatically determine optimal accumulation steps based on target effective batch size
- Adaptive accumulation: Dynamically adjust accumulation based on VRAM usage
- Loss-based tuning: Automatically increase accumulation if loss variance exceeds threshold
- Multi-GPU load balancing: Different accumulation steps per GPU based on VRAM capacity
Not in Scope (For Now)¶
- Dynamic gradient accumulation (changing during training)
- Per-layer gradient accumulation
- Gradient accumulation with different batch sizes per step
๐ Support & Troubleshooting¶
Common Issues¶
Q: Loss not improving with accumulation
A: Verify loss scaling is applied (loss / accumulation_steps). Check logs for scaled_loss in backward pass.
Q: VRAM usage increased
A: Check that batch_size in config matches physical batch, not effective batch. Accumulation shouldn't increase VRAM.
Q: Training slower than expected
A: Normal with high accumulation steps. Trade-off between speed and stability. Try reducing accumulation_steps.
Q: Gradients not accumulating
A: Ensure optimizer.zero_grad() only called after weight update, not every batch.
Q: GUI not showing new controls
A: Restart GUI. Check that all Phase 4 files were modified correctly.
Document Version: 1.0
Last Updated: 2025-10-23
Author: AI-OS Development Team
Status: Ready for Implementation