Skip to content

Training Resume and Checkpoint Selection

Overview

Enhance the GUI “Start Training / Resume” flow so the checkpoint selection screen always opens, even for a brand-new run. Add a new option that lets the user choose which block and chunk to start training from.

This is a UX/planning feature focused on: - making “resume” behavior explicit and predictable - enabling controlled partial re-training / targeted continuation - improving debuggability when chunk tracking state is confusing

Goals

  • Always show the checkpoint/resume screen when the user starts training.
  • Provide a simple way to choose a starting point: block_id and chunk_id.
  • Work for both:
  • new runs (no prior checkpoint)
  • resume runs (checkpoint exists)
  • Keep behavior consistent across single-GPU and parallel-independent multi-GPU.

Non-Goals

  • Do not add complex browsing/search/filtering of dataset contents.
  • Do not add per-sample selection or editing.
  • Do not change the adaptive LR auto-mode logic (auto stays limited to built-in modes).

Proposed UX

Checkpoint Screen Behavior

When the user clicks “Start Training”: - Always open the checkpoint screen. - Screen provides two primary choices: 1) Start New Run (no checkpoint required) 2) Resume From Checkpoint (choose existing checkpoint if available)

Start Position Selector (New)

Add a section:

  • Start Position
  • Block: numeric input (integer, >= 0)
  • Chunk: numeric input (integer, >= 0)

Behavior: - Defaults to Block=0, Chunk=0. - Applies to both “Start New Run” and “Resume From Checkpoint”. - If the chosen block/chunk is out of range (unknown until block discovery), validate as early as possible and fail with a clear message.

Interaction With Existing Controls

  • If the user chooses a start position that conflicts with existing chunk_tracker_state.json:
  • prompt to either:
    • respect the tracker (skip already-trained chunks), or
    • force training from the selected position (equivalent to existing --force-train escape hatch).

Backend / Plumbing

TrainingConfig

Add fields: - start_block_id: int = 0 - start_chunk_id: int = 0

Add CLI args (HRM HF): - --start-block-id <int> - --start-chunk-id <int>

These should be independent of --resume.

ChunkTracker / BlockManager Integration

Implement “start position” as the initial cursor for chunk claiming: - For a new run: - start claiming chunks beginning at (start_block_id, start_chunk_id). - For a resume run: - load checkpoint and chunk tracker state as normal - then apply the start-position override as the minimum starting point

Important rule: - The start position is a lower bound; chunks earlier than it should not be claimed.

Parallel-Independent Multi-GPU

When --parallel-independent is enabled: - all workers should share the same start-position lower bound - distribution logic should skip earlier chunks consistently

Edge Cases / Validation

  • If dataset has fewer blocks/chunks than requested start position:
  • fail fast with a readable error (don’t silently do 0 steps)
  • If start position points to an already-completed chunk:
  • behavior depends on “respect tracker” vs “force train” choice
  • If user selects Resume but no checkpoint exists:
  • show a warning and allow Start New Run

Telemetry / Observability

Emit JSONL events at training start: - training_start_position_selected including: - start_block_id, start_chunk_id - whether override is active - whether force_train is active

This makes “why did it train / why did it skip” easy to diagnose.

Rollout Plan

1) GUI: always show checkpoint screen; add start position inputs. 2) TrainingConfig: add new fields + to_cli_args() mapping. 3) HRM HF CLI: add new options + plumb into config. 4) ChunkTracker/Parallel distributor: enforce lower-bound claiming. 5) Add a small diagnostics scenario to verify: - new run starts at specified chunk - resume run starts at specified chunk - parallel-independent respects start position