Dataset Download & Training Data Flow¶

This document describes how datasets are downloaded, stored, and loaded during training in AI-OS.

Overview¶

AI-OS uses a hierarchical data structure optimized for memory efficiency and parallel training:

Dataset (Full dataset, e.g., 10M samples)
    └── Blocks (100k samples each, stored on disk)
        └── Chunks (4k samples each, loaded into RAM)
            └── Batches (8 samples, sent to GPU)

Download System¶

Streaming Block Writer¶

When downloading datasets, AI-OS uses streaming with automatic 100k block creation:

Stream samples from HuggingFace Hub (or other sources)
Buffer samples until 100k are collected
Flush to disk as a JSONL block file
Repeat until download complete

This provides: - Memory efficiency: Only holds 100k samples at a time, not entire dataset - Resumability: Blocks are saved progressively; partial downloads preserve completed blocks - Training-ready format: Blocks match the training system's expected format

Block Structure on Disk¶

dataset_name/
├── blocks/
│   ├── block_00000.jsonl  (100k samples)
│   ├── block_00001.jsonl  (100k samples)
│   ├── block_00002.jsonl  (up to 100k samples)
│   └── ...
└── block_manifest.json    (metadata about all blocks)

Post-Processing¶

For datasets not downloaded in block format, use process_raw_dataset_to_blocks():

from aios.gui.components.dataset_download_panel.block_processor import (
    process_raw_dataset_to_blocks
)

block_info = process_raw_dataset_to_blocks(
    input_path=Path("raw_dataset/"),
    output_dir=Path("blocked_dataset/"),
    dataset_name="my_dataset",
    block_size=100000
)

Training Data Flow¶

1. BlockManager¶

The BlockManager class handles loading blocks during training:

block_manager = BlockManager(
    dataset_path="hf://dataset_name",
    samples_per_block=100000,     # 100k samples per block
    dataset_chunk_size=4000,      # 4k samples per chunk
)

Key operations: - get_block(block_id) - Returns block metadata (not data!) - get_chunk(block_id, chunk_id, chunk_size) - Loads specific chunk into RAM

2. Memory Hierarchy¶

┌─────────────────────────────────────────────────────────────────┐
│ DISK: Full dataset (blocks/block_*.jsonl)                       │
│   - All blocks stored as JSONL files                            │
│   - Only metadata loaded initially                              │
└─────────────────────────────────────────────────────────────────┘
                              ↓ On-demand loading
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEM RAM: Current chunk (4k samples, ~few MB)                 │
│   - BlockManager.get_chunk() loads one chunk at a time          │
│   - Chunk cache stores last 10 chunks for potential reuse       │
│   - Aggressive cache eviction to prevent memory growth          │
└─────────────────────────────────────────────────────────────────┘
                              ↓ Batching & tokenization
┌─────────────────────────────────────────────────────────────────┐
│ GPU VRAM: Current batch only (8 samples, tokenized tensors)     │
│   - Tokenized sequences moved to GPU                            │
│   - Forward/backward pass computed                              │
│   - Tensors released after each step                            │
└─────────────────────────────────────────────────────────────────┘

3. Parallel GPU Training¶

For multi-GPU training, each GPU processes unique chunks:

Block 0 (100k samples, 25 chunks of 4k each):
├── Chunk 0  → GPU 0 claims and trains
├── Chunk 1  → GPU 1 claims and trains
├── Chunk 2  → GPU 0 claims and trains (after finishing Chunk 0)
├── Chunk 3  → GPU 1 claims and trains
└── ...

The ChunkTracker ensures: - No duplicate training (each chunk trained exactly once) - Progress persistence (resume from any point) - Epoch detection (when all blocks/chunks processed)

4. Long Sequence Handling¶

For very long sequences (>10k tokens), chunked training splits sequences further:

chunked_segment_rollout(
    model=model,
    batch=batch,              # Contains tokenized sequence
    max_segments=5,
    chunk_size=2048,          # Process 2k tokens at a time
)

This enables training on 100k+ token contexts with limited VRAM by: - Processing sequence in 2k-token chunks - Maintaining carry state across chunks - Accumulating gradients - Optional CPU offloading for extreme contexts

Modality Filtering¶

The dataset search panel supports filtering by data modality:

Modality	HF Filter Tag	Training Support
Text	`modality:text`	✅ Full support
Audio	`modality:audio`	❌ Not yet
Image	`modality:image`	❌ Not yet
Video	`modality:video`	❌ Not yet
Tabular	`modality:tabular`	❌ Not yet
Document	`modality:document`	⚠️ Text extraction needed
Geospatial	`modality:geospatial`	❌ Not yet
Time-series	`modality:timeseries`	❌ Not yet
3D	`modality:3d`	❌ Not yet

Note: Only "Text" datasets are currently supported for model training. Users can browse and download other modalities, but they won't work with the training system.

Configuration Options¶

Block/Chunk Sizes¶

# In training config
samples_per_block: 100000   # Samples per downloaded block
dataset_chunk_size: 4000    # Samples loaded into RAM at once

# Recommendations by RAM:
# - 16GB RAM: chunk_size=2000-3000
# - 32GB RAM: chunk_size=4000 (default)
# - 64GB+ RAM: chunk_size=8000+

Sequence Chunking (for long contexts)¶

use_chunked_training: true  # Enable for long sequences
chunk_size: 2048           # Tokens per GPU chunk

# Recommendations by VRAM:
# - 10GB VRAM: chunk_size=1024
# - 20GB VRAM: chunk_size=2048 (default)
# - 24GB+ VRAM: chunk_size=4096

Summary¶

The data flow in AI-OS is designed for:

Efficient Downloads: Streaming with progressive block saves
Memory Efficiency: Only load what's needed (chunks, not blocks)
Parallel Training: Unique chunk distribution across GPUs
Long Context Support: Chunked training for extreme sequence lengths
Resumability: Progress saved at block and chunk level