Skip to content

Model Serving and Inference API

Summary

Provide a clean serving path for HRM and HF baselines. Ship a minimal FastAPI server for HRM inference and document how to deploy HF baselines on vLLM/TGI. For production-grade GPU fleets, outline Triton backend path.

Motivation

  • Benchmark and A/B test HRM vs standard transformer baselines.
  • Enable downstream apps to use a stable /generate and /loglikelihood API.

Scope

In scope: - Minimal FastAPI server for HRM with batch generation and scoring endpoints. - Configuration to run under Docker. - Baseline serving docs for vLLM and TGI (HF models only).

Out of scope (for this PF): - Triton custom backend implementation (design notes only).

Integration points (repo)

  • New module: src/aios/serve/hrm_server.py (FastAPI, uvicorn)
  • Dockerfile target: --target hrm-serve (multi-stage)
  • Scripts: scripts/run_hrm_server.ps1 (Windows), scripts/run_hrm_server.sh (Linux)

API contract (v1)

  • POST /generate
  • Input: { prompts: string[], max_tokens?: number, temperature?: number, top_p?: number }
  • Output: { completions: string[], usage?: { prompt_tokens, completion_tokens } }

  • POST /loglikelihood

  • Input: { prompts: string[], continuations: string[] }
  • Output: { ll: number[] }

Comprehensive guide and checklist

This section expands PF-003 into an actionable implementation guide with CLI and GUI elements, deployment paths, and validation checklists.

Deliverables

  • HRM FastAPI server module: src/aios/serve/hrm_server.py
  • Optional minimal UI for manual testing (Gradio or Streamlit)
  • CLI launcher: aios serve hrm (via existing aios.cli.aios entrypoint)
  • Dockerfile target: hrm-serve and helper scripts for Windows/Linux
  • Baseline serving notes for vLLM and TGI
  • Acceptance tests and operational runbooks

Architecture overview

  • Client (CLI/UI/SDK) → FastAPI app → HRM inference wrapper (tokenizer + model) → Torch device (CPU/GPU)
  • Optional: request batcher (FIFO), configurable max batch size and max new tokens per request
  • Observability: structured logs + optional Prometheus metrics endpoint
  • Healthchecks: /healthz (process up), /readyz (model loaded)

Data models and API details

  • POST /generate
  • Request model (Pydantic):
    • prompts: List[str] (1..N)
    • max_tokens: int = 128 (cap at server max, e.g., 1024)
    • temperature: float = 0.7 (range [0, 2])
    • top_p: float = 1.0 (range (0, 1])
    • seed: Optional[int] (optional for reproducibility)
    • stop: Optional[List[str]] (optional stop strings)
  • Response:
    • completions: List[str]
    • usage?: { prompt_tokens: int, completion_tokens: int, total_tokens: int }
  • Errors:

    • 400: validation (e.g., empty prompts, invalid ranges)
    • 429: rate limited (optional)
    • 500: internal error
  • POST /loglikelihood

  • Request:
    • prompts: List[str]
    • continuations: List[str] (must match length of prompts)
  • Response:
    • ll: List[float] (sum of token log-probs of each continuation given prompt)
  • Errors: as above

  • GET /healthz: { status: "ok" }

  • GET /readyz: { status: "ready" } once model/tokenizer loaded
  • Optional future: /generate_stream using server-sent events (not in v1 scope)

HRM inference wrapper design

  • Artifacts and paths
  • Tokenizer: reuse loading logic from training; prefer registry paths under artifacts/hf_implant/tokenizers/ or configured --tokenizer path
  • Model weights: e.g., artifacts/brains/actv1/actv1_student.pt or artifacts/hf_implant/q_head.pt + base model
  • Config file (optional): config/default.yaml overrides

  • Batch tokenize prompts with truncation to model context window; return attention masks

  • Decoding

  • Modes: greedy (temperature=0 or top_p=1, top_k=None), sampled (temperature>0 and/or top_p<1)
  • Respect max_tokens and stop strings; early stop if EOS token encountered
  • Use torch.no_grad() and AMP optional (torch.cuda.amp.autocast) when CUDA available

  • Batching

  • Pad to max prompt length in batch; maintain mapping to return completions in order
  • Configurable max_batch_size to protect memory

  • Device selection

  • Auto-detect CUDA/ROCm/MPS; environment variable AIOS_DEVICE to force: cpu|cuda|mps

  • Pseudocode outline

  • load_tokenizer()load_model() → set eval, move to device
  • For /generate: encode → loop new tokens → decode to strings → postprocess stop
  • For /loglikelihood: teacher-forcing forward over prompt+continuation and sum log-probs at continuation positions

Configuration model

Precedence: CLI flags > Env vars > YAML config defaults.

  • CLI flags (examples):
  • --host 0.0.0.0 --port 8000
  • --model-path artifacts/brains/actv1/actv1_student.pt
  • --tokenizer-path artifacts/hf_implant/tokenizers/<name>
  • --device cpu|cuda|mps
  • --max-batch-size 16 --max-new-tokens 256 --context-window 2048
  • --enable-cors --cors-origins *
  • --log-level info (honors logging.yaml if present)

  • Env vars:

  • AIOS_MODEL_PATH, AIOS_TOKENIZER_PATH, AIOS_DEVICE, AIOS_MAX_BATCH_SIZE, AIOS_MAX_NEW_TOKENS, AIOS_CORS_ORIGINS

  • YAML (optional): config/default.yamlserve.hrm section

CLI: serve commands

  • Primary: aios serve hrm (wired via existing aios.cli.aios)
  • Example (Windows PowerShell):
    • aios serve hrm --host 0.0.0.0 --port 8000 --model-path artifacts/brains/actv1/actv1_student.pt --tokenizer-path artifacts/hf_implant/tokenizers/base --device cpu
  • Example (module invocation):

    • .venv\Scripts\python.exe -m aios.serve.hrm_server --host 0.0.0.0 --port 8000
  • Admin helpers (optional):

  • aios serve hrm --print-config
  • aios serve hrm --dry-run (load-only, no HTTP)

Minimal GUI (optional) for manual testing

  • Choice: Gradio (simpler) or Streamlit
  • Usage intent: quick sanity checks by product/QA without curl or code
  • Proposed: aios serve hrm --ui opens a small panel:
  • Input textarea for prompt, sliders for max_tokens, temperature, top_p
  • Buttons "Generate" and "Score loglikelihood"
  • Display output text and usage metrics

  • Local run (Streamlit example):

  • python -m aios.serve.hrm_ui --server-url http://localhost:8000

Note: The UI is helpful but not required for acceptance of PF-003. If code is deferred, include only docs and a future task for hrm_ui.py.

Docker and containerization

  • Dockerfile target: hrm-serve (multi-stage). Example build and run on Windows PowerShell:
# Build
docker build -t aios/hrm-serve:local --target hrm-serve .

# Run (CPU)
docker run --rm -p 8000:8000 ^
  -e AIOS_MODEL_PATH=artifacts/brains/actv1/actv1_student.pt ^
  -e AIOS_TOKENIZER_PATH=artifacts/hf_implant/tokenizers/base ^
  aios/hrm-serve:local

# Run (CUDA) – requires NVIDIA Container Toolkit
docker run --rm -p 8000:8000 --gpus all ^
  -e AIOS_DEVICE=cuda ^
  aios/hrm-serve:local
  • docker-compose service snippet:
services:
  hrm:
    image: aios/hrm-serve:local
    ports: ["8000:8000"]
    environment:
      AIOS_MODEL_PATH: artifacts/brains/actv1/actv1_student.pt
      AIOS_TOKENIZER_PATH: artifacts/hf_implant/tokenizers/base
      AIOS_DEVICE: cpu
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
  • Helper scripts to add:
  • scripts/run_hrm_server.ps1
  • scripts/run_hrm_server.sh

Baseline serving (A/B) with vLLM and TGI

  • vLLM (HF checkpoint):
docker run --rm -p 8001:8000 ^
  -v $PWD\artifacts\hf_implant\base_model:/model ^
  vllm/vllm-openai:latest ^
  --model /model

Hit using OpenAI-compatible API:

curl -s http://localhost:8001/v1/completions -H "Content-Type: application/json" -d '{
  "model":"local",
  "prompt":"Hello",
  "max_tokens":16
}' | jq .
  • Text Generation Inference (TGI):
docker run --rm -p 8002:80 ^
  -v $PWD\artifacts\hf_implant\base_model:/data ^
  ghcr.io/huggingface/text-generation-inference:latest ^
  --model-id /data

Hit using TGI API:

curl -s http://localhost:8002/generate -H "Content-Type: application/json" -d '{
  "inputs": "Hello",
  "parameters": {"max_new_tokens": 16, "temperature": 0.7, "top_p": 0.9}
}' | jq .
  • Mapping to our contract
  • Our /generate → vLLM /v1/completions or OpenAI Chat; TGI /generate
  • For A/B, normalize fields: max_tokens ↔ max_new_tokens, temperature, top_p

Observability and security

  • Logging: use logging.yaml if present; otherwise default to INFO with request/response IDs (omit bodies in prod logs)
  • Metrics: optional /metrics (Prometheus) for request counts/latency/tokens
  • Tracing: optional OpenTelemetry if configured
  • CORS: --enable-cors and whitelist origins
  • Auth (optional): bearer token via Authorization: Bearer <token>; deny when missing

Testing, acceptance, and checklists

Functional quick test (CPU):

# 1) Start server (example)
aios serve hrm --device cpu --port 8000 --model-path artifacts/brains/actv1/actv1_student.pt --tokenizer-path artifacts/hf_implant/tokenizers/base

# 2) Generate two prompts
curl -s http://localhost:8000/generate -H "Content-Type: application/json" -d '{
  "prompts": ["Hello", "Once upon a time"],
  "max_tokens": 8,
  "temperature": 0.7,
  "top_p": 0.9
}' | jq .

# 3) Loglikelihood
curl -s http://localhost:8000/loglikelihood -H "Content-Type: application/json" -d '{
  "prompts": ["Hello"],
  "continuations": [" world"]
}' | jq .

Readiness checklist

  • [ ] Server starts and /readyz returns ready within 60s
  • [ ] /generate returns completions for 2 prompts under 2s on CPU (sample model)
  • [ ] /loglikelihood returns values and shape matches inputs
  • [ ] Handles batch of 16 prompts without OOM; memory stable over 5 runs
  • [ ] Error cases return 400/422 with clear message
  • [ ] Logs include request IDs and timing
  • [ ] Docker image builds and runs locally
  • [ ] Optional UI launches and can call server
  • [ ] A/B against vLLM/TGI produces comparable outputs with same seeds

Load and reliability

  • [ ] Sustains 10 RPS with batch size 8 on CPU sample model (target; adjust per hardware)
  • [ ] Backpressure: requests rejected with 429 when queue is full (if enabled)
  • [ ] Graceful shutdown drains in-flight requests

Production runbook (starter)

  • Startup
  • Verify drivers (CUDA) or plan CPU
  • Warmup: send a 1-token request to pre-JIT kernels
  • Confirm /readyz and sample /generate

  • Scaling

  • Horizontal: run multiple replicas behind a load balancer; sticky sessions not required
  • Vertical: tune max_batch_size, max_new_tokens, and context window to fit memory

  • Troubleshooting

  • 500 at startup: verify paths for model/tokenizer; run with --dry-run
  • CUDA OOM: reduce batch size or max_new_tokens; ensure torch.cuda.empty_cache() between runs if needed
  • Slow responses: disable AMP on CPU; pin threads (OMP_NUM_THREADS)
  • CORS blocked: set --enable-cors --cors-origins * for dev only

Triton backend (design notes, out of scope)

  • Shape the HRM inference as a stateless backend with request batching and token cache
  • Inputs: token IDs, attention mask, decode params; Outputs: next tokens/logprobs
  • Consider KV cache management and paged attention for large contexts

Implementation steps

1) HRM inference wrapper - Reuse tokenizer loading from train_actv1.py helpers. - Load actv1_student.pt and implement a simple forward for greedy/sampled decoding. - Add batching and torch.no_grad(); AMP optional.

2) FastAPI app - Create endpoints per contract; validate inputs; return JSON. - Add health endpoint and basic logging.

3) Packaging - Add new Docker target with a slim runtime (CUDA base optional). - Provide PowerShell and Bash helpers to run locally.

4) Baseline docs - Document how to start vLLM/TGI for a matching HF model and how to hit similar endpoints for A/B.

Testing and acceptance criteria

  • Local run: Start server, send /generate with 2 prompts, receive 2 completions within reasonable latency on CPU/GPU.
  • Error handling: Invalid inputs return 400 with clear messages.
  • Load: Handle batch of 16 prompts without crash; memory stable.

Risks and mitigations

  • HRM decoding path may need custom halting logic → start with simple decode, iterate.
  • Windows GPU drivers: Recommend WSL2 or use CPU fallback for local dev.

Milestones

  • M1 (1–2 days): Minimal server + local run; sample client.
  • M2 (1 day): Docker target and docs; baseline serving notes.