Model Serving and Inference API¶
Summary¶
Provide a clean serving path for HRM and HF baselines. Ship a minimal FastAPI server for HRM inference and document how to deploy HF baselines on vLLM/TGI. For production-grade GPU fleets, outline Triton backend path.
Motivation¶
- Benchmark and A/B test HRM vs standard transformer baselines.
- Enable downstream apps to use a stable
/generateand/loglikelihoodAPI.
Scope¶
In scope: - Minimal FastAPI server for HRM with batch generation and scoring endpoints. - Configuration to run under Docker. - Baseline serving docs for vLLM and TGI (HF models only).
Out of scope (for this PF): - Triton custom backend implementation (design notes only).
Integration points (repo)¶
- New module:
src/aios/serve/hrm_server.py(FastAPI, uvicorn) - Dockerfile target:
--target hrm-serve(multi-stage) - Scripts:
scripts/run_hrm_server.ps1(Windows),scripts/run_hrm_server.sh(Linux)
API contract (v1)¶
- POST
/generate - Input:
{ prompts: string[], max_tokens?: number, temperature?: number, top_p?: number } -
Output:
{ completions: string[], usage?: { prompt_tokens, completion_tokens } } -
POST
/loglikelihood - Input:
{ prompts: string[], continuations: string[] } - Output:
{ ll: number[] }
Comprehensive guide and checklist¶
This section expands PF-003 into an actionable implementation guide with CLI and GUI elements, deployment paths, and validation checklists.
Deliverables¶
- HRM FastAPI server module:
src/aios/serve/hrm_server.py - Optional minimal UI for manual testing (Gradio or Streamlit)
- CLI launcher:
aios serve hrm(via existingaios.cli.aiosentrypoint) - Dockerfile target:
hrm-serveand helper scripts for Windows/Linux - Baseline serving notes for vLLM and TGI
- Acceptance tests and operational runbooks
Architecture overview¶
- Client (CLI/UI/SDK) → FastAPI app → HRM inference wrapper (tokenizer + model) → Torch device (CPU/GPU)
- Optional: request batcher (FIFO), configurable max batch size and max new tokens per request
- Observability: structured logs + optional Prometheus metrics endpoint
- Healthchecks:
/healthz(process up),/readyz(model loaded)
Data models and API details¶
- POST
/generate - Request model (Pydantic):
prompts: List[str](1..N)max_tokens: int = 128(cap at server max, e.g., 1024)temperature: float = 0.7(range [0, 2])top_p: float = 1.0(range (0, 1])seed: Optional[int](optional for reproducibility)stop: Optional[List[str]](optional stop strings)
- Response:
completions: List[str]usage?: { prompt_tokens: int, completion_tokens: int, total_tokens: int }
-
Errors:
- 400: validation (e.g., empty prompts, invalid ranges)
- 429: rate limited (optional)
- 500: internal error
-
POST
/loglikelihood - Request:
prompts: List[str]continuations: List[str](must match length of prompts)
- Response:
ll: List[float](sum of token log-probs of each continuation given prompt)
-
Errors: as above
-
GET
/healthz:{ status: "ok" } - GET
/readyz:{ status: "ready" }once model/tokenizer loaded - Optional future:
/generate_streamusing server-sent events (not in v1 scope)
HRM inference wrapper design¶
- Artifacts and paths
- Tokenizer: reuse loading logic from training; prefer registry paths under
artifacts/hf_implant/tokenizers/or configured--tokenizerpath - Model weights: e.g.,
artifacts/brains/actv1/actv1_student.ptorartifacts/hf_implant/q_head.pt+ base model -
Config file (optional):
config/default.yamloverrides -
Batch tokenize
promptswith truncation to model context window; return attention masks -
Decoding
- Modes: greedy (temperature=0 or top_p=1, top_k=None), sampled (temperature>0 and/or top_p<1)
- Respect
max_tokensandstopstrings; early stop if EOS token encountered -
Use
torch.no_grad()and AMP optional (torch.cuda.amp.autocast) when CUDA available -
Batching
- Pad to max prompt length in batch; maintain mapping to return completions in order
-
Configurable
max_batch_sizeto protect memory -
Device selection
-
Auto-detect CUDA/ROCm/MPS; environment variable
AIOS_DEVICEto force:cpu|cuda|mps -
Pseudocode outline
load_tokenizer()→load_model()→ set eval, move to device- For
/generate: encode → loop new tokens → decode to strings → postprocess stop - For
/loglikelihood: teacher-forcing forward over prompt+continuation and sum log-probs at continuation positions
Configuration model¶
Precedence: CLI flags > Env vars > YAML config defaults.
- CLI flags (examples):
--host 0.0.0.0 --port 8000--model-path artifacts/brains/actv1/actv1_student.pt--tokenizer-path artifacts/hf_implant/tokenizers/<name>--device cpu|cuda|mps--max-batch-size 16 --max-new-tokens 256 --context-window 2048--enable-cors --cors-origins *-
--log-level info(honorslogging.yamlif present) -
Env vars:
-
AIOS_MODEL_PATH,AIOS_TOKENIZER_PATH,AIOS_DEVICE,AIOS_MAX_BATCH_SIZE,AIOS_MAX_NEW_TOKENS,AIOS_CORS_ORIGINS -
YAML (optional):
config/default.yaml→serve.hrmsection
CLI: serve commands¶
- Primary:
aios serve hrm(wired via existingaios.cli.aios) - Example (Windows PowerShell):
aios serve hrm --host 0.0.0.0 --port 8000 --model-path artifacts/brains/actv1/actv1_student.pt --tokenizer-path artifacts/hf_implant/tokenizers/base --device cpu
-
Example (module invocation):
.venv\Scripts\python.exe -m aios.serve.hrm_server --host 0.0.0.0 --port 8000
-
Admin helpers (optional):
aios serve hrm --print-configaios serve hrm --dry-run(load-only, no HTTP)
Minimal GUI (optional) for manual testing¶
- Choice: Gradio (simpler) or Streamlit
- Usage intent: quick sanity checks by product/QA without curl or code
- Proposed:
aios serve hrm --uiopens a small panel: - Input textarea for prompt, sliders for
max_tokens,temperature,top_p - Buttons "Generate" and "Score loglikelihood"
-
Display output text and usage metrics
-
Local run (Streamlit example):
python -m aios.serve.hrm_ui --server-url http://localhost:8000
Note: The UI is helpful but not required for acceptance of PF-003. If code is deferred, include only docs and a future task for hrm_ui.py.
Docker and containerization¶
- Dockerfile target:
hrm-serve(multi-stage). Example build and run on Windows PowerShell:
# Build
docker build -t aios/hrm-serve:local --target hrm-serve .
# Run (CPU)
docker run --rm -p 8000:8000 ^
-e AIOS_MODEL_PATH=artifacts/brains/actv1/actv1_student.pt ^
-e AIOS_TOKENIZER_PATH=artifacts/hf_implant/tokenizers/base ^
aios/hrm-serve:local
# Run (CUDA) – requires NVIDIA Container Toolkit
docker run --rm -p 8000:8000 --gpus all ^
-e AIOS_DEVICE=cuda ^
aios/hrm-serve:local
- docker-compose service snippet:
services:
hrm:
image: aios/hrm-serve:local
ports: ["8000:8000"]
environment:
AIOS_MODEL_PATH: artifacts/brains/actv1/actv1_student.pt
AIOS_TOKENIZER_PATH: artifacts/hf_implant/tokenizers/base
AIOS_DEVICE: cpu
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
- Helper scripts to add:
scripts/run_hrm_server.ps1scripts/run_hrm_server.sh
Baseline serving (A/B) with vLLM and TGI¶
- vLLM (HF checkpoint):
docker run --rm -p 8001:8000 ^
-v $PWD\artifacts\hf_implant\base_model:/model ^
vllm/vllm-openai:latest ^
--model /model
Hit using OpenAI-compatible API:
curl -s http://localhost:8001/v1/completions -H "Content-Type: application/json" -d '{
"model":"local",
"prompt":"Hello",
"max_tokens":16
}' | jq .
- Text Generation Inference (TGI):
docker run --rm -p 8002:80 ^
-v $PWD\artifacts\hf_implant\base_model:/data ^
ghcr.io/huggingface/text-generation-inference:latest ^
--model-id /data
Hit using TGI API:
curl -s http://localhost:8002/generate -H "Content-Type: application/json" -d '{
"inputs": "Hello",
"parameters": {"max_new_tokens": 16, "temperature": 0.7, "top_p": 0.9}
}' | jq .
- Mapping to our contract
- Our
/generate→ vLLM/v1/completionsor OpenAI Chat; TGI/generate - For A/B, normalize fields:
max_tokens ↔ max_new_tokens,temperature,top_p
Observability and security¶
- Logging: use
logging.yamlif present; otherwise default to INFO with request/response IDs (omit bodies in prod logs) - Metrics: optional
/metrics(Prometheus) for request counts/latency/tokens - Tracing: optional OpenTelemetry if configured
- CORS:
--enable-corsand whitelist origins - Auth (optional): bearer token via
Authorization: Bearer <token>; deny when missing
Testing, acceptance, and checklists¶
Functional quick test (CPU):
# 1) Start server (example)
aios serve hrm --device cpu --port 8000 --model-path artifacts/brains/actv1/actv1_student.pt --tokenizer-path artifacts/hf_implant/tokenizers/base
# 2) Generate two prompts
curl -s http://localhost:8000/generate -H "Content-Type: application/json" -d '{
"prompts": ["Hello", "Once upon a time"],
"max_tokens": 8,
"temperature": 0.7,
"top_p": 0.9
}' | jq .
# 3) Loglikelihood
curl -s http://localhost:8000/loglikelihood -H "Content-Type: application/json" -d '{
"prompts": ["Hello"],
"continuations": [" world"]
}' | jq .
Readiness checklist
- [ ] Server starts and
/readyzreturns ready within 60s - [ ]
/generatereturns completions for 2 prompts under 2s on CPU (sample model) - [ ]
/loglikelihoodreturns values and shape matches inputs - [ ] Handles batch of 16 prompts without OOM; memory stable over 5 runs
- [ ] Error cases return 400/422 with clear message
- [ ] Logs include request IDs and timing
- [ ] Docker image builds and runs locally
- [ ] Optional UI launches and can call server
- [ ] A/B against vLLM/TGI produces comparable outputs with same seeds
Load and reliability
- [ ] Sustains 10 RPS with batch size 8 on CPU sample model (target; adjust per hardware)
- [ ] Backpressure: requests rejected with 429 when queue is full (if enabled)
- [ ] Graceful shutdown drains in-flight requests
Production runbook (starter)¶
- Startup
- Verify drivers (CUDA) or plan CPU
- Warmup: send a 1-token request to pre-JIT kernels
-
Confirm
/readyzand sample/generate -
Scaling
- Horizontal: run multiple replicas behind a load balancer; sticky sessions not required
-
Vertical: tune
max_batch_size,max_new_tokens, and context window to fit memory -
Troubleshooting
- 500 at startup: verify paths for model/tokenizer; run with
--dry-run - CUDA OOM: reduce batch size or
max_new_tokens; ensuretorch.cuda.empty_cache()between runs if needed - Slow responses: disable AMP on CPU; pin threads (
OMP_NUM_THREADS) - CORS blocked: set
--enable-cors --cors-origins *for dev only
Triton backend (design notes, out of scope)¶
- Shape the HRM inference as a stateless backend with request batching and token cache
- Inputs: token IDs, attention mask, decode params; Outputs: next tokens/logprobs
- Consider KV cache management and paged attention for large contexts
Implementation steps¶
1) HRM inference wrapper
- Reuse tokenizer loading from train_actv1.py helpers.
- Load actv1_student.pt and implement a simple forward for greedy/sampled decoding.
- Add batching and torch.no_grad(); AMP optional.
2) FastAPI app - Create endpoints per contract; validate inputs; return JSON. - Add health endpoint and basic logging.
3) Packaging - Add new Docker target with a slim runtime (CUDA base optional). - Provide PowerShell and Bash helpers to run locally.
4) Baseline docs - Document how to start vLLM/TGI for a matching HF model and how to hit similar endpoints for A/B.
Testing and acceptance criteria¶
- Local run: Start server, send
/generatewith 2 prompts, receive 2 completions within reasonable latency on CPU/GPU. - Error handling: Invalid inputs return 400 with clear messages.
- Load: Handle batch of 16 prompts without crash; memory stable.
Risks and mitigations¶
- HRM decoding path may need custom halting logic → start with simple decode, iterate.
- Windows GPU drivers: Recommend WSL2 or use CPU fallback for local dev.
Milestones¶
- M1 (1–2 days): Minimal server + local run; sample client.
- M2 (1 day): Docker target and docs; baseline serving notes.