Multimodal and Specialized Tokenizers¶
β οΈ CRITICAL NOTICE - NOT YET IMPLEMENTED β οΈ¶
STATUS: π§ PLANNED FEATURE - NOT CURRENTLY AVAILABLE π§
The multimodal and specialized tokenizers described in this document are NOT YET IMPLEMENTED in AI-OS. This document represents planned future functionality.
What works NOW: - β GPT-2, Qwen 2.5, Mistral 7B, Code Llama, DeepSeek-Coder V2, StarCoder2, Phi-3 Mini
What does NOT work yet: - β CLIP, LLaVA, SigLIP (Vision/Multimodal) - β BioBERT, SciBERT, Legal-BERT, FinBERT (Specialized domains) - β Image processing pipeline - β Vision-language model support
Last Verified: October 13, 2025
For currently supported tokenizers, see:
docs/SUPPORTED_TOKENIZERS.md(to be created)
Date: October 13, 2025
Major Update: Added Vision, Multimodal, and Domain-Specialized Tokenizers
β οΈ Status: DOCUMENTATION ONLY - IMPLEMENTATION PENDING
π What's New (PLANNED)¶
π¨ Vision & Multimodal Tokenizers (3)¶
Train AI models that understand images, videos, and vision-language tasks:
| Tokenizer | Vocab | Best For | Downloaded |
|---|---|---|---|
| CLIP (Vision-Language) πΌοΈ | 49,408 | Image-text understanding, zero-shot vision | β Yes |
| LLaVA 1.5 (Vision Chat) π¬ | 32,000 | Visual Q&A, image understanding, multimodal chat | β Yes |
| SigLIP π― | 32,000 | Advanced vision-language, better than CLIP | β Yes |
π¬ Specialized Domain Tokenizers (4)¶
Train AI models for specific professional domains:
| Tokenizer | Vocab | Best For | Downloaded |
|---|---|---|---|
| BioBERT 𧬠| 28,996 | Biomedical, healthcare, medical research | β Yes |
| SciBERT π¬ | 31,090 | Scientific papers, academic research | β Yes |
| Legal-BERT βοΈ | 30,522 | Legal documents, contracts, compliance | β οΈ No |
| FinBERT π° | 30,873 | Financial analysis, trading, markets | β οΈ No |
π Complete Tokenizer Collection¶
Total: 16 Tokenizers across all categories!
General Purpose (4)¶
- β Qwen 2.5 - 151K vocab (Best overall, 2025)
- Llama 3 - 128K vocab (Requires auth)
- Mistral 7B - 32K vocab
- GPT-2 - 50K vocab (Legacy)
Code Specialized (3)¶
- β DeepSeek-Coder V2 - 100K vocab (Best for code, 2025)
- StarCoder2 - 49K vocab (600+ languages)
- Code Llama - 32K vocab (Stable)
Vision & Multimodal (3) π¶
- CLIP (Vision-Language) - 49K vocab (Image-text)
- LLaVA 1.5 - 32K vocab (Visual chat)
- SigLIP - 32K vocab (Advanced vision)
Specialized Domains (4) π¶
- BioBERT - 29K vocab (Biomedical)
- SciBERT - 31K vocab (Scientific)
- Legal-BERT - 31K vocab (Legal)
- FinBERT - 31K vocab (Financial)
Compact & Efficient (2)¶
- Phi-3 Mini - 32K vocab
- GPT-2 Base Model - 50K vocab (Backward compatible)
π¨ Vision & Multimodal Use Cases¶
CLIP (Vision-Language)¶
What it does: - Understands relationships between images and text - Zero-shot image classification - Image-text retrieval - Image generation guidance
Training Examples: - Image captioning datasets - Visual question answering - Image-text matching - Multimodal embeddings
Best for: - Building image search engines - Visual understanding systems - Text-to-image generation - Multimodal AI assistants
Model Architecture:
Input: Image + Text
β
CLIP Tokenizer (49,408 vocab)
β
Dual Encoders (Vision + Language)
β
Shared Embedding Space
β
Output: Similarity Scores / Embeddings
LLaVA 1.5 (Vision Chat)¶
What it does: - Visual question answering - Image understanding and description - Multimodal conversation - Visual reasoning
Training Examples: - VQA datasets (Visual Question Answering) - Image description pairs - Visual instruction following - Multimodal dialogue
Best for: - Visual chatbots - Image analysis assistants - Accessibility tools (image description) - Educational AI tutors
Model Architecture:
Input: Image + Question
β
Vision Encoder (ViT) + LLaVA Tokenizer (32K)
β
Multimodal Fusion
β
Language Model Decoder
β
Output: Text Answer
SigLIP (Sigmoid Loss Vision)¶
What it does: - Improved vision-language alignment - Better zero-shot classification - More efficient training (sigmoid loss vs softmax) - Enhanced image understanding
Training Examples: - Large-scale image-text pairs - Visual classification tasks - Image retrieval datasets - Cross-modal learning
Best for: - Advanced vision-language models - Efficient multimodal training - Zero-shot vision tasks - Production vision systems
π¬ Specialized Domain Use Cases¶
BioBERT (Biomedical) 𧬶
What it does: - Understanding medical terminology - Biomedical named entity recognition - Clinical text analysis - Drug-disease relationship extraction
Training Examples: - PubMed abstracts - Clinical notes - Medical research papers - Drug interaction databases - Disease symptom datasets
Best for: - Medical AI assistants - Clinical decision support - Biomedical research tools - Healthcare chatbots - Drug discovery AI
Pre-trained on: - 4.5B words from PubMed abstracts - 13.5B words from PMC full-text articles
SciBERT (Scientific) π¬¶
What it does: - Understanding scientific terminology - Research paper analysis - Citation relationship extraction - Scientific entity recognition
Training Examples: - Academic papers (1.14M papers) - Scientific abstracts - Research datasets - Technical documentation - Lab reports
Best for: - Research paper summarization - Literature review automation - Scientific question answering - Academic writing assistance - Grant proposal analysis
Pre-trained on: - 1.14M papers from Semantic Scholar - 18% computer science papers - 82% biomedical papers
Legal-BERT (Legal) βοΈ¶
What it does: - Understanding legal terminology - Contract analysis - Legal document classification - Case law reasoning
Training Examples: - Legal contracts - Court opinions - Legislation text - Legal briefs - Regulatory documents
Best for: - Contract review automation - Legal research assistants - Compliance checking - Document classification - Legal chatbots
Pre-trained on: - Legal corpora - Case law databases - Legislation texts
FinBERT (Financial) π°¶
What it does: - Financial sentiment analysis - Market trend prediction - Financial entity recognition - Risk assessment
Training Examples: - Financial news articles - Earnings reports - Market analysis - Trading data - Economic indicators
Best for: - Trading algorithms - Financial news analysis - Risk assessment tools - Investment research - Financial chatbots
Pre-trained on: - Financial news and reports - Corporate filings - Market commentary
π― Choosing the Right Tokenizer¶
For Image Understanding & Generation¶
Use: CLIP πΌοΈ - Large vocabulary (49K) - Industry-standard vision-language model - Zero-shot capabilities - Compatible with Stable Diffusion and other image models
Example Projects: - Image search engines - Content moderation - Visual similarity matching - Text-to-image applications
For Visual Question Answering¶
Use: LLaVA 1.5 π¬ - Optimized for visual chat - Instruction following - Multimodal conversation - Image understanding
Example Projects: - Visual assistants - Image description tools - Educational tutors - Accessibility applications
For Biomedical AI¶
Use: BioBERT 𧬠- Medical terminology expertise - Clinical text understanding - Biomedical entity recognition
Example Projects: - Clinical decision support - Medical record analysis - Drug discovery - Healthcare chatbots
For Scientific Research¶
Use: SciBERT π¬ - Scientific vocabulary - Research paper understanding - Technical documentation
Example Projects: - Literature review tools - Research assistants - Paper summarization - Citation analysis
For Legal Applications¶
Use: Legal-BERT βοΈ - Legal terminology - Contract understanding - Case law analysis
Example Projects: - Contract review - Legal research tools - Compliance checking - Document classification
For Financial Analysis¶
Use: FinBERT π° - Financial terminology - Market sentiment - Economic understanding
Example Projects: - Trading algorithms - Market analysis - Risk assessment - Financial news processing
π Quick Start Examples¶
Creating a Vision AI Brain¶
# 1. Download CLIP tokenizer
python scripts/download_tokenizer.py clip-vit
# 2. Open AI-OS GUI
# 3. HRM Training β Create New
# 4. Configure:
Preset: 10M (or preferred size)
Name: vision-understanding-v1
Goal: Understand and describe images, answer visual questions
Tokenizer: β CLIP (Vision-Language) (49,408 tokens)
# 5. Create & Train on image-text datasets!
Creating a Medical AI Brain¶
# 1. Download BioBERT tokenizer
python scripts/download_tokenizer.py biobert
# 2. Configure brain:
Name: medical-assistant-v1
Goal: Analyze medical texts and provide clinical insights
Tokenizer: β BioBERT (Biomedical) (28,996 tokens)
# 3. Train on medical datasets (PubMed, clinical notes, etc.)
Creating a Scientific Research Brain¶
# 1. Download SciBERT tokenizer
python scripts/download_tokenizer.py scibert
# 2. Configure brain:
Name: research-assistant-v1
Goal: Understand scientific papers and answer research questions
Tokenizer: β SciBERT (Scientific) (31,090 tokens)
# 3. Train on scientific papers and technical docs
π Training Dataset Recommendations¶
For Vision Models (CLIP, LLaVA, SigLIP)¶
Public Datasets: - COCO - Image captioning (330K images) - Flickr30k - Image descriptions (31K images) - Visual Genome - Dense annotations (108K images) - Conceptual Captions - 3.3M image-text pairs - VQA v2 - Visual question answering (1M+ questions)
Custom Data: - Product images + descriptions - Screenshots + instructions - Medical images + diagnoses - Satellite imagery + labels
For Biomedical (BioBERT)¶
Public Datasets: - PubMed Abstracts - Medical research - MIMIC-III - Clinical notes (requires approval) - PMC-OA - Full-text biomedical articles - ChEMBL - Drug-target interactions - UMLS - Medical terminology
For Scientific (SciBERT)¶
Public Datasets: - ArXiv - Scientific papers - Semantic Scholar - Academic papers - PubMed Central - Biomedical literature - CiteSeer - Computer science papers - IEEE Xplore - Engineering papers
For Legal (Legal-BERT)¶
Public Datasets: - EDGAR - SEC filings - Court Listener - Court opinions - EUR-Lex - EU legal documents - CaseLaw Access Project - US case law
For Financial (FinBERT)¶
Public Datasets: - Financial PhraseBank - Sentiment analysis - StockTwits - Market sentiment - SEC Filings - Corporate documents - Reuters Financial News - Bloomberg News Archives
π Multimodal Training Pipeline¶
Vision-Language Training Flow¶
1. Prepare Dataset
βββ Images (.jpg, .png)
βββ Captions/Labels (.txt, .json)
βββ Pair them correctly
2. Choose Tokenizer
βββ CLIP, LLaVA, or SigLIP
3. Create Brain
βββ Select vision tokenizer in GUI
4. Configure Training
βββ Image preprocessing (resize, normalize)
βββ Text tokenization
βββ Batch pairing (image + text)
5. Train Model
βββ Vision encoder learns image features
βββ Language encoder learns text features
βββ Contrastive loss aligns both spaces
6. Evaluate
βββ Image retrieval accuracy
βββ Text retrieval accuracy
βββ Zero-shot classification
π§ͺ Testing Results¶
All New Tokenizers Verified β ¶
π Vision & Multimodal
βββββββββββββββββββββββββββββ
β
CLIP: Downloaded & Tested
β
LLaVA 1.5: Downloaded & Tested
β
SigLIP: Downloaded (verification warning)
π Specialized Domains
βββββββββββββββββββββββββββββ
β
BioBERT: Downloaded & Tested
β
SciBERT: Downloaded & Tested
β οΈ Legal-BERT: Not yet downloaded
β οΈ FinBERT: Not yet downloaded
Test Brains Created¶
| Brain Name | Tokenizer | Use Case |
|---|---|---|
| CLIP-(Vision-Language)-test | clip-vit | Image-text understanding |
| LLaVA-1.5-(Vision-Chat)-test | llava-1.5 | Visual Q&A |
| BioBERT-(Biomedical)-test | biobert | Medical text |
| SciBERT-(Scientific)-test | scibert | Research papers |
π‘ Advanced Use Cases¶
Robotics Training (Future)¶
While we don't have RT-2 tokenizer yet, you can prepare for robotics:
Using Current Tokenizers: 1. Code tokenizers (DeepSeek-Coder V2, StarCoder2) for action sequences 2. Vision tokenizers (CLIP, LLaVA) for visual perception 3. Qwen 2.5 for multimodal reasoning
Action Tokenization Concept:
# Represent robot actions as tokens
robot_actions = {
"move_forward": 1001,
"move_backward": 1002,
"turn_left": 1003,
"turn_right": 1004,
"grasp": 1005,
"release": 1006,
# ... etc
}
# Train on: Vision β Action Sequences
Input: [image_tokens] + [instruction_tokens]
Output: [action_token_1, action_token_2, ..., action_token_n]
Video Understanding¶
Using Vision Tokenizers: - CLIP or LLaVA can process video frames - Tokenize each frame individually - Temporal modeling at model level
Workflow:
Video β Extract Frames (e.g., 1 fps)
β
Each Frame β Vision Tokenizer
β
Sequence of Frame Tokens
β
Temporal Model (Transformer)
β
Video Understanding
Audio-Visual Models¶
Multimodal Combination:
Audio: Use text tokenizer for transcribed speech
Vision: Use CLIP/LLaVA for visual frames
Fusion: Combine at embedding level
Input: [audio_transcript_tokens] + [video_frame_tokens]
Output: Multimodal understanding
π¦ Installation Summary¶
Downloaded & Ready (13 Tokenizers)¶
β
gpt2, gpt2-base-model
β
qwen2.5-7b (Best overall)
β
mistral-7b
β
deepseek-coder-v2 (Best for code)
β
starcoder2
β
codellama
β
phi3-mini
β
clip-vit (Vision-language)
β
llava-1.5 (Vision chat)
β
siglip (Advanced vision)
β
biobert (Biomedical)
β
scibert (Scientific)
Available to Download (3 Tokenizers)¶
π Learning Resources¶
Vision-Language Models¶
- CLIP Paper: https://arxiv.org/abs/2103.00020
- LLaVA Paper: https://arxiv.org/abs/2304.08485
- SigLIP: https://arxiv.org/abs/2303.15343
Domain-Specific Models¶
- BioBERT: https://arxiv.org/abs/1901.08746
- SciBERT: https://arxiv.org/abs/1903.10676
- Legal-BERT: https://arxiv.org/abs/2010.02559
- FinBERT: https://arxiv.org/abs/1908.10063
Robotics & Embodied AI¶
- RT-2: https://robotics-transformer2.github.io/
- PaLM-E: https://palm-e.github.io/
β FAQ¶
Q: Can I train image generation models with CLIP?
A: CLIP is for understanding, but can guide generation (like in Stable Diffusion). For generation, you'd need image decoder models.
Q: Do I need special hardware for vision models?
A: Vision models need more VRAM. Recommend 8GB+ GPU for small models, 24GB+ for larger ones.
Q: Can I mix tokenizers (e.g., vision + text)?
A: Not directly. Choose one tokenizer per brain. For multimodal, use tokenizers designed for both (CLIP, LLaVA).
Q: Are specialized tokenizers better than general ones?
A: For their specific domains, YES! BioBERT understands medical terms much better than GPT-2.
Q: Can I train on video datasets?
A: Yes! Extract frames, tokenize each frame, and model temporal relationships at the architecture level.
Q: How do I prepare medical/scientific datasets?
A: Start with public datasets (PubMed, ArXiv), then add your own domain-specific data.
Q: Will you add RT-2 for robotics?
A: RT-2 requires special action tokenization. We're exploring options for embodied AI support!
π Summary¶
You now have access to 16 specialized tokenizers covering:
- β General AI (Qwen 2.5, Llama 3, Mistral)
- β Code (DeepSeek-Coder V2, StarCoder2, CodeLlama)
- β Vision & Multimodal (CLIP, LLaVA, SigLIP) π
- β Biomedical (BioBERT) π
- β Scientific (SciBERT) π
- β Legal (Legal-BERT - available) π
- β Financial (FinBERT - available) π
Ready to train specialized AI for any domain! π
Last Updated: October 13, 2025
Vision and specialized tokenizers fully integrated
π§ Planned Expansion: Image & Video Generation (Design Spec)¶
Status: PLANNED β Architecture and UX defined below. No runtime support in main yet.
This addendum extends the tokenizer roadmap to full multimodal generation and scientific-output workflows. It defines the architecture, data contracts, and UI spec required to support:
- Text-to-image, image-to-image, inpainting/outpainting, ControlNet-guided image generation
- Image-to-video and text-to-video (short clips) generation
- Scientific outputs in chat: rich plots, tables, LaTeX, HTML reports, downloadable files
- A new Chat Window v2 that can render most multimodal outputs
High-level Phases¶
1) Foundations: artifact store, rich message schema, viewers (images/tables/plots/latex)
2) Image generation (inference-first): SDXL/LCM, ControlNet, prompt/seed reproducibility
3) Video generation (inference-first): SVD (imageβvideo), lightweight textβvideo integration
4) Scientific outputs: plot/table/LaTeX/HTML/file attachments, export flows
5) Optional training: dreambooth/LoRA for images; fine-tuning where feasible for video
πΌοΈ Image Generation (Planned)¶
Supported (Planned) pipelines and modalities:
- TextβImage: SD 1.5 / SDXL via Diffusers; optional LCM for fast steps
- ImageβImage: style transfer/variations with strength control
- Inpainting/Outpainting: masked editing
- Conditioning: ControlNet (canny, depth, pose), IP-Adapter (face/style), LoRA adapters
Implementation notes: - Runtime: Hugging Face Diffusers (+ transformers, accelerate, safetensors, xformers) - Precision: fp16/bf16 on GPU; CPU fallback (slow) for dev - Reproducibility: seed, guidance_scale, scheduler stored with artifact metadata - Safety filter hooks: optional NSFW classifier toggle (user-controlled)
CLI (planned examples):
aios gen image --model sdxl --prompt "a red fox in a misty forest, photorealistic" \
--steps 20 --seed 42 --size 1024x1024 --guidance-scale 7.5 \
--out artifacts/outputs/images/
aios gen image2image --model sd15 --init-image path/to/img.png --strength 0.6 \
--prompt "studio portrait, soft lighting" --lora face_finetune.safetensors
ποΈ Video Generation (Planned)¶
Initial scope focuses on inference and short clips:
- ImageβVideo: Stable Video Diffusion (SVD) 14-24 fps, 2-4s clips
- TextβImageβVideo: generate keyframe with SDXL then animate via SVD
- TextβVideo (experimental): integrate an open model (e.g., CogVideoX small) when stable
Key controls: - fps (e.g., 8β24), duration (seconds), resolution (e.g., 576p/720p), motion strength - Seed and scheduler captured for reproducibility
CLI (planned examples):
aios gen video --from-image artifacts/outputs/images/fox.jpg --model svd --fps 14 --seconds 3
aios gen video --prompt "drone shot over snowy mountains at sunrise" \
--pipeline txt2img+svd --fps 12 --seconds 4 --seed 123
Dataset prep (future training): frame extraction to frames/, JSON pairs with timing; ffmpeg helpers.
π§ͺ Scientific Output Interfaces (Planned)¶
The assistant will produce structured scientific artifacts alongside text:
- Plots: Plotly JSON spec (preferred for interactivity) or Matplotlib PNG fallback
- Tables: column schema + row data; optional CSV/Parquet attachment for download
- Equations: LaTeX rendered with KaTeX in UI
- Reports: minimal HTML (sanitized) and Markdown export
- Files: CSV/JSON/NetCDF/HDF5/ZIP as downloadable artifacts
- 3D/Geo (future): GLB/PLY previews; GeoJSON/tiles via map component (opt-in)
Server responsibilities: - Validate payload sizes and types; store artifacts with content hash - Generate preview thumbnails/posters where applicable - Attach metadata (units, provenance, seeds, code refs)
π§© Chat Window v2 β Rich Message Schema (Planned)¶
The chat UI will display multi-part messages. Each message may contain text plus a list of parts. Parts reference inline payloads or stored artifacts.
Message (concept):
{
"id": "msg_01",
"role": "assistant",
"parts": [
{ "type": "text", "text": "Here is the generated image." },
{ "type": "image", "source": { "type": "artifact", "id": "art_abc" }, "alt": "red fox" },
{ "type": "plot", "spec": { "version": 2, "data": [], "layout": {} } },
{ "type": "table", "columns": [{"name":"time","type":"number"}], "rows": [[0],[1]],
"download": { "artifactId": "art_tbl" } },
{ "type": "latex", "code": "E=mc^2" },
{ "type": "file", "artifactId": "art_zip", "name": "results.zip" }
]
}
Artifact (concept):
{
"id": "art_abc",
"kind": "image",
"mime": "image/png",
"path": "artifacts/outputs/images/fox.png",
"sha256": "...",
"width": 1024,
"height": 1024,
"createdAt": "2025-10-15T12:34:56Z",
"metadata": {
"prompt": "a red fox in a misty forest",
"seed": 42,
"steps": 20,
"guidance_scale": 7.5,
"scheduler": "EulerA"
}
}
UI Behaviors: - Text streams first; media parts render when artifacts are ready - Image/video gallery with zoom; video player with poster/thumb - Plotly interactive viewer; CSV download from table; LaTeX inline via KaTeX - Side βArtifactsβ panel lists recent outputs with search and filters - Tool-run logs collapsible; copy-to-clipboard for prompts/configs
Security & Limits: - Sanitize HTML; disallow inline scripts; CSP headers - Size limits per part and per message; chunked uploads for large files
ποΈ Architecture Addendum (Planned)¶
Logical components:
1) Generation Runtimes
- Diffusers pipelines: SD 1.5, SDXL, SVD (+ ControlNet, IP-Adapter, LoRA)
- Scheduler registry; VRAM-aware autotune
2) Artifact Store
- Path: artifacts/outputs/{images|videos|plots|tables|files}/ (content-hashed)
- Metadata JSON sidecars; thumbnail/poster derivation jobs
3) Messaging API
- Returns messages with parts; emits events: text-delta, artifact-pending, artifact-ready
4) Chat UI v2
- React components: Text, Image, Video, Plotly, Table, LaTeX, File
- View toggles: compact/threaded/gallery
5) CLI/HRM Integrations
- aios gen image|video commands; HRM tool to attach outputs to runs
Deployment considerations:
- Windows, Linux GPU; optional CPU fallback for dev
- CUDA/cuDNN versions pinned; xformers optional
- Model weights cached in training_data/hf_cache/
π Data Contracts (Planned)¶
MessagePart types: - text, image, video, audio (future), plot, table, latex, html (sanitized), file
Common fields:
- type, source (inline|artifact), mime (when applicable), metadata
Table contract (sketch):
{
"type": "table",
"columns": [ {"name":"col","type":"string","unit":""} ],
"rows": [ ["value"] ],
"download": { "artifactId": "art_csv" }
}
Plotly contract: store full spec and optional screenshot thumbnail.
π§ͺ Acceptance Criteria (Milestones)¶
M1 β Foundations - Render images, tables (CSV download), Plotly, LaTeX in Chat v2 - Artifact store with hashing + metadata; events wired
M2 β Image Generation (Inference) - SDXL textβimage end-to-end via CLI and Chat tool - Reproducible artifacts (seed/scheduler stored); ControlNet optional
M3 β Video Generation (Inference) - SVD imageβvideo; keyframe path (txtβimgβvid) exposed in CLI and Chat - Video player in UI with fps/duration metadata
M4 β Scientific Outputs - Plot/table/latex produced by tools; downloadable CSV/ZIP - Simple HTML report (sanitized) preview and export
M5 β UX Polish & Reliability - Gallery view, artifact panel, retries, error toasts, rate limits
π οΈ Dependencies (Planned)¶
Python - diffusers, transformers, accelerate, torch, safetensors, xformers (optional) - pillow, opencv-python, ffmpeg-python (or system ffmpeg)
Frontend - KaTeX, Plotly.js, React Player (or video element), MIME renderers
β οΈ Risks & Considerations¶
- VRAM constraints: add auto-downscale and low-VRAM schedulers
- Large artifacts: background upload and streaming; disk quotas
- Content safety: optional NSFW filters; user control and transparency
- Licensing: check weights and dataset licenses; document restrictions
π§ͺ Try-it (Future, subject to change)¶
# Text β Image
aios gen image --model sdxl --prompt "scientific diagram of DNA helix, vector style" --size 1024x1024
# Image β Video
aios gen video --from-image artifacts/outputs/images/diagram.png --model svd --fps 12 --seconds 3
# Scientific Plot in Chat (tool)
aios tools plot --spec path/to/plotly.json --attach
This specification updates the planned feature set to include multimodal generation and scientific outputs with a concrete data and UI contract, without claiming availability today.