embedding (or embed) command. Inference runs entirely on-device — no external API required.
Supported Models
| Model | HuggingFace Repo | Parameters | Dimensions | Context | Quantizations |
|---|---|---|---|---|---|
| Qwen3-Embedding-0.6B | Fastiraz/Qwen3-Embedding-0.6B-GGUF | 0.6B | up to 1024 (configurable 32–1024) | 32K tokens | Q8_0 (639 MB), F16 (1.2 GB) |
Quickstart
1. Pull the embedding modelOptions
| Flag | Type | Default | Description |
|---|---|---|---|
-m <MODEL> | string | (required) | Path or filename of the GGUF embedding model |
--ngl <N> | i32 | 99 | Number of layers to offload to GPU (0 = CPU-only) |
--verbose <LEVEL> | string | error | Log verbosity: none, debug, info, warn, error |
<INPUT>... | string | (required) | One or more texts to embed |
Examples
Single inputOutput Format
The output is a JSON array of float arrays:NONE pooling, per-token embeddings are mean-pooled and re-normalized into a single vector per input.
Notes
Model details
Model details
Qwen3-Embedding-0.6B supports a 32K token context window and produces embeddings up to 1024 dimensions. It covers 100+ languages and supports instruction-aware embedding to boost retrieval accuracy by 1–5% on task-specific queries.
Context and batch limits
Context and batch limits
Each input is limited to 512 tokens internally by Akio’s batch size. The model itself supports up to 32K tokens per sequence. Chunk long documents before embedding them.
GPU acceleration
GPU acceleration
GPU offloading (
--ngl) works the same as akio run. Requires Akio to be compiled with CUDA or Metal support. Use --ngl 0 to force CPU inference.Comparing embeddings
Comparing embeddings
Because all vectors are L2-normalized, cosine similarity between two vectors equals their dot product. You can compute it directly without extra normalization.