Skip to main content
Akio supports local text embedding using the embedding (or embed) command. Inference runs entirely on-device — no external API required.

Supported Models

ModelHuggingFace RepoParametersDimensionsContextQuantizations
Qwen3-Embedding-0.6BFastiraz/Qwen3-Embedding-0.6B-GGUF0.6Bup to 1024 (configurable 32–1024)32K tokensQ8_0 (639 MB), F16 (1.2 GB)

Quickstart

1. Pull the embedding model
akio pull Fastiraz/Qwen3-Embedding-0.6B-GGUF
2. Embed some text
akio embedding -m Fastiraz/Qwen3-Embedding-0.6B-GGUF "Hello, world!"
Akio prints a JSON array of L2-normalized float vectors — one per input.

Options

FlagTypeDefaultDescription
-m <MODEL>string(required)Path or filename of the GGUF embedding model
--ngl <N>i3299Number of layers to offload to GPU (0 = CPU-only)
--verbose <LEVEL>stringerrorLog verbosity: none, debug, info, warn, error
<INPUT>...string(required)One or more texts to embed

Examples

Single input
akio embedding \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  "The quick brown fox jumps over the lazy dog"
Multiple inputs in one call
akio embed \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  "The cat sat on the mat" \
  "A quick brown fox" \
  "Akio runs inference locally"
Returns a JSON array with one vector per input, in the same order. CPU-only
akio embed \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  --ngl 0 \
  "embed this on CPU"

Output Format

The output is a JSON array of float arrays:
[
  [0.021, -0.045, 0.113, ...],
  [0.004, 0.078, -0.032, ...]
]
Each inner array is L2-normalized (unit length). For models that use NONE pooling, per-token embeddings are mean-pooled and re-normalized into a single vector per input.

Notes

Qwen3-Embedding-0.6B supports a 32K token context window and produces embeddings up to 1024 dimensions. It covers 100+ languages and supports instruction-aware embedding to boost retrieval accuracy by 1–5% on task-specific queries.
Each input is limited to 512 tokens internally by Akio’s batch size. The model itself supports up to 32K tokens per sequence. Chunk long documents before embedding them.
GPU offloading (--ngl) works the same as akio run. Requires Akio to be compiled with CUDA or Metal support. Use --ngl 0 to force CPU inference.
Because all vectors are L2-normalized, cosine similarity between two vectors equals their dot product. You can compute it directly without extra normalization.