Embeddings - 0xcat Akio

Akio supports local text embedding using the embedding (or embed) command. Inference runs entirely on-device — no external API required.

Supported Models

Model	HuggingFace Repo	Parameters	Dimensions	Context	Quantizations
Qwen3-Embedding-0.6B	`Fastiraz/Qwen3-Embedding-0.6B-GGUF`	0.6B	up to 1024 (configurable 32–1024)	32K tokens	Q8_0 (639 MB), F16 (1.2 GB)

Quickstart

1. Pull the embedding model

akio pull Fastiraz/Qwen3-Embedding-0.6B-GGUF

2. Embed some text

akio embedding -m Fastiraz/Qwen3-Embedding-0.6B-GGUF "Hello, world!"

Akio prints a JSON array of L2-normalized float vectors — one per input.

Options

Flag	Type	Default	Description
`-m <MODEL>`	`string`	(required)	Path or filename of the GGUF embedding model
`--ngl <N>`	`i32`	`99`	Number of layers to offload to GPU (`0` = CPU-only)
`--verbose <LEVEL>`	`string`	`error`	Log verbosity: `none`, `debug`, `info`, `warn`, `error`
`<INPUT>...`	`string`	(required)	One or more texts to embed

Examples

Single input

akio embedding \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  "The quick brown fox jumps over the lazy dog"

Multiple inputs in one call

akio embed \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  "The cat sat on the mat" \
  "A quick brown fox" \
  "Akio runs inference locally"

Returns a JSON array with one vector per input, in the same order. CPU-only

akio embed \
  -m Fastiraz/Qwen3-Embedding-0.6B-GGUF \
  --ngl 0 \
  "embed this on CPU"

Output Format

The output is a JSON array of float arrays:

[
  [0.021, -0.045, 0.113, ...],
  [0.004, 0.078, -0.032, ...]
]

Each inner array is L2-normalized (unit length). For models that use NONE pooling, per-token embeddings are mean-pooled and re-normalized into a single vector per input.

Notes

Model details

Qwen3-Embedding-0.6B supports a 32K token context window and produces embeddings up to 1024 dimensions. It covers 100+ languages and supports instruction-aware embedding to boost retrieval accuracy by 1–5% on task-specific queries.

Context and batch limits

Each input is limited to 512 tokens internally by Akio’s batch size. The model itself supports up to 32K tokens per sequence. Chunk long documents before embedding them.

GPU acceleration

GPU offloading (--ngl) works the same as akio run. Requires Akio to be compiled with CUDA or Metal support. Use --ngl 0 to force CPU inference.

Comparing embeddings

Because all vectors are L2-normalized, cosine similarity between two vectors equals their dot product. You can compute it directly without extra normalization.

​Supported Models

​Quickstart

​Options

​Examples

​Output Format

​Notes