Chat & completion

Two runtime tools cover interactive and one-shot generation.

Interactive chat

basert chat <model> [options]

Drops you into a REPL. /clear resets the conversation; quit exits. Each response prints prefill/decode throughput.

FlagDefaultPurpose
--temp <F>0.7Sampling temperature.
--top-k <N>40Top-K candidates.
--top-p <F>0.9Nucleus sampling threshold.
--min-p <F>0.0Min-P filter.
--repeat-penalty <F>1.1Repetition penalty.
--max-tokens <N>256Max tokens per response.
--max-context <N>4096Context window size.
--think / --no-thinkonShow / hide <think> blocks (Qwen3 thinking mode).
--kv-bits 4|8|16autoKV cache element width.
--paged-kvoffPaged KV cache + block-table dispatch.
--max-batch-size <N>1Max concurrent sequences for batched decode.
basert chat Qwen/Qwen3-4B --temp 0.8 --top-p 0.95 --max-context 8192 --no-think

One-shot completion

basert complete <model> --prompt <text> [options]

Runs a single prompt and prints the result — ideal for scripts and pipelines.

FlagDefaultPurpose
--prompt <text>Prompt text (required).
--system <text>System prompt (only with --chat).
--chatoffWrap --prompt with the model's chat template.
--image <path>Attach an image (multimodal models).
--audio <path>Attach audio (multimodal models).
--max-tokens <N>256Max tokens to generate.
--temp <F>0.0Temperature (0 = greedy).
--top-k <N>40Top-K.
--top-p <F>0.9Top-P.
--repeat-penalty <F>1.0Repetition penalty.
--kv-bits 4|8|16autoKV cache element width.
--paged-kvoffPaged KV cache.
# plain completion
basert complete Qwen/Qwen3-4B --prompt "List three Metal GPU tips." --max-tokens 128

# chat-templated, with a system prompt
basert complete Qwen/Qwen3-4B --chat \
  --system "You are concise." --prompt "Why is RoPE useful?"

# multimodal
basert complete <vlm-model> --image ./photo.png --prompt "Describe this image."

KV-cache width

--kv-bits trades memory for precision: 16 (f16) is highest quality, 8 and 4 shrink the cache so you can fit longer contexts or more concurrent sequences. The default is chosen per model. Pair with --paged-kv for block-table-based allocation.