Chat & completion
Two runtime tools cover interactive and one-shot generation.
Interactive chat
basert chat <model> [options]
Drops you into a REPL. /clear resets the conversation; quit exits. Each
response prints prefill/decode throughput.
| Flag | Default | Purpose |
|---|---|---|
--temp <F> | 0.7 | Sampling temperature. |
--top-k <N> | 40 | Top-K candidates. |
--top-p <F> | 0.9 | Nucleus sampling threshold. |
--min-p <F> | 0.0 | Min-P filter. |
--repeat-penalty <F> | 1.1 | Repetition penalty. |
--max-tokens <N> | 256 | Max tokens per response. |
--max-context <N> | 4096 | Context window size. |
--think / --no-think | on | Show / hide <think> blocks (Qwen3 thinking mode). |
--kv-bits 4|8|16 | auto | KV cache element width. |
--paged-kv | off | Paged KV cache + block-table dispatch. |
--max-batch-size <N> | 1 | Max concurrent sequences for batched decode. |
basert chat Qwen/Qwen3-4B --temp 0.8 --top-p 0.95 --max-context 8192 --no-think
One-shot completion
basert complete <model> --prompt <text> [options]
Runs a single prompt and prints the result — ideal for scripts and pipelines.
| Flag | Default | Purpose |
|---|---|---|
--prompt <text> | — | Prompt text (required). |
--system <text> | — | System prompt (only with --chat). |
--chat | off | Wrap --prompt with the model's chat template. |
--image <path> | — | Attach an image (multimodal models). |
--audio <path> | — | Attach audio (multimodal models). |
--max-tokens <N> | 256 | Max tokens to generate. |
--temp <F> | 0.0 | Temperature (0 = greedy). |
--top-k <N> | 40 | Top-K. |
--top-p <F> | 0.9 | Top-P. |
--repeat-penalty <F> | 1.0 | Repetition penalty. |
--kv-bits 4|8|16 | auto | KV cache element width. |
--paged-kv | off | Paged KV cache. |
# plain completion basert complete Qwen/Qwen3-4B --prompt "List three Metal GPU tips." --max-tokens 128 # chat-templated, with a system prompt basert complete Qwen/Qwen3-4B --chat \ --system "You are concise." --prompt "Why is RoPE useful?" # multimodal basert complete <vlm-model> --image ./photo.png --prompt "Describe this image."
KV-cache width
--kv-bits trades memory for precision: 16 (f16) is highest quality, 8 and
4 shrink the cache so you can fit longer contexts or more concurrent
sequences. The default is chosen per model. Pair with --paged-kv for
block-table-based allocation.