Serving an API

basert serve starts an OpenAI-compatible HTTP server backed by the engine.

basert serve --model Qwen/Qwen3-4B --api-key "$(uuidgen)" --port 8080

You can load several models at once by repeating --model; clients pick one via the model field in the request.

basert serve --model Qwen/Qwen3-4B --model Qwen/Qwen3-0.6B --api-key "$KEY"

See the full list of endpoints in the Server API reference.

Core flags

Flag	Default	Purpose
`--model <path>`	—	Model file or hub id. Repeatable.
`--host <addr>`	`127.0.0.1`	Bind address.
`--port <N>`	`8080`	Bind port.
`--api-key <key>`	—	Require `Authorization: Bearer <key>`.
`--max-context <N>`	`4096`	Context window; sizes the KV cache up front.
`--max-tokens <N>`	`2048`	Default max generation tokens.
`--metallib <path>`	auto	Path to `baseRT.metallib` (auto-detected next to the binary).

TIP

Set --max-context for long-context models

The KV cache is allocated up front from --max-context. Models trained for long context (Gemma 4 / Qwen3 MoE = 32k+) need this set explicitly; requests exceeding it are rejected with context_length_exceeded.

Throughput: batching & caching

Flag	Purpose
`--kv-bits 4\|8\|16`	KV cache element width (default: per-model auto).
`--paged-kv`	Paged KV cache + block-table dispatch.
`--max-batch-size <N>`	Max concurrent sequences for batched decode (default 1).
`--continuous-batching [N]`	Decode concurrent requests through one shared forward pass (implies `--paged-kv`; `N` = max in-flight lanes, default 8).
`--prefix-cache`	Share KV of common prompt prefixes across requests (implies `--paged-kv`; most effective with `--continuous-batching`).
`--prefix-cache-file <path>`	Persist the prefix cache per model (`<path>.<model_id>`), loaded on startup and saved on shutdown.
`--prefix-cache-save-interval <sec>`	Re-save the prefix cache periodically.

A typical high-throughput configuration:

basert serve --model Qwen/Qwen3-4B \
  --continuous-batching 8 --prefix-cache \
  --max-context 8192 --api-key "$KEY"

Operability

Flag	Purpose
`--rate-limit <N>`	Requests per minute per client (0 = unlimited).
`--idle-timeout <N>`	Auto-unload idle models after N seconds (0 = disabled).
`--request-timeout <ms>`	Abort generation longer than N ms (0 = disabled). Recommended 60000–300000 for unattended agents.
`--drain-timeout <N>`	Seconds to wait for in-flight requests on `SIGUSR1` (default 60).
`--log-file <path>`	Redirect stderr (access log + diagnostics).
`--files-dir <path>`	Enable `/v1/files` (+ `/v1/batches`) rooted at this directory.
`--files-max-bytes <N>`	Reject uploads that would exceed total stored bytes.
`--files-expiry <N>` / `--files-sweep <N>`	Auto-remove old files; sweep cadence.

Rolling restarts

SIGUSR1 stops accepting connections and drains in-flight requests (up to --drain-timeout); pair it with an external supervisor for zero-downtime restarts. SIGINT/SIGTERM exit immediately.

Logs

--log-file writes the access log + diagnostics to a file. Use external rotation — logrotate copytruncate (Linux) or newsyslog with the F flag (macOS).

CAUTION

Before exposing beyond localhost

Always set --api-key, bind carefully with --host, and read Security. The server is intended for trusted environments.

Quick test

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-4B","messages":[{"role":"user","content":"Hello!"}],"stream":true}'