Server API

basert serve exposes an OpenAI-compatible HTTP API. Point any OpenAI client at the base URL and set the API key to your --api-key value. See Serving an API for launch flags.

Authentication

If --api-key is set, every request must send:

Authorization: Bearer <api-key>

Endpoints

Method · PathPurpose
POST /v1/chat/completionsChat completions (streaming via "stream": true). Tool/function calling supported.
POST /v1/completionsText completions.
POST /v1/embeddingsEmbedding vectors.
POST /v1/rerankRerank documents against a query.
POST /v1/audio/transcriptionsWhisper-class transcription.
POST /v1/tokenizeTokenize text (count/inspect tokens).
GET /v1/modelsList loaded models.
POST /v1/models/load · POST /v1/models/unloadLoad/unload a model at runtime.
POST /v1/lora/load · POST /v1/lora/unloadManage LoRA adapters.
POST /v1/files · GET /v1/files/...File storage (requires --files-dir).
POST /v1/batches · GET /v1/batches/...Batch jobs (requires --files-dir).
GET /healthLiveness probe.
GET /metrics · GET /v1/metricsServer metrics.

Chat completions

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen3-4B",
        "messages": [
          {"role": "system", "content": "You are concise."},
          {"role": "user", "content": "What is RoPE?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256
      }'

Streaming

Set "stream": true to receive Server-Sent Events (data: {…} chunks ending with data: [DONE]):

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-4B","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Tool calling

Pass tools (OpenAI function-calling schema); the model emits tool_calls in the response (streamed incrementally when stream is set).

Embeddings

curl http://127.0.0.1:8080/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"my-embed-model","input":["hello world","second doc"]}'

Using the OpenAI SDKs

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="$API_KEY")
resp = client.chat.completions.create(
    model="Qwen3-4B",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

NOTE

The exact request/response fields follow the OpenAI schema. Endpoints like /v1/files, /v1/batches, and LoRA management depend on server flags (--files-dir) and the loaded model's capabilities.