Quantization

BaseRT quantizes weights with affine (MLX-style) quantization and an optional AWQ calibration pass. The on-disk dtypes and their defaults are defined by the canonical spec; this page is an overview.

Canonical spec

The authoritative quantization specification lives in the repo:

Dtypes

DtypeBitsTypical use
base_q2base_q82–8Quantized weights (affine, grouped).
bf16 / f1616Sensitive tensors (norms, routers, embeddings).
f3232Full precision where needed.

Each quantized tensor has:

  • group_size — elements sharing one scale (per-bit-width default, e.g. 64 for q4; overridable).
  • scale_dtypebf16 | f16 | e8m0 | e4m3 (e4m3 is q8-only).
  • symmetric — default false (asymmetric, MLX-affine style).

Choosing precision

  • default-q4 — a good general default: ~4-bit weights, strong size/quality balance.
  • default-q8 — higher quality, larger files; good for sensitive models or when memory allows.
  • The *-f16scale / *-bf16 variants control the scale dtype.

Per-tensor control comes from a quant profile.

AWQ calibration

Activation-aware weight quantization reduces low-bit error by scaling salient channels before packing. Provide calibration text (or a precomputed activation-stats sidecar) at convert time — see Converting models. Tuned, model-specific quality is delivered through the catalog as pre-converted artifacts.