Quantization
BaseRT quantizes weights with affine (MLX-style) quantization and an optional AWQ calibration pass. The on-disk dtypes and their defaults are defined by the canonical spec; this page is an overview.
Canonical spec
The authoritative quantization specification lives in the repo:
base-convert/CANONICAL_QUANT_SPEC.md— dtypes, group sizes, scale dtypes, symmetric/asymmetric rules, and the per-tensor header fields.
Dtypes
| Dtype | Bits | Typical use |
|---|---|---|
base_q2 … base_q8 | 2–8 | Quantized weights (affine, grouped). |
bf16 / f16 | 16 | Sensitive tensors (norms, routers, embeddings). |
f32 | 32 | Full precision where needed. |
Each quantized tensor has:
group_size— elements sharing one scale (per-bit-width default, e.g. 64 for q4; overridable).scale_dtype—bf16|f16|e8m0|e4m3(e4m3 is q8-only).symmetric— defaultfalse(asymmetric, MLX-affine style).
Choosing precision
default-q4— a good general default: ~4-bit weights, strong size/quality balance.default-q8— higher quality, larger files; good for sensitive models or when memory allows.- The
*-f16scale/*-bf16variants control the scale dtype.
Per-tensor control comes from a quant profile.
AWQ calibration
Activation-aware weight quantization reduces low-bit error by scaling salient channels before packing. Provide calibration text (or a precomputed activation-stats sidecar) at convert time — see Converting models. Tuned, model-specific quality is delivered through the catalog as pre-converted artifacts.