Benchmarking
basert bench measures prefill and decode throughput for a model.
basert bench <model> [-p N] [-n N] [-r N] [-w N] [--paged-kv]
| Flag | Purpose |
|---|---|
-p N | Prefill length (prompt tokens). |
-n N | Number of tokens to generate (decode). |
-r N | Repetitions (averaged). |
-w N | Warm-up iterations. |
--paged-kv | Benchmark the paged KV cache path. |
basert bench Qwen/Qwen3-4B -p 512 -n 128 -r 3
The output reports prefill tokens/sec (prompt processing) and decode tokens/sec (generation), which are the two numbers that matter for latency and throughput.
Reproducible benchmark scripts
The benchmarks/
directory has scripts that drive basert bench across models and context
lengths, plus reference results under benchmarks/results/. They expect the
engine bundle unpacked into build/ (so build/basert-bench exists) — see
Installation. Start with
benchmarks/scripts/ and the benchmarks/README.md.
Methodology notes
- Use controlled token counts (
-p/-n) and several repetitions (-r) with a warm-up (-w) so the first-call allocation cost doesn't skew results. - Prefill and decode are reported separately; they stress different parts of the engine (compute-bound matmuls vs. memory-bound single-token steps).
- For server-style throughput, benchmark under
basert serve --continuous-batchingwith concurrent clients rather than the single-streambenchtool.