- Repository:
qdrant/qdrant - Commit (SHA1):
bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5 - Scope: Implementation-level research for scalar
int8andbinaryquantization, with search-performance tuning guidance forfloat32/float16/uint8vector storage.
Qdrant implements two different approximate-vector compression families relevant to this request:
- Scalar
int8quantization (internally stored asu8codes + correction terms)
- Core: linear min-max quantization with optional quantile clipping.
- Distance restoration: per-query/per-vector offsets + multiplier restore score ordering for Dot/L1/L2.
- Fast path: AVX/SSE/NEON kernels.
- Binary quantization (1-bit / 1.5-bit / 2-bit)
- Core: sign/statistics-based bit encoding + XOR/popcount similarity.
- Query side can be asymmetric (
binary,scalar4bits,scalar8bits) for speed/accuracy tradeoff. - Fast path: AVX512/AVX2/SSE/NEON popcount kernels.
Search quality/speed is controlled mainly by oversampling and rescore; Qdrant also auto-chooses a default rescoring policy (default true for binary, false for scalar/PQ).
- Scalar encoder/scorer:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs - Scalar mode enum (
Int8):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L25 - Quantile helper used by scalar:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/quantile.rs#L28
- Binary encoder/scorer:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs - Encoding modes (
OneBit,TwoBits,OneAndHalfBits):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L32 - Query encoding modes (
SameAsStorage,Scalar4bits,Scalar8bits):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L46 - Per-dimension stats (Welford) used by multi-bit binary modes:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/vector_stats.rs#L30
- Quantized storage orchestration (
create/load, RAM/Mmap/ChunkedMmap):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_vectors.rs - Search-time quantized controls (
oversampling,rescore, ignore):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs - Quantization config schema types:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L498 - Query preprocessing before quantized scoring:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_query_scorer.rs#L29
MetadataInt8::encode_value maps each float value into a code:
code = round(clamp((value - offset) / alpha, 0..127))
Implementation:
- Encode value:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L93 - Alpha/offset from min/max:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L494
Notes:
- Although named
Int8, this path stores codes inu8and clamps to0..127. - Vector bytes contain an extra leading
f32offset value (ADDITIONAL_CONSTANT_SIZE) per vector.
When ScalarQuantizationConfig.quantile is set, scalar quantization can use a trimmed min/max interval from sampled vectors instead of raw global min/max.
- Scalar path call site:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L190 - Quantile interval finder (sampled, tail-trimmed):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/quantile.rs#L28 - Sampling constants (
SAMPLE_SIZE=5000):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/quantile.rs#L9
For Dot/L1/L2, Qdrant stores and applies correction terms (multiplier, query offset, vector offset, shift) so quantized integer math preserves usable ranking.
- Multiplier derivation by distance:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L206 - Shift compensation logic (
(x-a)(y-a)term):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L114 - Query encoding with offset handling:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L552
Scalar scoring dispatches to architecture-optimized routines if available:
- AVX path:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L446 - SSE path:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L404 - NEON path:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L362
Binary quantization supports three encodings:
OneBit: sign threshold (value > 0) per dimension.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L542TwoBits: 2-bit ternary-style state around±SIGMASz-score bands.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L608OneAndHalfBits: packed variant derived from 2-bit states.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L575
TwoBits/OneAndHalfBits depend on per-dimension mean/stddev statistics:
- Stats build (Welford):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/vector_stats.rs#L30
Binary queries can be encoded in multiple ways:
SameAsStorage(Binary)Scalar4bitsScalar8bits
Implementation:
- Query-encoding enum:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L46 - Scalar query encoding with transposed bit layout:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L706
The code comment cites arXiv 2405.12497 for this transposition optimization.
Core metric is XOR-popcount based; final score mapping depends on distance and invert flag.
- Metric computation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L760 - SIMD/scalar popcount backends (
u8/u128, AVX512/AVX2/SSE/NEON):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L156 https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L285
Quantized scorer builder chooses metric implementation by VectorStorageDatatype:
Float32,Float16,Uint8dispatch:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_scorer_builder.rs#L57
- At request/update side, vectors are preprocessed per distance and datatype:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/named_vectors.rs#L309 - During quantized query scoring, query is preprocessed again through metric + quantization preprocess:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_query_scorer.rs#L29
PrimitiveVectorElement::quantization_preprocess behavior:
float32: passthrough asf32.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L40float16: converted tof32.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L74uint8: for binary quantization specifically, centered by-127; otherwise distance preprocessing path is used.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L124
Quantized search is used when:
- quantized storage exists,
ignoreis false,exactis false.
Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L15
- Oversampled candidate count:
oversampling * topwhen >1.0.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L27 - Re-scoring with original vectors happens in postprocess if enabled/defaulted.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L48 - Default rescoring policy by quantization type: binary defaults to
true, scalar/PQ tofalse.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_vectors.rs#L194
HNSW exact mode forcibly disables quantization (sets ignore=true).
Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/hnsw.rs#L1360
Strict mode can cap quantization.oversampling.
Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/collection/src/operations/verification/mod.rs#L261
These are implementation-derived recommendations (not dataset-agnostic guarantees).
- For
float32/float16with scalar int8
- Start with scalar int8 (
type=int8) andquantilearound0.95-0.99when embeddings have outliers. - Keep
rescore=falsefor lowest latency. - If recall drops, set
oversamplingto1.5-3.0, then optionally enablerescore=true.
- For binary quantization
- Choose
encoding=one_bitfor max speed/memory reduction. - If quality is insufficient, try
two_bitsorone_and_half_bits. - For better recall at some CPU cost, use asymmetric query encoding (
scalar8bitsfirst, thenscalar4bitsas cheaper fallback). - Keep
rescore=true(Qdrant default for binary) unless latency budget is extremely tight.
- For
uint8source vectors
- Binary path uses centering (
value - 127) before thresholding; avoid assuming raw uint8 sign semantics. - Validate recall separately for
dotvscosinebecause test thresholds differ significantly by distance/query type.
- RAM vs disk
always_ram=truecan reduce IO-latency variance when original vectors/quantized blobs are on disk.- Binary is the only quantization mode that supports appendable/mutable quantized storage in this revision.
Relevant config definitions:
- Quantization search params:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L498 - Scalar config:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L759 - Binary config:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L841 - Appendable support (
Binaryonly):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L880
- Oversampling improves quality in integration tests (
oversampling=4.0not worse than baseline).
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/hnsw_quantized_search_test.rs#L231 - Rescoring behavior is explicitly tested.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/hnsw_quantized_search_test.rs#L290 - Byte/half storage + quantization combinations are exercised with different minimum-accuracy thresholds.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/byte_storage_quantization_test.rs#L67 - Binary scalar-query benchmark cases exist (
Scalar8bits).
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/benches/binary.rs#L159
Qdrant also has GPU-side quantization adapters (binary/scalar/PQ) with matching shader-side postprocessing.
- GPU adapter layer:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/gpu/gpu_vector_storage/gpu_quantization.rs - Binary shader:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/gpu/shaders/vector_storage_bq.comp - Scalar shader:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/gpu/shaders/vector_storage_sq.comp
- The P-square quantile estimator utility exists (
p_square.rs) but is not currently wired into scalar/binary encode paths in this commit.
Utility code: https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/p_square.rs - Scalar/PQ immutable limitation: mutable quantized storage is rejected for these modes; binary supports appendable path.