Qdrant Quantization Research Report (int8 / binary)

Repository: qdrant/qdrant
Commit (SHA1): bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5
Scope: Implementation-level research for scalar int8 and binary quantization, with search-performance tuning guidance for float32 / float16 / uint8 vector storage.

1. Executive Summary

Qdrant implements two different approximate-vector compression families relevant to this request:

Scalar int8 quantization (internally stored as u8 codes + correction terms)

Core: linear min-max quantization with optional quantile clipping.
Distance restoration: per-query/per-vector offsets + multiplier restore score ordering for Dot/L1/L2.
Fast path: AVX/SSE/NEON kernels.

Binary quantization (1-bit / 1.5-bit / 2-bit)

Core: sign/statistics-based bit encoding + XOR/popcount similarity.
Query side can be asymmetric (binary, scalar4bits, scalar8bits) for speed/accuracy tradeoff.
Fast path: AVX512/AVX2/SSE/NEON popcount kernels.

Search quality/speed is controlled mainly by oversampling and rescore; Qdrant also auto-chooses a default rescoring policy (default true for binary, false for scalar/PQ).

2. Implementation Map (commit-pinned links)

3. How Scalar `int8` Quantization Works

3.1 Encoding formula

MetadataInt8::encode_value maps each float value into a code:

code = round(clamp((value - offset) / alpha, 0..127))

Implementation:

Notes:

Although named Int8, this path stores codes in u8 and clamps to 0..127.
Vector bytes contain an extra leading f32 offset value (ADDITIONAL_CONSTANT_SIZE) per vector.

3.2 Optional quantile clipping

When ScalarQuantizationConfig.quantile is set, scalar quantization can use a trimmed min/max interval from sampled vectors instead of raw global min/max.

Scalar path call site:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L190
Quantile interval finder (sampled, tail-trimmed):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/quantile.rs#L28
Sampling constants (SAMPLE_SIZE=5000):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/quantile.rs#L9

3.3 Distance restoration and correction terms

For Dot/L1/L2, Qdrant stores and applies correction terms (multiplier, query offset, vector offset, shift) so quantized integer math preserves usable ranking.

Multiplier derivation by distance:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L206
Shift compensation logic ((x-a)(y-a) term):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L114
Query encoding with offset handling:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_u8.rs#L552

3.4 SIMD kernels

Scalar scoring dispatches to architecture-optimized routines if available:

4. How Binary Quantization Works

4.1 Storage encodings

Binary quantization supports three encodings:

OneBit: sign threshold (value > 0) per dimension.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L542
TwoBits: 2-bit ternary-style state around ±SIGMAS z-score bands.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L608
OneAndHalfBits: packed variant derived from 2-bit states.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L575

TwoBits/OneAndHalfBits depend on per-dimension mean/stddev statistics:

Stats build (Welford):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/vector_stats.rs#L30

4.2 Query encoding (symmetric/asymmetric)

Binary queries can be encoded in multiple ways:

SameAsStorage (Binary)
Scalar4bits
Scalar8bits

Implementation:

Query-encoding enum:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L46
Scalar query encoding with transposed bit layout:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L706

The code comment cites arXiv 2405.12497 for this transposition optimization.

4.3 Scoring function

Core metric is XOR-popcount based; final score mapping depends on distance and invert flag.

Metric computation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L760
SIMD/scalar popcount backends (u8 / u128, AVX512/AVX2/SSE/NEON):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L156 https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/encoded_vectors_binary.rs#L285

5. How `float32` / `float16` / `uint8` Inputs Reach Quantization

5.1 Datatype-specific path

Quantized scorer builder chooses metric implementation by VectorStorageDatatype:

Float32, Float16, Uint8 dispatch:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_scorer_builder.rs#L57

5.2 Vector/query preprocessing chain

At request/update side, vectors are preprocessed per distance and datatype:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/named_vectors.rs#L309
During quantized query scoring, query is preprocessed again through metric + quantization preprocess:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_query_scorer.rs#L29

PrimitiveVectorElement::quantization_preprocess behavior:

float32: passthrough as f32.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L40
float16: converted to f32.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L74
uint8: for binary quantization specifically, centered by -127; otherwise distance preprocessing path is used.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/data_types/primitive.rs#L124

6. Search-Phase Controls That Determine Performance

6.1 Quantized search enablement

Quantized search is used when:

quantized storage exists,
ignore is false,
exact is false.

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L15

6.2 Oversampling and rescoring

Oversampled candidate count: oversampling * top when >1.0.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L27
Re-scoring with original vectors happens in postprocess if enabled/defaulted.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L48
Default rescoring policy by quantization type: binary defaults to true, scalar/PQ to false.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/vector_storage/quantized/quantized_vectors.rs#L194

6.3 Exact search interaction

HNSW exact mode forcibly disables quantization (sets ignore=true).

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/hnsw.rs#L1360

6.4 Strict mode guardrails

Strict mode can cap quantization.oversampling.

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/collection/src/operations/verification/mod.rs#L261

7. Practical Guidance for Best Performance with Good Recall

These are implementation-derived recommendations (not dataset-agnostic guarantees).

For float32 / float16 with scalar int8

Start with scalar int8 (type=int8) and quantile around 0.95-0.99 when embeddings have outliers.
Keep rescore=false for lowest latency.
If recall drops, set oversampling to 1.5-3.0, then optionally enable rescore=true.

For binary quantization

Choose encoding=one_bit for max speed/memory reduction.
If quality is insufficient, try two_bits or one_and_half_bits.
For better recall at some CPU cost, use asymmetric query encoding (scalar8bits first, then scalar4bits as cheaper fallback).
Keep rescore=true (Qdrant default for binary) unless latency budget is extremely tight.

For uint8 source vectors

Binary path uses centering (value - 127) before thresholding; avoid assuming raw uint8 sign semantics.
Validate recall separately for dot vs cosine because test thresholds differ significantly by distance/query type.

RAM vs disk

always_ram=true can reduce IO-latency variance when original vectors/quantized blobs are on disk.
Binary is the only quantization mode that supports appendable/mutable quantized storage in this revision.

Relevant config definitions:

Quantization search params:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L498
Scalar config:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L759
Binary config:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L841
Appendable support (Binary only):
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/types.rs#L880

8. Validation Clues from Tests/Benches

Oversampling improves quality in integration tests (oversampling=4.0 not worse than baseline).
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/hnsw_quantized_search_test.rs#L231
Rescoring behavior is explicitly tested.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/hnsw_quantized_search_test.rs#L290
Byte/half storage + quantization combinations are exercised with different minimum-accuracy thresholds.
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/tests/integration/byte_storage_quantization_test.rs#L67
Binary scalar-query benchmark cases exist (Scalar8bits).
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/benches/binary.rs#L159

9. Optional: GPU Quantization Path

Qdrant also has GPU-side quantization adapters (binary/scalar/PQ) with matching shader-side postprocessing.

10. Important Nuances in This Revision

The P-square quantile estimator utility exists (p_square.rs) but is not currently wired into scalar/binary encode paths in this commit.
Utility code: https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/quantization/src/p_square.rs
Scalar/PQ immutable limitation: mutable quantized storage is rejected for these modes; binary supports appendable path.

hotchpotch/qdrant_qat_report.md

Select an option

No results found

Select an option

No results found

Qdrant Quantization Research Report (int8 / binary)

1. Executive Summary

2. Implementation Map (commit-pinned links)

2.1 Scalar int8 core

2.2 Binary core

2.3 Segment integration and query-time controls

3. How Scalar `int8` Quantization Works

3.1 Encoding formula

3.2 Optional quantile clipping

3.3 Distance restoration and correction terms

3.4 SIMD kernels

4. How Binary Quantization Works

4.1 Storage encodings

4.2 Query encoding (symmetric/asymmetric)

4.3 Scoring function

5. How `float32` / `float16` / `uint8` Inputs Reach Quantization

5.1 Datatype-specific path

5.2 Vector/query preprocessing chain

6. Search-Phase Controls That Determine Performance

6.1 Quantized search enablement

6.2 Oversampling and rescoring

6.3 Exact search interaction

6.4 Strict mode guardrails

7. Practical Guidance for Best Performance with Good Recall

8. Validation Clues from Tests/Benches

9. Optional: GPU Quantization Path

10. Important Nuances in This Revision

hotchpotch/qdrant_qat_report.md

Qdrant Quantization Research Report (int8 / binary)

1. Executive Summary

2. Implementation Map (commit-pinned links)

2.1 Scalar int8 core

2.2 Binary core

2.3 Segment integration and query-time controls

3. How Scalar int8 Quantization Works

3.1 Encoding formula

3.2 Optional quantile clipping

3.3 Distance restoration and correction terms

3.4 SIMD kernels

4. How Binary Quantization Works

4.1 Storage encodings

4.2 Query encoding (symmetric/asymmetric)

4.3 Scoring function

5. How float32 / float16 / uint8 Inputs Reach Quantization

5.1 Datatype-specific path

5.2 Vector/query preprocessing chain

6. Search-Phase Controls That Determine Performance

6.1 Quantized search enablement

6.2 Oversampling and rescoring

6.3 Exact search interaction

6.4 Strict mode guardrails

7. Practical Guidance for Best Performance with Good Recall

8. Validation Clues from Tests/Benches

9. Optional: GPU Quantization Path

10. Important Nuances in This Revision

3. How Scalar `int8` Quantization Works

5. How `float32` / `float16` / `uint8` Inputs Reach Quantization