Skip to content

Instantly share code, notes, and snippets.

@hotchpotch
Created February 17, 2026 01:17
Show Gist options
  • Select an option

  • Save hotchpotch/fcd5298136784914c257ca6b9d9d21b3 to your computer and use it in GitHub Desktop.

Select an option

Save hotchpotch/fcd5298136784914c257ca6b9d9d21b3 to your computer and use it in GitHub Desktop.
Investigation of QAT-related implementation in qdrant

Qdrant Quantization Research Report (int8 / binary)

  • Repository: qdrant/qdrant
  • Commit (SHA1): bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5
  • Scope: Implementation-level research for scalar int8 and binary quantization, with search-performance tuning guidance for float32 / float16 / uint8 vector storage.

1. Executive Summary

Qdrant implements two different approximate-vector compression families relevant to this request:

  1. Scalar int8 quantization (internally stored as u8 codes + correction terms)
  • Core: linear min-max quantization with optional quantile clipping.
  • Distance restoration: per-query/per-vector offsets + multiplier restore score ordering for Dot/L1/L2.
  • Fast path: AVX/SSE/NEON kernels.
  1. Binary quantization (1-bit / 1.5-bit / 2-bit)
  • Core: sign/statistics-based bit encoding + XOR/popcount similarity.
  • Query side can be asymmetric (binary, scalar4bits, scalar8bits) for speed/accuracy tradeoff.
  • Fast path: AVX512/AVX2/SSE/NEON popcount kernels.

Search quality/speed is controlled mainly by oversampling and rescore; Qdrant also auto-chooses a default rescoring policy (default true for binary, false for scalar/PQ).

2. Implementation Map (commit-pinned links)

2.1 Scalar int8 core

2.2 Binary core

2.3 Segment integration and query-time controls

3. How Scalar int8 Quantization Works

3.1 Encoding formula

MetadataInt8::encode_value maps each float value into a code:

  • code = round(clamp((value - offset) / alpha, 0..127))

Implementation:

Notes:

  • Although named Int8, this path stores codes in u8 and clamps to 0..127.
  • Vector bytes contain an extra leading f32 offset value (ADDITIONAL_CONSTANT_SIZE) per vector.

3.2 Optional quantile clipping

When ScalarQuantizationConfig.quantile is set, scalar quantization can use a trimmed min/max interval from sampled vectors instead of raw global min/max.

3.3 Distance restoration and correction terms

For Dot/L1/L2, Qdrant stores and applies correction terms (multiplier, query offset, vector offset, shift) so quantized integer math preserves usable ranking.

3.4 SIMD kernels

Scalar scoring dispatches to architecture-optimized routines if available:

4. How Binary Quantization Works

4.1 Storage encodings

Binary quantization supports three encodings:

TwoBits/OneAndHalfBits depend on per-dimension mean/stddev statistics:

4.2 Query encoding (symmetric/asymmetric)

Binary queries can be encoded in multiple ways:

  • SameAsStorage (Binary)
  • Scalar4bits
  • Scalar8bits

Implementation:

The code comment cites arXiv 2405.12497 for this transposition optimization.

4.3 Scoring function

Core metric is XOR-popcount based; final score mapping depends on distance and invert flag.

5. How float32 / float16 / uint8 Inputs Reach Quantization

5.1 Datatype-specific path

Quantized scorer builder chooses metric implementation by VectorStorageDatatype:

5.2 Vector/query preprocessing chain

PrimitiveVectorElement::quantization_preprocess behavior:

6. Search-Phase Controls That Determine Performance

6.1 Quantized search enablement

Quantized search is used when:

  • quantized storage exists,
  • ignore is false,
  • exact is false.

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/vector_index_search_common.rs#L15

6.2 Oversampling and rescoring

6.3 Exact search interaction

HNSW exact mode forcibly disables quantization (sets ignore=true).

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/segment/src/index/hnsw_index/hnsw.rs#L1360

6.4 Strict mode guardrails

Strict mode can cap quantization.oversampling.

Implementation:
https://github.com/qdrant/qdrant/blob/bdd4bb5180f4a4fb378dd3dedf5c307e8a8b74e5/lib/collection/src/operations/verification/mod.rs#L261

7. Practical Guidance for Best Performance with Good Recall

These are implementation-derived recommendations (not dataset-agnostic guarantees).

  1. For float32 / float16 with scalar int8
  • Start with scalar int8 (type=int8) and quantile around 0.95-0.99 when embeddings have outliers.
  • Keep rescore=false for lowest latency.
  • If recall drops, set oversampling to 1.5-3.0, then optionally enable rescore=true.
  1. For binary quantization
  • Choose encoding=one_bit for max speed/memory reduction.
  • If quality is insufficient, try two_bits or one_and_half_bits.
  • For better recall at some CPU cost, use asymmetric query encoding (scalar8bits first, then scalar4bits as cheaper fallback).
  • Keep rescore=true (Qdrant default for binary) unless latency budget is extremely tight.
  1. For uint8 source vectors
  • Binary path uses centering (value - 127) before thresholding; avoid assuming raw uint8 sign semantics.
  • Validate recall separately for dot vs cosine because test thresholds differ significantly by distance/query type.
  1. RAM vs disk
  • always_ram=true can reduce IO-latency variance when original vectors/quantized blobs are on disk.
  • Binary is the only quantization mode that supports appendable/mutable quantized storage in this revision.

Relevant config definitions:

8. Validation Clues from Tests/Benches

9. Optional: GPU Quantization Path

Qdrant also has GPU-side quantization adapters (binary/scalar/PQ) with matching shader-side postprocessing.

10. Important Nuances in This Revision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment