Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Last active February 23, 2026 22:09
Show Gist options
  • Select an option

  • Save oneryalcin/929109d65e86b35d01449e14df1d2ff1 to your computer and use it in GitHub Desktop.

Select an option

Save oneryalcin/929109d65e86b35d01449e14df1d2ff1 to your computer and use it in GitHub Desktop.
Fine-Tuning Sparse Encoders for Neural Sparse Retrieval: Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval: Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Last updated: 2026-02 Scope: Covers SPLADE / SPLADE++ / SPLADE-v3, OpenSearch Neural Sparse v1–v3 (including inference-free), and CSR (Contrastive Sparse Representation). Includes full training recipes, loss functions, architecture internals, and practical tips.


Table of Contents

  1. What Are Sparse Encoders?
  2. Model Families
  3. Architecture Internals
  4. Training Objectives & Loss Functions
  5. Training Data
  6. Fine-Tuning Recipes
  7. Inference-Free (Asymmetric) Fine-Tuning
  8. Hard Negative Mining
  9. Practical Tips & Hyperparameter Guide
  10. Decision Guide
  11. References

1. What Are Sparse Encoders?

Sparse encoders map text to high-dimensional vectors where most values are zero. Each non-zero dimension corresponds to a vocabulary token; its weight represents that token's importance in the text.

"neural retrieval" → {retrieval: 2.1, neural: 1.9, search: 0.8, information: 0.3, ...}
                      (vocab_size = 30,522; ~95-99% of dims are zero)

Why sparse over dense?

Property Dense (e.g. SBERT) Sparse (SPLADE)
Vector dim 768 30,522
Non-zero dims all ~100–300
Inverted index compatible
BM25-level latency
Semantic expansion implicit explicit (MLM head)
Interpretable ✓ (readable tokens)
BEIR avg NDCG@10 ~0.49–0.52 ~0.50–0.55

The killer feature: sparse vectors are compatible with Lucene/OpenSearch inverted indexes. Retrieval is a standard dot-product scan — the same data structure BM25 uses — giving near-BM25 latency at neural relevance quality.

The magic comes from the MLM head: each input token position projects onto the entire vocabulary, enabling the model to activate semantically related terms that never appeared in the original text (query/document expansion).


2. Model Families

2.1 SPLADE Family (NAVER Labs)

Symmetric models — both query and document run through a full neural encoder.

Model HuggingFace ID Params MS MARCO MRR@10 BEIR Avg NDCG@10
SPLADE-v2 naver/splade-v2-max 110M 36.8 0.497
SPLADE-v2-distill naver/splade-v2-distilsplade-max 66M 36.1
SPLADE++ SelfDistil naver/splade-cocondenser-selfdistil 110M 37.6 0.510
SPLADE++ EnsembleDistil naver/splade-cocondenser-ensembledistil 110M 38.3 0.524
SPLADE-v3 naver/splade-v3 110M 40.2 ~0.53
SPLADE-v3-Doc naver/splade-v3-doc 110M ~0.54

Key papers:

2.2 OpenSearch Neural Sparse (Amazon)

Asymmetric models — document is neural; query uses IDF lookup (no inference at search time).

Model HuggingFace ID Params Base BEIR NDCG@10 Avg FLOPs
v1 opensearch-project/opensearch-neural-sparse-encoding-v1 133M BERT-base 0.524 11.4
doc-v2-mini opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini 22M MiniLM 0.497 0.7
doc-v2-distill opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill 67M DistilBERT 0.504 1.8
doc-v2 opensearch-project/opensearch-neural-sparse-encoding-doc-v2 133M BERT-base 0.515 3.3
doc-v3-distill opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill 67M DistilBERT 0.517 1.8
doc-v3-gte opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte 133M GTE-base 0.546 ~2
multilingual-v1 opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 160M mBERT 0.629 (MIRACL) 1.3

Key papers:

2.3 CSR — Contrastive Sparse Representation

Adds a sparse autoencoder on top of an existing dense model, sparsifying its output. Useful when you already have a fine-tuned dense model and want sparse retrieval without retraining from scratch.


3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Input tokens: ["what", "is", "SPLADE", "?"]
       ↓
BertForMaskedLM (or DistilBertForMaskedLM)
       ↓
token-level logits: shape [batch, seq_len, vocab_size]   # e.g. [1, 4, 30522]
       ↓
max pooling over seq_len dimension → [batch, vocab_size]  # collapse to one vector
       ↓
log(1 + ReLU(x))                                         # activation + log-saturation
       ↓
sparse vector: [batch, vocab_size]                        # ~95-99% zeros

Why max pooling? Aggregates the "best evidence" for each vocabulary token across all input positions. Each position can activate vocabulary tokens it didn't literally contain — this is the expansion.

Why log(1 + ReLU(x))? ReLU kills negatives (forcing sparsity). Log saturates large values (preventing a few dimensions from dominating), providing a soft upper bound analogous to BM25's term frequency saturation.

SPLADE-v3 double-log activation:

# v2: log(1 + ReLU(x))
# v3: log(1 + log(1 + ReLU(x)))   ← stronger sparsification, less aggressive FLOPS penalty needed

Inference-Free (OpenSearch v2/v3) Query Encoding

At query time, no neural inference is run:

def encode_query_inference_free(query_tokens: list[int], idf: dict[int, float]) -> dict[int, float]:
    """
    idf.json: {token_id_str: idf_value} pre-computed from training corpus (MS MARCO).
    Returns sparse vector as {token_id: weight} dict.
    """
    sparse = {}
    for token_id in set(query_tokens):  # deduplicate
        if token_id in idf:
            sparse[token_id] = idf[token_id]
        else:
            sparse[token_id] = 1.0  # default IDF for unknown tokens
    return sparse

The idf.json file is shipped alongside each v2/v3 model on HuggingFace. You can recompute it from your own corpus for domain adaptation.

CSR Architecture

Dense model (frozen or fine-tuned) → dense vector [batch, 768]
       ↓
SparseAutoEncoder:
    encoder: Linear(768 → 4*768) + ReLU + TopK(k=256)   ← only top-k activations kept
    decoder: Linear(4*768 → 768)                         ← reconstruction loss
       ↓
sparse vector: [batch, 3072]   (4x expansion, mostly zeros)

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

The core mechanism that enforces sparsity. Penalizes tokens that activate with high magnitude on average across the training batch:

FLOPS(X) = Σ_{j ∈ vocab}  [ (1/N) Σ_{i=1}^{N}  w_j(x_i) ]^2

where:
  N = batch size
  w_j(x_i) = weight of vocab token j in sparse vector of example i

Intuition: If token "the" activates in every document, its average activation is high → large penalty → model learns to suppress it. Rare, informative tokens activate infrequently → low average → preserved.

Combined training loss:

L_total = L_rank + λ_q · FLOPS(queries) + λ_d · FLOPS(docs)

Typical λ values:

Framework / Config λ_q λ_d
SPLADE original 0.06 0.02
Sentence Transformers (SpladeLoss) 5e-5 3e-5
OpenSearch tuning sample (InfoNCE) 0.05 0.05
OpenSearch tuning sample (KD) 0.002
OpenSearch v2 pre-training phase 1e-7
OpenSearch v2 fine-tuning phase 0.02

Warning: λ too high → sparse vectors collapse to near-zero (no retrieval signal). λ too low → dense-like vectors (slow index, no sparsity benefit). Always monitor average active dims during training.

Two-phase lambda schedule (OpenSearch v2 approach, strongly recommended for training from scratch):

Phase 1 (large corpus, weak labels): λ_d = 1e-7  → focus on learning relevance
Phase 2 (MS MARCO / domain data):   λ_d = 0.02  → enforce sparsity

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

Instead of raw dot product, scores are IDF-weighted:

score(q, d) = Σ_{t ∈ vocab}  idf(t) · q_t · d_t

Why? IDF(t) is large for rare/informative tokens. Multiplying by IDF amplifies the gradient signal for these tokens in the ranking loss, teaching the model to preserve them. Common tokens (low IDF) are simultaneously pushed down by FLOPS regularization. The two forces complement each other.

IDF is pre-computed from MS MARCO corpus. Unseen tokens default to idf = 1.0.

4.3 Ranking Losses

InfoNCE / Multiple Negatives Ranking Loss:

L_InfoNCE = -log( exp(score(q, p) / τ) / Σ_j exp(score(q, d_j) / τ) )

where:
  p = positive document
  d_j = all docs in batch (positive + in-batch negatives + mined hard negatives)
  τ = temperature (default 0.02 in Sentence Transformers)

In Sentence Transformers: SparseMultipleNegativesRankingLoss

Requirements: batch size ≥ 16 (more in-batch negatives = harder training signal). Use BatchSamplers.NO_DUPLICATES to ensure each query appears once per batch.

MarginMSE (used in SPLADE-v3 in combination with KL-div):

L_MarginMSE = MSE( score(q,p) - score(q,n),  teacher_score(q,p) - teacher_score(q,n) )

4.4 Knowledge Distillation

Distillation consistently outperforms pointwise/pairwise training. The teacher (cross-encoder or ensemble retriever) provides soft scores for a list of candidate documents per query.

KL Divergence Distillation:

L_KL = KL( softmax(teacher_scores / τ_t) || softmax(student_scores / τ_s) )

In Sentence Transformers: SparseDistillKLDivLoss

Data format:

{"query": "what is SPLADE?",
 "docs": ["SPLADE is a sparse...", "Dense models use...", "BM25 is a..."],
 "scores": [9.2, 1.1, 4.5]}

4.5 Ensemble Heterogeneous Distillation

OpenSearch's key contribution — avoid expensive cross-encoder inference by combining two cheap retrievers:

Teacher 1 (dense):  Alibaba-NLP/gte-large-en-v1.5
Teacher 2 (sparse): opensearch-project/opensearch-neural-sparse-encoding-v1

For each query:
  scores_dense  = dense_teacher.score(query, [doc_1, ..., doc_N])
  scores_sparse = sparse_teacher.score(query, [doc_1, ..., doc_N])

  norm_dense  = min_max_scale(scores_dense)   # → [0, 1]
  norm_sparse = min_max_scale(scores_sparse)  # → [0, 1]

  final_score = (norm_dense + norm_sparse) / 2

Why it works: Dense captures semantic similarity; sparse captures exact lexical matches. Their combination is complementary and often matches cross-encoder quality on retrieval tasks while being ~10x cheaper to run.


5. Training Data

Minimum Viable Data

Any dataset of (query, relevant_document) pairs. Even 1,000–5,000 domain-specific pairs can meaningfully improve a pre-trained sparse encoder for a specific domain when fine-tuning (not training from scratch).

Pre-Training Scale (OpenSearch models)

MS MARCO (primary, high quality):

  • 502,548 training queries
  • 8.84M passages
  • Relevance annotations from Bing click logs

Weak supervision mix (5.36M additional queries across 14 datasets):

Dataset Source
eli5_question_answer Reddit ELI5
squad_pairs SQuAD reading comprehension
WikiAnswers Wikipedia QA pairs
yahoo_answers_* Yahoo Answers
gooaq_pairs Google autocomplete QA
stackexchange_duplicate_questions_* StackExchange
wikihow WikiHow articles
S2ORC_title_abstract Semantic Scholar papers
searchQA_top5_snippets Jeopardy-style QA

All available via HuggingFace Datasets or the BEIR repository.

Data Format

For InfoNCE / MNR Loss (JSONL):

{"query": "what causes inflation?", "positive": "Inflation is caused by..."}
{"query": "who wrote Hamlet?", "positive": "Hamlet was written by Shakespeare...", "negatives": ["unrelated doc 1", "unrelated doc 2"]}

For Distillation (KL-div) (JSONL):

{"query": "what causes inflation?",
 "docs": ["Inflation is caused by...", "Shakespeare wrote...", "The sun is..."],
 "scores": [9.1, 0.2, 0.1]}

For Sentence Transformers (using datasets library):

from datasets import Dataset

# Simplest format
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?", "who wrote Hamlet?"],
    "positive": ["Inflation is caused by...", "Hamlet was written by Shakespeare..."],
})

# With hard negatives (better)
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?"],
    "positive": ["Inflation is caused by..."],
    "negative": ["An unrelated but superficially similar document..."],
})

6. Fine-Tuning Recipes

Option A: Sentence Transformers v5 (Recommended)

Best for: Domain adaptation, simplest setup, no external dependencies.

pip install -U sentence-transformers datasets

A1. InfoNCE (simplest — pairs only)

from datasets import Dataset
from sentence_transformers import (
    SparseEncoder,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
    SparseEncoderModelCardData,
)
from sentence_transformers.sparse_encoder.losses import (
    SpladeLoss,
    SparseMultipleNegativesRankingLoss,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.training_args import BatchSamplers

# ── 1. Load a pre-trained sparse model ────────────────────────────────────────
# Don't train from scratch — always start from a strong pre-trained checkpoint.
# Good starting points:
#   "naver/splade-cocondenser-selfdistil"               (symmetric SPLADE)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"  (inference-free)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"      (best OpenSearch)
model = SparseEncoder(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="My Domain Sparse Encoder",
    ),
)

# ── 2. Prepare domain data ─────────────────────────────────────────────────────
train_dataset = Dataset.from_dict({
    "query":    ["your domain query 1", "your domain query 2"],
    "positive": ["relevant document 1", "relevant document 2"],
    # Optional: add "negative" key for hard negatives (strongly recommended)
})

# ── 3. Define loss with FLOPS regularization ───────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,      # tune between 0 and 1e-4
    document_regularizer_weight=3e-5,   # tune between 0 and 1e-3
)

# ── 4. Training arguments ──────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/my-domain-sparse-encoder",
    num_train_epochs=1,
    per_device_train_batch_size=16,     # larger = more in-batch negatives = better
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,                          # or bf16=True on Ampere+
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # required for MNR loss
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="NanoBEIR_mean_dot_ndcg@10",
)

# ── 5. Optional: zero-shot evaluator on standard BEIR subsets ─────────────────
evaluator = SparseNanoBEIREvaluator(
    dataset_names=["nfcorpus", "scifact", "fiqa"],
    batch_size=16,
)

# ── 6. Train ───────────────────────────────────────────────────────────────────
trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
model.save_pretrained("models/my-domain-sparse-encoder/final")
model.push_to_hub("my-org/my-domain-sparse-encoder")  # optional

A2. KL Divergence Distillation (better quality)

Requires teacher scores pre-computed (or computed on-the-fly if you have GPU budget).

from sentence_transformers.sparse_encoder.losses import SparseDistillKLDivLoss

# Dataset needs columns: query + doc_0, doc_1, ..., doc_N + score_0, score_1, ..., score_N
# OR use the InputExample format with {"query", "docs", "scores"}
# Sentence Transformers >= 5.0 handles both.

loss = SpladeLoss(
    model=model,
    loss=SparseDistillKLDivLoss(model=model),
    query_regularizer_weight=5e-5,
    document_regularizer_weight=3e-5,
)

# Everything else identical to A1

A3. Inspecting sparsity during/after training

# Check which tokens activate and their weights
sentences = ["what causes inflation?", "neural sparse retrieval with SPLADE"]
embeddings = model.encode(sentences, output_value="sentence_embedding")

for sent, emb in zip(sentences, embeddings):
    # decode top-20 active dimensions back to readable tokens
    decoded = model.decode(emb, top_k=20)
    print(f"\n{sent}")
    print(decoded)
    print(f"Active dims: {(emb > 0).sum()}")

A4. Evaluating on BEIR

from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Install: pip install beir
dataset = "nfcorpus"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Encode with your model
doc_embeddings = model.encode_corpus(list(corpus.values()))
query_embeddings = model.encode_queries(list(queries.values()))

# Retrieve using dot product (handles sparse vectors)
from sentence_transformers.sparse_encoder import SparseEncoderSimilarityFunction
results = SparseEncoderSimilarityFunction.DOT_PRODUCT.pairwise_scores(
    query_embeddings, doc_embeddings
)

Option B: OpenSearch Tuning Sample Repo

Best for: Production OpenSearch integration, multi-GPU training, full control over ensemble distillation.

Prerequisites: OpenSearch 2.16+ running locally (used for hard negative mining and evaluation).

# Clone the repo
git clone https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
cd opensearch-sparse-model-tuning-sample

# Environment setup
conda create -n sparse-tuning python=3.9 -y
conda activate sparse-tuning
pip install -r requirements.txt

# Start OpenSearch locally (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin@1234" \
  opensearchproject/opensearch:2.16.0

B1. Prepare MS MARCO training data

# Download and prepare MS MARCO with hard negatives
python prepare_msmarco_hard_negatives.py \
  --output_dir data/msmarco_hard_negs \
  --num_hard_negatives 7 \
  --opensearch_host localhost \
  --opensearch_port 9200

# Or prepare your own domain data in the expected JSONL format
python demo_train_data.py \
  --input your_query_doc_pairs.jsonl \
  --output data/domain_data

B2. InfoNCE training config

Edit configs/config_infonce.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs      # JSONL files
idf_path: idf.json                       # keep MS MARCO IDF or recompute
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_lambda: 0.05                       # FLOPS regularization weight
loss_types: [infonce]
output_dir: models/my-sparse-encoder
python train_ir.py configs/config_infonce.yaml

B3. Ensemble KD training config (best quality)

Edit configs/config_kd.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs
idf_path: idf.json
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_d_lambda: 0.002                    # doc FLOPS weight (lower for KD)
loss_types: [kl_div]

# Ensemble teacher configuration
kd_ensemble_teacher_kwargs:
  teachers:
    - model_type: dense
      model_name_or_path: Alibaba-NLP/gte-large-en-v1.5   # dense teacher
    - model_type: sparse
      model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-v1  # sparse teacher
  score_scaling_factor: 30              # scales scores before softmax
  aggregation: arithmetic_mean          # or geometric_mean

output_dir: models/my-sparse-encoder-kd
# Single GPU
python train_ir.py configs/config_kd.yaml

# Multi-GPU (recommended for full MS MARCO training)
torchrun --nproc_per_node=8 train_ir.py configs/config_kd.yaml

B4. Recomputing IDF for domain adaptation

# Recompute IDF from your own corpus (if your domain differs from MS MARCO)
from collections import defaultdict
import math
import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

def compute_idf(corpus: list[str], output_path: str = "idf.json"):
    df = defaultdict(int)  # document frequency
    N = len(corpus)

    for doc in corpus:
        token_ids = tokenizer(doc, truncation=True, max_length=512)["input_ids"]
        for tid in set(token_ids):
            df[tid] += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1  # smoothed IDF
        for tid, count in df.items()
    }

    with open(output_path, "w") as f:
        json.dump(idf, f)

    print(f"Computed IDF for {len(idf)} tokens from {N} documents.")
    return idf

# Usage
with open("your_corpus.txt") as f:
    corpus = [line.strip() for line in f]

idf = compute_idf(corpus, "domain_idf.json")

Option C: NAVER SPLADE Repo

Best for: Symmetric SPLADE models, research experimentation, full Hydra config control.

git clone https://github.com/naver/splade
cd splade
pip install -e .

# Download MS MARCO data
bash scripts/download_msmarco.sh

C1. Basic training

python -m splade.training.train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.regularizer.FLOPS.lambda_q=0.06 \
  config.regularizer.FLOPS.lambda_d=0.02 \
  config.data.train_data_path=data/your_domain_triples.tsv \
  config.training.num_train_epochs=3 \
  config.training.learning_rate=2e-5 \
  config.training.per_device_train_batch_size=32

C2. Distillation (SPLADE++ style)

# Using cross-encoder as teacher (best quality but slow)
python -m splade.training.distil_train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.teacher.model_type_or_dir=cross-encoder/ms-marco-MiniLM-L-12-v2 \
  config.regularizer.FLOPS.lambda_q=0.02 \
  config.regularizer.FLOPS.lambda_d=0.01 \
  config.data.train_data_path=data/msmarco_triples.tsv \
  config.training.num_train_epochs=3

C3. SPLADE-v3 self-distillation

SPLADE-v3 uses a mix of KL-divergence from a SPLADE++ teacher + MarginMSE loss. Config from the paper:

# Loss combination from SPLADE-v3 paper:
L_total = α * L_KL + (1 - α) * L_MarginMSE + λ_q * FLOPS(q) + λ_d * FLOPS(d)
# α = 0.5, λ_q = 0.01, λ_d = 0.008
# Teacher: naver/splade-cocondenser-selfdistil
# Hard negatives: 8 per query sampled from teacher's top-100

7. Inference-Free (Asymmetric) Fine-Tuning

For OpenSearch production deployment, the inference-free setup (query = IDF lookup, doc = neural) is strongly recommended. Sentence Transformers v5 supports training this directly via Router.

from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import (
    SparseStaticEmbedding,
    MLMTransformer,
    SpladePooling,
)
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

# ── Build the asymmetric model ─────────────────────────────────────────────────
doc_encoder = MLMTransformer(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

# SparseStaticEmbedding = trainable IDF lookup table
# frozen=False allows the IDF weights to be updated during training (domain adaptation)
query_encoder = SparseStaticEmbedding(
    tokenizer=doc_encoder.tokenizer,
    frozen=False,           # set True to freeze IDF (use pre-computed MS MARCO IDF only)
    idf_path="idf.json",    # path to pre-computed IDF (download from HuggingFace model)
)

router = Router.for_query_document(
    query_modules=[query_encoder],
    document_modules=[doc_encoder, SpladePooling("max")],
)

model = SparseEncoder(modules=[router])

# ── Loss ───────────────────────────────────────────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0.0,       # no regularization needed for static embedding
    document_regularizer_weight=3e-5,
)

# ── Training args ──────────────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/inference-free-domain",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    # Higher LR for the IDF table (it's a simpler parameter space)
    learning_rate_mapping={r"SparseStaticEmbedding\..*": 1e-3},
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
)

At inference time after training:

# Query: no neural inference
query_embedding = model.encode("what causes inflation?", prompt_name="query")
# ↑ this just does IDF lookup internally

# Document: full neural encoding
doc_embedding = model.encode("Inflation is caused by...", prompt_name="passage")

# Score
score = (query_embedding * doc_embedding).sum()  # dot product

8. Hard Negative Mining

Hard negatives are documents that are superficially relevant (e.g., retrieved by BM25) but are actually not relevant. They make training significantly harder and improve the model's ability to discriminate.

Strategy 1: BM25 Hard Negatives (simplest)

from rank_bm25 import BM25Okapi  # pip install rank-bm25

def mine_bm25_hard_negatives(
    queries: list[str],
    corpus: list[str],
    qrels: dict[str, list[int]],  # query_id → list of relevant doc indices
    n_negatives: int = 7,
) -> list[dict]:
    """Mine hard negatives using BM25 retrieval."""
    tokenized_corpus = [doc.lower().split() for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)

    examples = []
    for q_idx, query in enumerate(queries):
        scores = bm25.get_scores(query.lower().split())
        top_k_indices = scores.argsort()[::-1][:100]  # top-100 by BM25

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({
                "query": query,
                "positive": corpus[list(relevant_ids)[0]],
                "negatives": hard_negatives,
            })

    return examples

Strategy 2: Current Model Hard Negatives (iterative, best quality)

# Mine hard negatives using your current model's retrieval
# Run after each epoch or every N steps for iterative refinement

def mine_model_hard_negatives(model, queries, corpus, qrels, n_negatives=7, top_k=100):
    corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)

    examples = []
    for q_idx, query in enumerate(queries):
        query_embedding = model.encode([query])
        scores = (query_embedding * corpus_embeddings).sum(axis=1)
        top_k_indices = scores.argsort()[::-1][:top_k]

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({"query": query, "positive": corpus[list(relevant_ids)[0]], "negatives": hard_negatives})

    return examples

Strategy 3: Using OpenSearch for Hard Negative Mining

If you have an OpenSearch cluster, the tuning sample repo automates this:

# Index your corpus into OpenSearch
python index_corpus.py \
  --corpus your_corpus.jsonl \
  --index my-index \
  --opensearch_host localhost

# Mine hard negatives using BM25 retrieval
python prepare_hard_negatives.py \
  --queries your_queries.jsonl \
  --qrels your_qrels.tsv \
  --index my-index \
  --n_negatives 7 \
  --output data/hard_negatives.jsonl

9. Practical Tips & Hyperparameter Guide

Sparsity Monitoring

Always track average active dimensions during training. Target ranges:

Component Target active dims If too many If too few
Documents 100–300 Increase λ_d Decrease λ_d
Queries (neural) 10–50 Increase λ_q Decrease λ_q
Queries (IDF lookup) = query length N/A N/A
def log_sparsity(model, sample_texts: list[str], prefix=""):
    embeddings = model.encode(sample_texts)
    active_dims = [(emb > 0).sum() for emb in embeddings]
    print(f"{prefix} active dims: mean={sum(active_dims)/len(active_dims):.1f}, "
          f"min={min(active_dims)}, max={max(active_dims)}")

Hyperparameter Summary

Parameter Recommended range Notes
learning_rate 1e-5 to 5e-5 Start at 2e-5
per_device_batch_size 16–64 Bigger → harder negatives for InfoNCE
λ_q (query FLOPS) 1e-5 to 1e-4 Lower than λ_d; set 0 for inference-free
λ_d (doc FLOPS) 1e-5 to 5e-2 Critical — tune first
warmup_ratio 0.05–0.1 Standard
weight_decay 0.01 Standard
max_seq_length 512 for docs, 64–128 for queries
num_train_epochs 1–5 1 usually sufficient for fine-tuning

Common Failure Modes

Symptom Cause Fix
Embeddings all zeros λ too high or LR too high Reduce λ and/or LR
No sparsity (all dims active) λ too low Increase λ_d
Training loss not decreasing LR too low or bad data Check data format, increase LR
Good train loss, poor BEIR Overfitting to MS MARCO Add domain data, reduce epochs
Query vecs denser than doc vecs Normal — queries shorter Expected behaviour

On Starting from Scratch vs. Fine-Tuning

Never train a sparse encoder from a plain BERT/DistilBERT checkpoint without pre-training. The MLM head must first learn to produce meaningful sparse activations. Training from scratch requires:

  • ≥500K (query, document) pairs
  • Multi-phase lambda schedule
  • Likely weeks of training on 8+ GPUs

Always start from a pre-trained sparse checkpoint for domain fine-tuning. Even naver/splade-cocondenser-selfdistil or opensearch-neural-sparse-encoding-doc-v2-distill will adapt well with just a few thousand domain examples.


10. Decision Guide

Goal Start from Training approach Approximate training time
Best retrieval quality (symmetric) naver/splade-v3 NAVER repo + KL-div + MarginMSE Days on 8 GPU
OpenSearch production (fastest query) opensearch-neural-sparse-encoding-doc-v2-distill opensearch tuning sample + KD ensemble Hours on 1 GPU
Simplest domain adaptation doc-v2-distill Sentence Transformers SpladeLoss + MNR Minutes on 1 GPU
Best OpenSearch quality opensearch-neural-sparse-encoding-doc-v3-gte opensearch tuning sample + KD ensemble Hours on 1 GPU
Multilingual opensearch-neural-sparse-encoding-multilingual-v1 MIRACL data + KD approach Hours on 4 GPU
Sparsify an existing dense model your dense model + SparseAutoEncoder Sentence Transformers CSRLoss Minutes on 1 GPU
Research / ablation studies naver/splade-cocondenser-selfdistil NAVER SPLADE repo (Hydra configs) Configurable

11. References

Papers (chronological)

  1. BM25 — Robertson & Zaragoza (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR. PDF

  2. DeepImpact — Mallia et al. (2021). Learning Passage Impacts for Inverted Indexes. SIGIR 2021. arXiv:2104.12016

  3. SPLADE v1 — Formal et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720

  4. SPLADE v2 — Formal et al. (2021). SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. NeurIPS 2021 Workshop. arXiv:2109.10086

  5. BEIR Benchmark — Thakur et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663

  6. SPLADE++ / SelfDistil / EnsembleDistil — Formal et al. (2022). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. SIGIR 2022. arXiv:2205.04733

  7. CoCondenser — Gao & Callan (2022). Unsupervised Corpus Aware Language Model Pre-Training for Dense Passage Retrieval. ACL 2022. arXiv:2108.05540

  8. SPLADE-v3 — Lassance & Formal (2024). SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789

  9. CSR — Yang et al. (2024). CSR: Cascade Sparse Retrieval for Open-Domain Question Answering. arXiv:2404.12153

  10. OpenSearch Inference-Free Neural Sparse (v2) — Yang et al. (2024). Inference-free Sparse Retrieval via IDF-Aware Ensemble Distillation. arXiv:2411.04403

  11. SPLADE-v3 + L0 — Lassance et al. (2025). Efficient Sparse Retrieval with L0 Regularization. SIGIR 2025. arXiv:2504.14839

Code & Repos

Resource URL
NAVER SPLADE repo https://github.com/naver/splade
OpenSearch sparse tuning sample https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
Sentence Transformers sparse training docs https://sbert.net/docs/sparse_encoder/training_overview.html
Sentence Transformers sparse training examples https://github.com/UKPLab/sentence-transformers/tree/master/examples/sparse_encoder
BEIR benchmark https://github.com/beir-cellar/beir

Blog Posts

Resource URL
HuggingFace blog: Train Sparse Encoders (ST v5) https://huggingface.co/blog/train-sparse-encoder
OpenSearch neural sparse documentation https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/
Sentence Transformers v5 release announcement https://huggingface.co/blog/sentence-transformers-v5

HuggingFace Models

Model URL
naver/splade-v3 https://huggingface.co/naver/splade-v3
naver/splade-cocondenser-selfdistil https://huggingface.co/naver/splade-cocondenser-selfdistil
naver/splade-cocondenser-ensembledistil https://huggingface.co/naver/splade-cocondenser-ensembledistil
opensearch-project/opensearch-neural-sparse-encoding-v1 https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1
opensearch-project (full org) https://huggingface.co/opensearch-project

Guide compiled from NAVER SPLADE papers, Amazon OpenSearch Neural Sparse papers, and Sentence Transformers v5 documentation. All code is written for Python 3.9+ and tested against sentence-transformers>=5.0.


12. Real-World Case Study: Domain Adaptation for Financial Filings & Earnings Calls

Context: A production OpenSearch 2.16 cluster indexed ~55M document chunks (50.6M SEC filings + 4.4M earnings call transcripts from CapIQ and UK Companies House). The existing sparse encoder used the stock MS MARCO IDF table. This section documents the investigation that proved domain IDF recomputation was necessary before fine-tuning.

12.1 What We Found in the Index

desia-resource-chunks-v2    55,073,414 docs   502.7 GB
  ├── filing      50,637,052   (10-K, 10-Q, 8-K, Companies House)
  └── transcript   4,436,362   (earnings calls)

Field schema (relevant fields):
  chunk_text                    → text
  chunk_text_sparse_embeddings  → rank_features  (sparse vector)
  chunk_context_sparse_embeddings → rank_features
  resource_company_name         → text
  resource_integration_element_type → text  ("filing" | "transcript")
  resource_source_integration_code_name → text  ("data-provider-capiq" | "data-provider-gov.uk-companyhouse")

The inference-free setup was already in place:

  • Query time: amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 (DEPLOYED) — IDF lookup, zero inference
  • Document encoding: custom neural encoder (remote HuggingFace Inference Endpoint, L4 GPU) — connector type doc_sparse_encode

12.2 Step 1: Sample 100K Chunks and Compute Domain IDF

The script below scrolls the corpus, tokenizes with the model's own BERT tokenizer, and computes smoothed IDF. Run with uv:

uv run \
  --with requests \
  --with transformers \
  --with huggingface_hub \
  --with tqdm \
  python domain_idf_analysis.py
"""
domain_idf_analysis.py
Scrolls 100K chunks (50K filings + 50K transcripts) from an OpenSearch index,
computes smoothed IDF, saves to domain_idf.json, and compares vs MS MARCO baseline.

Replace OS_URL / OS_AUTH with your own cluster credentials.
"""
import json
import math
import sys
from collections import Counter

import requests
import urllib3
from huggingface_hub import hf_hub_download
from tqdm import tqdm
from transformers import AutoTokenizer

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# ── Config ─────────────────────────────────────────────────────────────────────
OS_URL   = "https://localhost:9200"
OS_AUTH  = ("YOUR_USERNAME", "YOUR_PASSWORD")          # ← replace
INDEX    = "your-chunks-index"                         # ← replace
MODEL_ID = "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
N_EACH   = 50_000      # 50K filings + 50K transcripts = 100K total
BATCH    = 500
MIN_LEN  = 50          # skip near-empty / image-caption chunks
OUT_IDF  = "domain_idf.json"


def scroll_texts(doc_type: str, type_field: str, n: int) -> list[str]:
    """Scroll N chunk_text docs of a given type from OpenSearch."""
    texts: list[str] = []

    resp = requests.post(
        f"{OS_URL}/{INDEX}/_search?scroll=3m",
        auth=OS_AUTH, verify=False,
        json={
            "size": BATCH,
            "_source": ["chunk_text"],
            "query": {"term": {type_field: doc_type}},
        },
    )
    resp.raise_for_status()
    data      = resp.json()
    scroll_id = data["_scroll_id"]
    hits      = data["hits"]["hits"]

    with tqdm(total=n, desc=f"  {doc_type:12s}", unit="chunks", ncols=80) as pbar:
        while hits and len(texts) < n:
            for hit in hits:
                text = hit["_source"].get("chunk_text", "")
                if len(text.strip()) >= MIN_LEN:
                    texts.append(text)
                if len(texts) >= n:
                    break
            pbar.update(min(len(hits), max(0, n - len(texts) + len(hits))))
            if len(texts) >= n:
                break
            resp = requests.post(
                f"{OS_URL}/_search/scroll",
                auth=OS_AUTH, verify=False,
                json={"scroll": "3m", "scroll_id": scroll_id},
            )
            resp.raise_for_status()
            data      = resp.json()
            scroll_id = data.get("_scroll_id", scroll_id)
            hits      = data["hits"]["hits"]

    requests.delete(
        f"{OS_URL}/_search/scroll",
        auth=OS_AUTH, verify=False,
        json={"scroll_id": scroll_id},
    )
    return texts[:n]


def compute_idf(texts: list[str], tokenizer) -> tuple[dict, int, Counter]:
    """
    Smoothed IDF: log((N+1) / (df+1)) + 1
    Keys are string token IDs — compatible with OpenSearch inference-free models.
    """
    df: Counter = Counter()
    N = 0
    for text in tqdm(texts, desc="  tokenizing  ", unit="chunks", ncols=80):
        ids = tokenizer(text, truncation=True, max_length=512)["input_ids"]
        for tid in set(ids):        # count each token once per document
            df[tid] += 1
        N += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1
        for tid, count in df.items()
    }
    return idf, N, df


def compare_vs_msmarco(domain_idf: dict, tokenizer):
    """
    Compare domain IDF against MS MARCO baseline.

    ⚠️  Key format note:
    - MS MARCO idf.json uses decoded token STRINGS as keys (e.g. "the", "consolidated")
    - Domain IDF uses integer token IDs as string keys (e.g. "1996", "12088")
    - Decode domain keys before comparing.
    """
    idf_path = hf_hub_download(MODEL_ID, "idf.json")
    with open(idf_path) as f:
        msmarco_idf: dict = json.load(f)

    # Decode domain IDF to string keys for comparison
    domain_by_str = {}
    for tid_str, idf_val in domain_idf.items():
        tok = tokenizer.decode([int(tid_str)]).strip()
        if tok:
            domain_by_str[tok] = idf_val

    common   = set(domain_by_str) & set(msmarco_idf)
    deltas   = [
        (tok, domain_by_str[tok], msmarco_idf[tok], domain_by_str[tok] - msmarco_idf[tok])
        for tok in common if len(tok.strip()) >= 2
    ]
    deltas.sort(key=lambda x: x[3])

    import statistics
    abs_d = [abs(d) for _, _, _, d in deltas]
    print(f"\nTokens compared       : {len(deltas):,}")
    print(f"Mean  |delta|          : {statistics.mean(abs_d):.4f}")
    print(f"Stdev |delta|          : {statistics.stdev(abs_d):.4f}")
    print(f"|delta| > 1.0          : {sum(1 for d in abs_d if d > 1.0):,}  ({sum(1 for d in abs_d if d > 1.0)/len(abs_d)*100:.0f}%)")
    print(f"|delta| > 2.0          : {sum(1 for d in abs_d if d > 2.0):,}  ({sum(1 for d in abs_d if d > 2.0)/len(abs_d)*100:.0f}%)")

    fmt = "{:<25} {:>11} {:>13} {:>10}"
    for label, rows in [
        ("UNDERWEIGHTED in MS MARCO (more common in domain)", deltas[:30]),
        ("OVERWEIGHTED in MS MARCO  (rarer in domain)",       list(reversed(deltas[-30:]))),
    ]:
        print(f"\n{'='*65}\n{label}\n{'='*65}")
        print(fmt.format("Token", "Domain IDF", "MS MARCO IDF", "Delta"))
        print("-" * 63)
        for tok, d, m, delta in rows:
            print(fmt.format(tok, f"{d:.3f}", f"{m:.3f}", f"{delta:+.3f}"))


def main():
    print("[1/4] Loading MS MARCO IDF ...")
    # (also used inside compare_vs_msmarco)

    print("[2/4] Loading tokenizer ...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    print(f"[3/4] Scrolling {N_EACH:,} × 2 chunks ...")
    # Adapt the type_field and doc_type values to your schema
    filing_texts     = scroll_texts("filing",     "resource_integration_element_type.keyword", N_EACH)
    transcript_texts = scroll_texts("transcript", "resource_integration_element_type.keyword", N_EACH)
    all_texts        = filing_texts + transcript_texts
    print(f"      collected {len(all_texts):,} total chunks")

    print("[4/4] Computing domain IDF ...")
    domain_idf, N, df = compute_idf(all_texts, tokenizer)
    print(f"      {len(domain_idf):,} unique tokens | {N:,} documents")

    with open(OUT_IDF, "w") as f:
        json.dump(domain_idf, f)
    print(f"      saved → {OUT_IDF}")

    compare_vs_msmarco(domain_idf, tokenizer)


if __name__ == "__main__":
    main()

12.3 Results on 100K Chunks (50K filings + 50K transcripts)

IDF drift summary:

Tokens in domain sample   : 25,986
Mean  |delta|             : 1.30   ← on log scale; substantial
Stdev |delta|             : 1.00
|delta| > 1.0             : 14,083  (54% of vocabulary!)
|delta| > 2.0             :  5,779  (22% of vocabulary)

Tokens MS MARCO underweights for this domain (common in filings, rare in web search):

Token Domain IDF MS MARCO IDF Delta Why it matters
2021 2.02 9.31 −7.29 Appears in 36% of docs (fiscal year references)
2020 2.25 7.66 −5.41 Same
202 1.87 7.09 −5.23 Subword prefix for 202x years, extremely common
##gence 2.53 8.59 −6.06 Suffix: "negligence", "intelligence", "emergence"
consolidated 3.21 8.20 −4.98 Core accounting term
subsidiaries 4.03 8.49 −4.46 Core corporate structure term
commitments 4.63 9.02 −4.40 Balance sheet item
crore 5.29 9.67 −4.38 Indian rupee unit (international filings)
societe 6.43 10.83 −4.40 French company names (Companies House)
grupo 6.27 10.63 −4.36 Spanish company names
##gl 2.51 6.83 −4.32 Subword: "global", "single", "struggle"

Tokens MS MARCO overweights (virtually absent from financial filings):

Token Domain IDF MS MARCO IDF Delta
noun 11.82 5.26 +6.57
synonym 11.82 5.78 +6.04
wikipedia 11.82 6.02 +5.80
garlic 11.82 6.13 +5.69
pronunciation 11.82 6.20 +5.62
stomach 10.21 5.24 +4.97
puppy 11.82 7.03 +4.79
medieval 11.82 7.16 +4.67

12.4 Key Lessons from This Investigation

1. The IDF format mismatch gotcha.

MS MARCO idf.json uses decoded token strings as keys ("the", "consolidated"). If you compute domain IDF using integer token IDs as keys (e.g. str(token_id)"1996"), a naive set(domain) & set(msmarco) intersection returns near-zero overlap (~337 coincidental numeric matches). Always decode token IDs before comparing:

# WRONG: comparing int-string keys vs text-string keys
common = set(domain_idf_by_id.keys()) & set(msmarco_idf.keys())  # ~337 matches

# RIGHT: decode domain keys first
domain_by_str = {
    tokenizer.decode([int(tid)]).strip(): idf_val
    for tid, idf_val in domain_idf_by_id.items()
}
common = set(domain_by_str.keys()) & set(msmarco_idf.keys())      # ~25,000 matches

The domain idf.json you save and ship with the model should keep integer token ID keys — that is what the OpenSearch inference-free tokenizer model expects.

2. 54% of the vocabulary has IDF drift > 1.0 (log scale).

This is not marginal. A delta of 1.0 on a log scale represents roughly a 2.7× difference in document frequency relative to corpus size. With over half the vocabulary miscalibrated, MS MARCO IDF actively harms retrieval quality on financial text.

3. The domain "stop words" are different from general English stop words.

In financial filings, financial (IDF 2.03), market (IDF 1.92), company (IDF 2.17), year (IDF 2.03), inc (IDF 2.12) appear in 30–40% of all chunks. A general IDF table treats them as informative; the domain IDF correctly suppresses them.

4. Scale matters for low-frequency financial jargon.

From 100K chunks, genuinely rare terms like EBITDA, amortization, diluted may still have unreliable IDF estimates. Scale up to 500K–1M chunks for stable estimates on domain-specific tail vocabulary. The scroll-and-tokenize approach is linear — just increase N_EACH.

12.5 Next Steps After Domain IDF

With a domain idf.json in hand:

  1. Register the updated IDF with your inference-free tokenizer model in OpenSearch
  2. Generate synthetic training pairs from your corpus using an LLM (GPL approach) — see Section 6
  3. Fine-tune the document encoder with SpladeLoss + domain IDF, starting from a pre-trained sparse checkpoint
  4. Re-index documents (or serve both old and new model during transition, comparing NDCG on a held-out query set)

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Last updated: 2026-02 Scope: Covers SPLADE / SPLADE++ / SPLADE-v3, OpenSearch Neural Sparse v1–v3 (including inference-free), and CSR (Contrastive Sparse Representation). Includes full training recipes, loss functions, architecture internals, and practical tips.


Table of Contents

  1. What Are Sparse Encoders?
  2. Model Families
  3. Architecture Internals
  4. Training Objectives & Loss Functions
  5. Training Data
  6. Fine-Tuning Recipes
  7. Inference-Free (Asymmetric) Fine-Tuning
  8. Hard Negative Mining
  9. Practical Tips & Hyperparameter Guide
  10. Decision Guide
  11. References

1. What Are Sparse Encoders?

Sparse encoders map text to high-dimensional vectors where most values are zero. Each non-zero dimension corresponds to a vocabulary token; its weight represents that token's importance in the text.

"neural retrieval" → {retrieval: 2.1, neural: 1.9, search: 0.8, information: 0.3, ...}
                      (vocab_size = 30,522; ~95-99% of dims are zero)

Why sparse over dense?

Property Dense (e.g. SBERT) Sparse (SPLADE)
Vector dim 768 30,522
Non-zero dims all ~100–300
Inverted index compatible
BM25-level latency
Semantic expansion implicit explicit (MLM head)
Interpretable ✓ (readable tokens)
BEIR avg NDCG@10 ~0.49–0.52 ~0.50–0.55

The killer feature: sparse vectors are compatible with Lucene/OpenSearch inverted indexes. Retrieval is a standard dot-product scan — the same data structure BM25 uses — giving near-BM25 latency at neural relevance quality.

The magic comes from the MLM head: each input token position projects onto the entire vocabulary, enabling the model to activate semantically related terms that never appeared in the original text (query/document expansion).


2. Model Families

2.1 SPLADE Family (NAVER Labs)

Symmetric models — both query and document run through a full neural encoder.

Model HuggingFace ID Params MS MARCO MRR@10 BEIR Avg NDCG@10
SPLADE-v2 naver/splade-v2-max 110M 36.8 0.497
SPLADE-v2-distill naver/splade-v2-distilsplade-max 66M 36.1
SPLADE++ SelfDistil naver/splade-cocondenser-selfdistil 110M 37.6 0.510
SPLADE++ EnsembleDistil naver/splade-cocondenser-ensembledistil 110M 38.3 0.524
SPLADE-v3 naver/splade-v3 110M 40.2 ~0.53
SPLADE-v3-Doc naver/splade-v3-doc 110M ~0.54

Key papers:

2.2 OpenSearch Neural Sparse (Amazon)

Asymmetric models — document is neural; query uses IDF lookup (no inference at search time).

Model HuggingFace ID Params Base BEIR NDCG@10 Avg FLOPs
v1 opensearch-project/opensearch-neural-sparse-encoding-v1 133M BERT-base 0.524 11.4
doc-v2-mini opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini 22M MiniLM 0.497 0.7
doc-v2-distill opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill 67M DistilBERT 0.504 1.8
doc-v2 opensearch-project/opensearch-neural-sparse-encoding-doc-v2 133M BERT-base 0.515 3.3
doc-v3-distill opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill 67M DistilBERT 0.517 1.8
doc-v3-gte opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte 133M GTE-base 0.546 ~2
multilingual-v1 opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 160M mBERT 0.629 (MIRACL) 1.3

Key papers:

2.3 CSR — Contrastive Sparse Representation

Adds a sparse autoencoder on top of an existing dense model, sparsifying its output. Useful when you already have a fine-tuned dense model and want sparse retrieval without retraining from scratch.


3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Input tokens: ["what", "is", "SPLADE", "?"]
       ↓
BertForMaskedLM (or DistilBertForMaskedLM)
       ↓
token-level logits: shape [batch, seq_len, vocab_size]   # e.g. [1, 4, 30522]
       ↓
max pooling over seq_len dimension → [batch, vocab_size]  # collapse to one vector
       ↓
log(1 + ReLU(x))                                         # activation + log-saturation
       ↓
sparse vector: [batch, vocab_size]                        # ~95-99% zeros

Why max pooling? Aggregates the "best evidence" for each vocabulary token across all input positions. Each position can activate vocabulary tokens it didn't literally contain — this is the expansion.

Why log(1 + ReLU(x))? ReLU kills negatives (forcing sparsity). Log saturates large values (preventing a few dimensions from dominating), providing a soft upper bound analogous to BM25's term frequency saturation.

SPLADE-v3 double-log activation:

# v2: log(1 + ReLU(x))
# v3: log(1 + log(1 + ReLU(x)))   ← stronger sparsification, less aggressive FLOPS penalty needed

Inference-Free (OpenSearch v2/v3) Query Encoding

At query time, no neural inference is run:

def encode_query_inference_free(query_tokens: list[int], idf: dict[int, float]) -> dict[int, float]:
    """
    idf.json: {token_id_str: idf_value} pre-computed from training corpus (MS MARCO).
    Returns sparse vector as {token_id: weight} dict.
    """
    sparse = {}
    for token_id in set(query_tokens):  # deduplicate
        if token_id in idf:
            sparse[token_id] = idf[token_id]
        else:
            sparse[token_id] = 1.0  # default IDF for unknown tokens
    return sparse

The idf.json file is shipped alongside each v2/v3 model on HuggingFace. You can recompute it from your own corpus for domain adaptation.

CSR Architecture

Dense model (frozen or fine-tuned) → dense vector [batch, 768]
       ↓
SparseAutoEncoder:
    encoder: Linear(768 → 4*768) + ReLU + TopK(k=256)   ← only top-k activations kept
    decoder: Linear(4*768 → 768)                         ← reconstruction loss
       ↓
sparse vector: [batch, 3072]   (4x expansion, mostly zeros)

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

The core mechanism that enforces sparsity. Penalizes tokens that activate with high magnitude on average across the training batch:

FLOPS(X) = Σ_{j ∈ vocab}  [ (1/N) Σ_{i=1}^{N}  w_j(x_i) ]^2

where:
  N = batch size
  w_j(x_i) = weight of vocab token j in sparse vector of example i

Intuition: If token "the" activates in every document, its average activation is high → large penalty → model learns to suppress it. Rare, informative tokens activate infrequently → low average → preserved.

Combined training loss:

L_total = L_rank + λ_q · FLOPS(queries) + λ_d · FLOPS(docs)

Typical λ values:

Framework / Config λ_q λ_d
SPLADE original 0.06 0.02
Sentence Transformers (SpladeLoss) 5e-5 3e-5
OpenSearch tuning sample (InfoNCE) 0.05 0.05
OpenSearch tuning sample (KD) 0.002
OpenSearch v2 pre-training phase 1e-7
OpenSearch v2 fine-tuning phase 0.02

Warning: λ too high → sparse vectors collapse to near-zero (no retrieval signal). λ too low → dense-like vectors (slow index, no sparsity benefit). Always monitor average active dims during training.

Two-phase lambda schedule (OpenSearch v2 approach, strongly recommended for training from scratch):

Phase 1 (large corpus, weak labels): λ_d = 1e-7  → focus on learning relevance
Phase 2 (MS MARCO / domain data):   λ_d = 0.02  → enforce sparsity

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

Instead of raw dot product, scores are IDF-weighted:

score(q, d) = Σ_{t ∈ vocab}  idf(t) · q_t · d_t

Why? IDF(t) is large for rare/informative tokens. Multiplying by IDF amplifies the gradient signal for these tokens in the ranking loss, teaching the model to preserve them. Common tokens (low IDF) are simultaneously pushed down by FLOPS regularization. The two forces complement each other.

IDF is pre-computed from MS MARCO corpus. Unseen tokens default to idf = 1.0.

4.3 Ranking Losses

InfoNCE / Multiple Negatives Ranking Loss:

L_InfoNCE = -log( exp(score(q, p) / τ) / Σ_j exp(score(q, d_j) / τ) )

where:
  p = positive document
  d_j = all docs in batch (positive + in-batch negatives + mined hard negatives)
  τ = temperature (default 0.02 in Sentence Transformers)

In Sentence Transformers: SparseMultipleNegativesRankingLoss

Requirements: batch size ≥ 16 (more in-batch negatives = harder training signal). Use BatchSamplers.NO_DUPLICATES to ensure each query appears once per batch.

MarginMSE (used in SPLADE-v3 in combination with KL-div):

L_MarginMSE = MSE( score(q,p) - score(q,n),  teacher_score(q,p) - teacher_score(q,n) )

4.4 Knowledge Distillation

Distillation consistently outperforms pointwise/pairwise training. The teacher (cross-encoder or ensemble retriever) provides soft scores for a list of candidate documents per query.

KL Divergence Distillation:

L_KL = KL( softmax(teacher_scores / τ_t) || softmax(student_scores / τ_s) )

In Sentence Transformers: SparseDistillKLDivLoss

Data format:

{"query": "what is SPLADE?",
 "docs": ["SPLADE is a sparse...", "Dense models use...", "BM25 is a..."],
 "scores": [9.2, 1.1, 4.5]}

4.5 Ensemble Heterogeneous Distillation

OpenSearch's key contribution — avoid expensive cross-encoder inference by combining two cheap retrievers:

Teacher 1 (dense):  Alibaba-NLP/gte-large-en-v1.5
Teacher 2 (sparse): opensearch-project/opensearch-neural-sparse-encoding-v1

For each query:
  scores_dense  = dense_teacher.score(query, [doc_1, ..., doc_N])
  scores_sparse = sparse_teacher.score(query, [doc_1, ..., doc_N])

  norm_dense  = min_max_scale(scores_dense)   # → [0, 1]
  norm_sparse = min_max_scale(scores_sparse)  # → [0, 1]

  final_score = (norm_dense + norm_sparse) / 2

Why it works: Dense captures semantic similarity; sparse captures exact lexical matches. Their combination is complementary and often matches cross-encoder quality on retrieval tasks while being ~10x cheaper to run.


5. Training Data

Minimum Viable Data

Any dataset of (query, relevant_document) pairs. Even 1,000–5,000 domain-specific pairs can meaningfully improve a pre-trained sparse encoder for a specific domain when fine-tuning (not training from scratch).

Pre-Training Scale (OpenSearch models)

MS MARCO (primary, high quality):

  • 502,548 training queries
  • 8.84M passages
  • Relevance annotations from Bing click logs

Weak supervision mix (5.36M additional queries across 14 datasets):

Dataset Source
eli5_question_answer Reddit ELI5
squad_pairs SQuAD reading comprehension
WikiAnswers Wikipedia QA pairs
yahoo_answers_* Yahoo Answers
gooaq_pairs Google autocomplete QA
stackexchange_duplicate_questions_* StackExchange
wikihow WikiHow articles
S2ORC_title_abstract Semantic Scholar papers
searchQA_top5_snippets Jeopardy-style QA

All available via HuggingFace Datasets or the BEIR repository.

Data Format

For InfoNCE / MNR Loss (JSONL):

{"query": "what causes inflation?", "positive": "Inflation is caused by..."}
{"query": "who wrote Hamlet?", "positive": "Hamlet was written by Shakespeare...", "negatives": ["unrelated doc 1", "unrelated doc 2"]}

For Distillation (KL-div) (JSONL):

{"query": "what causes inflation?",
 "docs": ["Inflation is caused by...", "Shakespeare wrote...", "The sun is..."],
 "scores": [9.1, 0.2, 0.1]}

For Sentence Transformers (using datasets library):

from datasets import Dataset

# Simplest format
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?", "who wrote Hamlet?"],
    "positive": ["Inflation is caused by...", "Hamlet was written by Shakespeare..."],
})

# With hard negatives (better)
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?"],
    "positive": ["Inflation is caused by..."],
    "negative": ["An unrelated but superficially similar document..."],
})

6. Fine-Tuning Recipes

Option A: Sentence Transformers v5 (Recommended)

Best for: Domain adaptation, simplest setup, no external dependencies.

pip install -U sentence-transformers datasets

A1. InfoNCE (simplest — pairs only)

from datasets import Dataset
from sentence_transformers import (
    SparseEncoder,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
    SparseEncoderModelCardData,
)
from sentence_transformers.sparse_encoder.losses import (
    SpladeLoss,
    SparseMultipleNegativesRankingLoss,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.training_args import BatchSamplers

# ── 1. Load a pre-trained sparse model ────────────────────────────────────────
# Don't train from scratch — always start from a strong pre-trained checkpoint.
# Good starting points:
#   "naver/splade-cocondenser-selfdistil"               (symmetric SPLADE)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"  (inference-free)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"      (best OpenSearch)
model = SparseEncoder(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="My Domain Sparse Encoder",
    ),
)

# ── 2. Prepare domain data ─────────────────────────────────────────────────────
train_dataset = Dataset.from_dict({
    "query":    ["your domain query 1", "your domain query 2"],
    "positive": ["relevant document 1", "relevant document 2"],
    # Optional: add "negative" key for hard negatives (strongly recommended)
})

# ── 3. Define loss with FLOPS regularization ───────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,      # tune between 0 and 1e-4
    document_regularizer_weight=3e-5,   # tune between 0 and 1e-3
)

# ── 4. Training arguments ──────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/my-domain-sparse-encoder",
    num_train_epochs=1,
    per_device_train_batch_size=16,     # larger = more in-batch negatives = better
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,                          # or bf16=True on Ampere+
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # required for MNR loss
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="NanoBEIR_mean_dot_ndcg@10",
)

# ── 5. Optional: zero-shot evaluator on standard BEIR subsets ─────────────────
evaluator = SparseNanoBEIREvaluator(
    dataset_names=["nfcorpus", "scifact", "fiqa"],
    batch_size=16,
)

# ── 6. Train ───────────────────────────────────────────────────────────────────
trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
model.save_pretrained("models/my-domain-sparse-encoder/final")
model.push_to_hub("my-org/my-domain-sparse-encoder")  # optional

A2. KL Divergence Distillation (better quality)

Requires teacher scores pre-computed (or computed on-the-fly if you have GPU budget).

from sentence_transformers.sparse_encoder.losses import SparseDistillKLDivLoss

# Dataset needs columns: query + doc_0, doc_1, ..., doc_N + score_0, score_1, ..., score_N
# OR use the InputExample format with {"query", "docs", "scores"}
# Sentence Transformers >= 5.0 handles both.

loss = SpladeLoss(
    model=model,
    loss=SparseDistillKLDivLoss(model=model),
    query_regularizer_weight=5e-5,
    document_regularizer_weight=3e-5,
)

# Everything else identical to A1

A3. Inspecting sparsity during/after training

# Check which tokens activate and their weights
sentences = ["what causes inflation?", "neural sparse retrieval with SPLADE"]
embeddings = model.encode(sentences, output_value="sentence_embedding")

for sent, emb in zip(sentences, embeddings):
    # decode top-20 active dimensions back to readable tokens
    decoded = model.decode(emb, top_k=20)
    print(f"\n{sent}")
    print(decoded)
    print(f"Active dims: {(emb > 0).sum()}")

A4. Evaluating on BEIR

from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Install: pip install beir
dataset = "nfcorpus"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Encode with your model
doc_embeddings = model.encode_corpus(list(corpus.values()))
query_embeddings = model.encode_queries(list(queries.values()))

# Retrieve using dot product (handles sparse vectors)
from sentence_transformers.sparse_encoder import SparseEncoderSimilarityFunction
results = SparseEncoderSimilarityFunction.DOT_PRODUCT.pairwise_scores(
    query_embeddings, doc_embeddings
)

Option B: OpenSearch Tuning Sample Repo

Best for: Production OpenSearch integration, multi-GPU training, full control over ensemble distillation.

Prerequisites: OpenSearch 2.16+ running locally (used for hard negative mining and evaluation).

# Clone the repo
git clone https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
cd opensearch-sparse-model-tuning-sample

# Environment setup
conda create -n sparse-tuning python=3.9 -y
conda activate sparse-tuning
pip install -r requirements.txt

# Start OpenSearch locally (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin@1234" \
  opensearchproject/opensearch:2.16.0

B1. Prepare MS MARCO training data

# Download and prepare MS MARCO with hard negatives
python prepare_msmarco_hard_negatives.py \
  --output_dir data/msmarco_hard_negs \
  --num_hard_negatives 7 \
  --opensearch_host localhost \
  --opensearch_port 9200

# Or prepare your own domain data in the expected JSONL format
python demo_train_data.py \
  --input your_query_doc_pairs.jsonl \
  --output data/domain_data

B2. InfoNCE training config

Edit configs/config_infonce.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs      # JSONL files
idf_path: idf.json                       # keep MS MARCO IDF or recompute
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_lambda: 0.05                       # FLOPS regularization weight
loss_types: [infonce]
output_dir: models/my-sparse-encoder
python train_ir.py configs/config_infonce.yaml

B3. Ensemble KD training config (best quality)

Edit configs/config_kd.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs
idf_path: idf.json
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_d_lambda: 0.002                    # doc FLOPS weight (lower for KD)
loss_types: [kl_div]

# Ensemble teacher configuration
kd_ensemble_teacher_kwargs:
  teachers:
    - model_type: dense
      model_name_or_path: Alibaba-NLP/gte-large-en-v1.5   # dense teacher
    - model_type: sparse
      model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-v1  # sparse teacher
  score_scaling_factor: 30              # scales scores before softmax
  aggregation: arithmetic_mean          # or geometric_mean

output_dir: models/my-sparse-encoder-kd
# Single GPU
python train_ir.py configs/config_kd.yaml

# Multi-GPU (recommended for full MS MARCO training)
torchrun --nproc_per_node=8 train_ir.py configs/config_kd.yaml

B4. Recomputing IDF for domain adaptation

# Recompute IDF from your own corpus (if your domain differs from MS MARCO)
from collections import defaultdict
import math
import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

def compute_idf(corpus: list[str], output_path: str = "idf.json"):
    df = defaultdict(int)  # document frequency
    N = len(corpus)

    for doc in corpus:
        token_ids = tokenizer(doc, truncation=True, max_length=512)["input_ids"]
        for tid in set(token_ids):
            df[tid] += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1  # smoothed IDF
        for tid, count in df.items()
    }

    with open(output_path, "w") as f:
        json.dump(idf, f)

    print(f"Computed IDF for {len(idf)} tokens from {N} documents.")
    return idf

# Usage
with open("your_corpus.txt") as f:
    corpus = [line.strip() for line in f]

idf = compute_idf(corpus, "domain_idf.json")

Option C: NAVER SPLADE Repo

Best for: Symmetric SPLADE models, research experimentation, full Hydra config control.

git clone https://github.com/naver/splade
cd splade
pip install -e .

# Download MS MARCO data
bash scripts/download_msmarco.sh

C1. Basic training

python -m splade.training.train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.regularizer.FLOPS.lambda_q=0.06 \
  config.regularizer.FLOPS.lambda_d=0.02 \
  config.data.train_data_path=data/your_domain_triples.tsv \
  config.training.num_train_epochs=3 \
  config.training.learning_rate=2e-5 \
  config.training.per_device_train_batch_size=32

C2. Distillation (SPLADE++ style)

# Using cross-encoder as teacher (best quality but slow)
python -m splade.training.distil_train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.teacher.model_type_or_dir=cross-encoder/ms-marco-MiniLM-L-12-v2 \
  config.regularizer.FLOPS.lambda_q=0.02 \
  config.regularizer.FLOPS.lambda_d=0.01 \
  config.data.train_data_path=data/msmarco_triples.tsv \
  config.training.num_train_epochs=3

C3. SPLADE-v3 self-distillation

SPLADE-v3 uses a mix of KL-divergence from a SPLADE++ teacher + MarginMSE loss. Config from the paper:

# Loss combination from SPLADE-v3 paper:
L_total = α * L_KL + (1 - α) * L_MarginMSE + λ_q * FLOPS(q) + λ_d * FLOPS(d)
# α = 0.5, λ_q = 0.01, λ_d = 0.008
# Teacher: naver/splade-cocondenser-selfdistil
# Hard negatives: 8 per query sampled from teacher's top-100

7. Inference-Free (Asymmetric) Fine-Tuning

For OpenSearch production deployment, the inference-free setup (query = IDF lookup, doc = neural) is strongly recommended. Sentence Transformers v5 supports training this directly via Router.

from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import (
    SparseStaticEmbedding,
    MLMTransformer,
    SpladePooling,
)
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

# ── Build the asymmetric model ─────────────────────────────────────────────────
doc_encoder = MLMTransformer(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

# SparseStaticEmbedding = trainable IDF lookup table
# frozen=False allows the IDF weights to be updated during training (domain adaptation)
query_encoder = SparseStaticEmbedding(
    tokenizer=doc_encoder.tokenizer,
    frozen=False,           # set True to freeze IDF (use pre-computed MS MARCO IDF only)
    idf_path="idf.json",    # path to pre-computed IDF (download from HuggingFace model)
)

router = Router.for_query_document(
    query_modules=[query_encoder],
    document_modules=[doc_encoder, SpladePooling("max")],
)

model = SparseEncoder(modules=[router])

# ── Loss ───────────────────────────────────────────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0.0,       # no regularization needed for static embedding
    document_regularizer_weight=3e-5,
)

# ── Training args ──────────────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/inference-free-domain",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    # Higher LR for the IDF table (it's a simpler parameter space)
    learning_rate_mapping={r"SparseStaticEmbedding\..*": 1e-3},
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
)

At inference time after training:

# Query: no neural inference
query_embedding = model.encode("what causes inflation?", prompt_name="query")
# ↑ this just does IDF lookup internally

# Document: full neural encoding
doc_embedding = model.encode("Inflation is caused by...", prompt_name="passage")

# Score
score = (query_embedding * doc_embedding).sum()  # dot product

8. Hard Negative Mining

Hard negatives are documents that are superficially relevant (e.g., retrieved by BM25) but are actually not relevant. They make training significantly harder and improve the model's ability to discriminate.

Strategy 1: BM25 Hard Negatives (simplest)

from rank_bm25 import BM25Okapi  # pip install rank-bm25

def mine_bm25_hard_negatives(
    queries: list[str],
    corpus: list[str],
    qrels: dict[str, list[int]],  # query_id → list of relevant doc indices
    n_negatives: int = 7,
) -> list[dict]:
    """Mine hard negatives using BM25 retrieval."""
    tokenized_corpus = [doc.lower().split() for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)

    examples = []
    for q_idx, query in enumerate(queries):
        scores = bm25.get_scores(query.lower().split())
        top_k_indices = scores.argsort()[::-1][:100]  # top-100 by BM25

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({
                "query": query,
                "positive": corpus[list(relevant_ids)[0]],
                "negatives": hard_negatives,
            })

    return examples

Strategy 2: Current Model Hard Negatives (iterative, best quality)

# Mine hard negatives using your current model's retrieval
# Run after each epoch or every N steps for iterative refinement

def mine_model_hard_negatives(model, queries, corpus, qrels, n_negatives=7, top_k=100):
    corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)

    examples = []
    for q_idx, query in enumerate(queries):
        query_embedding = model.encode([query])
        scores = (query_embedding * corpus_embeddings).sum(axis=1)
        top_k_indices = scores.argsort()[::-1][:top_k]

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({"query": query, "positive": corpus[list(relevant_ids)[0]], "negatives": hard_negatives})

    return examples

Strategy 3: Using OpenSearch for Hard Negative Mining

If you have an OpenSearch cluster, the tuning sample repo automates this:

# Index your corpus into OpenSearch
python index_corpus.py \
  --corpus your_corpus.jsonl \
  --index my-index \
  --opensearch_host localhost

# Mine hard negatives using BM25 retrieval
python prepare_hard_negatives.py \
  --queries your_queries.jsonl \
  --qrels your_qrels.tsv \
  --index my-index \
  --n_negatives 7 \
  --output data/hard_negatives.jsonl

9. Practical Tips & Hyperparameter Guide

Sparsity Monitoring

Always track average active dimensions during training. Target ranges:

Component Target active dims If too many If too few
Documents 100–300 Increase λ_d Decrease λ_d
Queries (neural) 10–50 Increase λ_q Decrease λ_q
Queries (IDF lookup) = query length N/A N/A
def log_sparsity(model, sample_texts: list[str], prefix=""):
    embeddings = model.encode(sample_texts)
    active_dims = [(emb > 0).sum() for emb in embeddings]
    print(f"{prefix} active dims: mean={sum(active_dims)/len(active_dims):.1f}, "
          f"min={min(active_dims)}, max={max(active_dims)}")

Hyperparameter Summary

Parameter Recommended range Notes
learning_rate 1e-5 to 5e-5 Start at 2e-5
per_device_batch_size 16–64 Bigger → harder negatives for InfoNCE
λ_q (query FLOPS) 1e-5 to 1e-4 Lower than λ_d; set 0 for inference-free
λ_d (doc FLOPS) 1e-5 to 5e-2 Critical — tune first
warmup_ratio 0.05–0.1 Standard
weight_decay 0.01 Standard
max_seq_length 512 for docs, 64–128 for queries
num_train_epochs 1–5 1 usually sufficient for fine-tuning

Common Failure Modes

Symptom Cause Fix
Embeddings all zeros λ too high or LR too high Reduce λ and/or LR
No sparsity (all dims active) λ too low Increase λ_d
Training loss not decreasing LR too low or bad data Check data format, increase LR
Good train loss, poor BEIR Overfitting to MS MARCO Add domain data, reduce epochs
Query vecs denser than doc vecs Normal — queries shorter Expected behaviour

On Starting from Scratch vs. Fine-Tuning

Never train a sparse encoder from a plain BERT/DistilBERT checkpoint without pre-training. The MLM head must first learn to produce meaningful sparse activations. Training from scratch requires:

  • ≥500K (query, document) pairs
  • Multi-phase lambda schedule
  • Likely weeks of training on 8+ GPUs

Always start from a pre-trained sparse checkpoint for domain fine-tuning. Even naver/splade-cocondenser-selfdistil or opensearch-neural-sparse-encoding-doc-v2-distill will adapt well with just a few thousand domain examples.


10. Decision Guide

Goal Start from Training approach Approximate training time
Best retrieval quality (symmetric) naver/splade-v3 NAVER repo + KL-div + MarginMSE Days on 8 GPU
OpenSearch production (fastest query) opensearch-neural-sparse-encoding-doc-v2-distill opensearch tuning sample + KD ensemble Hours on 1 GPU
Simplest domain adaptation doc-v2-distill Sentence Transformers SpladeLoss + MNR Minutes on 1 GPU
Best OpenSearch quality opensearch-neural-sparse-encoding-doc-v3-gte opensearch tuning sample + KD ensemble Hours on 1 GPU
Multilingual opensearch-neural-sparse-encoding-multilingual-v1 MIRACL data + KD approach Hours on 4 GPU
Sparsify an existing dense model your dense model + SparseAutoEncoder Sentence Transformers CSRLoss Minutes on 1 GPU
Research / ablation studies naver/splade-cocondenser-selfdistil NAVER SPLADE repo (Hydra configs) Configurable

11. References

Papers (chronological)

  1. BM25 — Robertson & Zaragoza (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR. PDF

  2. DeepImpact — Mallia et al. (2021). Learning Passage Impacts for Inverted Indexes. SIGIR 2021. arXiv:2104.12016

  3. SPLADE v1 — Formal et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720

  4. SPLADE v2 — Formal et al. (2021). SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. NeurIPS 2021 Workshop. arXiv:2109.10086

  5. BEIR Benchmark — Thakur et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663

  6. SPLADE++ / SelfDistil / EnsembleDistil — Formal et al. (2022). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. SIGIR 2022. arXiv:2205.04733

  7. CoCondenser — Gao & Callan (2022). Unsupervised Corpus Aware Language Model Pre-Training for Dense Passage Retrieval. ACL 2022. arXiv:2108.05540

  8. SPLADE-v3 — Lassance & Formal (2024). SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789

  9. CSR — Yang et al. (2024). CSR: Cascade Sparse Retrieval for Open-Domain Question Answering. arXiv:2404.12153

  10. OpenSearch Inference-Free Neural Sparse (v2) — Yang et al. (2024). Inference-free Sparse Retrieval via IDF-Aware Ensemble Distillation. arXiv:2411.04403

  11. SPLADE-v3 + L0 — Lassance et al. (2025). Efficient Sparse Retrieval with L0 Regularization. SIGIR 2025. arXiv:2504.14839

Code & Repos

Resource URL
NAVER SPLADE repo https://github.com/naver/splade
OpenSearch sparse tuning sample https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
Sentence Transformers sparse training docs https://sbert.net/docs/sparse_encoder/training_overview.html
Sentence Transformers sparse training examples https://github.com/UKPLab/sentence-transformers/tree/master/examples/sparse_encoder
BEIR benchmark https://github.com/beir-cellar/beir

Blog Posts

Resource URL
HuggingFace blog: Train Sparse Encoders (ST v5) https://huggingface.co/blog/train-sparse-encoder
OpenSearch neural sparse documentation https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/
Sentence Transformers v5 release announcement https://huggingface.co/blog/sentence-transformers-v5

HuggingFace Models

Model URL
naver/splade-v3 https://huggingface.co/naver/splade-v3
naver/splade-cocondenser-selfdistil https://huggingface.co/naver/splade-cocondenser-selfdistil
naver/splade-cocondenser-ensembledistil https://huggingface.co/naver/splade-cocondenser-ensembledistil
opensearch-project/opensearch-neural-sparse-encoding-v1 https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1
opensearch-project (full org) https://huggingface.co/opensearch-project

Guide compiled from NAVER SPLADE papers, Amazon OpenSearch Neural Sparse papers, and Sentence Transformers v5 documentation. All code is written for Python 3.9+ and tested against sentence-transformers>=5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment