Fine-Tuning Sparse Encoders for Neural Sparse Retrieval: Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)
Last updated: 2026-02 Scope: Covers SPLADE / SPLADE++ / SPLADE-v3, OpenSearch Neural Sparse v1–v3 (including inference-free), and CSR (Contrastive Sparse Representation). Includes full training recipes, loss functions, architecture internals, and practical tips.
- What Are Sparse Encoders?
- Model Families
- Architecture Internals
- Training Objectives & Loss Functions
- Training Data
- Fine-Tuning Recipes
- Inference-Free (Asymmetric) Fine-Tuning
- Hard Negative Mining
- Practical Tips & Hyperparameter Guide
- Decision Guide
- References
Sparse encoders map text to high-dimensional vectors where most values are zero. Each non-zero dimension corresponds to a vocabulary token; its weight represents that token's importance in the text.
"neural retrieval" → {retrieval: 2.1, neural: 1.9, search: 0.8, information: 0.3, ...}
(vocab_size = 30,522; ~95-99% of dims are zero)
Why sparse over dense?
| Property | Dense (e.g. SBERT) | Sparse (SPLADE) |
|---|---|---|
| Vector dim | 768 | 30,522 |
| Non-zero dims | all | ~100–300 |
| Inverted index compatible | ✗ | ✓ |
| BM25-level latency | ✗ | ✓ |
| Semantic expansion | implicit | explicit (MLM head) |
| Interpretable | ✗ | ✓ (readable tokens) |
| BEIR avg NDCG@10 | ~0.49–0.52 | ~0.50–0.55 |
The killer feature: sparse vectors are compatible with Lucene/OpenSearch inverted indexes. Retrieval is a standard dot-product scan — the same data structure BM25 uses — giving near-BM25 latency at neural relevance quality.
The magic comes from the MLM head: each input token position projects onto the entire vocabulary, enabling the model to activate semantically related terms that never appeared in the original text (query/document expansion).
Symmetric models — both query and document run through a full neural encoder.
| Model | HuggingFace ID | Params | MS MARCO MRR@10 | BEIR Avg NDCG@10 |
|---|---|---|---|---|
| SPLADE-v2 | naver/splade-v2-max |
110M | 36.8 | 0.497 |
| SPLADE-v2-distill | naver/splade-v2-distilsplade-max |
66M | 36.1 | — |
| SPLADE++ SelfDistil | naver/splade-cocondenser-selfdistil |
110M | 37.6 | 0.510 |
| SPLADE++ EnsembleDistil | naver/splade-cocondenser-ensembledistil |
110M | 38.3 | 0.524 |
| SPLADE-v3 | naver/splade-v3 |
110M | 40.2 | ~0.53 |
| SPLADE-v3-Doc | naver/splade-v3-doc |
110M | — | ~0.54 |
Key papers:
- SPLADE v1: Formal et al., 2021 — SIGIR 2021
- SPLADE v2: Formal et al., 2021b — NeurIPS 2021 Workshop
- SPLADE++ / SelfDistil / EnsembleDistil: Formal et al., 2022 — SIGIR 2022
- SPLADE-v3: Lassance & Formal, 2024
- SPLADE-v3 with L0 regularization: Lassance et al., 2025 — SIGIR 2025
Asymmetric models — document is neural; query uses IDF lookup (no inference at search time).
| Model | HuggingFace ID | Params | Base | BEIR NDCG@10 | Avg FLOPs |
|---|---|---|---|---|---|
| v1 | opensearch-project/opensearch-neural-sparse-encoding-v1 |
133M | BERT-base | 0.524 | 11.4 |
| doc-v2-mini | opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini |
22M | MiniLM | 0.497 | 0.7 |
| doc-v2-distill | opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill |
67M | DistilBERT | 0.504 | 1.8 |
| doc-v2 | opensearch-project/opensearch-neural-sparse-encoding-doc-v2 |
133M | BERT-base | 0.515 | 3.3 |
| doc-v3-distill | opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill |
67M | DistilBERT | 0.517 | 1.8 |
| doc-v3-gte | opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte |
133M | GTE-base | 0.546 | ~2 |
| multilingual-v1 | opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 |
160M | mBERT | 0.629 (MIRACL) | 1.3 |
Key papers:
- OpenSearch inference-free (v2): Yang et al., 2024
- BEIR benchmark used for evaluation: Thakur et al., 2021
Adds a sparse autoencoder on top of an existing dense model, sparsifying its output. Useful when you already have a fine-tuned dense model and want sparse retrieval without retraining from scratch.
- Paper: Yang et al., 2024 — CSR
- Available in Sentence Transformers v5 via
SparseAutoEncoder+CSRLoss
Input tokens: ["what", "is", "SPLADE", "?"]
↓
BertForMaskedLM (or DistilBertForMaskedLM)
↓
token-level logits: shape [batch, seq_len, vocab_size] # e.g. [1, 4, 30522]
↓
max pooling over seq_len dimension → [batch, vocab_size] # collapse to one vector
↓
log(1 + ReLU(x)) # activation + log-saturation
↓
sparse vector: [batch, vocab_size] # ~95-99% zeros
Why max pooling? Aggregates the "best evidence" for each vocabulary token across all input positions. Each position can activate vocabulary tokens it didn't literally contain — this is the expansion.
Why log(1 + ReLU(x))? ReLU kills negatives (forcing sparsity). Log saturates large values (preventing a few dimensions from dominating), providing a soft upper bound analogous to BM25's term frequency saturation.
SPLADE-v3 double-log activation:
# v2: log(1 + ReLU(x))
# v3: log(1 + log(1 + ReLU(x))) ← stronger sparsification, less aggressive FLOPS penalty neededAt query time, no neural inference is run:
def encode_query_inference_free(query_tokens: list[int], idf: dict[int, float]) -> dict[int, float]:
"""
idf.json: {token_id_str: idf_value} pre-computed from training corpus (MS MARCO).
Returns sparse vector as {token_id: weight} dict.
"""
sparse = {}
for token_id in set(query_tokens): # deduplicate
if token_id in idf:
sparse[token_id] = idf[token_id]
else:
sparse[token_id] = 1.0 # default IDF for unknown tokens
return sparseThe idf.json file is shipped alongside each v2/v3 model on HuggingFace. You can recompute it from your own corpus for domain adaptation.
Dense model (frozen or fine-tuned) → dense vector [batch, 768]
↓
SparseAutoEncoder:
encoder: Linear(768 → 4*768) + ReLU + TopK(k=256) ← only top-k activations kept
decoder: Linear(4*768 → 768) ← reconstruction loss
↓
sparse vector: [batch, 3072] (4x expansion, mostly zeros)
The core mechanism that enforces sparsity. Penalizes tokens that activate with high magnitude on average across the training batch:
FLOPS(X) = Σ_{j ∈ vocab} [ (1/N) Σ_{i=1}^{N} w_j(x_i) ]^2
where:
N = batch size
w_j(x_i) = weight of vocab token j in sparse vector of example i
Intuition: If token "the" activates in every document, its average activation is high → large penalty → model learns to suppress it. Rare, informative tokens activate infrequently → low average → preserved.
Combined training loss:
L_total = L_rank + λ_q · FLOPS(queries) + λ_d · FLOPS(docs)
Typical λ values:
| Framework / Config | λ_q | λ_d |
|---|---|---|
| SPLADE original | 0.06 | 0.02 |
| Sentence Transformers (SpladeLoss) | 5e-5 | 3e-5 |
| OpenSearch tuning sample (InfoNCE) | 0.05 | 0.05 |
| OpenSearch tuning sample (KD) | — | 0.002 |
| OpenSearch v2 pre-training phase | — | 1e-7 |
| OpenSearch v2 fine-tuning phase | — | 0.02 |
Warning: λ too high → sparse vectors collapse to near-zero (no retrieval signal). λ too low → dense-like vectors (slow index, no sparsity benefit). Always monitor average active dims during training.
Two-phase lambda schedule (OpenSearch v2 approach, strongly recommended for training from scratch):
Phase 1 (large corpus, weak labels): λ_d = 1e-7 → focus on learning relevance
Phase 2 (MS MARCO / domain data): λ_d = 0.02 → enforce sparsity
Instead of raw dot product, scores are IDF-weighted:
score(q, d) = Σ_{t ∈ vocab} idf(t) · q_t · d_t
Why? IDF(t) is large for rare/informative tokens. Multiplying by IDF amplifies the gradient signal for these tokens in the ranking loss, teaching the model to preserve them. Common tokens (low IDF) are simultaneously pushed down by FLOPS regularization. The two forces complement each other.
IDF is pre-computed from MS MARCO corpus. Unseen tokens default to idf = 1.0.
InfoNCE / Multiple Negatives Ranking Loss:
L_InfoNCE = -log( exp(score(q, p) / τ) / Σ_j exp(score(q, d_j) / τ) )
where:
p = positive document
d_j = all docs in batch (positive + in-batch negatives + mined hard negatives)
τ = temperature (default 0.02 in Sentence Transformers)
In Sentence Transformers: SparseMultipleNegativesRankingLoss
Requirements: batch size ≥ 16 (more in-batch negatives = harder training signal). Use BatchSamplers.NO_DUPLICATES to ensure each query appears once per batch.
MarginMSE (used in SPLADE-v3 in combination with KL-div):
L_MarginMSE = MSE( score(q,p) - score(q,n), teacher_score(q,p) - teacher_score(q,n) )
Distillation consistently outperforms pointwise/pairwise training. The teacher (cross-encoder or ensemble retriever) provides soft scores for a list of candidate documents per query.
KL Divergence Distillation:
L_KL = KL( softmax(teacher_scores / τ_t) || softmax(student_scores / τ_s) )
In Sentence Transformers: SparseDistillKLDivLoss
Data format:
{"query": "what is SPLADE?",
"docs": ["SPLADE is a sparse...", "Dense models use...", "BM25 is a..."],
"scores": [9.2, 1.1, 4.5]}OpenSearch's key contribution — avoid expensive cross-encoder inference by combining two cheap retrievers:
Teacher 1 (dense): Alibaba-NLP/gte-large-en-v1.5
Teacher 2 (sparse): opensearch-project/opensearch-neural-sparse-encoding-v1
For each query:
scores_dense = dense_teacher.score(query, [doc_1, ..., doc_N])
scores_sparse = sparse_teacher.score(query, [doc_1, ..., doc_N])
norm_dense = min_max_scale(scores_dense) # → [0, 1]
norm_sparse = min_max_scale(scores_sparse) # → [0, 1]
final_score = (norm_dense + norm_sparse) / 2
Why it works: Dense captures semantic similarity; sparse captures exact lexical matches. Their combination is complementary and often matches cross-encoder quality on retrieval tasks while being ~10x cheaper to run.
Any dataset of (query, relevant_document) pairs. Even 1,000–5,000 domain-specific pairs can meaningfully improve a pre-trained sparse encoder for a specific domain when fine-tuning (not training from scratch).
MS MARCO (primary, high quality):
- 502,548 training queries
- 8.84M passages
- Relevance annotations from Bing click logs
Weak supervision mix (5.36M additional queries across 14 datasets):
| Dataset | Source |
|---|---|
eli5_question_answer |
Reddit ELI5 |
squad_pairs |
SQuAD reading comprehension |
WikiAnswers |
Wikipedia QA pairs |
yahoo_answers_* |
Yahoo Answers |
gooaq_pairs |
Google autocomplete QA |
stackexchange_duplicate_questions_* |
StackExchange |
wikihow |
WikiHow articles |
S2ORC_title_abstract |
Semantic Scholar papers |
searchQA_top5_snippets |
Jeopardy-style QA |
All available via HuggingFace Datasets or the BEIR repository.
For InfoNCE / MNR Loss (JSONL):
{"query": "what causes inflation?", "positive": "Inflation is caused by..."}
{"query": "who wrote Hamlet?", "positive": "Hamlet was written by Shakespeare...", "negatives": ["unrelated doc 1", "unrelated doc 2"]}For Distillation (KL-div) (JSONL):
{"query": "what causes inflation?",
"docs": ["Inflation is caused by...", "Shakespeare wrote...", "The sun is..."],
"scores": [9.1, 0.2, 0.1]}For Sentence Transformers (using datasets library):
from datasets import Dataset
# Simplest format
dataset = Dataset.from_dict({
"query": ["what causes inflation?", "who wrote Hamlet?"],
"positive": ["Inflation is caused by...", "Hamlet was written by Shakespeare..."],
})
# With hard negatives (better)
dataset = Dataset.from_dict({
"query": ["what causes inflation?"],
"positive": ["Inflation is caused by..."],
"negative": ["An unrelated but superficially similar document..."],
})Best for: Domain adaptation, simplest setup, no external dependencies.
pip install -U sentence-transformers datasetsfrom datasets import Dataset
from sentence_transformers import (
SparseEncoder,
SparseEncoderTrainer,
SparseEncoderTrainingArguments,
SparseEncoderModelCardData,
)
from sentence_transformers.sparse_encoder.losses import (
SpladeLoss,
SparseMultipleNegativesRankingLoss,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.training_args import BatchSamplers
# ── 1. Load a pre-trained sparse model ────────────────────────────────────────
# Don't train from scratch — always start from a strong pre-trained checkpoint.
# Good starting points:
# "naver/splade-cocondenser-selfdistil" (symmetric SPLADE)
# "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill" (inference-free)
# "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte" (best OpenSearch)
model = SparseEncoder(
"opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
model_card_data=SparseEncoderModelCardData(
language="en",
license="apache-2.0",
model_name="My Domain Sparse Encoder",
),
)
# ── 2. Prepare domain data ─────────────────────────────────────────────────────
train_dataset = Dataset.from_dict({
"query": ["your domain query 1", "your domain query 2"],
"positive": ["relevant document 1", "relevant document 2"],
# Optional: add "negative" key for hard negatives (strongly recommended)
})
# ── 3. Define loss with FLOPS regularization ───────────────────────────────────
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=5e-5, # tune between 0 and 1e-4
document_regularizer_weight=3e-5, # tune between 0 and 1e-3
)
# ── 4. Training arguments ──────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
output_dir="models/my-domain-sparse-encoder",
num_train_epochs=1,
per_device_train_batch_size=16, # larger = more in-batch negatives = better
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True, # or bf16=True on Ampere+
batch_sampler=BatchSamplers.NO_DUPLICATES, # required for MNR loss
eval_strategy="steps",
eval_steps=500,
save_steps=500,
logging_steps=50,
load_best_model_at_end=True,
metric_for_best_model="NanoBEIR_mean_dot_ndcg@10",
)
# ── 5. Optional: zero-shot evaluator on standard BEIR subsets ─────────────────
evaluator = SparseNanoBEIREvaluator(
dataset_names=["nfcorpus", "scifact", "fiqa"],
batch_size=16,
)
# ── 6. Train ───────────────────────────────────────────────────────────────────
trainer = SparseEncoderTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
evaluator=evaluator,
)
trainer.train()
model.save_pretrained("models/my-domain-sparse-encoder/final")
model.push_to_hub("my-org/my-domain-sparse-encoder") # optionalRequires teacher scores pre-computed (or computed on-the-fly if you have GPU budget).
from sentence_transformers.sparse_encoder.losses import SparseDistillKLDivLoss
# Dataset needs columns: query + doc_0, doc_1, ..., doc_N + score_0, score_1, ..., score_N
# OR use the InputExample format with {"query", "docs", "scores"}
# Sentence Transformers >= 5.0 handles both.
loss = SpladeLoss(
model=model,
loss=SparseDistillKLDivLoss(model=model),
query_regularizer_weight=5e-5,
document_regularizer_weight=3e-5,
)
# Everything else identical to A1# Check which tokens activate and their weights
sentences = ["what causes inflation?", "neural sparse retrieval with SPLADE"]
embeddings = model.encode(sentences, output_value="sentence_embedding")
for sent, emb in zip(sentences, embeddings):
# decode top-20 active dimensions back to readable tokens
decoded = model.decode(emb, top_k=20)
print(f"\n{sent}")
print(decoded)
print(f"Active dims: {(emb > 0).sum()}")from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
# Install: pip install beir
dataset = "nfcorpus"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
# Encode with your model
doc_embeddings = model.encode_corpus(list(corpus.values()))
query_embeddings = model.encode_queries(list(queries.values()))
# Retrieve using dot product (handles sparse vectors)
from sentence_transformers.sparse_encoder import SparseEncoderSimilarityFunction
results = SparseEncoderSimilarityFunction.DOT_PRODUCT.pairwise_scores(
query_embeddings, doc_embeddings
)Best for: Production OpenSearch integration, multi-GPU training, full control over ensemble distillation.
Prerequisites: OpenSearch 2.16+ running locally (used for hard negative mining and evaluation).
# Clone the repo
git clone https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
cd opensearch-sparse-model-tuning-sample
# Environment setup
conda create -n sparse-tuning python=3.9 -y
conda activate sparse-tuning
pip install -r requirements.txt
# Start OpenSearch locally (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
-e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin@1234" \
opensearchproject/opensearch:2.16.0# Download and prepare MS MARCO with hard negatives
python prepare_msmarco_hard_negatives.py \
--output_dir data/msmarco_hard_negs \
--num_hard_negatives 7 \
--opensearch_host localhost \
--opensearch_port 9200
# Or prepare your own domain data in the expected JSONL format
python demo_train_data.py \
--input your_query_doc_pairs.jsonl \
--output data/domain_dataEdit configs/config_infonce.yaml:
model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs # JSONL files
idf_path: idf.json # keep MS MARCO IDF or recompute
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_lambda: 0.05 # FLOPS regularization weight
loss_types: [infonce]
output_dir: models/my-sparse-encoderpython train_ir.py configs/config_infonce.yamlEdit configs/config_kd.yaml:
model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs
idf_path: idf.json
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_d_lambda: 0.002 # doc FLOPS weight (lower for KD)
loss_types: [kl_div]
# Ensemble teacher configuration
kd_ensemble_teacher_kwargs:
teachers:
- model_type: dense
model_name_or_path: Alibaba-NLP/gte-large-en-v1.5 # dense teacher
- model_type: sparse
model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-v1 # sparse teacher
score_scaling_factor: 30 # scales scores before softmax
aggregation: arithmetic_mean # or geometric_mean
output_dir: models/my-sparse-encoder-kd# Single GPU
python train_ir.py configs/config_kd.yaml
# Multi-GPU (recommended for full MS MARCO training)
torchrun --nproc_per_node=8 train_ir.py configs/config_kd.yaml# Recompute IDF from your own corpus (if your domain differs from MS MARCO)
from collections import defaultdict
import math
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)
def compute_idf(corpus: list[str], output_path: str = "idf.json"):
df = defaultdict(int) # document frequency
N = len(corpus)
for doc in corpus:
token_ids = tokenizer(doc, truncation=True, max_length=512)["input_ids"]
for tid in set(token_ids):
df[tid] += 1
idf = {
str(tid): math.log((N + 1) / (count + 1)) + 1 # smoothed IDF
for tid, count in df.items()
}
with open(output_path, "w") as f:
json.dump(idf, f)
print(f"Computed IDF for {len(idf)} tokens from {N} documents.")
return idf
# Usage
with open("your_corpus.txt") as f:
corpus = [line.strip() for line in f]
idf = compute_idf(corpus, "domain_idf.json")Best for: Symmetric SPLADE models, research experimentation, full Hydra config control.
git clone https://github.com/naver/splade
cd splade
pip install -e .
# Download MS MARCO data
bash scripts/download_msmarco.shpython -m splade.training.train \
config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
config.regularizer.FLOPS.lambda_q=0.06 \
config.regularizer.FLOPS.lambda_d=0.02 \
config.data.train_data_path=data/your_domain_triples.tsv \
config.training.num_train_epochs=3 \
config.training.learning_rate=2e-5 \
config.training.per_device_train_batch_size=32# Using cross-encoder as teacher (best quality but slow)
python -m splade.training.distil_train \
config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
config.teacher.model_type_or_dir=cross-encoder/ms-marco-MiniLM-L-12-v2 \
config.regularizer.FLOPS.lambda_q=0.02 \
config.regularizer.FLOPS.lambda_d=0.01 \
config.data.train_data_path=data/msmarco_triples.tsv \
config.training.num_train_epochs=3SPLADE-v3 uses a mix of KL-divergence from a SPLADE++ teacher + MarginMSE loss. Config from the paper:
# Loss combination from SPLADE-v3 paper:
L_total = α * L_KL + (1 - α) * L_MarginMSE + λ_q * FLOPS(q) + λ_d * FLOPS(d)
# α = 0.5, λ_q = 0.01, λ_d = 0.008
# Teacher: naver/splade-cocondenser-selfdistil
# Hard negatives: 8 per query sampled from teacher's top-100For OpenSearch production deployment, the inference-free setup (query = IDF lookup, doc = neural) is strongly recommended. Sentence Transformers v5 supports training this directly via Router.
from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import (
SparseStaticEmbedding,
MLMTransformer,
SpladePooling,
)
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
# ── Build the asymmetric model ─────────────────────────────────────────────────
doc_encoder = MLMTransformer(
"opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)
# SparseStaticEmbedding = trainable IDF lookup table
# frozen=False allows the IDF weights to be updated during training (domain adaptation)
query_encoder = SparseStaticEmbedding(
tokenizer=doc_encoder.tokenizer,
frozen=False, # set True to freeze IDF (use pre-computed MS MARCO IDF only)
idf_path="idf.json", # path to pre-computed IDF (download from HuggingFace model)
)
router = Router.for_query_document(
query_modules=[query_encoder],
document_modules=[doc_encoder, SpladePooling("max")],
)
model = SparseEncoder(modules=[router])
# ── Loss ───────────────────────────────────────────────────────────────────────
loss = SpladeLoss(
model=model,
loss=SparseMultipleNegativesRankingLoss(model=model),
query_regularizer_weight=0.0, # no regularization needed for static embedding
document_regularizer_weight=3e-5,
)
# ── Training args ──────────────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
output_dir="models/inference-free-domain",
num_train_epochs=1,
per_device_train_batch_size=16,
learning_rate=2e-5,
# Higher LR for the IDF table (it's a simpler parameter space)
learning_rate_mapping={r"SparseStaticEmbedding\..*": 1e-3},
fp16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
)At inference time after training:
# Query: no neural inference
query_embedding = model.encode("what causes inflation?", prompt_name="query")
# ↑ this just does IDF lookup internally
# Document: full neural encoding
doc_embedding = model.encode("Inflation is caused by...", prompt_name="passage")
# Score
score = (query_embedding * doc_embedding).sum() # dot productHard negatives are documents that are superficially relevant (e.g., retrieved by BM25) but are actually not relevant. They make training significantly harder and improve the model's ability to discriminate.
from rank_bm25 import BM25Okapi # pip install rank-bm25
def mine_bm25_hard_negatives(
queries: list[str],
corpus: list[str],
qrels: dict[str, list[int]], # query_id → list of relevant doc indices
n_negatives: int = 7,
) -> list[dict]:
"""Mine hard negatives using BM25 retrieval."""
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
examples = []
for q_idx, query in enumerate(queries):
scores = bm25.get_scores(query.lower().split())
top_k_indices = scores.argsort()[::-1][:100] # top-100 by BM25
relevant_ids = set(qrels.get(str(q_idx), []))
hard_negatives = [
corpus[i] for i in top_k_indices
if i not in relevant_ids
][:n_negatives]
if relevant_ids and hard_negatives:
examples.append({
"query": query,
"positive": corpus[list(relevant_ids)[0]],
"negatives": hard_negatives,
})
return examples# Mine hard negatives using your current model's retrieval
# Run after each epoch or every N steps for iterative refinement
def mine_model_hard_negatives(model, queries, corpus, qrels, n_negatives=7, top_k=100):
corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)
examples = []
for q_idx, query in enumerate(queries):
query_embedding = model.encode([query])
scores = (query_embedding * corpus_embeddings).sum(axis=1)
top_k_indices = scores.argsort()[::-1][:top_k]
relevant_ids = set(qrels.get(str(q_idx), []))
hard_negatives = [
corpus[i] for i in top_k_indices
if i not in relevant_ids
][:n_negatives]
if relevant_ids and hard_negatives:
examples.append({"query": query, "positive": corpus[list(relevant_ids)[0]], "negatives": hard_negatives})
return examplesIf you have an OpenSearch cluster, the tuning sample repo automates this:
# Index your corpus into OpenSearch
python index_corpus.py \
--corpus your_corpus.jsonl \
--index my-index \
--opensearch_host localhost
# Mine hard negatives using BM25 retrieval
python prepare_hard_negatives.py \
--queries your_queries.jsonl \
--qrels your_qrels.tsv \
--index my-index \
--n_negatives 7 \
--output data/hard_negatives.jsonlAlways track average active dimensions during training. Target ranges:
| Component | Target active dims | If too many | If too few |
|---|---|---|---|
| Documents | 100–300 | Increase λ_d | Decrease λ_d |
| Queries (neural) | 10–50 | Increase λ_q | Decrease λ_q |
| Queries (IDF lookup) | = query length | N/A | N/A |
def log_sparsity(model, sample_texts: list[str], prefix=""):
embeddings = model.encode(sample_texts)
active_dims = [(emb > 0).sum() for emb in embeddings]
print(f"{prefix} active dims: mean={sum(active_dims)/len(active_dims):.1f}, "
f"min={min(active_dims)}, max={max(active_dims)}")| Parameter | Recommended range | Notes |
|---|---|---|
learning_rate |
1e-5 to 5e-5 |
Start at 2e-5 |
per_device_batch_size |
16–64 | Bigger → harder negatives for InfoNCE |
λ_q (query FLOPS) |
1e-5 to 1e-4 |
Lower than λ_d; set 0 for inference-free |
λ_d (doc FLOPS) |
1e-5 to 5e-2 |
Critical — tune first |
warmup_ratio |
0.05–0.1 | Standard |
weight_decay |
0.01 | Standard |
max_seq_length |
512 for docs, 64–128 for queries | |
num_train_epochs |
1–5 | 1 usually sufficient for fine-tuning |
| Symptom | Cause | Fix |
|---|---|---|
| Embeddings all zeros | λ too high or LR too high | Reduce λ and/or LR |
| No sparsity (all dims active) | λ too low | Increase λ_d |
| Training loss not decreasing | LR too low or bad data | Check data format, increase LR |
| Good train loss, poor BEIR | Overfitting to MS MARCO | Add domain data, reduce epochs |
| Query vecs denser than doc vecs | Normal — queries shorter | Expected behaviour |
Never train a sparse encoder from a plain BERT/DistilBERT checkpoint without pre-training. The MLM head must first learn to produce meaningful sparse activations. Training from scratch requires:
- ≥500K (query, document) pairs
- Multi-phase lambda schedule
- Likely weeks of training on 8+ GPUs
Always start from a pre-trained sparse checkpoint for domain fine-tuning. Even naver/splade-cocondenser-selfdistil or opensearch-neural-sparse-encoding-doc-v2-distill will adapt well with just a few thousand domain examples.
| Goal | Start from | Training approach | Approximate training time |
|---|---|---|---|
| Best retrieval quality (symmetric) | naver/splade-v3 |
NAVER repo + KL-div + MarginMSE | Days on 8 GPU |
| OpenSearch production (fastest query) | opensearch-neural-sparse-encoding-doc-v2-distill |
opensearch tuning sample + KD ensemble | Hours on 1 GPU |
| Simplest domain adaptation | doc-v2-distill |
Sentence Transformers SpladeLoss + MNR | Minutes on 1 GPU |
| Best OpenSearch quality | opensearch-neural-sparse-encoding-doc-v3-gte |
opensearch tuning sample + KD ensemble | Hours on 1 GPU |
| Multilingual | opensearch-neural-sparse-encoding-multilingual-v1 |
MIRACL data + KD approach | Hours on 4 GPU |
| Sparsify an existing dense model | your dense model + SparseAutoEncoder | Sentence Transformers CSRLoss | Minutes on 1 GPU |
| Research / ablation studies | naver/splade-cocondenser-selfdistil |
NAVER SPLADE repo (Hydra configs) | Configurable |
-
BM25 — Robertson & Zaragoza (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR. PDF
-
DeepImpact — Mallia et al. (2021). Learning Passage Impacts for Inverted Indexes. SIGIR 2021. arXiv:2104.12016
-
SPLADE v1 — Formal et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720
-
SPLADE v2 — Formal et al. (2021). SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. NeurIPS 2021 Workshop. arXiv:2109.10086
-
BEIR Benchmark — Thakur et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663
-
SPLADE++ / SelfDistil / EnsembleDistil — Formal et al. (2022). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. SIGIR 2022. arXiv:2205.04733
-
CoCondenser — Gao & Callan (2022). Unsupervised Corpus Aware Language Model Pre-Training for Dense Passage Retrieval. ACL 2022. arXiv:2108.05540
-
SPLADE-v3 — Lassance & Formal (2024). SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789
-
CSR — Yang et al. (2024). CSR: Cascade Sparse Retrieval for Open-Domain Question Answering. arXiv:2404.12153
-
OpenSearch Inference-Free Neural Sparse (v2) — Yang et al. (2024). Inference-free Sparse Retrieval via IDF-Aware Ensemble Distillation. arXiv:2411.04403
-
SPLADE-v3 + L0 — Lassance et al. (2025). Efficient Sparse Retrieval with L0 Regularization. SIGIR 2025. arXiv:2504.14839
| Resource | URL |
|---|---|
| NAVER SPLADE repo | https://github.com/naver/splade |
| OpenSearch sparse tuning sample | https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample |
| Sentence Transformers sparse training docs | https://sbert.net/docs/sparse_encoder/training_overview.html |
| Sentence Transformers sparse training examples | https://github.com/UKPLab/sentence-transformers/tree/master/examples/sparse_encoder |
| BEIR benchmark | https://github.com/beir-cellar/beir |
| Resource | URL |
|---|---|
| HuggingFace blog: Train Sparse Encoders (ST v5) | https://huggingface.co/blog/train-sparse-encoder |
| OpenSearch neural sparse documentation | https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/ |
| Sentence Transformers v5 release announcement | https://huggingface.co/blog/sentence-transformers-v5 |
Guide compiled from NAVER SPLADE papers, Amazon OpenSearch Neural Sparse papers, and Sentence Transformers v5 documentation. All code is written for Python 3.9+ and tested against sentence-transformers>=5.0.
Context: A production OpenSearch 2.16 cluster indexed ~55M document chunks (50.6M SEC filings + 4.4M earnings call transcripts from CapIQ and UK Companies House). The existing sparse encoder used the stock MS MARCO IDF table. This section documents the investigation that proved domain IDF recomputation was necessary before fine-tuning.
desia-resource-chunks-v2 55,073,414 docs 502.7 GB
├── filing 50,637,052 (10-K, 10-Q, 8-K, Companies House)
└── transcript 4,436,362 (earnings calls)
Field schema (relevant fields):
chunk_text → text
chunk_text_sparse_embeddings → rank_features (sparse vector)
chunk_context_sparse_embeddings → rank_features
resource_company_name → text
resource_integration_element_type → text ("filing" | "transcript")
resource_source_integration_code_name → text ("data-provider-capiq" | "data-provider-gov.uk-companyhouse")
The inference-free setup was already in place:
- Query time:
amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1(DEPLOYED) — IDF lookup, zero inference - Document encoding: custom neural encoder (remote HuggingFace Inference Endpoint, L4 GPU) — connector type
doc_sparse_encode
The script below scrolls the corpus, tokenizes with the model's own BERT tokenizer, and computes smoothed IDF. Run with uv:
uv run \
--with requests \
--with transformers \
--with huggingface_hub \
--with tqdm \
python domain_idf_analysis.py"""
domain_idf_analysis.py
Scrolls 100K chunks (50K filings + 50K transcripts) from an OpenSearch index,
computes smoothed IDF, saves to domain_idf.json, and compares vs MS MARCO baseline.
Replace OS_URL / OS_AUTH with your own cluster credentials.
"""
import json
import math
import sys
from collections import Counter
import requests
import urllib3
from huggingface_hub import hf_hub_download
from tqdm import tqdm
from transformers import AutoTokenizer
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# ── Config ─────────────────────────────────────────────────────────────────────
OS_URL = "https://localhost:9200"
OS_AUTH = ("YOUR_USERNAME", "YOUR_PASSWORD") # ← replace
INDEX = "your-chunks-index" # ← replace
MODEL_ID = "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
N_EACH = 50_000 # 50K filings + 50K transcripts = 100K total
BATCH = 500
MIN_LEN = 50 # skip near-empty / image-caption chunks
OUT_IDF = "domain_idf.json"
def scroll_texts(doc_type: str, type_field: str, n: int) -> list[str]:
"""Scroll N chunk_text docs of a given type from OpenSearch."""
texts: list[str] = []
resp = requests.post(
f"{OS_URL}/{INDEX}/_search?scroll=3m",
auth=OS_AUTH, verify=False,
json={
"size": BATCH,
"_source": ["chunk_text"],
"query": {"term": {type_field: doc_type}},
},
)
resp.raise_for_status()
data = resp.json()
scroll_id = data["_scroll_id"]
hits = data["hits"]["hits"]
with tqdm(total=n, desc=f" {doc_type:12s}", unit="chunks", ncols=80) as pbar:
while hits and len(texts) < n:
for hit in hits:
text = hit["_source"].get("chunk_text", "")
if len(text.strip()) >= MIN_LEN:
texts.append(text)
if len(texts) >= n:
break
pbar.update(min(len(hits), max(0, n - len(texts) + len(hits))))
if len(texts) >= n:
break
resp = requests.post(
f"{OS_URL}/_search/scroll",
auth=OS_AUTH, verify=False,
json={"scroll": "3m", "scroll_id": scroll_id},
)
resp.raise_for_status()
data = resp.json()
scroll_id = data.get("_scroll_id", scroll_id)
hits = data["hits"]["hits"]
requests.delete(
f"{OS_URL}/_search/scroll",
auth=OS_AUTH, verify=False,
json={"scroll_id": scroll_id},
)
return texts[:n]
def compute_idf(texts: list[str], tokenizer) -> tuple[dict, int, Counter]:
"""
Smoothed IDF: log((N+1) / (df+1)) + 1
Keys are string token IDs — compatible with OpenSearch inference-free models.
"""
df: Counter = Counter()
N = 0
for text in tqdm(texts, desc=" tokenizing ", unit="chunks", ncols=80):
ids = tokenizer(text, truncation=True, max_length=512)["input_ids"]
for tid in set(ids): # count each token once per document
df[tid] += 1
N += 1
idf = {
str(tid): math.log((N + 1) / (count + 1)) + 1
for tid, count in df.items()
}
return idf, N, df
def compare_vs_msmarco(domain_idf: dict, tokenizer):
"""
Compare domain IDF against MS MARCO baseline.
⚠️ Key format note:
- MS MARCO idf.json uses decoded token STRINGS as keys (e.g. "the", "consolidated")
- Domain IDF uses integer token IDs as string keys (e.g. "1996", "12088")
- Decode domain keys before comparing.
"""
idf_path = hf_hub_download(MODEL_ID, "idf.json")
with open(idf_path) as f:
msmarco_idf: dict = json.load(f)
# Decode domain IDF to string keys for comparison
domain_by_str = {}
for tid_str, idf_val in domain_idf.items():
tok = tokenizer.decode([int(tid_str)]).strip()
if tok:
domain_by_str[tok] = idf_val
common = set(domain_by_str) & set(msmarco_idf)
deltas = [
(tok, domain_by_str[tok], msmarco_idf[tok], domain_by_str[tok] - msmarco_idf[tok])
for tok in common if len(tok.strip()) >= 2
]
deltas.sort(key=lambda x: x[3])
import statistics
abs_d = [abs(d) for _, _, _, d in deltas]
print(f"\nTokens compared : {len(deltas):,}")
print(f"Mean |delta| : {statistics.mean(abs_d):.4f}")
print(f"Stdev |delta| : {statistics.stdev(abs_d):.4f}")
print(f"|delta| > 1.0 : {sum(1 for d in abs_d if d > 1.0):,} ({sum(1 for d in abs_d if d > 1.0)/len(abs_d)*100:.0f}%)")
print(f"|delta| > 2.0 : {sum(1 for d in abs_d if d > 2.0):,} ({sum(1 for d in abs_d if d > 2.0)/len(abs_d)*100:.0f}%)")
fmt = "{:<25} {:>11} {:>13} {:>10}"
for label, rows in [
("UNDERWEIGHTED in MS MARCO (more common in domain)", deltas[:30]),
("OVERWEIGHTED in MS MARCO (rarer in domain)", list(reversed(deltas[-30:]))),
]:
print(f"\n{'='*65}\n{label}\n{'='*65}")
print(fmt.format("Token", "Domain IDF", "MS MARCO IDF", "Delta"))
print("-" * 63)
for tok, d, m, delta in rows:
print(fmt.format(tok, f"{d:.3f}", f"{m:.3f}", f"{delta:+.3f}"))
def main():
print("[1/4] Loading MS MARCO IDF ...")
# (also used inside compare_vs_msmarco)
print("[2/4] Loading tokenizer ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
print(f"[3/4] Scrolling {N_EACH:,} × 2 chunks ...")
# Adapt the type_field and doc_type values to your schema
filing_texts = scroll_texts("filing", "resource_integration_element_type.keyword", N_EACH)
transcript_texts = scroll_texts("transcript", "resource_integration_element_type.keyword", N_EACH)
all_texts = filing_texts + transcript_texts
print(f" collected {len(all_texts):,} total chunks")
print("[4/4] Computing domain IDF ...")
domain_idf, N, df = compute_idf(all_texts, tokenizer)
print(f" {len(domain_idf):,} unique tokens | {N:,} documents")
with open(OUT_IDF, "w") as f:
json.dump(domain_idf, f)
print(f" saved → {OUT_IDF}")
compare_vs_msmarco(domain_idf, tokenizer)
if __name__ == "__main__":
main()IDF drift summary:
Tokens in domain sample : 25,986
Mean |delta| : 1.30 ← on log scale; substantial
Stdev |delta| : 1.00
|delta| > 1.0 : 14,083 (54% of vocabulary!)
|delta| > 2.0 : 5,779 (22% of vocabulary)
Tokens MS MARCO underweights for this domain (common in filings, rare in web search):
| Token | Domain IDF | MS MARCO IDF | Delta | Why it matters |
|---|---|---|---|---|
2021 |
2.02 | 9.31 | −7.29 | Appears in 36% of docs (fiscal year references) |
2020 |
2.25 | 7.66 | −5.41 | Same |
202 |
1.87 | 7.09 | −5.23 | Subword prefix for 202x years, extremely common |
##gence |
2.53 | 8.59 | −6.06 | Suffix: "negligence", "intelligence", "emergence" |
consolidated |
3.21 | 8.20 | −4.98 | Core accounting term |
subsidiaries |
4.03 | 8.49 | −4.46 | Core corporate structure term |
commitments |
4.63 | 9.02 | −4.40 | Balance sheet item |
crore |
5.29 | 9.67 | −4.38 | Indian rupee unit (international filings) |
societe |
6.43 | 10.83 | −4.40 | French company names (Companies House) |
grupo |
6.27 | 10.63 | −4.36 | Spanish company names |
##gl |
2.51 | 6.83 | −4.32 | Subword: "global", "single", "struggle" |
Tokens MS MARCO overweights (virtually absent from financial filings):
| Token | Domain IDF | MS MARCO IDF | Delta |
|---|---|---|---|
noun |
11.82 | 5.26 | +6.57 |
synonym |
11.82 | 5.78 | +6.04 |
wikipedia |
11.82 | 6.02 | +5.80 |
garlic |
11.82 | 6.13 | +5.69 |
pronunciation |
11.82 | 6.20 | +5.62 |
stomach |
10.21 | 5.24 | +4.97 |
puppy |
11.82 | 7.03 | +4.79 |
medieval |
11.82 | 7.16 | +4.67 |
1. The IDF format mismatch gotcha.
MS MARCO idf.json uses decoded token strings as keys ("the", "consolidated"). If you compute domain IDF using integer token IDs as keys (e.g. str(token_id) → "1996"), a naive set(domain) & set(msmarco) intersection returns near-zero overlap (~337 coincidental numeric matches). Always decode token IDs before comparing:
# WRONG: comparing int-string keys vs text-string keys
common = set(domain_idf_by_id.keys()) & set(msmarco_idf.keys()) # ~337 matches
# RIGHT: decode domain keys first
domain_by_str = {
tokenizer.decode([int(tid)]).strip(): idf_val
for tid, idf_val in domain_idf_by_id.items()
}
common = set(domain_by_str.keys()) & set(msmarco_idf.keys()) # ~25,000 matchesThe domain idf.json you save and ship with the model should keep integer token ID keys — that is what the OpenSearch inference-free tokenizer model expects.
2. 54% of the vocabulary has IDF drift > 1.0 (log scale).
This is not marginal. A delta of 1.0 on a log scale represents roughly a 2.7× difference in document frequency relative to corpus size. With over half the vocabulary miscalibrated, MS MARCO IDF actively harms retrieval quality on financial text.
3. The domain "stop words" are different from general English stop words.
In financial filings, financial (IDF 2.03), market (IDF 1.92), company (IDF 2.17), year (IDF 2.03), inc (IDF 2.12) appear in 30–40% of all chunks. A general IDF table treats them as informative; the domain IDF correctly suppresses them.
4. Scale matters for low-frequency financial jargon.
From 100K chunks, genuinely rare terms like EBITDA, amortization, diluted may still have unreliable IDF estimates. Scale up to 500K–1M chunks for stable estimates on domain-specific tail vocabulary. The scroll-and-tokenize approach is linear — just increase N_EACH.
With a domain idf.json in hand:
- Register the updated IDF with your inference-free tokenizer model in OpenSearch
- Generate synthetic training pairs from your corpus using an LLM (GPL approach) — see Section 6
- Fine-tune the document encoder with
SpladeLoss+ domain IDF, starting from a pre-trained sparse checkpoint - Re-index documents (or serve both old and new model during transition, comparing NDCG on a held-out query set)