oneryalcin/sparse_encoder_fine_tuning_guide.md

Last active February 23, 2026 22:09

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/oneryalcin/929109d65e86b35d01449e14df1d2ff1.js"></script>
Save oneryalcin/929109d65e86b35d01449e14df1d2ff1 to your computer and use it in GitHub Desktop.

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval: Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Raw

sparse_encoder_fine_tuning_guide.md

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval: Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Last updated: 2026-02 Scope: Covers SPLADE / SPLADE++ / SPLADE-v3, OpenSearch Neural Sparse v1–v3 (including inference-free), and CSR (Contrastive Sparse Representation). Includes full training recipes, loss functions, architecture internals, and practical tips.

What Are Sparse Encoders?
Model Families
Architecture Internals
Training Objectives & Loss Functions
Training Data
Fine-Tuning Recipes
Inference-Free (Asymmetric) Fine-Tuning
Hard Negative Mining
Practical Tips & Hyperparameter Guide
Decision Guide
References

1. What Are Sparse Encoders?

Sparse encoders map text to high-dimensional vectors where most values are zero. Each non-zero dimension corresponds to a vocabulary token; its weight represents that token's importance in the text.

"neural retrieval" → {retrieval: 2.1, neural: 1.9, search: 0.8, information: 0.3, ...}
                      (vocab_size = 30,522; ~95-99% of dims are zero)

Why sparse over dense?

Property	Dense (e.g. SBERT)	Sparse (SPLADE)
Vector dim	768	30,522
Non-zero dims	all	~100–300
Inverted index compatible	✗	✓
BM25-level latency	✗	✓
Semantic expansion	implicit	explicit (MLM head)
Interpretable	✗	✓ (readable tokens)
BEIR avg NDCG@10	~0.49–0.52	~0.50–0.55

The killer feature: sparse vectors are compatible with Lucene/OpenSearch inverted indexes. Retrieval is a standard dot-product scan — the same data structure BM25 uses — giving near-BM25 latency at neural relevance quality.

The magic comes from the MLM head: each input token position projects onto the entire vocabulary, enabling the model to activate semantically related terms that never appeared in the original text (query/document expansion).

2. Model Families

2.1 SPLADE Family (NAVER Labs)

Symmetric models — both query and document run through a full neural encoder.

Model	HuggingFace ID	Params	MS MARCO MRR@10	BEIR Avg NDCG@10
SPLADE-v2	`naver/splade-v2-max`	110M	36.8	0.497
SPLADE-v2-distill	`naver/splade-v2-distilsplade-max`	66M	36.1	—
SPLADE++ SelfDistil	`naver/splade-cocondenser-selfdistil`	110M	37.6	0.510
SPLADE++ EnsembleDistil	`naver/splade-cocondenser-ensembledistil`	110M	38.3	0.524
SPLADE-v3	`naver/splade-v3`	110M	40.2	~0.53
SPLADE-v3-Doc	`naver/splade-v3-doc`	110M	—	~0.54

Key papers:

SPLADE v1: Formal et al., 2021 — SIGIR 2021
SPLADE v2: Formal et al., 2021b — NeurIPS 2021 Workshop
SPLADE++ / SelfDistil / EnsembleDistil: Formal et al., 2022 — SIGIR 2022
SPLADE-v3: Lassance & Formal, 2024
SPLADE-v3 with L0 regularization: Lassance et al., 2025 — SIGIR 2025

2.2 OpenSearch Neural Sparse (Amazon)

Asymmetric models — document is neural; query uses IDF lookup (no inference at search time).

Model	HuggingFace ID	Params	Base	BEIR NDCG@10	Avg FLOPs
v1	`opensearch-project/opensearch-neural-sparse-encoding-v1`	133M	BERT-base	0.524	11.4
doc-v2-mini	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini`	22M	MiniLM	0.497	0.7
doc-v2-distill	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill`	67M	DistilBERT	0.504	1.8
doc-v2	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2`	133M	BERT-base	0.515	3.3
doc-v3-distill	`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill`	67M	DistilBERT	0.517	1.8
doc-v3-gte	`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte`	133M	GTE-base	0.546	~2
multilingual-v1	`opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1`	160M	mBERT	0.629 (MIRACL)	1.3

Key papers:

OpenSearch inference-free (v2): Yang et al., 2024
BEIR benchmark used for evaluation: Thakur et al., 2021

2.3 CSR — Contrastive Sparse Representation

Adds a sparse autoencoder on top of an existing dense model, sparsifying its output. Useful when you already have a fine-tuned dense model and want sparse retrieval without retraining from scratch.

Paper: Yang et al., 2024 — CSR
Available in Sentence Transformers v5 via SparseAutoEncoder + CSRLoss

3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Input tokens: ["what", "is", "SPLADE", "?"]
       ↓
BertForMaskedLM (or DistilBertForMaskedLM)
       ↓
token-level logits: shape [batch, seq_len, vocab_size]   # e.g. [1, 4, 30522]
       ↓
max pooling over seq_len dimension → [batch, vocab_size]  # collapse to one vector
       ↓
log(1 + ReLU(x))                                         # activation + log-saturation
       ↓
sparse vector: [batch, vocab_size]                        # ~95-99% zeros

Why max pooling? Aggregates the "best evidence" for each vocabulary token across all input positions. Each position can activate vocabulary tokens it didn't literally contain — this is the expansion.

Why log(1 + ReLU(x))? ReLU kills negatives (forcing sparsity). Log saturates large values (preventing a few dimensions from dominating), providing a soft upper bound analogous to BM25's term frequency saturation.

SPLADE-v3 double-log activation:

# v2: log(1 + ReLU(x))
# v3: log(1 + log(1 + ReLU(x)))   ← stronger sparsification, less aggressive FLOPS penalty needed

Inference-Free (OpenSearch v2/v3) Query Encoding

At query time, no neural inference is run:

def encode_query_inference_free(query_tokens: list[int], idf: dict[int, float]) -> dict[int, float]:
    """
    idf.json: {token_id_str: idf_value} pre-computed from training corpus (MS MARCO).
    Returns sparse vector as {token_id: weight} dict.
    """
    sparse = {}
    for token_id in set(query_tokens):  # deduplicate
        if token_id in idf:
            sparse[token_id] = idf[token_id]
        else:
            sparse[token_id] = 1.0  # default IDF for unknown tokens
    return sparse

The idf.json file is shipped alongside each v2/v3 model on HuggingFace. You can recompute it from your own corpus for domain adaptation.

CSR Architecture

Dense model (frozen or fine-tuned) → dense vector [batch, 768]
       ↓
SparseAutoEncoder:
    encoder: Linear(768 → 4*768) + ReLU + TopK(k=256)   ← only top-k activations kept
    decoder: Linear(4*768 → 768)                         ← reconstruction loss
       ↓
sparse vector: [batch, 3072]   (4x expansion, mostly zeros)

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

The core mechanism that enforces sparsity. Penalizes tokens that activate with high magnitude on average across the training batch:

FLOPS(X) = Σ_{j ∈ vocab}  [ (1/N) Σ_{i=1}^{N}  w_j(x_i) ]^2

where:
  N = batch size
  w_j(x_i) = weight of vocab token j in sparse vector of example i

Intuition: If token "the" activates in every document, its average activation is high → large penalty → model learns to suppress it. Rare, informative tokens activate infrequently → low average → preserved.

Combined training loss:

L_total = L_rank + λ_q · FLOPS(queries) + λ_d · FLOPS(docs)

Typical λ values:

Framework / Config	λ_q	λ_d
SPLADE original	0.06	0.02
Sentence Transformers (SpladeLoss)	5e-5	3e-5
OpenSearch tuning sample (InfoNCE)	0.05	0.05
OpenSearch tuning sample (KD)	—	0.002
OpenSearch v2 pre-training phase	—	1e-7
OpenSearch v2 fine-tuning phase	—	0.02

Warning: λ too high → sparse vectors collapse to near-zero (no retrieval signal). λ too low → dense-like vectors (slow index, no sparsity benefit). Always monitor average active dims during training.

Two-phase lambda schedule (OpenSearch v2 approach, strongly recommended for training from scratch):

Phase 1 (large corpus, weak labels): λ_d = 1e-7  → focus on learning relevance
Phase 2 (MS MARCO / domain data):   λ_d = 0.02  → enforce sparsity

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

Instead of raw dot product, scores are IDF-weighted:

score(q, d) = Σ_{t ∈ vocab}  idf(t) · q_t · d_t

Why? IDF(t) is large for rare/informative tokens. Multiplying by IDF amplifies the gradient signal for these tokens in the ranking loss, teaching the model to preserve them. Common tokens (low IDF) are simultaneously pushed down by FLOPS regularization. The two forces complement each other.

IDF is pre-computed from MS MARCO corpus. Unseen tokens default to idf = 1.0.

4.3 Ranking Losses

InfoNCE / Multiple Negatives Ranking Loss:

L_InfoNCE = -log( exp(score(q, p) / τ) / Σ_j exp(score(q, d_j) / τ) )

where:
  p = positive document
  d_j = all docs in batch (positive + in-batch negatives + mined hard negatives)
  τ = temperature (default 0.02 in Sentence Transformers)

In Sentence Transformers: SparseMultipleNegativesRankingLoss

Requirements: batch size ≥ 16 (more in-batch negatives = harder training signal). Use BatchSamplers.NO_DUPLICATES to ensure each query appears once per batch.

MarginMSE (used in SPLADE-v3 in combination with KL-div):

L_MarginMSE = MSE( score(q,p) - score(q,n),  teacher_score(q,p) - teacher_score(q,n) )

4.4 Knowledge Distillation

Distillation consistently outperforms pointwise/pairwise training. The teacher (cross-encoder or ensemble retriever) provides soft scores for a list of candidate documents per query.

KL Divergence Distillation:

L_KL = KL( softmax(teacher_scores / τ_t) || softmax(student_scores / τ_s) )

In Sentence Transformers: SparseDistillKLDivLoss

Data format:

{"query": "what is SPLADE?",
 "docs": ["SPLADE is a sparse...", "Dense models use...", "BM25 is a..."],
 "scores": [9.2, 1.1, 4.5]}

4.5 Ensemble Heterogeneous Distillation

OpenSearch's key contribution — avoid expensive cross-encoder inference by combining two cheap retrievers:

Teacher 1 (dense):  Alibaba-NLP/gte-large-en-v1.5
Teacher 2 (sparse): opensearch-project/opensearch-neural-sparse-encoding-v1

For each query:
  scores_dense  = dense_teacher.score(query, [doc_1, ..., doc_N])
  scores_sparse = sparse_teacher.score(query, [doc_1, ..., doc_N])

  norm_dense  = min_max_scale(scores_dense)   # → [0, 1]
  norm_sparse = min_max_scale(scores_sparse)  # → [0, 1]

  final_score = (norm_dense + norm_sparse) / 2

Why it works: Dense captures semantic similarity; sparse captures exact lexical matches. Their combination is complementary and often matches cross-encoder quality on retrieval tasks while being ~10x cheaper to run.

5. Training Data

Minimum Viable Data

Any dataset of (query, relevant_document) pairs. Even 1,000–5,000 domain-specific pairs can meaningfully improve a pre-trained sparse encoder for a specific domain when fine-tuning (not training from scratch).

Pre-Training Scale (OpenSearch models)

MS MARCO (primary, high quality):

502,548 training queries
8.84M passages
Relevance annotations from Bing click logs

Weak supervision mix (5.36M additional queries across 14 datasets):

Dataset	Source
`eli5_question_answer`	Reddit ELI5
`squad_pairs`	SQuAD reading comprehension
`WikiAnswers`	Wikipedia QA pairs
`yahoo_answers_*`	Yahoo Answers
`gooaq_pairs`	Google autocomplete QA
`stackexchange_duplicate_questions_*`	StackExchange
`wikihow`	WikiHow articles
`S2ORC_title_abstract`	Semantic Scholar papers
`searchQA_top5_snippets`	Jeopardy-style QA

All available via HuggingFace Datasets or the BEIR repository.

Data Format

For InfoNCE / MNR Loss (JSONL):

{"query": "what causes inflation?", "positive": "Inflation is caused by..."}
{"query": "who wrote Hamlet?", "positive": "Hamlet was written by Shakespeare...", "negatives": ["unrelated doc 1", "unrelated doc 2"]}

For Distillation (KL-div) (JSONL):

{"query": "what causes inflation?",
 "docs": ["Inflation is caused by...", "Shakespeare wrote...", "The sun is..."],
 "scores": [9.1, 0.2, 0.1]}

For Sentence Transformers (using datasets library):

from datasets import Dataset

# Simplest format
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?", "who wrote Hamlet?"],
    "positive": ["Inflation is caused by...", "Hamlet was written by Shakespeare..."],
})

# With hard negatives (better)
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?"],
    "positive": ["Inflation is caused by..."],
    "negative": ["An unrelated but superficially similar document..."],
})

6. Fine-Tuning Recipes

Option A: Sentence Transformers v5 (Recommended)

Best for: Domain adaptation, simplest setup, no external dependencies.

pip install -U sentence-transformers datasets

A1. InfoNCE (simplest — pairs only)

from datasets import Dataset
from sentence_transformers import (
    SparseEncoder,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
    SparseEncoderModelCardData,
)
from sentence_transformers.sparse_encoder.losses import (
    SpladeLoss,
    SparseMultipleNegativesRankingLoss,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.training_args import BatchSamplers

# ── 1. Load a pre-trained sparse model ────────────────────────────────────────
# Don't train from scratch — always start from a strong pre-trained checkpoint.
# Good starting points:
#   "naver/splade-cocondenser-selfdistil"               (symmetric SPLADE)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"  (inference-free)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"      (best OpenSearch)
model = SparseEncoder(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="My Domain Sparse Encoder",
    ),
)

# ── 2. Prepare domain data ─────────────────────────────────────────────────────
train_dataset = Dataset.from_dict({
    "query":    ["your domain query 1", "your domain query 2"],
    "positive": ["relevant document 1", "relevant document 2"],
    # Optional: add "negative" key for hard negatives (strongly recommended)
})

# ── 3. Define loss with FLOPS regularization ───────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,      # tune between 0 and 1e-4
    document_regularizer_weight=3e-5,   # tune between 0 and 1e-3
)

# ── 4. Training arguments ──────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/my-domain-sparse-encoder",
    num_train_epochs=1,
    per_device_train_batch_size=16,     # larger = more in-batch negatives = better
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,                          # or bf16=True on Ampere+
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # required for MNR loss
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="NanoBEIR_mean_dot_ndcg@10",
)

# ── 5. Optional: zero-shot evaluator on standard BEIR subsets ─────────────────
evaluator = SparseNanoBEIREvaluator(
    dataset_names=["nfcorpus", "scifact", "fiqa"],
    batch_size=16,
)

# ── 6. Train ───────────────────────────────────────────────────────────────────
trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
model.save_pretrained("models/my-domain-sparse-encoder/final")
model.push_to_hub("my-org/my-domain-sparse-encoder")  # optional

A2. KL Divergence Distillation (better quality)

Requires teacher scores pre-computed (or computed on-the-fly if you have GPU budget).

from sentence_transformers.sparse_encoder.losses import SparseDistillKLDivLoss

# Dataset needs columns: query + doc_0, doc_1, ..., doc_N + score_0, score_1, ..., score_N
# OR use the InputExample format with {"query", "docs", "scores"}
# Sentence Transformers >= 5.0 handles both.

loss = SpladeLoss(
    model=model,
    loss=SparseDistillKLDivLoss(model=model),
    query_regularizer_weight=5e-5,
    document_regularizer_weight=3e-5,
)

# Everything else identical to A1

A3. Inspecting sparsity during/after training

# Check which tokens activate and their weights
sentences = ["what causes inflation?", "neural sparse retrieval with SPLADE"]
embeddings = model.encode(sentences, output_value="sentence_embedding")

for sent, emb in zip(sentences, embeddings):
    # decode top-20 active dimensions back to readable tokens
    decoded = model.decode(emb, top_k=20)
    print(f"\n{sent}")
    print(decoded)
    print(f"Active dims: {(emb > 0).sum()}")

A4. Evaluating on BEIR

from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Install: pip install beir
dataset = "nfcorpus"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Encode with your model
doc_embeddings = model.encode_corpus(list(corpus.values()))
query_embeddings = model.encode_queries(list(queries.values()))

# Retrieve using dot product (handles sparse vectors)
from sentence_transformers.sparse_encoder import SparseEncoderSimilarityFunction
results = SparseEncoderSimilarityFunction.DOT_PRODUCT.pairwise_scores(
    query_embeddings, doc_embeddings
)

Option B: OpenSearch Tuning Sample Repo

Best for: Production OpenSearch integration, multi-GPU training, full control over ensemble distillation.

Prerequisites: OpenSearch 2.16+ running locally (used for hard negative mining and evaluation).

# Clone the repo
git clone https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
cd opensearch-sparse-model-tuning-sample

# Environment setup
conda create -n sparse-tuning python=3.9 -y
conda activate sparse-tuning
pip install -r requirements.txt

# Start OpenSearch locally (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin@1234" \
  opensearchproject/opensearch:2.16.0

B1. Prepare MS MARCO training data

# Download and prepare MS MARCO with hard negatives
python prepare_msmarco_hard_negatives.py \
  --output_dir data/msmarco_hard_negs \
  --num_hard_negatives 7 \
  --opensearch_host localhost \
  --opensearch_port 9200

# Or prepare your own domain data in the expected JSONL format
python demo_train_data.py \
  --input your_query_doc_pairs.jsonl \
  --output data/domain_data

B2. InfoNCE training config

Edit configs/config_infonce.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs      # JSONL files
idf_path: idf.json                       # keep MS MARCO IDF or recompute
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_lambda: 0.05                       # FLOPS regularization weight
loss_types: [infonce]
output_dir: models/my-sparse-encoder

python train_ir.py configs/config_infonce.yaml

B3. Ensemble KD training config (best quality)

Edit configs/config_kd.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs
idf_path: idf.json
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_d_lambda: 0.002                    # doc FLOPS weight (lower for KD)
loss_types: [kl_div]

# Ensemble teacher configuration
kd_ensemble_teacher_kwargs:
  teachers:
    - model_type: dense
      model_name_or_path: Alibaba-NLP/gte-large-en-v1.5   # dense teacher
    - model_type: sparse
      model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-v1  # sparse teacher
  score_scaling_factor: 30              # scales scores before softmax
  aggregation: arithmetic_mean          # or geometric_mean

output_dir: models/my-sparse-encoder-kd

# Single GPU
python train_ir.py configs/config_kd.yaml

# Multi-GPU (recommended for full MS MARCO training)
torchrun --nproc_per_node=8 train_ir.py configs/config_kd.yaml

B4. Recomputing IDF for domain adaptation

# Recompute IDF from your own corpus (if your domain differs from MS MARCO)
from collections import defaultdict
import math
import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

def compute_idf(corpus: list[str], output_path: str = "idf.json"):
    df = defaultdict(int)  # document frequency
    N = len(corpus)

    for doc in corpus:
        token_ids = tokenizer(doc, truncation=True, max_length=512)["input_ids"]
        for tid in set(token_ids):
            df[tid] += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1  # smoothed IDF
        for tid, count in df.items()
    }

    with open(output_path, "w") as f:
        json.dump(idf, f)

    print(f"Computed IDF for {len(idf)} tokens from {N} documents.")
    return idf

# Usage
with open("your_corpus.txt") as f:
    corpus = [line.strip() for line in f]

idf = compute_idf(corpus, "domain_idf.json")

Option C: NAVER SPLADE Repo

Best for: Symmetric SPLADE models, research experimentation, full Hydra config control.

git clone https://github.com/naver/splade
cd splade
pip install -e .

# Download MS MARCO data
bash scripts/download_msmarco.sh

C1. Basic training

python -m splade.training.train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.regularizer.FLOPS.lambda_q=0.06 \
  config.regularizer.FLOPS.lambda_d=0.02 \
  config.data.train_data_path=data/your_domain_triples.tsv \
  config.training.num_train_epochs=3 \
  config.training.learning_rate=2e-5 \
  config.training.per_device_train_batch_size=32

C2. Distillation (SPLADE++ style)

# Using cross-encoder as teacher (best quality but slow)
python -m splade.training.distil_train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.teacher.model_type_or_dir=cross-encoder/ms-marco-MiniLM-L-12-v2 \
  config.regularizer.FLOPS.lambda_q=0.02 \
  config.regularizer.FLOPS.lambda_d=0.01 \
  config.data.train_data_path=data/msmarco_triples.tsv \
  config.training.num_train_epochs=3

C3. SPLADE-v3 self-distillation

SPLADE-v3 uses a mix of KL-divergence from a SPLADE++ teacher + MarginMSE loss. Config from the paper:

# Loss combination from SPLADE-v3 paper:
L_total = α * L_KL + (1 - α) * L_MarginMSE + λ_q * FLOPS(q) + λ_d * FLOPS(d)
# α = 0.5, λ_q = 0.01, λ_d = 0.008
# Teacher: naver/splade-cocondenser-selfdistil
# Hard negatives: 8 per query sampled from teacher's top-100

7. Inference-Free (Asymmetric) Fine-Tuning

For OpenSearch production deployment, the inference-free setup (query = IDF lookup, doc = neural) is strongly recommended. Sentence Transformers v5 supports training this directly via Router.

from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import (
    SparseStaticEmbedding,
    MLMTransformer,
    SpladePooling,
)
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

# ── Build the asymmetric model ─────────────────────────────────────────────────
doc_encoder = MLMTransformer(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

# SparseStaticEmbedding = trainable IDF lookup table
# frozen=False allows the IDF weights to be updated during training (domain adaptation)
query_encoder = SparseStaticEmbedding(
    tokenizer=doc_encoder.tokenizer,
    frozen=False,           # set True to freeze IDF (use pre-computed MS MARCO IDF only)
    idf_path="idf.json",    # path to pre-computed IDF (download from HuggingFace model)
)

router = Router.for_query_document(
    query_modules=[query_encoder],
    document_modules=[doc_encoder, SpladePooling("max")],
)

model = SparseEncoder(modules=[router])

# ── Loss ───────────────────────────────────────────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0.0,       # no regularization needed for static embedding
    document_regularizer_weight=3e-5,
)

# ── Training args ──────────────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/inference-free-domain",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    # Higher LR for the IDF table (it's a simpler parameter space)
    learning_rate_mapping={r"SparseStaticEmbedding\..*": 1e-3},
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
)

At inference time after training:

# Query: no neural inference
query_embedding = model.encode("what causes inflation?", prompt_name="query")
# ↑ this just does IDF lookup internally

# Document: full neural encoding
doc_embedding = model.encode("Inflation is caused by...", prompt_name="passage")

# Score
score = (query_embedding * doc_embedding).sum()  # dot product

8. Hard Negative Mining

Hard negatives are documents that are superficially relevant (e.g., retrieved by BM25) but are actually not relevant. They make training significantly harder and improve the model's ability to discriminate.

Strategy 1: BM25 Hard Negatives (simplest)

from rank_bm25 import BM25Okapi  # pip install rank-bm25

def mine_bm25_hard_negatives(
    queries: list[str],
    corpus: list[str],
    qrels: dict[str, list[int]],  # query_id → list of relevant doc indices
    n_negatives: int = 7,
) -> list[dict]:
    """Mine hard negatives using BM25 retrieval."""
    tokenized_corpus = [doc.lower().split() for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)

    examples = []
    for q_idx, query in enumerate(queries):
        scores = bm25.get_scores(query.lower().split())
        top_k_indices = scores.argsort()[::-1][:100]  # top-100 by BM25

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({
                "query": query,
                "positive": corpus[list(relevant_ids)[0]],
                "negatives": hard_negatives,
            })

    return examples

Strategy 2: Current Model Hard Negatives (iterative, best quality)

# Mine hard negatives using your current model's retrieval
# Run after each epoch or every N steps for iterative refinement

def mine_model_hard_negatives(model, queries, corpus, qrels, n_negatives=7, top_k=100):
    corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)

    examples = []
    for q_idx, query in enumerate(queries):
        query_embedding = model.encode([query])
        scores = (query_embedding * corpus_embeddings).sum(axis=1)
        top_k_indices = scores.argsort()[::-1][:top_k]

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({"query": query, "positive": corpus[list(relevant_ids)[0]], "negatives": hard_negatives})

    return examples

Strategy 3: Using OpenSearch for Hard Negative Mining

If you have an OpenSearch cluster, the tuning sample repo automates this:

# Index your corpus into OpenSearch
python index_corpus.py \
  --corpus your_corpus.jsonl \
  --index my-index \
  --opensearch_host localhost

# Mine hard negatives using BM25 retrieval
python prepare_hard_negatives.py \
  --queries your_queries.jsonl \
  --qrels your_qrels.tsv \
  --index my-index \
  --n_negatives 7 \
  --output data/hard_negatives.jsonl

9. Practical Tips & Hyperparameter Guide

Sparsity Monitoring

Always track average active dimensions during training. Target ranges:

Component	Target active dims	If too many	If too few
Documents	100–300	Increase λ_d	Decrease λ_d
Queries (neural)	10–50	Increase λ_q	Decrease λ_q
Queries (IDF lookup)	= query length	N/A	N/A

def log_sparsity(model, sample_texts: list[str], prefix=""):
    embeddings = model.encode(sample_texts)
    active_dims = [(emb > 0).sum() for emb in embeddings]
    print(f"{prefix} active dims: mean={sum(active_dims)/len(active_dims):.1f}, "
          f"min={min(active_dims)}, max={max(active_dims)}")

Hyperparameter Summary

Parameter	Recommended range	Notes
`learning_rate`	`1e-5` to `5e-5`	Start at `2e-5`
`per_device_batch_size`	16–64	Bigger → harder negatives for InfoNCE
`λ_q` (query FLOPS)	`1e-5` to `1e-4`	Lower than λ_d; set 0 for inference-free
`λ_d` (doc FLOPS)	`1e-5` to `5e-2`	Critical — tune first
`warmup_ratio`	0.05–0.1	Standard
`weight_decay`	0.01	Standard
`max_seq_length`	512 for docs, 64–128 for queries
`num_train_epochs`	1–5	1 usually sufficient for fine-tuning

Common Failure Modes

Symptom	Cause	Fix
Embeddings all zeros	λ too high or LR too high	Reduce λ and/or LR
No sparsity (all dims active)	λ too low	Increase λ_d
Training loss not decreasing	LR too low or bad data	Check data format, increase LR
Good train loss, poor BEIR	Overfitting to MS MARCO	Add domain data, reduce epochs
Query vecs denser than doc vecs	Normal — queries shorter	Expected behaviour

On Starting from Scratch vs. Fine-Tuning

Never train a sparse encoder from a plain BERT/DistilBERT checkpoint without pre-training. The MLM head must first learn to produce meaningful sparse activations. Training from scratch requires:

≥500K (query, document) pairs
Multi-phase lambda schedule
Likely weeks of training on 8+ GPUs

Always start from a pre-trained sparse checkpoint for domain fine-tuning. Even naver/splade-cocondenser-selfdistil or opensearch-neural-sparse-encoding-doc-v2-distill will adapt well with just a few thousand domain examples.

10. Decision Guide

Goal	Start from	Training approach	Approximate training time
Best retrieval quality (symmetric)	`naver/splade-v3`	NAVER repo + KL-div + MarginMSE	Days on 8 GPU
OpenSearch production (fastest query)	`opensearch-neural-sparse-encoding-doc-v2-distill`	opensearch tuning sample + KD ensemble	Hours on 1 GPU
Simplest domain adaptation	`doc-v2-distill`	Sentence Transformers SpladeLoss + MNR	Minutes on 1 GPU
Best OpenSearch quality	`opensearch-neural-sparse-encoding-doc-v3-gte`	opensearch tuning sample + KD ensemble	Hours on 1 GPU
Multilingual	`opensearch-neural-sparse-encoding-multilingual-v1`	MIRACL data + KD approach	Hours on 4 GPU
Sparsify an existing dense model	your dense model + SparseAutoEncoder	Sentence Transformers CSRLoss	Minutes on 1 GPU
Research / ablation studies	`naver/splade-cocondenser-selfdistil`	NAVER SPLADE repo (Hydra configs)	Configurable

11. References

Papers (chronological)

BM25 — Robertson & Zaragoza (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR. PDF
DeepImpact — Mallia et al. (2021). Learning Passage Impacts for Inverted Indexes. SIGIR 2021. arXiv:2104.12016
SPLADE v1 — Formal et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720
SPLADE v2 — Formal et al. (2021). SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. NeurIPS 2021 Workshop. arXiv:2109.10086
BEIR Benchmark — Thakur et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663
SPLADE++ / SelfDistil / EnsembleDistil — Formal et al. (2022). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. SIGIR 2022. arXiv:2205.04733
CoCondenser — Gao & Callan (2022). Unsupervised Corpus Aware Language Model Pre-Training for Dense Passage Retrieval. ACL 2022. arXiv:2108.05540
SPLADE-v3 — Lassance & Formal (2024). SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789
CSR — Yang et al. (2024). CSR: Cascade Sparse Retrieval for Open-Domain Question Answering. arXiv:2404.12153
OpenSearch Inference-Free Neural Sparse (v2) — Yang et al. (2024). Inference-free Sparse Retrieval via IDF-Aware Ensemble Distillation. arXiv:2411.04403
SPLADE-v3 + L0 — Lassance et al. (2025). Efficient Sparse Retrieval with L0 Regularization. SIGIR 2025. arXiv:2504.14839

Code & Repos

Resource	URL
NAVER SPLADE repo	https://github.com/naver/splade
OpenSearch sparse tuning sample	https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
Sentence Transformers sparse training docs	https://sbert.net/docs/sparse_encoder/training_overview.html
Sentence Transformers sparse training examples	https://github.com/UKPLab/sentence-transformers/tree/master/examples/sparse_encoder
BEIR benchmark	https://github.com/beir-cellar/beir

Blog Posts

Resource	URL
HuggingFace blog: Train Sparse Encoders (ST v5)	https://huggingface.co/blog/train-sparse-encoder
OpenSearch neural sparse documentation	https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/
Sentence Transformers v5 release announcement	https://huggingface.co/blog/sentence-transformers-v5

HuggingFace Models

Model	URL
naver/splade-v3	https://huggingface.co/naver/splade-v3
naver/splade-cocondenser-selfdistil	https://huggingface.co/naver/splade-cocondenser-selfdistil
naver/splade-cocondenser-ensembledistil	https://huggingface.co/naver/splade-cocondenser-ensembledistil
opensearch-project/opensearch-neural-sparse-encoding-v1	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1
opensearch-project (full org)	https://huggingface.co/opensearch-project

Guide compiled from NAVER SPLADE papers, Amazon OpenSearch Neural Sparse papers, and Sentence Transformers v5 documentation. All code is written for Python 3.9+ and tested against sentence-transformers>=5.0.

12. Real-World Case Study: Domain Adaptation for Financial Filings & Earnings Calls

Context: A production OpenSearch 2.16 cluster indexed ~55M document chunks (50.6M SEC filings + 4.4M earnings call transcripts from CapIQ and UK Companies House). The existing sparse encoder used the stock MS MARCO IDF table. This section documents the investigation that proved domain IDF recomputation was necessary before fine-tuning.

12.1 What We Found in the Index

desia-resource-chunks-v2    55,073,414 docs   502.7 GB
  ├── filing      50,637,052   (10-K, 10-Q, 8-K, Companies House)
  └── transcript   4,436,362   (earnings calls)

Field schema (relevant fields):
  chunk_text                    → text
  chunk_text_sparse_embeddings  → rank_features  (sparse vector)
  chunk_context_sparse_embeddings → rank_features
  resource_company_name         → text
  resource_integration_element_type → text  ("filing" | "transcript")
  resource_source_integration_code_name → text  ("data-provider-capiq" | "data-provider-gov.uk-companyhouse")

The inference-free setup was already in place:

Query time: amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 (DEPLOYED) — IDF lookup, zero inference
Document encoding: custom neural encoder (remote HuggingFace Inference Endpoint, L4 GPU) — connector type doc_sparse_encode

12.2 Step 1: Sample 100K Chunks and Compute Domain IDF

The script below scrolls the corpus, tokenizes with the model's own BERT tokenizer, and computes smoothed IDF. Run with uv:

uv run \
  --with requests \
  --with transformers \
  --with huggingface_hub \
  --with tqdm \
  python domain_idf_analysis.py

"""
domain_idf_analysis.py
Scrolls 100K chunks (50K filings + 50K transcripts) from an OpenSearch index,
computes smoothed IDF, saves to domain_idf.json, and compares vs MS MARCO baseline.

Replace OS_URL / OS_AUTH with your own cluster credentials.
"""
import json
import math
import sys
from collections import Counter

import requests
import urllib3
from huggingface_hub import hf_hub_download
from tqdm import tqdm
from transformers import AutoTokenizer

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# ── Config ─────────────────────────────────────────────────────────────────────
OS_URL   = "https://localhost:9200"
OS_AUTH  = ("YOUR_USERNAME", "YOUR_PASSWORD")          # ← replace
INDEX    = "your-chunks-index"                         # ← replace
MODEL_ID = "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
N_EACH   = 50_000      # 50K filings + 50K transcripts = 100K total
BATCH    = 500
MIN_LEN  = 50          # skip near-empty / image-caption chunks
OUT_IDF  = "domain_idf.json"


def scroll_texts(doc_type: str, type_field: str, n: int) -> list[str]:
    """Scroll N chunk_text docs of a given type from OpenSearch."""
    texts: list[str] = []

    resp = requests.post(
        f"{OS_URL}/{INDEX}/_search?scroll=3m",
        auth=OS_AUTH, verify=False,
        json={
            "size": BATCH,
            "_source": ["chunk_text"],
            "query": {"term": {type_field: doc_type}},
        },
    )
    resp.raise_for_status()
    data      = resp.json()
    scroll_id = data["_scroll_id"]
    hits      = data["hits"]["hits"]

    with tqdm(total=n, desc=f"  {doc_type:12s}", unit="chunks", ncols=80) as pbar:
        while hits and len(texts) < n:
            for hit in hits:
                text = hit["_source"].get("chunk_text", "")
                if len(text.strip()) >= MIN_LEN:
                    texts.append(text)
                if len(texts) >= n:
                    break
            pbar.update(min(len(hits), max(0, n - len(texts) + len(hits))))
            if len(texts) >= n:
                break
            resp = requests.post(
                f"{OS_URL}/_search/scroll",
                auth=OS_AUTH, verify=False,
                json={"scroll": "3m", "scroll_id": scroll_id},
            )
            resp.raise_for_status()
            data      = resp.json()
            scroll_id = data.get("_scroll_id", scroll_id)
            hits      = data["hits"]["hits"]

    requests.delete(
        f"{OS_URL}/_search/scroll",
        auth=OS_AUTH, verify=False,
        json={"scroll_id": scroll_id},
    )
    return texts[:n]


def compute_idf(texts: list[str], tokenizer) -> tuple[dict, int, Counter]:
    """
    Smoothed IDF: log((N+1) / (df+1)) + 1
    Keys are string token IDs — compatible with OpenSearch inference-free models.
    """
    df: Counter = Counter()
    N = 0
    for text in tqdm(texts, desc="  tokenizing  ", unit="chunks", ncols=80):
        ids = tokenizer(text, truncation=True, max_length=512)["input_ids"]
        for tid in set(ids):        # count each token once per document
            df[tid] += 1
        N += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1
        for tid, count in df.items()
    }
    return idf, N, df


def compare_vs_msmarco(domain_idf: dict, tokenizer):
    """
    Compare domain IDF against MS MARCO baseline.

    ⚠️  Key format note:
    - MS MARCO idf.json uses decoded token STRINGS as keys (e.g. "the", "consolidated")
    - Domain IDF uses integer token IDs as string keys (e.g. "1996", "12088")
    - Decode domain keys before comparing.
    """
    idf_path = hf_hub_download(MODEL_ID, "idf.json")
    with open(idf_path) as f:
        msmarco_idf: dict = json.load(f)

    # Decode domain IDF to string keys for comparison
    domain_by_str = {}
    for tid_str, idf_val in domain_idf.items():
        tok = tokenizer.decode([int(tid_str)]).strip()
        if tok:
            domain_by_str[tok] = idf_val

    common   = set(domain_by_str) & set(msmarco_idf)
    deltas   = [
        (tok, domain_by_str[tok], msmarco_idf[tok], domain_by_str[tok] - msmarco_idf[tok])
        for tok in common if len(tok.strip()) >= 2
    ]
    deltas.sort(key=lambda x: x[3])

    import statistics
    abs_d = [abs(d) for _, _, _, d in deltas]
    print(f"\nTokens compared       : {len(deltas):,}")
    print(f"Mean  |delta|          : {statistics.mean(abs_d):.4f}")
    print(f"Stdev |delta|          : {statistics.stdev(abs_d):.4f}")
    print(f"|delta| > 1.0          : {sum(1 for d in abs_d if d > 1.0):,}  ({sum(1 for d in abs_d if d > 1.0)/len(abs_d)*100:.0f}%)")
    print(f"|delta| > 2.0          : {sum(1 for d in abs_d if d > 2.0):,}  ({sum(1 for d in abs_d if d > 2.0)/len(abs_d)*100:.0f}%)")

    fmt = "{:<25} {:>11} {:>13} {:>10}"
    for label, rows in [
        ("UNDERWEIGHTED in MS MARCO (more common in domain)", deltas[:30]),
        ("OVERWEIGHTED in MS MARCO  (rarer in domain)",       list(reversed(deltas[-30:]))),
    ]:
        print(f"\n{'='*65}\n{label}\n{'='*65}")
        print(fmt.format("Token", "Domain IDF", "MS MARCO IDF", "Delta"))
        print("-" * 63)
        for tok, d, m, delta in rows:
            print(fmt.format(tok, f"{d:.3f}", f"{m:.3f}", f"{delta:+.3f}"))


def main():
    print("[1/4] Loading MS MARCO IDF ...")
    # (also used inside compare_vs_msmarco)

    print("[2/4] Loading tokenizer ...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    print(f"[3/4] Scrolling {N_EACH:,} × 2 chunks ...")
    # Adapt the type_field and doc_type values to your schema
    filing_texts     = scroll_texts("filing",     "resource_integration_element_type.keyword", N_EACH)
    transcript_texts = scroll_texts("transcript", "resource_integration_element_type.keyword", N_EACH)
    all_texts        = filing_texts + transcript_texts
    print(f"      collected {len(all_texts):,} total chunks")

    print("[4/4] Computing domain IDF ...")
    domain_idf, N, df = compute_idf(all_texts, tokenizer)
    print(f"      {len(domain_idf):,} unique tokens | {N:,} documents")

    with open(OUT_IDF, "w") as f:
        json.dump(domain_idf, f)
    print(f"      saved → {OUT_IDF}")

    compare_vs_msmarco(domain_idf, tokenizer)


if __name__ == "__main__":
    main()

12.3 Results on 100K Chunks (50K filings + 50K transcripts)

IDF drift summary:

Tokens in domain sample   : 25,986
Mean  |delta|             : 1.30   ← on log scale; substantial
Stdev |delta|             : 1.00
|delta| > 1.0             : 14,083  (54% of vocabulary!)
|delta| > 2.0             :  5,779  (22% of vocabulary)

Tokens MS MARCO underweights for this domain (common in filings, rare in web search):

Token	Domain IDF	MS MARCO IDF	Delta	Why it matters
`2021`	2.02	9.31	−7.29	Appears in 36% of docs (fiscal year references)
`2020`	2.25	7.66	−5.41	Same
`202`	1.87	7.09	−5.23	Subword prefix for 202x years, extremely common
`##gence`	2.53	8.59	−6.06	Suffix: "negligence", "intelligence", "emergence"
`consolidated`	3.21	8.20	−4.98	Core accounting term
`subsidiaries`	4.03	8.49	−4.46	Core corporate structure term
`commitments`	4.63	9.02	−4.40	Balance sheet item
`crore`	5.29	9.67	−4.38	Indian rupee unit (international filings)
`societe`	6.43	10.83	−4.40	French company names (Companies House)
`grupo`	6.27	10.63	−4.36	Spanish company names
`##gl`	2.51	6.83	−4.32	Subword: "global", "single", "struggle"

Tokens MS MARCO overweights (virtually absent from financial filings):

Token	Domain IDF	MS MARCO IDF	Delta
`noun`	11.82	5.26	+6.57
`synonym`	11.82	5.78	+6.04
`wikipedia`	11.82	6.02	+5.80
`garlic`	11.82	6.13	+5.69
`pronunciation`	11.82	6.20	+5.62
`stomach`	10.21	5.24	+4.97
`puppy`	11.82	7.03	+4.79
`medieval`	11.82	7.16	+4.67

12.4 Key Lessons from This Investigation

1. The IDF format mismatch gotcha.

MS MARCO idf.json uses decoded token strings as keys ("the", "consolidated"). If you compute domain IDF using integer token IDs as keys (e.g. str(token_id) → "1996"), a naive set(domain) & set(msmarco) intersection returns near-zero overlap (~337 coincidental numeric matches). Always decode token IDs before comparing:

# WRONG: comparing int-string keys vs text-string keys
common = set(domain_idf_by_id.keys()) & set(msmarco_idf.keys())  # ~337 matches

# RIGHT: decode domain keys first
domain_by_str = {
    tokenizer.decode([int(tid)]).strip(): idf_val
    for tid, idf_val in domain_idf_by_id.items()
}
common = set(domain_by_str.keys()) & set(msmarco_idf.keys())      # ~25,000 matches

The domain idf.json you save and ship with the model should keep integer token ID keys — that is what the OpenSearch inference-free tokenizer model expects.

2. 54% of the vocabulary has IDF drift > 1.0 (log scale).

This is not marginal. A delta of 1.0 on a log scale represents roughly a 2.7× difference in document frequency relative to corpus size. With over half the vocabulary miscalibrated, MS MARCO IDF actively harms retrieval quality on financial text.

3. The domain "stop words" are different from general English stop words.

In financial filings, financial (IDF 2.03), market (IDF 1.92), company (IDF 2.17), year (IDF 2.03), inc (IDF 2.12) appear in 30–40% of all chunks. A general IDF table treats them as informative; the domain IDF correctly suppresses them.

4. Scale matters for low-frequency financial jargon.

From 100K chunks, genuinely rare terms like EBITDA, amortization, diluted may still have unreliable IDF estimates. Scale up to 500K–1M chunks for stable estimates on domain-specific tail vocabulary. The scroll-and-tokenize approach is linear — just increase N_EACH.

12.5 Next Steps After Domain IDF

With a domain idf.json in hand:

Register the updated IDF with your inference-free tokenizer model in OpenSearch
Generate synthetic training pairs from your corpus using an LLM (GPL approach) — see Section 6
Fine-tune the document encoder with SpladeLoss + domain IDF, starting from a pre-trained sparse checkpoint
Re-index documents (or serve both old and new model during transition, comparing NDCG on a held-out query set)

Raw

sparse_encoder_training_guide.md

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Last updated: 2026-02 Scope: Covers SPLADE / SPLADE++ / SPLADE-v3, OpenSearch Neural Sparse v1–v3 (including inference-free), and CSR (Contrastive Sparse Representation). Includes full training recipes, loss functions, architecture internals, and practical tips.

What Are Sparse Encoders?
Model Families
Architecture Internals
Training Objectives & Loss Functions
Training Data
Fine-Tuning Recipes
Inference-Free (Asymmetric) Fine-Tuning
Hard Negative Mining
Practical Tips & Hyperparameter Guide
Decision Guide
References

1. What Are Sparse Encoders?

"neural retrieval" → {retrieval: 2.1, neural: 1.9, search: 0.8, information: 0.3, ...}
                      (vocab_size = 30,522; ~95-99% of dims are zero)

Why sparse over dense?

Property	Dense (e.g. SBERT)	Sparse (SPLADE)
Vector dim	768	30,522
Non-zero dims	all	~100–300
Inverted index compatible	✗	✓
BM25-level latency	✗	✓
Semantic expansion	implicit	explicit (MLM head)
Interpretable	✗	✓ (readable tokens)
BEIR avg NDCG@10	~0.49–0.52	~0.50–0.55

2. Model Families

2.1 SPLADE Family (NAVER Labs)

Symmetric models — both query and document run through a full neural encoder.

Model	HuggingFace ID	Params	MS MARCO MRR@10	BEIR Avg NDCG@10
SPLADE-v2	`naver/splade-v2-max`	110M	36.8	0.497
SPLADE-v2-distill	`naver/splade-v2-distilsplade-max`	66M	36.1	—
SPLADE++ SelfDistil	`naver/splade-cocondenser-selfdistil`	110M	37.6	0.510
SPLADE++ EnsembleDistil	`naver/splade-cocondenser-ensembledistil`	110M	38.3	0.524
SPLADE-v3	`naver/splade-v3`	110M	40.2	~0.53
SPLADE-v3-Doc	`naver/splade-v3-doc`	110M	—	~0.54

Key papers:

SPLADE v1: Formal et al., 2021 — SIGIR 2021
SPLADE v2: Formal et al., 2021b — NeurIPS 2021 Workshop
SPLADE++ / SelfDistil / EnsembleDistil: Formal et al., 2022 — SIGIR 2022
SPLADE-v3: Lassance & Formal, 2024
SPLADE-v3 with L0 regularization: Lassance et al., 2025 — SIGIR 2025

2.2 OpenSearch Neural Sparse (Amazon)

Asymmetric models — document is neural; query uses IDF lookup (no inference at search time).

Model	HuggingFace ID	Params	Base	BEIR NDCG@10	Avg FLOPs
v1	`opensearch-project/opensearch-neural-sparse-encoding-v1`	133M	BERT-base	0.524	11.4
doc-v2-mini	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini`	22M	MiniLM	0.497	0.7
doc-v2-distill	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill`	67M	DistilBERT	0.504	1.8
doc-v2	`opensearch-project/opensearch-neural-sparse-encoding-doc-v2`	133M	BERT-base	0.515	3.3
doc-v3-distill	`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill`	67M	DistilBERT	0.517	1.8
doc-v3-gte	`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte`	133M	GTE-base	0.546	~2
multilingual-v1	`opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1`	160M	mBERT	0.629 (MIRACL)	1.3

Key papers:

OpenSearch inference-free (v2): Yang et al., 2024
BEIR benchmark used for evaluation: Thakur et al., 2021

2.3 CSR — Contrastive Sparse Representation

Adds a sparse autoencoder on top of an existing dense model, sparsifying its output. Useful when you already have a fine-tuned dense model and want sparse retrieval without retraining from scratch.

Paper: Yang et al., 2024 — CSR
Available in Sentence Transformers v5 via SparseAutoEncoder + CSRLoss

3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Input tokens: ["what", "is", "SPLADE", "?"]
       ↓
BertForMaskedLM (or DistilBertForMaskedLM)
       ↓
token-level logits: shape [batch, seq_len, vocab_size]   # e.g. [1, 4, 30522]
       ↓
max pooling over seq_len dimension → [batch, vocab_size]  # collapse to one vector
       ↓
log(1 + ReLU(x))                                         # activation + log-saturation
       ↓
sparse vector: [batch, vocab_size]                        # ~95-99% zeros

SPLADE-v3 double-log activation:

# v2: log(1 + ReLU(x))
# v3: log(1 + log(1 + ReLU(x)))   ← stronger sparsification, less aggressive FLOPS penalty needed

Inference-Free (OpenSearch v2/v3) Query Encoding

At query time, no neural inference is run:

def encode_query_inference_free(query_tokens: list[int], idf: dict[int, float]) -> dict[int, float]:
    """
    idf.json: {token_id_str: idf_value} pre-computed from training corpus (MS MARCO).
    Returns sparse vector as {token_id: weight} dict.
    """
    sparse = {}
    for token_id in set(query_tokens):  # deduplicate
        if token_id in idf:
            sparse[token_id] = idf[token_id]
        else:
            sparse[token_id] = 1.0  # default IDF for unknown tokens
    return sparse

The idf.json file is shipped alongside each v2/v3 model on HuggingFace. You can recompute it from your own corpus for domain adaptation.

CSR Architecture

Dense model (frozen or fine-tuned) → dense vector [batch, 768]
       ↓
SparseAutoEncoder:
    encoder: Linear(768 → 4*768) + ReLU + TopK(k=256)   ← only top-k activations kept
    decoder: Linear(4*768 → 768)                         ← reconstruction loss
       ↓
sparse vector: [batch, 3072]   (4x expansion, mostly zeros)

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

The core mechanism that enforces sparsity. Penalizes tokens that activate with high magnitude on average across the training batch:

FLOPS(X) = Σ_{j ∈ vocab}  [ (1/N) Σ_{i=1}^{N}  w_j(x_i) ]^2

where:
  N = batch size
  w_j(x_i) = weight of vocab token j in sparse vector of example i

Combined training loss:

L_total = L_rank + λ_q · FLOPS(queries) + λ_d · FLOPS(docs)

Typical λ values:

Framework / Config	λ_q	λ_d
SPLADE original	0.06	0.02
Sentence Transformers (SpladeLoss)	5e-5	3e-5
OpenSearch tuning sample (InfoNCE)	0.05	0.05
OpenSearch tuning sample (KD)	—	0.002
OpenSearch v2 pre-training phase	—	1e-7
OpenSearch v2 fine-tuning phase	—	0.02

Two-phase lambda schedule (OpenSearch v2 approach, strongly recommended for training from scratch):

Phase 1 (large corpus, weak labels): λ_d = 1e-7  → focus on learning relevance
Phase 2 (MS MARCO / domain data):   λ_d = 0.02  → enforce sparsity

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

Instead of raw dot product, scores are IDF-weighted:

score(q, d) = Σ_{t ∈ vocab}  idf(t) · q_t · d_t

IDF is pre-computed from MS MARCO corpus. Unseen tokens default to idf = 1.0.

4.3 Ranking Losses

InfoNCE / Multiple Negatives Ranking Loss:

L_InfoNCE = -log( exp(score(q, p) / τ) / Σ_j exp(score(q, d_j) / τ) )

where:
  p = positive document
  d_j = all docs in batch (positive + in-batch negatives + mined hard negatives)
  τ = temperature (default 0.02 in Sentence Transformers)

In Sentence Transformers: SparseMultipleNegativesRankingLoss

Requirements: batch size ≥ 16 (more in-batch negatives = harder training signal). Use BatchSamplers.NO_DUPLICATES to ensure each query appears once per batch.

MarginMSE (used in SPLADE-v3 in combination with KL-div):

L_MarginMSE = MSE( score(q,p) - score(q,n),  teacher_score(q,p) - teacher_score(q,n) )

4.4 Knowledge Distillation

Distillation consistently outperforms pointwise/pairwise training. The teacher (cross-encoder or ensemble retriever) provides soft scores for a list of candidate documents per query.

KL Divergence Distillation:

L_KL = KL( softmax(teacher_scores / τ_t) || softmax(student_scores / τ_s) )

In Sentence Transformers: SparseDistillKLDivLoss

Data format:

{"query": "what is SPLADE?",
 "docs": ["SPLADE is a sparse...", "Dense models use...", "BM25 is a..."],
 "scores": [9.2, 1.1, 4.5]}

4.5 Ensemble Heterogeneous Distillation

OpenSearch's key contribution — avoid expensive cross-encoder inference by combining two cheap retrievers:

Teacher 1 (dense):  Alibaba-NLP/gte-large-en-v1.5
Teacher 2 (sparse): opensearch-project/opensearch-neural-sparse-encoding-v1

For each query:
  scores_dense  = dense_teacher.score(query, [doc_1, ..., doc_N])
  scores_sparse = sparse_teacher.score(query, [doc_1, ..., doc_N])

  norm_dense  = min_max_scale(scores_dense)   # → [0, 1]
  norm_sparse = min_max_scale(scores_sparse)  # → [0, 1]

  final_score = (norm_dense + norm_sparse) / 2

5. Training Data

Minimum Viable Data

Pre-Training Scale (OpenSearch models)

MS MARCO (primary, high quality):

502,548 training queries
8.84M passages
Relevance annotations from Bing click logs

Weak supervision mix (5.36M additional queries across 14 datasets):

Dataset	Source
`eli5_question_answer`	Reddit ELI5
`squad_pairs`	SQuAD reading comprehension
`WikiAnswers`	Wikipedia QA pairs
`yahoo_answers_*`	Yahoo Answers
`gooaq_pairs`	Google autocomplete QA
`stackexchange_duplicate_questions_*`	StackExchange
`wikihow`	WikiHow articles
`S2ORC_title_abstract`	Semantic Scholar papers
`searchQA_top5_snippets`	Jeopardy-style QA

All available via HuggingFace Datasets or the BEIR repository.

Data Format

For InfoNCE / MNR Loss (JSONL):

{"query": "what causes inflation?", "positive": "Inflation is caused by..."}
{"query": "who wrote Hamlet?", "positive": "Hamlet was written by Shakespeare...", "negatives": ["unrelated doc 1", "unrelated doc 2"]}

For Distillation (KL-div) (JSONL):

{"query": "what causes inflation?",
 "docs": ["Inflation is caused by...", "Shakespeare wrote...", "The sun is..."],
 "scores": [9.1, 0.2, 0.1]}

For Sentence Transformers (using datasets library):

from datasets import Dataset

# Simplest format
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?", "who wrote Hamlet?"],
    "positive": ["Inflation is caused by...", "Hamlet was written by Shakespeare..."],
})

# With hard negatives (better)
dataset = Dataset.from_dict({
    "query":    ["what causes inflation?"],
    "positive": ["Inflation is caused by..."],
    "negative": ["An unrelated but superficially similar document..."],
})

6. Fine-Tuning Recipes

Option A: Sentence Transformers v5 (Recommended)

Best for: Domain adaptation, simplest setup, no external dependencies.

pip install -U sentence-transformers datasets

A1. InfoNCE (simplest — pairs only)

from datasets import Dataset
from sentence_transformers import (
    SparseEncoder,
    SparseEncoderTrainer,
    SparseEncoderTrainingArguments,
    SparseEncoderModelCardData,
)
from sentence_transformers.sparse_encoder.losses import (
    SpladeLoss,
    SparseMultipleNegativesRankingLoss,
)
from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
from sentence_transformers.training_args import BatchSamplers

# ── 1. Load a pre-trained sparse model ────────────────────────────────────────
# Don't train from scratch — always start from a strong pre-trained checkpoint.
# Good starting points:
#   "naver/splade-cocondenser-selfdistil"               (symmetric SPLADE)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"  (inference-free)
#   "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"      (best OpenSearch)
model = SparseEncoder(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    model_card_data=SparseEncoderModelCardData(
        language="en",
        license="apache-2.0",
        model_name="My Domain Sparse Encoder",
    ),
)

# ── 2. Prepare domain data ─────────────────────────────────────────────────────
train_dataset = Dataset.from_dict({
    "query":    ["your domain query 1", "your domain query 2"],
    "positive": ["relevant document 1", "relevant document 2"],
    # Optional: add "negative" key for hard negatives (strongly recommended)
})

# ── 3. Define loss with FLOPS regularization ───────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=5e-5,      # tune between 0 and 1e-4
    document_regularizer_weight=3e-5,   # tune between 0 and 1e-3
)

# ── 4. Training arguments ──────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/my-domain-sparse-encoder",
    num_train_epochs=1,
    per_device_train_batch_size=16,     # larger = more in-batch negatives = better
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,                          # or bf16=True on Ampere+
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # required for MNR loss
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="NanoBEIR_mean_dot_ndcg@10",
)

# ── 5. Optional: zero-shot evaluator on standard BEIR subsets ─────────────────
evaluator = SparseNanoBEIREvaluator(
    dataset_names=["nfcorpus", "scifact", "fiqa"],
    batch_size=16,
)

# ── 6. Train ───────────────────────────────────────────────────────────────────
trainer = SparseEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
model.save_pretrained("models/my-domain-sparse-encoder/final")
model.push_to_hub("my-org/my-domain-sparse-encoder")  # optional

A2. KL Divergence Distillation (better quality)

Requires teacher scores pre-computed (or computed on-the-fly if you have GPU budget).

from sentence_transformers.sparse_encoder.losses import SparseDistillKLDivLoss

# Dataset needs columns: query + doc_0, doc_1, ..., doc_N + score_0, score_1, ..., score_N
# OR use the InputExample format with {"query", "docs", "scores"}
# Sentence Transformers >= 5.0 handles both.

loss = SpladeLoss(
    model=model,
    loss=SparseDistillKLDivLoss(model=model),
    query_regularizer_weight=5e-5,
    document_regularizer_weight=3e-5,
)

# Everything else identical to A1

A3. Inspecting sparsity during/after training

# Check which tokens activate and their weights
sentences = ["what causes inflation?", "neural sparse retrieval with SPLADE"]
embeddings = model.encode(sentences, output_value="sentence_embedding")

for sent, emb in zip(sentences, embeddings):
    # decode top-20 active dimensions back to readable tokens
    decoded = model.decode(emb, top_k=20)
    print(f"\n{sent}")
    print(decoded)
    print(f"Active dims: {(emb > 0).sum()}")

A4. Evaluating on BEIR

from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Install: pip install beir
dataset = "nfcorpus"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Encode with your model
doc_embeddings = model.encode_corpus(list(corpus.values()))
query_embeddings = model.encode_queries(list(queries.values()))

# Retrieve using dot product (handles sparse vectors)
from sentence_transformers.sparse_encoder import SparseEncoderSimilarityFunction
results = SparseEncoderSimilarityFunction.DOT_PRODUCT.pairwise_scores(
    query_embeddings, doc_embeddings
)

Option B: OpenSearch Tuning Sample Repo

Best for: Production OpenSearch integration, multi-GPU training, full control over ensemble distillation.

Prerequisites: OpenSearch 2.16+ running locally (used for hard negative mining and evaluation).

# Clone the repo
git clone https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
cd opensearch-sparse-model-tuning-sample

# Environment setup
conda create -n sparse-tuning python=3.9 -y
conda activate sparse-tuning
pip install -r requirements.txt

# Start OpenSearch locally (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin@1234" \
  opensearchproject/opensearch:2.16.0

B1. Prepare MS MARCO training data

# Download and prepare MS MARCO with hard negatives
python prepare_msmarco_hard_negatives.py \
  --output_dir data/msmarco_hard_negs \
  --num_hard_negatives 7 \
  --opensearch_host localhost \
  --opensearch_port 9200

# Or prepare your own domain data in the expected JSONL format
python demo_train_data.py \
  --input your_query_doc_pairs.jsonl \
  --output data/domain_data

B2. InfoNCE training config

Edit configs/config_infonce.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs      # JSONL files
idf_path: idf.json                       # keep MS MARCO IDF or recompute
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_lambda: 0.05                       # FLOPS regularization weight
loss_types: [infonce]
output_dir: models/my-sparse-encoder

python train_ir.py configs/config_infonce.yaml

B3. Ensemble KD training config (best quality)

Edit configs/config_kd.yaml:

model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
train_data: data/msmarco_hard_negs
idf_path: idf.json
max_seq_length: 512
per_device_train_batch_size: 15
learning_rate: 2e-5
max_steps: 100000
warmup_steps: 1000
fp16: true
weight_decay: 0.01
flops_d_lambda: 0.002                    # doc FLOPS weight (lower for KD)
loss_types: [kl_div]

# Ensemble teacher configuration
kd_ensemble_teacher_kwargs:
  teachers:
    - model_type: dense
      model_name_or_path: Alibaba-NLP/gte-large-en-v1.5   # dense teacher
    - model_type: sparse
      model_name_or_path: opensearch-project/opensearch-neural-sparse-encoding-v1  # sparse teacher
  score_scaling_factor: 30              # scales scores before softmax
  aggregation: arithmetic_mean          # or geometric_mean

output_dir: models/my-sparse-encoder-kd

# Single GPU
python train_ir.py configs/config_kd.yaml

# Multi-GPU (recommended for full MS MARCO training)
torchrun --nproc_per_node=8 train_ir.py configs/config_kd.yaml

B4. Recomputing IDF for domain adaptation

# Recompute IDF from your own corpus (if your domain differs from MS MARCO)
from collections import defaultdict
import math
import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

def compute_idf(corpus: list[str], output_path: str = "idf.json"):
    df = defaultdict(int)  # document frequency
    N = len(corpus)

    for doc in corpus:
        token_ids = tokenizer(doc, truncation=True, max_length=512)["input_ids"]
        for tid in set(token_ids):
            df[tid] += 1

    idf = {
        str(tid): math.log((N + 1) / (count + 1)) + 1  # smoothed IDF
        for tid, count in df.items()
    }

    with open(output_path, "w") as f:
        json.dump(idf, f)

    print(f"Computed IDF for {len(idf)} tokens from {N} documents.")
    return idf

# Usage
with open("your_corpus.txt") as f:
    corpus = [line.strip() for line in f]

idf = compute_idf(corpus, "domain_idf.json")

Option C: NAVER SPLADE Repo

Best for: Symmetric SPLADE models, research experimentation, full Hydra config control.

git clone https://github.com/naver/splade
cd splade
pip install -e .

# Download MS MARCO data
bash scripts/download_msmarco.sh

C1. Basic training

python -m splade.training.train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.regularizer.FLOPS.lambda_q=0.06 \
  config.regularizer.FLOPS.lambda_d=0.02 \
  config.data.train_data_path=data/your_domain_triples.tsv \
  config.training.num_train_epochs=3 \
  config.training.learning_rate=2e-5 \
  config.training.per_device_train_batch_size=32

C2. Distillation (SPLADE++ style)

# Using cross-encoder as teacher (best quality but slow)
python -m splade.training.distil_train \
  config.model.model_type_or_dir=naver/splade-cocondenser-selfdistil \
  config.teacher.model_type_or_dir=cross-encoder/ms-marco-MiniLM-L-12-v2 \
  config.regularizer.FLOPS.lambda_q=0.02 \
  config.regularizer.FLOPS.lambda_d=0.01 \
  config.data.train_data_path=data/msmarco_triples.tsv \
  config.training.num_train_epochs=3

C3. SPLADE-v3 self-distillation

SPLADE-v3 uses a mix of KL-divergence from a SPLADE++ teacher + MarginMSE loss. Config from the paper:

# Loss combination from SPLADE-v3 paper:
L_total = α * L_KL + (1 - α) * L_MarginMSE + λ_q * FLOPS(q) + λ_d * FLOPS(d)
# α = 0.5, λ_q = 0.01, λ_d = 0.008
# Teacher: naver/splade-cocondenser-selfdistil
# Hard negatives: 8 per query sampled from teacher's top-100

7. Inference-Free (Asymmetric) Fine-Tuning

For OpenSearch production deployment, the inference-free setup (query = IDF lookup, doc = neural) is strongly recommended. Sentence Transformers v5 supports training this directly via Router.

from sentence_transformers import SparseEncoder, SparseEncoderTrainer, SparseEncoderTrainingArguments
from sentence_transformers.models import Router
from sentence_transformers.sparse_encoder.models import (
    SparseStaticEmbedding,
    MLMTransformer,
    SpladePooling,
)
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

# ── Build the asymmetric model ─────────────────────────────────────────────────
doc_encoder = MLMTransformer(
    "opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill"
)

# SparseStaticEmbedding = trainable IDF lookup table
# frozen=False allows the IDF weights to be updated during training (domain adaptation)
query_encoder = SparseStaticEmbedding(
    tokenizer=doc_encoder.tokenizer,
    frozen=False,           # set True to freeze IDF (use pre-computed MS MARCO IDF only)
    idf_path="idf.json",    # path to pre-computed IDF (download from HuggingFace model)
)

router = Router.for_query_document(
    query_modules=[query_encoder],
    document_modules=[doc_encoder, SpladePooling("max")],
)

model = SparseEncoder(modules=[router])

# ── Loss ───────────────────────────────────────────────────────────────────────
loss = SpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model=model),
    query_regularizer_weight=0.0,       # no regularization needed for static embedding
    document_regularizer_weight=3e-5,
)

# ── Training args ──────────────────────────────────────────────────────────────
args = SparseEncoderTrainingArguments(
    output_dir="models/inference-free-domain",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    # Higher LR for the IDF table (it's a simpler parameter space)
    learning_rate_mapping={r"SparseStaticEmbedding\..*": 1e-3},
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
)

At inference time after training:

# Query: no neural inference
query_embedding = model.encode("what causes inflation?", prompt_name="query")
# ↑ this just does IDF lookup internally

# Document: full neural encoding
doc_embedding = model.encode("Inflation is caused by...", prompt_name="passage")

# Score
score = (query_embedding * doc_embedding).sum()  # dot product

8. Hard Negative Mining

Strategy 1: BM25 Hard Negatives (simplest)

from rank_bm25 import BM25Okapi  # pip install rank-bm25

def mine_bm25_hard_negatives(
    queries: list[str],
    corpus: list[str],
    qrels: dict[str, list[int]],  # query_id → list of relevant doc indices
    n_negatives: int = 7,
) -> list[dict]:
    """Mine hard negatives using BM25 retrieval."""
    tokenized_corpus = [doc.lower().split() for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)

    examples = []
    for q_idx, query in enumerate(queries):
        scores = bm25.get_scores(query.lower().split())
        top_k_indices = scores.argsort()[::-1][:100]  # top-100 by BM25

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({
                "query": query,
                "positive": corpus[list(relevant_ids)[0]],
                "negatives": hard_negatives,
            })

    return examples

Strategy 2: Current Model Hard Negatives (iterative, best quality)

# Mine hard negatives using your current model's retrieval
# Run after each epoch or every N steps for iterative refinement

def mine_model_hard_negatives(model, queries, corpus, qrels, n_negatives=7, top_k=100):
    corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)

    examples = []
    for q_idx, query in enumerate(queries):
        query_embedding = model.encode([query])
        scores = (query_embedding * corpus_embeddings).sum(axis=1)
        top_k_indices = scores.argsort()[::-1][:top_k]

        relevant_ids = set(qrels.get(str(q_idx), []))
        hard_negatives = [
            corpus[i] for i in top_k_indices
            if i not in relevant_ids
        ][:n_negatives]

        if relevant_ids and hard_negatives:
            examples.append({"query": query, "positive": corpus[list(relevant_ids)[0]], "negatives": hard_negatives})

    return examples

Strategy 3: Using OpenSearch for Hard Negative Mining

If you have an OpenSearch cluster, the tuning sample repo automates this:

# Index your corpus into OpenSearch
python index_corpus.py \
  --corpus your_corpus.jsonl \
  --index my-index \
  --opensearch_host localhost

# Mine hard negatives using BM25 retrieval
python prepare_hard_negatives.py \
  --queries your_queries.jsonl \
  --qrels your_qrels.tsv \
  --index my-index \
  --n_negatives 7 \
  --output data/hard_negatives.jsonl

9. Practical Tips & Hyperparameter Guide

Sparsity Monitoring

Always track average active dimensions during training. Target ranges:

Component	Target active dims	If too many	If too few
Documents	100–300	Increase λ_d	Decrease λ_d
Queries (neural)	10–50	Increase λ_q	Decrease λ_q
Queries (IDF lookup)	= query length	N/A	N/A

def log_sparsity(model, sample_texts: list[str], prefix=""):
    embeddings = model.encode(sample_texts)
    active_dims = [(emb > 0).sum() for emb in embeddings]
    print(f"{prefix} active dims: mean={sum(active_dims)/len(active_dims):.1f}, "
          f"min={min(active_dims)}, max={max(active_dims)}")

Hyperparameter Summary

Parameter	Recommended range	Notes
`learning_rate`	`1e-5` to `5e-5`	Start at `2e-5`
`per_device_batch_size`	16–64	Bigger → harder negatives for InfoNCE
`λ_q` (query FLOPS)	`1e-5` to `1e-4`	Lower than λ_d; set 0 for inference-free
`λ_d` (doc FLOPS)	`1e-5` to `5e-2`	Critical — tune first
`warmup_ratio`	0.05–0.1	Standard
`weight_decay`	0.01	Standard
`max_seq_length`	512 for docs, 64–128 for queries
`num_train_epochs`	1–5	1 usually sufficient for fine-tuning

Common Failure Modes

Symptom	Cause	Fix
Embeddings all zeros	λ too high or LR too high	Reduce λ and/or LR
No sparsity (all dims active)	λ too low	Increase λ_d
Training loss not decreasing	LR too low or bad data	Check data format, increase LR
Good train loss, poor BEIR	Overfitting to MS MARCO	Add domain data, reduce epochs
Query vecs denser than doc vecs	Normal — queries shorter	Expected behaviour

On Starting from Scratch vs. Fine-Tuning

Never train a sparse encoder from a plain BERT/DistilBERT checkpoint without pre-training. The MLM head must first learn to produce meaningful sparse activations. Training from scratch requires:

≥500K (query, document) pairs
Multi-phase lambda schedule
Likely weeks of training on 8+ GPUs

10. Decision Guide

Goal	Start from	Training approach	Approximate training time
Best retrieval quality (symmetric)	`naver/splade-v3`	NAVER repo + KL-div + MarginMSE	Days on 8 GPU
OpenSearch production (fastest query)	`opensearch-neural-sparse-encoding-doc-v2-distill`	opensearch tuning sample + KD ensemble	Hours on 1 GPU
Simplest domain adaptation	`doc-v2-distill`	Sentence Transformers SpladeLoss + MNR	Minutes on 1 GPU
Best OpenSearch quality	`opensearch-neural-sparse-encoding-doc-v3-gte`	opensearch tuning sample + KD ensemble	Hours on 1 GPU
Multilingual	`opensearch-neural-sparse-encoding-multilingual-v1`	MIRACL data + KD approach	Hours on 4 GPU
Sparsify an existing dense model	your dense model + SparseAutoEncoder	Sentence Transformers CSRLoss	Minutes on 1 GPU
Research / ablation studies	`naver/splade-cocondenser-selfdistil`	NAVER SPLADE repo (Hydra configs)	Configurable

11. References

Papers (chronological)

BM25 — Robertson & Zaragoza (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR. PDF
DeepImpact — Mallia et al. (2021). Learning Passage Impacts for Inverted Indexes. SIGIR 2021. arXiv:2104.12016
SPLADE v1 — Formal et al. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720
SPLADE v2 — Formal et al. (2021). SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. NeurIPS 2021 Workshop. arXiv:2109.10086
BEIR Benchmark — Thakur et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663
SPLADE++ / SelfDistil / EnsembleDistil — Formal et al. (2022). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. SIGIR 2022. arXiv:2205.04733
CoCondenser — Gao & Callan (2022). Unsupervised Corpus Aware Language Model Pre-Training for Dense Passage Retrieval. ACL 2022. arXiv:2108.05540
SPLADE-v3 — Lassance & Formal (2024). SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789
CSR — Yang et al. (2024). CSR: Cascade Sparse Retrieval for Open-Domain Question Answering. arXiv:2404.12153
OpenSearch Inference-Free Neural Sparse (v2) — Yang et al. (2024). Inference-free Sparse Retrieval via IDF-Aware Ensemble Distillation. arXiv:2411.04403
SPLADE-v3 + L0 — Lassance et al. (2025). Efficient Sparse Retrieval with L0 Regularization. SIGIR 2025. arXiv:2504.14839

Code & Repos

Resource	URL
NAVER SPLADE repo	https://github.com/naver/splade
OpenSearch sparse tuning sample	https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample
Sentence Transformers sparse training docs	https://sbert.net/docs/sparse_encoder/training_overview.html
Sentence Transformers sparse training examples	https://github.com/UKPLab/sentence-transformers/tree/master/examples/sparse_encoder
BEIR benchmark	https://github.com/beir-cellar/beir

Blog Posts

Resource	URL
HuggingFace blog: Train Sparse Encoders (ST v5)	https://huggingface.co/blog/train-sparse-encoder
OpenSearch neural sparse documentation	https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/
Sentence Transformers v5 release announcement	https://huggingface.co/blog/sentence-transformers-v5

HuggingFace Models

Model	URL
naver/splade-v3	https://huggingface.co/naver/splade-v3
naver/splade-cocondenser-selfdistil	https://huggingface.co/naver/splade-cocondenser-selfdistil
naver/splade-cocondenser-ensembledistil	https://huggingface.co/naver/splade-cocondenser-ensembledistil
opensearch-project/opensearch-neural-sparse-encoding-v1	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1	https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1
opensearch-project (full org)	https://huggingface.co/opensearch-project

oneryalcin/sparse_encoder_fine_tuning_guide.md

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Table of Contents

1. What Are Sparse Encoders?

2. Model Families

2.1 SPLADE Family (NAVER Labs)

2.2 OpenSearch Neural Sparse (Amazon)

2.3 CSR — Contrastive Sparse Representation

3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Inference-Free (OpenSearch v2/v3) Query Encoding

CSR Architecture

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

4.3 Ranking Losses

4.4 Knowledge Distillation

4.5 Ensemble Heterogeneous Distillation

5. Training Data

Minimum Viable Data

Pre-Training Scale (OpenSearch models)

Data Format

6. Fine-Tuning Recipes

Option A: Sentence Transformers v5 (Recommended)

A1. InfoNCE (simplest — pairs only)

A2. KL Divergence Distillation (better quality)

A3. Inspecting sparsity during/after training

A4. Evaluating on BEIR

Option B: OpenSearch Tuning Sample Repo

B1. Prepare MS MARCO training data

B2. InfoNCE training config

B3. Ensemble KD training config (best quality)

B4. Recomputing IDF for domain adaptation

Option C: NAVER SPLADE Repo

C1. Basic training

C2. Distillation (SPLADE++ style)

C3. SPLADE-v3 self-distillation

7. Inference-Free (Asymmetric) Fine-Tuning

8. Hard Negative Mining

Strategy 1: BM25 Hard Negatives (simplest)

Strategy 2: Current Model Hard Negatives (iterative, best quality)

Strategy 3: Using OpenSearch for Hard Negative Mining

9. Practical Tips & Hyperparameter Guide

Sparsity Monitoring

Hyperparameter Summary

Common Failure Modes

On Starting from Scratch vs. Fine-Tuning

10. Decision Guide

11. References

Papers (chronological)

Code & Repos

Blog Posts

HuggingFace Models

12. Real-World Case Study: Domain Adaptation for Financial Filings & Earnings Calls

12.1 What We Found in the Index

12.2 Step 1: Sample 100K Chunks and Compute Domain IDF

12.3 Results on 100K Chunks (50K filings + 50K transcripts)

12.4 Key Lessons from This Investigation

12.5 Next Steps After Domain IDF

Fine-Tuning Sparse Encoders for Neural Sparse Retrieval

A Complete Reproducible Guide (SPLADE, OpenSearch Neural Sparse, CSR)

Table of Contents

1. What Are Sparse Encoders?

2. Model Families

2.1 SPLADE Family (NAVER Labs)

2.2 OpenSearch Neural Sparse (Amazon)

2.3 CSR — Contrastive Sparse Representation

3. Architecture Internals

SPLADE / OpenSearch v1 Architecture

Inference-Free (OpenSearch v2/v3) Query Encoding

CSR Architecture

4. Training Objectives & Loss Functions

4.1 FLOPS Regularization

4.2 IDF-Aware Scoring (OpenSearch v2/v3)

4.3 Ranking Losses

4.4 Knowledge Distillation

4.5 Ensemble Heterogeneous Distillation

5. Training Data

Minimum Viable Data