Authorship Note: This document was compiled during an interactive exploration session simulating a "Feynman Lab" environment. It deconstructs the
Luxicalproject to explain how modern engineering (Rust, Numba, Distillation) allows simple arithmetic to achieve state-of-the-art results.
- The Problem: The Efficiency Gap
- The Solution: Lexical-Dense Embeddings
- Deep Dive: The Tokenization Engine (Rust)
- Deep Dive: The Feature Extraction (Numba)
- Deep Dive: The Vocabulary Builder (Space-Saving Algo)
- The Core Mathematics: Sparse-to-Dense Projection
- Training: The Art of Knowledge Distillation
- Performance Characteristics & Limits
- Practical Engineering: Usage & Fine-Tuning
In the current landscape of NLP, we have a massive bifurcation:
-
The "Smart but Slow" (Transformers): Models like BERT, RoBERTa, and E5.
-
Mechanism: Self-Attention (
$O(N^2)$ complexity). Every token looks at every other token. - Pros: Deep semantic understanding. Knows that "bank" means "river" when near "water".
- Cons: Expensive. Hard to run on CPU at scale. Impossible to train on trillions of tokens without massive clusters.
-
Mechanism: Self-Attention (
-
The "Fast but Dumb" (BM25, FastText):
-
Mechanism: Keyword matching or simple averaging (
$O(N)$ complexity). - Pros: Blazing fast. constant memory.
- Cons: Semantic blindness. "Car" and "Automobile" are totally different features unless explicitly mapped.
-
Mechanism: Keyword matching or simple averaging (
Luxical attempts to bridge this gap. It asks:
- Can we keep the $O(N)$ speed of FastText?
- But gain the semantic understanding of BERT?
The answer lies in Knowledge Distillation. We don't change the architecture of the fast model (it stays simple); we change its weights by teaching it to mimic a smart model.
Luxical is not a Neural Network in the deep sense. It is a Feature-Based Linear Model.
- Text: "The quick brown fox"
- Tokens:
[101, 200, 300, 400](Subwords) - N-Grams:
- 1-grams:
[101],[200], ... - 2-grams:
[101, 200],[200, 300], ... - ... up to 5-grams.
- 1-grams:
- Hashing: Map each N-Gram to a generic ID (0 to 2,000,000).
- Projection: Look up a learned vector for each ID and sum them up.
This pipeline is entirely linear. There are no activation functions (like ReLU or Gelu) between the input and the summation (though there is a final normalization). This means the inference speed depends linearly on the input length.
The first bottleneck in any high-performance NLP system is string processing. Python's str object is heavy.
Luxical solves this by offloading the critical path to Rust.
The library uses a custom Rust extension that interfaces with:
- Hugging Face Tokenizers: The industry standard for BPE/WordPiece algorithms in Rust.
- Apache Arrow: A cross-language development platform for in-memory data.
Key Design Choice: Zero-Copy Memory
Instead of passing Python lists of strings (which requires serialization/pickling), Luxical passes Arrow Arrays. Arrow defines a memory layout that both Python (via pyarrow) and Rust can read without copying bytes.
Inside arrow_tokenize/src/lib.rs:
// The Parallel Iterator (Rayon)
let results: PyResult<Vec<Option<Vec<u32>>>> = (0..string_array.len())
.into_par_iter() // <--- Parallel execution across all CPU cores
.map(|i| {
// ... get text ...
self.tokenizer.encode_fast(text, add_special_tokens) // <--- HF Tokenizer
})
.collect();Why this matters:
- GIL Release: Rust releases the Python Global Interpreter Lock (GIL). This allows true parallelism.
- Batch Processing: It processes thousands of documents at once.
- Memory Efficiency: It returns a
LargeListArray(Arrow format), which flows directly into the next step (Numba) without conversion overhead.
Once we have integers (Token IDs), we need to generate features (N-grams).
The Challenge:
Generating 1-grams to 5-grams for a document of length for loop is too slow (for i in range(len(tokens)): ...).
The Solution: Numba
Luxical uses @numba.njit to compile this logic into machine code.
Inside luxical/ngrams.py, the function sparse_count_ngram_in_document does the heavy lifting:
@numba.njit(nogil=True)
def sparse_count_ngram_in_document(...):
# Iterate over lengths 1 to 5
for ngram_length in range(1, max_ngram_length + 1):
# Sliding window
for i in range(len(tokens) - ngram_length + 1):
# Extract window
ng[:ngram_length] = tokens[i : i + ngram_length]
# Hash
ngh = fnv1a_hash_array_to_int64(ng)
# Count
if ngh in ngram_hash_to_idx:
...Why use hashing?
We need to map a sequence [101, 7592] to a single unique identifier (Feature ID).
Luxical implements the Fowler–Noll–Vo (FNV-1a) hash function manually in Numba:
FNV_OFFSET_BASIS_64 = np.uint64(14695981039346656037)
FNV_PRIME_64 = np.uint64(1099511628211)
for byte_val in byte_view:
hash_val ^= np.uint64(byte_val) # XOR
hash_val *= FNV_PRIME_64 # MultiplyWhy FNV-1a?
- Speed: It uses only XOR and Multiply. These are single-cycle CPU instructions. It is vastly faster than SHA-256 or MD5.
- Distribution: It has excellent avalanche properties for short keys (like n-grams).
- Simplicity: It fits in 10 lines of code and has no dependencies.
This hashing allows Luxical to treat "The cat" (feature) just like a "Word" (feature). To the model, they are just Index 42 and Index 99.
This is perhaps the most impressive "Systems" component. Goal: Find the top 2,000,000 most frequent n-grams in the FineWeb dataset (trillions of tokens).
Constraint: You cannot store a counter for every unique n-gram. There are quadrillions of possible combinations. You would run out of RAM instantly.
Luxical uses the Space-Saving Algorithm (Metwally et al., 2005). It is a "Heavy Hitters" algorithm.
Mechanism:
- Initialize a fixed map of size
$K$ (e.g., 2 million). - For every incoming n-gram
$x$ :-
Case A:
$x$ is in Map. -> Increment count. -
Case B:
$x$ is NOT in Map, and Map has space. -> Add$x$ with count 1. -
Case C:
$x$ is NOT in Map, and Map is FULL.- Find element
$y$ with the minimum count ($min$ ). -
Evict
$y$ . -
Insert
$x$ . - Set Count of
$x$ =$min + 1$ .
- Find element
-
Case A:
Why $min + 1$?
This is the survival mechanism.
If we reset new items to 1, they would be immediately evicted by the next item. The bottom of the list would become a revolving door where nothing accumulates enough count to survive.
By inheriting the count of the evicted item, we are saying: "Assume this new item $x$ might have appeared before while we weren't looking. Give it a fighting chance equal to the item it replaced."
Over time, true heavy hitters will grow exponentially (to counts of billions), while rare items will stagnate at the $min threshold and be evicted.
Question: What if "giraffe" appears for the very first time at the very end of the stream?
Answer: It will replace the minimum item and enter the list with count $min + 1$.
Result: The final list might technically contain a rare item.
Fix: Luxical performs a post-processing step. It calculates a keep_threshold based on the minimum count. Items too close to the "eviction floor" are discarded as noise.
After tokenization and hashing, we have a Sparse Vector
- Dimension: 2,000,000.
- Values: Mostly 0. A few 1s (counts).
We want a Dense Vector
- Dimension: 192.
$$ E = x \cdot W $$
Where
Multiplying a sparse vector by a dense matrix is inefficient if you do it blindly (
In Python/Numba terms:
- Get the indices of active n-grams:
[Idx1, Idx2, Idx3...] - Get the weights (TF-IDF):
[w1, w2, w3...] - Slice the matrix:
Rows = W[[Idx1, Idx2, ...]] - Weighted Sum:
Output = (Rows * Weights).sum(axis=0)
This operation is
Not all n-grams are equal.
- "The": High frequency, low information.
- "Quantum": Low frequency, high information.
Luxical learns/calculates an IDF vector during the Space-Saving phase. $$ \text{IDF}(t) = \log(\frac{\text{Total N-Grams}}{\text{Count}(t)}) $$
This weight
How do we fill the matrix
-
Teacher:
snowflake-arctic-embed-m(Transformer).- Input: "The movie was not good."
- Output: Vector
$V_T$ (captures negative sentiment).
-
Student: Luxical (Bag of N-grams).
- Input: "The", "movie", "not", "good", "not good"...
- Output: Vector
$V_S$ (initially random).
We minimize the distance between
The Magic of N-Grams: The student cannot understand syntax. It doesn't know "not" negates "good" via grammar. But it does have a feature for the bigram "not good".
During training:
- Teacher says: "Vector must be NEGATIVE."
- Student sums:
Vec("not") + Vec("good") + Vec("not good"). - Gradient Descent: "The only unique feature here is 'not good'. I will make its vector extremely NEGATIVE to fix the error."
Thus, the student "memorizes" the semantic result of the Teacher's attention mechanism into the static weights of the n-gram.
-
BERT / Transformers:
$O(N^2)$ .- Doubling text length =
$4 \times$ compute. - Hard limit (e.g., 512 tokens) due to memory.
- Doubling text length =
-
Luxical:
$O(N)$ .- Doubling text length =
$2 \times$ compute. - No hard architectural limit.
- Doubling text length =
While Luxical can process 10,000 words, it shouldn't. Because it relies on Summation, all vectors get averaged.
The result is a vector that points nowhere specific (the centroid of all topics). Rule of Thumb: Use Luxical for Passage Retrieval (chunks of 50-500 words). If you have a book, chunk it first.
For very short text (< 5 words), Bag-of-Words struggles with word order.
- "Man bites Dog" vs "Dog bites Man".
- Unigrams are identical.
- The only differentiation comes from N-grams:
[Man bites]vs[Dog bites]. - If the model hasn't seen those specific bigrams in the vocabulary, it sees them as identical.
Rule of Thumb: Avoid using Luxical for extremely short, order-dependent queries (1-3 words) unless the phrases are common idioms.
Since Luxical relies on a Rust kernel, you cannot just pip install a pure Python wheel (unless pre-built).
# 1. Install Rust (cargo)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# 2. Compile Luxical
git clone https://github.com/datologyai/luxical
cd luxical
maturin develop --releaseIf you use Luxical "out of the box" on Enterprise data (e.g., Legal, Medical), it may fail on jargon.
The Strategy:
- Vocabulary Expansion:
- Run the
SpaceSavingNgramSummaryon your enterprise corpus. - Identify top terms (e.g., "Section 404(b)").
- Add them to the 2M vocabulary if missing.
- Run the
- Fine-Tuning:
- Run a Teacher (BERT) on your corpus to generate target vectors.
- Train the Luxical projection layer (Student).
- Tip: Freeze the rows of the "General English" terms to prevent catastrophic forgetting. Only train the new rows or use a very low learning rate for the old ones.
| Feature | BERT / Transformers | Luxical | BM25 / Keyword |
|---|---|---|---|
| Speed | Slow ($O(N^2)$) | Very Fast ($O(N)$) | Instant |
| Semantics | Deep, Contextual | Shallow, Phrase-based | None (Exact Match) |
| Vocabulary | Fixed (~30k) | Massive (~2M N-grams) | Infinite |
| Training | Heavy (TPUs) | Moderate (Distillation) | None |
| Usage | Re-Ranking, QA | First-Stage Retrieval | Keyword Search |
This document serves as a comprehensive reference for the engineering principles behind Luxical. It demonstrates that high-performance AI is not just about bigger matrices, but about smarter algorithms and systems programming.
RAW DISCUSSION