Skip to content

Instantly share code, notes, and snippets.

@intellectronica
Last active August 2, 2025 17:14
Show Gist options
  • Save intellectronica/53850c967307dad6bde0f058be3746be to your computer and use it in GitHub Desktop.
Save intellectronica/53850c967307dad6bde0f058be3746be to your computer and use it in GitHub Desktop.
Information Retrieval Flashcards (based on Leonie Monigatti's "37 Things I Learned About Information Retrieval in Two Years at a Vector Database Company")

Information Retrieval Flashcars

Leonie Monigatti, one of the best and clearest voices on information retrieval, published this great list of the most essential things to know about information retrieval (that's the "R" in "RAG"): 37 Things I Learned About Information Retrieval in Two Years at a Vector Database Company. It's excellent, go read it.

And because these are things I never want to forget, I created flashcards to add to my collection (using CardCraft). Maybe they will be useful to you too.

Formats:

  • Anki
  • Mochi
  • CSV (back formatted in markdown)
  • JSON (back formatted in markdown)
  • Markdown (cards separated by ===, back and front separated by ---)
BM25 Is A Strong Baseline For Search BM25 is a **keyword search** algorithm. It is recommended to start with simple baselines like BM25 before moving to more complex methods like vector search. >Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.
Why should you start with BM25 before vector search? You should start with BM25 because it is a **simple** and **strong baseline** for keyword search. This pragmatic approach helps establish a foundational search system before introducing the complexity of vector search. >Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.
Vector Search In Vector Databases Is Approximate And Not Exact Vector search is **approximate** because brute-force exact k-nearest neighbor (KNN) computations do not scale well. Vector databases use **Approximate Nearest Neighbor (ANN)** algorithms to achieve speed at scale, trading off a small amount of accuracy. >In theory, you could run a brute-force search to compute distances between a query vector and every vector in the database using exact k-nearest neighbors (KNN). But this doesn’t scale well. That’s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.
What algorithms enable fast, approximate vector search? Vector databases use **Approximate Nearest Neighbor (ANN)** algorithms to speed up search at scale. Examples include: - **HNSW** (Hierarchical Navigable Small World) - **IVF** (Inverted File Index) - **ScaNN** (Scalable Nearest Neighbors) These algorithms introduce a small trade-off in accuracy for significant speed gains. >That’s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.
Vector Databases Don’t Only Store Embeddings Besides embeddings, vector databases also store: - The **original object** (e.g., the text from which embeddings were generated) - **Metadata** This enables features beyond just vector search, such as metadata filtering and hybrid search. >They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.
What additional features are supported by vector databases due to storing original objects and metadata? By storing original objects and metadata, vector databases can support features such as: - **Metadata filtering** - **Keyword search** - **Hybrid search** These capabilities extend their utility beyond purely vector-based search. >They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.
Vector Databases’ Main Application Is Not In Generative AI The **primary application** of vector databases is **search**. Although finding relevant context for Large Language Models (LLMs) is a form of search, their utility extends beyond only generative AI. >It’s in search. But finding relevant context for LLMs is ‘search’. That’s why vector databases and LLMs go together like cookies and cream.
You Have To Specify How Many Results You Want To Retrieve When performing a vector search, it is crucial to define the maximum number of results you want. Without parameters like `limit` or `top_k`, vector search would return _all_ objects stored in the database, sorted by the distance to your query. >When I think back, I almost have to laugh because this was such a big “aha” moment when I realized that you need to define the maximum number of results you want to retrieve. It’s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren’t a `limit` or `top_k` parameter.
What parameters are used to limit vector search results? To control the number of results retrieved in a vector search, you must specify a maximum limit using parameters such as: - `limit` - `top_k` Without these, the search would return _all_ objects, sorted by distance. >It’s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren’t a `limit` or `top_k` parameter.
There Are Many Different Types Of Embeddings Beyond the commonly visualized **dense vectors**, there are other types of embeddings, including: - **Sparse vectors** (e.g., `[0, 2, 0, ..., 1]`) - **Binary vectors** (e.g., `[0, 1, 1, ..., 0]`) - **Multi-vector embeddings** (e.g., `[[-0.9837, ...], [0.1044, ...]]`) Each type serves different purposes in information retrieval. >When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, …, -0.2049]. That’s called a dense vector, and it is the most commonly used type of vector embedding. But there’s also many other types of vectors, such as sparse ([0, 2, 0, …, 1]), binary ([0, 1, 1, …, 0]), and multi-vector embeddings ([[-0.9837, …, -0.2049], [ 0.1044, …, 0.0090], …, [-0.0937, …, 0.5044]]), which can be used for different purposes.
What is the most commonly used type of vector embedding? The most commonly used type of vector embedding is the **dense vector**, often visualized as a continuous array of floating-point numbers, such as `[-0.9837, 0.1044, ..., -0.2049]`. >When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, …, -0.2049]. That’s called a dense vector, and it is the most commonly used type of vector embedding.
Fantastic Embedding Models And Where To Find Them To find fantastic embedding models, you should check: - The **Massive Text Embedding Benchmark (MTEB)** leaderboard, which covers various tasks like classification, clustering, and retrieval. - **BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models)**, specifically for information retrieval evaluation. >The first place to go is the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval. If you’re focused on information retrieval, you might want to check out [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://github.com/beir-cellar/beir).
The Majority Of Embedding Models On MTEB Are English While the **Massive Text Embedding Benchmark (MTEB)** leaderboard contains many excellent models, most are designed for **English languages**. For multilingual or non-English applications, **MMTEB** is a better resource. >If you’re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1).
Where can you find multilingual embedding models? If you are working with multilingual or non-English languages, you should explore **MMTEB (Massive Multilingual Text Embedding Benchmark)**, as the majority of models on the MTEB leaderboard are English-focused. >If you’re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1).
A Little History On Vector Embeddings Before modern **contextual embeddings** (e.g., BERT), there were **static embeddings** (e.g., Word2Vec, GloVe). Static embeddings give each word a fixed representation, while contextual embeddings generate different representations based on the surrounding text, making them more expressive. >Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.
What is the difference between static and contextual embeddings? - **Static embeddings** (like Word2Vec) assign a _fixed representation_ to each word, regardless of its context. - **Contextual embeddings** (like BERT) generate _different representations_ for the same word based on its surrounding context, making them more expressive. Static embeddings can still be useful in computationally _restrained environments_ as they can be looked up from pre-computed tables. >Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.
Fantastic Embedding Models And Where To Find Them To find fantastic embedding models, the primary resource is the **Massive Text Embedding Benchmark (MTEB)** leaderboard. It covers a wide range of tasks for embedding models, including: - Classification - Clustering - Retrieval >The first place to go is the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval.
What is BEIR and what is its focus? BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models) is a benchmark specifically focused on **information retrieval**. >If you’re focused on information retrieval, you might want to check out [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://github.com/beir-cellar/beir).
The Majority Of Embedding Models On MTEB Are English Most embedding models found on MTEB are designed for English. For working with **multilingual or non-English languages**, it is recommended to check out the **MMTEB (Massive Multilingual Text Embedding Benchmark)**. >If you’re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1).
A Little History On Vector Embeddings Before the advent of today's **contextual embeddings** (e.g., BERT), there were **static embeddings** such as Word2Vec and GloVe. >Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe).
What is the difference between static and contextual embeddings? - **Static embeddings**: Provide a _fixed_ representation for each word. - **Contextual embeddings**: Generate _different_ representations for the same word based on its _surrounding context_. - Static embeddings can still be useful in _computationally restrained environments_ because they can be looked up from pre-computed tables. >They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.
Don’t Confuse Sparse Vectors And Sparse Embeddings - **Sparse vectors** can be generated in two main ways: - By applying _statistical scoring functions_ like TF-IDF or BM25 to term frequencies. - With _neural sparse embedding models_ like SPLADE. - A **sparse embedding** _is_ a type of sparse vector, but _not all_ sparse vectors are necessarily sparse embeddings. >It took me a while until I understood that sparse vectors can be generated in different ways: Either by applying statistical scoring functions like TF-IDF or BM25 to term frequencies (often retrieved via inverted indexes), or with neural sparse embedding models like SPLADE. That means a sparse embedding is a sparse vector, but not all sparse vectors are necessarily sparse embeddings.
Embed All The Things Embeddings are not exclusively for text. You can embed various data types, including: - Images - PDFs as images (e.g., ColPali) - Graphs This capability enables **multimodal vector search** over diverse data. >Embeddings aren’t just for text. You can embed images, PDFs as images (see [ColPali](https://arxiv.org/abs/2407.01449)), graphs, etc. And that means you can do vector search over multimodal data. It’s pretty incredible. You should try it sometime.
The Economics Of Vector Embeddings - The **vector dimensions** directly impact the required _storage cost_. - For example, choosing a model with 1536 dimensions over one with 768 dimensions can _double_ your storage requirements. - While more dimensions capture _more semantic nuances_, a very high number of dimensions may not always be necessary for common tasks like "chat with your docs". >This shouldn’t be a surprise, but the vector dimensions will impact the required storage cost. So, consider whether it is worth it before you choose an embedding model with 1536 dimensions over one with 768 dimensions and risk doubling your storage requirements. Yes, more dimensions capture more semantic nuances.
What is Matryoshka Representation Learning? Matryoshka Representation Learning is a technique that allows you to _shorten_ vector embeddings. This is beneficial for environments with _less computational resources_, while aiming to maintain _minimal performance losses_. >Some models actually use [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) to allow you to shorten vector embeddings for environments with less computational resources, with minimal performance losses.
“Chat With Your Docs” Tutorials Are The “Hello World” Programs Of Generative AI The phrase "chat with your docs" tutorials are considered the "Hello world" programs of Generative AI. This means they are foundational, basic, or introductory examples in the field. >“Chat with your docs” tutorials are the “Hello world” programs of Generative AI.
You Need To Call The Embedding Model A LOT Calling the embedding model is a frequent necessity, not just during initial data ingestion. It is required: - Every time a **search query** is run (the query must be embedded). - When **adding new objects** later on (they need embedding and indexing). - If you **change the embedding model** (all existing data must be re-embedded and re-indexed). >Just because you embedded your documents during the ingestion stage, doesn’t mean you’re done calling the embedding model. Every time you run a search query, the query must also be embedded (if you’re not using a cache). If you’re adding objects later on, those must also be embedded (and indexed). If you’re changing the embedding model, you must also re-embed (and re-index) everything.
Similar Does Not Necessarily Mean Relevant - Vector search identifies objects based on their _similarity_ to a query, measured by proximity in vector space. - However, _similarity_ does not always equate to _relevance_ to the user's intent. - For example,"How to fix a faucet" and "Where to buy a kitchen faucet" might be similar in vector space but not relevant to each other. >Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., “How to fix a faucet” and “Where to buy a kitchen faucet”) does not mean they are relevant to each other.
Cosine Similarity And Cosine Distance Are Not The Same Thing - Cosine similarity and cosine distance are _related_ but distinct concepts. - They are _inverses_ of each other. - If two vectors are exactly the same: - Their similarity is **1**. - The distance between them is **0**. >But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.
What is the relationship between cosine similarity and dot product for normalized vectors? - For **normalized vectors**, cosine similarity and dot product are _mathematically equivalent_. - In this scenario, the **dot product** is generally _more efficient_ for computation. >Because mathematically, they are the same. For the calculation, dot product is more efficient.
Common Misconception: The R In RAG Stands For ‘Vector Search’ - The **"R"** in RAG (Retrieval-Augmented Generation) stands for **'retrieval'**, not specifically 'vector search'. - Retrieval can be accomplished through various methods beyond just vector search. >It doesn’t. It stands for ‘retrieval’. And retrieval can be done in many different ways (see following bullets).
Vector Search Is Just One Tool In The Retrieval Toolbox - Vector search is only _one_ tool within a broader **retrieval toolbox**. - Other essential tools include: - **Keyword-based search** - **Filtering** - **Reranking** - Combining these different tools is crucial for building effective and robust retrieval systems. >There’s also keyword-based search, filtering, and reranking. It’s not one over the other. To build something great, you will need to combine it with different tools.
Similar Does Not Necessarily Mean Relevant - Vector search returns objects based on their _proximity_ in vector space, which signifies _similarity_. >Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., “How to fix a faucet” and “Where to buy a kitchen faucet”) does not mean they are relevant to each other.
For normalized vectors, which is more computationally efficient: cosine similarity or dot product? - For _normalized vectors_, both cosine similarity and dot product are mathematically the **same**.- However, the **dot product** is generally _more efficient_ for computation. >If you’re working with normalized vectors, it doesn’t matter whether you’re using cosine similarity or dot product for the similarity measure. Because mathematically, they are the same. For the calculation, dot product is more efficient.
Cosine Similarity And Cosine Distance Are Not The Same Thing - Cosine similarity and cosine distance are _related_ but are _not identical_.- They are _inverses_ of each other: if two vectors are exactly the same, their similarity is 1, and their distance is 0. >But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.
How can hybrid search combine keyword and vector search using parameters? - In some implementations (e.g., Weaviate), a hybrid search function allows you to combine keyword-based and vector-based search.- The `alpha` parameter can then be used to _adjust the weighting_ from pure keyword-based search, to a mix of both, or to pure vector search. >In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the `alpha` parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.
If You’re Working With Normalized Vectors, It Doesn’t Matter Whether You’re Using Cosine Similarity Or Dot Product For The Similarity Measure - For _normalized vectors_, it does _not matter_ whether you use cosine similarity or dot product.- Mathematically, they are the **same** calculation in this context.- The **dot product** is generally _more efficient_ for computation. >Because mathematically, they are the same. For the calculation, dot product is more efficient.
What analogy describes the issue of losing semantic meaning with large chunk sizes and mean pooling? - The issue is like creating a **movie poster** by _overlaying every single frame_ of the movie.- While all the information is technically present, the resulting image is _unintelligible_, and the overall meaning of the movie is lost. >I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.
Common Misconception: The R In RAG Stands For ‘Vector Search’ - The 'R' in RAG stands for **'retrieval'**, not specifically 'vector search'.- _Retrieval_ encompasses various methods, including but not limited to vector search. >It doesn’t. It stands for ‘retrieval’. And retrieval can be done in many different ways (see following bullets).
Vector Search Is Just One Tool In The Retrieval Toolbox - Vector search is only _one component_ in a comprehensive retrieval system.- Effective retrieval often requires combining vector search with other techniques like **keyword search**, **filtering**, and **reranking**. >There’s also keyword-based search, filtering, and reranking. It’s not one over the other. To build something great, you will need to combine it with different tools.
When To Use Keyword-Based Search Vs. Vector-Based Search - Use **vector-based search** for matching _semantics_ and _synonyms_ (e.g., “pastel colors” vs.“light pink”).- Use **keyword-based search** for _exact keywords_ (e.g., “A-line skirt”, “peplum dress”).- For use cases requiring **both**, **hybrid search** is beneficial. >Does your use case require mainly matching semantics and synonyms (e.g., “pastel colors” vs.“light pink”) or exact keywords (e.g., “A-line skirt”, “peplum dress”)? If it requires both (e.g., “pastel colored A-line skirt”), you might benefit from combining both and using hybrid search. In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the `alpha` parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.
Hybrid Search Can Be A Hybrid Of Different Search Techniques - While often meaning keyword and vector search combination, 'hybrid' is _broader_.- It can also refer to combining **vector-based search** with **search over structured data** (metadata filtering). >Most often, when you hear people talk about hybrid search, they mean the combination of keyword-based search and vector-based search. But the term ‘hybrid’ doesn’t specify which techniques to combine. So, sometimes you might hear people talk about hybrid search, meaning the combination of vector-based search and search over structured data (often referred to as metadata filtering).
Misconception: Filtering Makes Vector Search Faster - This is a misconception because filtering does _not always improve_ search latency.- _Pre-filtering_ can disrupt underlying index structures (e.g., HNSW graph connectivity).- _Post-filtering_ may lead to an empty result set.- Vector databases employ _complex methods_ to address this. >Intuitively, you’d think using a filter should speed up search latency because you’re reducing the number of candidates to search through. But in practice, pre-filtering candidates can, for example, break the graph connectivity in HNSW, and post-filtering can leave you with no results at all. Vector databases have different, sophisticated techniques to handle this challenge.
Two-Stage Retrieval Pipelines Aren’t Only For Recommendation Systems - Two-stage pipelines are common in **recommendation systems** but are also applicable to **RAG pipelines**.- The _first stage_ uses a simpler, faster process (e.g., vector search) to _reduce candidates_.- The _second stage_ uses a more compute-intensive but _more accurate_ reranking process. >Recommendation systems often have a first retrieval stage that uses a simpler retrieval process (e.g., vector search) to reduce the number of potential candidates, which is followed by a second retrieval stage with a more compute-intensive but more accurate reranking stage. You can apply this to your RAG pipeline as well.
How Vector Search Differs From Reranking - **Vector search** _retrieves_ a small portion of results from the _entire database_.- **Reranking** _re-orders_ an _already provided list_ of items. >Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.
Finding The Right Chunk Size To Embed Is Not Trivial - Selecting the correct chunk size is _challenging_: **Too small** means losing _important context_.- **Too big** means losing _semantic meaning_ due to averaging (mean pooling).- Many embedding models use **mean pooling**, which can dilute meaning in large chunks. >Too small, and you’ll lose important context. Too big, and you’ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.
Vector Indexing Libraries Are Different From Vector Databases - Both are fast for vector search and useful for tutorials.- **Vector databases** offer additional _data management features_ such as built-in persistence, CRUD support, metadata filtering, and hybrid search.- **Vector indexing libraries** _lack_ these comprehensive features. >Both are incredibly fast for vector search. Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.
RAG Has Been Dying Since The Release Of The First Long-Context LLM - The claim that RAG is "dead" emerges whenever new LLMs with longer context windows are released.- Despite these claims, **RAG continues to be relevant** and useful. >Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is…
How Vector Search Differs From Reranking Vector search and reranking serve different purposes in information retrieval: - **Vector search** returns a _small portion_ of results from the entire database. - **Reranking** takes an _existing list_ of items and returns that list in a _re-ordered_ sequence. >Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.
Finding the Right Chunk Size to Embed Is Not Trivial Determining the optimal chunk size for embedding is a challenge because: - Chunks that are **too small** can lead to a _loss of important context_. - Chunks that are **too big** can result in a _loss of semantic meaning_. Many models use mean pooling, averaging token embeddings into a single vector, which can make large chunks semantically unclear, even if technically embeddable. >Too small, and you’ll lose important context. Too big, and you’ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.
What is a helpful analogy for understanding chunk size for embedding? A useful analogy for understanding the challenge of chunk size is thinking of it like creating a movie poster by overlaying every single frame of the movie. - All the original information from the movie is technically present. - However, you still won't be able to understand the movie's plot or meaning from such an aggregated image. >I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.
Vector Indexing Libraries Are Different From Vector Databases While both vector indexing libraries and vector databases excel at fast vector search and are useful for tutorials, vector databases offer additional data management features: - Built-in **persistence** - **CRUD support** (Create, Read, Update, Delete) - **Metadata filtering** - **Hybrid search** capabilities >Both are incredibly fast for vector search. Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.
What data management features do vector databases offer beyond vector indexing libraries? Vector databases provide several essential data management features that vector indexing libraries typically lack: - They include **built-in persistence** for data. - They support **CRUD operations** (Create, Read, Update, Delete). - They enable **metadata filtering** and **hybrid search**. >Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.
RAG Has Been Dying Since the Release of the First Long-Context LLM Despite claims that Retrieval-Augmented Generation (RAG) is becoming obsolete with the release of Large Language Models (LLMs) featuring longer context windows, **RAG has not died**. Each time a new LLM with expanded context is released, the claim resurfaces, but RAG continues to be a relevant and effective technique. >Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is…
You Can Throw Out 97% of the Information and Still Retrieve (Somewhat) Accurately This phenomenon is known as **vector quantization**. For instance, with binary quantization, a 32-bit float vector can be converted into a 1-bit binary vector, achieving a **32x storage reduction**. Surprisingly, retrieval accuracy can remain quite good in some use cases despite this significant reduction in data. >It’s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, …, -0.2049] into [0, 1, 1, …, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you’ll be surprised how well retrieval will remain to work (in some use cases).
What is vector quantization? Vector quantization is a technique used to reduce the storage size of vector embeddings. - It converts high-precision vectors (like 32-bit floats) into a more compressed format (e.g., 1-bit binary vectors). - This can lead to significant storage reduction (e.g., 32x) while surprisingly maintaining effective retrieval accuracy for certain applications. >It’s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, …, -0.2049] into [0, 1, 1, …, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you’ll be surprised how well retrieval will remain to work (in some use cases).
Vector Search Is Not Robust To Typos Vector search is generally **not robust to typos** for a key reason: - It's highly improbable that all possible typographical errors of a word are sufficiently represented in the training data of embedding models. - While some simple typos might be handled, vector search cannot be relied upon to correct or robustly handle a wide range of misspellings. >For a while, I thought that vector search was robust to typos because these large corpora of text surely must contain a lot of typos and therefore help the embedding model learn these typos as well. But if you think about it, there’s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle _some_ typos, you can’t really say it is robust to them.
Why is vector search not robust to typos? Vector search is not robust to typos primarily because: - The extensive training datasets used by embedding models are unlikely to contain sufficient examples of *all possible* typos for every word. - This limitation means that while vector search might tolerate _some_ minor typos, it cannot reliably correct or retrieve results for significant or uncommon misspellings. >But if you think about it, there’s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle _some_ typos, you can’t really say it is robust to them.
Knowing When to Use Which Metric to Evaluate Search Results Choosing the right metric for evaluating search results depends on the use case: - **NDCG@k** is a prominent metric often seen in academic benchmarks like BEIR. - Simpler metrics such as **precision** and **recall** are often well-suited for many practical applications. >There are many different metrics to evaluate search results. Looking at academic benchmarks, like BEIR, you’ll notice that NDCG@k is prominent. But simpler metrics like precision and recall are a great fit for many use cases.
The Precision-Recall Trade-Off The precision-recall trade-off illustrates the inverse relationship between these two metrics: - **High precision** means most returned results are relevant, but you might miss many existing relevant items (low recall). - **High recall** means you found most, if not all, relevant items, but you might also return many irrelevant items (low precision). It's a balance between returning only relevant items and returning all relevant items. >is often depicted with a fisherman’s analogy of casting a net, but this e-commerce analogy made it click better for me: Imagine you have a webshop with 100 books, out of which 10 are ML-related. Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that’s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). And also, that’s not so good for your business. Maybe the user didn’t liked that one ML-related book you returned. On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted… That’s **perfect recall** because you returned all relevant results. It’s just that you also returned a bunch of irrelevant results, which can be measured by how **bad the precision** is.
How is precision defined in search evaluation? **Precision** measures the proportion of _returned search results_ that are actually relevant. - For example, if you return 1 book and it's relevant, your precision is perfect. - It answers: "Of the items I returned, how many were correct?" >Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant).
How is recall defined in search evaluation? **Recall** measures the proportion of _all existing relevant items_ that were successfully retrieved by the search system. - For example, if there are 10 relevant books and you only return 1, your recall is low. - If you return all 10 relevant books (even among many irrelevant ones), your recall is perfect. - It answers: "Of all the correct items out there, how many did I find?" >But that’s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). ...That’s **perfect recall** because you returned all relevant results.
There Are Metrics That Include the Order The order of search results can be crucial in certain use cases, similar to a Google search. - Metrics like **precision** and **recall** do _not_ take the order of results into account. - If rank is important for your use case, you should choose **rank-aware metrics** such as: - **MRR@k** (Mean Reciprocal Rank) - **MAP@k** (Mean Average Precision) - **NDCG@k** (Normalized Discounted Cumulative Gain) >When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don’t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.
Tokenizers Matter Tokenizers are a critical component affecting search performance, not just in the context of Transformer models. - They are essential for the performance of **keyword search**. - Since hybrid search often combines keyword search with vector search, the tokenizer's impact on keyword performance directly affects **hybrid search performance** as well. >If you’ve been in the Transformer’s bubble too long, you’ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.
There Are Metrics That Include The Order - Metrics like **precision** and **recall** _do not_ consider the order of search results.- If search result rank is important, use **rank-aware metrics** such as:- MRR@k (Mean Reciprocal Rank at k)- MAP@k (Mean Average Precision at k)- NDCG@k (Normalized Discounted Cumulative Gain at k) >When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don’t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.
Which search metrics _do not_ consider the order of results? - **Precision** and **recall** are search metrics that do not take into account the _order_ in which search results are returned.- They evaluate relevance based on the set of results, regardless of their position. >Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that’s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books).
What is the impact on precision and recall when only one relevant item is returned from many? - Returning only **one relevant item** (e.g., one ML book out of ten) results in:- **Perfect precision**: All returned results are relevant.- **Bad recall**: Only a small fraction of the total relevant results are returned. >Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that’s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books).
How does returning _all_ available items, including many irrelevant ones, affect recall and precision? - If you return your **entire selection of items** (e.g., all 100 books, unsorted), it leads to:- **Perfect recall**: All relevant results that exist are returned.- **Bad precision**: Many irrelevant results are also returned alongside the relevant ones. >On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted… That’s **perfect recall** because you returned all relevant results. It’s just that you also returned a bunch of irrelevant results, which can be measured by how **bad the precision** is.
Tokenizers Matter - **Tokenizers** are crucial beyond Byte-Pair-Encoding (BPE) which is common in Transformer models.- They are essential for:- **Keyword search** performance.- **Hybrid search** performance, as it relies on keyword search. >If you’ve been in the Transformer’s bubble too long, you’ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.
Why are tokenizers important for search performance? - Tokenizers are important because they directly impact the performance of:- **Keyword search**: How text is broken down affects matching relevant terms.- **Hybrid search**: Since hybrid search often incorporates keyword-based methods, tokenizer performance directly influences its overall effectiveness. >Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.
Out-Of-Domain Is Not The Same As Out-Of-Vocabulary - There is a key distinction between **out-of-domain (OOD)** and **out-of-vocabulary (OOV)** terms.- Earlier embedding models failed on OOV terms, but modern models handle them graciously with smart tokenization.- However, OOD terms result in **meaningless vector embeddings** even if they look like proper embeddings. >Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of “Labubu”, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.
How do modern embedding models handle out-of-vocabulary terms? - Modern embedding models, using **smart tokenization**, can handle **out-of-vocabulary (OOV)** terms _graciously_.- Unlike earlier models that would fail, unseen OOV terms can now be processed without errors. >Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of “Labubu”, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.
What is the consequence of an out-of-domain term for vector embeddings? - When a term is **out-of-domain (OOD)** for an embedding model, its generated vector embedding, while appearing like a proper embedding, is **meaningless**.- This occurs even if the term is handled without error by smart tokenization. >With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.
Query Optimizations - Users have historically learned to **optimize search queries** for **keyword search** (e.g., using keywords like "longest river africa" instead of full questions).- A similar learning curve is now necessary to **optimize queries for vector search**. >You know how you’ve learned to type “longest river africa” into Google’s search bar, instead of “What is the name of the longest river in Africa?”. You’ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?). Similarly, we now need to learn how to optimize our search queries for vector search now.
How have users historically optimized search queries for keyword search? - Users have learned to optimize queries for keyword search by typing **concise keywords** rather than full questions.- An example is typing "**longest river africa**" instead of "What is the name of the longest river in Africa?". >You know how you’ve learned to type “longest river africa” into Google’s search bar, instead of “What is the name of the longest river in Africa?”. You’ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?).
What new challenge arises in optimizing search queries for modern systems? - Just as users learned to optimize queries for keyword search, there is now a need to learn how to **optimize search queries specifically for vector search**.- This involves understanding how to phrase queries to get the most relevant vector embeddings. >Similarly, we now need to learn how to optimize our search queries for vector search now.
What Comes After Vector Search? - Search technologies have evolved through distinct stages:1. **Keyword-based search**: The initial approach.2. **Vector search**: Enabled by Machine Learning models.3. **Reasoning-based retrieval**: The current frontier, enabled by Large Language Models (LLMs) with reasoning capabilities. >First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.
Describe the historical progression of search technologies. - The progression of search technologies has been:- Starting with **keyword-based search**.- Evolving to **vector search**, facilitated by Machine Learning models.- Now moving towards **reasoning-based retrieval**, powered by LLMs with reasoning abilities. >First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.
Information Retrieval Is So Hot Right Now - Information Retrieval (IR) is an **exciting field** to work in, as it continuously evolves.- While working with LLMs is popular, **providing the best information** _for_ LLMs is equally important and falls within the field of retrieval. >I feel fortunate to get to work in this exciting space. Although working on and with LLMs seems to be the cool thing now, figuring out how to provide the best information for them is equally exciting. And that’s the field of retrieval.
What enduring aspect of information retrieval remains crucial despite new trends? - The **importance of finding the best information** to provide to a Large Language Model (LLM) so it can generate the best possible answer remains crucial.- This fundamental role of retrieval persists even with new developments like RAG or "context engineering." >When I joined Weaviate, vector databases were the hot new thing. Then came RAG. Now, we’re talking about “context engineering”. _But what hasn’t changed is the importance of finding the best information to give the LLM so it can provide the best possible answer._
{
"name": "37 Things About Information Retrieval - Leonie Monigatti",
"cards": [
{
"front": "BM25 Is A Strong Baseline For Search",
"back": "BM25 is a **keyword search** algorithm. It is recommended to start with simple baselines like BM25 before moving to more complex methods like vector search.\n\n>Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search."
},
{
"front": "Why should you start with BM25 before vector search?",
"back": "You should start with BM25 because it is a **simple** and **strong baseline** for keyword search. This pragmatic approach helps establish a foundational search system before introducing the complexity of vector search.\n\n>Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search."
},
{
"front": "Vector Search In Vector Databases Is Approximate And Not Exact",
"back": "Vector search is **approximate** because brute-force exact k-nearest neighbor (KNN) computations do not scale well. Vector databases use **Approximate Nearest Neighbor (ANN)** algorithms to achieve speed at scale, trading off a small amount of accuracy.\n\n>In theory, you could run a brute-force search to compute distances between a query vector and every vector in the database using exact k-nearest neighbors (KNN). But this doesn\u2019t scale well. That\u2019s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale."
},
{
"front": "What algorithms enable fast, approximate vector search?",
"back": "Vector databases use **Approximate Nearest Neighbor (ANN)** algorithms to speed up search at scale. Examples include:\n- **HNSW** (Hierarchical Navigable Small World)\n- **IVF** (Inverted File Index)\n- **ScaNN** (Scalable Nearest Neighbors)\nThese algorithms introduce a small trade-off in accuracy for significant speed gains.\n\n>That\u2019s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale."
},
{
"front": "Vector Databases Don\u2019t Only Store Embeddings",
"back": "Besides embeddings, vector databases also store:\n- The **original object** (e.g., the text from which embeddings were generated)\n- **Metadata**\nThis enables features beyond just vector search, such as metadata filtering and hybrid search.\n\n>They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search."
},
{
"front": "What additional features are supported by vector databases due to storing original objects and metadata?",
"back": "By storing original objects and metadata, vector databases can support features such as:\n- **Metadata filtering**\n- **Keyword search**\n- **Hybrid search**\nThese capabilities extend their utility beyond purely vector-based search.\n\n>They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search."
},
{
"front": "Vector Databases\u2019 Main Application Is Not In Generative AI",
"back": "The **primary application** of vector databases is **search**. Although finding relevant context for Large Language Models (LLMs) is a form of search, their utility extends beyond only generative AI.\n\n>It\u2019s in search. But finding relevant context for LLMs is \u2018search\u2019. That\u2019s why vector databases and LLMs go together like cookies and cream."
},
{
"front": "You Have To Specify How Many Results You Want To Retrieve",
"back": "When performing a vector search, it is crucial to define the maximum number of results you want. Without parameters like `limit` or `top_k`, vector search would return _all_ objects stored in the database, sorted by the distance to your query.\n\n>When I think back, I almost have to laugh because this was such a big \u201caha\u201d moment when I realized that you need to define the maximum number of results you want to retrieve. It\u2019s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren\u2019t a `limit` or `top_k` parameter."
},
{
"front": "What parameters are used to limit vector search results?",
"back": "To control the number of results retrieved in a vector search, you must specify a maximum limit using parameters such as:\n- `limit`\n- `top_k`\nWithout these, the search would return _all_ objects, sorted by distance.\n\n>It\u2019s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren\u2019t a `limit` or `top_k` parameter."
},
{
"front": "There Are Many Different Types Of Embeddings",
"back": "Beyond the commonly visualized **dense vectors**, there are other types of embeddings, including:\n- **Sparse vectors** (e.g., `[0, 2, 0, ..., 1]`)\n- **Binary vectors** (e.g., `[0, 1, 1, ..., 0]`)\n- **Multi-vector embeddings** (e.g., `[[-0.9837, ...], [0.1044, ...]]`)\nEach type serves different purposes in information retrieval.\n\n>When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, \u2026, -0.2049]. That\u2019s called a dense vector, and it is the most commonly used type of vector embedding. But there\u2019s also many other types of vectors, such as sparse ([0, 2, 0, \u2026, 1]), binary ([0, 1, 1, \u2026, 0]), and multi-vector embeddings ([[-0.9837, \u2026, -0.2049], [ 0.1044, \u2026, 0.0090], \u2026, [-0.0937, \u2026, 0.5044]]), which can be used for different purposes."
},
{
"front": "What is the most commonly used type of vector embedding?",
"back": "The most commonly used type of vector embedding is the **dense vector**, often visualized as a continuous array of floating-point numbers, such as `[-0.9837, 0.1044, ..., -0.2049]`.\n\n>When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, \u2026, -0.2049]. That\u2019s called a dense vector, and it is the most commonly used type of vector embedding."
},
{
"front": "Fantastic Embedding Models And Where To Find Them",
"back": "To find fantastic embedding models, you should check:\n- The **Massive Text Embedding Benchmark (MTEB)** leaderboard, which covers various tasks like classification, clustering, and retrieval.\n- **BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models)**, specifically for information retrieval evaluation.\n\n>The first place to go is the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval. If you\u2019re focused on information retrieval, you might want to check out [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://github.com/beir-cellar/beir)."
},
{
"front": "The Majority Of Embedding Models On MTEB Are English",
"back": "While the **Massive Text Embedding Benchmark (MTEB)** leaderboard contains many excellent models, most are designed for **English languages**. For multilingual or non-English applications, **MMTEB** is a better resource.\n\n>If you\u2019re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1)."
},
{
"front": "Where can you find multilingual embedding models?",
"back": "If you are working with multilingual or non-English languages, you should explore **MMTEB (Massive Multilingual Text Embedding Benchmark)**, as the majority of models on the MTEB leaderboard are English-focused.\n\n>If you\u2019re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1)."
},
{
"front": "A Little History On Vector Embeddings",
"back": "Before modern **contextual embeddings** (e.g., BERT), there were **static embeddings** (e.g., Word2Vec, GloVe). Static embeddings give each word a fixed representation, while contextual embeddings generate different representations based on the surrounding text, making them more expressive.\n\n>Before there were today\u2019s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today\u2019s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables."
},
{
"front": "What is the difference between static and contextual embeddings?",
"back": "- **Static embeddings** (like Word2Vec) assign a _fixed representation_ to each word, regardless of its context.\n- **Contextual embeddings** (like BERT) generate _different representations_ for the same word based on its surrounding context, making them more expressive.\nStatic embeddings can still be useful in computationally _restrained environments_ as they can be looked up from pre-computed tables.\n\n>Before there were today\u2019s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today\u2019s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables."
},
{
"front": "Fantastic Embedding Models And Where To Find Them",
"back": "To find fantastic embedding models, the primary resource is the **Massive Text Embedding Benchmark (MTEB)** leaderboard. It covers a wide range of tasks for embedding models, including:\n- Classification\n- Clustering\n- Retrieval\n\n>The first place to go is the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval."
},
{
"front": "What is BEIR and what is its focus?",
"back": "BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models) is a benchmark specifically focused on **information retrieval**.\n\n>If you\u2019re focused on information retrieval, you might want to check out [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://github.com/beir-cellar/beir)."
},
{
"front": "The Majority Of Embedding Models On MTEB Are English",
"back": "Most embedding models found on MTEB are designed for English. For working with **multilingual or non-English languages**, it is recommended to check out the **MMTEB (Massive Multilingual Text Embedding Benchmark)**.\n\n>If you\u2019re working with multilingual or non-English languages, it might be worth checking out [MMTEB (Massive Multilingual Text Embedding Benchmark)](https://arxiv.org/html/2502.13595v1)."
},
{
"front": "A Little History On Vector Embeddings",
"back": "Before the advent of today's **contextual embeddings** (e.g., BERT), there were **static embeddings** such as Word2Vec and GloVe.\n\n>Before there were today\u2019s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe)."
},
{
"front": "What is the difference between static and contextual embeddings?",
"back": "- **Static embeddings**: Provide a _fixed_ representation for each word.\n- **Contextual embeddings**: Generate _different_ representations for the same word based on its _surrounding context_.\n- Static embeddings can still be useful in _computationally restrained environments_ because they can be looked up from pre-computed tables.\n\n>They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today\u2019s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables."
},
{
"front": "Don\u2019t Confuse Sparse Vectors And Sparse Embeddings",
"back": "- **Sparse vectors** can be generated in two main ways:\n - By applying _statistical scoring functions_ like TF-IDF or BM25 to term frequencies.\n - With _neural sparse embedding models_ like SPLADE.\n- A **sparse embedding** _is_ a type of sparse vector, but _not all_ sparse vectors are necessarily sparse embeddings.\n\n>It took me a while until I understood that sparse vectors can be generated in different ways: Either by applying statistical scoring functions like TF-IDF or BM25 to term frequencies (often retrieved via inverted indexes), or with neural sparse embedding models like SPLADE. That means a sparse embedding is a sparse vector, but not all sparse vectors are necessarily sparse embeddings."
},
{
"front": "Embed All The Things",
"back": "Embeddings are not exclusively for text. You can embed various data types, including:\n- Images\n- PDFs as images (e.g., ColPali)\n- Graphs\nThis capability enables **multimodal vector search** over diverse data.\n\n>Embeddings aren\u2019t just for text. You can embed images, PDFs as images (see [ColPali](https://arxiv.org/abs/2407.01449)), graphs, etc. And that means you can do vector search over multimodal data. It\u2019s pretty incredible. You should try it sometime."
},
{
"front": "The Economics Of Vector Embeddings",
"back": "- The **vector dimensions** directly impact the required _storage cost_.\n- For example, choosing a model with 1536 dimensions over one with 768 dimensions can _double_ your storage requirements.\n- While more dimensions capture _more semantic nuances_, a very high number of dimensions may not always be necessary for common tasks like \"chat with your docs\".\n\n>This shouldn\u2019t be a surprise, but the vector dimensions will impact the required storage cost. So, consider whether it is worth it before you choose an embedding model with 1536 dimensions over one with 768 dimensions and risk doubling your storage requirements. Yes, more dimensions capture more semantic nuances."
},
{
"front": "What is Matryoshka Representation Learning?",
"back": "Matryoshka Representation Learning is a technique that allows you to _shorten_ vector embeddings. This is beneficial for environments with _less computational resources_, while aiming to maintain _minimal performance losses_.\n\n>Some models actually use [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) to allow you to shorten vector embeddings for environments with less computational resources, with minimal performance losses."
},
{
"front": "\u201cChat With Your Docs\u201d Tutorials Are The \u201cHello World\u201d Programs Of Generative AI",
"back": "The phrase \"chat with your docs\" tutorials are considered the \"Hello world\" programs of Generative AI. This means they are foundational, basic, or introductory examples in the field.\n\n>\u201cChat with your docs\u201d tutorials are the \u201cHello world\u201d programs of Generative AI."
},
{
"front": "You Need To Call The Embedding Model A LOT",
"back": "Calling the embedding model is a frequent necessity, not just during initial data ingestion. It is required:\n- Every time a **search query** is run (the query must be embedded).\n- When **adding new objects** later on (they need embedding and indexing).\n- If you **change the embedding model** (all existing data must be re-embedded and re-indexed).\n\n>Just because you embedded your documents during the ingestion stage, doesn\u2019t mean you\u2019re done calling the embedding model. Every time you run a search query, the query must also be embedded (if you\u2019re not using a cache). If you\u2019re adding objects later on, those must also be embedded (and indexed). If you\u2019re changing the embedding model, you must also re-embed (and re-index) everything."
},
{
"front": "Similar Does Not Necessarily Mean Relevant",
"back": "- Vector search identifies objects based on their _similarity_ to a query, measured by proximity in vector space.\n- However, _similarity_ does not always equate to _relevance_ to the user's intent.\n- For example, \"How to fix a faucet\" and \"Where to buy a kitchen faucet\" might be similar in vector space but not relevant to each other.\n\n>Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., \u201cHow to fix a faucet\u201d and \u201cWhere to buy a kitchen faucet\u201d) does not mean they are relevant to each other."
},
{
"front": "Cosine Similarity And Cosine Distance Are Not The Same Thing",
"back": "- Cosine similarity and cosine distance are _related_ but distinct concepts.\n- They are _inverses_ of each other.\n- If two vectors are exactly the same:\n - Their similarity is **1**.\n - The distance between them is **0**.\n\n>But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0."
},
{
"front": "What is the relationship between cosine similarity and dot product for normalized vectors?",
"back": "- For **normalized vectors**, cosine similarity and dot product are _mathematically equivalent_.\n- In this scenario, the **dot product** is generally _more efficient_ for computation.\n\n>Because mathematically, they are the same. For the calculation, dot product is more efficient."
},
{
"front": "Common Misconception: The R In RAG Stands For \u2018Vector Search\u2019",
"back": "- The **\"R\"** in RAG (Retrieval-Augmented Generation) stands for **'retrieval'**, not specifically 'vector search'.\n- Retrieval can be accomplished through various methods beyond just vector search.\n\n>It doesn\u2019t. It stands for \u2018retrieval\u2019. And retrieval can be done in many different ways (see following bullets)."
},
{
"front": "Vector Search Is Just One Tool In The Retrieval Toolbox",
"back": "- Vector search is only _one_ tool within a broader **retrieval toolbox**.\n- Other essential tools include:\n - **Keyword-based search**\n - **Filtering**\n - **Reranking**\n- Combining these different tools is crucial for building effective and robust retrieval systems.\n\n>There\u2019s also keyword-based search, filtering, and reranking. It\u2019s not one over the other. To build something great, you will need to combine it with different tools."
},
{
"front": "Similar Does Not Necessarily Mean Relevant",
"back": "- Vector search returns objects based on their _proximity_ in vector space, which signifies _similarity_.\n\n>Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., \u201cHow to fix a faucet\u201d and \u201cWhere to buy a kitchen faucet\u201d) does not mean they are relevant to each other."
},
{
"front": "For normalized vectors, which is more computationally efficient: cosine similarity or dot product?",
"back": "- For _normalized vectors_, both cosine similarity and dot product are mathematically the **same**.- However, the **dot product** is generally _more efficient_ for computation.\n\n>If you\u2019re working with normalized vectors, it doesn\u2019t matter whether you\u2019re using cosine similarity or dot product for the similarity measure. Because mathematically, they are the same. For the calculation, dot product is more efficient."
},
{
"front": "Cosine Similarity And Cosine Distance Are Not The Same Thing",
"back": "- Cosine similarity and cosine distance are _related_ but are _not identical_.- They are _inverses_ of each other: if two vectors are exactly the same, their similarity is 1, and their distance is 0.\n\n>But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0."
},
{
"front": "How can hybrid search combine keyword and vector search using parameters?",
"back": "- In some implementations (e.g., Weaviate), a hybrid search function allows you to combine keyword-based and vector-based search.- The `alpha` parameter can then be used to _adjust the weighting_ from pure keyword-based search, to a mix of both, or to pure vector search.\n\n>In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the `alpha` parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search."
},
{
"front": "If You\u2019re Working With Normalized Vectors, It Doesn\u2019t Matter Whether You\u2019re Using Cosine Similarity Or Dot Product For The Similarity Measure",
"back": "- For _normalized vectors_, it does _not matter_ whether you use cosine similarity or dot product.- Mathematically, they are the **same** calculation in this context.- The **dot product** is generally _more efficient_ for computation.\n\n>Because mathematically, they are the same. For the calculation, dot product is more efficient."
},
{
"front": "What analogy describes the issue of losing semantic meaning with large chunk sizes and mean pooling?",
"back": "- The issue is like creating a **movie poster** by _overlaying every single frame_ of the movie.- While all the information is technically present, the resulting image is _unintelligible_, and the overall meaning of the movie is lost.\n\n>I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won\u2019t understand what the movie is about."
},
{
"front": "Common Misconception: The R In RAG Stands For \u2018Vector Search\u2019",
"back": "- The 'R' in RAG stands for **'retrieval'**, not specifically 'vector search'.- _Retrieval_ encompasses various methods, including but not limited to vector search.\n\n>It doesn\u2019t. It stands for \u2018retrieval\u2019. And retrieval can be done in many different ways (see following bullets)."
},
{
"front": "Vector Search Is Just One Tool In The Retrieval Toolbox",
"back": "- Vector search is only _one component_ in a comprehensive retrieval system.- Effective retrieval often requires combining vector search with other techniques like **keyword search**, **filtering**, and **reranking**.\n\n>There\u2019s also keyword-based search, filtering, and reranking. It\u2019s not one over the other. To build something great, you will need to combine it with different tools."
},
{
"front": "When To Use Keyword-Based Search Vs. Vector-Based Search",
"back": "- Use **vector-based search** for matching _semantics_ and _synonyms_ (e.g., \u201cpastel colors\u201d vs.\u201clight pink\u201d).- Use **keyword-based search** for _exact keywords_ (e.g., \u201cA-line skirt\u201d, \u201cpeplum dress\u201d).- For use cases requiring **both**, **hybrid search** is beneficial.\n\n>Does your use case require mainly matching semantics and synonyms (e.g., \u201cpastel colors\u201d vs.\u201clight pink\u201d) or exact keywords (e.g., \u201cA-line skirt\u201d, \u201cpeplum dress\u201d)? If it requires both (e.g., \u201cpastel colored A-line skirt\u201d), you might benefit from combining both and using hybrid search. In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the `alpha` parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search."
},
{
"front": "Hybrid Search Can Be A Hybrid Of Different Search Techniques",
"back": "- While often meaning keyword and vector search combination, 'hybrid' is _broader_.- It can also refer to combining **vector-based search** with **search over structured data** (metadata filtering).\n\n>Most often, when you hear people talk about hybrid search, they mean the combination of keyword-based search and vector-based search. But the term \u2018hybrid\u2019 doesn\u2019t specify which techniques to combine. So, sometimes you might hear people talk about hybrid search, meaning the combination of vector-based search and search over structured data (often referred to as metadata filtering)."
},
{
"front": "Misconception: Filtering Makes Vector Search Faster",
"back": "- This is a misconception because filtering does _not always improve_ search latency.- _Pre-filtering_ can disrupt underlying index structures (e.g., HNSW graph connectivity).- _Post-filtering_ may lead to an empty result set.- Vector databases employ _complex methods_ to address this.\n\n>Intuitively, you\u2019d think using a filter should speed up search latency because you\u2019re reducing the number of candidates to search through. But in practice, pre-filtering candidates can, for example, break the graph connectivity in HNSW, and post-filtering can leave you with no results at all. Vector databases have different, sophisticated techniques to handle this challenge."
},
{
"front": "Two-Stage Retrieval Pipelines Aren\u2019t Only For Recommendation Systems",
"back": "- Two-stage pipelines are common in **recommendation systems** but are also applicable to **RAG pipelines**.- The _first stage_ uses a simpler, faster process (e.g., vector search) to _reduce candidates_.- The _second stage_ uses a more compute-intensive but _more accurate_ reranking process.\n\n>Recommendation systems often have a first retrieval stage that uses a simpler retrieval process (e.g., vector search) to reduce the number of potential candidates, which is followed by a second retrieval stage with a more compute-intensive but more accurate reranking stage. You can apply this to your RAG pipeline as well."
},
{
"front": "How Vector Search Differs From Reranking",
"back": "- **Vector search** _retrieves_ a small portion of results from the _entire database_.- **Reranking** _re-orders_ an _already provided list_ of items.\n\n>Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list."
},
{
"front": "Finding The Right Chunk Size To Embed Is Not Trivial",
"back": "- Selecting the correct chunk size is _challenging_: **Too small** means losing _important context_.- **Too big** means losing _semantic meaning_ due to averaging (mean pooling).- Many embedding models use **mean pooling**, which can dilute meaning in large chunks.\n\n>Too small, and you\u2019ll lose important context. Too big, and you\u2019ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won\u2019t understand what the movie is about."
},
{
"front": "Vector Indexing Libraries Are Different From Vector Databases",
"back": "- Both are fast for vector search and useful for tutorials.- **Vector databases** offer additional _data management features_ such as built-in persistence, CRUD support, metadata filtering, and hybrid search.- **Vector indexing libraries** _lack_ these comprehensive features.\n\n>Both are incredibly fast for vector search. Both work really well to showcase vector search in \u201cchat with your docs\u201d-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search."
},
{
"front": "RAG Has Been Dying Since The Release Of The First Long-Context LLM",
"back": "- The claim that RAG is \"dead\" emerges whenever new LLMs with longer context windows are released.- Despite these claims, **RAG continues to be relevant** and useful.\n\n>Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is\u2026"
},
{
"front": "How Vector Search Differs From Reranking",
"back": "Vector search and reranking serve different purposes in information retrieval:\n- **Vector search** returns a _small portion_ of results from the entire database.\n- **Reranking** takes an _existing list_ of items and returns that list in a _re-ordered_ sequence.\n\n>Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list."
},
{
"front": "Finding the Right Chunk Size to Embed Is Not Trivial",
"back": "Determining the optimal chunk size for embedding is a challenge because:\n- Chunks that are **too small** can lead to a _loss of important context_.\n- Chunks that are **too big** can result in a _loss of semantic meaning_.\nMany models use mean pooling, averaging token embeddings into a single vector, which can make large chunks semantically unclear, even if technically embeddable.\n\n>Too small, and you\u2019ll lose important context. Too big, and you\u2019ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won\u2019t understand what the movie is about."
},
{
"front": "What is a helpful analogy for understanding chunk size for embedding?",
"back": "A useful analogy for understanding the challenge of chunk size is thinking of it like creating a movie poster by overlaying every single frame of the movie.\n- All the original information from the movie is technically present.\n- However, you still won't be able to understand the movie's plot or meaning from such an aggregated image.\n\n>I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won\u2019t understand what the movie is about."
},
{
"front": "Vector Indexing Libraries Are Different From Vector Databases",
"back": "While both vector indexing libraries and vector databases excel at fast vector search and are useful for tutorials, vector databases offer additional data management features:\n- Built-in **persistence**\n- **CRUD support** (Create, Read, Update, Delete)\n- **Metadata filtering**\n- **Hybrid search** capabilities\n\n>Both are incredibly fast for vector search. Both work really well to showcase vector search in \u201cchat with your docs\u201d-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search."
},
{
"front": "What data management features do vector databases offer beyond vector indexing libraries?",
"back": "Vector databases provide several essential data management features that vector indexing libraries typically lack:\n- They include **built-in persistence** for data.\n- They support **CRUD operations** (Create, Read, Update, Delete).\n- They enable **metadata filtering** and **hybrid search**.\n\n>Both work really well to showcase vector search in \u201cchat with your docs\u201d-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search."
},
{
"front": "RAG Has Been Dying Since the Release of the First Long-Context LLM",
"back": "Despite claims that Retrieval-Augmented Generation (RAG) is becoming obsolete with the release of Large Language Models (LLMs) featuring longer context windows, **RAG has not died**.\nEach time a new LLM with expanded context is released, the claim resurfaces, but RAG continues to be a relevant and effective technique.\n\n>Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is\u2026"
},
{
"front": "You Can Throw Out 97% of the Information and Still Retrieve (Somewhat) Accurately",
"back": "This phenomenon is known as **vector quantization**.\nFor instance, with binary quantization, a 32-bit float vector can be converted into a 1-bit binary vector, achieving a **32x storage reduction**.\nSurprisingly, retrieval accuracy can remain quite good in some use cases despite this significant reduction in data.\n\n>It\u2019s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, \u2026, -0.2049] into [0, 1, 1, \u2026, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you\u2019ll be surprised how well retrieval will remain to work (in some use cases)."
},
{
"front": "What is vector quantization?",
"back": "Vector quantization is a technique used to reduce the storage size of vector embeddings.\n- It converts high-precision vectors (like 32-bit floats) into a more compressed format (e.g., 1-bit binary vectors).\n- This can lead to significant storage reduction (e.g., 32x) while surprisingly maintaining effective retrieval accuracy for certain applications.\n\n>It\u2019s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, \u2026, -0.2049] into [0, 1, 1, \u2026, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you\u2019ll be surprised how well retrieval will remain to work (in some use cases)."
},
{
"front": "Vector Search Is Not Robust To Typos",
"back": "Vector search is generally **not robust to typos** for a key reason:\n- It's highly improbable that all possible typographical errors of a word are sufficiently represented in the training data of embedding models.\n- While some simple typos might be handled, vector search cannot be relied upon to correct or robustly handle a wide range of misspellings.\n\n>For a while, I thought that vector search was robust to typos because these large corpora of text surely must contain a lot of typos and therefore help the embedding model learn these typos as well. But if you think about it, there\u2019s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle _some_ typos, you can\u2019t really say it is robust to them."
},
{
"front": "Why is vector search not robust to typos?",
"back": "Vector search is not robust to typos primarily because:\n- The extensive training datasets used by embedding models are unlikely to contain sufficient examples of *all possible* typos for every word.\n- This limitation means that while vector search might tolerate _some_ minor typos, it cannot reliably correct or retrieve results for significant or uncommon misspellings.\n\n>But if you think about it, there\u2019s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle _some_ typos, you can\u2019t really say it is robust to them."
},
{
"front": "Knowing When to Use Which Metric to Evaluate Search Results",
"back": "Choosing the right metric for evaluating search results depends on the use case:\n- **NDCG@k** is a prominent metric often seen in academic benchmarks like BEIR.\n- Simpler metrics such as **precision** and **recall** are often well-suited for many practical applications.\n\n>There are many different metrics to evaluate search results. Looking at academic benchmarks, like BEIR, you\u2019ll notice that NDCG@k is prominent. But simpler metrics like precision and recall are a great fit for many use cases."
},
{
"front": "The Precision-Recall Trade-Off",
"back": "The precision-recall trade-off illustrates the inverse relationship between these two metrics:\n- **High precision** means most returned results are relevant, but you might miss many existing relevant items (low recall).\n- **High recall** means you found most, if not all, relevant items, but you might also return many irrelevant items (low precision).\nIt's a balance between returning only relevant items and returning all relevant items.\n\n>is often depicted with a fisherman\u2019s analogy of casting a net, but this e-commerce analogy made it click better for me: Imagine you have a webshop with 100 books, out of which 10 are ML-related. Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that\u2019s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). And also, that\u2019s not so good for your business. Maybe the user didn\u2019t liked that one ML-related book you returned. On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted\u2026 That\u2019s **perfect recall** because you returned all relevant results. It\u2019s just that you also returned a bunch of irrelevant results, which can be measured by how **bad the precision** is."
},
{
"front": "How is precision defined in search evaluation?",
"back": "**Precision** measures the proportion of _returned search results_ that are actually relevant.\n- For example, if you return 1 book and it's relevant, your precision is perfect.\n- It answers: \"Of the items I returned, how many were correct?\"\n\n>Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant)."
},
{
"front": "How is recall defined in search evaluation?",
"back": "**Recall** measures the proportion of _all existing relevant items_ that were successfully retrieved by the search system.\n- For example, if there are 10 relevant books and you only return 1, your recall is low.\n- If you return all 10 relevant books (even among many irrelevant ones), your recall is perfect.\n- It answers: \"Of all the correct items out there, how many did I find?\"\n\n>But that\u2019s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). ...That\u2019s **perfect recall** because you returned all relevant results."
},
{
"front": "There Are Metrics That Include the Order",
"back": "The order of search results can be crucial in certain use cases, similar to a Google search.\n- Metrics like **precision** and **recall** do _not_ take the order of results into account.\n- If rank is important for your use case, you should choose **rank-aware metrics** such as:\n - **MRR@k** (Mean Reciprocal Rank)\n - **MAP@k** (Mean Average Precision)\n - **NDCG@k** (Normalized Discounted Cumulative Gain)\n\n>When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don\u2019t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k."
},
{
"front": "Tokenizers Matter",
"back": "Tokenizers are a critical component affecting search performance, not just in the context of Transformer models.\n- They are essential for the performance of **keyword search**.\n- Since hybrid search often combines keyword search with vector search, the tokenizer's impact on keyword performance directly affects **hybrid search performance** as well.\n\n>If you\u2019ve been in the Transformer\u2019s bubble too long, you\u2019ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance."
},
{
"front": "There Are Metrics That Include The Order",
"back": "- Metrics like **precision** and **recall** _do not_ consider the order of search results.- If search result rank is important, use **rank-aware metrics** such as:- MRR@k (Mean Reciprocal Rank at k)- MAP@k (Mean Average Precision at k)- NDCG@k (Normalized Discounted Cumulative Gain at k)\n\n>When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don\u2019t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k."
},
{
"front": "Which search metrics _do not_ consider the order of results?",
"back": "- **Precision** and **recall** are search metrics that do not take into account the _order_ in which search results are returned.- They evaluate relevance based on the set of results, regardless of their position.\n\n>Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that\u2019s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books)."
},
{
"front": "What is the impact on precision and recall when only one relevant item is returned from many?",
"back": "- Returning only **one relevant item** (e.g., one ML book out of ten) results in:- **Perfect precision**: All returned results are relevant.- **Bad recall**: Only a small fraction of the total relevant results are returned.\n\n>Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have **perfect precision** (out of the k=1 results returned, how many were relevant). But that\u2019s **bad recall** (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books)."
},
{
"front": "How does returning _all_ available items, including many irrelevant ones, affect recall and precision?",
"back": "- If you return your **entire selection of items** (e.g., all 100 books, unsorted), it leads to:- **Perfect recall**: All relevant results that exist are returned.- **Bad precision**: Many irrelevant results are also returned alongside the relevant ones.\n\n>On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted\u2026 That\u2019s **perfect recall** because you returned all relevant results. It\u2019s just that you also returned a bunch of irrelevant results, which can be measured by how **bad the precision** is."
},
{
"front": "Tokenizers Matter",
"back": "- **Tokenizers** are crucial beyond Byte-Pair-Encoding (BPE) which is common in Transformer models.- They are essential for:- **Keyword search** performance.- **Hybrid search** performance, as it relies on keyword search.\n\n>If you\u2019ve been in the Transformer\u2019s bubble too long, you\u2019ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance."
},
{
"front": "Why are tokenizers important for search performance?",
"back": "- Tokenizers are important because they directly impact the performance of:- **Keyword search**: How text is broken down affects matching relevant terms.- **Hybrid search**: Since hybrid search often incorporates keyword-based methods, tokenizer performance directly influences its overall effectiveness.\n\n>Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance."
},
{
"front": "Out-Of-Domain Is Not The Same As Out-Of-Vocabulary",
"back": "- There is a key distinction between **out-of-domain (OOD)** and **out-of-vocabulary (OOV)** terms.- Earlier embedding models failed on OOV terms, but modern models handle them graciously with smart tokenization.- However, OOD terms result in **meaningless vector embeddings** even if they look like proper embeddings.\n\n>Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of \u201cLabubu\u201d, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless."
},
{
"front": "How do modern embedding models handle out-of-vocabulary terms?",
"back": "- Modern embedding models, using **smart tokenization**, can handle **out-of-vocabulary (OOV)** terms _graciously_.- Unlike earlier models that would fail, unseen OOV terms can now be processed without errors.\n\n>Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of \u201cLabubu\u201d, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless."
},
{
"front": "What is the consequence of an out-of-domain term for vector embeddings?",
"back": "- When a term is **out-of-domain (OOD)** for an embedding model, its generated vector embedding, while appearing like a proper embedding, is **meaningless**.- This occurs even if the term is handled without error by smart tokenization.\n\n>With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless."
},
{
"front": "Query Optimizations",
"back": "- Users have historically learned to **optimize search queries** for **keyword search** (e.g., using keywords like \"longest river africa\" instead of full questions).- A similar learning curve is now necessary to **optimize queries for vector search**.\n\n>You know how you\u2019ve learned to type \u201clongest river africa\u201d into Google\u2019s search bar, instead of \u201cWhat is the name of the longest river in Africa?\u201d. You\u2019ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?). Similarly, we now need to learn how to optimize our search queries for vector search now."
},
{
"front": "How have users historically optimized search queries for keyword search?",
"back": "- Users have learned to optimize queries for keyword search by typing **concise keywords** rather than full questions.- An example is typing \"**longest river africa**\" instead of \"What is the name of the longest river in Africa?\".\n\n>You know how you\u2019ve learned to type \u201clongest river africa\u201d into Google\u2019s search bar, instead of \u201cWhat is the name of the longest river in Africa?\u201d. You\u2019ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?)."
},
{
"front": "What new challenge arises in optimizing search queries for modern systems?",
"back": "- Just as users learned to optimize queries for keyword search, there is now a need to learn how to **optimize search queries specifically for vector search**.- This involves understanding how to phrase queries to get the most relevant vector embeddings.\n\n>Similarly, we now need to learn how to optimize our search queries for vector search now."
},
{
"front": "What Comes After Vector Search?",
"back": "- Search technologies have evolved through distinct stages:1. **Keyword-based search**: The initial approach.2. **Vector search**: Enabled by Machine Learning models.3. **Reasoning-based retrieval**: The current frontier, enabled by Large Language Models (LLMs) with reasoning capabilities.\n\n>First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval."
},
{
"front": "Describe the historical progression of search technologies.",
"back": "- The progression of search technologies has been:- Starting with **keyword-based search**.- Evolving to **vector search**, facilitated by Machine Learning models.- Now moving towards **reasoning-based retrieval**, powered by LLMs with reasoning abilities.\n\n>First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval."
},
{
"front": "Information Retrieval Is So Hot Right Now",
"back": "- Information Retrieval (IR) is an **exciting field** to work in, as it continuously evolves.- While working with LLMs is popular, **providing the best information** _for_ LLMs is equally important and falls within the field of retrieval.\n\n>I feel fortunate to get to work in this exciting space. Although working on and with LLMs seems to be the cool thing now, figuring out how to provide the best information for them is equally exciting. And that\u2019s the field of retrieval."
},
{
"front": "What enduring aspect of information retrieval remains crucial despite new trends?",
"back": "- The **importance of finding the best information** to provide to a Large Language Model (LLM) so it can generate the best possible answer remains crucial.- This fundamental role of retrieval persists even with new developments like RAG or \"context engineering.\"\n\n>When I joined Weaviate, vector databases were the hot new thing. Then came RAG. Now, we\u2019re talking about \u201ccontext engineering\u201d. _But what hasn\u2019t changed is the importance of finding the best information to give the LLM so it can provide the best possible answer._"
}
]
}

BM25 Is A Strong Baseline For Search

BM25 is a keyword search algorithm. It is recommended to start with simple baselines like BM25 before moving to more complex methods like vector search.

Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.

===

Why should you start with BM25 before vector search?

You should start with BM25 because it is a simple and strong baseline for keyword search. This pragmatic approach helps establish a foundational search system before introducing the complexity of vector search.

Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.

===

Vector Search In Vector Databases Is Approximate And Not Exact

Vector search is approximate because brute-force exact k-nearest neighbor (KNN) computations do not scale well. Vector databases use Approximate Nearest Neighbor (ANN) algorithms to achieve speed at scale, trading off a small amount of accuracy.

In theory, you could run a brute-force search to compute distances between a query vector and every vector in the database using exact k-nearest neighbors (KNN). But this doesn’t scale well. That’s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.

===

What algorithms enable fast, approximate vector search?

Vector databases use Approximate Nearest Neighbor (ANN) algorithms to speed up search at scale. Examples include:

  • HNSW (Hierarchical Navigable Small World)
  • IVF (Inverted File Index)
  • ScaNN (Scalable Nearest Neighbors) These algorithms introduce a small trade-off in accuracy for significant speed gains.

That’s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.

===

Vector Databases Don’t Only Store Embeddings

Besides embeddings, vector databases also store:

  • The original object (e.g., the text from which embeddings were generated)
  • Metadata This enables features beyond just vector search, such as metadata filtering and hybrid search.

They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.

===

What additional features are supported by vector databases due to storing original objects and metadata?

By storing original objects and metadata, vector databases can support features such as:

  • Metadata filtering
  • Keyword search
  • Hybrid search These capabilities extend their utility beyond purely vector-based search.

They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.

===

Vector Databases’ Main Application Is Not In Generative AI

The primary application of vector databases is search. Although finding relevant context for Large Language Models (LLMs) is a form of search, their utility extends beyond only generative AI.

It’s in search. But finding relevant context for LLMs is ‘search’. That’s why vector databases and LLMs go together like cookies and cream.

===

You Have To Specify How Many Results You Want To Retrieve

When performing a vector search, it is crucial to define the maximum number of results you want. Without parameters like limit or top_k, vector search would return all objects stored in the database, sorted by the distance to your query.

When I think back, I almost have to laugh because this was such a big “aha” moment when I realized that you need to define the maximum number of results you want to retrieve. It’s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren’t a limit or top_k parameter.

===

What parameters are used to limit vector search results?

To control the number of results retrieved in a vector search, you must specify a maximum limit using parameters such as:

  • limit
  • top_k Without these, the search would return all objects, sorted by distance.

It’s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren’t a limit or top_k parameter.

===

There Are Many Different Types Of Embeddings

Beyond the commonly visualized dense vectors, there are other types of embeddings, including:

  • Sparse vectors (e.g., [0, 2, 0, ..., 1])
  • Binary vectors (e.g., [0, 1, 1, ..., 0])
  • Multi-vector embeddings (e.g., [[-0.9837, ...], [0.1044, ...]]) Each type serves different purposes in information retrieval.

When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, …, -0.2049]. That’s called a dense vector, and it is the most commonly used type of vector embedding. But there’s also many other types of vectors, such as sparse ([0, 2, 0, …, 1]), binary ([0, 1, 1, …, 0]), and multi-vector embeddings ([[-0.9837, …, -0.2049], [ 0.1044, …, 0.0090], …, [-0.0937, …, 0.5044]]), which can be used for different purposes.

===

What is the most commonly used type of vector embedding?

The most commonly used type of vector embedding is the dense vector, often visualized as a continuous array of floating-point numbers, such as [-0.9837, 0.1044, ..., -0.2049].

When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, …, -0.2049]. That’s called a dense vector, and it is the most commonly used type of vector embedding.

===

Fantastic Embedding Models And Where To Find Them

To find fantastic embedding models, you should check:

  • The Massive Text Embedding Benchmark (MTEB) leaderboard, which covers various tasks like classification, clustering, and retrieval.
  • BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models), specifically for information retrieval evaluation.

The first place to go is the Massive Text Embedding Benchmark (MTEB). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval. If you’re focused on information retrieval, you might want to check out BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

===

The Majority Of Embedding Models On MTEB Are English

While the Massive Text Embedding Benchmark (MTEB) leaderboard contains many excellent models, most are designed for English languages. For multilingual or non-English applications, MMTEB is a better resource.

If you’re working with multilingual or non-English languages, it might be worth checking out MMTEB (Massive Multilingual Text Embedding Benchmark).

===

Where can you find multilingual embedding models?

If you are working with multilingual or non-English languages, you should explore MMTEB (Massive Multilingual Text Embedding Benchmark), as the majority of models on the MTEB leaderboard are English-focused.

If you’re working with multilingual or non-English languages, it might be worth checking out MMTEB (Massive Multilingual Text Embedding Benchmark).

===

A Little History On Vector Embeddings

Before modern contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). Static embeddings give each word a fixed representation, while contextual embeddings generate different representations based on the surrounding text, making them more expressive.

Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.

===

What is the difference between static and contextual embeddings?

  • Static embeddings (like Word2Vec) assign a fixed representation to each word, regardless of its context.
  • Contextual embeddings (like BERT) generate different representations for the same word based on its surrounding context, making them more expressive. Static embeddings can still be useful in computationally restrained environments as they can be looked up from pre-computed tables.

Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.

===

Fantastic Embedding Models And Where To Find Them

To find fantastic embedding models, the primary resource is the Massive Text Embedding Benchmark (MTEB) leaderboard. It covers a wide range of tasks for embedding models, including:

  • Classification
  • Clustering
  • Retrieval

The first place to go is the Massive Text Embedding Benchmark (MTEB). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval.

===

What is BEIR and what is its focus?

BEIR (A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models) is a benchmark specifically focused on information retrieval.

If you’re focused on information retrieval, you might want to check out BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

===

The Majority Of Embedding Models On MTEB Are English

Most embedding models found on MTEB are designed for English. For working with multilingual or non-English languages, it is recommended to check out the MMTEB (Massive Multilingual Text Embedding Benchmark).

If you’re working with multilingual or non-English languages, it might be worth checking out MMTEB (Massive Multilingual Text Embedding Benchmark).

===

A Little History On Vector Embeddings

Before the advent of today's contextual embeddings (e.g., BERT), there were static embeddings such as Word2Vec and GloVe.

Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe).

===

What is the difference between static and contextual embeddings?

  • Static embeddings: Provide a fixed representation for each word.
  • Contextual embeddings: Generate different representations for the same word based on its surrounding context.
  • Static embeddings can still be useful in computationally restrained environments because they can be looked up from pre-computed tables.

They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.

===

Don’t Confuse Sparse Vectors And Sparse Embeddings

  • Sparse vectors can be generated in two main ways:
    • By applying statistical scoring functions like TF-IDF or BM25 to term frequencies.
    • With neural sparse embedding models like SPLADE.
  • A sparse embedding is a type of sparse vector, but not all sparse vectors are necessarily sparse embeddings.

It took me a while until I understood that sparse vectors can be generated in different ways: Either by applying statistical scoring functions like TF-IDF or BM25 to term frequencies (often retrieved via inverted indexes), or with neural sparse embedding models like SPLADE. That means a sparse embedding is a sparse vector, but not all sparse vectors are necessarily sparse embeddings.

===

Embed All The Things

Embeddings are not exclusively for text. You can embed various data types, including:

  • Images
  • PDFs as images (e.g., ColPali)
  • Graphs This capability enables multimodal vector search over diverse data.

Embeddings aren’t just for text. You can embed images, PDFs as images (see ColPali), graphs, etc. And that means you can do vector search over multimodal data. It’s pretty incredible. You should try it sometime.

===

The Economics Of Vector Embeddings

  • The vector dimensions directly impact the required storage cost.
  • For example, choosing a model with 1536 dimensions over one with 768 dimensions can double your storage requirements.
  • While more dimensions capture more semantic nuances, a very high number of dimensions may not always be necessary for common tasks like "chat with your docs".

This shouldn’t be a surprise, but the vector dimensions will impact the required storage cost. So, consider whether it is worth it before you choose an embedding model with 1536 dimensions over one with 768 dimensions and risk doubling your storage requirements. Yes, more dimensions capture more semantic nuances.

===

What is Matryoshka Representation Learning?

Matryoshka Representation Learning is a technique that allows you to shorten vector embeddings. This is beneficial for environments with less computational resources, while aiming to maintain minimal performance losses.

Some models actually use Matryoshka Representation Learning to allow you to shorten vector embeddings for environments with less computational resources, with minimal performance losses.

===

“Chat With Your Docs” Tutorials Are The “Hello World” Programs Of Generative AI

The phrase "chat with your docs" tutorials are considered the "Hello world" programs of Generative AI. This means they are foundational, basic, or introductory examples in the field.

“Chat with your docs” tutorials are the “Hello world” programs of Generative AI.

===

You Need To Call The Embedding Model A LOT

Calling the embedding model is a frequent necessity, not just during initial data ingestion. It is required:

  • Every time a search query is run (the query must be embedded).
  • When adding new objects later on (they need embedding and indexing).
  • If you change the embedding model (all existing data must be re-embedded and re-indexed).

Just because you embedded your documents during the ingestion stage, doesn’t mean you’re done calling the embedding model. Every time you run a search query, the query must also be embedded (if you’re not using a cache). If you’re adding objects later on, those must also be embedded (and indexed). If you’re changing the embedding model, you must also re-embed (and re-index) everything.

===

Similar Does Not Necessarily Mean Relevant

  • Vector search identifies objects based on their similarity to a query, measured by proximity in vector space.
  • However, similarity does not always equate to relevance to the user's intent.
  • For example, "How to fix a faucet" and "Where to buy a kitchen faucet" might be similar in vector space but not relevant to each other.

Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., “How to fix a faucet” and “Where to buy a kitchen faucet”) does not mean they are relevant to each other.

===

Cosine Similarity And Cosine Distance Are Not The Same Thing

  • Cosine similarity and cosine distance are related but distinct concepts.
  • They are inverses of each other.
  • If two vectors are exactly the same:
    • Their similarity is 1.
    • The distance between them is 0.

But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.

===

What is the relationship between cosine similarity and dot product for normalized vectors?

  • For normalized vectors, cosine similarity and dot product are mathematically equivalent.
  • In this scenario, the dot product is generally more efficient for computation.

Because mathematically, they are the same. For the calculation, dot product is more efficient.

===

Common Misconception: The R In RAG Stands For ‘Vector Search’

  • The "R" in RAG (Retrieval-Augmented Generation) stands for 'retrieval', not specifically 'vector search'.
  • Retrieval can be accomplished through various methods beyond just vector search.

It doesn’t. It stands for ‘retrieval’. And retrieval can be done in many different ways (see following bullets).

===

Vector Search Is Just One Tool In The Retrieval Toolbox

  • Vector search is only one tool within a broader retrieval toolbox.
  • Other essential tools include:
    • Keyword-based search
    • Filtering
    • Reranking
  • Combining these different tools is crucial for building effective and robust retrieval systems.

There’s also keyword-based search, filtering, and reranking. It’s not one over the other. To build something great, you will need to combine it with different tools.

===

Similar Does Not Necessarily Mean Relevant

  • Vector search returns objects based on their proximity in vector space, which signifies similarity.

Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., “How to fix a faucet” and “Where to buy a kitchen faucet”) does not mean they are relevant to each other.

===

For normalized vectors, which is more computationally efficient: cosine similarity or dot product?

  • For normalized vectors, both cosine similarity and dot product are mathematically the same.- However, the dot product is generally more efficient for computation.

If you’re working with normalized vectors, it doesn’t matter whether you’re using cosine similarity or dot product for the similarity measure. Because mathematically, they are the same. For the calculation, dot product is more efficient.

===

Cosine Similarity And Cosine Distance Are Not The Same Thing

  • Cosine similarity and cosine distance are related but are not identical.- They are inverses of each other: if two vectors are exactly the same, their similarity is 1, and their distance is 0.

But they are related to each other (). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.

===

How can hybrid search combine keyword and vector search using parameters?

  • In some implementations (e.g., Weaviate), a hybrid search function allows you to combine keyword-based and vector-based search.- The alpha parameter can then be used to adjust the weighting from pure keyword-based search, to a mix of both, or to pure vector search.

In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the alpha parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.

===

If You’re Working With Normalized Vectors, It Doesn’t Matter Whether You’re Using Cosine Similarity Or Dot Product For The Similarity Measure

  • For normalized vectors, it does not matter whether you use cosine similarity or dot product.- Mathematically, they are the same calculation in this context.- The dot product is generally more efficient for computation.

Because mathematically, they are the same. For the calculation, dot product is more efficient.

===

What analogy describes the issue of losing semantic meaning with large chunk sizes and mean pooling?

  • The issue is like creating a movie poster by overlaying every single frame of the movie.- While all the information is technically present, the resulting image is unintelligible, and the overall meaning of the movie is lost.

I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.

===

Common Misconception: The R In RAG Stands For ‘Vector Search’

  • The 'R' in RAG stands for 'retrieval', not specifically 'vector search'.- Retrieval encompasses various methods, including but not limited to vector search.

It doesn’t. It stands for ‘retrieval’. And retrieval can be done in many different ways (see following bullets).

===

Vector Search Is Just One Tool In The Retrieval Toolbox

  • Vector search is only one component in a comprehensive retrieval system.- Effective retrieval often requires combining vector search with other techniques like keyword search, filtering, and reranking.

There’s also keyword-based search, filtering, and reranking. It’s not one over the other. To build something great, you will need to combine it with different tools.

===

When To Use Keyword-Based Search Vs. Vector-Based Search

  • Use vector-based search for matching semantics and synonyms (e.g., “pastel colors” vs.“light pink”).- Use keyword-based search for exact keywords (e.g., “A-line skirt”, “peplum dress”).- For use cases requiring both, hybrid search is beneficial.

Does your use case require mainly matching semantics and synonyms (e.g., “pastel colors” vs.“light pink”) or exact keywords (e.g., “A-line skirt”, “peplum dress”)? If it requires both (e.g., “pastel colored A-line skirt”), you might benefit from combining both and using hybrid search. In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the alpha parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.

===

Hybrid Search Can Be A Hybrid Of Different Search Techniques

  • While often meaning keyword and vector search combination, 'hybrid' is broader.- It can also refer to combining vector-based search with search over structured data (metadata filtering).

Most often, when you hear people talk about hybrid search, they mean the combination of keyword-based search and vector-based search. But the term ‘hybrid’ doesn’t specify which techniques to combine. So, sometimes you might hear people talk about hybrid search, meaning the combination of vector-based search and search over structured data (often referred to as metadata filtering).

===

Misconception: Filtering Makes Vector Search Faster

  • This is a misconception because filtering does not always improve search latency.- Pre-filtering can disrupt underlying index structures (e.g., HNSW graph connectivity).- Post-filtering may lead to an empty result set.- Vector databases employ complex methods to address this.

Intuitively, you’d think using a filter should speed up search latency because you’re reducing the number of candidates to search through. But in practice, pre-filtering candidates can, for example, break the graph connectivity in HNSW, and post-filtering can leave you with no results at all. Vector databases have different, sophisticated techniques to handle this challenge.

===

Two-Stage Retrieval Pipelines Aren’t Only For Recommendation Systems

  • Two-stage pipelines are common in recommendation systems but are also applicable to RAG pipelines.- The first stage uses a simpler, faster process (e.g., vector search) to reduce candidates.- The second stage uses a more compute-intensive but more accurate reranking process.

Recommendation systems often have a first retrieval stage that uses a simpler retrieval process (e.g., vector search) to reduce the number of potential candidates, which is followed by a second retrieval stage with a more compute-intensive but more accurate reranking stage. You can apply this to your RAG pipeline as well.

===

How Vector Search Differs From Reranking

  • Vector search retrieves a small portion of results from the entire database.- Reranking re-orders an already provided list of items.

Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.

===

Finding The Right Chunk Size To Embed Is Not Trivial

  • Selecting the correct chunk size is challenging: Too small means losing important context.- Too big means losing semantic meaning due to averaging (mean pooling).- Many embedding models use mean pooling, which can dilute meaning in large chunks.

Too small, and you’ll lose important context. Too big, and you’ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.

===

Vector Indexing Libraries Are Different From Vector Databases

  • Both are fast for vector search and useful for tutorials.- Vector databases offer additional data management features such as built-in persistence, CRUD support, metadata filtering, and hybrid search.- Vector indexing libraries lack these comprehensive features.

Both are incredibly fast for vector search. Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.

===

RAG Has Been Dying Since The Release Of The First Long-Context LLM

  • The claim that RAG is "dead" emerges whenever new LLMs with longer context windows are released.- Despite these claims, RAG continues to be relevant and useful.

Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is…

===

How Vector Search Differs From Reranking

Vector search and reranking serve different purposes in information retrieval:

  • Vector search returns a small portion of results from the entire database.
  • Reranking takes an existing list of items and returns that list in a re-ordered sequence.

Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.

===

Finding the Right Chunk Size to Embed Is Not Trivial

Determining the optimal chunk size for embedding is a challenge because:

  • Chunks that are too small can lead to a loss of important context.
  • Chunks that are too big can result in a loss of semantic meaning. Many models use mean pooling, averaging token embeddings into a single vector, which can make large chunks semantically unclear, even if technically embeddable.

Too small, and you’ll lose important context. Too big, and you’ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.

===

What is a helpful analogy for understanding chunk size for embedding?

A useful analogy for understanding the challenge of chunk size is thinking of it like creating a movie poster by overlaying every single frame of the movie.

  • All the original information from the movie is technically present.
  • However, you still won't be able to understand the movie's plot or meaning from such an aggregated image.

I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.

===

Vector Indexing Libraries Are Different From Vector Databases

While both vector indexing libraries and vector databases excel at fast vector search and are useful for tutorials, vector databases offer additional data management features:

  • Built-in persistence
  • CRUD support (Create, Read, Update, Delete)
  • Metadata filtering
  • Hybrid search capabilities

Both are incredibly fast for vector search. Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.

===

What data management features do vector databases offer beyond vector indexing libraries?

Vector databases provide several essential data management features that vector indexing libraries typically lack:

  • They include built-in persistence for data.
  • They support CRUD operations (Create, Read, Update, Delete).
  • They enable metadata filtering and hybrid search.

Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.

===

RAG Has Been Dying Since the Release of the First Long-Context LLM

Despite claims that Retrieval-Augmented Generation (RAG) is becoming obsolete with the release of Large Language Models (LLMs) featuring longer context windows, RAG has not died. Each time a new LLM with expanded context is released, the claim resurfaces, but RAG continues to be a relevant and effective technique.

Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is…

===

You Can Throw Out 97% of the Information and Still Retrieve (Somewhat) Accurately

This phenomenon is known as vector quantization. For instance, with binary quantization, a 32-bit float vector can be converted into a 1-bit binary vector, achieving a 32x storage reduction. Surprisingly, retrieval accuracy can remain quite good in some use cases despite this significant reduction in data.

It’s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, …, -0.2049] into [0, 1, 1, …, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you’ll be surprised how well retrieval will remain to work (in some use cases).

===

What is vector quantization?

Vector quantization is a technique used to reduce the storage size of vector embeddings.

  • It converts high-precision vectors (like 32-bit floats) into a more compressed format (e.g., 1-bit binary vectors).
  • This can lead to significant storage reduction (e.g., 32x) while surprisingly maintaining effective retrieval accuracy for certain applications.

It’s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, …, -0.2049] into [0, 1, 1, …, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you’ll be surprised how well retrieval will remain to work (in some use cases).

===

Vector Search Is Not Robust To Typos

Vector search is generally not robust to typos for a key reason:

  • It's highly improbable that all possible typographical errors of a word are sufficiently represented in the training data of embedding models.
  • While some simple typos might be handled, vector search cannot be relied upon to correct or robustly handle a wide range of misspellings.

For a while, I thought that vector search was robust to typos because these large corpora of text surely must contain a lot of typos and therefore help the embedding model learn these typos as well. But if you think about it, there’s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle some typos, you can’t really say it is robust to them.

===

Why is vector search not robust to typos?

Vector search is not robust to typos primarily because:

  • The extensive training datasets used by embedding models are unlikely to contain sufficient examples of all possible typos for every word.
  • This limitation means that while vector search might tolerate some minor typos, it cannot reliably correct or retrieve results for significant or uncommon misspellings.

But if you think about it, there’s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle some typos, you can’t really say it is robust to them.

===

Knowing When to Use Which Metric to Evaluate Search Results

Choosing the right metric for evaluating search results depends on the use case:

  • NDCG@k is a prominent metric often seen in academic benchmarks like BEIR.
  • Simpler metrics such as precision and recall are often well-suited for many practical applications.

There are many different metrics to evaluate search results. Looking at academic benchmarks, like BEIR, you’ll notice that NDCG@k is prominent. But simpler metrics like precision and recall are a great fit for many use cases.

===

The Precision-Recall Trade-Off

The precision-recall trade-off illustrates the inverse relationship between these two metrics:

  • High precision means most returned results are relevant, but you might miss many existing relevant items (low recall).
  • High recall means you found most, if not all, relevant items, but you might also return many irrelevant items (low precision). It's a balance between returning only relevant items and returning all relevant items.

is often depicted with a fisherman’s analogy of casting a net, but this e-commerce analogy made it click better for me: Imagine you have a webshop with 100 books, out of which 10 are ML-related. Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant). But that’s bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). And also, that’s not so good for your business. Maybe the user didn’t liked that one ML-related book you returned. On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted… That’s perfect recall because you returned all relevant results. It’s just that you also returned a bunch of irrelevant results, which can be measured by how bad the precision is.

===

How is precision defined in search evaluation?

Precision measures the proportion of returned search results that are actually relevant.

  • For example, if you return 1 book and it's relevant, your precision is perfect.
  • It answers: "Of the items I returned, how many were correct?"

Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant).

===

How is recall defined in search evaluation?

Recall measures the proportion of all existing relevant items that were successfully retrieved by the search system.

  • For example, if there are 10 relevant books and you only return 1, your recall is low.
  • If you return all 10 relevant books (even among many irrelevant ones), your recall is perfect.
  • It answers: "Of all the correct items out there, how many did I find?"

But that’s bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). ...That’s perfect recall because you returned all relevant results.

===

There Are Metrics That Include the Order

The order of search results can be crucial in certain use cases, similar to a Google search.

  • Metrics like precision and recall do not take the order of results into account.
  • If rank is important for your use case, you should choose rank-aware metrics such as:
    • MRR@k (Mean Reciprocal Rank)
    • MAP@k (Mean Average Precision)
    • NDCG@k (Normalized Discounted Cumulative Gain)

When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don’t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.

===

Tokenizers Matter

Tokenizers are a critical component affecting search performance, not just in the context of Transformer models.

  • They are essential for the performance of keyword search.
  • Since hybrid search often combines keyword search with vector search, the tokenizer's impact on keyword performance directly affects hybrid search performance as well.

If you’ve been in the Transformer’s bubble too long, you’ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.

===

There Are Metrics That Include The Order

  • Metrics like precision and recall do not consider the order of search results.- If search result rank is important, use rank-aware metrics such as:- MRR@k (Mean Reciprocal Rank at k)- MAP@k (Mean Average Precision at k)- NDCG@k (Normalized Discounted Cumulative Gain at k)

When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don’t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.

===

Which search metrics do not consider the order of results?

  • Precision and recall are search metrics that do not take into account the order in which search results are returned.- They evaluate relevance based on the set of results, regardless of their position.

Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant). But that’s bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books).

===

What is the impact on precision and recall when only one relevant item is returned from many?

  • Returning only one relevant item (e.g., one ML book out of ten) results in:- Perfect precision: All returned results are relevant.- Bad recall: Only a small fraction of the total relevant results are returned.

Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant). But that’s bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books).

===

How does returning all available items, including many irrelevant ones, affect recall and precision?

  • If you return your entire selection of items (e.g., all 100 books, unsorted), it leads to:- Perfect recall: All relevant results that exist are returned.- Bad precision: Many irrelevant results are also returned alongside the relevant ones.

On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted… That’s perfect recall because you returned all relevant results. It’s just that you also returned a bunch of irrelevant results, which can be measured by how bad the precision is.

===

Tokenizers Matter

  • Tokenizers are crucial beyond Byte-Pair-Encoding (BPE) which is common in Transformer models.- They are essential for:- Keyword search performance.- Hybrid search performance, as it relies on keyword search.

If you’ve been in the Transformer’s bubble too long, you’ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.

===

Why are tokenizers important for search performance?

  • Tokenizers are important because they directly impact the performance of:- Keyword search: How text is broken down affects matching relevant terms.- Hybrid search: Since hybrid search often incorporates keyword-based methods, tokenizer performance directly influences its overall effectiveness.

Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.

===

Out-Of-Domain Is Not The Same As Out-Of-Vocabulary

  • There is a key distinction between out-of-domain (OOD) and out-of-vocabulary (OOV) terms.- Earlier embedding models failed on OOV terms, but modern models handle them graciously with smart tokenization.- However, OOD terms result in meaningless vector embeddings even if they look like proper embeddings.

Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of “Labubu”, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.

===

How do modern embedding models handle out-of-vocabulary terms?

  • Modern embedding models, using smart tokenization, can handle out-of-vocabulary (OOV) terms graciously.- Unlike earlier models that would fail, unseen OOV terms can now be processed without errors.

Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of “Labubu”, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.

===

What is the consequence of an out-of-domain term for vector embeddings?

  • When a term is out-of-domain (OOD) for an embedding model, its generated vector embedding, while appearing like a proper embedding, is meaningless.- This occurs even if the term is handled without error by smart tokenization.

With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.

===

Query Optimizations

  • Users have historically learned to optimize search queries for keyword search (e.g., using keywords like "longest river africa" instead of full questions).- A similar learning curve is now necessary to optimize queries for vector search.

You know how you’ve learned to type “longest river africa” into Google’s search bar, instead of “What is the name of the longest river in Africa?”. You’ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?). Similarly, we now need to learn how to optimize our search queries for vector search now.

===

How have users historically optimized search queries for keyword search?

  • Users have learned to optimize queries for keyword search by typing concise keywords rather than full questions.- An example is typing "longest river africa" instead of "What is the name of the longest river in Africa?".

You know how you’ve learned to type “longest river africa” into Google’s search bar, instead of “What is the name of the longest river in Africa?”. You’ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?).

===

What new challenge arises in optimizing search queries for modern systems?

  • Just as users learned to optimize queries for keyword search, there is now a need to learn how to optimize search queries specifically for vector search.- This involves understanding how to phrase queries to get the most relevant vector embeddings.

Similarly, we now need to learn how to optimize our search queries for vector search now.

===

What Comes After Vector Search?

  • Search technologies have evolved through distinct stages:1. Keyword-based search: The initial approach.2. Vector search: Enabled by Machine Learning models.3. Reasoning-based retrieval: The current frontier, enabled by Large Language Models (LLMs) with reasoning capabilities.

First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.

===

Describe the historical progression of search technologies.

  • The progression of search technologies has been:- Starting with keyword-based search.- Evolving to vector search, facilitated by Machine Learning models.- Now moving towards reasoning-based retrieval, powered by LLMs with reasoning abilities.

First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.

===

Information Retrieval Is So Hot Right Now

  • Information Retrieval (IR) is an exciting field to work in, as it continuously evolves.- While working with LLMs is popular, providing the best information for LLMs is equally important and falls within the field of retrieval.

I feel fortunate to get to work in this exciting space. Although working on and with LLMs seems to be the cool thing now, figuring out how to provide the best information for them is equally exciting. And that’s the field of retrieval.

===

What enduring aspect of information retrieval remains crucial despite new trends?

  • The importance of finding the best information to provide to a Large Language Model (LLM) so it can generate the best possible answer remains crucial.- This fundamental role of retrieval persists even with new developments like RAG or "context engineering."

When I joined Weaviate, vector databases were the hot new thing. Then came RAG. Now, we’re talking about “context engineering”. But what hasn’t changed is the importance of finding the best information to give the LLM so it can provide the best possible answer.

===

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment