Skip to content

Instantly share code, notes, and snippets.

@scastillo
Last active February 24, 2023 16:30
Show Gist options
  • Save scastillo/66c38ef0961f47e8db15120c9648cf9f to your computer and use it in GitHub Desktop.
Save scastillo/66c38ef0961f47e8db15120c9648cf9f to your computer and use it in GitHub Desktop.
"""
In this example, we're using the spacy library to load a pre-trained word embedding model (en_core_web_md), which includes word vectors for many common English words. We define a list of documents to search through, and a search function that takes a query string and computes the cosine similarity between the query vector and each document vector. Finally, we test the search function with a query string and print the most similar document. Note that this is a simple example, and in a real-world application you may want to use a more sophisticated search algorithm or pre-process the documents differently to achieve better results.
"""
import numpy as np
import spacy
# Load the pre-trained word embedding model
nlp = spacy.load("en_core_web_md")
# Define a list of documents to search through
documents = [
"The quick brown fox jumps over the lazy dog.",
"A stitch in time saves nine.",
"The early bird catches the worm.",
"An apple a day keeps the doctor away.",
"Actions speak louder than words."
]
# Define a search function that takes a query string and returns the most similar document
def search(query, documents):
# Convert the query string to a vector of embeddings
query_vec = np.mean([nlp(word).vector for word in query.lower().split()], axis=0)
# Convert each document to a vector of embeddings
document_vecs = [np.mean([nlp(word).vector for word in doc.lower().split()], axis=0) for doc in documents]
# Compute the cosine similarity between the query vector and each document vector
similarities = np.dot(document_vecs, query_vec) / (np.linalg.norm(document_vecs, axis=1) * np.linalg.norm(query_vec))
# Return the index of the document with the highest similarity score
return np.argmax(similarities)
# Test the search function
query = "The lazy dog jumps over the quick brown fox."
most_similar_doc_index = search(query, documents)
most_similar_doc = documents[most_similar_doc_index]
print("Most similar document to query:", most_similar_doc)
@scastillo
Copy link
Author

Word embeddings can be a useful tool for implementing text search in your web app because they allow you to represent words in a high-dimensional space where words with similar meanings are located close together. Here's a basic overview of how you could use word embeddings to implement text search:

  • Choose a pre-trained word embedding model: There are many pre-trained word embedding models available online, such as Word2Vec, GloVe, or FastText. Choose one that suits your needs and download it.
  • Convert your documents to embeddings: Once you have a word embedding model, you can use it to convert each document in your corpus to a vector of embeddings. This can be done by averaging the embeddings of all the words in the document, or by using a more sophisticated approach, such as Doc2Vec.
  • Implement a search algorithm: To perform a search, you first need to convert the query text into an embedding vector using the same approach you used for the documents.
  • Then, you can compare the cosine similarity between the query vector and each document vector in your corpus. The documents with the highest cosine similarity scores are the ones that are most similar to the query text.
  • Optimize for speed: Depending on the size of your corpus, the search algorithm may be slow. To speed things up, you can use an approximate nearest neighbor search algorithm, such as Annoy or FAISS, which can quickly find the closest document vectors to your query vector.
  • Integrate with your web app: Finally, you can integrate the text search functionality into your web app by allowing users to enter a search query and displaying the most relevant documents. You can also provide additional features such as highlighting the search terms in the documents, or allowing users to filter the search results by date or category.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment