Awesome. I’ll begin the deep-dive into architecture, embeddings, indexing strategies, search techniques, and VSCode extension design for a local-first semantic search tool.

I’ll include:

Comparative analysis of embedding models (OpenAI, Hugging Face, BGE, Instructor, etc.) for code and natural language
Indexing strategies (chunking, AST, metadata) optimized for Python and JavaScript
Vector store comparison with a focus on LanceDB, QDrant, and other developer-friendly local options
VSCode extension patterns, referencing tools like Sourcegraph Cody and Continue.dev
Search techniques like hybrid search and context-aware retrieval
Design notes on local-first performance, privacy, and future support for remote repo indexing

I'll organize this into a markdown guide with clear headers, diagrams if needed, and practical examples. I'll let you know when it's ready!

Local-First Codebase Indexing and Semantic Search in VS Code

Setting up a local-first code search pipeline involves designing an indexing system that can efficiently handle codebases on a developer’s machine and serve semantic search queries with high relevance. This guide explores the architecture of such a system – to be implemented as a VS Code extension – using local vector databases (e.g. LanceDB, Qdrant) and various embedding models (OpenAI and open-source alternatives from Hugging Face). We’ll cover pipeline design (with incremental indexing and file watching), compare embedding models and vector stores, discuss code-specific indexing (chunking, AST parsing, metadata), outline search techniques (hybrid search, re-ranking, context-aware retrieval), and review best practices for building a performant VS Code extension. We also address privacy considerations of local vs cloud models and patterns for extending the approach to remote repositories.

Architecture Design for Scalable Code Indexing

A robust architecture for semantic code search should support scalable indexing of the codebase, real-time updates as code changes, and modular components that can be extended. At a high level, the system will parse source files into semantic chunks (e.g. functions, classes), generate embeddings for each chunk, store them in a local vector database, and provide a query interface that retrieves relevant code snippets via similarity search. Crucially, the index should update incrementally as the developer edits code.

(An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) High-level pipeline for codebase indexing. The codebase is parsed (using an AST parser like Tree-sitter) to extract semantic chunks (functions, classes, docstrings). Optionally, an LLM can generate summary comments for each chunk. Embeddings for chunks are generated and stored in a local vector database (e.g. LanceDB) along with metadata (file paths, references). Code reference links (where functions/classes are used) can also be gathered and stored as metadata for use during retrieval (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2).

Incremental Indexing and File Watching

To avoid reprocessing the entire repository on every change, incremental indexing is employed. The extension can leverage VS Code’s file system APIs (such as vscode.workspace.createFileSystemWatcher) to watch for file changes (saves, creations, deletions). When a file changes, only its affected chunks need re-indexing, rather than re-embedding everything. For example, if a single function is edited, the system can locate the corresponding embedding entry (by a unique ID or metadata) and update or replace it in the vector store.

Under the hood, an indexing daemon (which could be a background thread or process spawned by the extension) maintains the vector store. On startup, it performs an initial full index of all files. Then, file watcher events trigger targeted re-indexing: e.g. re-parse the changed file’s AST, regenerate embeddings for changed functions or new code, and update the vector database entry for those chunks. This design ensures the search index stays in sync with the codebase in near real-time without incurring huge re-indexing costs for every edit.

Modularity and Extensibility: The pipeline should be modular – separate components for parsing, embedding, storage, and retrieval – so that each part can be swapped or extended. For instance, the embedding module might start with OpenAI’s API but later be replaced by a local model; the vector store could be LanceDB initially, but the system should allow switching to Qdrant or another store with minimal changes. A clean separation (e.g. via interfaces or an adapter pattern) between the VS Code extension front-end and a back-end indexing service can make it easier to integrate new features like remote repository support in the future.

Parsing Code into Chunks (AST-Based Indexing)

Rather than splitting code arbitrarily (e.g. fixed-size tokens or lines), we leverage the inherent structure of source code. Abstract Syntax Tree (AST) parsing produces a structured representation of code, which we use to extract logical units such as class definitions, function/method definitions, and possibly docstrings or comments attached to them (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). Each such unit becomes a “chunk” with a standalone meaning.

For a given file, the AST parser (e.g. [Tree-sitter (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2)](https://tree-sitter.github.io)) can enumerate all functions and classes. We then chunk the code at the method or class level. This preserves semantic integrity – a function is indexed as one piece, which is more meaningful for search than random 200-line blocks (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). It also keeps related code together so that a search result can show the whole function if it’s relevant, rather than a snippet cut off mid-function.

Using an AST also allows attaching metadata to each chunk: e.g. the function name, class name, file path, and maybe the signature or docstring. This metadata is stored in the vector database alongside the embedding. It enables rich search filtering (e.g. restrict search to a certain file or function name) and more informative results display (showing the file and function name with the snippet). For instance, a chunk’s metadata might look like: { "file": "src/utils.py", "function": "parse_config", "class": null, "docstring": "Parses the configuration file..." }.

Additionally, AST traversal can help gather cross-references (where functions or classes are used). For example, one could augment the metadata with a list of symbol references (function calls, subclass relationships, etc.) found in the codebase (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). This way, if a search result is a function definition, the system could also know which other files invoke that function, providing useful context to the developer.

Example: Using Tree-sitter to parse and chunk a Python file:

import tree_sitter
# Assume we have Tree-sitter Python parser loaded as parser
tree = parser.parse(bytes(code, "utf8"))
root = tree.root_node

# Traverse AST and collect function definitions
functions = []
def visit(node):
    if node.type == "function_definition":
        name_node = node.child_by_field_name("name")
        func_name = name_node.text.decode() if name_node else "<anonymous>"
        func_code = node.text.decode()  # entire function source
        functions.append({"name": func_name, "code": func_code})
    # recurse into children
    for child in node.children:
        visit(child)

visit(root)
print(f"Found {len(functions)} functions:", [f['name'] for f in functions])

The above (using Tree-sitter’s Python grammar) would extract each function’s code and name. Similar logic applies for classes or other constructs. These extracted code chunks are then fed to the embedding step.

Embedding Models for Code and Natural Language

Once code chunks are identified, we transform them into vector embeddings using an LLM-based embedding model. The choice of embedding model critically impacts search quality (semantic relevance of results) as well as performance (embedding generation speed and vector dimensionality). Here we compare several embeddings – both OpenAI’s proprietary models and open-source alternatives – with a focus on how well they handle code and technical text.

What makes a good code embedding? An embedding model for code search needs to handle not just natural language descriptions but also programming language syntax and semantics. Ideally, it should capture functional similarity (two code snippets that do similar things should be near in vector space) and understand code-specific tokens and structures. Models trained or fine-tuned on code tend to outperform generic text models on code search tasks (What embedding models work best for code and technical content? - Zilliz Vector Database) (What embedding models work best for code and technical content? - Zilliz Vector Database). For example, models like CodeBERT and UniXcoder are pre-trained on code and can better encode the structure and naming patterns in code (What embedding models work best for code and technical content? - Zilliz Vector Database). However, some general models (like OpenAI’s) also perform surprisingly well on code due to broad training data that included GitHub repos (What embedding models work best for code and technical content? - Zilliz Vector Database).

Comparing Embedding Models

Below is a comparative overview of popular embedding models for code and text, highlighting their characteristics:

Model	Type	Dimensionality	Trained On	Strengths	Considerations
OpenAI Ada-002	Closed-source (API)	1536	Web text + code (OpenAI)	High-quality universal embeddings; handles code well (OpenAI’s data includes GitHub) (What embedding models work best for code and technical content? - Zilliz Vector Database); very consistent performance	API only (code must be sent to cloud); cost per request; 8192-token context limit
OpenAI Embedding v3	Closed-source (API)	1536 (small) / 3072 (large) (Vector embeddings - OpenAI API)	Multilingual text + code	Newer generation with improved multilingual support ([OpenAI vs Open-Source Multilingual Embedding Models	by Yann-Aël Le Borgne
BGE Large v1.5	Open-source (BAAI)	1024 (approx.)	English text (incl. technical)	Top-ranked on embedding benchmarks (MTEB) (BAAI/bge-large-en · Hugging Face); excels at semantic search tasks; no API needed (local use)	Large model (~2.9B parameters); requires GPU for reasonable speed; primarily English-focused
Instructor-XL	Open-source (HKUST)	768	Text (with instructions)	Instruction-tuned: can understand prompts like “Represent the code for…”; good all-rounder for text/code due to broad training	Still fairly large (≈1B parameters); needs GPU for fast operation; must prepend instructions to input (slight complexity)
E5-large (MS)	Open-source (Microsoft)	1024	Multilingual text + code	Strong multilingual embeddings ([OpenAI vs Open-Source Multilingual Embedding Models	by Yann-Aël Le Borgne
CodeBERT	Open-source (Microsoft)	768	Code + NL (6 languages)	Bi-modal model trained on paired code and descriptions (What embedding models work best for code and technical content? - Zilliz Vector Database); good at linking code to comments; effective for code search and documentation queries (What embedding models work best for code and technical content? - Zilliz Vector Database)	Older (Transformer encoder from 2020); may lag behind newer models on pure semantic similarity; limited to languages seen in training
UniXcoder	Open-source (Microsoft)	768	Code + NL (unified)	Jointly encodes code and comments in the same model (What embedding models work best for code and technical content? - Zilliz Vector Database); captures cross-modal links (e.g. API usage to docs); understands code structure (was pretrained with code data flow)	More complex model (supports generation too); not just an encoder – may need specific use to get embeddings; model size and requirements similar to CodeBERT
all-MiniLM-L6-v2	Open-source (Sentence-Tfmr)	384	General text (incl. some technical)	Very lightweight (~22M params) and fast; can run on CPU easily; decent quality for simple semantic similarity	Not specifically trained on code (so may miss nuances of code syntax); lower dimensional embedding (faster but slightly less expressive)
Nomic Embed (text-v1)	Open-source (Nomic)	768	Web text (open data)	Small (550MB) model claiming parity with Ada-002 on many tasks ([OpenAI vs Open-Source Multilingual Embedding Models	by Yann-Aël Le Borgne

Notes: All the above models output a fixed-size dense vector per input chunk. They are generally compatible with any vector database that supports the given dimensionality. For example, Qdrant and LanceDB can store vectors of size 384 just as easily as 1536 – the main impact is on storage and search speed (larger vectors = more data to index). If using a local GPU, model size and throughput are key factors: smaller models like MiniLM or Instructor-base will embed faster but might sacrifice some accuracy; larger models like BGE or E5 require more resources but yield very high-quality embeddings (often matching or beating OpenAI’s, as seen on benchmarks (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2)).

Importantly, embedding models should ideally be trained on code if you want the best results for code search (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). For instance, OpenAI’s models include code in training, and specialized models like CodeBERT explicitly learned from code corpora. When selecting an open-source model, prefer ones where the documentation or research paper indicates exposure to programming languages (or consider further fine-tuning on a code dataset). Tools like LlamaIndex provide utilities to fine-tune Hugging Face models on your own data to improve domain-specific performance (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2).

Embedding Generation and Indexing Pipeline

Once a model is chosen, the extension’s back-end will generate embeddings for each code chunk and upsert them into the vector store. This can be done in Python (using libraries like Hugging Face Transformers or SentenceTransformers for local models, or OpenAI’s API for remote), or directly in Node.js via an API call or a WASM binding for local models. In a local-first scenario, using a Python subprocess for heavy embedding computation is common – the VS Code extension (Node) can spawn a Python script that loads the model and listens (e.g. on a socket or stdio) for chunk texts to embed, returning vectors. This keeps the VS Code UI thread free and allows using Python’s rich ML ecosystem.

To optimize indexing:

Batch embeddings: Many models (and OpenAI’s API) support batch processing. Collect a batch of, say, 20 chunks and embed them together to amortize overhead.
Incremental updates: For real-time indexing, maintain a small queue of “dirty” chunks that need re-embedding (due to file changes) and re-index them during idle moments or at a controlled rate, so as not to overwhelm resources while the user is actively coding.

Choosing a Local Vector Database

The choice of the vector store determines how embeddings are stored and queried. Key considerations include: local deployability (no external cloud dependency), performance on typical code search workloads, support for metadata and filtering, and ease of integration in a VS Code extension context. Below, we compare some popular developer-friendly vector databases for local use.

Vector DB	Deployment Model	Index Type	Supports Metadata Filtering	Language Bindings	Notes
LanceDB	Embedded library (Python/Rust) ([LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs…	by Sergei Petrov	Medium](https://medium.com/@plaggy/lancedb-vs-qdrant-caf01c89965a#:~:text=,IVF_PQ%20and%20Qdrant%20uses%20HNSW))	IVF-PQ (Inverted file with product quantization) ([LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs…	by Sergei Petrov
Qdrant	Client-server (local or remote)	HNSW (graph) ([LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs…	by Sergei Petrov	Medium](https://medium.com/@plaggy/lancedb-vs-qdrant-caf01c89965a#:~:text=Qdrant%29,IVF_PQ%20and%20Qdrant%20uses%20HNSW))	Yes (filters, payload support) (Vector databases (1): What makes each one different? • The Data Quarry)
Chroma	Embedded library (Python)	HNSW (uses FAISS or similar under hood)	Yes (stores documents with metadata)	Python API (LangChain integration)	Easy integration: just pip install and use in Python. Meant for local RAG prototyping. Persists data (uses SQLite/duckDB). However, no native JS client – would need a Python bridge for VS Code.
Weaviate	Client-server (local or cloud)	HNSW + optional BM25 (hybrid)	Yes (GraphQL filtering, BM25 text search)	REST/GraphQL API; Python & JS clients	Full-featured semantic + keyword search engine. Heavier to run (Java backend or container). Good docs and community (Vector databases (1): What makes each one different? • The Data Quarry). Possibly overkill for single-developer use due to setup complexity.
FAISS (library)	Embedded (C++/Python)	Various (Flat, HNSW, IVF)	No (embedding-only; metadata handled separately by app)	Python (via `faiss`), C++; (JS via WASM possible)	Facebook’s library for similarity search. High performance, but it’s a low-level tool – you manage your own data structures. Often used under the hood by others. Not ideal unless you want to build a custom store.
Milvus	Client-server (C++ core)	HNSW, IVF, DiskANN	Yes (filters, etc.)	REST, Python, Java clients	Enterprise-grade system, designed for massive scale (billion+ vectors). Can run locally but requires more resources. Many features but higher complexity. For local-first small projects, lighter options suffice.
SQLite + Extensions	Embedded (C)	Various (via extensions like Qdrant’s lite or pgvector)	Limited (SQL-based filtering)	SQLite (JS via `better-sqlite3`, etc.)	An unconventional but viable route: e.g. use SQLite with a vector extension. This keeps everything in one file and one process. However, performance is typically lower and setup can be fiddly.

In our local-first VS Code extension scenario, LanceDB and Qdrant stand out:

LanceDB can be bundled with the extension’s back-end (if using Python, just install the package). It stores data on disk by default (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium), so the index persists across sessions. Being embedded, it avoids network overhead – embedding vectors are added and queried in-process, which is efficient for large vectors where serialization cost is non-trivial (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium). LanceDB’s use of IVF-PQ means it’s optimized for disk and memory efficiency, at some cost of a small reduction in recall (due to quantization) – suitable for handling bigger codebases without running out of RAM.
Qdrant requires running a service, but it’s quite developer-friendly (you can embed it by running a background process, or even start it programmatically via a subprocess). It uses HNSW, which tends to be very fast and high-recall in-memory (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium) (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium). Qdrant supports storing arbitrary payloads with vectors, which maps well to our metadata needs (file names, etc., can be stored and filtered in queries) (Vector databases (1): What makes each one different? • The Data Quarry). The absence of an in-process mode means every query goes through an HTTP/gRPC call – for many use cases this overhead is negligible (the vector search dominates time), but for extremely low-latency needs or very frequent queries, an embedded DB might be preferable.

Other notes:

Hybrid search support: Some vector DBs have built-in hybrid search (combining keyword and vector queries). Weaviate, for example, can index a text field and allow BM25 queries alongside vector similarity in a single query. Qdrant has recently added a new Query API enabling hybrid search workflows (Hybrid Search Revamped - Building with Qdrant's Query API) (though it might involve storing sparse embeddings or using a special payload field for keywords). If hybrid search is not directly supported, you can always implement it at the application layer (as discussed in the Search Techniques section).
Resource usage: Embedded databases like LanceDB and Chroma run inside the extension’s process (or its Python child process), which means they share memory with your environment. They are optimized for local use, but keep an eye on memory if indexing a huge codebase – HNSW indices in particular can be memory-hungry (though they can often be persisted to disk). In contrast, an external DB (Qdrant, Weaviate) might consume a few GB of RAM but it’s in a separate process, which could be advantageous on multi-core machines.

Integration tip: For a VS Code extension, an embedded DB accessible via Python is straightforward – the extension can call Python scripts for queries. For a pure Node implementation, an external service (Qdrant/Weaviate) accessible via HTTP might be easier, unless a Node native vector search library is used. Developer experience is key: LanceDB’s Python API, for example, allows you to do tbl.search(vector).limit(5).to_df() to get results quickly. Qdrant’s REST API allows filtering by metadata in the query payload JSON. Both approaches can work; many projects start with an embedded DB for simplicity and may move to a client-server DB if scaling up or splitting workload across machines.

Indexing Strategies for Source Code

We touched on AST-based chunking above; here we dive deeper into techniques for indexing source code effectively:

Optimal Chunk Sizing: If chunks are too large (e.g. whole files), the embeddings may dilute and searches become less precise (because a file may contain many unrelated functions). Too small (e.g. every line separately) and you lose context and overwhelm the system with too many vectors. Function or method-level granularity is often a sweet spot (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). For very large functions or classes, you might further break them down (perhaps by logical blocks or paragraphs of code), but keep the break aligned to code structure – e.g. split a long function into chunks by its internal comment sections or blank lines. Always ensure each chunk “makes sense” on its own and can answer some potential question.
Including Documentation: If the code has docstrings or comments, these are extremely valuable for semantic search. You have a few options: (a) Append docstrings/comments to the code chunk text before embedding – this way the embedding contains both what the code does and what the comments say, improving searchability by natural language queries. (b) Index docstrings as separate chunks linked to their function. For example, the docstring could be one vector (good for answering “what does function X do?”) and the code is another. A hybrid approach is also possible: use multi-field embeddings where code and docs are combined but possibly with some demarcation.
Metadata Tagging: As mentioned, store identifiers like function name, class name, file path as metadata. This not only helps in result display but can be leveraged for queries – e.g., if a user query contains something that looks like a function name or file name, the search system can explicitly filter or boost those matches. For instance, if query is “function parse_config”, the system might first attempt a direct symbol lookup (exact match for a function named “parse_config”) to show as a top result, in addition to semantic results. Metadata can also store language info (if multi-language repo), so you could filter to only Python code if needed.
Language-Specific Handling: Different languages might need slight tweaks. For example, in Java or C#, method names might not be unique unless qualified by class; so storing the fully qualified name (“ClassName.method”) could help. In languages like JavaScript, where code can be more free-form, it might be useful to chunk at the level of individual exportable functions or even analyze the call graph (if building advanced features to find usage references).
Indexing Frequency: For active development, continuously updating the index is crucial. You might implement a debounce or rate-limit on indexing – e.g., wait until a file hasn’t been edited for 5 seconds before re-indexing it, to avoid embedding on every keystroke. The VS Code extension can also provide a manual “Reindex Project” command in case something goes out of sync or for initial indexing.
Scalability Considerations: For a moderately large repo (say, a few thousand files), the number of chunks could be tens of thousands. A vector DB like Qdrant or Lance can handle this in-memory easily. If you anticipate extremely large codebases, consider strategies like sharding the index by directory or module, or building a hierarchical search (first search for relevant files via file-level embeddings, then within those files search for specific functions). However, these complexities are usually not needed until hitting very large scales (100k+ chunks).

Semantic Search Techniques

With the index in place (chunks + embeddings + metadata), the extension can answer user queries by retrieving relevant code snippets. Here we focus on improving the search quality through various techniques: dense semantic search combined with sparse keyword search (hybrid), result re-ranking with more expensive models, and context-aware querying that leverages additional information (like the user’s current context or conversational history).

Dense Vector Search vs. Hybrid Search

A pure dense vector search means: take the user’s query (which could be natural language like “how is the config file parsed?” or a code snippet), encode it with the same embedding model, then find nearest neighbor vectors in the code embedding index. This will find code that is semantically similar to the query. However, solely relying on embeddings can sometimes miss exact matches for specific keywords (e.g. variable or function names) or give slightly off results if the query is ambiguous. That’s where hybrid search comes in: combining dense search with traditional keyword (lexical) search.

Hybrid search approaches:

Parallel search: Do a vector search to get N candidates, and also do a keyword search (e.g. using an inverted index or even VS Code’s built-in text search) to get N candidates, then merge the results. This can catch cases where the query uses specific terminology that might not be captured in the same way by the embedding. For example, searching for “database URL configuration” might benefit from one method that explicitly has database_url in code (keyword hit) and another that is semantically similar but named differently (vector hit).
Sparse embeddings: Another approach is to convert the query (and documents) into a sparse vector as well (like a bag-of-words or TF-IDF vector) and store that in the vector DB as metadata. Some vector DBs (like Qdrant with its BM25 or SRL features (BM42: New Baseline for Hybrid Search - Qdrant)) support storing a sparse embedding (for text keywords) alongside the dense one, and you can query both. Essentially, this uses a transformer like Splade to generate sparse keyword importance, enabling a single query to retrieve by semantic and lexical signals combined (Reranking in Hybrid Search - Qdrant).
Application-level hybrid: If the vector DB doesn’t directly support hybrid queries, you can implement it by scoring. For each candidate from vector search, compute a BM25 score (using an offline index or a library like Lunr in JS or Whoosh in Python) for the query vs the candidate’s code text. Combine the scores (e.g. weighted sum or use rank fusion) to reorder results. Conversely, you can take top keyword hits and use the embedding model to check their semantic similarity to the query, then merge.

In practice, hybrid search often yields better results than either alone (What embedding models work best for code and technical content? - Zilliz Vector Database). Especially in code, certain searches are highly precise (you want where a specific function name appears) while others are conceptual (“where do we handle user authentication?”). The combination ensures you don’t miss either case (What embedding models work best for code and technical content? - Zilliz Vector Database). Tools like Weaviate have demonstrated the effectiveness of combining BM25 with vector similarity for code search (What embedding models work best for code and technical content? - Zilliz Vector Database).

Re-ranking with Cross-Encoders

Even after hybrid retrieval, the top results might not always be perfectly ordered by relevance. Re-ranking is a step where a more expensive but more accurate model is used to sort the candidate results. Typically, you take the top-k (say 5 or 10) results from the initial search and feed each (with the query) into a cross-encoder model, which scores relevance. A cross-encoder (like cross-encoder/ms-marco-MiniLM-L-12-v2 or larger models like ColBERT) takes a query and a document as input together and produces a relevance score (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). Because it looks at the pair with full attention, it can be more precise than embeddings that were produced independently (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2).

In our context, a cross-encoder can be an NL-to-NL model (if your chunks have docstrings or comments, it might focus on those) or NL-to-code model if available. If none specifically exist, using a general cross-encoder (trained on MSMARCO or similar) still helps, as it will pick up literal query-term matches and context overlap better.

Re-ranking improves precision at the top: you ensure the very first result is the best it can be, at the cost of some extra computation. Since k is small, this is usually fine to do on the fly in the extension (even with a moderately large model). Sankalp Shubham notes in his code QA project that “embedding search is not as effective [alone] and one needs to follow additional steps in the retrieval pipeline like ... re-ranking” (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). Cross-encoders are a common choice for that second-stage ranker (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2), as they often significantly boost accuracy albeit with added latency.

Implementation: The extension can either call a local cross-encoder model (perhaps via the same Python process used for embeddings, if it can load two models, or a separate one) or call an API (e.g. Cohere Rerank API or OpenAI if such service available) to score the results. Each result (code chunk) along with the query is input, and we get a score. We then sort by this score to reorder the final list shown to the developer.

Context-Aware Retrieval

A powerful feature for a code assistant is to use context from the developer’s environment to improve search results:

Active file context: If the developer invokes a search while working in a particular file or function, that context can hint at what they need. For example, if they are editing auth.py and search for “validate token”, the system might boost results from the same module or results that are related to authentication, under the assumption they want code relevant to their current work. Technically, this could be done by incorporating the file name or some keywords from the open file into the query embedding (e.g. concatenate a few important tokens from the file with the query).
Query classification (code vs language): Determine if the query looks like code (e.g. a snippet or error message) or a natural language question. If it’s code, you might want to search by code similarity (perhaps even using a different embedding model optimized for code-to-code similarity). If it’s a question, use the NL-to-code embedding model. Some pipelines route queries differently based on heuristics or use multi-modal embeddings.
Conversational context: If this extension is part of a chat assistant (like GitHub Copilot’s “Chat” or Cody), then previous questions or answers could refine the search. For example, if previously the user asked about class UserManager, and now just asks “Where is the update method implemented?”, the system can recall that context and search for “update” within UserManager class. This requires some state management in the extension or integration with the chat LLM to disambiguate queries.
Function call context: A very useful context is if the user highlights or right-clicks a function call and asks, “Where is this defined?” or “Find references of this.” The extension can intercept that and directly use language server or AST info for an exact answer (this is more deterministic than semantic search). However, for “fuzzy” questions like “how is this function used throughout the codebase?”, semantic search with embedding can complement the exact reference search by finding usage patterns that might not literally match the name (e.g., via interface implementations, etc.).

In practice, combining symbolic analysis with semantic search yields the best developer experience. Use the precise tools when you can (like VS Code’s built-in “Go to Definition” or ripgrep for plain text search) and fall back to semantic search when the query is high-level or spans multiple concepts. The VS Code extension can expose multiple commands: e.g. “Semantic Search in Codebase” (which uses all the fancy embedding stuff) and still let developers use traditional search when needed. Over time, as confidence in semantic results grows, the two can even be merged into one interface.

Finally, you can employ query transformations like [HyDE (Hypothetical Document Embeddings) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2)](https://arxiv.org/abs/2212.10496), where the system first asks an LLM to hallucinate an answer or an explanation for the query, then uses that as the embedding query to the vector DB. This can enrich the query representation (especially for natural language questions). For example, for “How do we validate JWT tokens?”, a HyDE step might produce a pseudo-answer “The code likely decodes the JWT and checks the signature using a secret key”, which when embedded could better match the actual code (which might contain decode and verifySignature methods) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). This two-stage retrieval (LLM → vector search → then maybe another LLM for final answer) is advanced but can boost recall significantly for complex queries. It’s an optional add-on if you integrate an LLM in the loop.

Building the VS Code Extension

Constructing a VS Code extension for this involves bridging the gap between the editor UI and the back-end logic we’ve described. We want a smooth developer experience: minimal intrusion, quick results, and useful presentation of code snippets. Let’s outline some best practices and approaches:

Integrating Local Model Inference

If using local models (for embeddings or re-rankers), the extension should manage these without blocking the UI. Running heavy computations in the extension’s main thread is a no-go (it would freeze the editor). Instead, consider one of these:

Separate Process: The extension can spawn a child process (e.g. a Python script or a Node subprocess) to handle indexing and querying. Communication can be via IPC, sockets, or simple file I/O. For example, on activation, the extension launches index_server.py which loads the embedding model and vector store. When the user issues a search, the extension sends the query to this server (perhaps via an HTTP call to localhost if the server runs an HTTP API, or via VS Code’s executeTask/stdin). This isolates the heavy CPU/RAM usage.
WebAssembly: In some cases, small models can be run via WASM in Node. Projects like onnxruntime-web or tensorflow.js could run a distilled embedding model directly in Node. This avoids any external dependency but is limited to relatively smaller models unless the user has a powerful machine – and even then, Python tooling is generally ahead.
Lazy Loading: Don’t initialize models until needed. On extension activation, you might begin indexing (which requires the embedding model). But if the model is huge and the user hasn’t explicitly used the feature yet, you might delay load to when the first search command is invoked, informing the user of a one-time setup delay.

Also, provide settings for the user to choose embedding model or vector DB. For instance, a setting "semanticSearch.embeddingModel": "openai-ada" vs "huggingface-BGE". The extension can respect this and either call OpenAI’s API (requiring the user to input an API key in settings) or load the local model accordingly.

Handling Filesystem Events and Index Updates

Leverage VS Code’s events:

Use FileSystemWatcher to track changes. For each event (create, change, delete), schedule an index update. On delete, remove entries from the vector DB; on create or change, schedule re-embedding of that file’s content. Ensure to also handle file renames (which usually come as create+delete events, but you may lose the connection unless you handle them atomically).
Use the extension’s state to store an indexing queue. Possibly maintain an in-memory map of file paths to “last indexed timestamp” vs “last modified timestamp”, so you know which files are outdated.
If the repo is huge, initial indexing might take a while. It’s good to inform the user (e.g. show a progress notification or a status bar item “Indexing codebase… 42%”). VS Code has API for progress notifications that you can use to report status as you index file by file. After initial indexing, updates are quick.
For modularity, the extension’s indexing could be abstracted into a library that others can reuse. This ties back to design – maybe the core logic is not written inside the extension code but in a separate module that could also be run standalone (for indexing remote repos on a server, for example).

UI/UX for Querying and Results

How will the developer interact with this search? Common patterns:

Command Palette: Provide a command (e.g. “Semantic Code Search: Ask a question”) that opens an input box. The user types a natural language query, presses Enter, and then results appear in a custom view or panel.
Tree View in Sidebar: You can register a TreeDataProvider that shows search results as a tree (or list). Each result could be a tree item representing a code snippet – with the label as e.g. function name – file path, and a tooltip or description with a preview of code. Clicking the item opens the file at the relevant location. VS Code allows marking a range on open (using TextDocument.showTextDocument with a range to select the snippet).
Webview Panel: For more elaborate UI (syntax highlighting of snippets, buttons to copy code, etc.), a Webview could be used. The extension would launch a webview that displays the query bar and the results nicely formatted (perhaps even with diff-like context). This gives more control over styling at the cost of more complexity (you’re basically writing a mini webpage within VS Code).
Inline UI: Another idea is to integrate with VS Code’s Search UI, but that is more tailored to text search. A custom panel is more flexible.

Make sure to present metadata clearly – e.g., include the file path for each snippet result (to avoid confusion if multiple functions named init show up). If possible, show a few lines of context around the snippet. Some implementations show a few leading/trailing lines or allow expanding the snippet in place.

Interactivity: Provide actions like “Open File”, “Copy snippet”, or even “Ask follow-up” if integrated with an AI assistant. For instance, you might allow the user to right-click a result and trigger a secondary action, like “Explain this code” which then sends the snippet to an LLM for explanation.

Performance: Ensure the UI remains responsive. If a query is taking a while (say, loading model or doing re-rank), provide feedback (spinner or message “Searching…”). If using a webview, you can send partial results as they come (e.g. stream in results as they are found and embed the snippet gradually).

Since code search is often iterative (user may refine query), allow quickly running a new query. Perhaps keep the last few queries in a history or allow the user to modify the query text from a results view and rerun.

Lastly, integrate with VS Code’s native features where possible. For example, if a search result is essentially “find references” or “go to definition”, you might leverage the existing definitions provider or references provider if available. Conversely, your extension could register a VS Code “Text Search Provider” so that it integrates into the native search sidebar (this is an advanced integration where VS Code allows extensions to provide search results).

Data Privacy and Performance: Local vs Cloud

One major reason for a local-first design is to keep source code private. Developers (or companies) may be uncomfortable sending their code to an external API for embedding or search. By using local models and databases, code stays on the machine. As Sourcegraph’s team noted when evaluating embeddings for their Cody assistant, relying on an API has downsides: “Your code has to be sent to a 3rd party (OpenAI) for processing, and not everyone wants their code relayed this way” (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). They also highlight the maintenance burden of keeping embeddings up-to-date and the storage cost as code scale grows (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2).

From a privacy perspective, local is clearly better. However, local models can be large and slower. A sensible approach is to offer both options:

Default to local embedding for privacy (especially if you can ship a reasonably small but good model or leverage what the user has).
Optionally allow using a cloud API for improved quality or speed, if the user opts in with an API key. This could be behind a setting like "semanticSearch.useOpenAI": true with clear warnings. The user can decide query-by-query as well (maybe a checkbox “Use OpenAI for this query”).
Some orgs might have an internal embedding service – the extension could be configured to call a self-hosted model on a server (still not leaving the company’s domain).

Performance considerations:

Speed: OpenAI’s embeddings are actually very fast (their infrastructure is optimized), but you pay network latency. A local model on CPU might be significantly slower for large inputs. Using a GPU if available is recommended – you might allow configuring the device or even detecting if the user has a CUDA-enabled GPU and using it. Tools like ONNX Runtime or GGML can sometimes run smaller models on CPU with decent speed by using int8 quantization.
Memory: Running a 1B+ parameter model locally can consume several GB of RAM or GPU VRAM. If that’s not feasible, consider using a smaller model or offloading heavy tasks to when the user is idle (e.g. do initial indexing overnight or when VS Code is not busy).
Index size: Each code chunk produces a vector. A 1536-dim vector of float32 is 6KB. So 10k such vectors is ~60 MB – not bad. But if a repository has 1M chunks (very large monorepo), that’s 6 GB which is problematic for memory. In such cases, strategies like using product quantization (like LanceDB does by default) or storing in float16/BF16 can halve or quarter the size at some accuracy cost. You can also periodically prune the index (remove rarely used parts, or compress older embeddings). For most local projects, index size is not a big issue, but it’s worth noting that naive embeddings can become heavy at extreme scales (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2).
Updating vs re-computing: If using a cloud API, every code change means calling the API again (incurring cost each time). This could discourage frequent updates. A local model might be slower per call but “free” to run unlimited times. One compromise is to use a cloud embedding for the initial full index (fast and done once), then use a smaller local model for incremental updates. However, mixing embeddings from different models is generally not recommended (vectors live in different spaces). So better is: stick to one embedding model if possible for consistency, and design indexing such that updates are batch processed (to amortize cost if using API).
Alternative approaches: The Sourcegraph blog mentioned they moved away from embeddings for Cody at large scale (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). While embeddings are powerful for semantic search, very large enterprises might use other techniques (like on-the-fly search with an LLM scanning code upon query). For a local developer tool, though, pre-indexing with embeddings is still a very practical solution given the codebase is of moderate size and hardware is limited.

In summary, local vs cloud is a trade-off: Local ensures privacy and control, while cloud can offer turnkey quality and less local resource usage. This extension’s philosophy is local-first, so we prioritize using open models and on-disk indexes. In cases where developers want the absolute best results and are willing to send code out, we make it an explicit choice. As a developer building this tool, be transparent about what data leaves the machine (if any). For instance, if using OpenAI API, have a toggle that by default is off, and log or display when code is being sent for embedding.

Future Expansion: Remote Repositories and Large-Scale Integration

While our focus is a single local repository, the architecture can be extended to remote repositories or a multi-repo environment:

GitHub/GitLab Integration: You could allow the extension to connect to a user’s GitHub account and index a repo that isn’t cloned locally. VS Code’s GitHub Repositories extension (which opens repos virtually) could be a starting point – once the virtual file system is available, you can index it similarly. Alternatively, use the GitHub API to fetch repository contents (GraphQL API can list files and fetch file content). The indexing pipeline (chunking, embedding) remains the same, but you’d likely run it in a separate process (maybe on a server if the repo is large or to avoid tying up VS Code). After indexing, the vector store could either live locally (in which case you download all content to index – effectively cloning the repo anyway), or live on a server (allowing you to query without pulling everything).
Scalability for multiple repos: If a developer wants to search across all their projects, the index could be partitioned by project and stored perhaps in a single vector DB with a metadata field for “repo name”. Queries could then either search all or filter to one repo. This is similar to how Sourcegraph Cody aims to search across many repositories. One challenge is keeping remote indexes updated – you might use webhooks from Git providers to notify of changes, or periodically re-sync the repo.
Server-side indexing service: One pattern is to separate the VS Code extension (which handles UI and querying) from the indexing service (which could run on a local or remote server). The extension then becomes a client to this service. For instance, a company could run an internal service that indexes all company repositories; the developer’s VS Code extension queries that service for results (keeping computation centralized). In our local-first approach, you might still run that service locally (for your own large monorepo, for example).
Authentication and access: If integrating with GitHub, leverage VS Code’s authentication APIs or ask the user for a PAT (personal access token) to fetch private repo content. Always respect rate limits and privacy (don’t accidentally push their code anywhere).
Index transfer: If the user has the repo locally but also wants to use a remote index (say an index precomputed on a powerful machine), you might allow importing/exporting the vector index (LanceDB and others can export to files). This way, an organization could precompute embeddings for a repo and share the index file with team members, who then load it in the extension – saving each developer from doing heavy indexing themselves.
Real-time collaboration: In remote scenarios, consider if multiple users are updating the code. An index on a server could update with each git push, and clients fetch updates. This goes beyond the single-user case but is a natural extension if this tool becomes widely used in a team.

Example (GitHub remote repo): A possible workflow – The extension offers a “Add Remote Repository for Search” input. The user enters “owner/repo” and an API token. The extension then either:

Clones the repo in a temp location and indexes it just like a local one (then possibly deletes clone). This is simple but requires pulling the whole repo.
Or calls an API to list files, and for each file above a certain size limit, fetches content via API. This saves bandwidth if user only queries a subset. However, for embeddings you do need the actual content at least once to embed it.

Given that, cloning might actually be easiest, using Git itself or GitHub’s REST to get a zip of the repo.

Patterns for Integration

The main point is that modularity in design will pay off here. If your indexing pipeline is cleanly separated, you could have one implementation for local FS and another for remote (that perhaps uses a different file reader interface). The rest (chunk → embed → store) doesn’t change.

If planning for remote, it might even make sense to choose a vector DB that can be accessed both locally and via network. For example, Qdrant could run on a central server with all indexes for all repos, and the extension queries it over HTTP. In contrast, LanceDB is local file-based – you’d need to distribute those files or have a syncing mechanism.

Security note: If searching remote code (especially open-source), the user might bring in indexes of code they didn’t write. Be mindful of not executing or trusting that code. The search itself should be safe, but if you did any analysis beyond parsing (like running test cases for a query), that could pose risks.

Conclusion and Key Takeaways

Building a local-first semantic code search tool in VS Code is a challenging but rewarding endeavor. By carefully designing the system to use code structure (AST-based chunking) and modern retrieval techniques (vector embeddings with hybrid search and re-ranking), we can empower developers to quickly find and understand code within their projects.

Key takeaways for developers implementing such a system:

Leverage code semantics: Use AST parsing (via Tree-sitter or language-specific libraries) to chunk code at logical boundaries (functions, classes) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). This yields more meaningful embeddings and search results than arbitrary chunks.
Choose the right embedding model: There are excellent open-source models that rival proprietary ones for embedding code and technical text (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). Evaluate models on code-like data; consider dimensions and inference speed in addition to raw accuracy.
Opt for local vector stores for privacy: Tools like LanceDB or Qdrant can be run locally and integrated into VS Code, keeping code secure on the developer’s machine. LanceDB’s embedded design avoids network overhead (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium), while Qdrant offers a lean service with powerful search capabilities in Rust (Vector databases (1): What makes each one different? • The Data Quarry) (Vector databases (1): What makes each one different? • The Data Quarry).
Implement hybrid search: Combine dense embeddings with keyword search to improve relevance (What embedding models work best for code and technical content? - Zilliz Vector Database). Many code queries benefit from exact token matches – don’t rely solely on vectors if you can incorporate lexical cues.
Use re-ranking for precision: A lightweight cross-encoder model to re-rank top results can significantly boost the quality of the first results (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2), ensuring the developer sees the best answer at the top.
Design the extension for responsiveness: Offload heavy tasks to background processes, and update the index incrementally. Provide clear UI feedback during indexing/search, and make the search results actionable (open file at location, etc.).
Respect privacy and scale considerations: Default to not sending code to cloud (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2); if offering that option, be transparent. Plan for large projects by using efficient indexing (IVF/HNSW) and possibly sharding or summarizing extremely large files.
Prepare for future extensions: With a solid local system in place, extending to remote repositories or a team-wide index is mostly a matter of changing the data source. The core search logic remains the same, which is a testament to a good modular design.

By following these practices, you can create a VS Code extension that acts as a smart “code librarian,” retrieving the right snippet or answer from your codebase almost as naturally as a search engine fetches answers on the web – all while keeping the workflow fast, local, and secure. Happy coding and searching!

Sources:

Sergei Petrov – LanceDB vs Qdrant (Medium, Dec 2023): Differences in design (embedded vs service) and indexing algorithms (IVF-PQ vs HNSW) (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium) (LanceDB vs Qdrant. I ran a quick benchmark of LanceDB vs… | by Sergei Petrov | Medium).
Zilliz – Embedding Models for Code and Technical Content: Highlights code-trained models (CodeBERT, UniXcoder) and their advantages (What embedding models work best for code and technical content? - Zilliz Vector Database) (What embedding models work best for code and technical content? - Zilliz Vector Database); suggests hybrid search with BM25 + vectors for best results (What embedding models work best for code and technical content? - Zilliz Vector Database).
Sankalp Shubham – Building RAG on Codebases (LanceDB Blog, Nov 2024): Describes codebase Q&A system “CodeQA” using AST chunking and LanceDB. Explains why structured chunking beats naive splitting (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 1/2). Part 2 covers improvements like HyDE, BM25, re-ranking (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) and embedding choices (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2). Also quotes Sourcegraph’s perspective on embeddings and privacy (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2).
OpenAI Documentation: States embedding dimensions for new models (1536 for small, 3072 for large) (Vector embeddings - OpenAI API). Indicates the 8192 token limit for inputs (for Ada-002 and others).
Qdrant Tech Blog – Hybrid Search Revamped: Describes Qdrant’s approach to hybrid search via their Query API, combining vector and keyword search (Hybrid Search Revamped - Building with Qdrant's Query API).
The Data Quarry – Vector DB Comparison (2023): Notes on LanceDB’s embedded Rust design and focus on on-disk performance (Vector databases (1): What makes each one different? • The Data Quarry); Qdrant’s developer experience and Rust performance (Vector databases (1): What makes each one different? • The Data Quarry) (Vector databases (1): What makes each one different? • The Data Quarry); mention of hybrid search feature progress in Qdrant (Vector databases (1): What makes each one different? • The Data Quarry).
Sourcegraph Blog – How Cody Understands Your Codebase (Feb 2024): Describes large-scale code search strategies. Notably, they initially used embeddings for multi-repo context but then moved away due to privacy and scaling issues (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2) (An attempt to build cursor's @codebase feature - RAG on codebases - part 2/2), favoring on-demand techniques. (Cody is Sourcegraph’s AI coding assistant).
Hugging Face – Model Cards and Leaderboard: BAAI’s BGE models rank high on MTEB (embedding benchmark) (BAAI/bge-large-en · Hugging Face); Nomic’s embedding model claims to outperform Ada in some aspects while being open (OpenAI vs Open-Source Multilingual Embedding Models | by Yann-Aël Le Borgne | TDS Archive | Medium). These inform our model comparisons.
VS Code Extension API Reference: Documentation on file system watchers and tree views (not directly cited above, but used in understanding implementation details).

tuandinh0801/local-first-indexing.md