How to setup a Meilisearch instance for Hackerverse

Installation

You'll need multiple simple tools to refine the raw arrow files and NDJSON files.

Meilisearch

A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications. But also the most stared search engine on GitHub just being Elastic Search.

You can download it from the releases page.

jq

jq is a lightweight and flexible command-line JSON processor. It will be used to the NDJSON streams we extract from the raw arrow files. You can install it using your package manager or download it from the releases page.

meilisearch-importer

meilisearch-importer is a tool to import data into Meilisearch. It will be used to import the data into Meilisearch. You can install it via the cargo install meilisearch-importer command or download it from the releases page.

arrow2ndjson-lite

arrow2ndjson-lite is a tool to convert arrow files to NDJSON files. You can install it via the cargo install arrow2ndjson-lite command.

hackerverse-refining

hackerverse-refining is a tool to refine the raw arrow files of hackerverse into NDJSON files. You can install it following these instructions:

git clone https://github.com/Kerollmops/hackerverse-refining
cd hackerverse-refining
cargo install --path .

Setup the Hackverse indexes

We need to setup two indexes for Hackerverse. You can find the dataset on GitHub. I recommend reading the incredible blog post.

The posts index

We will call it hackerverse-posts and change the settings to support hybrid search (embeddings + keyword search). Top posts are only the posts with score >= 10.

{
  "searchableAttributes": [
    "title",
    "text",
    "url_title",
    "url_description",
    "url_snippet"
  ],
  "filterableAttributes": [
    "deleted",
    "dead",
    "score",
    "author",
    "timestamp",
    "emb_missing_page",
    "url_fetched_timestamp",
    "url_fetched_timestamp_modified",
    "url_fetch_err",
    "url_fetched_via",
    "url_description",
    "url_lang",
    "url_timestamp",
    "url_timestamp_modified"
  ],
  "sortableAttributes": ["score"],
  "proximityPrecision": "byWord",
  "facetSearch": false,
  "embedders": {
    "bge-m3": {
      "source": "ollama",
      "url": "http://localhost:11434/api/embed",
      "model": "bge-m3",
      "dimensions": 1024
    },
    "jina-v2-small": {
      "source": "ollama",
      "url": "http://localhost:11434/api/embed",
      "model": "jina/jina-embeddings-v2-base-en",
      "dimensions": 512
    }
  }
}

The comments index

{
  "searchableAttributes": [
    "text"
  ],
  "filterableAttributes": [
    "deleted",
    "dead",
    "score",
    "parent",
    "author",
    "timestamp",
    "post"
  ],
  "sortableAttributes": ["score"],
  "proximityPrecision": "byAttribute",
  "facetSearch": false,
  "embedders": {
    "jina-v2-small": {
      "source": "ollama",
      "url": "http://localhost:11434/api/embed",
      "model": "jina-v2-small",
      "binaryQuantized": true,
      "dimensions": 512
    }
  }
}

Refining and streaming the data

Note that we will use the add-or-update operation to update the existing documents as the different document components, e.g., title, text, metadata and embeddings, all come from different sources. Meilisearch will merge all of the components into a single document.

The posts embeddings

We will have to change some JSON fields names to make them match the schema. You must download the post_titles.arrow file from the Hackerverse repository.

Once ready let's extract the arrow content and convert it to NDJSON while renaming the text field to title. We will also pipe this into meilisearch-importer to send the content by batch but before this we must explicitly ask Meilisearch not to generate embeddings.

arrow2ndjson-lite post_titles.arrow \
  | jq '. | { id: .id, title: .text, _vectors: { "bge-m3": { "embeddings": null, "regenerate": false }, "jina-v2-small": { "embeddings": null, "regenerate": false } } }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

Let's send the embeddings for the top posts. We will need the toppost-embs-ids.mat and toppost-embs-data.mat files from the Hackerverse repository.

embs2ndjson toppost-embs-ids.mat toppost-embs-data.mat --float-precision 6 --dimensions 1024 \
  | jq '{ id: .id, _vectors: { "bge-m3": { embeddings: .embeddings, regenerate: false } } }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

Let's send the other posts smaller embeddings. We will need the post-embs-ids.mat and post-embs-data.mat files from the Hackerverse repository.

embs2ndjson post-embs-ids.mat post-embs-data.mat --float-precision 6 --dimensions 512 \
  | jq '{ id: .id, _vectors: { "jina-v2-small": { embeddings: .embeddings, regenerate: false } } }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

The posts metadata

We would like to have more metadata about the posts we just indexed. We will need the posts.arrow file from the Hackerverse repository.

Note that we refine a bit the metadata:

We rename the ts field to timestamp.
We do not send the author (u32) field yet, as we will send the author's name later.
We do not send the url (u32) field yet, as we will send the actual url later.

arrow2ndjson-lite posts.arrow \
  | jq '{ id: .id, deleted: .deleted, dead: .dead, score: .score, timestamp: .ts, emb_missing_page: .emb_missing_page }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

The posts' author names

This part will be used to send the author's names to Meilisearch. We will need the users.arrow file from the Hackerverse repository.

arrow2ndjson-lite users.arrow > users.ndjson

arrow2ndjson-lite posts.arrow \
  | jq '{ id: .id, author: .author }' \
  | author4post2ndjson --usernames users.ndjson \
  | jq '{ id: .id, author: .author }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

The posts' url metadatas

This part will be used to send the author's names to Meilisearch. We will need the url_metas.arrow file from the Hackerverse repository.

arrow2ndjson-lite posts.arrow > posts.ndjson

arrow2ndjson-lite url_metas.arrow \
  | url4post2ndjson --posts posts.ndjson \
  | jq '{ id: .id, url_title: .title, url_description: .description, url_snippet: .snippet, image_url: .image_url, url_lang: .lang, url_timestamp: .timestamp, url_timestamp_modified: .timestamp_modified }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

The posts' urls

This part will be used to send the author's names to Meilisearch. We will need the urls.arrow file from the Hackerverse repository. It features the url, url_fetched_timestamp, url_fetched_timestamp_modified, url_fetch_err, url_fetched_via fields.

arrow2ndjson-lite posts.arrow > posts.ndjson

arrow2ndjson-lite urls.arrow \
  | url4post2ndjson --posts posts.ndjson \
  | jq '{ id: .id, url: (.proto + "//" + .url), url_fetch_err: (if .fetched_err == "" then null else .fetched_err end), url_fetched_timestamp: .fetched, url_fetched_timestamp_modified: .fetched_modified, url_fetched_via: (if .fetched_via == "" then null else .fetched_via end) }' \
  | meilisearch-importer \
  --jobs 10 \
  --url $MEILISEARCH_URL \
  --api-key $MEILISEARCH_API_KEY \
  --index hackerverse-posts \
  --upload-operation add-or-update \
  --format ndjson \
  --files -

The comments metadatas

We would like to have the metadata about the comments. We will need the comments.arrow file from the Hackerverse repository.