You'll need multiple simple tools to refine the raw arrow files and NDJSON files.
A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications. But also the most stared search engine on GitHub just being Elastic Search.
You can download it from the releases page.
jq is a lightweight and flexible command-line JSON processor. It will be used to the NDJSON streams we extract from the raw arrow files. You can install it using your package manager or download it from the releases page.
meilisearch-importer is a tool to import data into Meilisearch. It will be used to import the data into Meilisearch.
You can install it via the cargo install meilisearch-importer
command or download it from the releases page.
arrow2ndjson-lite is a tool to convert arrow files to NDJSON files.
You can install it via the cargo install arrow2ndjson-lite
command.
hackerverse-refining is a tool to refine the raw arrow files of hackerverse into NDJSON files. You can install it following these instructions:
git clone https://github.com/Kerollmops/hackerverse-refining
cd hackerverse-refining
cargo install --path .
We need to setup two indexes for Hackerverse. You can find the dataset on GitHub. I recommend reading the incredible blog post.
We will call it hackerverse-posts
and change the settings to support hybrid search (embeddings + keyword search).
Top posts are only the posts with score >= 10.
{
"searchableAttributes": [
"title",
"text",
"url_title",
"url_description",
"url_snippet"
],
"filterableAttributes": [
"deleted",
"dead",
"score",
"author",
"timestamp",
"emb_missing_page",
"url_fetched_timestamp",
"url_fetched_timestamp_modified",
"url_fetch_err",
"url_fetched_via",
"url_description",
"url_lang",
"url_timestamp",
"url_timestamp_modified"
],
"sortableAttributes": ["score"],
"proximityPrecision": "byWord",
"facetSearch": false,
"embedders": {
"bge-m3": {
"source": "ollama",
"url": "http://localhost:11434/api/embed",
"model": "bge-m3",
"dimensions": 1024
},
"jina-v2-small": {
"source": "ollama",
"url": "http://localhost:11434/api/embed",
"model": "jina/jina-embeddings-v2-base-en",
"dimensions": 512
}
}
}
{
"searchableAttributes": [
"text"
],
"filterableAttributes": [
"deleted",
"dead",
"score",
"parent",
"author",
"timestamp",
"post"
],
"sortableAttributes": ["score"],
"proximityPrecision": "byAttribute",
"facetSearch": false,
"embedders": {
"jina-v2-small": {
"source": "ollama",
"url": "http://localhost:11434/api/embed",
"model": "jina-v2-small",
"binaryQuantized": true,
"dimensions": 512
}
}
}
Note that we will use the add-or-update
operation to update the existing documents as the different document components, e.g., title, text, metadata and embeddings, all come from different sources. Meilisearch will merge all of the components into a single document.
We will have to change some JSON fields names to make them match the schema.
You must download the post_titles.arrow
file from the Hackerverse repository.
Once ready let's extract the arrow content and convert it to NDJSON while renaming the text
field to title
.
We will also pipe this into meilisearch-importer
to send the content by batch but before this we must explicitly ask Meilisearch not to generate embeddings.
arrow2ndjson-lite post_titles.arrow \
| jq '. | { id: .id, title: .text, _vectors: { "bge-m3": { "embeddings": null, "regenerate": false }, "jina-v2-small": { "embeddings": null, "regenerate": false } } }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
Let's send the embeddings for the top posts. We will need the toppost-embs-ids.mat
and toppost-embs-data.mat
files from the Hackerverse repository.
embs2ndjson toppost-embs-ids.mat toppost-embs-data.mat --float-precision 6 --dimensions 1024 \
| jq '{ id: .id, _vectors: { "bge-m3": { embeddings: .embeddings, regenerate: false } } }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
Let's send the other posts smaller embeddings. We will need the post-embs-ids.mat
and post-embs-data.mat
files from the Hackerverse repository.
embs2ndjson post-embs-ids.mat post-embs-data.mat --float-precision 6 --dimensions 512 \
| jq '{ id: .id, _vectors: { "jina-v2-small": { embeddings: .embeddings, regenerate: false } } }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
We would like to have more metadata about the posts we just indexed. We will need the posts.arrow
file from the Hackerverse repository.
Note that we refine a bit the metadata:
- We rename the
ts
field totimestamp
. - We do not send the
author (u32)
field yet, as we will send the author's name later. - We do not send the
url (u32)
field yet, as we will send the actual url later.
arrow2ndjson-lite posts.arrow \
| jq '{ id: .id, deleted: .deleted, dead: .dead, score: .score, timestamp: .ts, emb_missing_page: .emb_missing_page }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
This part will be used to send the author's names to Meilisearch. We will need the users.arrow
file from the Hackerverse repository.
arrow2ndjson-lite users.arrow > users.ndjson
arrow2ndjson-lite posts.arrow \
| jq '{ id: .id, author: .author }' \
| author4post2ndjson --usernames users.ndjson \
| jq '{ id: .id, author: .author }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
This part will be used to send the author's names to Meilisearch. We will need the url_metas.arrow
file from the Hackerverse repository.
arrow2ndjson-lite posts.arrow > posts.ndjson
arrow2ndjson-lite url_metas.arrow \
| url4post2ndjson --posts posts.ndjson \
| jq '{ id: .id, url_title: .title, url_description: .description, url_snippet: .snippet, image_url: .image_url, url_lang: .lang, url_timestamp: .timestamp, url_timestamp_modified: .timestamp_modified }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
This part will be used to send the author's names to Meilisearch. We will need the urls.arrow
file from the Hackerverse repository.
It features the url
, url_fetched_timestamp
, url_fetched_timestamp_modified
, url_fetch_err
, url_fetched_via
fields.
arrow2ndjson-lite posts.arrow > posts.ndjson
arrow2ndjson-lite urls.arrow \
| url4post2ndjson --posts posts.ndjson \
| jq '{ id: .id, url: (.proto + "//" + .url), url_fetch_err: (if .fetched_err == "" then null else .fetched_err end), url_fetched_timestamp: .fetched, url_fetched_timestamp_modified: .fetched_modified, url_fetched_via: (if .fetched_via == "" then null else .fetched_via end) }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-posts \
--upload-operation add-or-update \
--format ndjson \
--files -
We would like to have the metadata about the comments. We will need the comments.arrow
file from the Hackerverse repository.
Note that we refine a bit the metadata:
- We rename the
ts
field totimestamp
. - We do not send the
author (u32)
field yet, as we will send the author's name later.
arrow2ndjson-lite comments.arrow \
| jq '{ id: .id, deleted: .deleted, dead: .dead, score: .score, parent: .parent, post: .post, timestamp: .ts, _vectors: { "jina-v2-small": { "embeddings": null, "regenerate": false } } }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-comments \
--upload-operation add-or-update \
--format ndjson \
--files -
embs2ndjson comment-embs-ids.mat comment-embs-data.mat --float-precision 2 --dimensions 512 \
| jq '{ id: .id, _vectors: { "jina-v2-small": { embeddings: .embeddings, regenerate: false } } }' \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-comments \
--upload-operation add-or-update \
--format ndjson \
--files -
arrow2ndjson-lite comment_texts.arrow \
| meilisearch-importer \
--jobs 10 \
--url $MEILISEARCH_URL \
--api-key $MEILISEARCH_API_KEY \
--index hackerverse-comments \
--upload-operation add-or-update \
--format ndjson \
--files -