Alex Klibisz alexklibisz

Benchmarks, tricky queries, etc.

This document describes how to run Elastiknn on the big-ann-benchmarks challenge. It's admittedly a little late in the game for this benchmarking challenge. IIRC the deadline is October 22, 2021, and I'm writing this on October 15. But hey, the neighbors aren't gonna find themselves. We can still use this as an opportunity to improve Elastiknn.

The setup is currently pretty experimental, so bring your elbow grease.

Part 1: Setup the Elastiknn project

Clone the alexklibisz/elastiknn repo and checkout the elastiknn-278-lucene-benchmarks branch. That's where I've been working on the big-ann-benchmarks integration and improvements.

git clone [email protected]:alexklibisz/elastiknn.git
git fetch --all

This is a very rough first-pass Scala implementation of the PDCI algorithm for nearest neighbor search.

It's quite an interesting algorithm but I found it difficult to implement efficiently on the JVM. I'm pretty sure this will compile and run, but also I last touched this code in March 2019, so who knows :).

The build.sbt includes some unnecessary dependencies, as this was pulled out of a private repo containing other experiments which eventually became Elastiknn.

Example of setting up Grafana to read from Firefly. See https://www.reddit.com/r/FireflyIII/comments/nogrl5 for context.

	-- Export Transactions from Firefly-III Database
	-- This query produces a reasonable CSV export of transactions from a Firefly-III database.
	-- Adjust the user_id if you have multiple users.
	-- Tested on Postgres.
	select cast(tj.date as date) as date,
	tj.description as description,
	round(tcredit.amount, 2) as credit_amount,
	acredit.name as credit_account,
	round(tdebit.amount, 2) as debit_amount,
	adebit.name as debit_account,

	package com.klibisz.elastiknn.search;

	/**
	* Min heap where the values are shorts. Useful for tracking top counts for a query.
	* Based on the Python std. lib. implementation: https://docs.python.org/3.8/library/heapq.html#module-heapq
	*/
	public class ShortMinHeap {
	private short[] heap;
	private int size;
	private final int capacity;

	import os
	import sys
	import requests
	from pprint import pprint
	from datetime import datetime
	from dataclasses import dataclass
	from time import time

	@dataclass
	class Transaction:

	------------------------------------------------------
	_______ __ _____
	/ ____(_)___ ____ _/ / / ___/_________ ________
	/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
	/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/

	from concurrent.futures.thread import ThreadPoolExecutor
	from pprint import pprint
	from time import time

	import boto3
	import os

	import botocore
	import shell as shell

	# !/bin/sh
	set -e

	# Get the amount of available memory using some hacky python
	MAXMEM=$(python -c "import os; print('%dg' % (os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024 ** 3)))")

	# Define the Elasticsearch java options
	export ES_JAVA_OPTS="-Xms$MAXMEM -Xmx$MAXMEM"

	# Increase memory setting for Elasticsearch.