Benchmarks, tricky queries, etc.
-- Export Transactions from Firefly-III Database | |
-- This query produces a reasonable CSV export of transactions from a Firefly-III database. | |
-- Adjust the user_id if you have multiple users. | |
-- Tested on Postgres. | |
select cast(tj.date as date) as date, | |
tj.description as description, | |
round(tcredit.amount, 2) as credit_amount, | |
acredit.name as credit_account, | |
round(tdebit.amount, 2) as debit_amount, | |
adebit.name as debit_account, |
This document describes how to run Elastiknn on the big-ann-benchmarks challenge. It's admittedly a little late in the game for this benchmarking challenge. IIRC the deadline is October 22, 2021, and I'm writing this on October 15. But hey, the neighbors aren't gonna find themselves. We can still use this as an opportunity to improve Elastiknn.
The setup is currently pretty experimental, so bring your elbow grease.
Part 1: Setup the Elastiknn project
- Clone the alexklibisz/elastiknn repo and checkout the
elastiknn-278-lucene-benchmarks
branch. That's where I've been working on the big-ann-benchmarks integration and improvements.
git clone [email protected]:alexklibisz/elastiknn.git
git fetch --all
package com.klibisz.elastiknn.search; | |
/** | |
* Min heap where the values are shorts. Useful for tracking top counts for a query. | |
* Based on the Python std. lib. implementation: https://docs.python.org/3.8/library/heapq.html#module-heapq | |
*/ | |
public class ShortMinHeap { | |
private short[] heap; | |
private int size; | |
private final int capacity; |
This is a very rough first-pass Scala implementation of the PDCI algorithm for nearest neighbor search.
It's quite an interesting algorithm but I found it difficult to implement efficiently on the JVM. I'm pretty sure this will compile and run, but also I last touched this code in March 2019, so who knows :).
The build.sbt includes some unnecessary dependencies, as this was pulled out of a private repo containing other experiments which eventually became Elastiknn.
Example of setting up Grafana to read from Firefly. See https://www.reddit.com/r/FireflyIII/comments/nogrl5 for context.
import os | |
import sys | |
import requests | |
from pprint import pprint | |
from datetime import datetime | |
from dataclasses import dataclass | |
from time import time | |
@dataclass | |
class Transaction: |
------------------------------------------------------ | |
_______ __ _____ | |
/ ____(_)___ ____ _/ / / ___/_________ ________ | |
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ | |
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ |
from concurrent.futures.thread import ThreadPoolExecutor | |
from pprint import pprint | |
from time import time | |
import boto3 | |
import os | |
import botocore | |
import shell as shell |
# !/bin/sh | |
set -e | |
# Get the amount of available memory using some hacky python | |
MAXMEM=$(python -c "import os; print('%dg' % (os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024 ** 3)))") | |
# Define the Elasticsearch java options | |
export ES_JAVA_OPTS="-Xms$MAXMEM -Xmx$MAXMEM" | |
# Increase memory setting for Elasticsearch. |