Skip to content

Instantly share code, notes, and snippets.

@okal
Last active June 8, 2020 21:43
Show Gist options
  • Save okal/5e61df20edde718d98597f7fb782be35 to your computer and use it in GitHub Desktop.
Save okal/5e61df20edde718d98597f7fb782be35 to your computer and use it in GitHub Desktop.
Storage

Introduction

I'm currently reading the Amazon Dynamo paper. As an aid to internalising and retaining the concepts, I will attempt to implement them in a piecemeal fashion. The first of these will be a very simple key-value store that I've named Thamani, the Swahili word for "value". It implements a simple API over TCP, consisting of two methods

  1. put(key: byte array, value: byte array) -> timestamp: 64 bit unix timestamp
  2. get(key: byte array) -> value: byte array, timestamp: 64 bit unix timestamp

Design Notes

  1. The keys and values are simply byte arrays. Thamani does not impose any sort of structure beyond that. Semantics are entirely determined by the client.
  2. Requests are not authenticated.
  3. Data is stored on disk.
  4. The server can handle multiple clients at a time. I haven't yet determined a concurrency model.
  5. While I haven't fully settled on an implementation language, I'll likely use Kotlin.
  6. Both put and get operations should be constant-time.
  7. Multiple calls to put will overwrite previous versions. Last write wins.

Notes

04/06/2020 20:34

I'm currently leaning towards an on-disk hash table implementation, but also exploring other associated array implementation possibilities. Is it naive to implement the persistence as a directory of single files named for the key? It's only now hitting me that I can't make any performance guarantees with this implementation, so I'll have to suspend the O(1) requirement for reads. If we pretend file access is O(1), then it still stands. Thamani will then basically have outsourced performance to the file system. I'll likely revisit this decision once I gain a deeper understanding of the appropriate data structures.

04/06/2020 22:42

So, the current implementation uses the key as a filename, with the value as the only contents. Unfortunately, it appears that ByteArrays may contain characters that aren't accepted as filenames 😭

04/06/2020 23:03

OK. Unanticipated hiccup. How does one deal with input created using different character sets?

@okal
Copy link
Author

okal commented Jun 5, 2020

@okal
Copy link
Author

okal commented Jun 5, 2020

Possibly interesting: http://www.mapdb.org/

@okal
Copy link
Author

okal commented Jun 8, 2020

So it turns out that character encodings are a pain! I'm revising the spec so that it's UTF-8 encoded keys mapping to UTF-8 encoded values. Any non-string values can be serialized. Semantics still remain under the control of the client. Also going to use this as a chance to learn more about hash tables rather than lean on the file system. Just to learn more stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment