AYB12

Alvin Richards: MongoDB

Trade-off: scale vs. functionality. MongoDB tries to have good functionality and good scalability.
Auto-sharding to maintain equilibrium between shards
Scalable datastore != scalable application: use of datastore may still be non-scalable (e.g. many queries across all shards)
Get low latency by ensuring shard data is always in memory: datastore then becomes a cache with persistence
Replica sets: auto-election of new primary node on failure, plus automatic recovery once failed node is back online
Async replication between nodes in a replica set (eventual consistency)
Auto TTL for messages, and can update on read operations
Tunable data consistency before write is "complete" from "none": fire and forget, assume it's going to get there eventually, to "full": includes remote replication to other geographies
Data model of RDBS enforces relational model which can limit ability to scale that system. Data locality ("which server is my record on?") becomes an issue

Luca Garulli: OrientDB

Biggest issue with switching from RDBMS: what about the data model?
KV, column-based, document DBs ... and graph DBs
Property graph model: vertices and edges can have properties, edges are directional, edges connect vertices, vertices can have one or more incoming + outgoing edges
In RDBMS, every time you traverse a relationship, you perform an expensive JOIN. Indexes can speed up reads, but slow down writes
Index lookups are generally based on balanced trees. More entries == more lookup steps == slower JOIN
"A graph DB is any storage system that provides index-free adjacency"
A graph DB treats relationships as physical links assigned to the record when the edge is created; RDBMS computes the same relationship every time you perform a JOIN
Lookup time moves from O(log N) to new O(1), and does not increase with DB size
NuvloaBase.com: REST-based graph DB service
Difficult to create distributed graph DBs. Scaling is basically a case of using client-side hashing.

Dale Harvey: PouchDB

CouchDB for JavaScript environments, mainly for browsers (but also works in Node.js)
Multi-master replication, supports disconnected sync
"Ground computing" -- like cloud computing, but provides offline behaviour with on-demand sync
Designed for builing applications that needs to work well offline, and that need to sync data
Would simplify something like multi-app SimpleNote-type system?
Offline is a fact: the more mobile devices, the more people are offline. No reception data limits, slow / unstable connections etc
Sync is hard: Things took 2 years to develop sync
Bad connections + retries, transfer overhead and moving deltas (mobile access might not want total sync), master-master scenarios, conflict resolution
[CP]ouchDB has good, simple conflict resolution, but sometimes you need to tell it what to resolve (based on your app usage)
Requires CouchDB on the server for sync
Safari + Opera support in progress, so not production-ready yet

Matt Heitzenroder: Eventual Consistency

Brewer's Conjecture (2000): CAP -- you can only have two
"Life is full of tradeoffs" as is engineering
Amazon's Dynamo paper: tradeoff between C & A -- they chose A
Financial systems already dealing with eventual consistency: trading banks closing and reconciling, network partitions between cash point and centralised bank etc
Riak uses vnodes in a ring topology (ketama-style)
Writes go to hashed node + the next two (i.e. three copies on separate nodes)
Read Repair: handle out of date copies of data on vhosts automatically on read and update out of date nodes to logical descendants (e.g. v1 -> v2)
Read Repair etc means internally three objects are requested and checked for consistency. This can be tuned via quoram, single-read for speed etc
There can be divergent ojbect versions, a.k.a. siblings: after a network partition, two operations can have altered object state at the same time. Riak returns both versions
Per-application, can define a "conflict resolver": as part of the Riak client to define how to handle sibling resolution
Common use-cases are: pick one based on some property, or perform a set union of the data
Probabilistically Bounded Staleness

Monty Widenius: MySQL-MariaDB

MySQL named after Monty's daughter, My (MaxDB released later, named after his son, Max)
Original MySQL devs started focussing on MariaDB in 2009 with the impending purchase of Sun by Oracle
Chose to use dual-license to be able to work full-time on MySQL: took 2 months to become profitable
Don't go to investors when you need their money. Wait for them to come to you when you don't need their money, and you won't have to give up so much of your company
Monty Program Ab: new company (using Hacker Business Model) to focus on MariaDB, with most of the original MySQL developers
Aim to keep MySQL dev talent together, always have an open-source version of MySQL. More important after Oracle purchase of Sun
MariaDB is a drop-in replacement for MySQL. "No reason to use MySQL anymore: MariaDB is better in all cases"
Big JOIN and subquery performance is an order of magnitude (or more) faster than MySQL
"SQL doesn't solve all common problems" e.g. arbitrary attributes (shop item sizes, colours etc). Dynamic columns introduced in MariaDB 5.3. As a POC, created a storage engine for Cassandra with MariaDB 10
Any close-sourced features that Oracle has added to MySQL have been added to MariaDB as open-source features
5.5 introduces a new thread pool (instead of thread-per-connection)
Full merge of MySQL 5.6 into MariaDB 5.6 is a year-long project due to broken features and new bugs, over-complicated vode, lack of understanding of existing code etc
Did such a good job of getting the MySQL name out there, changing everyone over to MariaDB is going to be a tough job!
Though creating a dev community is easier as Oracle is not working with the community
Aim of MariaDB: make MySQL obselete
Free MariaDB + MySQL knowledgebase available at askmonty.org

Brandon Keepers: Git: the NoSQL DB

Let's start with "Git's amazing ... what else can we do with it?"
"NoSQL is marketing bollocks" -- people mean non-relational and schemaless, and anything else gets lumped in to NoSQL
git calls itself "the stupid content tracker" (see the man page)
git has three "object types": blobs, trees and commits, plus symbolic "references" on top, all managed by the git command line tool
There are libraries to work with this (Grit, libgit2), plus ORMs built on top, such as ToyStore
NoSQL allows us to question RDBMS design, including big design up-front: schemaless allows us to be much more agile with our data model
git can handle transaction in both short-lived (one commit with multiple changes) and long-lived (branches) forms
Replication handled by the fact that all git repos are full clones
git doesn't have any of the features that makes a great DB: querying, concurrency (it's filesystem based), merge conflict resolution, scale
Scale: filesystem based, and problems with git at scale. Someone tested with a very large repo: 4m commits, 1.3m files, 15Gb repo ... git-add took 7 seconds etc...
Think about how you can abuse your tools to get more out of them

Peter Cooper: Redis, Steady, Go!

Peter's a Rubist, and wants his languages and tools to be "beautiful," which he considers Redis to be
Redis: remote [data structure] server -- no tables, no SQL, no enforced relationships, lots of working with primitives. The Redis manifesto calls it a DSL for abstract data types
Like memcached but with more commands, more persistence, more data types
Three big use cases: database, messaging (pub/sub, queueing), or as a cache. Also: fast live stats logging (why Redis was created in the first place), rate limiting (using automatic key expiry), scoreboarding (using sorted sets), IPC, session storage
YouP*rn.com uses Redis as their primary datastore (~100 Alexa ranking)
Redis is single-threaded and event-driven (apart from background saving etc). Single-threading means individual operations are atomic
Python library redis_wrap means you can use normal Python data types, backed by Redis
Recent additions: scripting with LUA, plus PostgreSQL data wrapper
6 data types: strings, lists, sets, sorted sets, lists, hashes
Abstract data type example: queueing using a list, with LPOP and RPUSH. Priority queues implemented by using a BLPOP with multiple list names
Set operations are available such as intersection, union, difference. Also provides the ability to store intermediary results in new keys.
Hashes don't allow storage of other data types: strings only
Supports transactions using MULTI ... EXEC to run all queued commands in one go
Master/slave replication is simple with the SLAVE OF ... command
Other updates and versions include Redis Sentinel (in development to provide automated failover), Redis Cluster (in development for fault tolerance of a subset of Redis commands) and a Windows version
Have a play with a "live demo" within the redis.io documentation

Lisa Phillips: MySQL @Twitter

MySQL plus friends has enabled Twitter to still use MySQL (5.0 and 5.5) as its primary datastore, with an average off 400 million tweets per say, 4,629/s average, with a peak at 25,088/s (about a Japanese anime film!)
8 full-time DBAs (recently up from 6) managing thousands of MySQL instances, supporting 100s of developers. All DBAs have at-scale experience, and most developers are familiar with MySQL
The Twitter DBAs manage from the bare-metal up, including operating system, software, monitoring etc
Engineering in Twitter is about pragmatism: use commodity hardware and software, queues and async processing, eventual consistency, some delay tolerance (measured in seconds)
"Build new awesome tools (and open source them) if you need to"
They use "deciders": feature flags to enable roll-out to small volumes of people to gauge impact on the DB servers (plus other parts of the infrastructure)
Twitter don't roll back, either code or DB changes: they roll out slowly and iterate on any fixes
Replication (usually) works. Have seen replication break in lots of different ways so many times, so can now quickly fix any problems.
Bad points of MySQL: at-scale ID generation, graphs, replication inefficiencies and lag
"If you're using replication, make sure you can tolerate lag in your code. If you can't tolerate lag, don't use MySQL"
MySQL great for HA, "smaller" datasets (<1.5Tb)
Challenges: MySQL version diversity, single DBA, upgrades without HA solution, no load-balancing for reads
In 2012, they used a sharded master-slave setup using temporal sharding. New shards were hot, old shards not. New DB clusters being built every week, and DBA time became limiting factor
Snowflake used for unique ID generation
Gizzard created for sharding as a replacement for the temporal sharding (stores and replicates tweets, interest, social graph) and replaces native MySQL replication (disabling native replication improves performance)
Gizzard handles 6m SELECTs per second at peak, and creating more than 3b records per day
Other apps built on top of Gizzard: Flock, TBird, TFlock -- all of these are backed by MySQL
Still using traditional master-slave clusters (3-100 machines in a cluster) for non-tweet data such as user metadata, old Rails models
One Twitter employee is an ex-MySQL developer who now just works on MySQL for Twitter
Working on better loggin and auditing support, real-time monitoring, performance and response metrics, row-based replication pre-fetching

timblair/ayb12.md

AYB12

Alvin Richards: MongoDB

Luca Garulli: OrientDB

Dale Harvey: PouchDB

Matt Heitzenroder: Eventual Consistency

Monty Widenius: MySQL-MariaDB

Brandon Keepers: Git: the NoSQL DB

Peter Cooper: Redis, Steady, Go!

Lisa Phillips: MySQL @Twitter