timblair · November 23, 2012 12:56 · Nov 23, 2012 · Nov 23, 2012 · Nov 23, 2012
diff --git a/ayb12.md b/ayb12.md
@@ -240,4 +240,107 @@
   performance and response metrics, row-based replication pre-fetching
 
 [snowflake]: https://github.com/twitter/snowflake
-[gizzard]: https://github.com/twitter/gizzard
+[gizzard]: https://github.com/twitter/gizzard
+
+## Brian LeRoux: Mobile Web Persistence
+
+- [Lawnchair][lawnchair]: like CouchDB but smaller and outside
+- "Xcode is Eclipse that looks like iTunes, and it just as slow"
+- Cookies: need to be online, 4Kb storage, but there's a [handy
+  hack][cookieimg] for serving up responsive image
+- [Can I use Web SQL Database?][caniuse] (oh, and [JS is dumb][wtfjs])
+- SQL in the browser: SQLite (probably).  Started off as Google Gears,
+  but now improved.  http://caniuse.com/sql-storage  SQLite is an
+  implmentation not a standard, and Mozilla had issues with it.  Isn't
+  everywhere and doesn't necessarily work.
+- LocalStorage: quite nice, can store up to 5Mb, but has a synchronous
+  (blocking) API, plus misses complex types, and you can't query it.
+  Almost supported everywhere (e.g. Opera Mini)
+- WebSimpleDB: solve ALL the problems!  Renamed Indexed DB.  Has
+  querying etc, but is heavy on the code required because it's a
+  versioned DB.  Not supported in lots of places (yet), but could be
+  polyfilled.
+- Lawnchair wraps up all the above in one sane API
+- Hack: store unlimited data on all browsers, accessible from any
+  domain, using `window.name`
+- Web sockets means we could have a web page open a database connection
+- WebRTC PeerChannel and DataConnection APIs are also around
+- Stong indication of first-class File APIs coming to browsers.
+  Currently split in to two specs: File API and Directories and System
+  API.  [filer.js][filerjs] tries to make this saner
+- Mozilla working on Archive API in Firefox OS
+
+[lawnchair]: http://brian.io/lawnchair/
+[cookieimg]: http://blog.keithclark.co.uk/responsive-images-using-cookies/
+[caniuse]: http://caniuse.com/sql-storage
+[wtfjs]: http://wtfjs.com/
+[filerjs]: https://github.com/ebidel/filer.js
+
+## Craig Kersteins: Postgres Demystified
+
+- [postgresapp.com][pgapp] -- simplified running of Postgres in OS X
+- "It's the emacs of databases": more of an OS for your data
+- `psql` is powerful command-line client
+- 30+ datatypes including IPs, MAC addresses, geospatial, arrays
+- Native arrays give the power of custom fields without a join
+- Loads of extensions such as `hstore`: a KVP store inside a column,
+  with queryability
+- Simple native JSON type recently added, but PLV8 allows embedding of a
+  JS engine (can open up JS-injection attacks!)
+- Range types include from+to within a single column, and can have
+  checks on those (e.g. two entries can overlap in times)
+- Light geospatial stuff is built-in; PostGIS provides full geospatial
+  capabilities
+- Sequential scans are bad (most of the time).  Indexes are good (most
+  of the time)
+- Postgres has multiple types of indexes.  B-Tree is the default, and
+  you usually want it; BIN used with multiple values in one column
+  (arrays, hStore); GIST for full-text search and GIS
+- Aim for all queries being =< 10ms
+- Can create indexes concurrently without locking the whole table
+- Create indexes on certain conditions (e.g. active things only)
+- PG internal metrics can provide things like cache and index hit ratios
+- Window functions permit partioning (sub-grouping) data while querying
+- Fuzzy string matching using `soundex()`
+- Move data around using `\copy` or db_link, not `SELECT` + `INSERT`
+- Foreign storage adapters such as Redis: in this case can `JOIN` across
+  PG and Redis
+- Common table expressions allow naming of common queries, which can
+  then be reused in subsequent queries
+- Extras: Listen/notify (pub/sub within the DB), per-transaction
+  synchronous replication, `SELECT` for `UPDATE`
+- Replication introduced in 9.0, multi-master expected for 9.4
+- References: [Postgres Guide][pgguide] and a [presentation][notyourjob]
+
+[pgapp]: http://postgresapp.com/
+[pgguide]: http://www.postgresguide.com/
+[notyourjob]: http://thebuild.com/blog/2012/06/04/postgresql-when-its-not-your-job-at-djangocon-europe/
+
+## Tim Moreton: Apache Cassandra and BASE
+
+- Facebook took bits of BigTable data model and Dynamo distribution to
+  create Cassandra, used to power their inbox search.  Open sourced it
+  in 2008, top-level Apache project as of 2010.  Now pretty prevalent
+- Multi-master (no SPOF), tunable consistency (multi-DC aware),
+  optimised for writes (do more up front to gain on select time), atomic
+  counters
+- Data model is a set of nested, sorted dictionaries.  Columns are
+  effectively just labels, and can be *very* wide
+- Reads are fast within a single row (across columns) but much slower
+  between rows, because rows are spread around the cluster
+- Uses timestamp-based reconcilliation for conflict resolution across
+  the cluster
+- Tunable consistency for both writes and reads: one, quorum, all
+- Use case: session store.  Read dominated, updates to existing items,
+  probably fits in RAM, distribute for availability, challenge:
+  atomicity
+- Use case: real-time analytics.  Write dominated, updates rare, read
+  "results" mostly, distribute for availablity + performance + capcity,
+  challenge: complex querying
+- Twitters promoted tweets dashboard just used Cassandra counters,
+  demormalising into buckets on writes, so the grouping etc is already
+  done for reading (no need for separate counting, grouping etc)
+- Relies on up-front knowledge of the use of the data to be able to
+  optimise for reading
+- Acunu Analytics: materialised views of data to provide better
+  queryability on top of Cassandra data
diff --git a/ayb12.md b/ayb12.md
@@ -1,6 +1,6 @@
 # AYB12
 
-## Alvin: MongoDB
+## Alvin Richards: MongoDB
 
 - Trade-off: scale vs. functionality.  MongoDB tries to have good
   functionality *and* good scalability.
@@ -63,7 +63,7 @@
 - Requires CouchDB on the server for sync
 - Safari + Opera support in progress, so not production-ready yet
 
-## Matt: Eventual Consistency
+## Matt Heitzenroder: Eventual Consistency
 
 - Brewer's Conjecture (2000): CAP -- you can only have two
 - "Life is full of tradeoffs" as is engineering
@@ -128,4 +128,116 @@
 - Free MariaDB + MySQL knowledgebase available at
   [askmonty.org][askmonty]
 
-[askmonty]: http://askmonty.org/
+[askmonty]: http://askmonty.org/
+
+## Brandon Keepers: Git: the NoSQL DB
+
+- Let's start with "Git's amazing ... what else can we do with it?"
+- "NoSQL is marketing bollocks" -- people mean non-relational and
+  schemaless, and anything else gets lumped in to NoSQL
+- git calls itself "the stupid content tracker" (see the man page)
+- git has three "object types": blobs, trees and commits, plus symbolic
+  "references" on top, all managed by the `git` command line tool
+- There are libraries to work with this (Grit, libgit2), plus ORMs built
+  on top, such as ToyStore
+- NoSQL allows us to question RDBMS design, including big design
+  up-front: schemaless allows us to be much more agile with our data
+  model
+- git can handle transaction in both short-lived (one commit with
+  multiple changes) and long-lived (branches) forms
+- Replication handled by the fact that all git repos are full clones
+- git doesn't have any of the features that makes a great DB: querying,
+  concurrency (it's filesystem based), merge conflict resolution, scale
+- Scale: filesystem based, and problems with git at scale.  Someone
+  tested with a very large repo: 4m commits, 1.3m files, 15Gb repo ...
+  git-add took 7 seconds etc...
+- Think about how you can abuse your tools to get more out of them
+
+## Peter Cooper: Redis, Steady, Go!
+
+- Peter's a Rubist, and wants his languages and tools to be "beautiful,"
+  which he considers Redis to be
+- Redis: remote [data structure] server -- no tables, no SQL, no
+  enforced relationships, lots of working with primitives.  The [Redis
+  manifesto][rmanifesto] calls it a DSL for abstract data types
+- Like memcached but with more commands, more persistence, more data
+  types
+- Three big use cases: database, messaging (pub/sub, queueing), or as
+  a cache.  Also: fast live stats logging (why Redis was created in the
+  first place), rate limiting (using automatic key expiry),
+  scoreboarding (using sorted sets), IPC, session storage
+- YouP*rn.com uses Redis as their primary datastore (~100 Alexa ranking)
+- Redis is single-threaded and event-driven (apart from background
+  saving etc). Single-threading means individual operations are atomic
+- Python library redis_wrap means you can use normal Python data types,
+  backed by Redis
+- Recent additions: scripting with LUA, plus PostgreSQL data wrapper
+- 6 data types: strings, lists, sets, sorted sets, lists, hashes
+- Abstract data type example: queueing using a list, with `LPOP` and
+  `RPUSH`. Priority queues implemented by using a `BLPOP` with multiple
+  list names
+- Set operations are available such as intersection, union, difference.
+  Also provides the ability to store intermediary results in new keys.
+- Hashes don't allow storage of other data types: strings only
+- Supports transactions using `MULTI ... EXEC` to run all queued
+  commands in one go
+- Master/slave replication is simple with the `SLAVE OF ...` command
+- Other updates and versions include Redis Sentinel (in development to
+  provide automated failover), Redis Cluster (in development for fault
+  tolerance of a subset of Redis commands) and a Windows version
+- Have a play with a "live demo" within the [redis.io][redisio]
+  documentation
+
+[rmanifesto]: http://oldblog.antirez.com/post/redis-manifesto.html
+[redisio]: http://redis.io/
+
+## Lisa Phillips: MySQL @Twitter
+
+- MySQL *plus friends* has enabled Twitter to still use MySQL (5.0 and
+  5.5) as its primary datastore, with an average off 400 million tweets
+  per say, 4,629/s average, with a peak at 25,088/s (about a Japanese
+  anime film!)
+- 8 full-time DBAs (recently up from 6) managing thousands of MySQL
+  instances, supporting 100s of developers.  All DBAs have at-scale
+  experience, and most developers are familiar with MySQL
+- The Twitter DBAs manage from the bare-metal up, including operating
+  system, software, monitoring etc
+- Engineering in Twitter is about pragmatism: use commodity hardware
+  and software, queues and async processing, eventual consistency,
+  some delay tolerance (measured in seconds)
+- "Build new awesome tools (and open source them) *if you need to*"
+- They use "deciders": feature flags to enable roll-out to small volumes
+  of people to gauge impact on the DB servers (plus other parts of the
+  infrastructure)
+- Twitter don't roll back, either code or DB changes: they roll out
+  slowly and iterate on any fixes
+- Replication (usually) works.  Have seen replication break in lots of
+  different ways so many times, so can now quickly fix any problems.
+- Bad points of MySQL: at-scale ID generation, graphs, replication
+  inefficiencies and lag
+- "If you're using replication, make sure you can tolerate lag in your
+  code.  If you can't tolerate lag, don't use MySQL"
+- MySQL great for HA, "smaller" datasets (<1.5Tb)
+- Challenges: MySQL version diversity, single DBA, upgrades without HA
+  solution, no load-balancing for reads
+- In 2012, they used a sharded master-slave setup using temporal
+  sharding.  New shards were hot, old shards not.  New DB clusters being
+  built every week, and DBA time became limiting factor
+- [Snowflake][snowflake] used for unique ID generation
+- [Gizzard][gizzard] created for sharding as a replacement for the
+  temporal sharding (stores and replicates tweets, interest, social
+  graph) and *replaces native MySQL replication* (disabling native
+  replication improves performance)
+- Gizzard handles 6m `SELECT`s per second at peak, and creating more
+  than 3b records per day
+- Other apps built on top of Gizzard: Flock, TBird, TFlock -- all of
+  these are backed by MySQL
+- Still using traditional master-slave clusters (3-100 machines in a
+  cluster) for non-tweet data such as user metadata, old Rails models
+- One Twitter employee is an ex-MySQL developer who now just works on
+  MySQL for Twitter
+- Working on better loggin and auditing support, real-time monitoring,
+  performance and response metrics, row-based replication pre-fetching
+
+[snowflake]: https://github.com/twitter/snowflake
+[gizzard]: https://github.com/twitter/gizzard
diff --git a/ayb12.md b/ayb12.md
@@ -0,0 +1,131 @@
+# AYB12
+
+## Alvin: MongoDB
+
+- Trade-off: scale vs. functionality.  MongoDB tries to have good
+  functionality *and* good scalability.
+- Auto-sharding to maintain equilibrium between shards
+- Scalable datastore != scalable application: use of datastore may still
+  be non-scalable (e.g. many queries across all shards)
+- Get low latency by ensuring shard data is always in memory: datastore
+  then becomes a cache with persistence
+- Replica sets: auto-election of new primary node on failure, plus
+  automatic recovery once failed node is back online
+- Async replication between nodes in a replica set (eventual
+  consistency)
+- Auto TTL for messages, and can update on read operations
+- Tunable data consistency before write is "complete" from "none": fire
+  and forget, assume it's going to get there eventually, to "full":
+  includes remote replication to other geographies
+- Data model of RDBS enforces relational model which can limit ability
+  to scale that system.  Data locality ("which server is my record on?")
+  becomes an issue
+
+## Luca Garulli: OrientDB
+
+- Biggest issue with switching from RDBMS: what about the data model?
+- KV, column-based, document DBs ... and graph DBs
+- Property graph model: vertices and edges can have properties, edges
+  are directional, edges connect vertices, vertices can have one or more
+  incoming + outgoing edges
+- In RDBMS, every time you traverse a relationship, you perform an
+  expensive JOIN.  Indexes can speed up reads, but slow down writes
+- Index lookups are generally based on balanced trees.  More entries ==
+  more lookup steps == slower JOIN
+- "A graph DB is any storage system that provides index-free adjacency"
+- A graph DB treats relationships as physical links assigned to the
+  record when the edge is created; RDBMS computes the same relationship
+  every time you perform a JOIN
+- Lookup time moves from O(log N) to new O(1), and does not increase
+  with DB size
+- NuvloaBase.com: REST-based graph DB service
+- Difficult to create distributed graph DBs.  Scaling is basically a
+  case of using client-side hashing.
+
+## Dale Harvey: PouchDB
+
+- CouchDB for JavaScript environments, mainly for browsers (but also
+  works in Node.js)
+- Multi-master replication, supports disconnected sync
+- "Ground computing" -- like cloud computing, but provides offline
+  behaviour with on-demand sync
+- Designed for builing applications that needs to work well offline, and
+  that need to sync data
+- Would simplify something like multi-app SimpleNote-type system?
+- Offline is a fact: the more mobile devices, the more people are
+  offline.  No reception data limits, slow / unstable connections etc
+- Sync is hard: Things took *2 years* to develop sync
+- Bad connections + retries, transfer overhead and moving deltas (mobile
+  access might not want total sync), master-master scenarios, conflict
+  resolution
+- [CP]ouchDB has good, simple conflict resolution, but sometimes you
+  need to tell it what to resolve (based on your app usage)
+- Requires CouchDB on the server for sync
+- Safari + Opera support in progress, so not production-ready yet
+
+## Matt: Eventual Consistency
+
+- Brewer's Conjecture (2000): CAP -- you can only have two
+- "Life is full of tradeoffs" as is engineering
+- [Amazon's Dynamo paper][dynamo]: tradeoff between C & A -- they chose A
+- Financial systems already dealing with eventual consistency: trading
+  banks closing and reconciling, network partitions between cash point
+  and centralised bank etc
+- Riak uses vnodes in a ring topology (ketama-style)
+- Writes go to hashed node + the next two (i.e. three copies on separate
+  nodes)
+- Read Repair: handle out of date copies of data on vhosts automatically
+  on read and update out of date nodes to logical descendants (e.g. v1
+  -> v2)
+- Read Repair etc means internally three objects are requested and
+  checked for consistency.  This can be tuned via quoram, single-read
+  for speed etc
+- There can be divergent ojbect versions, a.k.a. siblings: after a
+  network partition, two operations can have altered object state at the
+  same time.  Riak returns *both* versions
+- Per-application, can define a "conflict resolver": as part of the Riak
+  client to define how to handle sibling resolution
+- Common use-cases are: pick one based on some property, or perform a
+  set union of the data
+- [Probabilistically Bounded Staleness][pbs]
+
+[dynamo]: http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf
+[pbs]: http://pbs.cs.berkeley.edu/
+
+## Monty Widenius: MySQL-MariaDB
+
+- MySQL named after Monty's daughter, My (MaxDB released later, named
+  after his son, Max)
+- Original MySQL devs started focussing on MariaDB in 2009 with the
+  impending purchase of Sun by Oracle
+- Chose to use dual-license to be able to work full-time on MySQL: took
+  2 months to become profitable
+- Don't go to investors when you need their money.  Wait for them to
+  come to you when you *don't* need their money, and you won't have to
+  give up so much of your company
+- Monty Program Ab: new company (using Hacker Business Model) to focus
+  on MariaDB, with most of the original MySQL developers
+- Aim to keep MySQL dev talent together, always have an open-source
+  version of MySQL.  More important after Oracle purchase of Sun
+- MariaDB is a drop-in replacement for MySQL.  "No reason to use MySQL
+  anymore: MariaDB is better in all cases"
+- Big JOIN and subquery performance is an order of magnitude (or more)
+  faster than MySQL
+- "SQL doesn't solve all common problems" e.g. arbitrary attributes
+  (shop item sizes, colours etc).  Dynamic columns introduced in MariaDB
+  5.3.  As a POC, created a storage engine for Cassandra with MariaDB 10
+- Any close-sourced features that Oracle has added to MySQL have been
+  added to MariaDB as open-source features
+- 5.5 introduces a new thread pool (instead of thread-per-connection)
+- Full merge of MySQL 5.6 into MariaDB 5.6 is a year-long project due to
+  broken features and new bugs, over-complicated vode, lack of
+  understanding of existing code etc
+- Did such a good job of getting the MySQL name out there, changing
+  everyone over to MariaDB is going to be a tough job!
+- Though creating a dev community is easier as Oracle is not working
+  with the community
+- Aim of MariaDB: make MySQL obselete
+- Free MariaDB + MySQL knowledgebase available at
+  [askmonty.org][askmonty]
+
+[askmonty]: http://askmonty.org/