Probabilistic M2M relationships using Bloom filters

Here’s an idea that’s been kicking around inside my head recently.

A standard M2M relationship, as represented in SQL, looks like this:

CREATE TABLE movie (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255)
);

CREATE TABLE person (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255)
);

CREATE TABLE movies_people (
  movie_id INTEGER REFERENCES movie,
  person_id INTEGER REFERENCES person
);

To find the people for a given movie (including the details of the movie itself):

SELECT *
FROM
  movie
  INNER JOIN movie_people ON (movie.id = movie_people.movie_id)
  INNER JOIN person ON (movie_people.person_id = person.id)
WHERE movie.id = $MOVIE_ID;

Finding the movies for a given person just involves changing the WHERE predicate to filter for person.id instead.

Using a junction table for a sparse or small data set (where there are not many associations between movies and people) gives acceptable space and time consumption properties. But for denser association matrices (which may grow over time), the upper bound on the size of the junction table is O(n(movies) * n(people)), and the upper bound on the time taken to join all three tables will be the square of that. So what optimizations and trade-offs can be made in such a situation?

Well, we can use a bloom filter on each side of the M2M relationship and do away with the junction table altogether. Here’s what the SQL (for Postgres) looks like:

CREATE TABLE movie (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255),
  person_filter BIT($PERSON_FILTER_LENGTH),
  hash BIT($MOVIE_FILTER_LENGTH)
);

CREATE TABLE person (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255),
  movie_filter BIT($MOVIE_FILTER_LENGTH),
  hash BIT($PERSON_FILTER_LENGTH)
);

I haven’t calibrated these filters yet, so I’ve yet to decide how long to make each one. I’m also doing something different compared to the normal explanation of a bloom filter. Typically each element is expressed as the set of results of k hash functions, each mapping to an index in a bit array of length m. I prefer to think of a single hash function with an m-bit output and a popcount guaranteed to be less than or equal to k. This is effectively identical, but it helps you think of the filters themselves in a different way: as a union of a set of hash outputs. All of a sudden, these filters seem less daunting—they’re just fancy bit arrays.

zacharyvoase/prob-m2m.md