Skip to content

Instantly share code, notes, and snippets.

@tdfirth
Last active April 8, 2024 03:27
Show Gist options
  • Save tdfirth/49e56c1043f8a4ed1d931e7225a30f62 to your computer and use it in GitHub Desktop.
Save tdfirth/49e56c1043f8a4ed1d931e7225a30f62 to your computer and use it in GitHub Desktop.
Asof joins in duckdb
How to think about asof queries.
id user_id timestamp
1 1 2/21/2024
2 1 2/17/2024
3 2 2/20/2024
4 3 2/19/2024
5 3 2/15/2024
6 4 2/18/2024
id user_id post timestamp
1 1 1 2/22/2024
2 1 2 2/20/2024
3 1 3 2/18/2024
4 1 4 2/16/2024
5 1 5 2/14/2024
6 2 1 2/22/2024
7 2 2 2/20/2024
8 2 3 2/18/2024
9 2 4 2/16/2024
10 2 5 2/14/2024
11 3 1 2/22/2024
12 3 2 2/20/2024
13 3 3 2/18/2024
14 3 4 2/16/2024
15 3 5 2/14/2024
16 4 1 2/22/2024
17 4 2 2/20/2024
18 4 3 2/18/2024
19 4 4 2/16/2024
20 4 5 2/14/2024
21 5 1 2/22/2024
22 5 2 2/20/2024
23 5 3 2/18/2024
24 5 4 2/16/2024
25 5 5 2/14/2024
with candidates as (
select
o.user_id as user_id,
o.id as order_id,
p.id as post_id,
o.timestamp as order_time,
p.timestamp as post_view_time,
row_number() over (partition by o.id order by p.timestamp desc) rn
from "orders.csv" o
inner join "post_views.csv" p
on o.user_id = p.user_id and o.timestamp >= p.timestamp
)
select * from candidates
where rn = 1
order by order_time desc;
user_id order_id post_id order_time post_view_time rn
1 1 2 2024-02-21 2024-02-20 1
2 3 7 2024-02-20 2024-02-20 1
3 4 13 2024-02-19 2024-02-18 1
4 6 18 2024-02-18 2024-02-18 1
1 2 4 2024-02-17 2024-02-16 1
3 5 15 2024-02-15 2024-02-14 1
select
o.user_id as user_id,
o.id as order_id,
p.id as post_id,
o.timestamp as order_time,
p.timestamp as post_view_time
from "orders.csv" o
asof join "post_views.csv" p
on o.user_id = p.user_id
and o.timestamp >= p.timestamp
order by order_time desc;
user_id order_id post_id order_time post_view_time
1 1 2 2024-02-21 2024-02-20
2 3 7 2024-02-20 2024-02-20
3 4 13 2024-02-19 2024-02-18
4 6 18 2024-02-18 2024-02-18
1 2 4 2024-02-17 2024-02-16
3 5 15 2024-02-15 2024-02-14
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ candidates.order_time DESC│
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ user_id │
│ order_id │
│ post_id │
│ order_time │
│ post_view_time │
│ rn │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
│ #2 │
│ #4 │
│ #5 │
│ #6 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (rn = 1) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 23 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ WINDOW │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ROW_NUMBER() OVER │
│(PARTITION BY id ORDER... │
│ DESC NULLS LAST) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ INNER │
│ user_id = user_id ├──────────────┐
│ timestamp <= timestamp │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ EC: 23 │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│READ_CSV_AUTO (MULTI-T... ││READ_CSV_AUTO (MULTI-T... │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ user_id ││ user_id │
│ timestamp ││ timestamp │
│ id ││ id │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 23 ││ EC: 7 │
└───────────────────────────┘└───────────────────────────┘
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ o."timestamp" DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ user_id │
│ order_id │
│ post_id │
│ order_time │
│ post_view_time │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ASOF_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ INNER │
│ user_id = user_id ├──────────────┐
│ timestamp >= timestamp │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ EC: 23 │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│READ_CSV_AUTO (MULTI-T... ││READ_CSV_AUTO (MULTI-T... │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ user_id ││ user_id │
│ timestamp ││ timestamp │
│ id ││ id │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 7 ││ EC: 23 │
└───────────────────────────┘└───────────────────────────┘
@tdfirth
Copy link
Author

tdfirth commented Feb 22, 2024

Asof SQL queries

I've seen a few people get confused about asof queries recently, and they're starting to land in a couple of mainstream platforms! Snowflake added support very recently (it's still in closed beta) and duckdb added them a little while back.

They've been around for a while though, the first time I encountered them was with a system called q/kdb+, which is commonly used in hedge funds and other financial organizations. Matching up data on timestamps is extremely common in that domain, so it's no surprise that tooling there has made this first class for some time.

What are they for?

The use case is pretty simple really. You have two tables, and both tables hold time series data. Your goal is to want to find items in one table relative to the other in time.

Here are some concrete examples:

  • You want to find the latest price for a stock given an order at a certain time.
  • You want to find the last blog post someone viewed before they made an order.
  • You want to find the first action that somebody too after they did another action.

A worked example

I've attached a couple of simple csvs and a couple of queries to this gist. The queries are for duckdb so if you download the csvs then the queries will just work (on duckdb v0.9.2 at least...)!

The data set has two tables, one is order data, and the other is post views. Our goal is to try and find the post that somebody read before they made an order (the content team wants to know).

The old fashioned way

This is actually not that hard to solve without asof joins. You can use a CTE and a window function, or a self join - there are quite a few ways to do it really. I'm going to use the classic 'CTE and row_number' here, because you see it very often in the wild.

See the file 3_window_query.sql for the query. The general idea is:

  • Join post_views on to orders on the order id and where the order timestamp is greater than or equal to the view timestamp.
  • At this point, we have every post view that came before the order joined, but we want to narrow it down to a single order.
  • So we partition by order id, order by post timestamp, and output the row number.
  • We then filter for the rows that have row number = 1.
  • We've now got the post that was most recently read before each order.

The asof way

Now that we've worked through it with a CTE and a window function, we should have a pretty good grasp for the kind of thing that the asof query is doing. You've probably done it a hundred times.

Now, let's do the exact same thing but with asof. See the file 5_asof_query.sql.

In short, it's identical, except that we replaced the window function and the rn = 1 constraint with an asof join.

I've included the output of both queries so that you can see they're identical (but run the query yourself and have a play!)

So why asof?

Ok this all makes sense, but why do we need a fancy join operator for it?

Ease of use

In short, syntax does matter. In some domains this is a very common query pattern, and so it is more convenient and less error prone to have a language feature to support it.

The syntactic convenience is amplified when you get more complex too. What if I'm joining three tables in this way? The window function method starts to get gnarly pretty fast, but with the asof join it's actually pretty straight forward.

Performance

I can't speak as to whether duckdb (or any other system) optimizes this operator in any other way, but the fact that we are using an operator instead of a query pattern provides an opportunity for optimization.

To demonstrate this, I've included two duckdb query plans. The first is the window query plan (7_window_plan.txt) and the second is the asof query plan (8_asof_plan.txt).

You can see that there are far fewer plan nodes in the asof query. This doesn't necessarily mean that it's faster, but it does mean that duckdb is free to use a specialized implementation for the ASOF_JOIN physical plan node type. This provides opportunities for optimization that are simply not possible otherwise. If you're familiar with deep learning framework terminology, you could think of it as a fused kernel.

I mentioned q/kdb+ earlier, which in many ways is built entirely for this kind of operation. Data is typically laid out in timestamp order on disk, and so these kinds of asof joins are extremely fast (just a binary search and taking a single value). When you're dealing with NYSE scale tick data (which is many terabytes per day), this is extremely important.

PS: If you haven't done it before, you can get these query plans just by sticking the keyword explain in front of any query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment