Last active
April 8, 2024 03:27
-
-
Save tdfirth/49e56c1043f8a4ed1d931e7225a30f62 to your computer and use it in GitHub Desktop.
Asof joins in duckdb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
How to think about asof queries. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
id | user_id | timestamp | |
---|---|---|---|
1 | 1 | 2/21/2024 | |
2 | 1 | 2/17/2024 | |
3 | 2 | 2/20/2024 | |
4 | 3 | 2/19/2024 | |
5 | 3 | 2/15/2024 | |
6 | 4 | 2/18/2024 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
id | user_id | post | timestamp | |
---|---|---|---|---|
1 | 1 | 1 | 2/22/2024 | |
2 | 1 | 2 | 2/20/2024 | |
3 | 1 | 3 | 2/18/2024 | |
4 | 1 | 4 | 2/16/2024 | |
5 | 1 | 5 | 2/14/2024 | |
6 | 2 | 1 | 2/22/2024 | |
7 | 2 | 2 | 2/20/2024 | |
8 | 2 | 3 | 2/18/2024 | |
9 | 2 | 4 | 2/16/2024 | |
10 | 2 | 5 | 2/14/2024 | |
11 | 3 | 1 | 2/22/2024 | |
12 | 3 | 2 | 2/20/2024 | |
13 | 3 | 3 | 2/18/2024 | |
14 | 3 | 4 | 2/16/2024 | |
15 | 3 | 5 | 2/14/2024 | |
16 | 4 | 1 | 2/22/2024 | |
17 | 4 | 2 | 2/20/2024 | |
18 | 4 | 3 | 2/18/2024 | |
19 | 4 | 4 | 2/16/2024 | |
20 | 4 | 5 | 2/14/2024 | |
21 | 5 | 1 | 2/22/2024 | |
22 | 5 | 2 | 2/20/2024 | |
23 | 5 | 3 | 2/18/2024 | |
24 | 5 | 4 | 2/16/2024 | |
25 | 5 | 5 | 2/14/2024 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with candidates as ( | |
select | |
o.user_id as user_id, | |
o.id as order_id, | |
p.id as post_id, | |
o.timestamp as order_time, | |
p.timestamp as post_view_time, | |
row_number() over (partition by o.id order by p.timestamp desc) rn | |
from "orders.csv" o | |
inner join "post_views.csv" p | |
on o.user_id = p.user_id and o.timestamp >= p.timestamp | |
) | |
select * from candidates | |
where rn = 1 | |
order by order_time desc; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
user_id | order_id | post_id | order_time | post_view_time | rn | |
---|---|---|---|---|---|---|
1 | 1 | 2 | 2024-02-21 | 2024-02-20 | 1 | |
2 | 3 | 7 | 2024-02-20 | 2024-02-20 | 1 | |
3 | 4 | 13 | 2024-02-19 | 2024-02-18 | 1 | |
4 | 6 | 18 | 2024-02-18 | 2024-02-18 | 1 | |
1 | 2 | 4 | 2024-02-17 | 2024-02-16 | 1 | |
3 | 5 | 15 | 2024-02-15 | 2024-02-14 | 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
select | |
o.user_id as user_id, | |
o.id as order_id, | |
p.id as post_id, | |
o.timestamp as order_time, | |
p.timestamp as post_view_time | |
from "orders.csv" o | |
asof join "post_views.csv" p | |
on o.user_id = p.user_id | |
and o.timestamp >= p.timestamp | |
order by order_time desc; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
user_id | order_id | post_id | order_time | post_view_time | |
---|---|---|---|---|---|
1 | 1 | 2 | 2024-02-21 | 2024-02-20 | |
2 | 3 | 7 | 2024-02-20 | 2024-02-20 | |
3 | 4 | 13 | 2024-02-19 | 2024-02-18 | |
4 | 6 | 18 | 2024-02-18 | 2024-02-18 | |
1 | 2 | 4 | 2024-02-17 | 2024-02-16 | |
3 | 5 | 15 | 2024-02-15 | 2024-02-14 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
┌─────────────────────────────┐ | |
│┌───────────────────────────┐│ | |
││ Physical Plan ││ | |
│└───────────────────────────┘│ | |
└─────────────────────────────┘ | |
┌───────────────────────────┐ | |
│ ORDER_BY │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ ORDERS: │ | |
│ candidates.order_time DESC│ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ PROJECTION │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ user_id │ | |
│ order_id │ | |
│ post_id │ | |
│ order_time │ | |
│ post_view_time │ | |
│ rn │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ PROJECTION │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ #0 │ | |
│ #1 │ | |
│ #2 │ | |
│ #4 │ | |
│ #5 │ | |
│ #6 │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ FILTER │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ (rn = 1) │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ EC: 23 │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ WINDOW │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ ROW_NUMBER() OVER │ | |
│(PARTITION BY id ORDER... │ | |
│ DESC NULLS LAST) │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ HASH_JOIN │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ INNER │ | |
│ user_id = user_id ├──────────────┐ | |
│ timestamp <= timestamp │ │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ | |
│ EC: 23 │ │ | |
└─────────────┬─────────────┘ │ | |
┌─────────────┴─────────────┐┌─────────────┴─────────────┐ | |
│READ_CSV_AUTO (MULTI-T... ││READ_CSV_AUTO (MULTI-T... │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ user_id ││ user_id │ | |
│ timestamp ││ timestamp │ | |
│ id ││ id │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ EC: 23 ││ EC: 7 │ | |
└───────────────────────────┘└───────────────────────────┘ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
┌─────────────────────────────┐ | |
│┌───────────────────────────┐│ | |
││ Physical Plan ││ | |
│└───────────────────────────┘│ | |
└─────────────────────────────┘ | |
┌───────────────────────────┐ | |
│ ORDER_BY │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ ORDERS: │ | |
│ o."timestamp" DESC │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ PROJECTION │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ user_id │ | |
│ order_id │ | |
│ post_id │ | |
│ order_time │ | |
│ post_view_time │ | |
└─────────────┬─────────────┘ | |
┌─────────────┴─────────────┐ | |
│ ASOF_JOIN │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ INNER │ | |
│ user_id = user_id ├──────────────┐ | |
│ timestamp >= timestamp │ │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ | |
│ EC: 23 │ │ | |
└─────────────┬─────────────┘ │ | |
┌─────────────┴─────────────┐┌─────────────┴─────────────┐ | |
│READ_CSV_AUTO (MULTI-T... ││READ_CSV_AUTO (MULTI-T... │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ user_id ││ user_id │ | |
│ timestamp ││ timestamp │ | |
│ id ││ id │ | |
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ | |
│ EC: 7 ││ EC: 23 │ | |
└───────────────────────────┘└───────────────────────────┘ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Asof SQL queries
I've seen a few people get confused about asof queries recently, and they're starting to land in a couple of mainstream platforms! Snowflake added support very recently (it's still in closed beta) and duckdb added them a little while back.
They've been around for a while though, the first time I encountered them was with a system called q/kdb+, which is commonly used in hedge funds and other financial organizations. Matching up data on timestamps is extremely common in that domain, so it's no surprise that tooling there has made this first class for some time.
What are they for?
The use case is pretty simple really. You have two tables, and both tables hold time series data. Your goal is to want to find items in one table relative to the other in time.
Here are some concrete examples:
A worked example
I've attached a couple of simple csvs and a couple of queries to this gist. The queries are for duckdb so if you download the csvs then the queries will just work (on duckdb v0.9.2 at least...)!
The data set has two tables, one is order data, and the other is post views. Our goal is to try and find the post that somebody read before they made an order (the content team wants to know).
The old fashioned way
This is actually not that hard to solve without asof joins. You can use a CTE and a window function, or a self join - there are quite a few ways to do it really. I'm going to use the classic 'CTE and row_number' here, because you see it very often in the wild.
See the file
3_window_query.sql
for the query. The general idea is:The asof way
Now that we've worked through it with a CTE and a window function, we should have a pretty good grasp for the kind of thing that the asof query is doing. You've probably done it a hundred times.
Now, let's do the exact same thing but with asof. See the file
5_asof_query.sql
.In short, it's identical, except that we replaced the window function and the rn = 1 constraint with an asof join.
I've included the output of both queries so that you can see they're identical (but run the query yourself and have a play!)
So why asof?
Ok this all makes sense, but why do we need a fancy join operator for it?
Ease of use
In short, syntax does matter. In some domains this is a very common query pattern, and so it is more convenient and less error prone to have a language feature to support it.
The syntactic convenience is amplified when you get more complex too. What if I'm joining three tables in this way? The window function method starts to get gnarly pretty fast, but with the asof join it's actually pretty straight forward.
Performance
I can't speak as to whether duckdb (or any other system) optimizes this operator in any other way, but the fact that we are using an operator instead of a query pattern provides an opportunity for optimization.
To demonstrate this, I've included two duckdb query plans. The first is the window query plan (
7_window_plan.txt
) and the second is the asof query plan (8_asof_plan.txt
).You can see that there are far fewer plan nodes in the asof query. This doesn't necessarily mean that it's faster, but it does mean that duckdb is free to use a specialized implementation for the
ASOF_JOIN
physical plan node type. This provides opportunities for optimization that are simply not possible otherwise. If you're familiar with deep learning framework terminology, you could think of it as a fused kernel.I mentioned q/kdb+ earlier, which in many ways is built entirely for this kind of operation. Data is typically laid out in timestamp order on disk, and so these kinds of asof joins are extremely fast (just a binary search and taking a single value). When you're dealing with NYSE scale tick data (which is many terabytes per day), this is extremely important.
PS: If you haven't done it before, you can get these query plans just by sticking the keyword
explain
in front of any query.