Skip to content

Instantly share code, notes, and snippets.

@jkittner
Created January 18, 2025 22:17
Show Gist options
  • Save jkittner/0d4213a8ccb9d20019804081387d1653 to your computer and use it in GitHub Desktop.
Save jkittner/0d4213a8ccb9d20019804081387d1653 to your computer and use it in GitHub Desktop.

Implementing gapfilled materialized views with timescale

intro

TimescaleDB's continous aggregates and materialized views are very limited. One could think that just replacing time_bucket with time_bucket_gapfill would be enough to enable gap filling in a materialized view. But it is simply not allowed in the context of a materialized view. Then one could think that sticking with time_bucket and simply implementing the gap-filling logic yourself is enough, but CTEs are not allowed in timescale materialized views. So the next logic though would be to use subqueries instead, but also those are not allowed. The only option is to completely ditch the timescaledb.continuous. This will have a few downside, but mainly for very large views, which might not be the case for a lot of use cases, so for those this is an option.

solution

Lets create an example setup with multiple stations having some data each.

-- defined two tables that are connected via a foreign key
CREATE TABLE stations(
    id BIGINT PRIMARY KEY
);

CREATE TABLE data(
    date TIMESTAMPTZ,
    id BIGINT REFERENCES stations(id),
    measurement NUMERIC,
    PRIMARY KEY(date, id)
);

Now fill the tables with some example data intentionally leaving gaps, which are different between station with id 1 and id 2.

-- add some example data to showcase the result
INSERT INTO stations(id) VALUES (1), (2);

INSERT INTO data(date, id, measurement) VALUES
    ('2025-01-01 09:15', 1, 0),
    ('2025-01-01 10:15', 1, 2),
    ('2025-01-01 10:45', 1, 4),
    ('2025-01-01 12:15', 1, 6),
    ('2025-01-01 12:45', 1, 8),
    ('2025-01-01 10:15', 2, 3),
    ('2025-01-01 10:45', 2, 6),
    ('2025-01-01 13:15', 2, 9),
    ('2025-01-01 13:45', 2, 12),
    ('2025-01-01 14:15', 2, 15);

Let's finally create a materialized view in a multi-step process.

-- define the gap-filled materialized view
CREATE MATERIALIZED VIEW data_hourly AS
-- get the start and end date of measurements per station,
-- so we don't extrapolate the gapfilling, but only fill between
-- existing values
WITH data_bounds AS (
    SELECT
        id,
        MIN(date) AS start_time,
        MAX(date) AS end_time
    FROM data
    GROUP BY id
),
-- generate a complete time series that contains all possible dates for any
-- station present
gapfiller_time_series AS (
    SELECT generate_series(
        (SELECT MIN(date) FROM data),
        (SELECT MAX(date) FROM data),
        '1 hour'::INTERVAL
    ) AS date
),
-- based on the gapfiller time series, generate potential fillers for each
-- id in station. Take care that only filler data is generated between the
-- earliest and latest measurement per individual station
time_station_combinations AS (
    SELECT
        date,
        stations.id AS id,
        start_time,
        end_time
    FROM gapfiller_time_series
    CROSS JOIN stations
    JOIN data_bounds ON data_bounds.id = stations.id
    WHERE
        gapfiller_time_series.date >= data_bounds.start_time AND
        gapfiller_time_series.date <= data_bounds.end_time
),
-- now combine both, the acutaly measurements and the filler data, intentionally
-- creating duplicates, which, however, will be elimated in a later step
filled_data AS (
    (
        SELECT
            date,
            id,
            NULL AS measurement
        FROM time_station_combinations
    )
    UNION ALL
    (
        SELECT
            date,
            id,
            measurement
        FROM data
    )
)
-- Now use the regular time_bucket function to calculate hourly averages
-- this will preduce rows of NULL when there are no values otherwise
-- NULLs are ignored when the aggregations contain actual values
SELECT
    time_bucket('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement)
FROM filled_data
GROUP BY hourly_bucket, id
ORDER BY hourly_bucket, id DESC;

To be able to refresh the view concurrently, we need a unique index

CREATE UNIQUE INDEX ON data_hourly(id, hourly_bucket);
REFRESH MATERIALIZED VIEW CONCURRENTLY data_hourly;

results

with this setup the gaps are filled consistently between available data points, avoiding extensive extrapolations for stations that either started late or stoped early with their measurements.

SELECT * FROM data_hourly ORDER BY id, hourly_bucket;
     hourly_bucket      | id |          avg
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

This is an improved behavior compared to the plain time_bucket_gapfill.

One could of course ommit that behavior by simply removing the data_bounds logic and its conditional join after the CROSS JOIN so the behavior is consistent with time_bucket_gapfill. Or just use time_bucket_gapfill, but without the timescaledb.continuous.

SELECT
    time_bucket_gapfill('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement) AS measurement
FROM data
WHERE date BETWEEN '2025-01-01 09:15' AND '2025-01-01 14:15'
GROUP BY hourly_bucket, id
ORDER BY id, hourly_bucket;
     hourly_bucket      | id |      measurement
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 13:00:00+00 |  1 |
 2025-01-01 14:00:00+00 |  1 |
 2025-01-01 09:00:00+00 |  2 |
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

performance

Performance-wise the workaround will likeley perform better for views with fewer rows than the timescale materialized view, but for very large views performance will likely be worse. So this will only be a solution for a couple 100,000 rows in the view...

conclusions

I hope this will be useful for some people that don't need all the featured of a timescale-native materialized view. But the correct solution would of course be for timescale to simply support time_bucket_gapfill in materialized views.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment