Skip to content

Instantly share code, notes, and snippets.

View Nipsuli's full-sized avatar

Niko Korvenlaita Nipsuli

View GitHub Profile
@Nipsuli
Nipsuli / stg___jaffle__orders.sql
Created August 21, 2024 07:32
data load with meta columns example
SELECT
/* Meta columns to have same pattern across sources
* ___hash for data deduplication and creating unique id that's stable
* ___source_pk to be able to PARTITION BY to get the correct data per object
* ___as_of telling the business validity. In case something is a range already in the source system, then split into two rows
* ___loaded_at to indicate when this particular table was loaded
* ___original_loaded_at (optional) in case we have some indicator when the data originally has arrived into our system
*/
TO_HEX(SHA256(TO_JSON_STRING(my_table))) AS ___hash,
my_table.id as ___source_pk,
@Nipsuli
Nipsuli / second_login.sql
Created August 18, 2022 07:19
Example on how to get timestamp for second login. Used to educate about Non-Equi JOIN in SQL
WITH
base AS (
SELECT
SAFE_CAST (
JSON_EXTRACT_SCALAR (payload, '$.timestamp') AS timestamp
) AS event_time,
JSON_EXTRACT_SCALAR (payload, '$.action') AS action_type,
JSON_EXTRACT_SCALAR (payload, '$.actor_id') AS actor_id,
JSON_EXTRACT_SCALAR (payload, '$.actor_username') AS actor_username,
FROM
@Nipsuli
Nipsuli / weighted_std.sql
Created May 22, 2022 06:47
Numerically stable parallel safe standard deviation in PostgreSQL
CREATE OR REPLACE FUNCTION _wstd_state(state numeric[3], val numeric, weight numeric)
RETURNS numeric[3] AS $$
DECLARE
s_n_1 CONSTANT numeric NOT NULL := state[1];
mu_n_1 CONSTANT numeric NOT NULL := state[2];
w_n_1 CONSTANT numeric NOT NULL := state[3];
s_n numeric;
mu_n numeric;
w_n numeric;
BEGIN
@Nipsuli
Nipsuli / two_pass_shuffle.py
Last active November 10, 2021 19:06
Two pass shuffle implementation for algorithm described: in https://blog.janestreet.com/how-to-shuffle-a-big-dataset/
import contextlib
import tempfile
import random
def two_pass_shuffle(input_files, output_files, temp_file_count, header_lines=0):
"""
two_pass_shuffle
Suffle data larger that can be shuffled in memory.
Implementation based on: