Created
July 12, 2021 08:54
-
-
Save dah33/79fea4c586b201cb9e798c7383538301 to your computer and use it in GitHub Desktop.
Mutual information in SQL between discrete variables
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with t1 as ( | |
select | |
column1 as x, | |
column2 as y | |
from your_table | |
), | |
t as ( | |
select x, y | |
from t1 | |
where x is not null and y is not null | |
), | |
n as ( select count(*)::real as n from t ), | |
x as ( select x, count(*)::real as cx from t group by 1 ), | |
y as ( select y, count(*)::real as cy from t group by 1 ), | |
xy as ( select x,y, count(*)::real as cxy from t group by 1,2 ), | |
ixy as ( select sum(cxy/n * (ln(n)+ln(cxy)-ln(cx)-ln(cy))) as ixy from xy join x on xy.x = x.x join y on xy.y = y.y, n ), | |
hxy as ( select -sum(cxy/n * (ln(cxy)-ln(n))) as hxy from xy, n ) | |
--select 1 - ixy/hxy from hxy, ixy; -- Jaccard distance | |
select ixy from ixy; -- Mutual information: I(X;Y) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Removes records where
x
ory
is NULL (i.e. missing observations). To keep these as known "other" values, rather than missing, replace thet
CTE with:This chooses a sensible value to infill NULLs and works for various data types.