Skip to content

Instantly share code, notes, and snippets.

View juhotuho10's full-sized avatar

Juho N juhotuho10

  • Finland
View GitHub Profile
@cameres
cameres / compute_correlation_matrix.py
Last active November 22, 2022 14:19
Compute Pandas Correlation Matrix of a Spark Data Frame
from pyspark.mllib.stat import Statistics
import pandas as pd
# result can be used w/ seaborn's heatmap
def compute_correlation_matrix(df, method='pearson'):
# wrapper around
# https://forums.databricks.com/questions/3092/how-to-calculate-correlation-matrix-with-all-colum.html
df_rdd = df.rdd.map(lambda row: row[0:])
corr_mat = Statistics.corr(df_rdd, method=method)
corr_mat_df = pd.DataFrame(corr_mat,