Skip to content

Instantly share code, notes, and snippets.

@knthls
Created October 17, 2017 12:00
Show Gist options
  • Save knthls/3e912d78f41d18714d86c2c2f6c68e9b to your computer and use it in GitHub Desktop.
Save knthls/3e912d78f41d18714d86c2c2f6c68e9b to your computer and use it in GitHub Desktop.
remove correlated columns in pandas DataFrame
# remove low variance columns
rna_data = rna_data.loc[:, (rna_data.var() > 0.5)]
# remove correlated columns
vars = rna_data.var()
tbd = []
# from each pair of highly correlated columns, remove columns with lower variance
for i, j in zip(*np.where(np.corrcoef(rna_data.values.T) > 0.5)):
if i < j:
if vars.iloc[i] < vars.iloc[j]:
tbd.append(i)
else:
tbd.append(j)
rna_data = rna_data.drop(rna_data.columns[tbd], axis=1)
# scale and center values
rna_data -= rna_data.mean()
rna_data /= rna_data.std()
data = pd.concat((clinical_data.loc[:,'tumor_site'], rna_data), axis=1, join='inner')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment