Skip to content

Instantly share code, notes, and snippets.

View sllynn's full-sized avatar

Stuart Lynn sllynn

View GitHub Profile
@sllynn
sllynn / unzipper.scala
Created April 27, 2022 17:50
Unzip a lot of zip files from DBFS
import org.apache.spark.sql.functions._
import spark.implicits._
import sys.process._
val paths = dbutils.fs.ls("/FileStore/shared_uploads/[email protected]/shapefiles/").toDF
.select("path", "name")
.where(col("path").endsWith(".zip"))
.withColumn("path", regexp_replace($"path", "dbfs:/", "/dbfs/"))
.withColumn("root", regexp_replace($"path", $"name", lit("")))
.drop("name")
@sllynn
sllynn / module_reload.py
Created August 19, 2021 11:55
Reload a python module
import sys
try:
sys.modules.pop("tests.test_advanced")
from tests.test_advanced import AdvancedTestSuite
except KeyError as e:
pass
@sllynn
sllynn / plot_graphviz.py
Created August 18, 2020 17:57
Display a graphviz Graph or Digraph in a notebook
def plot_graphviz(graph):
from tempfile import NamedTemporaryFile
from base64 import b64encode
with NamedTemporaryFile(suffix=".png") as fh:
graph.plot(to_file=fh.name)
img = b64encode(fh.read()).decode("UTF-8")
displayHTML(f"<img src='data:image/png;base64,{img}'>")
@sllynn
sllynn / vectorToArray.scala
Created June 15, 2020 15:48
Convert VectorUDT to ArrayType (Scala UDF)
import org.apache.spark.ml.linalg.Vector
val toArray = udf { v: Vector => v.toArray }
spark.sqlContext.udf.register("toArray", toArray)
@sllynn
sllynn / from_xltime.py
Created June 15, 2020 15:18
Pandas UDF for converting Excel dates to Spark timestamps
@pandas_udf("timestamp", PandasUDFType.SCALAR)
def from_xltime(x):
import pandas as pd
import datetime as dt
return (pd.TimedeltaIndex(x, unit='d') + dt.datetime(1899,12,30)).to_series()
@sllynn
sllynn / pandas_multi.py
Created October 2, 2019 15:37
Example of selecting groups of columns in a pandas dataframe using numpy slicing tools
data.iloc[:, np.r_[5:data.columns.size,1]]
@sllynn
sllynn / mlflow-pyfunc-wrapper.py
Created September 20, 2019 07:36
Custom mlflow pyfunc wrapper for Keras models
import mlflow.pyfunc
import mlflow.keras
class KerasWrapper(mlflow.pyfunc.PythonModel):
def __init__(self, keras_model_name):
self.keras_model_name = keras_model_name
def load_context(self, context):
@sllynn
sllynn / custom-kinesis-writer.py
Last active September 20, 2019 07:42
kinesis writer (includes some other logic relevant to multiclass classification of documents)
import boto3
import json
import numpy as np
import pandas as pd
from math import ceil
class KinesisWriter:
def __init__(self, region, stream, classes):
self.kinesis_client = None
@sllynn
sllynn / parallel-notebooks.py
Created September 19, 2019 08:29
Run multiple notebooks in parallel as ephemeral jobs using python threads
from threading import Thread
def producer_method():
dbutils.notebook.run(
path="./kinesis-producer",
timeout_seconds=600,
arguments={
"kinesisRegion": KINESIS_REGION,
"inputStream": INPUT_STREAM,
"newsgroupDataLocation": NEWSGROUP_DATA_PATH
@sllynn
sllynn / sparklyr-display.R
Created September 17, 2019 14:05
equivalent to `display()` for sparklyr dataframes
library(dplyr)
sdisplay <- function(x) {
x %>% sample_n(1000) %>% collect() %>% display
}