Stuart Lynn sllynn

Solutions Architect at Databricks EMEA

sllynn / unzipper.scala

Created April 27, 2022 17:50

Unzip a lot of zip files from DBFS

	import org.apache.spark.sql.functions._
	import spark.implicits._
	import sys.process._

	val paths = dbutils.fs.ls("/FileStore/shared_uploads/[email protected]/shapefiles/").toDF
	.select("path", "name")
	.where(col("path").endsWith(".zip"))
	.withColumn("path", regexp_replace($"path", "dbfs:/", "/dbfs/"))
	.withColumn("root", regexp_replace($"path", $"name", lit("")))
	.drop("name")

sllynn / module_reload.py

Created August 19, 2021 11:55

Reload a python module

	import sys

	try:
	sys.modules.pop("tests.test_advanced")
	from tests.test_advanced import AdvancedTestSuite
	except KeyError as e:
	pass

sllynn / plot_graphviz.py

Created August 18, 2020 17:57

Display a graphviz Graph or Digraph in a notebook

	def plot_graphviz(graph):
	from tempfile import NamedTemporaryFile
	from base64 import b64encode
	with NamedTemporaryFile(suffix=".png") as fh:
	graph.plot(to_file=fh.name)
	img = b64encode(fh.read()).decode("UTF-8")

	displayHTML(f"<img src='data:image/png;base64,{img}'>")

sllynn / vectorToArray.scala

Created June 15, 2020 15:48

Convert VectorUDT to ArrayType (Scala UDF)

	import org.apache.spark.ml.linalg.Vector
	val toArray = udf { v: Vector => v.toArray }
	spark.sqlContext.udf.register("toArray", toArray)

sllynn / from_xltime.py

Created June 15, 2020 15:18

Pandas UDF for converting Excel dates to Spark timestamps

	@pandas_udf("timestamp", PandasUDFType.SCALAR)
	def from_xltime(x):
	import pandas as pd
	import datetime as dt
	return (pd.TimedeltaIndex(x, unit='d') + dt.datetime(1899,12,30)).to_series()

sllynn / pandas_multi.py

Created October 2, 2019 15:37

Example of selecting groups of columns in a pandas dataframe using numpy slicing tools

data.iloc[:, np.r_[5:data.columns.size,1]]

sllynn / mlflow-pyfunc-wrapper.py

Created September 20, 2019 07:36

Custom mlflow pyfunc wrapper for Keras models

	import mlflow.pyfunc
	import mlflow.keras

	class KerasWrapper(mlflow.pyfunc.PythonModel):

	def __init__(self, keras_model_name):
	self.keras_model_name = keras_model_name

	def load_context(self, context):

sllynn / custom-kinesis-writer.py

Last active September 20, 2019 07:42

kinesis writer (includes some other logic relevant to multiclass classification of documents)

	import boto3
	import json
	import numpy as np
	import pandas as pd
	from math import ceil

	class KinesisWriter:

	def __init__(self, region, stream, classes):
	self.kinesis_client = None

sllynn / parallel-notebooks.py

Created September 19, 2019 08:29

Run multiple notebooks in parallel as ephemeral jobs using python threads

	from threading import Thread

	def producer_method():
	dbutils.notebook.run(
	path="./kinesis-producer",
	timeout_seconds=600,
	arguments={
	"kinesisRegion": KINESIS_REGION,
	"inputStream": INPUT_STREAM,
	"newsgroupDataLocation": NEWSGROUP_DATA_PATH

sllynn / sparklyr-display.R

Created September 17, 2019 14:05

equivalent to `display()` for sparklyr dataframes