Introduction to Apache Spark

Introducting Apache Spark

What use cases are a good fit for Apache Spark? How to work with Spark?
- create RDDs, transform them, and execute actions to get result of a computation
- All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
  - the less disk operations, the faster (you do know it, don't you?)
- You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
- Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
- Data mining = analysis / insights / analytics
  - log mining
  - twitter stream processing
What's RDD?
- RDD = resilient distributed dataset
- different types - key-value RDD
- RDD API
- collect
- take
- count
- reduce
- reduceByKey, groupByKey, sortByKey
- saveAs...
What's SparkContext?
main entry point to Spark
sc under the Spark shell
Create your own in standalone apps
imports + SparkContext
local (1 thread) or local[n] where n is the number of threads to use (use * for the max)
on cluster with Cluster URL spark://host:port
How to create RDDs?
- sc.textFile("hdfs://...")
- sc.parallelize
What are transformations?
- they're lazy = nothing happens at the time of calling transformations
- filter
- map
- knowing how to write applications using functional programming concepts helps a lot
What are actions?
- count
- first
cache method
- Stages of executions
- laziness and computation graph optimization
- unpersist
- Eviction (LRU by default)
- Cache API
Spark Cluster
- a Driver
- Workers
- Tasks
- Executors
- Tasks run locally or on cluster
- Clusters using Mesos, YARN or standalone mode
- Local mode uses local threads
Task scheduling
task graph
aware of data locality & partitions (to avoid shuffles)
Supported storage systems
using Hadoop InputFormat API
HBase
HDFS
S3
to save the content of a RDD use rdd.saveAs...
Data locality awareness
Fault recovery
- lineage (to recompute lost data)
- partitions
Advanced features
controllable partitioning
controllable storage formats
shared variables: broadcasts & accumulators
Writing Spark applications
define dependency on Spark
Running Spark applications
standalone applications
interactive shell
the fastest approach to master Spark
How to run it - ./bin/spark-shell.sh (FIXME what are the available options)
Tab completion
Exercises (use Docker whenever possible)
- Getting started using simple line count example
- Prepare environment to measure computation time of different pipelines
- Read data from Hadoop HDFS, Twitter, Cassandra, Kafka, Redis
- Setting up a Spark cluster
- Start spark-shell using the cluster and see how UI shows it under Running Applications
Administration
UI - standalone master:8080
Level of Parallelism
All the pair RDD ops take an optional second parameter for the number of tasks
Shuffle service

fran0x/spark.md

Introducting Apache Spark