- What use cases are a good fit for Apache Spark? How to work with Spark?
- create RDDs, transform them, and execute actions to get result of a computation
- All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
- the less disk operations, the faster (you do know it, don't you?)
- You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
- Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
- Data mining = analysis / insights / analytics
- log mining
- twitter stream processing
- What's RDD?
- RDD = resilient distributed dataset
- different types - key-value RDD
- RDD API
collect
take
count
reduce
reduceByKey
,groupByKey
,sortByKey
saveAs...
- What's
SparkContext
? - main entry point to Spark
sc
under the Spark shell- Create your own in standalone apps
- imports + SparkContext
local
(1 thread) orlocal[n]
wheren
is the number of threads to use (use*
for the max)- on cluster with Cluster URL
spark://host:port
- How to create RDDs?
sc.textFile("hdfs://...")
sc.parallelize
- What are transformations?
- they're lazy = nothing happens at the time of calling transformations
filter
map
- knowing how to write applications using functional programming concepts helps a lot
- What are actions?
count
first
cache
method- Stages of executions
- laziness and computation graph optimization
unpersist
- Eviction (LRU by default)
- Cache API
- Spark Cluster
- a Driver
- Workers
- Tasks
- Executors
- Tasks run locally or on cluster
- Clusters using Mesos, YARN or standalone mode
- Local mode uses local threads
- Task scheduling
- task graph
- aware of data locality & partitions (to avoid shuffles)
- Supported storage systems
- using Hadoop InputFormat API
- HBase
- HDFS
- S3
- to save the content of a RDD use
rdd.saveAs...
- Data locality awareness
- Fault recovery
- lineage (to recompute lost data)
- partitions
- Advanced features
- controllable partitioning
- controllable storage formats
- shared variables: broadcasts & accumulators
- Writing Spark applications
- define dependency on Spark
- Running Spark applications
- standalone applications
- interactive shell
- the fastest approach to master Spark
- How to run it -
./bin/spark-shell.sh
(FIXME what are the available options) - Tab completion
- Exercises (use Docker whenever possible)
- Getting started using simple line count example
- Prepare environment to measure computation time of different pipelines
- Read data from Hadoop HDFS, Twitter, Cassandra, Kafka, Redis
- Setting up a Spark cluster
- Start spark-shell using the cluster and see how UI shows it under Running Applications
- Administration
- UI - standalone master:8080
- Level of Parallelism
- All the pair RDD ops take an optional second parameter for the number of tasks
- Shuffle service
-
-
Save fran0x/8765cec0ef30e5aaef64497f1ecd47cb to your computer and use it in GitHub Desktop.
Introduction to Apache Spark
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment