Big Data Engineering - PySpark
Big Data Engineering - PySpark
Spark
What is Apache Spark
WHAT IS
Ability to efficiently execute streaming, machine-learning or SQL
workloads which require fast-iterative access to data-sets
APACHE
Can run on top of Apache Hadoop YARN, Mesos & Kubernetes
SPARK
easier
Includes ML-lib
SPARK PROCESSES DATA IN SO SPARK SHOULD NONETHELESS, SPARK IF DATA IS TOO BIG TO FIT
MEMORY, WHILE MR OUTPERFORM MR NEEDS A LOT OF MEMORY IN MEMORY, THEN THERE MR KILLS ITS JOB, AS SOON
PERSISTS BACK TO DISK, WILL BE MAJOR AS IT’S DONE
AFTER A MAPREDUCE JOB PERFORMANCE
DEGRADATION FOR SPARK
IT IS THE PRIMARY ABSTRACTION IN ONE COULD COMPARE RDDS TO IMMUTABLE AND PARTITIONED CAN ONLY BE CREATED BY READING
SPARK AND IS THE CORE OF APACHE COLLECTIONS IN PROGRAMMING, A COLLECTION OF RECORDS DATA FROM A STABLE STORAGE LIKE
SPARK RDD IS COMPUTED ON MANY JVMS HDFS OR BY TRANSFORMATIONS ON
WHILE A COLLECTION LIVES ON A EXISTING RDD’S
SINGLE JVM
Features of RDD
From HDFS, text files, Hypertable, Amazon S3, Apache Hbase,SequenceFiles, any other
Hadoop InputFormat, and directory or glob wildcard: /data/201404*
In other words, a RDD operation that returns a value of any type except RDD[T]
is an action
Only actions can materialize the entire processing pipeline with real data
Actions are one of two ways to send data from executors to the driver (the other
being accumulators)
Actions
Getting Data Out of RDDs
>>> rdd=sc.parallelize([1,2,3])
>>> rdd.reduce(lambda a,b:a*b)
>>> rdd.take(2)
>>> rdd.collect()
>>>rdd =sc.parallelize([5,3,1,2])
>>>rdd.takeOrdered(3,lambda s: ‐1 * s)
lines=sc.textFile("...",4)
print lines.count()
count() causes Spark to: read data , sum within partitions, combine sums in driver
saveAsTextFile(path) - Save this RDD as a text file, using string representations of elements.
RDD
Transformations
Transformations are lazy operations on a RDD that create one
or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup,
randomSplit
They are functions that take a RDD as the input and produce
one or many RDDs as the output They do not change the input
RDD (since RDDs are immutable), but always produce one or
>>> rdd2=sc.parallelize([1,4,2,2,3])
>>> rdd2.distinct()
SparkSession &
SparkContext
SparkContext
●It is the entry point to spark core
-heart of the spark application
● Spark context sets up internal
services and establishes a
connection to a Spark execution
environment
● Once a SparkContext is created
you can use it to create RDDs,
accumulators and broadcast
variables, access Spark services and
run jobs (until SparkContext is
stopped)
● A Spark context is essentially a
client of Spark’s execution
environment and acts as the master
of your Spark application (don’t get
confused with the other meaning of
Master in Spark, though)
SparkContext & SparkSession
Prior to spark 2.0.0
like
val conf=newSparkConf()
● SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming
Spark with Dataframe and Dataset APIs. All the functionality available with sparkContext are also available in
sparkSession.
● In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all
the APIs.
● Once the SparkSession is instantiated, we can configure Spark’s run-time config properties
●You can create a SparkContext instance with or
without creating a SparkConf object first
... .master("local") \
... .getOrCreate()
INSTANCE
OR
>>> c = SparkConf()
From here
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
Spark Caching
Caching
● It is one mechanism to speed up applications that access the same RDD multiple times.
● An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action
is invoked on that RDD.
● There are two function calls for caching an RDD: cache() and persist(level:
StorageLevel).
● The difference among them is that cache() will cache the RDD into memory, whereas
persist(level) can cache in memory, on disk, or off-heap memory according to the
caching strategy specified by level.
● persist() without an argument is equivalent with cache(). Freeing up space from the
Storage memory is performed by unpersist().
When to use caching
Don’t spill to disk unless the functions that computed your datasets are expensive, or
they filter a large amount of the data. Otherwise, recomputing a partition may be as
fast as reading it from disk.
Spark Pair RDD
Spark Key-Value RDDs
>>>rdd=sc.parallelize([(1,2),(3,4)])
Spark Key-Value RDDs
●partitions() Get the array of partitions of this RDD, taking into account whether
the RDD is checkpointed or not
●getNumPartitions() - Returns the number of partitions of this RDD.
COALESCE AND REPARTITION
●The repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly. Let's
create a DataFrame with the numbers from 1 to 12.
●coalesce uses existing partitions to minimize the amount of data that's shuffled. repartition creates new
partitions and does a full shuffle. coalesce results in partitions with different amounts of data (sometimes partitions
that have much different sizes) and repartition results in roughly equal sized partitions.
●Is coalesce or repartition faster?
●coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than
equal sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found
repartition to be faster overall because Spark is built to work with equal sized partitions.
●repartition - it's recommended to use it while increasing the number of partitions, because it involve shuffling of
all the data.
●coalesce - it's is recommended to use it while reducing the number of partitions. For example if you have 3
partitions and you want to reduce it to 2, coalesce will move the 3rd partition data to partition 1 and 2. Partition 1
and 2 will remains in the same container. On the other hand, repartition will shuffle data in all the partitions,
therefore the network usage between the executors will be high and it will impacts the performance.
●coalesce performs better than repartition while reducing the number of partitions.
Spark Core
Concepts ,
Internals &
Architecture
Spark Core Concepts
● Any data processing workflow could be defined as reading the data source.
● Applying set of transformations and materializing the result in different ways.
● Transformations create dependencies between RDDs.
● The dependencies are usually classified as "narrow" and "wide".
Let Us Analyze a DAG
• Narrow (pipelineable)
• each partition of the parent RDD is used by at most
one partition of the child RDD
• Wide (shuffle)
• multiple child partitions may depend on one parent
partition
• Spark stages are created by breaking the RDD graph at shuffle boundaries
LIST OF NARROW VS WIDE TRANSFORMS
Transformations with (usually) Narrow dependencies:
•map
•mapValues
•flatMap
•filter
•mapPartitions
•mapPartitionsWithIndex
Transformations with (usually) Wide dependencies: (might cause a
shuffle)
•cogroup
•groupWith
•join
•leftOuterJoin
•rightOuterJoin
•groupByKey
•reduceByKey
•combineByKey
•distinct
•intersection
•repartition
•coalesce
Splitting DAG Into Stages
• RDD operations with "narrow" dependencies, like map() and filter(), are pipelined
together into one set of tasks in each stage operations with shuffle dependencies
require multiple stages (one to write a set of map output files, and another to read
those files after a barrier).
• In the end, every stage will have only shuffle dependencies on other stages, and may
compute multiple operations inside it. The actual pipelining of these operations
happens in the RDD.compute() functions of various RDDs
Spark Components
• Spark driver
• separate process to execute user applications
• Executors
• Run tasks scheduled by driver
• store computation results in memory, on disk or off-heap
• interact with storage systems
• Cluster Manager
• Mesos
• YARN
• Spark Standalone
Spark Components
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files
locations) and finds minimum schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures,
and mitigating stragglers(slowness)
Spark Components
• BlockManager
• provides interfaces for putting and retrieving blocks both local and external
Memory Management
• Execution Memory
• storage for data needed during tasks execution
• shuffle-related data
• Storage Memory
• storage of cached RDDs and broadcast variables
• safeguard value is 50% of Spark Memory when cached blocks are immune
to eviction
• User Memory
• user data structures and internal metadata in Spark
• Reserved Memory
• memory needed for running executor itself and not strictly related to
Spark
Spark Dataframes
Dataframe
DataFrames introduced in Spark 1.3 as extension to RDDs
Distributed collection of data organized into named columns
» Equivalent to Pandas and R DataFrame, but distributed
Types of columns inferred from values
Easy to convert between Pandas and pySpark
» Note: pandas DataFrame must fit in driver
#Convert Spark DataFrame to Pandas
pandas_df = spark_df.toPandas()
# Create a Spark DataFrame from Pandas
spark_df = sc.createDataFrame(pandas_df)
createDataFrame(another_rdd, schemaofstructtype)
STRUCTTYPE
●StructType and StructField belong to the org.apache.spark.sql.types
package import org.apache.spark.sql.types.StructType
schemaUntyped = new StructType()
.add("a", "int")
.add("b", "string")
●reader = spark.read.parquet("/path/to/file.parquet")
QUERY and Processing DATAFRAME
Commonly used transformations are as below:
●select
●filter
●groupBy
●union
●explode
●where clause
●withColumn
●partitionBy
●orderBy
●rank etc
Writing Dataframes
●We can write the dataframes back to disk as files using any of the
supported read formats using below syntax
● dataframe.write.<format>(“/location”)
● df.write.mode(SaveMode.Overwrite).format("<format>").save("/target/path”)
Spark SQL
Why Spark SQL?
• Spark SQL originated as Apache Hive to run on top of Spark and is now
integrated with the Spark stack. Spark SQL was built to overcome these
drawbacks and replace Apache Hive.
• Limitations with Hive:
• Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce
lags in the performance when it comes to the analysis of medium sized datasets (10 to
200 GB).
• Hive has no resume capability. This means that if the processing dies in the middle of a
workflow, you cannot resume from where it got stuck.
• Hive cannot drop encrypted databases in cascade when trash is enabled and leads to
an execution error. To overcome this, users have to use Purge option to skip trash
instead of drop.
Architecture of Spark SQL
SQL - A language for Relational DBs
SQL = Structured Query Language
Supported by pySpark DataFrames (SparkSQL)
Some of the functionality SQL provides:
» Create, modify, delete relations
» Add, modify, remove tuples
» Specify queries to find tuples matching criteria
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Project Tungsten
Optimization Features
https://community.cloudera.com/t5/Community-Articles/What-is-Tungsten-for-Apache-Spark/ta-p/248445
RDD vs
Dataframe vs
Dataset
RDD vs DataFrame vs Dataset
When to use RDD:
● you want low-level transformation and actions and control on your dataset;
● your data is unstructured, such as media streams or streams of text;
● you want to manipulate your data with functional programming constructs than domain specific expressions;
● you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name
or column; and
● you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and
semi-structured data.
DataFrames
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns,
like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to
impose a structure onto a distributed collection of data, allowing higher-level abstraction
Datasets
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API. Conceptually,
consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.
Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
Parquet vs ORC
vs Avro
ORC FORMAT
●ORC stands for Optimized Row Columnar which means it can store data in an
optimized way than the other file formats. ORC reduces the size of the original
data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data
processing also increases. ORC shows better performance than Text, Sequence
and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer.
ORC format improves the performance when Hive is processing the data.
●An ORC file contains groups of row data called stripes, along with auxiliary
information in a file footer. At the end of the file a postscript holds
compression parameters and the size of the compressed footer.
●The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads
from HDFS.
●The file footer contains a list of stripes in the file, the number of rows per stripe,
and each column's data type. It also contains column-level aggregates count,
min, max, and sum.
●This diagram illustrates the ORC file structure:
PARQUET FORMAT
●Parquet is an open source file format available to any project in
the Hadoop ecosystem. Apache Parquet is designed for efficient as
well as performant flat columnar storage format of data compared
to row based files like CSV or TSV files.
●Columnar storage like Apache Parquet is designed to bring
efficiency compared to row-based files like CSV. When querying,
columnar storage you can skip over the non-relevant data very
quickly. As a result, aggregation queries are less time consuming
compared to row-oriented databases. This way of storage has
translated into hardware savings and minimized latency for
accessing data.
●Compression ratio is 60-70%.
AVRO FORMAT
●.avro
●Compression ratio 50-55%
●Schema is stored separately in avsc (avro schema) file
●Schema evolution advantage – Even if one column data doesn’t
appear automatically hive will default it to some value.
●The default value is specified in the avsc file.
●Name , address, phone
●1,blr,99168
●2,Noida, 9999
●3,nocityfound,9777
ORC
Raw Data
the row
values
• In the session Spark Streaming, Stream from text data received over a TCP
socket connection. Besides sockets, the StreamingContext API provides
methods for creating Streams from files and Akka actors as input sources.
• • File Streams: For reading data from files on any file system compatible
with the HDFS API (that is, HDFS, S3, NFS, etc.)
Receiver Reliability
• There are two kinds of data sources based on their reliability.
• Sources like Kafka and Flume allow the transferred data to be
acknowledged.
• If the system receiving data from these reliable sources acknowledges the
received data correctly, it can be ensured that no data will be lost due to any
kind of failure. This leads to two kinds of receiver:
• Reliable Receiver – Correctly sends acknowledgment to a reliable source when the data
has been received and stored in Spark with Replication
• Unreliable Receiver – Doesn’t send acknowledgement to a source. This can be used for
sources that do not support acknowledgement, or even for reliable sources when one
does not want or need to go into the complexity of acknowledgement.
Performance Tuning
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("topic", "fastcars")
.option("checkpointLocation", "/tmp/sparkcheckpoint/")
.queryName("kafka spark streaming kafka")
.outputMode("update")
.trigger(Trigger.Continuous("10 seconds")) //10 seconds is the
checkpoint interval.
.start()
MODEL DETAILS
●Conceptually, Structured Streaming treats all the data arriving as
an unbounded input table. Each new item in the stream is like a
row appended to the input table. We won’t actually retain all the
input, but our results will be equivalent to having all of it and
running a batch job.
READ , PROCESS, WRITE
EVENT TIME & PROCESSING TIME
EventTime is the time at which an event is generated at its source, whereas a ProcessingTime is the time at which that
event is processed by the system. There is also one more time which some stream processing systems account for, that
is IngestionTime - the time at which event/message was ingested into the System. It is important to understand the
difference between EventTime and ProcessingTime.
The red dot in the above image is the message, which originates from the vehicle, then flows through the Kafka topic to
Spark’s Kafka source and then reaches executor during task execution. There could be a slight delay (or maybe a long
delay if there is any network connectivity issue) between these points. The time at the source is what is called an
EventTime, the time at the executor is what is called the ProcessingTime. You can think of the ingestion time as the
time at when it was first read into the system at the Kafka source (IngestionTime is not relevant for spark).
TUMBLING WINDOW & SLIDING WINDOW
A tumbling window is a non-overlapping window, that tumbles over every “window-size”. e.g., for a Tumbling window of size 4
seconds, there could be window for [00:00 to 00:04), [00:04: 00:08), [00:08: 00:12) etc (ignoring day, hour etc here). If an
incoming event has EventTime 00:05, that event will be assigned the window - [00:04 to 00:08)
A SlidingWindow is a window of a given size(say 4 seconds) that slides every given interval (say 2 seconds). That means a sliding
window could overlap with another window. For a window of size 4 seconds, that slides every 2 seconds there could windows
[00:00 to 00:04), [00:02 to 00:06), [00:04 to 00:08) etc. Notice that the windows 1 and 2 are overlapping here. If an event with
EventTime 00:05 comes in, that event will belong to the windows [00:02 to 00:06) and [00:04 to 00:08).
OUTPUT MODES
The last part of the model is output modes. Each time the
result table is updated, the developer wants to write the
changes to an external system, such as S3, HDFS, or a
database. We usually want to write output incrementally.
For this purpose, Structured Streaming provides three
output modes:
•Append: Only the new rows appended to the result
table since the last trigger will be written to the external
storage. This is applicable only on queries where existing
rows in the result table cannot change (e.g. a map on an
input stream).
•Complete: The entire updated result table will be written
to external storage.
•Update: Only the rows that were updated in the result
table since the last trigger will be changed in the
external storage. This mode works for output sinks that
can be updated in place, such as a MySQL table.
WATERMARK
In Spark, Watermark is used to decide when to clear a state based on current maximum event time. Based on the delay you
specify, Watermark lags behind the maximum event time seen so far. e.g., if dealy is 3 seconds and current max event time is
10:00:45 then the watermark is at 10:00:42. This means that Spark will keep the state of windows who’s end time is less than
10:00:42.
Watermark delay is 5 seconds here
http://vishnuviswanath.com/spark_structured_streaming.html
Spark MLLib machine learning
● Connect to batch/ real time streaming sources
● Data to be cleansed and transformed into a stream, stored in
memory
● Models can or cannot be built in real time depending on the
requirement and availability of libraries
● Do predictions in batch/real time
● Receive, store and communicate with external system
Spark ML Lib
● Packages – spark.mllib, spark.ml
● Data types – Local vector, Labelled point
● Local vector – vector of double values – Dense and Sparse
● e.g, a vector (1.0, 0.0, 3.0) can be represented in dense format as
[1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0])
● Labeled point – contains the target variable(outcome) and list
of features (predictors) ()
● Process - Load data into RDD -> Transform RDD – Filter, Type
conversion, centering, Scaling, etc. -> Convert to labeled point ->
Split training and testing -> Create model -> Tune -> Perform
predictions
ML Pipelines
● Standard APIs for machine learning algorithms to make it easier to
combine multiple algorithms into a single pipeline, or workflow.
● Inspired from SciKit-learn where multiple libraries of
pre-processing, feature engineering and modelling are combined to
give an output in using single line of code.
● Three major parts:
1. DataFrame – to hold the dataset
2. Transformer – Apply Functions/Models to datasets (may add columns),
.transform()
3. Estimator – Converts a dataset into a Transformer, e.g. training the datasets
using .fit()
● Sometimes pipelines may also contain params which are the args passed to a
transformer.
Feature engineering in Spark
● Extraction: Extracting features from “raw” data.
● TF-IDF - Feature vectorization method widely used in text mining to reflect
the importance of a term to a document in the corpus.
● TF: Both HashingTF and CountVectorizer can be used to generate the
term frequency vectors.
● HashingTF is a Transformer which takes sets of terms and converts those
sets into fixed-length feature vectors.
● IDF is an Estimator which is fit on a dataset and produces an IDFModel.
The IDFModel takes feature vectors (generally created from HashingTF or
CountVectorizer) and scales each column.
● Word2Vec -Estimator which takes sequences of words representing
documents and trains a Word2VecModel. The model maps each word to a
unique fixed-size vector. The Word2VecModel transforms each document
into a vector using the average of all words in the document
● CountVectorizer -CountVectorizer and CountVectorizerModel aim to help
convert a collection of text documents to vectors of token counts.
https://spark.apache.org/docs/2.4.0/ml-features.html#pca
Transformation: Scaling, converting, or modifying features
● Tokenizer
● StopWordsRemover
● n-gram
● Normalizer
● StandardScaler
● Bucketizer
● SQLTransformer
● Imputer
Selection: Selecting a subset from a larger set of features
● Hypothesis testing
● Determine whether a result is statistically significant,
whether this result occurred by chance or not.
Linear regression
● Estimate value of dependent variables from the values
of independent variables with some correlation.
● Draw the best line to fit plotted points.
Example -
● Convert input string to vector & drop columns which are not
needed
● Create rDD vectors.dense(atList(x).toFloat)
● Create labelled vectors & drop low correlation vectors (should be
continuous)
● Run LR model on training data Print coefficients, intercepts,
summary, lr.fit(trainingdata)
● Predict
● Evaluate using MSE (average((predicted value – actual value ) ** 2,
should be low), R square should be high as close to 1 as possible
K-Means Clustering
● Attempts to split data into K- groups that are closest to K
centroids
● Un Supervised learning – uses only position of each data point
● Randomly pick K centroids -> Assign each data point to the
centroid it’s closest to
● Re-compute centroids based on average position of each
centroid’s points
● Iterate until points stop changing assignment to centroids
● If you want to predict the cluster for new points, just find the
centroid they are closest to
Classification
● Binary Classification is the task of predicting a binary label. E.g., is an
email spam or not spam?
● Supervised learning – uses only position of each data point
● For binary classification problems, the algorithm outputs a binary logistic
regression model. Given a new data point, denoted by xx, the model
makes predictions by applying the logistic function
Model selection and hyper parameter
tuning
● An important task in ML is model selection, or using data to find
the best model or parameters for a given task.
● Estimator: algorithm or Pipeline to tune
● Set of ParamMaps : parameters to choose from, sometimes
called a “parameter grid” to search over
-Split the input data into separate training and test datasets.
-For each (training, test) pair, they iterate through the set of
ParamMaps
-Identify the best ParamMap, CrossValidator finally re-fits the
Estimator using the best ParamMap and the entire dataset.
● Evaluator: metric to measure how well a fitted Model does on
held-out test data i.e evaluate the Model’s performance using the
Evaluator.
● RegressionEvaluator , a BinaryClassificationEvaluator for binary
data, or a MulticlassClassificationEvaluator
Spark Streaming
Use Cases:
Spark Streaming v/s DStreams
Thank you!