BDA Unit-6
BDA Unit-6
BDA Unit-6
Introduction of Spark
What is Spark?
Spark extends the popular MapReduce model to efficiently support more types of
computations, such as interactive queries and stream processing.
Speed is important when working with large datasets. This is because it's the difference
between browsing data interactively and waiting minutes or hours.
One of the key speed features of Spark is the ability to perform calculations in memory, but
the system is more efficient than MapReduce for complex applications running on disk.
Spark is designed for high accessibility, Spark offers a simple API for Python, Java, Scala, SQL,
and an extensive integrated library. It is also tightly integrated with other big data tools.
Specifically, Spark can run on Hadoop clusters and can access any Hadoop data source,
including Cassandra.
Spark Stack
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more.
Spark Core also hosts an API that defines Resilient Distributed Datasets (RDDs), the main
programming abstraction for Spark.
RDD represents a collection of elements that are distributed across many compute nodes
and can be processed in parallel.
Spark Core provides many APIs for creating and manipulating these collections.
1
Spark SQL
Spark SQL is Spark’s package for working with structured data.
It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive
Query Language (HQL)—and it supports many sources of data, including Hive tables,
Parquet, and JSON.
Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.
This tight integration with the rich computing environment provided by Spark makes Spark
SQL unlike any other open-source data warehouse tool.
Spark SQL was added to Spark in version 1.0.
Spark was an older SQL-on-Spark project out of the University of California, Berkeley, that
modified Apache Hive to run on Spark.
It has now been replaced by Spark SQL to provide better integration with the Spark engine
and language APIs.
Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
Spark Streaming provides an API for manipulating data streams that closely matches the
Spark Core’s RDD API, making it easy for programmers to learn the project and move
between applications that manipulate data stored in memory, on disk, or arriving in real
time.
Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.
MLlib
Spark comes with a library containing common machine learning (ML) functionality, called
MLlib.
MLlib provides multiple types of machine learning algorithms, including classification,
regression, clustering, and collaborative filtering, as well as supporting functionality such as
model evaluation and data import.
It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm. All these methods are designed to scale out across a cluster.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations, like Spark Streaming and Spark SQL.
2
GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary
properties attached to each vertex and edge.
GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).
3
Often it is a different person or team that leads the process of productizing the work of the
data scientists, and that person is often an engineer.
Figure 2 - RDDs
Features of RDD
Resilient
RDDs track data lineage information to recover the lost data, automatically on failure. It is
also called Fault tolerance.
Distributed
Data present in the RDD resides on multiple nodes. It is distributed across different nodes of
a cluster.
4
Lazy Evaluation
Data does not get loaded in the RDD even if we define it. Transformations are computed
when you call an action, like count or collect, or save the output to a file system.
Immutability
Data stored in the RDD is in a read-only mode you cannot edit the data which is present in
the RDD. But you can create new RDDs by performing transformations on the existing RDDs.
In-memory Computation:
RDD stores any immediate data that is generated in the memory (RAM) than on the disk so
that it provides faster access.
Partitioning:
Partitions can be done on any existing RDDs to create logical parts that are mutable. You can
achieve this by applying transformations on existing partitions.
Operations of RDD
There are two basic operations which can be done on RDDs. They are:
1. Transformations
2. Actions
5
Transformations
These are functions which accept existing RDDs as the input and outputs one or more RDDs.
The data in the existing RDDs does not change as it is immutable. Some of the
transformation operations are shown in the table given below:
Functions Description
map() Returns a new RDD by applying the function on each data element
filter() Returns a new RDD formed by selecting those elements of the source on which
the function returns true
reduceByKey() Used to aggregate values of a key using a function
groupByKey() Used to convert a (key, value) pair to (key, <iterable value>) pair
union() Returns a new RDD that contains all elements and arguments from the source
RDD
intersection() Returns a new RDD that contains an intersection of elements in the datasets
These transformations are executed when they are invoked or called. Every time
transformations are applied, a new RDD is created.
Actions:
Actions in Spark are functions which return the result of RDD computations. It uses a lineage
graph to load the data onto the RDD in a particular order.
After all transformations are done, actions return the final result to the Spark Driver. Actions
are operations which provide non-RDD values.
Some of the common actions used in Spark are:
Functions Description
count() Gets the number of data elements in an RDD
collect() Gets all data elements in the RDD as an array
reduce() Aggregates data elements into the RDD by taking two arguments and
returning one
take(n) Used to fetch the first n elements of the RDD
foreach(operation) Used to execute operation for each data element in the RDD
first() Retrieves the first data element of the RDD
6
Figure 5 - Key-Value Pairs
Java users also need to call special versions of Spark’s functions when you are creating pair
of RDDs.
For instance, the mapToPair () function should be used in place of the basic map() function.
Creating a pair RDD using the first word as the key word in Java program.
PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);}};
JavaPairRDD<String, String> pairs =
lines.mapToPair(keyData);
Transformations on Pair RDDs
Aggregations
When datasets are described in terms of key or value pairs, it is common feature that is
required to aggregate statistics across all elements with the same key value. Spark has a set
of operations that combines values that own the same key value. These operations return
RDDs and thus are transformations rather than actions i.e. reduceByKey(), foldByKey(),
combineByKey().
Grouping Data
With key data is a common type of use case in grouping our data sets is used with respect to
predefined key value for example, viewing all a customer’s orders together in one file.
If our data is already keyed in the way we want to implement, groupByKey() will group our
data using the key value using our RDD.
On an RDD consisting of keys of type K and values of type V, we get back an RDD operation
of type [K, Iterable[V]].
A groupBy() works on unpaired data or data where we want to use a different terms of
condition besides equality on the current key been specified.
7
It requires a function that it allows to apply the same to every element in the source of RDD
and uses the result to determine the key value obtained.
Joins
The most useful and effective operations we get with keyed data values comes from using it
together with other keyed data.
Joining datasets together is probably one of the most common type of operations you can
find out on a pair RDD.
Inner Join : Only keys that are present in both pair RDDs are known as the output.
leftOuterJoin() : The resulting pair RDD has entries for each key in the source RDD. The
value which is been associated with each key in the result is a tuple of the value from the
source RDD and an Option for the value from the other pair of RDD.
rightOuterJoin() : is almost identical functioning to leftOuterJoin () except the key must be
present in the other RDD and the tuple has an option for the source rather than the other
RDD functions.
Sorting Data
We can sort an RDD with key or value pairs if there is an ordering defined on the key set.
Once we have sorted our data elements, any subsequent call on the sorted data to collect()
or save() will result in ordered dataset.
8
What are dataframes?
A dataframe is the new API for Apache Spark. It is basically a distributed, Strongly-typed
collection of data, that is, a dataset which is organised into named columns. Dataframe is
equivalent to what a table is for relational database, only, it has richer optimization options.
Basic of MLlib
MLlib is short for machine learning library. Machine learning in PySpark is easy to use and
scalable. It works on distributed systems.
We use machine learning in PySpark for data analysis.
We get the benefit of various machine learning algorithms such as Regression, classification
etc, because of the MLlib in Apache Spark.
PySpark
Real-time Computation
PySpark provides real-time computation on a large amount of data because it focuses on in-
memory processing. It shows the low latency.
Swift Processing
PySpark allows us to achieve a high data processing speed, which is about 100 times faster
in memory and 10 times faster on the disk.
9
o Lambda: Lambda is a regularization parameter.
o Blocks: Blocks are used to parallel the number of computations. The default value
for this is -1.
Input: In [1]:
from pyspark import SparkConf,
SparkContextfrom pyspark.sql import
SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)
Input: In [2]:
company-df=
sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’true’,inferschema=’true’).load(‘
C:/Users/intellipaat/Downloads/spark-2.3.2-binhadoop2.7/Fortune5002017.csv’)
company-df.take(1)
You can choose the number of rows you want to view while displaying the data of a
dataframe.
I have displayed the first row only.
Output: Out[2]:
[Row (Rank=1, Title= ‘Walmart’, Website= ‘http:/www.walmart.com’, Employees-
2300000, Sector= ‘retailing’)]
Data exploration:
To check the datatype of every column of a dataframe and print the schema of the
dataframe in a tree format, you can use the following commands respectively.
Input: In[3]:
company-df.cache()
company-
df.printSchema()
10
|– Employees: integer (nullable = true)
|– Sector: string (nullable = true)
Input: In [4]:
company-df.describe().toPandas().transpose()
11