BDA Unit-6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Unit-6: Spark

Introduction of Spark
What is Spark?
 Spark extends the popular MapReduce model to efficiently support more types of
computations, such as interactive queries and stream processing.
 Speed is important when working with large datasets. This is because it's the difference
between browsing data interactively and waiting minutes or hours.
 One of the key speed features of Spark is the ability to perform calculations in memory, but
the system is more efficient than MapReduce for complex applications running on disk.
 Spark is designed for high accessibility, Spark offers a simple API for Python, Java, Scala, SQL,
and an extensive integrated library. It is also tightly integrated with other big data tools.
 Specifically, Spark can run on Hadoop clusters and can access any Hadoop data source,
including Cassandra.

Spark Stack
Spark Core
 Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more.

Figure 1 - Spark Stack

 Spark Core also hosts an API that defines Resilient Distributed Datasets (RDDs), the main
programming abstraction for Spark.
 RDD represents a collection of elements that are distributed across many compute nodes
and can be processed in parallel.
 Spark Core provides many APIs for creating and manipulating these collections.

1
Spark SQL
 Spark SQL is Spark’s package for working with structured data.
 It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive
Query Language (HQL)—and it supports many sources of data, including Hive tables,
Parquet, and JSON.
 Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.
 This tight integration with the rich computing environment provided by Spark makes Spark
SQL unlike any other open-source data warehouse tool.
 Spark SQL was added to Spark in version 1.0.
 Spark was an older SQL-on-Spark project out of the University of California, Berkeley, that
modified Apache Hive to run on Spark.
 It has now been replaced by Spark SQL to provide better integration with the Spark engine
and language APIs.

Spark Streaming
 Spark Streaming is a Spark component that enables processing of live streams of data.
 Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
 Spark Streaming provides an API for manipulating data streams that closely matches the
Spark Core’s RDD API, making it easy for programmers to learn the project and move
between applications that manipulate data stored in memory, on disk, or arriving in real
time.
 Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.

MLlib
 Spark comes with a library containing common machine learning (ML) functionality, called
MLlib.
 MLlib provides multiple types of machine learning algorithms, including classification,
regression, clustering, and collaborative filtering, as well as supporting functionality such as
model evaluation and data import.
 It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm. All these methods are designed to scale out across a cluster.

GraphX
 GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations, like Spark Streaming and Spark SQL.

2
 GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary
properties attached to each vertex and edge.
 GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).

Data Analysis with Spark


Data Science Tasks
 Data science, a discipline that has been emerging over the past few years, centers on
analyzing data.
 While there is no standard definition, for our purposes a data scientist is somebody whose
main task is to analyze and model data.
 Data scientists may have experience with SQL, statistics, predictive modelling (machine
learning), and programming, usually in Python, MATLAB, or R.
 Data scientists also have experience with techniques necessary to transform data into
formats that can be analyzed for insights (sometimes referred to as data wrangling).
 Data scientists use their skills to analyze data with the goal of answering a question or
discovering insights.
 Oftentimes, their workflow involves ad hoc analysis, so they use interactive shells (versus
building complex applications) that let them see results of queries and snippets of code in
the least amount of time.
 Spark’s speed and simple APIs shine for this purpose, and its built-in libraries mean that
many
 algorithms are available out of the box.
 Spark supports the different tasks of data science with several components. The Spark shell
makes it easy to do interactive data analysis using Python or Scala.
 Spark SQL also has a separate SQL shell that can be used to do data exploration using SQL,
or Spark SQL can be used as part of a regular Spark program or in the Spark shell.
 Machine learning and data analysis is supported through the MLLib libraries. In addition,
there is support for calling out to external programs in MATLAB or R.
 Spark enables data scientists to tackle problems with larger data sizes than they could
before with tools like R or Pandas.
 Sometimes, after the initial exploration phase, the work of a data scientist will be
“productized,” or extended, hardened (i.e., made fault-tolerant), and tuned to become a
production data processing application, which itself is a component of a business
application.
 For example, the initial investigation of a data scientist might lead to the creation of a
production recommender system that is integrated into a web application and used to
generate product suggestions to users.

3
 Often it is a different person or team that leads the process of productizing the work of the
data scientists, and that person is often an engineer.

Data Processing Applications


 The other major use cases for Spark can be explained in the context of an engineer's
personality. For the purposes of here, we consider engineers to be a large class of software
developers who use Spark to build applications for production computing.
 These developers usually understand the principles of software engineering, such as
encapsulation, interface design, and object-oriented programming. They often have a
degree in computer science.

Resilient Distributed Datasets (RDDs)


 RDDs are the main logical data unit in Spark.
 They are a distributed collection of objects, which are stored in memory or on disks of
different machines of a cluster.
 A single RDD can be divided into multiple logical partitions so that these partitions can be
stored and processed on different machines of a cluster.
 RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can
create new RDDs by performing coarse-grain operations, like transformations, on an existing
RDD.

Figure 2 - RDDs

Features of RDD

Resilient
 RDDs track data lineage information to recover the lost data, automatically on failure. It is
also called Fault tolerance.

Distributed
 Data present in the RDD resides on multiple nodes. It is distributed across different nodes of
a cluster.

4
Lazy Evaluation
 Data does not get loaded in the RDD even if we define it. Transformations are computed
when you call an action, like count or collect, or save the output to a file system.

Figure 3 - Features of RDD

Immutability
 Data stored in the RDD is in a read-only mode you cannot edit the data which is present in
the RDD. But you can create new RDDs by performing transformations on the existing RDDs.

In-memory Computation:
 RDD stores any immediate data that is generated in the memory (RAM) than on the disk so
that it provides faster access.

Partitioning:
 Partitions can be done on any existing RDDs to create logical parts that are mutable. You can
achieve this by applying transformations on existing partitions.

Operations of RDD
 There are two basic operations which can be done on RDDs. They are:
1. Transformations
2. Actions

Figure 4 - Operations of RDD

5
Transformations
 These are functions which accept existing RDDs as the input and outputs one or more RDDs.
The data in the existing RDDs does not change as it is immutable. Some of the
transformation operations are shown in the table given below:
Functions Description
map() Returns a new RDD by applying the function on each data element
filter() Returns a new RDD formed by selecting those elements of the source on which
the function returns true
reduceByKey() Used to aggregate values of a key using a function
groupByKey() Used to convert a (key, value) pair to (key, <iterable value>) pair
union() Returns a new RDD that contains all elements and arguments from the source
RDD
intersection() Returns a new RDD that contains an intersection of elements in the datasets
 These transformations are executed when they are invoked or called. Every time
transformations are applied, a new RDD is created.

Actions:
 Actions in Spark are functions which return the result of RDD computations. It uses a lineage
graph to load the data onto the RDD in a particular order.
 After all transformations are done, actions return the final result to the Spark Driver. Actions
are operations which provide non-RDD values.
 Some of the common actions used in Spark are:
Functions Description
count() Gets the number of data elements in an RDD
collect() Gets all data elements in the RDD as an array
reduce() Aggregates data elements into the RDD by taking two arguments and
returning one
take(n) Used to fetch the first n elements of the RDD
foreach(operation) Used to execute operation for each data element in the RDD
first() Retrieves the first data element of the RDD

Creating Pair RDDs


 Spark provides special type of operations on RDDs containing key or value pairs. These RDDs
are called pair RDDs operations.
 Pair RDDs are a useful building block in many programming language, as they expose
operations that allow you to act on each key operations in parallel or regroup data across
the network.
 Pair RDDs can be created by running a map() function that returns key or value pairs.
 The procedure to build the key-value RDDs differs by language. In Python language, for the
functions on keyed data to work we need to return an RDD composed of tuples
 Creating a pair RDD using the first word as the key in Python programming language.
pairs = lines.map(lambda x: (x.split(" ")[0], x))

6
Figure 5 - Key-Value Pairs

 Java users also need to call special versions of Spark’s functions when you are creating pair
of RDDs.
 For instance, the mapToPair () function should be used in place of the basic map() function.
 Creating a pair RDD using the first word as the key word in Java program.
PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);}};
JavaPairRDD<String, String> pairs =
lines.mapToPair(keyData);
Transformations on Pair RDDs

Aggregations
 When datasets are described in terms of key or value pairs, it is common feature that is
required to aggregate statistics across all elements with the same key value. Spark has a set
of operations that combines values that own the same key value. These operations return
RDDs and thus are transformations rather than actions i.e. reduceByKey(), foldByKey(),
combineByKey().

Grouping Data
 With key data is a common type of use case in grouping our data sets is used with respect to
predefined key value for example, viewing all a customer’s orders together in one file.
 If our data is already keyed in the way we want to implement, groupByKey() will group our
data using the key value using our RDD.
 On an RDD consisting of keys of type K and values of type V, we get back an RDD operation
of type [K, Iterable[V]].
 A groupBy() works on unpaired data or data where we want to use a different terms of
condition besides equality on the current key been specified.

7
 It requires a function that it allows to apply the same to every element in the source of RDD
and uses the result to determine the key value obtained.

Joins
 The most useful and effective operations we get with keyed data values comes from using it
together with other keyed data.
 Joining datasets together is probably one of the most common type of operations you can
find out on a pair RDD.
 Inner Join : Only keys that are present in both pair RDDs are known as the output.
 leftOuterJoin() : The resulting pair RDD has entries for each key in the source RDD. The
value which is been associated with each key in the result is a tuple of the value from the
source RDD and an Option for the value from the other pair of RDD.
 rightOuterJoin() : is almost identical functioning to leftOuterJoin () except the key must be
present in the other RDD and the tuple has an option for the source rather than the other
RDD functions.

Sorting Data
 We can sort an RDD with key or value pairs if there is an ordering defined on the key set.
Once we have sorted our data elements, any subsequent call on the sorted data to collect()
or save() will result in ordered dataset.

Machine learning with MLib


 Apache Spark comes with a library named MLlib to perform machine learning tasks using
spark framework.
 Since we have a Python API for Apache spark, that is, as you already know, PySpark, we can
also use this library in PySpark.
 MLlib contains many algorithms and machine learning utilities.
 Machine learning is one of the many applications of Artificial intelligence (AI) where the
primary aim is to enable the computers to learn automatically without any human
assistance.
 With the help of machine learning, computers can tackle the tasks that were, until now, only
handled and carried out by people.
 It’s basically a process of teaching a system, how to make accurate predictions when fed the
right data.
 It provides the ability to learn and improve from experience without being specifically
programmed for that task.
 Machine learning mainly focuses on developing the computer programs and algorithms that
make predictions and learn from the provided data.

8
What are dataframes?
 A dataframe is the new API for Apache Spark. It is basically a distributed, Strongly-typed
collection of data, that is, a dataset which is organised into named columns. Dataframe is
equivalent to what a table is for relational database, only, it has richer optimization options.

How to create dataframes


 There are multiple ways to create dataframes in Apache Spark:
o Dataframes can be created using an existing RDD.
o You can create a dataframe by loading a CSV file directly.
o You can programmatically specify a schema to create a dataframe as well.

Basic of MLlib
 MLlib is short for machine learning library. Machine learning in PySpark is easy to use and
scalable. It works on distributed systems.
 We use machine learning in PySpark for data analysis.
 We get the benefit of various machine learning algorithms such as Regression, classification
etc, because of the MLlib in Apache Spark.

PySpark

Real-time Computation
 PySpark provides real-time computation on a large amount of data because it focuses on in-
memory processing. It shows the low latency.

Support Multiple Language


 PySpark framework is suited with various programming languages like Scala, Java, Python,
and R. Its compatibility makes it the preferable frameworks for processing huge datasets.

Caching and disk constancy


 PySpark framework provides powerful caching and good disk constancy.

Swift Processing
 PySpark allows us to achieve a high data processing speed, which is about 100 times faster
in memory and 10 times faster on the disk.

Works well with RDD


 Python programming language is dynamically typed, which helps when working with RDD.
We will learn more about RDD using Python in the further tutorial.

Parameters in PySpark MLlib


 Some of the main parameters of PySpark MLlib are listed below:
o Ratings: This parameter is used to create an RDD of ratings, rows or tuples.
o Rank: It shows the number of features computed and ranks them.

9
o Lambda: Lambda is a regularization parameter.
o Blocks: Blocks are used to parallel the number of computations. The default value
for this is -1.

Performing Linear Regression on a real-world Dataset


 Let’s understand machine learning better by implementing a full-fledged code to perform
linear Regression on the dataset of top 5 Fortune 500 Companies in year 2017.

Loading the data:


 As mentioned above, we are going to use a dataframe that we have created directly from a
CSV file.
 Following are the commands to load the data into a dataframe and to view the loaded data.

Input: In [1]:
from pyspark import SparkConf,
SparkContextfrom pyspark.sql import
SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)

Input: In [2]:
company-df=
sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’true’,inferschema=’true’).load(‘
C:/Users/intellipaat/Downloads/spark-2.3.2-binhadoop2.7/Fortune5002017.csv’)
company-df.take(1)
 You can choose the number of rows you want to view while displaying the data of a
dataframe.
 I have displayed the first row only.

Output: Out[2]:
[Row (Rank=1, Title= ‘Walmart’, Website= ‘http:/www.walmart.com’, Employees-
2300000, Sector= ‘retailing’)]

Data exploration:
 To check the datatype of every column of a dataframe and print the schema of the
dataframe in a tree format, you can use the following commands respectively.

Input: In[3]:
company-df.cache()
company-
df.printSchema()

Output: Out [3]:


DataFrame[Rank: int, Title: string, Website: string, Employees: Int, Sector: string]
root
|– Rank: integer (nullable = true)
|– Title: string (nullable = true)
|– Website:string (nullable = true)

10
|– Employees: integer (nullable = true)
|– Sector: string (nullable = true)

Performing Descriptive Analysis:

Input: In [4]:
company-df.describe().toPandas().transpose()

Output: Out [4]:


0 1 2 3 4
Summary count mean stddev min max
Rank 5 3.0 1.581138830084 1 5
Title 5 None None Apple Walmart
Website 5 None None www.apple.com www.walmart.com
Employees 5 584880.0 966714.2168190142 68000 2300000
Sector 5 None None Energy Wholesalers

Machine learning in Industry


 Computer systems with the ability to predict and learn from a given data and improve
themselves without having to be reprogrammed used to be a dream only but in the recent
years it has been made possible using machine learning.
 Now machine learning is a most used branch of artificial intelligence that is being accepted
by big industries in order to benefit their businesses.
 Following are some of the organisations where machine learning has various use cases:
o PayPal: PayPal uses machine learning to detect suspicious activity.
o IBM: There is a machine learning technology patented by IBM which helps to decide
when to handover the control of self-driving vehicle between a vehicle control
processor and a human driver
o Google: Machine learning is used to gather information from the users which further
is used to improve their search engine results
o Walmart: Machine learning in Walmart is used to benefit their efficiency
o Amazon: Machine learning is used to design and implement personalised product
recommendations
o Facebook: Machine learning is used to filter out poor quality content.

11

You might also like