Big Data Tools 2 - Apache Spark With PySpark

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Crafted by :

Fiqri Wicaksono

Training series number


xxx.xxx.xx.xxx.xx
Big Data Tools 2
Apache Spark with PySpark

Updated
Q2.2020

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Hi, I’m Fiqri Wicaksono!

Your resume of experiences


- Project Engineer (2018-2019)
- Data Fellowship II (2019)
- Data Tech Specialist (2020 - ???)

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Objectives

● Understand Apache Spark concepts that apply to Big Data.


● Write basic PySpark programs.
● Run PySpark programs on small datasets with your local machine.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Outline

1. Introduction to Apache Spark


2. Resilient Distributed Dataset
3. Getting Started with PySpark

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Sub Chapter 1
Introduction to Apache Spark

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Background

● The main concern for hadoop is to maintain speed in processing large datasets
in terms of waiting time between queries and waiting time to run the program.
● Spark is not a modified version of Hadoop and is not, really, dependent on
Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
● Spark uses Hadoop in two ways – one is storage and second is processing.
Since Spark has its own cluster management computation, it uses Hadoop for
storage purpose only.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


What is Apache Spark?

● Apache Spark is a lightning-fast cluster computing technology, designed for


fast computation.
● It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive
queries and stream processing.
● The main feature of Spark is its in-memory cluster computing that increases
the processing speed of an application.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Apache Spark Features

● Speed − Spark helps to run an application in Hadoop cluster, up to 100 times


faster in memory, and 10 times faster when running on disk
● Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages.
● Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Apache Spark Components

● ApacheSparkCore
Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
● Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and
semi-structured data.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Apache Spark Components

● Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
● MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Apache Spark Components

● GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an
API for expressing graph computation that can model the user-defined graphs by
using Pregel abstraction API. It also provides an optimized runtime for this
abstraction.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Sub Chapter 2
Resilient Distributed Datasets

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Resilient Distributed Datasets

● Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an


immutable distributed collection of objects. Each dataset in RDD is divided into
logical partitions, which may be computed on different nodes of the cluster.
● Formally, an RDD is a read-only, partitioned collection of records. RDDs can be
created through deterministic operations on either data on stable storage or other
RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.
● RDD runs in Lazy Execution

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Example: A File-based RDD

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Example: A File-based RDD

Two types of RDD Operations:


● Actions: return values

● Transformations: define a new RDD based on the current one

Proprietary Document of IYKRA, 2020 Data Fellowship 3


RDD Operations: Actions

Some Common Actions:


● count() - Return the number of elements
● take(n) - Return an array of the first n elements
● collect() - Return an array of all elements
● saveAsTextFile(file) - Save to text file

Proprietary Document of IYKRA, 2020 Data Fellowship 3


RDD Operations: Transformation

● Transformations create a new RDD from an existing one.


● RDDs are immutable.
Data in RDD is never changed. Transform in sequence to modify the data
needed.
● Some common tranformations.
-map (function) : creates a new RDD by performing a function on each record
in the base RDD.
-filter (function) : creates a new RDD by including or excluding each record in
the base RDD according to a boolean function.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Example: Map and Filter Transformations

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Sub Chapter 3
Getting Started with PySpark

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Installing PySpark

● To install Spark, make sure you have Java 8 or higher installed on your computer.
● To install PySpark, make sure you already have python installed on your computer.
● Then, you can install PySpark using pip.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Installing PySpark

● Open Jupyter Notebook and try if PySpark works. In a new notebook paste the
following code:

import pyspark
from pyspark import SparkContext
sc =SparkContext()

● If an error is shown, it is likely that Java is not installed on your machine. In mac,
open the terminal and write java -version, if there is a java version, make sure it is 1.8.
In Windows, go to Application and check if there is a Java folder. If there is a Java
folder, check that Java 1.8 is installed.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Spark Context

● SparkContext is the internal engine that allows the connections with the clusters. If
you want to run an operation, you need a SparkContext.
● Now that the SparkContext is ready, you can create a collection of data called RDD,
Resilient Distributed Dataset. Computation in an RDD is automatically parallelized
across the cluster.

>> nums= sc.parallelize([1,2,3,4])


>> nums.take(1)
[1]

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Spark Context

● You can apply a transformation to the data with a lambda function. In the example
below, you return the square of nums. It is a map transformation.

squared = nums.map(lambda x: x*x).collect()


for num in squared:
print('%i ' % (num))

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Spark Context

● You can apply a transformation to the data with a lambda function. In the example
below, you return the square of nums. It is a map transformation.

squared = nums.map(lambda x:
x*x).collect()
for num in squared:
print('%i ' % (num))

Proprietary Document of IYKRA, 2020 Data Fellowship 3


SQLContext

● A more convenient way is to use the DataFrame. SparkContext is already set, you can
use it to create the dataFrame. You also need to declare the SQLContext.
● SQLContext allows connecting the engine with different data sources. It is used to
initiate the functionalities of Spark SQL.

from pyspark.sql import Row


from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

Proprietary Document of IYKRA, 2020 Data Fellowship 3


SQLContext

1. Create the list of tuple with the information


2. Build a RDD
3. Convert the tuples
4. Create a DataFrame Context

list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)

Proprietary Document of IYKRA, 2020 Data Fellowship 3


SQLContext

● If you want to access the type of each feature, you can use printSchema()

DF_ppl.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Basic Operation with PySpark

● Now, let’s get our hands dirty, open your notebook and run these following programs
to understand basic data operation with PySpark.
● First of all, you need to initialize the SQLContext is not already in initiated yet.

from pyspark.sql import SQLContext


url =
"https://raw.githubusercontent.com/guru99-edu/R-Programming/master/adult_
data.csv"
from pyspark import SparkFiles
sc.addFile(url)
sqlContext = SQLContext(sc)

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Basic Operation with PySpark

● Then, you can read the cvs file with sqlContext.read.csv. You use inferSchema set to
True to tell Spark to guess automatically the type of data. By default, it is turn to
False.

df = sqlContext.read.csv(SparkFiles.get("adult_data.csv"), header=True,
inferSchema= True)

● Check the data type using printSchema()

df.printSchema()

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Basic Operation with PySpark

● You can see the data with show.

df.show(5, truncate = False)

● Select Columns

df.select('age','fnlwgt').show(5)

● Count by group

df.groupBy("education").count().sort("count",ascending=True).show()

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Basic Operation with PySpark

● To get a summary statistics, of the data, you can use describe()

df.describe().show()

● Crosstab Computation

df.crosstab('age', 'label').sort("age_label").show()

● Drop column

df.drop('education_num').columns

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Wrapping up

● Now you’re able to do some basic data manipulation with PySpark, however it all
doesn’t matter if you don’t practice it, now why don’t you do rebuild your machine
learning practice case using PySpark!
● Hint: You can look for MLib.

Proprietary Document of IYKRA, 2020 Data Fellowship 3


Thank you!

Proprietary Document of IYKRA, 2020 Data Fellowship 3

You might also like