0% found this document useful (0 votes)
210 views61 pages

Introduction To Big Data With Spark and Hadoop

Uploaded by

haswinpratama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views61 pages

Introduction To Big Data With Spark and Hadoop

Uploaded by

haswinpratama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

📊

7. Introduction to Big Data with


Spark and Hadoop
Semana 1
Course introduction
What is big data?
The four V'2
Parallel Processing, Scaling, and Data Parallelism
Parallel processing
Data scaling
Big Data Tools and Ecosystem
Key technologies include:
Analytics and visualization:
Business intelligence:
Cloud providers
Programming tools
Open Source and Big Data
Beyond the Hype
Where does big data come from?
Big Data Use Cases
Summary & Highlights
Semana 2
Introduction to Hadoop
Intro to MapReduce
Why mapReduce?

7. Introduction to Big Data with Spark and Hadoop 1


Hadoop Ecosystem
Ingest data
Store data
Analyze data
Access data
HDFS
Blocks
Nodes
Replication
Read and write
HDFS architecture
Hive
Hive architecture
Apache HBASE
HBase architecture
Concepts
Hands-on Lab: Hadoop MapReduce
Summary & Highlights
Semana 3
Why use Apache Spark?
Parallel Programming using Resilient Distributed Datasets
RDDs
Creating an RDD in Spark
Parallel programming
Scale out / Data Parallelism in Apache Spark
Components
Spark core
Scaling big data in Spark
Dataframes and SparkSQL
LAB → notebook sparkintro
Summary & Highlights
Practice quiz
Semana 4
RDDs in Parallel Programming and Spark
Transformations
Actions
Directed Acyclyc Graph (DAG)
Transformations and actions
Data-frames and Datasets
ETL with DataFrames
Read the data
Analyze the data

7. Introduction to Big Data with Spark and Hadoop 2


Transform the data
Loading or exporting the data
Real-world usage of SparkSQL
Summary & Highlights
Semana 5
Apache Spark Architecture
Overview of Apache Spark Cluster Modes
How to Run an Apache Spark Application
Spark submit
Spark shell
Summary & Highlights
Using Apache Spark on IBM Cloud
Why use Spark on IBM Cloud
What is AIOps
IBM Spectrum Conductor
Setting Apache Spark Configuration
Running Spark on Kubernetes
Summary & Highlights
Semana 6
The Apache Spark User Interface
Monitoring Application Progress
Debugging Apache Spark Application Issues
Understanding Memory Resources
Understanding Processor Resources
Summary & Highlights
Course Final Exam

Semana 1
Course introduction
The latest statistics report that the accumulated world’s data will grow from 4.4
zettabytes to 44 zettabytes, with much of that data classified as Big Data. Revenues
based on Big Data analytics are projected to increase to $103 billion by 2027.
Understandably, organizations across industries want to harness the competitive
advantages of Big Data analytics. This course provides you with the foundational
knowledge and hands-on lab experience you need to understand what Big Data is
and learn how organizations use Apache Hadoop, Apache Spark, including Apache
Spark SQL, and Kubernetes to expedite and optimize Big Data processing.

7. Introduction to Big Data with Spark and Hadoop 3


What is big data?
small data big data

small enough for human inference generated in huge volumes and


could be structured, semi-structured
accumulates slowly
or unstructured
relatively consistent and structured
needs processing to generate
data usually stored in known forms
insights for human consumption
such as JSON and XML
arrives continously at enormous
mostly located in storage systems
speed from multple sources
within enterprises or data centers
comprises any form of data
including video, photos, and more

distributed on the cloud and server


forms

Life cicle:
business case → data collection → data modeling → data processing → data
visualization

1025 GB → 1 Terabyte

1024 TB → 1 Petabyte
1024 PB → 1 Exabyte
1024 EX → 1 Zettabyte

1024 ZB → 1 Yottabyte

The four V'2


VELOCITY:

Description: speed at which data arrives

Attributes:

batch

close to real time

7. Introduction to Big Data with Spark and Hadoop 4


streaming

Drivers:

improved connectivity and hardware

rapid response times

VOLUME:

Description: increase in the amount of data stored over time

Attribute:

petabytes

exa

zetta

Drivers:

increase in data sources

higher resolution sensors

scalable infrastructure

VARIETY:

Description: many forms of data exist and need to be stored

Attributess:

structure

complexity

origin

Drivers:

mobile tech

scalable infrastructure

resilience

fault recovery

efficient storage and retrieval

VERACITY:

7. Introduction to Big Data with Spark and Hadoop 5


Description: the certainty of data

Attributes:

consistency and completeness

integrity

amibuity

Drivers:

cost and traceability

robus ingestion

ETL mechanisms

→ The fifth V is VALUE ($)

Parallel Processing, Scaling, and Data Parallelism


Parallel processing
Linear processing → instructions executed sequentially and if there's an error it
starts again
→ for minor computing tasks

Parallel processing:

The problem is also devided into instructions but each instructions goes to a
separate node with equal processing capacity and are executed in paralell.
Errors can be re-executed locally and it doesnt affect other instructions

faster

less memory and compute requirements

flexibility. Execution nodes can be added and remove depending on the


necessity

7. Introduction to Big Data with Spark and Hadoop 6


Data scaling
→ scaling UP : when the storage is full and you enlarge it

HORIZONTAL SCALING:

adding additional nodes with same capacity

This collections of nodes is known as cluster capacity

if one process fails it doesnt affect the others and can be easily re-run

Fault tolerance:

if node one has partitions P1, P2 and P3, and it fails, we can easily add a new
node and recover these partitions from the copies they had in other nodes

Big Data Tools and Ecosystem


Key technologies include:
hadoop

HDFS

spark

mapReduce

cloudera

databricks

Analytics and visualization:


tableau

palantir

SAS

pentaho

teradata

7. Introduction to Big Data with Spark and Hadoop 7


Business intelligence:
BI offers a range of tools that provide a quick and easy way to transform data into
actionable insights

cognos

oracle

powerBI

business objects

hyperion

Cloud providers
ibm

aws

oracle

Programming tools
R

python

scala

julia

Open Source and Big Data


Hadoop plays a major role in open source Big Data projects.

Its three main components are:

Hadoop MapReduce

framework that allows code to be written to run at scale on a hadoop cluster

7. Introduction to Big Data with Spark and Hadoop 8


less used than apache spark

Hadoop File System (HDFS)

file system that stores and manages big data files

manages issues around large and distributed datasets

reilience and partitioning

still used in the industry → 70% of the world's big data resides on HDFS

S3 is also coming into use

Yet Another Resource Negotiator (YARN)

resource manager

default resource manager for many big data apps, including hive and spark

kubernets is slowly becoming the new defacto standard but YARN is still
very used

Beyond the Hype


FACT: more data has been created in the past two years than in the entire previous
history of humankind

Where does big data come from?


social data

social media

images

video

comments

machine data

iot

sensors

transactional data

invoices

7. Introduction to Big Data with Spark and Hadoop 9


payment orders

storage records

delivery receipts

Big Data Use Cases


retail

price analytics

sentimental analysis

what consumers think of the product

insurance

fraud analytics

risk assesment

telecomm

improve network security

contextualized location-based promotions

real-time network analytics

optimized pricing

manufacturing

predivtive maintenance

example: for machines

production optimization

automotive industry

predictive support

connected self-driven cars

finance

customer segmentation

algorithmic trading

7. Introduction to Big Data with Spark and Hadoop 10


Summary & Highlights
Personal assistants like Siri, Alexa and Google Now, use Big Data and IoT to
gather data and devise answers.

Big Data Analytics helps companies gain insights from the data collected by IoT
devices.

Big Data requires parallel processing on account of massive volumes of data


that are too large to fit on any one computer.

"Embarrassingly parallel” calculations are the kinds of workloads that can easily
be divided and run independently of one another. If any single process fails, that
process has no impact on the other processes and can simply be re-run.

Open-source projects, which are free and completely transparent, run the world
of Big Data and include the Hadoop project and big data tools like Apache Hive
and Apache Spark.

The Big Data tool ecosystem includes the following six main tooling categories:
data technologies, analytics and visualization, business intelligence, cloud
providers, NoSQL databases, and programming tools.

Semana 2
Introduction to Hadoop
open-source program to process large data sets

servers run applications on clusters

handles parallel jobs or processes. Not a database but an ecosystem

structured, unstructured and semi-structured data

core components

hadoop common: essential part of the framework that refers to the


collection of common utilities and libraries that support other hadoop
modules

HDFS: handles large data sets running in comodity hardware, that is, low-
specifications industry-grade hardware. HDFS scales a singles hadoop
cluster into as much as thousand clusters

7. Introduction to Big Data with Spark and Hadoop 11


MapReduce: processes data bu splitting the data into smaller units. It was
the first method to query data stored in HDFS

YARN: prepares RAM and CPU in Hadoop to run batch processes, stream,
interactive and graph processing

hadoop is NOT good for:

processing transactions (lack of random access)

when work cannot be parallelized

when there are dependencies in the data

low latency data access

processing lots of small files

intensive calculations with little data

Intro to MapReduce
programming model used in hadoop for processing big data

processing technique for distributed computing

based on java

two tasks

MAP

input file

processes data into key value pairs

further data sortin and organizing

REDUCE

aggregates and computes a set of result and produces a final output

MAPREDUCE

keeps track of its task by creating a unique key

7. Introduction to Big Data with Spark and Hadoop 12


Why mapReduce?
parallel computing

devide → run tasks → done

flexibility

process tabular and non tabular forms, such as videos

support for multiple languages

platforn for analysis and data ware housing

Common use cases:

social media

recommendatios (netflix recommendations algorithm)

financial industries

advertisment (google ads)

Hadoop Ecosystem
made of components that suppot one another for big data processing

INGEST DATA (ex: Flume and Sqoop) → STORE DATA (ex: HDFS and HBase) →
PROCESS AND ANALYZE DATA (ex: Pig and Hive) → ACCESS DATA (ex: Impala
and Hue)

Ingest data
Flume

7. Introduction to Big Data with Spark and Hadoop 13


collects, aggregates and transfers big data

has a simple and flexible architecture based on streaming data flows

uses a simple extensible data model that allows for online analytic
application

Sqoop

designed to transfer data between relational database systems and hadoop

accesses the database to understand the schema of the data

generates a MapReduce application to import or export the data

Store data
HBase

non-relational database that runs on top of HDFS

provides real time wrangling on data

stores data as indexes to allow for random and faster access to data

Cassandra

scalable, noSQL database designed to have no single point of failure

Analyze data
Pig

analyzes large amounts of data

operates on the client side of a cluster

procedural data flow language that follows an order and a set of commands

Hive

creating reports

operates on the server side of the cluster

declarative programming language, which means it allows users to express


which data they wish to recieve

Access data
Impala

7. Introduction to Big Data with Spark and Hadoop 14


scalable and easy to use platform for everyone

no programming skills required

Hue

stands for hadoop user experience

allows you to upload, browse, and query data

runs pig jobs and workflow

provides editors for several SQL query languages like Hive and MySQL

HDFS
Hadoop distributed file system

storage layer of hadoop

splits the files into blocks, creates replicas of the blocks, and stores them on
different machines

command line interface to interact with hadoop

provides access to streaming data

this means that HDFS provides a constant bitrate when transferring data
rather than having the data being transferred in waves

Key features

cost efficient

large amounts of data

replication

fault tolerant

scalable

portable

Blocks
minimum amount of data that can be read or written

7. Introduction to Big Data with Spark and Hadoop 15


provides fault tolerance

default size is 64MB or 128 MB

each file stored doesn't have to take up the configures space size

Nodes
a node is a single system which is responisble to store and process data

primary node = name node

regulates file access to the clients and mantains, manages and assigns
tasks to the secondary node

secondary node = data node

the actual workers

take instructions from the primary

rack awareness in HDFS

choosing data node racks that are closest to each other

improves cluster preformance by reducing network traffic

name node keeps the rack ID information

replication can be done through rack awareness

(a rack is the collection of about 40 to 50 data nodes using the same


network switch)

Replication
creating a copy of the data block

copies are created for backup purposes

7. Introduction to Big Data with Spark and Hadoop 16


replication factor: number of times of the data block was copied

Read and write


allows write once ready many operations

read

client will send a request to the primary node to get the location of the data
nodes containing blocks

client will read files closest to the data nodes

a client fulfills a user's request by interacting with the name and data nodes

write

name node makes sure that te file doesnt exist

if the file exists client gets an IO Exception message

if the file doesnt exist, the client is given access to start writing files

HDFS architecture

Hive
data warehouse software within hadoop that is designed for reading, writing and
managing tabular-type datasets and data analysis

scalable, fast and easy to use

7. Introduction to Big Data with Spark and Hadoop 17


Hive query language (HiveQL) in inspired by SQL

supports data cleansing and filtering depending on users' requirements

file formats supported:

flat and text files

sequence file (binary key value pairs)

record columnar files (columns of a table stored in a columnar database)

Traditional RDBMS Hive

used to mantain a database and used to mantain a data warehouse


uses SQL using hive query language

suited for real-time/dynamic data suited for static data analysis like a
analysis like data from sensors text file containing names

designed to read and write as many designed on the methodology of


times as it needs wwrite once, read many

maximum data size it can handle is maximum data size it can handle is
terabytes petabytes

enforces that the schema must doesnt enforce the schema to verify
verify loading data before it can loading data
proceed
supports partitioning
may not always have built-in for
support data partitioning

Hive architecture

7. Introduction to Big Data with Spark and Hadoop 18


Hive clients

JDBC client allows java apps to connect to hive

ODBC client allows apps based on ODBC protocol to connect to hive

Hive services

hive server to enable queries

the driver recieves query statements

the optimizer is used to split tasks efficiently

the executor executes tasks after the optimizer

metastore stores the metadata information about the tables

Apache HBASE
column-oriented non-relational database management system

runs on top of hdfs

provides a fault-tolerant way of storing sparse datasets

works well with real-time data and random read and write access to big data

used for write-heavy applications

linearly and modularly scalable

backup support for MapReduce

provides consistent reads and writes

no fixed column schema

7. Introduction to Big Data with Spark and Hadoop 19


easy-to-use java api for client access

provides data replication across clusters

predefine table schema and specify column families

new columns can be added to column families at any time

schema is very flexible

has a master node to manage the cluster and region servers to perform the work

HBase HDFS

stores data in the form of columns stores data in a distributed manner


and rows in a table across different nodes on that
network
allows dynamic changes
has a rigid architecture that doesnt
suitable for random writes and
allow changes
reads of data stored in HDFS
suited for write once and read many
allows for storing and processing of
times
big data
for storing only

HBase architecture

Concepts

7. Introduction to Big Data with Spark and Hadoop 20


HMaster

monitors the region server instances

assigns regions to region servers

manages any changes that are made to the schema

Region Servers

communicates directly with the client

recieves and assigns requests to regions

responsible for managing regions

communicates directly with the client

Region

smallest unit of HBase cluster

contains multiple stores

two componentes:

HFile

Memstore

Zookeeper

mantains healthy links between nodes

provides distributed sync

tracks server failure

Hands-on Lab: Hadoop MapReduce


The steps outlined in this lab use the Dockerized single-node Hadoop Version 3.2.1.
Hadoop is most useful when deployed in a fully distributed mode on a large cluster
of networked servers sharing a large volume of data. However, for basic
understanding, we will configure Hadoop on a single node.

#clone the repository


git clone https://github.com/ibm-developer-skills-network/ooxwv-docker_hadoop.git

#Compose the docker application.<


#Compose is a tool for defining and running multi-container Docker applications. It us
es the YAML file to configure the serives and enables us to create and start all the s

7. Introduction to Big Data with Spark and Hadoop 21


ervices from just one configurtation file.
docker-compose up -d

#Run the namenode as a mounted drive on bash.


docker exec -it namenode /bin/bash

Hadoop enviroment is configured by editing a set of configuration files:

hadoop-env.sh: serves as a master file to configure YARN, HDFS, MapReduce,


and Hadoop-related project settings

core-site.xml: defines HDFS and Hadoop core properties

hdfs-site.xml: governs the location for storing node metadata, fsimage file and
log file

mapred-site.xml: lists the parameters to MapReduce configuration

yarn-site.xml: defines settings relevant toYARN. It contains configurations for the


Node Manager, Resource Manager, Containers, and Application Master.

#For the docker image, these xml files have been configured already. You can see these
in the directory /opt/hadoop-3.2.1/etc/hadoop/ by running
ls /opt/hadoop-3.2.1/etc/hadoop/*.xml

Set up for MapReduce

#In the HDFS, create a directory named user.


hdfs dfs -mkdir /user

#In the HDFS, under user, create a directory named root.


hdfs dfs -mkdir /user/root

#Under /user/root, create an input directory.


hdfs dfs -mkdir /user/root/input

#Copy all the hadoop configuration xml files into the input directory.
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/root/input

#Create a data.txt file in the current directory.


curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-S
killsNetwork/labs/data/data.txt > data.txt

#Copy the data.txt into the /user/root directory to pass it into the wordcount proble
m.
hdfs dfs -put data.txt /user/root

#Check if the file has been copied into the HDFS by viewing its content.
hdfs dfs -cat /user/root/data.txt

7. Introduction to Big Data with Spark and Hadoop 22


MapReduce word count

#Run the Map reduce application for wordcount on data.txt and store the output in /use
r/root/output
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wor
dcount data.txt /user/root/output

#Once the word count runs successfully, you can run the following command to see the o
utput file it has generated.
hdfs dfs -ls /user/root/output/
#While it is still processing, you may only see '_temporary' listed in the output dire
ctory. Wait for a couple of minutes and run the command again till you see output as s
hown in the image above.

#Run the following command to see the word count output.


hdfs dfs -cat /user/root/output/part-r-00000

Summary & Highlights


Hadoop is an open-source framework for Big Data that faced challenges when
encountering dependencies and low-level latency.

MapReduce, a parallel computing framework used in parallel computing, is


flexible for all data types, addresses parallel processing needs for multiple
industries and contains two major tasks, “map” and “reduce.”

The four main stages of the Hadoop Ecosystem are Ingest, Store, Process and
Analyze, and Access.

Key HDFS benefits include its cost efficiency, scalability, data storage expansion
and data replication capabilities. Rack awareness helps reduce the network
traffic and improve cluster performance. HDFS enables “write once, read
many” operations.

Suited for static data analysis and built to handle petabytes of data, Hive is a
data warehouse software for reading, writing, and managing datasets. Hive is
based on the “write once, read many” methodology, doesn’t enforce the schema
to verify loading data and has built-in partitioning support.

Linearly scalable and highly efficient, HBase is a column-oriented non-relational


database management system that runs on HDFS and provides an easy-to-use
Java API for client access. HBase architecture consists of HMaster, Region
servers, Region, Zookeeper and HDFS. A key difference between HDFS and

7. Introduction to Big Data with Spark and Hadoop 23


HBase is that HBase allows dynamic changes compared to the rigid architecture
of HDFS.

Semana 3
Why use Apache Spark?
open source in-memory application framework for distributed data processing
and iterative analysis on massive data volumes

written in Scala

runs in java virtual machines

distributed computing

easy-to-use python, scala and java APIs

Different!!
→ PARALLEL COMPUTING processors access shared memory

→ DISTRIBUTED COMPUTING processors usually have their own private or


distributed memory

Distributed computing benefits

scalability and modular growth

fault tolerance and redundancy

Apache Spark MapReduce

create MapReduce jobs for keep more data in-memory with a


complex jobs, interactive query, and new distributed execution engine
online event-hub processing
involves lots of (slow) disk I/O

7. Introduction to Big Data with Spark and Hadoop 24


Data engineering Data science and machine learning

core spark engine sparkML

clusters and executors dataFrames

cluster managment streaming

sparkSQL

catalyst tungsten dataFrames

Parallel Programming using Resilient Distributed


Datasets
spark parallelizes computations using the lambda calculus (functional
programming)

RDDs
A resilient distributed dataset is:

spark's primary data abstraction

a fault-tolerant collection of elements

partitioned across the nodes of the cluster

capable of accepting parallel operations

immutable

Supported file types:

text

sequenceFiles

Avro

Parquet

Hadoop input formats

Supported file formats:

local

cassandra

7. Introduction to Big Data with Spark and Hadoop 25


HBase

amazon S3

others

sql and nosql databases

→ Spark applications consist of a driver program that runs the user's main
functions and multiple parallel operations on a cluster

Creating an RDD in Spark


Option 1: use an external or local file from hadoop-supported file system such as

HDFS

cassandra

HBase

amazon s3

Option 2: create from a list in Scala or Python using the APIs

data = [1, 2, 3, 4]
distData = sc.parallelize(data)

Option 3: apply a transformation on an existing RDD to create a new RDD

Parallel programming
simultaneous use of multiple compute resources to solve a computational
problem

breaks problems into discrete parts that can be solved concurrently

runs simultaneous instructions on multiple processors

employs an overall control/coordination mechanism

you can create an RDD by parallelizing an array of objects, or by splitting a


dataset into partitions

7. Introduction to Big Data with Spark and Hadoop 26


spark runs one task for each partition of the cluster

Resilient distributed datasets

are always recoverable as they are immutable

can persist or cache datasets in memory across operations, which speeds


iterative operations

persistence and cache

Scale out / Data Parallelism in Apache Spark


Components
DATA STORAGE HDFS and other formats

COMPUTE INTERFACE. APIs: scala, java and python

MANAGEMENT. Distributed computing Standalone, Mesos, YARN and Kubernetes

Spark core
is a base engine

is fault-tolerant

performs large scale parallel and distributed data processing

manages memory

sschedules tasks

houses APIs that define RDDs

contains a distributed collection of elements that are parallelized across the


cluster

Scaling big data in Spark

7. Introduction to Big Data with Spark and Hadoop 27


The Spark Application consists of the driver program and the executor program.
Executor programs run on worker nodes.

Spark can start additional executor processes on a worker if there is enough


memory and cores available. Similarly, executors can also take multiple cores for
multithreaded calculations. Spark distributes RDDs among executors.

Communication occurs among the driver and the executors.

The driver contains the Spark jobs that the application needs to run and splits
the jobs into tasks submitted to the executors. The driver receives the task
results when the executors complete the tasks. If Apache Spark were a large
organization or company, the driver code would be the executive management of
that company that makes decisions about allocating work, obtaining capital, and
more. The junior employees are the executors who do the jobs assigned to them
with the resources provided.

The worker nodes correspond to the physical office space that the employees
occupy. You can add additional worker nodes to scale big data processing
incrementally.

Dataframes and SparkSQL


sparksql is a spark module for STRUCTURED DATA PROCESSING

used to query structured data inside spark programs, using either sql or a
familiar DataFrame API

usable in java, scala, python and R

7. Introduction to Big Data with Spark and Hadoop 28


runs SQL queries over imported data and existing RDDs independently of API or
programming language

#example sql query in python

results = spark.sql("SELECT * FROM people")

names = results.map(lambda p: p.name)

Benefits of SparkSQL:

includes a cost-based optimizer, columnar storage, and code generation to make


queries fast

scales to thousands of nodes and multi-hour queries using the spark engine,
which provides full mid-query fault tolerance

provides a programming abstraction called DataFrames and can also act as a


distributed SQL query engine

DataFrames:

Distributed collection of data organized into named columns

conceptually equivalent to a table in a relational database or a data frame in


R/Python, but with richer optimizations

built on top of the RDD API

uses RDD

performs relational queries

benefits

scale from KBs on a single laptop to petabytes on a large cluster

support for a wide array of data formats and storage systems

state-of-the-art optimization and code generation through the spark sql


catalyst optimizer

seamless integration with all big data tooling and infrastrcture via spark

APIs for python, java, scala, and R, which is in development via spark R

7. Introduction to Big Data with Spark and Hadoop 29


#read from json file and create dataframe
df = spark.read.json("people.json")
df.show()
df.printSchema()

#register the dataframe as SQL temporary vire


df.createTempView("people")

#sql query
spark.sql("SELECT * FROM people").show()

#same with dataframe


df.select("name").show()
df.select(df["name"]).show()

#sql query
spark.sql("SELECT age, name FROM people WHERE age > 21").show()

#same with dataframe python api


df.filter(df["age"]>21).show()

LAB → notebook sparkintro


When running the second cell I had this error:

"Couldn't find Spark, make sure SPARK_HOME env is set"


" or Spark is in an expected location (e.g. from homebrew
installation)."

This line should get everything done and installed but it doesnt:

import findspark
findspark.init()

To solve it:

1. install spark from the web page:


https://www.apache.org/dyn/closer.lua/spark/spark-3.2.0/spark-3.2.0-bin-
hadoop3.2.tgz

7. Introduction to Big Data with Spark and Hadoop 30


2. Set enviroment variable. In this case for Linux (fedora)
Theres a hidden file in home directory called .bashrc
There you need to add this line:

SPARK_HOME="/path/spark-3.2.0-bin-hadoop3.2"

4. Reboot

5. Check the enviroment variable has been created:

printenv SPARK_HOME

5. Problem solved

Summary & Highlights


Spark is an open source in-memory application framework for distributed data
processing and iterative analysis on massive data volumes. Both distributed
systems and Apache Spark are inherently scalable and fault tolerant. Apache
Spark solves the problems encountered with MapReduce by keeping a
substantial portion of the data required in-memory, avoiding expensive and time-
consuming disk I/O.

Functional programming follows a declarative programming model that


emphasizes “what” instead of “how to” and uses expressions.

Lambda functions or operators are anonymous functions that enable functional


programming. Spark parallelizes computations using the lambda calculus and all
functional Spark programs are inherently parallel.

Resilient distributed datasets, or RDDs, are Spark’s primary data abstraction


consisting of a fault-tolerant collection of elements partitioned across the nodes
of the cluster, capable of accepting parallel operations.You can create an RDD
using an external or local Hadoop-supported file, from a collection, or from
another RDD. RDDs are immutable and always recoverable, providing resilience
in Apache Spark RDDs can persist or cache datasets in memory across
operations, which speeds iterative operations in Spark.

Apache Spark architecture consists of components data, compute input, and


management. The fault-tolerant Spark Core base engine performs large-scale
Big Data worthy parallel and distributed data processing jobs, manages memory,
schedules tasks, and houses APIs that define RDDs.

7. Introduction to Big Data with Spark and Hadoop 31


Spark SQL provides a programming abstraction called DataFrames and can also
act as a distributed SQL query engine. Spark DataFrames are conceptually
equivalent to a table in a relational database or a data frame in R/Python,
but with richer optimizations.

Practice quiz
1. Benefits of working with Spark:
it is an open-source, in-memory application framework for distributed data
processing

2. What are the features or characteristics of a functional programming language


such as Scala?
- it treats functions as first-class citizens, for example functions can be passed
as arguments to other functions
- it follows a declarative programming model
- the emphasis is in the "what" not in the "how to"

3. Which of the following statements are true of Resilient Distributed Datasets


(RDDs)?
- An RDD is a distributed collection of elements parallelized across the cluster
- RDDs are persistent and speed up interactions because they can reuse the
same patition in other actions on a dataset
- RDDs enable Apache Spark to reconstruct transformations

4. How is the Spark Application Architecture configured to scale big data?


- spark application consists of the driver program and the executor program
- executor programs run on worker nodes and you can add additional worker
nodes to scale big data processing incrementally.
- apache spark architecture consists of three main pieces. components data,
compute input and management

5. Which SQL query options would display the names column from this DataFrame
example?
- df.select( df["name"] ).show()
- spark.sql("SELECT name FROM people").show()
- df.select("name").show()

Semana 4

7. Introduction to Big Data with Spark and Hadoop 32


RDDs in Parallel Programming and Spark
RDDs are spark's primary data abstraction

they are partitioned across the nodes of the cluster

dataset is partitioned

partitions are stored in worker memory

Transformations
create a new RDD from existing one

are "lazy" because the results are only computed when evaluated by actions

the map transformation passes each element of a dataset through a function


and returns a new RDD

Actions
actions return a value to driver program after running a computation

reduce()

an action that aggregates all RDD elements

Directed Acyclyc Graph (DAG)


How do transformations and actions happen?

Spark uses a unique data structure called DAG and an associated DAG Schedular
to perform RDD operations.

it's a graphical data structure with edges and vertices

a new edge is generated only from an existing vertex

in apache spark DAG, the vertices represent RDDs and the edges represent
operations such as transformations or actions

if a noed goes down, spark replicated the DAG and restores the node

Transformations and actions


1. Spark creates the DAG when creating an RDD

2. Spark enables the DAG Schedular to perform a transformation and updates the
DAG

7. Introduction to Big Data with Spark and Hadoop 33


3. The DAG now points to the new RDD

4. The pointer that transforms RDD is returned to the Spark driver program

5. If there is an action, the driver program that calls the action evaluates the DAG
only after Spark completes the action

A transformation is only a map operation. We need actions to return computed


values to the driver program

Data-frames and Datasets


DATASETS

provide an API to access a distributed data collection

7. Introduction to Big Data with Spark and Hadoop 34


collection of strongly typed Java Virtual Machine objects

provides the combined benefits of both RDDs and Spark SQL

features:

immutable, cannot be deleted or lost

encoder that converts JVM objects to a tabular representation

extend DataFrame type-safe and object-orientes API capabilities

work with both Scala and Java APIs

dynamically typed languages, such as Python and R, do NOT support


dataset APIs

benefits

provide compile-time type safety

compute faster than RDDs

offer the benefits of Spark SQL and DataFrames

optimize queries using Catalyst and Tungsten

enable improved memory usage and caching

use dataset API functions for aggregate operations including sum,


average, join and group by.

creating a dataset in Scala

val ds = Seq("Alpha", "Beta", "Gamma").toDS()

//from text file


val ds = spark.read.text("/text_folder/file.txt").as[String]

case class Customer(name: String, id: Int, phone: Double)


val ds_cust = spark.read.json("/customer.json").as[Customer]

DataFrames

not typesafe

use APIs in Java, Scala, Python and R

7. Introduction to Big Data with Spark and Hadoop 35


built on top of RDDs and added in earlier Spark versions

Datasets

strongly-typed

use unified Java and Scala APIs

built on top of DataFrames and the latest data abstraction added to Spark

Catalyst and Tungsten


Goals of Spark SQL optimization:

reduce query time

reduce memory consumption

save organizations time and money

Catalyst
spark SQL buil-in rule-based query optimizer

based on functional programming constructs in Scala

supports the addition of new optimization techniques and features

enables developers to add data source-specific rules and support new data
types

tree data structure and a set of rules

four major phases of query execution

analysis

catalyst analizes the query, the DataFrame, the unresolved


logicical plan and the Catalog to create a logical plan

logistical optimization

the logical plan evolves into an Optimized Logical Plan. This is


the rule-based optimization step of Spark SQL and rules such
as folding, pushdown and pruning are applied here

physical planning

7. Introduction to Big Data with Spark and Hadoop 36


describes computation on datasets with specific definitions on
how to conduct the computation. A cost model then chooses the
physical plan with the least cost. This is the cost-based
optimization step.

code generation

Catalyst applies the selected physical plan and generates Java


nytecode ro run on each node.

Rule-based optimization → defines how to run the query


is the table indexed?
does the query contain only the required columns?

Cost-based optimizaion → equals time + memory a query consumes


what are the best paths for multiple datasets to use when querying data?

Tungsten
Spark's cost-based optimizer that maximizes CPU and memory performance

features

manages memory explicitly and does not rely on the JVM object model
or garbage collection

enables cache-friendly computation of alorithms and data structures


using both STRIDE-based memory access

supports on-demand JVM byte code generation

does not generate virtual function dispatches

places intermediate data in CPU registers

ETL with DataFrames


basic DF operations

read the data

7. Introduction to Big Data with Spark and Hadoop 37


analyze the data

transform the data

load data into database

write data back to disk

iThere's also ELT

here the data resides in data lakes

some companies use a mixture between both ETL and ELT

Read the data


create a dataframe

create a dataframe from an exissting dataframe

import pandas as pd
mtcars = pd.read_csv('mtcars.csv')
sdf = spark.createDataFrame(mtcars)

Analyze the data


view the schema and take note of data types

sdf.printSchema()

sdf.show(5)

sdf.select('mpg').show(5)

Transform the data


keep only relevant data

apply filters, joins, sources and tables, column operations, grouping and
aggregations and other functions

apply domain-specific data augmentation processes

sdf.filter( sdf['mpg'] < 18 ).show(5) #mpg is a column

car_counts = sdf.groupby(['cyl']).agg({"wt": "count"})\.sort("count(wt)", ascending=Fa


lse).show(5)

7. Introduction to Big Data with Spark and Hadoop 38


Loading or exporting the data
final step of ETL pipeline

export to another database

export to disk as JSON files

save the data to a Postgres database

use an API to export data

Real-world usage of SparkSQL


SparkSQL:

spark module for structured data processing

runs SQL queries on Spark DataFrames

usable in java, scala, python and R

First step: create a table view

creating a table view is required to run SQL queries programmatically on a


DataFrame

a view is a temporary table to run SQL queries

a tempoerary view provides local scope within the current spark session.

a global temporary view provides global scope within the spark application.
Useful for sharing

df = spark.read.json("people.json")

df.createTempView("people")
spark.sql("SELECT * FROM people").show()

df.createGlobalTempView("people")
spark.sql("SELECT * FROM global_temp.people").show()

Aggregating data

7. Introduction to Big Data with Spark and Hadoop 39


DataFrames contain inbuilt common aggregation functions

count()

countDistinct()

avg()

max()

min()

others

alternatively, aggregate using SQL queries and tableviews

import pandas as pd

mtcars = pd.read_csv("mtcars.csv")
sdf = spark.createDataFrame(mtcars)

sdf.select('mpg').show(5)

#option 1
car_counts = sdf.groupby(['cyl']).agg({"wt": "count"})\.sort("count(wt)", ascending=Fa
lse).show(5)
#option 2
sdf.createTempView("cars")
sql("SELECT cyl, COUNT(*) FROM cars GROUPBY cyl ORDER BY 2 DESC")

Data sources supported

parquet files

supports reading and writing, and preserving data schema

spark sql can also run queries without loading the file

JSON datasets

spark infers the schema and loads the dataset as a DataFrame

Hive tables

spark supports reading and writing data stored in Apache Hive

Summary & Highlights

7. Introduction to Big Data with Spark and Hadoop 40


RDDs are Spark's primary data abstraction partitioned across the nodes of the
cluster. Transformations leave existing RDDs intact and create new RDDs based
on the transformation function. With a variety of available options, apply
functions to transformations perform operations. Next, actions return computed
values to the driver program. Transformations undergo lazy evaluation, meaning
they are only evaluated when the driver function calls an action.

A dataset is a distributed collection of data that provides the combined benefits


of both RDDs and SparkSQL. Consisting of strongly typed JVM objects, datasets
make use of DataFrame typesafe capabilities and extend object-oriented API
capabilities. Datasets work with both Scala and Java APIs. DataFrames are not
typesafe. You can use APIs in Java, Scala, Python. Datasets are Spark's latest
data abstraction.

The primary goal of Spark SQL Optimization is to improve the run-time


performance of a SQL query, by reducing the query’s time and memory
consumption, saving organizations time and money. Catalyst is the Spark SQL
built-in rule-based query optimizer. Catalyst performs analysis, logical
optimization, physical planning, and code generation. Tungsten is the Spark
built-in cost-based optimizer for CPU and memory usage that enables cache-
friendly computation of algorithms and data structures.

Basic DataFrame operations are reading, analysis, transformation, loading, and


writing. You can use a Pandas DataFrame in Python to load a dataset and apply
the print schema, select function, or show function for data analysis. For
transform tasks, keep only relevant data and apply functions such as filters,
joins, column operations, grouping and aggregations, and other functions.

Spark SQL consists of Spark modules for structured data processing that can
run SQL queries on Spark DataFrames and are usable in Java, Scala, Python
and R. Spark SQL supports both temporary views and global temporary views.
Use a DataFrame function or an SQL Query + Table View for data aggregation.
Spark SQL supports Parquet files, JSON datasets and Hive tables.

Semana 5
Apache Spark Architecture
two main processes

7. Introduction to Big Data with Spark and Hadoop 41


driver program

The driver program runs as one process per application.

The driver process can be run on a cluster node or another machine as


a client to the cluster. The driver runs the application’s user code,

creates work and sends it to the cluster.

executors

work independently

There can be many throughout a cluster and one or more per node,
depending on configuration.

Spark context

The Spark Context starts when the application launches and must be
created in the driver

before DataFrames or RDDs. Any DataFrames or RDDs created under the


context are tied to it and the context must remain active for the life of them. The
driver program creates work from the user code called “Jobs” (or computations
that can be performed in parallel). The Spark Context in the driver divides the
jobs into tasks to be executed on the cluster. Tasks from a given job operate on
different data subsets, called Partitions. This means tasks can run in parallel in
the Executors. A Spark Worker is a cluster node that performs work.

Spark executor

A Spark Executor utilizes a set portion of local resources as memory and


compute cores, running one task per available core. Each executor
manages its data caching as dictated by the driver. In general, increasing
executors and available cores increases the cluster’s parallelism. Tasks run
in separate threads until all cores are used. When a task finishes, the
executor puts the results in a new RDD partition or transfers them back to
the driver. Ideally, limit utilized cores to total cores available per node. For
instance, an 8-core node could have 1 executor process using 8 cores.

Stage

A “stage” in a Spark job represents a set of tasks an executor can complete


on the

current data partition. When a task requires other data partitions, Spark must
perform a “shuffle." A shuffle marks the boundary between stages. Subsequent

7. Introduction to Big Data with Spark and Hadoop 42


tasks in later stages must wait for that stage to be completed before beginning
execution, creating a dependency from one stage to the next. Shuffles are costly
as they require data serialization, disk and network I/O. This is because they
enable tasks to “pass over” other dataset partitions outside the current partition.
An example would be a “groupby” with a given key that requires scanning each
partition to find matching records. When Spark performs a shuffle, it redistributes
the dataset across the cluster. This example shows two stages separated by a
shuffle. In Stage 1, a ransformation (such as a map) is applied on dataset “a”
which has 2 partitions (“1a” and “2b”). This creates data set “b”. The next
operation requires a shuffle (such as a “groupby”). Key values could exist in any
other partition, so to group keys of equal value together, tasks must scan each
partition to pick out the matching records. Transformation results are placed in
Stage 2. Here results have the same number of partitions, but this depends on
the operation.

Final results are sent to the driver program as an action, such as collect. NOTE:
It is not advised to perform a collection to the driver on a large data set as it
could easily use up the driver process memory. If the data set is large, apply a
reduction before collection.

Overview of Apache Spark Cluster Modes


The Spark Cluster Manager communicates with a cluster to acquire resources for an
application to run. It runs as a service outside the application and abstracts the
cluster type. While an application is running, the Spark Context creates tasks and
communicates to the cluster manager what resources are needed. Then the cluster
manager reserves executor cores and memory resources. Once the resources are
reserved, tasks can be transferred to the executor processes to run

Spark has built-in support for several cluster managers:

Standalone manager is included


no additional dependencies required

two main componentes:

workers

run on cluster nodes. They start an executor process with one or


more reserved nores

7. Introduction to Big Data with Spark and Hadoop 43


master

There must be one master available which can run on any cluster
node. It connects workers to the cluster and keeps track of them with
heartbeat polling. However, if the master is together with a worker, do
not reserve all the node’s cores and memory for the worker.

To manually set up a Spark Standalone cluster, start the Master. The Master is
assigned a URL with a hostname and port number. After the master is up, you
can use the Master URL to start workers on any node using bi-directional
communication with the master. Once the master and the workers are running,
you can launch a Spark application on the cluster by specifying the master URL
as an argument to connect them.

Apache Hadoop YARN


general-purpose cluster manager

supports other frameworks besides spark

have their own dependencies

To run Spark on an existing YARN cluster, use the ‘--master’ option with the
keyword YARN.

Apache mesos
general-purpose cluster manager

dynamic partitioning between Spark and other big data frameworks and
scalable partitioning between multiple Spark instances

may require additional set up

Kubernetes
runs containerized applications

This makes Spark applications more portable and helps automate


deployment, simplify dependency management and scale the cluster as
needed.

launch application on kubernetes


./bin/spark-submit \

7. Introduction to Big Data with Spark and Hadoop 44


- -master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
<additional configuration>

Local mode
spark can also be run in local mode

not connect to cluster, easy to get started

runs in the same process that calls sparl-submit

uses threads for running executor tasks

useful for testing

runs locally within a single process which can limit performance

to run:
#launch spark in local mode with 8 cores
./bin/spark-submit \
- -master local[8]
<additional configuration>

How to Run an Apache Spark Application


Spark submit
Unified interface for submitting applications

found in the bin/ directory

easily switches from local to cluster

1. parse command line arguments and options

2. read additional configuration specified in 'conf/spark-defaults.conf'

3. connect to the cluster manager specified with the '- -master' argument or run in
local mode

4. transfer applications (JARs or python files) and any additional files specified to
be distributed and run in the cluster

7. Introduction to Big Data with Spark and Hadoop 45


Example launch python SparkPi to a Spark Standalone cluster. Estimate Pi with
1000 samples:

./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py
1000

Spark shell
simple way to learn spark api

powerful tool to analyze data interactively

use in local mode or with a cluster

can initiate in scala and python

enviroment

SparkContext is automaticallt initialized and is available as 'sc'

SparkSession is automatically available as 'spark'

expressions are entered in the shell and then evaluated in the driver to
become jobds that are scheduled as tasks for the cluster

#spark shell example in scala


val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

df.withColumn("mod", expr("id % 2")).show(4)

7. Introduction to Big Data with Spark and Hadoop 46


Summary & Highlights
Spark Architecture has driver and executor processes, coordinated by the Spark
Context in the Driver.

The Driver creates jobs and the Spark Context splits jobs into tasks which can
be run in parallel in the executors on the cluster. Stages are a set of tasks that
are separated by a data shuffle. Shuffles are costly, as they require data
serialization, disk and network I/O. The driver program can be run in either client
Mode (connecting the driver outside the cluster) or cluster mode (running the
driver in the cluster).

Cluster managers acquire resources and run as an abstracted service outside


the application. Spark can run on Spark Standalone, Apache Hadoop
YARN, Apache Mesos or Kubernetes cluster managers, with specific set-up
requirements. Choosing a cluster manager depends on your data ecosystem
and factors such as ease of configuration, portability, deployment, or data
partitioning needs. Spark can also run using local mode, which is useful for
testing or debugging an application.

'spark-submit’ is a unified interface to submit the Spark application, no matter the


cluster manager or application language. Mandatory options include telling
Spark which cluster manager to connect to; other options set driver deploy mode
or executor resourcing. To manage dependencies, application projects or
libraries must be accessible for driver and executor processes, for example by
creating a Java or Scala uber-JAR. Spark Shell simplifies working with data by
automatically initializing the SparkContext and SparkSession variables and
providing Spark API access.

Using Apache Spark on IBM Cloud


Why use Spark on IBM Cloud
cloud benefits

streamline deployment with less configuration

easily scale up to increase compute power

enterprise grade security

7. Introduction to Big Data with Spark and Hadoop 47


tie into existing IBM big data solutions for AIOps and apps for IBM Watson
and IBM Analytics Engine

What is AIOps
the application of artificial intelligence to automate or enhance IT operations

helps collect, aggregate and work with large volymes of operations data

helps identify events and patterns in infrastructure systems

diagnose root causes and report or fix them automatically

IBM Spectrum Conductor


run multiple spark apps and versions together, on a single large cluster

manage and share cluster resources as needed

provide enterprise grade security

Setting Apache Spark Configuration


Configuration types:

Properties: adjust and control app behavior

Enviroment variables: adjust settings on a per-machine basis

Logging: control how logging is output using 'conf/log4j-dafaults.properties'

7. Introduction to Big Data with Spark and Hadoop 48


Property precedence: (the order is from highest to lowest)

Static configuration usually doesnt change because it would be needed to change


the app

Dynamic configuration avoids hard-coding values such as number of cored or


reserved memory

Enviromental variables loaded from: conf/spark-env.sh

common usage: ensure all cluster nodes use the same python version
PYSPARK_PYTHON enviroment variable

Running Spark on Kubernetes

7. Introduction to Big Data with Spark and Hadoop 49


Also abreviated as k8s runs containerized applications on a cluster and is

open source

highly scalable

provides flexible, automated deployments

protable, so can be run in the same way wether in the cloud or on-premises

Use to manage containers that run distributed systems in a more resilient and
flexible way, with benefits including:

network service discovery

clsuter load balancing

automated scale up and down

orchestrating storage

Local Kubernets cluster on a machine using tools as minikube

Running Spark on Kubernetes

containerization

better resource sharing

multiple spark apps can run concurrently and in isloation

./bin/spark-submit \
--master k8s://https://<k82-apiserver-host>:<k82-apiserver-port> \
--deploy-mode client \
--class <application-main-class>
--conf spark.kubernetes.container.image=<spark-image> \
--conf spark.kubernetes.dirver.pod.name=<pod-name> \
local:///path/to/application.jar

Summary & Highlights


Running Spark on IBM Cloud provides enterprise security and easily ties in IBM
big data solutions for AIOps, IBM Watson and IBM Analytics Engine. Spark’s big

7. Introduction to Big Data with Spark and Hadoop 50


data processing capabilities work well with AIOps tools, using machine learning
to identify events or patterns and help report or fix issues. IBM Spectrum
Conductor manages and deploys Spark resources dynamically on a single
cluster and provides enterprise security. IBM Watson helps you focus on Spark’s
machine learning capabilities by creating automated production-ready
environments for AI. IBM Analytics Engine separates storage and compute to
create a scalable analytics solution alongside Spark’s data processing
capabilities.

You can set Spark configuration using properties (to control application
behavior), environment variables (to adjust settings on a per-machine basis) or
logging properties (to control logging outputs). Spark property configuration
follows a precedence order, with the highest being configuration set
programmatically, then spark-submit configuration and lastly configuration set in
the spark-defaults.conf file. Use Static configuration options for values that don’t
change from run to run or properties related to the application, such as the
application name. Use Dynamic configuration options for values that change or
need tuning when deployed, such as master location, executor memory or core
settings.

Use Kubernetes to run containerized applications on a cluster, to manage


distributed systems such as Spark with more flexibility and resilience. You can
run Kubernetes as a deployment environment, which is useful for trying out
changes before deploying to clusters in the cloud. Kubernetes can be hosted on
private or hybrid clouds, and set up using existing tools to bootstrap clusters, or
using turnkey options from certified providers. While you can use Kubernetes
with Spark launched either in client or cluster mode, when using Client mode,
executors must be able to connect with the driver and pod cleanup settings are
required.

Semana 6
The Apache Spark User Interface
Connect to the UI with the URL:

http://<driver-node>:4040

7. Introduction to Big Data with Spark and Hadoop 51


The Jobs tab displays the application’s jobs, including job status

the Stages tab reports the state of tasks within a stage.

the Storage tab shows the size of RDDs or DataFrames that persisted to
memory or disk.

the Environment tab information includes any environment variables and system
properties for Spark or the JVM.

the Executors tab displays a summary that shows memory and disk usage for
any executors in use

If the application runs SQL queries, select the SQL tab and the Description
hyperlink to display the query’s details.

Monitoring Application Progress


Benefits:

quickly identify failed jobs and tasks

fast access to locate inefficient operations

Multiple related jobs:

from different sources

7. Introduction to Big Data with Spark and Hadoop 52


one or more DataFrames

actions applied to the DataFrames

Workflows can include:

jobs created by the SparkContext in the driver program

jobs in progress running as tasks in the executors

completed jobs transferring results back to the driver or writing to disk

How do jobs progress?

1. Spark jobs divide into stages, which connect as a Directed Acyclic Graph, or
DAG

2. Tasks for the current stage are scheduled on the cluster

3. When the stage completes all of its tasks, the next dependent stage in the DAG
begins.

4. The job progresses through the DAG all stages are completed

→ If any of the tasks within a stage fail, after several attempts, Spark marks the task,
stage, and job as failed and stops the application

History server:

spark.eventLog.enabled true
spark.eventLog.dir <path-for-log-files>

#command to connect to the spark app UI history server


http://<host-url>:18080

#start history server


./sbin/start-history-server.sh

Debugging Apache Spark Application Issues


Common app issues:

user code

driver programm

7. Introduction to Big Data with Spark and Hadoop 53


configuration

app dependencies

app files

source files: python script, java JAR, required data files

app libraries

dependencies must be available for all nodes of the cluster

resource allocation

CPU and memory resources must be available for all tasks to run

any worker with free resources can start processes

spark retries until worker is free

network communication

Understanding Memory Resources


Executor memory:

processing

caching

excessive caching leads to issues

Driver memory:

loads data, broadcasts variables

handles results, such as collections

7. Introduction to Big Data with Spark and Hadoop 54


Data persistence or cache:

store intermediate calculations

persist to memory/disk

less computation

Understanding Processor Resources


Spark assigns CPU cores to driver and executor processes

parallelism is limited by the number of cores available

executors process tasks in parallel up to the number of cores assigned to the


applications

after processing, CPU cores become available for future tasks

workers in the cluster contain a limited number of cores

if no cores are available to an app, the application must wait for currently running
tasks to finish

spark queues tasks and waits for available executors and cores for maximized
parallel processing

parallel processing tasks mainly depend on the number of data partitions and
operations

app settings will override default bahaviour

7. Introduction to Big Data with Spark and Hadoop 55


Core utilization example:

Summary & Highlights

7. Introduction to Big Data with Spark and Hadoop 56


To connect to the Apache Spark user interface web server, start your application
and connect to the application UI using the
following URL: http://<driver-node>:4040

The Spark application UI centralizes critical information, including status


information into the Jobs, Stages, Storage, Environment and Executors
tabbed regions. You can quickly identify failures, then drill down to the lowest
levels of the application to discover their root causes. If the application runs SQL
queries, select the SQL tab and the Description hyperlink to display the query’s
details.

The Spark application workflow includes jobs created by the Spark Context in
the driver program, jobs in progress running as tasks in the executors, and
completed jobs transferring results back to the driver or writing to disk.

Common reasons for application failure on a cluster include user code, system
and application configurations, missing dependencies, improper resource
allocation, and network communications. Application log files, located in the
Spark installation directory, will often show the complete details of a failure.

User code specific errors include syntax, serialization, data validation. Related
errors can happen outside the code If a task fails due to an error, Spark can
attempt to rerun tasks for a set number of retries. If all attempts to run a task fail,
Spark reports an error to the driver and terminates the application. The cause of
an application failure can usually be found in the driver event log.

Spark enables configurable memory for executor and driver processes. Executor
memory and Storage memory share a region that can be tuned.

Setting data persistence by caching data is one technique used to improve


application performance.

The following code example illustrates configuration of executor memory on


submit for a Spark Standalone cluster:
$ ./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master
spark://<spark-master-URL>:7077 \
--executor-memory 10G \
/path/to/examples.jar \1000

The following code example illustrates setting Spark Standalone worker memory
and core parameters:

7. Introduction to Big Data with Spark and Hadoop 57


# Start standalone worker with MAX 10Gb memory, 8 cores
$
./sbin/start-worker.sh \
spark://<spark-master-URL> \
–-memory 10G –-cores 8

Spark assigns processor cores to driver and executor processes during


application processing. Executors process tasks in parallel according to the
number of cores available or as assigned by the application.

You can apply the argument ‘--executor-cores 8 \’ to set executor cores on


submit per executor. This example specifies eight cores.

You can specify the executor cores for a Spark standalone cluster for the
application using the argument ‘‘--total-executor-cores 50’ followed by the
number of cores for the application. This example specifies 50 cores.

When starting a worker manually in a Spark standalone cluster, you can specify
the number of cores the application uses by using the argument ‘--cores‘
followed by the number of cores. Spark’s default behavior is to use all available
cores.

Course Final Exam


Which of the following Apache Spark benefits helps manage big data processing?

Unified Framework

The three Apache Spark components are data storage, compute interface, and
cluster management framework. Which order does the data flow through these
components?

the data from a Hadoop file system flows into the compute interface or
API, which then flows into different nodes to perform
distributed/parallel tasks.

Select the characteristics of datasets.


Strongly-typed; use unified Scala and Java APIs; built on top of DataFrames; are
the latest data abstraction added to spark

7. Introduction to Big Data with Spark and Hadoop 58


Which of the following features belong to Tungsten?

Manages memory explicitly and does not rely on the JVM object model or
garbage collector

Places intermidiate data in CPU registers

How does IBM Spectrum Conductor help avoid downtime when running Spark?

Cluster resources divided dynamically

Spark dependencies require driver and cluster executor processes to be able to


access the application project. Java and Scala applications provide this access with
what?
uber-JAR

Which command specifies the number of executor cores for a Spark standalone
cluster for the application?

--total-executor-cores

Identify common areas where Spark application issues can happen.

User code, configuration, app dependencies, resource allocation, network comm

Select the answer that identifies the main components that describe the dimensions
of Big Data.

velocity, volume, variety, veracity

What is Data Scaling?

technique to manage, store and process the overflow of data

What is the current projected yearly growth rate for data?

40%

7. Introduction to Big Data with Spark and Hadoop 59


Which of the following Hadoop core components prepares the RAM and CPU for
Hadoop to run data in batch, stream, interactive, and graph processing?

YARN

What happens when Spark performs a shuffle? Select all that apply.

boundaries between stages are marked

datasets redistributed across cluster

What is the Spark property configuration that follows a precedence order, with the
highest being configuration set programmatically, then spark-submit configuration
and lastly configuration set in the spark-defaults.conf file?

Setting how many cores are used → this task configuration could change so
dynamic configuration handles it well

What are the required additional considerations when deploying Spark applications
on top Kubernetes using client mode? Select all that apply.

the executors must be able to communicate and connect with the driver
programm

use the driver's pod name to set spark.kubernetes.driver.pod.name

Select the answer that identifies the licensing types available for open-source
software.

Public domain, Copyleft, Permissive, Lesser General Public License

How does MapReduce keep track of its tasks?


unique keys

Which of the following characteristics are part of Hive rather than a traditional
relational database?

7. Introduction to Big Data with Spark and Hadoop 60


designed on the methodology of write once, read many

can handle petabytes of data

Select the option that most closely matches the steps associated with the Spark
Application Workflow.

The application creates a job. Spark devides the job into one or more stages. The
first stage starts tasks. The tasks run and as one stages completes, the next
stage starts. When tasks and stages complete the next job can begin.

7. Introduction to Big Data with Spark and Hadoop 61

You might also like