Introduction To Big Data With Spark and Hadoop
Introduction To Big Data With Spark and Hadoop
Semana 1
Course introduction
The latest statistics report that the accumulated world’s data will grow from 4.4
zettabytes to 44 zettabytes, with much of that data classified as Big Data. Revenues
based on Big Data analytics are projected to increase to $103 billion by 2027.
Understandably, organizations across industries want to harness the competitive
advantages of Big Data analytics. This course provides you with the foundational
knowledge and hands-on lab experience you need to understand what Big Data is
and learn how organizations use Apache Hadoop, Apache Spark, including Apache
Spark SQL, and Kubernetes to expedite and optimize Big Data processing.
Life cicle:
business case → data collection → data modeling → data processing → data
visualization
1025 GB → 1 Terabyte
1024 TB → 1 Petabyte
1024 PB → 1 Exabyte
1024 EX → 1 Zettabyte
1024 ZB → 1 Yottabyte
Attributes:
batch
Drivers:
VOLUME:
Attribute:
petabytes
exa
zetta
Drivers:
scalable infrastructure
VARIETY:
Attributess:
structure
complexity
origin
Drivers:
mobile tech
scalable infrastructure
resilience
fault recovery
VERACITY:
Attributes:
integrity
amibuity
Drivers:
robus ingestion
ETL mechanisms
Parallel processing:
The problem is also devided into instructions but each instructions goes to a
separate node with equal processing capacity and are executed in paralell.
Errors can be re-executed locally and it doesnt affect other instructions
faster
HORIZONTAL SCALING:
if one process fails it doesnt affect the others and can be easily re-run
Fault tolerance:
if node one has partitions P1, P2 and P3, and it fails, we can easily add a new
node and recover these partitions from the copies they had in other nodes
HDFS
spark
mapReduce
cloudera
databricks
palantir
SAS
pentaho
teradata
cognos
oracle
powerBI
business objects
hyperion
Cloud providers
ibm
aws
oracle
Programming tools
R
python
scala
julia
Hadoop MapReduce
still used in the industry → 70% of the world's big data resides on HDFS
resource manager
default resource manager for many big data apps, including hive and spark
kubernets is slowly becoming the new defacto standard but YARN is still
very used
social media
images
video
comments
machine data
iot
sensors
transactional data
invoices
storage records
delivery receipts
price analytics
sentimental analysis
insurance
fraud analytics
risk assesment
telecomm
optimized pricing
manufacturing
predivtive maintenance
production optimization
automotive industry
predictive support
finance
customer segmentation
algorithmic trading
Big Data Analytics helps companies gain insights from the data collected by IoT
devices.
"Embarrassingly parallel” calculations are the kinds of workloads that can easily
be divided and run independently of one another. If any single process fails, that
process has no impact on the other processes and can simply be re-run.
Open-source projects, which are free and completely transparent, run the world
of Big Data and include the Hadoop project and big data tools like Apache Hive
and Apache Spark.
The Big Data tool ecosystem includes the following six main tooling categories:
data technologies, analytics and visualization, business intelligence, cloud
providers, NoSQL databases, and programming tools.
Semana 2
Introduction to Hadoop
open-source program to process large data sets
core components
HDFS: handles large data sets running in comodity hardware, that is, low-
specifications industry-grade hardware. HDFS scales a singles hadoop
cluster into as much as thousand clusters
YARN: prepares RAM and CPU in Hadoop to run batch processes, stream,
interactive and graph processing
Intro to MapReduce
programming model used in hadoop for processing big data
based on java
two tasks
MAP
input file
REDUCE
MAPREDUCE
flexibility
social media
financial industries
Hadoop Ecosystem
made of components that suppot one another for big data processing
INGEST DATA (ex: Flume and Sqoop) → STORE DATA (ex: HDFS and HBase) →
PROCESS AND ANALYZE DATA (ex: Pig and Hive) → ACCESS DATA (ex: Impala
and Hue)
Ingest data
Flume
uses a simple extensible data model that allows for online analytic
application
Sqoop
Store data
HBase
stores data as indexes to allow for random and faster access to data
Cassandra
Analyze data
Pig
procedural data flow language that follows an order and a set of commands
Hive
creating reports
Access data
Impala
Hue
provides editors for several SQL query languages like Hive and MySQL
HDFS
Hadoop distributed file system
splits the files into blocks, creates replicas of the blocks, and stores them on
different machines
this means that HDFS provides a constant bitrate when transferring data
rather than having the data being transferred in waves
Key features
cost efficient
replication
fault tolerant
scalable
portable
Blocks
minimum amount of data that can be read or written
each file stored doesn't have to take up the configures space size
Nodes
a node is a single system which is responisble to store and process data
regulates file access to the clients and mantains, manages and assigns
tasks to the secondary node
Replication
creating a copy of the data block
read
client will send a request to the primary node to get the location of the data
nodes containing blocks
a client fulfills a user's request by interacting with the name and data nodes
write
if the file doesnt exist, the client is given access to start writing files
HDFS architecture
Hive
data warehouse software within hadoop that is designed for reading, writing and
managing tabular-type datasets and data analysis
suited for real-time/dynamic data suited for static data analysis like a
analysis like data from sensors text file containing names
maximum data size it can handle is maximum data size it can handle is
terabytes petabytes
enforces that the schema must doesnt enforce the schema to verify
verify loading data before it can loading data
proceed
supports partitioning
may not always have built-in for
support data partitioning
Hive architecture
Hive services
Apache HBASE
column-oriented non-relational database management system
works well with real-time data and random read and write access to big data
has a master node to manage the cluster and region servers to perform the work
HBase HDFS
HBase architecture
Concepts
Region Servers
Region
two componentes:
HFile
Memstore
Zookeeper
hdfs-site.xml: governs the location for storing node metadata, fsimage file and
log file
#For the docker image, these xml files have been configured already. You can see these
in the directory /opt/hadoop-3.2.1/etc/hadoop/ by running
ls /opt/hadoop-3.2.1/etc/hadoop/*.xml
#Copy all the hadoop configuration xml files into the input directory.
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/root/input
#Copy the data.txt into the /user/root directory to pass it into the wordcount proble
m.
hdfs dfs -put data.txt /user/root
#Check if the file has been copied into the HDFS by viewing its content.
hdfs dfs -cat /user/root/data.txt
#Run the Map reduce application for wordcount on data.txt and store the output in /use
r/root/output
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wor
dcount data.txt /user/root/output
#Once the word count runs successfully, you can run the following command to see the o
utput file it has generated.
hdfs dfs -ls /user/root/output/
#While it is still processing, you may only see '_temporary' listed in the output dire
ctory. Wait for a couple of minutes and run the command again till you see output as s
hown in the image above.
The four main stages of the Hadoop Ecosystem are Ingest, Store, Process and
Analyze, and Access.
Key HDFS benefits include its cost efficiency, scalability, data storage expansion
and data replication capabilities. Rack awareness helps reduce the network
traffic and improve cluster performance. HDFS enables “write once, read
many” operations.
Suited for static data analysis and built to handle petabytes of data, Hive is a
data warehouse software for reading, writing, and managing datasets. Hive is
based on the “write once, read many” methodology, doesn’t enforce the schema
to verify loading data and has built-in partitioning support.
Semana 3
Why use Apache Spark?
open source in-memory application framework for distributed data processing
and iterative analysis on massive data volumes
written in Scala
distributed computing
Different!!
→ PARALLEL COMPUTING processors access shared memory
sparkSQL
RDDs
A resilient distributed dataset is:
immutable
text
sequenceFiles
Avro
Parquet
local
cassandra
amazon S3
others
→ Spark applications consist of a driver program that runs the user's main
functions and multiple parallel operations on a cluster
HDFS
cassandra
HBase
amazon s3
data = [1, 2, 3, 4]
distData = sc.parallelize(data)
Parallel programming
simultaneous use of multiple compute resources to solve a computational
problem
Spark core
is a base engine
is fault-tolerant
manages memory
sschedules tasks
The driver contains the Spark jobs that the application needs to run and splits
the jobs into tasks submitted to the executors. The driver receives the task
results when the executors complete the tasks. If Apache Spark were a large
organization or company, the driver code would be the executive management of
that company that makes decisions about allocating work, obtaining capital, and
more. The junior employees are the executors who do the jobs assigned to them
with the resources provided.
The worker nodes correspond to the physical office space that the employees
occupy. You can add additional worker nodes to scale big data processing
incrementally.
used to query structured data inside spark programs, using either sql or a
familiar DataFrame API
Benefits of SparkSQL:
scales to thousands of nodes and multi-hour queries using the spark engine,
which provides full mid-query fault tolerance
DataFrames:
uses RDD
benefits
seamless integration with all big data tooling and infrastrcture via spark
APIs for python, java, scala, and R, which is in development via spark R
#sql query
spark.sql("SELECT * FROM people").show()
#sql query
spark.sql("SELECT age, name FROM people WHERE age > 21").show()
This line should get everything done and installed but it doesnt:
import findspark
findspark.init()
To solve it:
SPARK_HOME="/path/spark-3.2.0-bin-hadoop3.2"
4. Reboot
printenv SPARK_HOME
5. Problem solved
Practice quiz
1. Benefits of working with Spark:
it is an open-source, in-memory application framework for distributed data
processing
5. Which SQL query options would display the names column from this DataFrame
example?
- df.select( df["name"] ).show()
- spark.sql("SELECT name FROM people").show()
- df.select("name").show()
Semana 4
dataset is partitioned
Transformations
create a new RDD from existing one
are "lazy" because the results are only computed when evaluated by actions
Actions
actions return a value to driver program after running a computation
reduce()
Spark uses a unique data structure called DAG and an associated DAG Schedular
to perform RDD operations.
in apache spark DAG, the vertices represent RDDs and the edges represent
operations such as transformations or actions
if a noed goes down, spark replicated the DAG and restores the node
2. Spark enables the DAG Schedular to perform a transformation and updates the
DAG
4. The pointer that transforms RDD is returned to the Spark driver program
5. If there is an action, the driver program that calls the action evaluates the DAG
only after Spark completes the action
features:
benefits
DataFrames
not typesafe
Datasets
strongly-typed
built on top of DataFrames and the latest data abstraction added to Spark
Catalyst
spark SQL buil-in rule-based query optimizer
enables developers to add data source-specific rules and support new data
types
analysis
logistical optimization
physical planning
code generation
Tungsten
Spark's cost-based optimizer that maximizes CPU and memory performance
features
manages memory explicitly and does not rely on the JVM object model
or garbage collection
import pandas as pd
mtcars = pd.read_csv('mtcars.csv')
sdf = spark.createDataFrame(mtcars)
sdf.printSchema()
sdf.show(5)
sdf.select('mpg').show(5)
apply filters, joins, sources and tables, column operations, grouping and
aggregations and other functions
a tempoerary view provides local scope within the current spark session.
a global temporary view provides global scope within the spark application.
Useful for sharing
df = spark.read.json("people.json")
df.createTempView("people")
spark.sql("SELECT * FROM people").show()
df.createGlobalTempView("people")
spark.sql("SELECT * FROM global_temp.people").show()
Aggregating data
count()
countDistinct()
avg()
max()
min()
others
import pandas as pd
mtcars = pd.read_csv("mtcars.csv")
sdf = spark.createDataFrame(mtcars)
sdf.select('mpg').show(5)
#option 1
car_counts = sdf.groupby(['cyl']).agg({"wt": "count"})\.sort("count(wt)", ascending=Fa
lse).show(5)
#option 2
sdf.createTempView("cars")
sql("SELECT cyl, COUNT(*) FROM cars GROUPBY cyl ORDER BY 2 DESC")
parquet files
spark sql can also run queries without loading the file
JSON datasets
Hive tables
Spark SQL consists of Spark modules for structured data processing that can
run SQL queries on Spark DataFrames and are usable in Java, Scala, Python
and R. Spark SQL supports both temporary views and global temporary views.
Use a DataFrame function or an SQL Query + Table View for data aggregation.
Spark SQL supports Parquet files, JSON datasets and Hive tables.
Semana 5
Apache Spark Architecture
two main processes
executors
work independently
There can be many throughout a cluster and one or more per node,
depending on configuration.
Spark context
The Spark Context starts when the application launches and must be
created in the driver
Spark executor
Stage
current data partition. When a task requires other data partitions, Spark must
perform a “shuffle." A shuffle marks the boundary between stages. Subsequent
Final results are sent to the driver program as an action, such as collect. NOTE:
It is not advised to perform a collection to the driver on a large data set as it
could easily use up the driver process memory. If the data set is large, apply a
reduction before collection.
workers
There must be one master available which can run on any cluster
node. It connects workers to the cluster and keeps track of them with
heartbeat polling. However, if the master is together with a worker, do
not reserve all the node’s cores and memory for the worker.
To manually set up a Spark Standalone cluster, start the Master. The Master is
assigned a URL with a hostname and port number. After the master is up, you
can use the Master URL to start workers on any node using bi-directional
communication with the master. Once the master and the workers are running,
you can launch a Spark application on the cluster by specifying the master URL
as an argument to connect them.
To run Spark on an existing YARN cluster, use the ‘--master’ option with the
keyword YARN.
Apache mesos
general-purpose cluster manager
dynamic partitioning between Spark and other big data frameworks and
scalable partitioning between multiple Spark instances
Kubernetes
runs containerized applications
Local mode
spark can also be run in local mode
to run:
#launch spark in local mode with 8 cores
./bin/spark-submit \
- -master local[8]
<additional configuration>
3. connect to the cluster manager specified with the '- -master' argument or run in
local mode
4. transfer applications (JARs or python files) and any additional files specified to
be distributed and run in the cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py
1000
Spark shell
simple way to learn spark api
enviroment
expressions are entered in the shell and then evaluated in the driver to
become jobds that are scheduled as tasks for the cluster
The Driver creates jobs and the Spark Context splits jobs into tasks which can
be run in parallel in the executors on the cluster. Stages are a set of tasks that
are separated by a data shuffle. Shuffles are costly, as they require data
serialization, disk and network I/O. The driver program can be run in either client
Mode (connecting the driver outside the cluster) or cluster mode (running the
driver in the cluster).
What is AIOps
the application of artificial intelligence to automate or enhance IT operations
helps collect, aggregate and work with large volymes of operations data
common usage: ensure all cluster nodes use the same python version
PYSPARK_PYTHON enviroment variable
open source
highly scalable
protable, so can be run in the same way wether in the cloud or on-premises
Use to manage containers that run distributed systems in a more resilient and
flexible way, with benefits including:
orchestrating storage
containerization
./bin/spark-submit \
--master k8s://https://<k82-apiserver-host>:<k82-apiserver-port> \
--deploy-mode client \
--class <application-main-class>
--conf spark.kubernetes.container.image=<spark-image> \
--conf spark.kubernetes.dirver.pod.name=<pod-name> \
local:///path/to/application.jar
You can set Spark configuration using properties (to control application
behavior), environment variables (to adjust settings on a per-machine basis) or
logging properties (to control logging outputs). Spark property configuration
follows a precedence order, with the highest being configuration set
programmatically, then spark-submit configuration and lastly configuration set in
the spark-defaults.conf file. Use Static configuration options for values that don’t
change from run to run or properties related to the application, such as the
application name. Use Dynamic configuration options for values that change or
need tuning when deployed, such as master location, executor memory or core
settings.
Semana 6
The Apache Spark User Interface
Connect to the UI with the URL:
http://<driver-node>:4040
the Storage tab shows the size of RDDs or DataFrames that persisted to
memory or disk.
the Environment tab information includes any environment variables and system
properties for Spark or the JVM.
the Executors tab displays a summary that shows memory and disk usage for
any executors in use
If the application runs SQL queries, select the SQL tab and the Description
hyperlink to display the query’s details.
1. Spark jobs divide into stages, which connect as a Directed Acyclic Graph, or
DAG
3. When the stage completes all of its tasks, the next dependent stage in the DAG
begins.
4. The job progresses through the DAG all stages are completed
→ If any of the tasks within a stage fail, after several attempts, Spark marks the task,
stage, and job as failed and stops the application
History server:
spark.eventLog.enabled true
spark.eventLog.dir <path-for-log-files>
user code
driver programm
app dependencies
app files
app libraries
resource allocation
CPU and memory resources must be available for all tasks to run
network communication
processing
caching
Driver memory:
persist to memory/disk
less computation
if no cores are available to an app, the application must wait for currently running
tasks to finish
spark queues tasks and waits for available executors and cores for maximized
parallel processing
parallel processing tasks mainly depend on the number of data partitions and
operations
The Spark application workflow includes jobs created by the Spark Context in
the driver program, jobs in progress running as tasks in the executors, and
completed jobs transferring results back to the driver or writing to disk.
Common reasons for application failure on a cluster include user code, system
and application configurations, missing dependencies, improper resource
allocation, and network communications. Application log files, located in the
Spark installation directory, will often show the complete details of a failure.
User code specific errors include syntax, serialization, data validation. Related
errors can happen outside the code If a task fails due to an error, Spark can
attempt to rerun tasks for a set number of retries. If all attempts to run a task fail,
Spark reports an error to the driver and terminates the application. The cause of
an application failure can usually be found in the driver event log.
Spark enables configurable memory for executor and driver processes. Executor
memory and Storage memory share a region that can be tuned.
The following code example illustrates setting Spark Standalone worker memory
and core parameters:
You can specify the executor cores for a Spark standalone cluster for the
application using the argument ‘‘--total-executor-cores 50’ followed by the
number of cores for the application. This example specifies 50 cores.
When starting a worker manually in a Spark standalone cluster, you can specify
the number of cores the application uses by using the argument ‘--cores‘
followed by the number of cores. Spark’s default behavior is to use all available
cores.
Unified Framework
The three Apache Spark components are data storage, compute interface, and
cluster management framework. Which order does the data flow through these
components?
the data from a Hadoop file system flows into the compute interface or
API, which then flows into different nodes to perform
distributed/parallel tasks.
Manages memory explicitly and does not rely on the JVM object model or
garbage collector
How does IBM Spectrum Conductor help avoid downtime when running Spark?
Which command specifies the number of executor cores for a Spark standalone
cluster for the application?
--total-executor-cores
Select the answer that identifies the main components that describe the dimensions
of Big Data.
40%
YARN
What happens when Spark performs a shuffle? Select all that apply.
What is the Spark property configuration that follows a precedence order, with the
highest being configuration set programmatically, then spark-submit configuration
and lastly configuration set in the spark-defaults.conf file?
Setting how many cores are used → this task configuration could change so
dynamic configuration handles it well
What are the required additional considerations when deploying Spark applications
on top Kubernetes using client mode? Select all that apply.
the executors must be able to communicate and connect with the driver
programm
Select the answer that identifies the licensing types available for open-source
software.
Which of the following characteristics are part of Hive rather than a traditional
relational database?
Select the option that most closely matches the steps associated with the Spark
Application Workflow.
The application creates a job. Spark devides the job into one or more stages. The
first stage starts tasks. The tasks run and as one stages completes, the next
stage starts. When tasks and stages complete the next job can begin.