Big Data Computing: Working With Data Models and Big Data Processing
Big Data Computing: Working With Data Models and Big Data Processing
Day 4
Day 4
Working with data models and
Big Data processing
In this session…
• Streaming data: what it is and why is different
• Data Lakes
• DBMS‐based and non‐DBMS‐based approaches to Big data
• Big Data Management Systems
• Retrieving Big Data: MongoDB
• Apache Spark
– The Spark Stack, including SparkSQL, Spark Streamming, MLlib, GraphX
– Architecture and basic concepts
– Programming in Spark: processing RDDs, transformations and actions
• Hands‐on activities
1. Exploring streaming Tweeter data
2. Querying semi‐structured data with MongoDB
3. WordCount in Spark
4. Exploring SparkSQL and Spark DataFrames
5. Sensor Data with Spark Streaming
3
S Ortega‐Martorell
Streaming data: what it is and why is different
4
S Ortega‐Martorell
Streaming data
• We mentioned previously that one of the Big Data challenges
was the velocity of data, coming in varying rates.
• For some applications this presents the need to process data
as it is generated, or in other words, as it streams.
• We call these types of applications:
Streaming Data Processing Applications
• This terminology refers to a constant stream of data flowing from a source.
– E.g. data from a sensory machine or data from social media.
5
S Ortega‐Martorell
Streaming data – example application
• FlightStats, https://www.flightstats.com
– It processes ~60 million weekly flight events that come into their data acquisition system, and turns it into real‐time
intelligence for airlines and millions of travellers, daily.
6
S Ortega‐Martorell
Data stream – challenges
• Conventional data management architectures are built primarily on the concept of
persistence, static data collections.
• Streams pose very difficult challenges for these conventional data management architectures
– as most often we have only one chance to look at, and process, streaming data before receiving
more.
• Streaming data management systems cannot be separated from real‐time processing of data.
– Managing and processing data in motion is a typical capability of streaming data systems.
7
S Ortega‐Martorell
Streaming Data Systems – characteristics
Streaming Data Systems
Designed to manage relatively simple computations
Data Stream – one record at a time or small time window of data
The sheer size, variety and velocity
of big data adds further challenges
Computations are done in near‐real‐time
– sometimes in memory
Computations are independent
The processing components often subscribe to a
stream source non‐interactively
– this means they sent nothing back to the source, nor
did they establish interaction with the source
8
S Ortega‐Martorell
Data stream – dynamic steering
• The concept of dynamic steering involves dynamically changing the next steps or direction
of an application through a continuous computational process using streaming.
– Dynamic steering is often a part of streaming data management and processing.
Example of dynamic steering application: Self‐driving car
9
S Ortega‐Martorell
Streaming Data Systems
• Examples of big data streaming systems:
10
S Ortega‐Martorell
Why is Streaming Data different?
Data‐at‐rest Data‐in‐motion
• Mostly static data from one or • Analysed as it is generated
more sources – E.g. sensor data processing in a plane
or a self‐driving car
• Collected prior to analysis
Analysis of data‐at‐rest is called Analysis of data‐in‐motion is called
batch or static processing stream processing
11
S Ortega‐Martorell
Data processing algorithms
Size determines time and space:
Static / Batch Processing – The run time and memory usage of most algorithms is usually
dependent on the data size, which can easily be calculated from
files or databases.
Unbounded size, but finite time and space:
– The size of the data is unbounded and this changes the types of
Streaming Processing algorithms that can be used.
– Algorithms that require iterating or looping over the whole data set
are not possible since with stream data, you never get to the end.
12
S Ortega‐Martorell
Streaming Data Management and Processing
• Streaming Data Management and Processing should enable:
1. Computations on one data element or a small window
of data elements at a time.
– These computations can update metrics, monitor and
plot statistics on the streaming data.
2. Relatively fast and simple computations
– Since computations need to be completed in real time, the tasks processing streaming data should
be quicker (or not much longer) than the streaming rate of the data (data velocity).
3. No interactions with the data source
– In most streaming systems, the management and processing system subscribe to the data source,
but does not send anything back to the stream source in terms of feedback or interactions.
13
S Ortega‐Martorell
Streaming Data
• These requirements for streaming data processing are quite different than batch processing.
• In batch processing, the analytical steps have access to (often) all data and can take more
time to complete a complex analytical task with less pressure on the completion time of
individual data management and processing tasks.
• Most organisations today use a hybrid architecture for processing streaming and batch jobs
at the same time, which sometimes get referred to as the lambda architecture.
14
S Ortega‐Martorell
Streaming Data – Lambda architecture
• Lambda architecture is a data‐processing architecture designed to handle massive
quantities of data by taking advantage of both batch‐ and stream‐processing methods.
Now
Batch Real‐time
Batch Real‐time
Batch Real‐time
…
Time
15
S Ortega‐Martorell
Streaming data – challenges
• Scalability
– To accommodate rapid growth in traffic and data volume (scaling up)
– To adapt to decreases in demand (scaling down)
• Data replication and durability
Data Availability
– Refers to system uptime, i.e. the storage system is operational and
can deliver data upon request.
Data Availability vs. Durability
They are not the same thing Data Durability
– Refers to long‐term data protection
• i.e. the stored data does not suffer from bit rot, degradation or other
corruption.
– It is concerned with data redundancy rather than hardware
redundancy, so that data is never lost or compromised.
16
S Ortega‐Martorell
Streaming data – challenges
• Among many, there are two main challenges that needs to be overcome to avoid data loss,
and enable real time analytical tasks:
17
S Ortega‐Martorell
Streaming data – changes in size and frequency
The size and frequency of the stream data can significantly change over time
Example:
Size Streaming data found on social networks can increase in
volume during holidays, sports matches, or major news
events.
Periodic
Frequency Sporadic
18
S Ortega‐Martorell
Streaming data – periodic changes
Data changes may be periodic
Periodic: evenings,
weekends, etc.
Example:
People may post messages on social
media more in the evenings.
19
S Ortega‐Martorell
Streaming data – sporadic changes
Data changes may be sporadic
Sporadic: major
events.
Examples:
There can be an increase in data size
and frequency during major events,
sport matches, etc.
20
S Ortega‐Martorell
Streaming data – extreme changes example
• Example of extreme data fluctuation:
Average Tweets / Second: 6,000
During the first 10 years of Tweeter, the record for Most tweets per minute was set during
Germany's victory over Argentina during the 2014 World Cup.
There were 618,725 tweets in
the 60 seconds after the final
whistle was blown
21
S Ortega‐Martorell
Data Lakes
22
S Ortega‐Martorell
Data Lakes
• With big data streaming from different sources in
varying formats, models, and speeds, we need to be
able to ingest this data into a fast and scalable
storage system that is flexible enough to serve many
current and future analytical processes.
• This is when traditional data warehouses with strict
data models and data formats do not fit the big data
challenges for streaming and batch applications.
• The concept of a data lake was created in response
of these data big storage and processing challenges.
23
S Ortega‐Martorell
What is a Data Lake?
• A data lake is a part of a big data infrastructure that many streams can flow into and get
stored for processing in their original form.
• We can think of it as a massive storage depository with huge processing power and ability to
handle a very large number of concurrence, data management and analytical tasks.
24
S Ortega‐Martorell
How do Data lakes work?
• The concept can be compared to a water body, a lake, where water flows in, filling up a reservoir and
flows out.
25
S Ortega‐Martorell
How do Data lakes work? – in more detail
• In a Data Lake:
– The data gets loaded from its source, stored in its native format until it is needed, at which time,
the applications can freely read the data and add structure to it.
– This is called ‘schema‐on‐read’.
– This approach ensures all data is stored for a potentially unknown use at a later time.
• In contrast, in a Data Warehouse:
– The data is loaded into the warehouse after transforming it into a well‐defined and structured
format.
– This is called ‘schema‐on‐write’.
– Any application using the data needs to know this format in order to retrieve and use the data.
– In this approach, data is not loaded into the warehouse unless there is a use for it.
26
S Ortega‐Martorell
Data warehouse vs. Data lake
Data warehouse Data lake
• Stores data in a hierarchical file system • Stores data as flat files with a unique identifier.
with a well‐defined structure • This often gets referred to as object storage in
big data systems.
Hierarchical file system
Object storage
27
S Ortega‐Martorell
Data lake object storage
Data lake
• Each data is stored as a Binary Large
Object (BLOB) and is assigned a unique
identifier.
• Each data object is tagged with a
number of metadata tags.
• The data can be searched using these
metadata tags to retrieve it.
• In Hadoop data architectures, data is loaded into HDFS and processed using the appropriate data management and
analytical systems on commodity clusters.
• The selection of the tools is based on the nature of the problem being solved, and the data format being accessed.
28
S Ortega‐Martorell
Retrieving Big Data
29
S Ortega‐Martorell
SQL
• Example
– Beer Drinkers Club that owns many bars, and each bar sells beer.
– Not every bar sells the same brands of beer, and even when they do, they may have different prices.
– It keeps information about the regular member customers.
– It also knows which member visits which bars, and which beer each member likes.
• Database schema
30
S Ortega‐Martorell
More example queries
• Find expensive beers
• Which businesses have a temporary license
(starts with 32) in San Diego?
31
S Ortega‐Martorell
Querying JSON data with MongoDB
32
S Ortega‐Martorell
From structured to semi‐structured data
[ JSON Atomic, string value
{ Operations: given a key, return a value
_id: 1,
name: "sue", Key‐value pair
age: 19,
Key‐value pairs, structured as tuples, i.e.
type: 1, artist: "Picasso", and food: "pizza", with
status: "P", name favourites.
favourites: { artist: "Picasso", food: "pizza" }, Named tuple As tuples can be thought of as relational
finished: [ 17, 3 ], records, operations would include,
badges: [ "blue", "black" ], Named array projection over an attribute, or selection
points: [ over a set of tables.
{ points: 85, bonus: 20 },
Named array of unnamed tuples
{ points: 75, bonus: 10 }
] Named array (name: points),
}, two unnamed tuples (they can
{ be addressed by their position)
_id: 2,
name: “john”, JASON has nesting:
age: 21 a mini structure can Named array (name: badges), array elements "blue", "black"
}
be embedded within Operations: the position of an element in the list, or given a
]
another structure position retrieve the value, etc.
33
S Ortega‐Martorell
MongoDB
• As JSON has nesting, we need operations that will let us navigate from one structure to any of its
embedded structures.
• “MongoDB is a document database with the scalability and flexibility that you want with the querying and
indexing that you need” ‐ www.mongodb.com
• MongoDB stores data in flexible, JSON‐like documents
– meaning fields can vary from document to document and data structure can be changed over time.
• The document model maps to the objects in your application code, making data easy to work with.
• Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyse data.
• It is a distributed database at its core
– so high availability, horizontal scaling, and geographic distribution are built in and easy to use.
• MongoDB is free and open‐source
34
S Ortega‐Martorell
SQL Select and MongoDB find()
• A basic SQL query states:
– which parts of which records from one or more people should be reported
• MongoDB query states:
– which parts of which documents from a document collection should be returned
• MongoDB:
– The primary query is expressed as a find function, which contains two arguments and an optional
qualifier:
db.collection.find( <query filter>, <projection> ).<cursor modifier>
Some simple queries
Example 1:
• SQL
SELECT * FROM Beers
• MongoDB
db.Beers.find() Empty, which means: no query conditions and no projection clauses in it
Example 2:
• SQL
SELECT beer, price FROM Sells
• MongoDB a) Query filter not needed: empty query condition
denoted by {}
db.Sells.find( {} , {beer:1, price:1} )
b) The projection clauses are specifically identified.
• 1 if an attribute is output and 0 if it is not.
By default, every query will return the id of the document. When not needed: { beer: 1, price: 1, _id: 0}
36
S Ortega‐Martorell
Adding query conditions
Example 3:
• SQL
SELECT manf FROM Beers WHERE name=‘Heineken’
• MongoDB
db.Beers.find( {name:‘Heineken’} , {manf:1, _id:0} )
Example 4:
• SQL
SELECT DISTINCT beer, price FROM Sells WHERE price > 15
• MongoDB
db.Sells.distinct( {price:{$gt:15} } , {beer:1, price:1, _id:0} ) $gt ‐> greater than
37
S Ortega‐Martorell
Some operators of MongoDB
Symbol Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$lt Matches values that are less than a specified value.
Comparison operators
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$in Matches any of the values specified in an array.
Array operators
$nin Matches none of the values specified in an array.
$or Joins query clauses with a logical OR.
$and Joins query clauses with a logical AND.
Logical operators
$not Inverts the effect of a query expression.
$nor Joins query clauses with a logical NOR.
https://docs.mongodb.com/manual/reference/operator/query/ 38
S Ortega‐Martorell
Regular expressions
Example 5:
• Count the number of manufacturers whose names have the partial string ‘am’ in it – must
be case sensitive
db.Beers.find( name:{$regex:/am/i} ).count() i ‐> case sensitivity
Example 6:
• Same, but name starts with ‘Am’
db.Beers.find( name:{$regex:/^Am/} ).count() ^ ‐> at the beginning
• Same with ‘Am’ ends with ‘corp’
. ‐> any character in the middle
db.Beers.count( name:{$regex:/^Am.*corp$/} ) * ‐> zero or more occurrences
$ ‐> to indicate that ‘corp’ should be at the end
39
S Ortega‐Martorell
{ _id: 1,
Array operations item: “bud",
qty: 10,
Example 7: tags: [ “popular”,
“summer”,
• Find items which are tagged as “popular” or “organic” “Japanese”],
db.inventory.find( tags : {$in : [“popular”, “organic”]} ) rating: “good” }
• Find items which are not tagged as “popular” nor “organic”
db.inventory.find( tags : {$nin : [“popular”, “organic”]} ) $in ‐> in the intersection
$nin ‐> not in the intersection
Example 8:
• Find the 2nd and 3rd elements of tags
db.inventory.find( {} , { tags: {$slice:[1 , 2]} } )
Skip count Return this many
db.inventory.find( {} , { tags: {$slice: ‐2} } ) Return last two
40
S Ortega‐Martorell
{ _id: 1,
Array operations item: “bud",
qty: 10,
Example 9: tags: [ “popular”,
“summer”,
• Find a document whose 2nd element in tags is “summer” “Japanese”],
db.inventory.find( tags.1 : “summer” ) rating: “good” }
tags.0 ‐> 1st element (in this example, “popular”)
tags.1 ‐> 2nd element (in this example, “summer”)
41
S Ortega‐Martorell
{ _id: 1,
Compound statements item: “bud",
qty: 10,
Example 10: tags: [ “popular”,
“summer”,
• MongoDB query “Japanese”],
db.inventory.find( { rating: “good”,
$and : [ price: 3.99}
{ $or : [ { price : 3.99 }, { price : 4.99 }] },
{ $or : [ { rating : good }, { qty : { $lt : 20 } } ] }
{ item: {$ne: “Coors”} }
]
})
• Equivalent SQL query:
SELECT * FROM inventory
WHERE ((price = 3.99) OR (price=4.99)) AND
((rating = “good”) OR (qty < 20)) AND
item != “Coors” 42
S Ortega‐Martorell
Queries over nested elements
Example 11:
_id: 1,
points: [
• db.users.find( {“points.points” : {$lte:80} } )
{ points: 96, bonus: 20 },
{ points: 35, bonus: 10 }
]
Retrieves _id:1 and _id:2 and does not retrieve _id:3.
_id: 2,
points: [
{ points: 53, bonus: 20 }, Both have at least one Does not have an
{ points: 64, bonus: 12 } element “points” <= 80 element “points” <= 80
]
_id: 3, Example 12:
points: [
{ points: 81, bonus: 8 }, • db.users.find( {“points.0.points” : {$lte:80} } )
{ points: 95, bonus: 20}
]
Retrieves _id:2 only, as it is the only one in which the first (position 0)
element “points” is <= 80
43
S Ortega‐Martorell
Queries over nested elements
Example 13:
_id: 1,
points: [
• db.users.find( { “points.0.points” : {$lte:81}, “points.bonus”:20} )
{ points: 96, bonus: 20 },
{ points: 35, bonus: 10 } It looks for a document where the points element of a tuple should be
] utmost 81, and there will be a bonus that should be exactly 20.
_id: 2,
points: [
{ points: 53, bonus: 20 },
{ points: 64, bonus: 12 }
]
It retrieves _id:2 and _id:3 as the condition {“points.0.points” : {$lte:81} }
_id: 3,
points: [ (similar to example 12) now also includes 81, and there is a
{ points: 81, bonus: 8 }, {“points.bonus”:20} that is also met in both documents (notice that the
{ points: 95, bonus: 20} position of bonus:20 does not matter as it was not specified).
]
The first document is the only one that does not qualify, as the first ‘points’
element is greater than 81, even when it has a bonus:20.
44
S Ortega‐Martorell
Queries over nested elements
Example 14 (example 13 modified):
_id: 1,
points: [
• db.users.find( { “points.0.points” : {$lte:81},
{ points: 96, bonus: 20 }, “points.0.bonus”:{$lte:10}} )
{ points: 35, bonus: 10 }
] Similar to before, but now we are restricting that the bonus on the
_id: 2,
points: [
first position should be <= 10.
{ points: 53, bonus: 20 },
{ points: 64, bonus: 12 }
]
_id: 3,
points: [ This time it only retrieves _id:3 as, from the first condition “points.0.points” :
{ points: 81, bonus: 8 }, {$lte:81} we get documents 2 and 3, but only document 3 will meet the
{ points: 95, bonus: 20} second condition “points.0.bonus”:{$lte:10}}.
]
45
S Ortega‐Martorell
Queries over nested elements
Example 15:
_id: 1,
points: [ Here, we are modifying example 14 to allow the retrieval of
{ points: 96, bonus: 20 }, documents that meet at least one of the two conditions:
{ points: 35, bonus: 10 }
]
_id: 2, • db.users.find( {$or:[{“points.0.points ” : {$lte:81}},
points: [
{ points: 53, bonus: 20 },
{“points.0.bonus ” :{$lte:10}}]} )
{ points: 64, bonus: 12 }
]
_id: 3,
points: [
{ points: 81, bonus: 8 },
{ points: 95, bonus: 20}
]
46
S Ortega‐Martorell
Queries over nested elements
You can replicate these examples using the MongoDB shell in https://mws.mongodb.com/?version=3.6
Create a collection with some data:
db.users.insertMany([
{_id: 1, points: [{ points: 96, bonus: 20 }, { points: 35, bonus: 10 } ]},
{_id: 2, points: [{ points: 53, bonus: 20 }, { points: 64, bonus: 12 } ]},
{_id: 3, points: [{ points: 81, bonus: 8 }, { points: 95, bonus: 20} ]}
]);
Perform the following queries:
db.users.find( {'points.points' : {$lte:80} } ) ‐> Example 11
db.users.find( {'points.0.points' : {$lte:80} } ) ‐> Example 12
db.users.find( { 'points.0.points' : {$lte:81}, 'points.bonus':20} ) ‐> Example 13
db.users.find( {'points.0.points' : {$lte:81}, 'points.0.bonus':{$lte:10}} ) ‐> Example 14
db.users.find( {$or:[{'points.0.points' : {$lte:81}}, {'points.0.bonus':{$lte:10}}]} ) ‐> Example 15
47
S Ortega‐Martorell
Introduction to Apache Spark
48
S Ortega‐Martorell
Apache Spark
• Spark was initiated at UC Berkeley in 2009 and was transferred to Apache Software
Foundation in 2013.
• Since then, Spark has become a top level project with many users and contributors
worldwide.
• It is of the most successful projects in the Apache Software Foundation
49
S Ortega‐Martorell
Why Spark?
Hadoop MapReduce shortcomings:
• Only for Map and Reduce based computations
• Relies on reading data from HDFS
– a problem for iterative algorithms
• Native support for Java only
• No interactive shell support Spark came out of the need to extend the
MapReduce framework and provide an
• No support for streaming
expressive cluster computing environment with:
• interactive querying
• efficient iterative analytics
• and streaming data processing
50
S Ortega‐Martorell
Basics of Data Analysis with Spark
Expressive programing model
• Provides more than 20 highly efficient distributed operations or transformations.
• Pipe‐lining any of these steps in Spark simply takes a few lines of code
In‐memory processing
• It runs these computations in memory.
− Its ability to cache and process data in memory, makes it significantly faster for iterative applications.
Support for diverse workloads
• Spark provides support for batch and streaming workloads at once.
Interactive shell
• Spark provides simple APIs for Python, Scala, Java and SQL programming through an interactive shell to
accomplish analytical tasks through both external and its built‐in libraries
51
S Ortega‐Martorell
The Spark Stack
• The Spark layer diagram, also called Stack, consists of components that build on top of the Spark
computational engine.
– This engine distributes and monitors tasks across the nodes of a commodity cluster.
– The components built on top of this engine are designed to interact and communicate through this
common engine.
– Any improvements to the underlying engine becomes an improvement in the other components, thanks to
such close interaction.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
52
S Ortega‐Martorell
The Spark Stack – Spark core
• The Spark Core is where the core capabilities of the Spark Framework are implemented, including:
– support for distributed scheduling, memory management and full tolerance
– interaction with different schedulers (e.g. YARN and Mesos) and various NoSQL storage systems (e.g.
HBase)
– the APIs for defining resilient distributed datasets (RDD)
• RDDs are the main programming abstraction in Spark, which carry data across many computing nodes in
parallel, and transform it.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
53
S Ortega‐Martorell
The Spark Stack – Spark SQL
• Spark SQL is the component of Spark that provides Architecture of Spark SQL:
querying structured and semi‐structured data DataFrame DSL Spark SQL and HQL
through a common query language.
DataFrame API
• It can connect to many data sources, and provide Data Source API
APIs to convert query results to RDDs in Python, Scala
and Java programs. CSV JSON JDBC
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
54
S Ortega‐Martorell
The Spark Stack – Spark streaming
• Spark Streaming is where data manipulations take place in Spark.
• Although not a native real‐time interface to datastreams, Spark streaming enables creating small
aggregates of data coming from streaming data ingestion systems.
– These aggregate datasets (micro‐batches) can be converted into RDBs in Spark Streaming for
processing.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
55
S Ortega‐Martorell
The Spark Stack – MLlib
• MLlib is Spark’s native library for machine learning
algorithms as well as model evaluation.
• Spark MLlib provides the following tools:
– ML Algorithms: Include common learning algorithms such as classification, regression, clustering, etc.
– Featurization: includes feature extraction, transformation, dimensionality reduction and selection.
– Pipelines: provide tools for constructing, evaluating and tuning ML Pipelines.
– Persistence: helps in saving and loading algorithms, models and Pipelines.
– Utilities: utilities for linear algebra, statistics and data handling.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
56
S Ortega‐Martorell
The Spark Stack – GraphX
• GraphX is the graph analytics library of Spark.
• It enables the Vertex edge data model of graphs to be converted into RDDs.
• It provides scalable implementations of graph processing algorithms.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
57
S Ortega‐Martorell
The Spark Stack
Through these layers Spark provides diverse, scalable
interactive management and analyses of big data.
Spark
SparkSQL MLlib GraphX
Streaming
Spark Core
58
S Ortega‐Martorell
Getting started with Spark:
The Architecture and Basic Concepts
59
S Ortega‐Martorell
MapReduce
60
S Ortega‐Martorell
Spark
• Spark allows for immediate results of transformations in different stages of the pipeline in memory,
like MAP and REDUCE here:
The outputs of MAP operations are shared with Much faster operations!
REDUCE operations without being written to the disk
61
S Ortega‐Martorell
Spark
• The containers where the data gets stored in memory are called Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets
RDDs are how Spark distributes data and computations Thanks to this abstraction, Spark has proven to
across the nodes of a commodity cluster be 100 times faster for some applications.
62
S Ortega‐Martorell
Resilient Distributed Datasets – what it means?
Resilient Distributed Datasets
• The datasets that RDD distributes comes from batch data storages (like HDFS), NoSQL databases,
text files, streaming data ingestion systems (like kafka), or even from the local hard drive.
• When Spark reads data from these sources, it generates RDDs for them.
• The Spark operations can transform RDDs into other RDDs like any other data.
– Important to mention: RDDs are immutable.
• This means that they cannot be changed partially.
• However, you can create new RDDs by performing one or more transformations.
63
S Ortega‐Martorell
Resilient Distributed Datasets – what it means?
Resilient Distributed Datasets
• RDDs distribute partitioned data collections and computations on clusters even across a number of
machines.
• Computations are a diverse set of transformations of RDDs such as map, filter and join; and also
actions on the RDDs, such as counting and saving them persistently on disk.
• The partitioning of data can be changed dynamically to optimise Spark's performance.
64
S Ortega‐Martorell
Resilient Distributed Datasets – what it means?
Resilient Distributed Datasets
• The resilient element refers to the ability to recover from failures without losing any work already
done.
• For full tolerance in such situations, Spark tracks the history of each partition, keeping a lineage over
RDDs over time
– So, for every point in the calculations, Spark knows which are the partitions needed to recreate the partition
in case it gets lost.
– And if that happens, then Spark automatically figures out where it can start to recompute from and optimises
the amount of processing needed to recover from the failure.
65
S Ortega‐Martorell
Spark Architecture
• Apache Spark follows a master/slave architecture with two main daemons and a cluster manager:
1. Master Daemon – (Master/Driver Process)
2. Worker Daemon – (Slave Process)
• A spark cluster has a single Master and
any number of Slaves/Workers.
– There is a driver that talks to a single
coordinator called master that manages
workers in which executors run.
66
S Ortega‐Martorell
Programming in Spark
67
S Ortega‐Martorell
Programming in Spark
• In a typical Spark program we create RDDs from Create RDDs
external storage or local collections like lists.
• Then we apply transformations to these RDDs,
like filter, map, and reduceByKey.
Apply transformations
• These transformations get lazily evaluated until an
action is performed.
• Actions are performed both for local and parallel Perform actions
computation to generate results.
68
S Ortega‐Martorell
Processing RDDs
RDD Transformation
RDD Transformation
69
S Ortega‐Martorell
Processing RDDs
Transformations
RDD map
Action
70
S Ortega‐Martorell
Processing RDDs – WordCount example
tuples reduceByKey
71
S Ortega‐Martorell
Transformations – map
map
map : to apply function to
each element of RDD
RDD partitions
Example:
def lower(line):
return line.lower()
lower_text_RDD = text_RDD.map(lower)
72
S Ortega‐Martorell
Transformations – flatMap
flatMap
flatMap : first map, then
flatten output
RDD partitions
It is similar to Map, but flatMap allows returning
0, 1 or more elements from map function.
Example:
def split_words(line):
return line.split()
words_RDD = text_RDD.flatmap(split_words)
words_RDD.collect()
73
S Ortega‐Martorell
Transformations – filter
filter
filter: to keep only
elements where function is true
RDD partitions
Example:
def starts_with_a(word):
return word.lower().startswith(“a”)
words_RDD.filter(starts_with_a).collect()
74
S Ortega‐Martorell
Transformations – coalesce
coalesce: to reduce the number of partitions
coalesce
RDD partitions
75
S Ortega‐Martorell
Narrow vs. wide transformations
• Narrow transformation refers to the processing where the processing logic depends only on data that is
already residing in the partition and data shuffling is not necessary.
• In wide transformation operations, processing depends on data residing in multiple partitions distributed
across worker nodes and this requires data shuffling over the network to bring related datasets together.
(apple, 1) (apple,[1,1])
transformation
transformation
Narrow
Wide
(apple, 1)
map groupByKey
76
S Ortega‐Martorell
More transformations
• Full list of transformations at:
https://spark.apache.org/docs/2.2.0/rdd‐programming‐guide.html#transformations
77
S Ortega‐Martorell
Actions
• A few common actions:
Action Usage
collect() Copy all elements to the driver
take(n) Copy first n elements
reduce(func) Aggregate elements with func (takes 2
arguments, returns 1)
saveAsTextFile(path) Save to local file or HDFS
78
S Ortega‐Martorell
More actions
• Full list of actions at:
https://spark.apache.org/docs/2.2.0/rdd‐programming‐guide.html#actions
79
S Ortega‐Martorell
Big Data Landscape
http://mattturck.com/bigdata2018/ 80
S Ortega‐Martorell
Hands‐on activity:
1. Exploring Streaming Tweeter Data
81
S Ortega‐Martorell
Details of the hands‐on activity: Exploring Streaming Tweeter Data
• View the text of Twitter data streaming in real‐time containing specific words.
• Create plots of the frequency of streaming Twitter data to see how popular a word is.
82
S Ortega‐Martorell
Hands‐on activity:
2. Querying documents in MongoDB
83
S Ortega‐Martorell
Details of the hands‐on activity: Querying documents in MongoDB
• Find documents in MongoDB with specific field values.
• Filter the results returned by MongoDB queries.
• Count documents in a MongoDB collection and returned by queries.
84
S Ortega‐Martorell
Hands‐on activity:
3. WordCount in Spark
85
S Ortega‐Martorell
Details of the hands‐on activity: WordCount in Spark
• Read and write text files to HDFS with Spark
• Perform WordCount with Spark Python
86
S Ortega‐Martorell
Hands‐on activity:
4. Exploring SparkSQL and Spark DataFrames
87
S Ortega‐Martorell
Details of the hands‐on activity: SparkSQL and Spark DataFrames
• Access Postgres database tables with SparkSQL
• Filter rows and columns of a Spark DataFrame
• Group and perform aggregate functions on columns in a Spark DataFrame
• Join two Spark Dataframes on a single column
88
S Ortega‐Martorell
Hands‐on activity:
5. Sensor Data with Spark Streaming
89
S Ortega‐Martorell
Details of the hands‐on activity: Sensor Data with Spark Streaming
• Read streaming data into Spark
• Create and apply computations over a sliding window of data
90
S Ortega‐Martorell
Conclusions
91
S Ortega‐Martorell
Summary
We have covered today:
1. The key characteristics of a data stream and identified the requirements of streaming data
systems.
2. How Data Lakes enable batch processing of streaming data, and explained the difference
between ‘schema‐on‐write’ and ‘schema‐on‐read’.
3. How to retrieve relational data and semi‐structured data with PostgreSQL and MongoBD,
respectively.
4. The Spark stack as a layer diagram as well as the functionality of the components in the Spark
stack.
5. How Spark does in‐memory processing using the RDD abstraction, and explained the inner
workings of the Spark architecture.
6. The steps to create a Spark program and how to interpret a Spark program as a pipeline of
transformations and actions.
92
S Ortega‐Martorell