0% found this document useful (0 votes)

43 views46 pages

Big Data Computing: Working With Data Models and Big Data Processing

This document discusses streaming data and big data processing. It begins by explaining what streaming data is, how it differs from static data, and why streaming data presents new challenges. Examples of streaming data applications are provided. The document then covers data stream characteristics, challenges, and how streaming data systems work. Key aspects of streaming data management and processing are outlined, including the need for simple, fast computations on small data windows with no feedback to the data source. Finally, the lambda architecture for combining batch and real-time streaming processing is introduced.

Uploaded by

HGE05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views46 pages

Big Data Computing: Working With Data Models and Big Data Processing

Uploaded by

HGE05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Big Data Computing

Day 4

Day 4
Working with data models and
Big Data processing
In this session…
• Streaming data: what it is and why is different
• Data Lakes
• DBMS‐based and non‐DBMS‐based approaches to Big data
• Big Data Management Systems
• Retrieving Big Data: MongoDB
• Apache Spark
– The Spark Stack, including SparkSQL, Spark Streamming, MLlib, GraphX
– Architecture and basic concepts
– Programming in Spark: processing RDDs, transformations and actions
• Hands‐on activities
1. Exploring streaming Tweeter data
2. Querying semi‐structured data with MongoDB
3. WordCount in Spark
4. Exploring SparkSQL and Spark DataFrames
5. Sensor Data with Spark Streaming
3
S Ortega‐Martorell

Streaming data: what it is and why is different

4
S Ortega‐Martorell
Streaming data
• We mentioned previously that one of the Big Data challenges
was the velocity of data, coming in varying rates.

• For some applications this presents the need to process data
as it is generated, or in other words, as it streams.

• We call these types of applications:

Streaming Data Processing Applications

• This terminology refers to a constant stream of data flowing from a source.
– E.g. data from a sensory machine or data from social media.

5
S Ortega‐Martorell

Streaming data – example application
• FlightStats, https://www.flightstats.com
– It processes ~60 million weekly flight events that come into their data acquisition system, and turns it into real‐time
intelligence for airlines and millions of travellers, daily.

6
S Ortega‐Martorell
Data stream – challenges
• Conventional data management architectures are built primarily on the concept of
persistence, static data collections.

• Streams pose very difficult challenges for these conventional data management architectures
– as most often we have only one chance to look at, and process, streaming data before receiving
more.

• Streaming data management systems cannot be separated from real‐time processing of data.
– Managing and processing data in motion is a typical capability of streaming data systems.

7
S Ortega‐Martorell

Streaming Data Systems – characteristics

Streaming Data Systems
Designed to manage relatively simple computations
Data Stream – one record at a time or small time window of data
The sheer size, variety and velocity
of big data adds further challenges
Computations are done in near‐real‐time
– sometimes in memory

Computations are independent

The processing components often subscribe to a
stream source non‐interactively
– this means they sent nothing back to the source, nor
did they establish interaction with the source

8
S Ortega‐Martorell
Data stream – dynamic steering
• The concept of dynamic steering involves dynamically changing the next steps or direction
of an application through a continuous computational process using streaming.
– Dynamic steering is often a part of streaming data management and processing.

Example of dynamic steering application: Self‐driving car

9
S Ortega‐Martorell

Streaming Data Systems
• Examples of big data streaming systems:

10
S Ortega‐Martorell
Why is Streaming Data different?

Data‐at‐rest Data‐in‐motion
• Mostly static data from one or • Analysed as it is generated
more sources – E.g. sensor data processing in a plane
or a self‐driving car
• Collected prior to analysis

Analysis of data‐at‐rest is called Analysis of data‐in‐motion is called
batch or static processing stream processing
11
S Ortega‐Martorell

Data processing algorithms

Size determines time and space:
Static / Batch Processing – The run time and memory usage of most algorithms is usually
dependent on the data size, which can easily be calculated from
files or databases.

Unbounded size, but finite time and space:
– The size of the data is unbounded and this changes the types of
Streaming Processing algorithms that can be used.
– Algorithms that require iterating or looping over the whole data set
are not possible since with stream data, you never get to the end.

12
S Ortega‐Martorell
Streaming Data Management and Processing
• Streaming Data Management and Processing should enable:

1. Computations on one data element or a small window
of data elements at a time.
– These computations can update metrics, monitor and
plot statistics on the streaming data.

2. Relatively fast and simple computations
– Since computations need to be completed in real time, the tasks processing streaming data should
be quicker (or not much longer) than the streaming rate of the data (data velocity).

3. No interactions with the data source
– In most streaming systems, the management and processing system subscribe to the data source,
but does not send anything back to the stream source in terms of feedback or interactions.
13
S Ortega‐Martorell

Streaming Data
• These requirements for streaming data processing are quite different than batch processing.

• In batch processing, the analytical steps have access to (often) all data and can take more
time to complete a complex analytical task with less pressure on the completion time of
individual data management and processing tasks.

• Most organisations today use a hybrid architecture for processing streaming and batch jobs
at the same time, which sometimes get referred to as the lambda architecture.

14
S Ortega‐Martorell
Streaming Data – Lambda architecture
• Lambda architecture is a data‐processing architecture designed to handle massive
quantities of data by taking advantage of both batch‐ and stream‐processing methods.

Now

Batch Real‐time
Batch Real‐time
Batch Real‐time

…
Time

15
S Ortega‐Martorell

Streaming data – challenges
• Scalability
– To accommodate rapid growth in traffic and data volume (scaling up)
– To adapt to decreases in demand (scaling down)

• Data replication and durability

Data Availability
– Refers to system uptime, i.e. the storage system is operational and
can deliver data upon request.
Data Availability vs. Durability
They are not the same thing Data Durability
– Refers to long‐term data protection
• i.e. the stored data does not suffer from bit rot, degradation or other
corruption.
– It is concerned with data redundancy rather than hardware
redundancy, so that data is never lost or compromised.
16
S Ortega‐Martorell
Streaming data – challenges
• Among many, there are two main challenges that needs to be overcome to avoid data loss,
and enable real time analytical tasks:

The size and frequency of the stream data can significantly change over time.

Data changes may be periodic and sporadic.

17
S Ortega‐Martorell

Streaming data – changes in size and frequency
The size and frequency of the stream data can significantly change over time

Example:
Size Streaming data found on social networks can increase in
volume during holidays, sports matches, or major news
events.

Periodic
Frequency Sporadic
18
S Ortega‐Martorell
Streaming data – periodic changes
Data changes may be periodic

Periodic: evenings,
weekends, etc.

Example:
People may post messages on social
media more in the evenings.

19
S Ortega‐Martorell

Streaming data – sporadic changes
Data changes may be sporadic

Sporadic: major
events.

Examples:

There can be an increase in data size
and frequency during major events,
sport matches, etc.

20
S Ortega‐Martorell
Streaming data – extreme changes example
• Example of extreme data fluctuation:
Average Tweets / Second: 6,000

During the first 10 years of Tweeter, the record for Most tweets per minute was set during
Germany's victory over Argentina during the 2014 World Cup.

There were 618,725 tweets in
the 60 seconds after the final
whistle was blown

21
S Ortega‐Martorell

Data Lakes

22
S Ortega‐Martorell
Data Lakes
• With big data streaming from different sources in
varying formats, models, and speeds, we need to be
able to ingest this data into a fast and scalable
storage system that is flexible enough to serve many
current and future analytical processes.

• This is when traditional data warehouses with strict
data models and data formats do not fit the big data
challenges for streaming and batch applications.

• The concept of a data lake was created in response
of these data big storage and processing challenges.

23
S Ortega‐Martorell

What is a Data Lake?
• A data lake is a part of a big data infrastructure that many streams can flow into and get
stored for processing in their original form.

• We can think of it as a massive storage depository with huge processing power and ability to
handle a very large number of concurrence, data management and analytical tasks.

24
S Ortega‐Martorell
How do Data lakes work?
• The concept can be compared to a water body, a lake, where water flows in, filling up a reservoir and
flows out.

25
S Ortega‐Martorell

How do Data lakes work? – in more detail
• In a Data Lake:
– The data gets loaded from its source, stored in its native format until it is needed, at which time,
the applications can freely read the data and add structure to it.
– This is called ‘schema‐on‐read’.
– This approach ensures all data is stored for a potentially unknown use at a later time.

• In contrast, in a Data Warehouse:
– The data is loaded into the warehouse after transforming it into a well‐defined and structured
format.
– This is called ‘schema‐on‐write’.
– Any application using the data needs to know this format in order to retrieve and use the data.
– In this approach, data is not loaded into the warehouse unless there is a use for it.

26
S Ortega‐Martorell
Data warehouse vs. Data lake

Data warehouse Data lake
• Stores data in a hierarchical file system • Stores data as flat files with a unique identifier.
with a well‐defined structure • This often gets referred to as object storage in
big data systems.
Hierarchical file system

Object storage
27
S Ortega‐Martorell

Data lake object storage

Data lake
• Each data is stored as a Binary Large
Object (BLOB) and is assigned a unique
identifier.
• Each data object is tagged with a
number of metadata tags.
• The data can be searched using these
metadata tags to retrieve it.

• In Hadoop data architectures, data is loaded into HDFS and processed using the appropriate data management and
analytical systems on commodity clusters.
• The selection of the tools is based on the nature of the problem being solved, and the data format being accessed.
28
S Ortega‐Martorell
Retrieving Big Data

29
S Ortega‐Martorell

SQL
• Example
– Beer Drinkers Club that owns many bars, and each bar sells beer.
– Not every bar sells the same brands of beer, and even when they do, they may have different prices.
– It keeps information about the regular member customers.
– It also knows which member visits which bars, and which beer each member likes.

• Database schema

30
S Ortega‐Martorell
More example queries
• Find expensive beers

• Which businesses have a temporary license
(starts with 32) in San Diego?

31
S Ortega‐Martorell

Querying JSON data with MongoDB

32
S Ortega‐Martorell
From structured to semi‐structured data
[ JSON Atomic, string value
{ Operations: given a key, return a value
_id: 1,
name: "sue", Key‐value pair
age: 19,
Key‐value pairs, structured as tuples, i.e.
type: 1, artist: "Picasso", and food: "pizza", with
status: "P", name favourites.
favourites: { artist: "Picasso", food: "pizza" }, Named tuple As tuples can be thought of as relational
finished: [ 17, 3 ], records, operations would include,
badges: [ "blue", "black" ], Named array projection over an attribute, or selection
points: [ over a set of tables.
{ points: 85, bonus: 20 },
Named array of unnamed tuples
{ points: 75, bonus: 10 }
] Named array (name: points),
}, two unnamed tuples (they can
{ be addressed by their position)
_id: 2,
name: “john”, JASON has nesting:
age: 21 a mini structure can Named array (name: badges), array elements "blue", "black"
}
be embedded within Operations: the position of an element in the list, or given a
]
another structure position retrieve the value, etc.
33
S Ortega‐Martorell

MongoDB
• As JSON has nesting, we need operations that will let us navigate from one structure to any of its
embedded structures.

• “MongoDB is a document database with the scalability and flexibility that you want with the querying and
indexing that you need” ‐ www.mongodb.com

• MongoDB stores data in flexible, JSON‐like documents
– meaning fields can vary from document to document and data structure can be changed over time.

• The document model maps to the objects in your application code, making data easy to work with.

• Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyse data.

• It is a distributed database at its core
– so high availability, horizontal scaling, and geographic distribution are built in and easy to use.

• MongoDB is free and open‐source
34
S Ortega‐Martorell
SQL Select and MongoDB find()
• A basic SQL query states:
– which parts of which records from one or more people should be reported

• MongoDB query states:
– which parts of which documents from a document collection should be returned

• MongoDB:
– The primary query is expressed as a find function, which contains two arguments and an optional
qualifier:

db.collection.find( <query filter>, <projection> ).<cursor modifier>

Like FROM clause, Like WHERE clause, Projection variables How many results

specifies the specifies which in SELECT clause to return etc.
collection to use documents to return
35
S Ortega‐Martorell

Some simple queries
Example 1:
• SQL
SELECT * FROM Beers
• MongoDB
db.Beers.find() Empty, which means: no query conditions and no projection clauses in it

Example 2:
• SQL
SELECT beer, price FROM Sells
• MongoDB a) Query filter not needed: empty query condition
denoted by {}
db.Sells.find( {} , {beer:1, price:1} )
b) The projection clauses are specifically identified.
• 1 if an attribute is output and 0 if it is not.

By default, every query will return the id of the document. When not needed: { beer: 1, price: 1, _id: 0}
36
S Ortega‐Martorell
Adding query conditions
Example 3:
• SQL
SELECT manf FROM Beers WHERE name=‘Heineken’
• MongoDB
db.Beers.find( {name:‘Heineken’} , {manf:1, _id:0} )

Example 4:
• SQL
SELECT DISTINCT beer, price FROM Sells WHERE price > 15
• MongoDB
db.Sells.distinct( {price:{$gt:15} } , {beer:1, price:1, _id:0} ) $gt ‐> greater than

37
S Ortega‐Martorell

Some operators of MongoDB
Symbol Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$lt Matches values that are less than a specified value.
Comparison operators
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$in Matches any of the values specified in an array.
Array operators
$nin Matches none of the values specified in an array.
$or Joins query clauses with a logical OR.
$and Joins query clauses with a logical AND.
Logical operators
$not Inverts the effect of a query expression.
$nor Joins query clauses with a logical NOR.
https://docs.mongodb.com/manual/reference/operator/query/ 38
S Ortega‐Martorell
Regular expressions
Example 5:
• Count the number of manufacturers whose names have the partial string ‘am’ in it – must
be case sensitive
db.Beers.find( name:{$regex:/am/i} ).count() i ‐> case sensitivity

Example 6:
• Same, but name starts with ‘Am’
db.Beers.find( name:{$regex:/^Am/} ).count() ^ ‐> at the beginning

• Same with ‘Am’ ends with ‘corp’
. ‐> any character in the middle
db.Beers.count( name:{$regex:/^Am.*corp$/} ) * ‐> zero or more occurrences
$ ‐> to indicate that ‘corp’ should be at the end

39
S Ortega‐Martorell

{ _id: 1,
Array operations item: “bud",
qty: 10,
Example 7: tags: [ “popular”,
“summer”,
• Find items which are tagged as “popular” or “organic” “Japanese”],
db.inventory.find( tags : {$in : [“popular”, “organic”]} ) rating: “good” }
• Find items which are not tagged as “popular” nor “organic”
db.inventory.find( tags : {$nin : [“popular”, “organic”]} ) $in ‐> in the intersection
$nin ‐> not in the intersection

Example 8:
• Find the 2nd and 3rd elements of tags
db.inventory.find( {} , { tags: {$slice:[1 , 2]} } )

Skip count Return this many

db.inventory.find( {} , { tags: {$slice: ‐2} } ) Return last two
40
S Ortega‐Martorell
{ _id: 1,
Array operations item: “bud",
qty: 10,
Example 9: tags: [ “popular”,
“summer”,
• Find a document whose 2nd element in tags is “summer” “Japanese”],
db.inventory.find( tags.1 : “summer” ) rating: “good” }

tags.0 ‐> 1st element (in this example, “popular”)
tags.1 ‐> 2nd element (in this example, “summer”)

41
S Ortega‐Martorell

{ _id: 1,
Compound statements item: “bud",
qty: 10,
Example 10: tags: [ “popular”,
“summer”,
• MongoDB query “Japanese”],
db.inventory.find( { rating: “good”,
$and : [ price: 3.99}
{ $or : [ { price : 3.99 }, { price : 4.99 }] },
{ $or : [ { rating : good }, { qty : { $lt : 20 } } ] }
{ item: {$ne: “Coors”} }
]
})
• Equivalent SQL query:
SELECT * FROM inventory
WHERE ((price = 3.99) OR (price=4.99)) AND
((rating = “good”) OR (qty < 20)) AND
item != “Coors” 42
S Ortega‐Martorell
Queries over nested elements
Example 11:
_id: 1,
points: [
• db.users.find( {“points.points” : {$lte:80} } )
{ points: 96, bonus: 20 },
{ points: 35, bonus: 10 }
]
Retrieves _id:1 and _id:2 and does not retrieve _id:3.
_id: 2,
points: [
{ points: 53, bonus: 20 }, Both have at least one Does not have an
{ points: 64, bonus: 12 } element “points” <= 80 element “points” <= 80
]
_id: 3, Example 12:
points: [
{ points: 81, bonus: 8 }, • db.users.find( {“points.0.points” : {$lte:80} } )
{ points: 95, bonus: 20}
]
Retrieves _id:2 only, as it is the only one in which the first (position 0)
element “points” is <= 80
43
S Ortega‐Martorell

Queries over nested elements
Example 13:
_id: 1,
points: [
• db.users.find( { “points.0.points” : {$lte:81}, “points.bonus”:20} )
{ points: 96, bonus: 20 },
{ points: 35, bonus: 10 } It looks for a document where the points element of a tuple should be
] utmost 81, and there will be a bonus that should be exactly 20.
_id: 2,
points: [
{ points: 53, bonus: 20 },
{ points: 64, bonus: 12 }
]
It retrieves _id:2 and _id:3 as the condition {“points.0.points” : {$lte:81} }
_id: 3,
points: [ (similar to example 12) now also includes 81, and there is a
{ points: 81, bonus: 8 }, {“points.bonus”:20} that is also met in both documents (notice that the
{ points: 95, bonus: 20} position of bonus:20 does not matter as it was not specified).
]
The first document is the only one that does not qualify, as the first ‘points’
element is greater than 81, even when it has a bonus:20.
44
S Ortega‐Martorell
Queries over nested elements
Example 14 (example 13 modified):
_id: 1,
points: [
• db.users.find( { “points.0.points” : {$lte:81},
{ points: 96, bonus: 20 }, “points.0.bonus”:{$lte:10}} )
{ points: 35, bonus: 10 }
] Similar to before, but now we are restricting that the bonus on the
_id: 2,
points: [
first position should be <= 10.
{ points: 53, bonus: 20 },
{ points: 64, bonus: 12 }
]
_id: 3,
points: [ This time it only retrieves _id:3 as, from the first condition “points.0.points” :
{ points: 81, bonus: 8 }, {$lte:81} we get documents 2 and 3, but only document 3 will meet the
{ points: 95, bonus: 20} second condition “points.0.bonus”:{$lte:10}}.
]

45
S Ortega‐Martorell

Queries over nested elements
Example 15:
_id: 1,
points: [ Here, we are modifying example 14 to allow the retrieval of
{ points: 96, bonus: 20 }, documents that meet at least one of the two conditions:
{ points: 35, bonus: 10 }
]
_id: 2, • db.users.find( {$or:[{“points.0.points ” : {$lte:81}},
points: [
{ points: 53, bonus: 20 },
{“points.0.bonus ” :{$lte:10}}]} )
{ points: 64, bonus: 12 }
]
_id: 3,
points: [
{ points: 81, bonus: 8 },
{ points: 95, bonus: 20}
]

46
S Ortega‐Martorell
Queries over nested elements
You can replicate these examples using the MongoDB shell in https://mws.mongodb.com/?version=3.6
Create a collection with some data:
db.users.insertMany([
{_id: 1, points: [{ points: 96, bonus: 20 }, { points: 35, bonus: 10 } ]},
{_id: 2, points: [{ points: 53, bonus: 20 }, { points: 64, bonus: 12 } ]},
{_id: 3, points: [{ points: 81, bonus: 8 }, { points: 95, bonus: 20} ]}
]);

Perform the following queries:
db.users.find( {'points.points' : {$lte:80} } )    ‐> Example 11
db.users.find( {'points.0.points' : {$lte:80} } )    ‐> Example 12
db.users.find( { 'points.0.points' : {$lte:81},  'points.bonus':20} )    ‐> Example 13
db.users.find( {'points.0.points' : {$lte:81}, 'points.0.bonus':{$lte:10}} )    ‐> Example 14
db.users.find( {$or:[{'points.0.points' : {$lte:81}}, {'points.0.bonus':{$lte:10}}]} )    ‐> Example 15

47
S Ortega‐Martorell

Introduction to Apache Spark

48
S Ortega‐Martorell
Apache Spark
• Spark was initiated at UC Berkeley in 2009 and was transferred to Apache Software
Foundation in 2013.

• Since then, Spark has become a top level project with many users and contributors
worldwide.

• It is of the most successful projects in the Apache Software Foundation

49
S Ortega‐Martorell

Why Spark?
Hadoop MapReduce shortcomings:
• Only for Map and Reduce based computations

• Relies on reading data from HDFS
– a problem for iterative algorithms

• Native support for Java only

• No interactive shell support Spark came out of the need to extend the
MapReduce framework and provide an
• No support for streaming
expressive cluster computing environment with:
• interactive querying
• efficient iterative analytics
• and streaming data processing

50
S Ortega‐Martorell
Basics of Data Analysis with Spark

Expressive programing model
• Provides more than 20 highly efficient distributed operations or transformations.
• Pipe‐lining any of these steps in Spark simply takes a few lines of code

In‐memory processing
• It runs these computations in memory.
− Its ability to cache and process data in memory, makes it significantly faster for iterative applications.

Support for diverse workloads
• Spark provides support for batch and streaming workloads at once.

Interactive shell
• Spark provides simple APIs for Python, Scala, Java and SQL programming through an interactive shell to
accomplish analytical tasks through both external and its built‐in libraries
51
S Ortega‐Martorell

The Spark Stack
• The Spark layer diagram, also called Stack, consists of components that build on top of the Spark
computational engine.
– This engine distributes and monitors tasks across the nodes of a commodity cluster.
– The components built on top of this engine are designed to interact and communicate through this
common engine.
– Any improvements to the underlying engine becomes an improvement in the other components, thanks to
such close interaction.