0% found this document useful (0 votes)
13 views81 pages

10-Big Data Nhom7

Uploaded by

An Dương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views81 pages

10-Big Data Nhom7

Uploaded by

An Dương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Principles of Distributed Database

Systems
TS. Phan Thị Hà

© 2020, M.T. Özsu & P. Valduriez TS.


Phan Thị Hà 1
Outline
◼ Introduction
◼ Distributed and Parallel Database Design
◼ Distributed Data Control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multidatabase Systems
◼ Parallel Database Systems
◼ Peer-to-Peer Data Management
◼ Big Data Processing
◼ NoSQL, NewSQL and Polystores
◼ Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
◼ Big Data Processing
❑ Distributed storage systems
❑ Processing platforms
❑ Stream data management
❑ Graph analytics
❑ Data lake

© 2020, M.T. Özsu & P. Valduriez 3


Four Vs

◼ Volume
❑ Increasing data size: petabytes (1015) to zettabytes (1021)
◼ Variety
❑ Multimodal data: structured, images, text, audio, video
❑ 90% of currently generated data unstructured
◼ Velocity
❑ Streaming data at high speed
❑ Real-time processing
◼ Veracity
❑ Data quality

© 2020, M.T. Özsu & P. Valduriez 4


Big Data Software Stack

© 2020, M.T. Özsu & P. Valduriez 5


Outline
◼ Big Data Processing
❑ Distributed storage systems

© 2020, M.T. Özsu & P. Valduriez 6


Distributed Storage System

Storing and managing data across the nodes of a shared-


nothing cluster
◼ Object-based
❑ Object = ⟨oid, data, metadata⟩
❑ Metadata can be different for different object
❑ Easy to move
❑ Flat object space → billions/trillions of objects
❑ Easily accessed through REST-based API (get/put)
❑ Good for high number of small objects (photos, mail attachments)
◼ File-based
❑ Data in files of fixed- or variable-length records
❑ Metadata-per-file stored separately from file
❑ For large data, a file needs to be partitioned and distributed

© 2020, M.T. Özsu & P. Valduriez 7


Google File System (GFS)

◼ Targets shared-nothing clusters of thousands of machines


◼ Targets applications with characteristics:
Very large files (several gigabytes)

❑ Mostly read and append workloads
❑ High throughput more important than low latency

◼ Interface: create, open, read, write, close, delete, snapshot,


record append

© 2020, M.T. Özsu & P. Valduriez 8


Outline
◼ Big Data Processing

❑ Processing platforms

© 2020, M.T. Özsu & P. Valduriez 9


Big Data Processing Platforms

◼ Applications that do not need full DBMS functionality


❑ Data analysis of very large data sets
❑ Highly dynamic, irregular, schemaless, …
◼ “Embarrassingly parallel problems”
◼ MapReduce/Spark
◼ Advantages
❑ Flexibility
❑ Scalability
❑ Efficiency
❑ Fault-tolerance
◼ Disadvantage
❑ Reduced functionality
❑ Increased programming effort

© 2020, M.T. Özsu & P. Valduriez 10


MapReduce Basics

◼ Simple programming model


❑ Data structured as (key, value) pairs
◼ E.g. (doc-id, content); (word, count)
❑ Functional programming style with two functions
◼ map(k1, v1) → list(k2, v2)
◼ reduce(k2, list(v2)) → list(v3)
◼ Implemented on a distributed file system (e.g. Google
File System) on very large clusters

© 2020, M.T. Özsu & P. Valduriez 11


map Function
◼ User-defined function
❑ Processes input (key, value) pairs
❑ Produces a set of intermediate (key, value) pairs
❑ Executes on multiple machines (called mapper)
◼ map function I/O
❑ Input: read a chunk from distributed file system (DFS)
❑ Output: Write to intermediate file on local disk
◼ MapReduce library
❑ Execute map function
❑ Groups together all intermediate values with same key
❑ Passes these lists to reduce function
◼ Effect of map function
❑ Processes and partitions input data
❑ Builds a distributed map (transparent to user)
❑ Similar to “group by” operator in SQL

© 2020, M.T. Özsu & P. Valduriez 12


reduce Function

◼ User-defined function
❑ Accepts one intermediate key and a set of values for that key
(i.e. a list)
❑ Merges these values together to form a (possibly) smaller set
❑ Computes the reduce function generating, typically, zero or one
output per invocation
❑ Executes on multiple machines (called reducer)
◼ reduce function I/O
❑ Input: read from intermediate files using remote reads on local
files of corresponding mappers
❑ Output: Write result back to DFS
◼ Effect of map function
❑ Similar to aggregation function in SQL

© 2020, M.T. Özsu & P. Valduriez 13


Example

Consider EMP(ENO,ENAME,TITLE,CITY)

SELECT CITY, COUNT(*)


FROM EMP
WHERE ENAME LIKE “%Smith”
GROUP BY CITY

map (Input: (TID,EMP), Output: (CITY, 1)


if EMP.ENAME like ``\%Smith’’ return (CITY, 1)
reduce (Input: (CITY, list(1)), Output: (CITY,
SUM(list)))
return (CITY, SUM(1))

© 2020, M.T. Özsu & P. Valduriez 14


MapReduce Processing

© 2020, M.T. Özsu & P. Valduriez 15


Hadoop Stack

© 2020, M.T. Özsu & P. Valduriez 16


Master-Worker Architecture

© 2020, M.T. Özsu & P. Valduriez 17


Execution Flow

From: J. Dean and S.Ghemawat. MapReduce: Simplified data processing on large clusters, Comm. ACM, 51(1), 2008.

© 2020, M.T. Özsu & P. Valduriez 18


High-Level MapReduce Languages

◼ Declarative
❑ HiveQL
❑ Tenzing
❑ JAQL
◼ Data flow
❑ Pig Latin
◼ Procedural
❑ Sawzall
◼ Java Library
❑ FlumeJava

© 2020, M.T. Özsu & P. Valduriez 19


MapReduce Implementations of DB Ops

◼ Select and Project can be easily implemented in the


map function
◼ Aggregation is not difficult (see next slide)
◼ Join requires more work

© 2020, M.T. Özsu & P. Valduriez 20


Aggregation

© 2020, M.T. Özsu & P. Valduriez 21


𝜃-Join

Baseline implementation of R(A,B) ⨝ S(B,C)


1) Partition R and assign each partition to mappers
2) Each mapper takes ⟨a,b⟩ tuples and converts them to a
list of key-value pairs of the form (b, ⟨a,R⟩)
3) Each reducer pulls the pairs with the same key
4) Each reducer joins tuples of R with tuples of S

© 2020, M.T. Özsu & P. Valduriez 22


𝜃-Join (𝜃 is =)

◼ Repartition join

© 2020, M.T. Özsu & P. Valduriez 23


𝜃-Join (𝜃 is ≠)

© 2020, M.T. Özsu & P. Valduriez 24


MapReduce Iterative Computation

© 2020, M.T. Özsu & P. Valduriez 25


Problems with Iteration

◼ MapReduce workflow model is acyclic


❑ Iteration: Intermediate results have to be written to HDFS after
each iteration and read again
◼ At each iteration, no guarantee that the same job is
assigned to the same compute node
❑ Invariant files cannot be locally cached
◼ Check for fixpoint
❑ At the end of each iteration, another job is needed

© 2020, M.T. Özsu & P. Valduriez 26


Spark

◼ Addresses MapReduce shortcomings


◼ Data sharing abstraction: Resilient Distributed Dataset
(RDD)
1) Cache working set (i.e. RDDs) so no writing-to/reading-
from HDFS
2) Assign partitions to the same machine across iterations
3) Maintain lineage for fault-tolerance

© 2020, M.T. Özsu & P. Valduriez 27


Spark
Spark
Spark Program
Spark Programming
Programming
Programming Flow
Model
Model
Model [Zaharia et
[Zaharia et
[Zaharia al.,
et al., 2010,
al., 2010, 2012]
2010, 2012]
2012]

HDFS
HDFS
HDFS

Created from HDFS or parallelized arrays;


Partitioned across worker machines; Create RDD
CreateRDD
RDD
Create
May be made persistent lazily;
RDD
RDD
RDD

Each transform generates a


Each transform generates a
new RDD that may also be
new RDD that may also be
cached or processed
cached or processed
Yes
Yes
Cache?
Cache? Yes Cache
Cache
Cache? Cache

No
No
No

Transform
Transform Yes
Yes
Transform Yes Transform
Transform
RDD?
RDD? Transform
RDD?

No
No
No
Process
Process
Process

HDFS
HDFS
HDFS

©© M.
M.Tamer
Tamer Özsu
Özsu VLDB Summer
VLDB Summer School
School (2015-07-27/28)
(2015-07-27/ 28) 142// 242
142 242
© M. Tamer Özsu VLDB Summer School (2015-07-27/ 28) 142 / 242

© 2020, M.T. Özsu & P. Valduriez 28


Outline
◼ Big Data Processing

❑ Stream data management


© 2020, M.T. Özsu & P. Valduriez 29


Traditional DBMS vs Streaming
DBMS Streaming
Transient query Transient data

Persist ent data Persistent queries

One-time result Continuous results

◼ Other differences
❑ Push-based (data-driven) ❑ Unbounded stream
❑ Persistent queries ❑ System conditions may not be
stable

© 2020, M.T. Özsu & P. Valduriez 30


History

◼ Data Stream Management System (DSMS)


❑ Typical DBMS functionality, primarily query language
❑ Earlier systems: STREAM, Gigascope, TelegraphCQ, Aurora,
Borealis
❑ Mostly single machine (except Borealis)
◼ Data Stream Processing System (DSPS)
❑ Do not embody DBMS functionality
❑ Later systems: Apache Storm, Heron, Spark Streaming, Flink,
MillWheel, TimeStream
❑ Almost all are distributed/parallel systems
◼ Use Data Stream System (DSS) when the distinction is
not important
© 2020, M.T. Özsu & P. Valduriez 31
DSMS Architecture

© 2020, M.T. Özsu & P. Valduriez 32


Stream Data Model

◼ Standard def: An append-only sequence of timestamped


items that arrive in some order
◼ Relaxations
❑ Revision tuples
❑ Sequence of events that are reported continually
(publish/subscribe systems)
❑ Sequence of sets of elements (bursty arrivals)
◼ Typical arrival:
〈timestamp, payload〉
❑ Payload changes based on system
◼ Relational: tuple
◼ Graph: edge
◼ …

© 2020, M.T. Özsu & P. Valduriez 33


Processing Models

◼ Continuous
❑ Each new arrival is processed as soon as it arrives in the
system.
❑ Examples: Apache Storm, Heron
◼ Windowed
❑ Arrivals are batched in windows and executed as a batch.
❑ For user, recently arrived data may be more interesting and
useful.
❑ Examples: Aurora, STREAM, Spark Streaming

© 2020, M.T. Özsu & P. Valduriez 34


Window Definition

◼ According to the direction of endpoint movement


❑ Fixed window: both endpoints are fixed
❑ Sliding window: both endpoints can slide (backward or forward)
❑ Landmark window: one endpoint fixed, the other sliding
◼ According to definition of window size
❑ Logical window (time-based) – window length measured in time
units
❑ Physical window (count-based) – window length measured in
number of data items
❑ Partitioned window: split a window into multiple count-based
windows
❑ Predicate window: arbitrary predicate defines the contents of the
window

© 2020, M.T. Özsu & P. Valduriez 35


Stream Query Models

◼ Queries are typically persistent


◼ They may be monotonic or non-monotonic
◼ Monotonic: result set always grows
❑ Results can be updated incrementally
❑ Answer is continuous, append-only stream of results
❑ Results may be removed from the answer only by explicit
deletions (if allowed)
◼ Non-monotonic: some answers in the result set become
invalid with new arrivals
❑ Recomputation may be necessary

© 2020, M.T. Özsu & P. Valduriez 36


Stream Query Languages

◼ Declarative
❑ SQL-like syntax, stream-specific semantics
❑ Examples: CQL, GSQL, StreaQuel
◼ Procedural
❑ Construct queries by defining an acyclic graph of operators
❑ Example: Aurora
◼ Windowed languages
❑ size: window length
❑ slide: how frequently the window moves
❑ E.g.: size=10min, slide=5sec
◼ Monotonic vs non-monotonic

© 2020, M.T. Özsu & P. Valduriez 37


Streaming Operators

◼ Stateless operators are no


problem: e.g., selection,
projection
◼ Stateful operators (e.g.,
nested loop join) are blocking
❑ You need to see the entire
inner operand
◼ For some blocking operators,
non-blocking versions exist
(symmetric hash join)
◼ Otherwise: windowed
execution

© 2020, M.T. Özsu & P. Valduriez 38


Query Processing over Streams

◼ Similar to relational, except


❑ persistent queries: registered to the system and continuously
running
❑ data pushed through the query plan, not pulled
◼ Stream query plan

© 2020, M.T. Özsu & P. Valduriez 39


Query Processing Issues

◼ Continuous execution
❑ Each new arrival is processed as soon as the system gets it
❑ E.g. Apache Storm, Heron
◼ Windowed execution
❑ Arrivals are batched and processed as a batch
❑ E.g. Aurora, STREAM, Spark Streaming
◼ More opportunities for multi-query optimization
❑ E.g. Easier to determine shared subplans

© 2020, M.T. Özsu & P. Valduriez 40


Windowed Query Execution

◼ Two events need to be managed


❑ Arrivals
❑ Expirations
◼ System actions depend on operators
❑ E.g. Join generates new result, negation removes previous result
◼ Window movement also affects results
❑ As window moves, some items in the window move out
❑ What to do to results
❑ If monotonic, nothing; if non-monotonic, two options
◼ Direct approach
◼ Negative tuple approach

© 2020, M.T. Özsu & P. Valduriez 41


Load Management

◼ Stream arrival rate > processing capability


◼ Load shedding
❑ Random
❑ Semantic
◼ Early drop
❑ All of the downstream operators will benefit
❑ Accuracy may be negatively affected
◼ Late drop
❑ May not reduce the system load much
❑ Allows the shared subplans to be evaluated

© 2020, M.T. Özsu & P. Valduriez 42


Out-of-Order Processing

◼ Assumption: arrivals are in timestamp order


◼ May not hold
❑ Arrival order may not match generation order
❑ Late arrivals → no more or just late?
❑ Multiple sources
◼ Approaches
❑ Built-in slack
❑ Punctuations

© 2020, M.T. Özsu & P. Valduriez 43


Multiquery Optimization

◼ More opportunity since the persistent queries are known


beforehand
❑ Aggregate queries over different window lengths or with different
slide intervals
❑ State and computation may be shared (usual)

© 2020, M.T. Özsu & P. Valduriez 44


Parallel Data Stream Processing

1) Partitioning the incoming stream


2) Execution of the operation on the partition
3) (Optionally) aggregation of the results from multiple
machines

© 2020, M.T. Özsu & P. Valduriez 45


Stream Partitioning

◼ Shuffle (round-robin)
partitioning

◼ Hash partitioning

© 2020, M.T. Özsu & P. Valduriez 46


Parallel Stream Query Plan

© 2020, M.T. Özsu & P. Valduriez 47


Outline
◼ Big Data Processing

❑ Graph analytics

© 2020, M.T. Özsu & P. Valduriez 48


Property Graph

◼ Graph G=(V, E, DV, DE) where V is set of vertices, E is


set of edges, DV is set of vertex properties, DE is set of
edge properties
◼ Vertices represent entities, edges relationships among
them.
◼ Multigraph: multiple edges between a pair of vertices
◼ Weighted graph: edges have weights
◼ Directed vs undirected

© 2020, M.T. Özsu & P. Valduriez 49


Graph Workloads

Analytical Online
◼ Multiple iterations ◼ No iteration
◼ Process each vertex at ◼ Usually access portion of
each iteration the graph
◼ Examples ◼ Examples
❑ PageRank ❑ Reachability
❑ Clustering ❑ Single-source shortest path
❑ Connected components ❑ Subgraph matching
❑ Machine learning tasks

© 2020, M.T. Özsu & P. Valduriez 50


PageRank as Analytical Example

A web page is important if it is pointed


at by other important web pages.
𝑃𝑅(𝑃𝑗 )
𝑃𝑅 𝑃𝑖 = 1 − 𝑑 + 𝑑 ෍
|𝐹𝑃𝑗 |
𝑃𝑗 ∈𝐵𝑃𝑖

𝐵𝑃𝑖 : in−neighbors of 𝑃𝑖
𝐹𝑃𝑖 : out−neighbors of 𝑃𝑖
let 𝑑 = 0.85
𝑃𝑅 𝑃1 𝑃𝑅 𝑃3
𝑃𝑅 𝑃2 = 0.15 + 0.85( + )
2 3
Recursive!...
© 2020, M.T. Özsu & P. Valduriez 51
Graph Partitioning

◼ Graph partitioning is more difficult than relational


partitioning because of edges
◼ Two approaches
❑ Edge-cut or vertex-disjoint
◼ Each vertex assigned to one partition, edges may be replicated
❑ Vertex-cut or edge-disjoint
◼ Each edge is assigned to one partition, vertices may be replicated
◼ Objectives
❑ Allocate each vertex/edge to partitions such that partitions are
mutually exclusive
❑ Partitions are balanced
❑ Minimize edge-/vertex-cuts to minimize communication

© 2020, M.T. Özsu & P. Valduriez 52


Graph Partitioning Formalization

Total communication cost


minimize 𝐶 𝑃 due to partitioning
𝑠ubject to:
σ𝑘𝑗=1 𝑤 𝑃𝑗
𝑤 𝑃𝑖 ≤ 𝛽 ∗ , ∀𝑖 ∈ 1, … , 𝑘
𝑘
Abstract cost of Slackness for unbalanced
processing partition 𝑃𝑖 partitioning

◼ 𝐶 𝑃 and 𝑤 𝑃𝑖 differ for different partitionings

© 2020, M.T. Özsu & P. Valduriez 53


Vertex-Disjoint (Edge-Cut)

◼ Objective is to minimize the number of edge cuts


◼ Objective function

σ𝑘
𝑖=1 |𝑒 𝑃𝑖 ,𝑉\𝑃𝑖 |
𝐶 𝑃 = where 𝑒 𝑃𝑖 , 𝑃𝑗 = #edges between 𝑃𝑖 and 𝑃𝑗
|𝐸|

◼ 𝑤 𝑃𝑖 defined in terms of the number of vertices-per-


partition

© 2020, M.T. Özsu & P. Valduriez 54


METIS Vertex-Disjoint Partitioning

◼ METIS is a family of algorithms


◼ Usually the gold standard for comparison

Given an initial graph G0 = (V, E):


1) Produce a hierarchy of successively coarsened graphs
G1, …, Gn such that |V(Gi)| > |V(Gj)| for i < j
2) Partition Gn using some partitioning algorithm
❑ Small enough that it won’t matter what algorithm is used
3) Iteratively coarsen Gn to G0, and at each step
a) Project the partitioning solution on graph Gj to graph Gj-1
b) Improve the partitioning of G0

© 2020, M.T. Özsu & P. Valduriez 55


Vertex-Disjoint Partitioning Example

© 2020, M.T. Özsu & P. Valduriez 56


Edge-Disjoint Partitioning

◼ Vertex-disjoint perform
❑ well for graphs with low-degree vertices
❑ poorly on power-law graphs causing many edge-cuts
◼ Edge-disjoint (vertex-cut) better for these
❑ Put each edge in one partition
❑ Vertices may need to be replicated – minimize these
◼ Objective function
σ𝑣∈𝑉 |𝐴 𝑣 |
𝐶 𝑃 = where 𝐴 𝑣 ⊆ {𝑃1 , … , 𝑃𝑘 } is set of
|𝑉|
partitions in which 𝑣 exists
◼ 𝑤 𝑃𝑖 is the number of edges in partition 𝑃𝑖

© 2020, M.T. Özsu & P. Valduriez 57


Edge-Disjoint Alternatives

◼ Hashing (on the ids of the two vertices incident on edge)


❑ Fast and highly parallelizable
❑ Gives good balance
❑ But may lead to high vertex replication
◼ Heuristics cognizant of graph characteristics
❑ Greedy: decide on allocation edge i+1 based on the allocation of
the previous i edges to minimize vertex replication

© 2020, M.T. Özsu & P. Valduriez 58


Edge-Disjoint Partitioning Example

© 2020, M.T. Özsu & P. Valduriez 59


Can MapReduce be used for Graph
Analytics?
◼ map and reduce functions can be written for gaph
analytics
❑ There are works that have done this
◼ Graph analytics tasks are iterative
❑ Recall: MapReduce is not good for iterative tasks
◼ Spark improves MapReduce (e.g., Hadoop) for iterative
tasks
❑ GraphX on top of Spark
❑ Edge-disjoint partitioning
❑ Vertex table & edge table as RDDs on each worker
❑ Join vertex & edge tables
❑ Perform an aggregation
© 2020, M.T. Özsu & P. Valduriez 60
Special-Purpose Graph Analytics
Systems

???: Systems do not exist


© 2020, M.T. Özsu & P. Valduriez 61
Vertex-Centric Model

◼ Computation on a vertex
is the focus
◼ “Think like a vertex”
◼ Vertex computation
depends on its own state
+ states of its neighbors
◼ Compute(vertex v)
◼ GetValue(),
WriteValue()

© 2020, M.T. Özsu & P. Valduriez 62


Partition-centric (Block-centric) Model

◼ Computation on an entire
partition is specified
◼ “Think like a block” or
“Think like a graph”
◼ Aim is to reduce the
communication cost
among vertices

© 2020, M.T. Özsu & P. Valduriez 63


Edge-centric Model

◼ Computation is
specified on each
edge rather than on
each vertex or bloc
◼ Compute(edge e)

© 2020, M.T. Özsu & P. Valduriez 64


Bulk Synchronous Parallel (BSP) Model

Each machine performs At the end of each superstep


computation on its results are pushed to other
graph partition workers

© 2020, M.T. Özsu & P. Valduriez 65


Asynchronous Parallel (AP) Model

◼ Supersteps, but no
communication barriers
◼ Uses the most recent values
◼ Computation in step k may be
based on neighbor states of
step k-1 (of received late) or of
state k
◼ Consistency issues → requires
distributed locking

Consider vertex-centric

© 2020, M.T. Özsu & P. Valduriez 66


Gather-Apply-Scatter (GAS) Model

◼ Similar to BSP, but pull-based


◼ Gather: pull state
◼ Apply: Compute function
◼ Scatter: Update state
◼ Updates of states separated from scheduling

© 2020, M.T. Özsu & P. Valduriez 67


Vertex-Centric BSP Systems

◼ “Think like a vertex”


◼ Compute(vertex v)
◼ BSP Computation – push
state to neighbor vertices
at the end of each
superstep
◼ Continue until all vertices
are inactive
◼ Vertex state machine

© 2020, M.T. Özsu & P. Valduriez 68


Connected Components: Vertex-
Centric BSP

© 2020, M.T. Özsu & P. Valduriez 69


Vertex-Centric AP Systems

◼ “Think like a vertex”


◼ Compute(vertex v)
◼ Supersteps exist along with
synchronization barriers, but …
◼ Compute(vertex v) can
see msgs sent in the same
superstep or previous one
◼ Consistency of vertex states:
distributed locking
◼ Consistency issues: no
guarantee about input to
Compute()

© 2020, M.T. Özsu & P. Valduriez 70


Connected Components: Vertex-
Centric AP

© 2020, M.T. Özsu & P. Valduriez 71


Barrierless Asynchronous Parallel (BAP)

◼ Divides each global superstep into logical supersteps separated by


very lightweight local barriers
◼ Compute() can be executed multiple times in each superstep
(once-per-logical superstep)
◼ Synchronizes at global barriers as in AP

© 2020, M.T. Özsu & P. Valduriez 72


Partition- (Block-)Centric BSP Systems

◼ “Think like a block”; also “think like a graph”


◼ Reduces communication
◼ Exploit the partitioning of the graph
❑ Message exchanges only among blocks; BSP in this case
❑ Within a block, run a serial in-memory algorithm

© 2020, M.T. Özsu & P. Valduriez 73


Connected Components: Block-Centric
BSP

© 2020, M.T. Özsu & P. Valduriez 74


Edge-Centric BSP Systems
◼ Dual of vertex-centric BSP
◼ “Think like an edge”
◼ Compute(edge e)
◼ BSP Computation – push
state to neighbor vertices at
the end of each superstep
◼ Continue until all vertices are
inactive
◼ Number of edges ≫ number
of vertices
❑ Fewer msgs, more computation
❑ No random reads

© 2020, M.T. Özsu & P. Valduriez 75


Data Lake

◼ Collection of raw data in native format


❑ Each element has a unique identifier and metadata
❑ For each business question, you can find the relevant data set to
analyze it
◼ Originally based on Hadoop
❑ Enterprise Hadoop

© 2020, M.T. Özsu & P. Valduriez 76


Advantages of a Data Lake

◼ Schema on read
❑ Write the data as they are, read them according to a diagram
(e.g. code of the Map function)
❑ More flexibility, multiple views of the same data
◼ Multi-workload data processing
❑ Different types of processing on the same data
❑ Interactive, batch, real time
◼ Cost-effective data architecture
❑ Excellent cost/performance and ROI ratio with SN cluster and
open source technologies

© 2020, M.T. Özsu & P. Valduriez 77


Principles of a Data Lake

◼ Collect all useful data


❑ Raw data, transformed data
◼ Dive from anywhere
❑ Users from different business units can explore and enrich the
data
◼ Flexible access
❑ Different access paths to shared infrastructure
◼ Batch, interactive (OLAP and BI), real-time, search,.....

© 2020, M.T. Özsu & P. Valduriez 78


Main Functions

◼ Data management, to store and process large amounts


of data
◼ Data access: interactive, batch, real time, streaming
◼ Governance: load data easily, and manage it according
to a policy implemented by the data steward
◼ Security: authentication, access control, data protection
◼ Platform management: provision, monitoring and
scheduling of tasks (in a cluster)

© 2020, M.T. Özsu & P. Valduriez 79


Data Lake Architecture

A collection of multi-modal data stored in their raw formats

© 2020, M.T. Özsu & P. Valduriez 80


Data Lake vs Data Warehouse

Data Lake Data Warehouse


◼ Shorter development ◼ Long development
process process
◼ Schema-on-read ◼ Schema-on-write
◼ Multiworkload processing ◼ OLAP workloads
◼ Cost-effective ◼ Complex development
architecture with ETL

© 2020, M.T. Özsu & P. Valduriez 81

You might also like