0% found this document useful (0 votes)

13 views81 pages

10-Big Data Nhom7

Uploaded by

An Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views81 pages

10-Big Data Nhom7

Uploaded by

An Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Principles of Distributed Database

Systems
TS. Phan Thị Hà

© 2020, M.T. Özsu & P. Valduriez TS.

Phan Thị Hà 1
Outline
◼ Introduction
◼ Distributed and Parallel Database Design
◼ Distributed Data Control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multidatabase Systems
◼ Parallel Database Systems
◼ Peer-to-Peer Data Management
◼ Big Data Processing
◼ NoSQL, NewSQL and Polystores
◼ Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
◼ Big Data Processing
❑ Distributed storage systems
❑ Processing platforms
❑ Stream data management
❑ Graph analytics
❑ Data lake

© 2020, M.T. Özsu & P. Valduriez 3

Four Vs

◼ Volume
❑ Increasing data size: petabytes (1015) to zettabytes (1021)
◼ Variety
❑ Multimodal data: structured, images, text, audio, video
❑ 90% of currently generated data unstructured
◼ Velocity
❑ Streaming data at high speed
❑ Real-time processing
◼ Veracity
❑ Data quality

© 2020, M.T. Özsu & P. Valduriez 4

Big Data Software Stack

© 2020, M.T. Özsu & P. Valduriez 5

Outline
◼ Big Data Processing
❑ Distributed storage systems
❑

© 2020, M.T. Özsu & P. Valduriez 6

Distributed Storage System

Storing and managing data across the nodes of a shared-

nothing cluster
◼ Object-based
❑ Object = ⟨oid, data, metadata⟩
❑ Metadata can be different for different object
❑ Easy to move
❑ Flat object space → billions/trillions of objects
❑ Easily accessed through REST-based API (get/put)
❑ Good for high number of small objects (photos, mail attachments)
◼ File-based
❑ Data in files of fixed- or variable-length records
❑ Metadata-per-file stored separately from file
❑ For large data, a file needs to be partitioned and distributed

© 2020, M.T. Özsu & P. Valduriez 7

Google File System (GFS)

◼ Targets shared-nothing clusters of thousands of machines

◼ Targets applications with characteristics:
Very large files (several gigabytes)
❑
❑ Mostly read and append workloads
❑ High throughput more important than low latency

◼ Interface: create, open, read, write, close, delete, snapshot,

record append

© 2020, M.T. Özsu & P. Valduriez 8

Outline
◼ Big Data Processing
❑

❑ Processing platforms
❑

© 2020, M.T. Özsu & P. Valduriez 9

Big Data Processing Platforms

◼ Applications that do not need full DBMS functionality

❑ Data analysis of very large data sets
❑ Highly dynamic, irregular, schemaless, …
◼ “Embarrassingly parallel problems”
◼ MapReduce/Spark
◼ Advantages
❑ Flexibility
❑ Scalability
❑ Efficiency
❑ Fault-tolerance
◼ Disadvantage
❑ Reduced functionality
❑ Increased programming effort

© 2020, M.T. Özsu & P. Valduriez 10

MapReduce Basics

◼ Simple programming model

❑ Data structured as (key, value) pairs
◼ E.g. (doc-id, content); (word, count)
❑ Functional programming style with two functions
◼ map(k1, v1) → list(k2, v2)
◼ reduce(k2, list(v2)) → list(v3)
◼ Implemented on a distributed file system (e.g. Google
File System) on very large clusters

© 2020, M.T. Özsu & P. Valduriez 11

map Function
◼ User-defined function
❑ Processes input (key, value) pairs
❑ Produces a set of intermediate (key, value) pairs
❑ Executes on multiple machines (called mapper)
◼ map function I/O
❑ Input: read a chunk from distributed file system (DFS)
❑ Output: Write to intermediate file on local disk
◼ MapReduce library
❑ Execute map function
❑ Groups together all intermediate values with same key
❑ Passes these lists to reduce function
◼ Effect of map function
❑ Processes and partitions input data
❑ Builds a distributed map (transparent to user)
❑ Similar to “group by” operator in SQL

© 2020, M.T. Özsu & P. Valduriez 12

reduce Function

◼ User-defined function
❑ Accepts one intermediate key and a set of values for that key
(i.e. a list)
❑ Merges these values together to form a (possibly) smaller set
❑ Computes the reduce function generating, typically, zero or one
output per invocation
❑ Executes on multiple machines (called reducer)
◼ reduce function I/O
❑ Input: read from intermediate files using remote reads on local
files of corresponding mappers
❑ Output: Write result back to DFS
◼ Effect of map function
❑ Similar to aggregation function in SQL

© 2020, M.T. Özsu & P. Valduriez 13

Example

Consider EMP(ENO,ENAME,TITLE,CITY)

SELECT CITY, COUNT(*)

FROM EMP
WHERE ENAME LIKE “%Smith”
GROUP BY CITY

map (Input: (TID,EMP), Output: (CITY, 1)

if EMP.ENAME like ``\%Smith’’ return (CITY, 1)
reduce (Input: (CITY, list(1)), Output: (CITY,
SUM(list)))
return (CITY, SUM(1))

© 2020, M.T. Özsu & P. Valduriez 14

MapReduce Processing

© 2020, M.T. Özsu & P. Valduriez 15

Hadoop Stack

© 2020, M.T. Özsu & P. Valduriez 16

Master-Worker Architecture

© 2020, M.T. Özsu & P. Valduriez 17

Execution Flow

From: J. Dean and S.Ghemawat. MapReduce: Simplified data processing on large clusters, Comm. ACM, 51(1), 2008.

© 2020, M.T. Özsu & P. Valduriez 18

High-Level MapReduce Languages

◼ Declarative
❑ HiveQL
❑ Tenzing
❑ JAQL
◼ Data flow
❑ Pig Latin
◼ Procedural
❑ Sawzall
◼ Java Library
❑ FlumeJava

© 2020, M.T. Özsu & P. Valduriez 19

MapReduce Implementations of DB Ops

◼ Select and Project can be easily implemented in the

map function
◼ Aggregation is not difficult (see next slide)
◼ Join requires more work

© 2020, M.T. Özsu & P. Valduriez 20

Aggregation

© 2020, M.T. Özsu & P. Valduriez 21

𝜃-Join

Baseline implementation of R(A,B) ⨝ S(B,C)

1) Partition R and assign each partition to mappers
2) Each mapper takes ⟨a,b⟩ tuples and converts them to a
list of key-value pairs of the form (b, ⟨a,R⟩)
3) Each reducer pulls the pairs with the same key
4) Each reducer joins tuples of R with tuples of S

© 2020, M.T. Özsu & P. Valduriez 22

𝜃-Join (𝜃 is =)

◼ Repartition join

© 2020, M.T. Özsu & P. Valduriez 23

𝜃-Join (𝜃 is ≠)

© 2020, M.T. Özsu & P. Valduriez 24

MapReduce Iterative Computation

© 2020, M.T. Özsu & P. Valduriez 25

Problems with Iteration

◼ MapReduce workflow model is acyclic

❑ Iteration: Intermediate results have to be written to HDFS after
each iteration and read again
◼ At each iteration, no guarantee that the same job is
assigned to the same compute node
❑ Invariant files cannot be locally cached
◼ Check for fixpoint
❑ At the end of each iteration, another job is needed

© 2020, M.T. Özsu & P. Valduriez 26

Spark

◼ Addresses MapReduce shortcomings

◼ Data sharing abstraction: Resilient Distributed Dataset
(RDD)
1) Cache working set (i.e. RDDs) so no writing-to/reading-
from HDFS
2) Assign partitions to the same machine across iterations
3) Maintain lineage for fault-tolerance

© 2020, M.T. Özsu & P. Valduriez 27

Spark
Spark
Spark Program
Spark Programming
Programming
Programming Flow
Model
Model
Model [Zaharia et
[Zaharia et
[Zaharia al.,
et al., 2010,
al., 2010, 2012]
2010, 2012]
2012]

HDFS
HDFS
HDFS

Created from HDFS or parallelized arrays;

Partitioned across worker machines; Create RDD
CreateRDD
RDD
Create
May be made persistent lazily;
RDD
RDD
RDD

Each transform generates a

Each transform generates a
new RDD that may also be
new RDD that may also be
cached or processed
cached or processed
Yes
Yes
Cache?
Cache? Yes Cache
Cache
Cache? Cache

No
No
No

Transform
Transform Yes
Yes
Transform Yes Transform
Transform
RDD?
RDD? Transform
RDD?

No
No
No
Process
Process
Process

HDFS
HDFS
HDFS

©© M.
M.Tamer
Tamer Özsu
Özsu VLDB Summer
VLDB Summer School
School (2015-07-27/28)
(2015-07-27/ 28) 142// 242
142 242
© M. Tamer Özsu VLDB Summer School (2015-07-27/ 28) 142 / 242

© 2020, M.T. Özsu & P. Valduriez 28

Outline
◼ Big Data Processing
❑

❑ Stream data management

❑

© 2020, M.T. Özsu & P. Valduriez 29

Traditional DBMS vs Streaming
DBMS Streaming
Transient query Transient data

Persist ent data Persistent queries

One-time result Continuous results

◼ Other differences
❑ Push-based (data-driven) ❑ Unbounded stream
❑ Persistent queries ❑ System conditions may not be
stable

© 2020, M.T. Özsu & P. Valduriez 30

History

◼ Data Stream Management System (DSMS)

❑ Typical DBMS functionality, primarily query language
❑ Earlier systems: STREAM, Gigascope, TelegraphCQ, Aurora,
Borealis
❑ Mostly single machine (except Borealis)
◼ Data Stream Processing System (DSPS)
❑ Do not embody DBMS functionality
❑ Later systems: Apache Storm, Heron, Spark Streaming, Flink,
MillWheel, TimeStream
❑ Almost all are distributed/parallel systems
◼ Use Data Stream System (DSS) when the distinction is
not important
© 2020, M.T. Özsu & P. Valduriez 31
DSMS Architecture

© 2020, M.T. Özsu & P. Valduriez 32

Stream Data Model

◼ Standard def: An append-only sequence of timestamped

items that arrive in some order
◼ Relaxations
❑ Revision tuples
❑ Sequence of events that are reported continually
(publish/subscribe systems)
❑ Sequence of sets of elements (bursty arrivals)
◼ Typical arrival:
〈timestamp, payload〉
❑ Payload changes based on system
◼ Relational: tuple
◼ Graph: edge
◼ …

© 2020, M.T. Özsu & P. Valduriez 33

Processing Models

◼ Continuous
❑ Each new arrival is processed as soon as it arrives in the
system.
❑ Examples: Apache Storm, Heron
◼ Windowed
❑ Arrivals are batched in windows and executed as a batch.
❑ For user, recently arrived data may be more interesting and
useful.
❑ Examples: Aurora, STREAM, Spark Streaming

© 2020, M.T. Özsu & P. Valduriez 34

Window Definition

◼ According to the direction of endpoint movement

❑ Fixed window: both endpoints are fixed
❑ Sliding window: both endpoints can slide (backward or forward)
❑ Landmark window: one endpoint fixed, the other sliding
◼ According to definition of window size
❑ Logical window (time-based) – window length measured in time
units
❑ Physical window (count-based) – window length measured in
number of data items
❑ Partitioned window: split a window into multiple count-based
windows
❑ Predicate window: arbitrary predicate defines the contents of the
window

© 2020, M.T. Özsu & P. Valduriez 35

Stream Query Models

◼ Queries are typically persistent

◼ They may be monotonic or non-monotonic
◼ Monotonic: result set always grows
❑ Results can be updated incrementally
❑ Answer is continuous, append-only stream of results
❑ Results may be removed from the answer only by explicit
deletions (if allowed)
◼ Non-monotonic: some answers in the result set become
invalid with new arrivals
❑ Recomputation may be necessary

© 2020, M.T. Özsu & P. Valduriez 36

Stream Query Languages

◼ Declarative
❑ SQL-like syntax, stream-specific semantics
❑ Examples: CQL, GSQL, StreaQuel
◼ Procedural
❑ Construct queries by defining an acyclic graph of operators
❑ Example: Aurora
◼ Windowed languages
❑ size: window length
❑ slide: how frequently the window moves
❑ E.g.: size=10min, slide=5sec
◼ Monotonic vs non-monotonic

© 2020, M.T. Özsu & P. Valduriez 37

Streaming Operators

◼ Stateless operators are no

problem: e.g., selection,
projection
◼ Stateful operators (e.g.,
nested loop join) are blocking
❑ You need to see the entire
inner operand
◼ For some blocking operators,
non-blocking versions exist
(symmetric hash join)
◼ Otherwise: windowed
execution

© 2020, M.T. Özsu & P. Valduriez 38

Query Processing over Streams

◼ Similar to relational, except

❑ persistent queries: registered to the system and continuously
running
❑ data pushed through the query plan, not pulled
◼ Stream query plan

© 2020, M.T. Özsu & P. Valduriez 39

Query Processing Issues

◼ Continuous execution
❑ Each new arrival is processed as soon as the system gets it
❑ E.g. Apache Storm, Heron
◼ Windowed execution
❑ Arrivals are batched and processed as a batch
❑ E.g. Aurora, STREAM, Spark Streaming
◼ More opportunities for multi-query optimization
❑ E.g. Easier to determine shared subplans

© 2020, M.T. Özsu & P. Valduriez 40

Windowed Query Execution

◼ Two events need to be managed

❑ Arrivals
❑ Expirations
◼ System actions depend on operators
❑ E.g. Join generates new result, negation removes previous result
◼ Window movement also affects results
❑ As window moves, some items in the window move out
❑ What to do to results
❑ If monotonic, nothing; if non-monotonic, two options
◼ Direct approach
◼ Negative tuple approach

© 2020, M.T. Özsu & P. Valduriez 41

Load Management

◼ Stream arrival rate > processing capability

◼ Load shedding
❑ Random
❑ Semantic
◼ Early drop
❑ All of the downstream operators will benefit
❑ Accuracy may be negatively affected
◼ Late drop
❑ May not reduce the system load much
❑ Allows the shared subplans to be evaluated

© 2020, M.T. Özsu & P. Valduriez 42

Out-of-Order Processing

◼ Assumption: arrivals are in timestamp order

◼ May not hold
❑ Arrival order may not match generation order
❑ Late arrivals → no more or just late?
❑ Multiple sources
◼ Approaches
❑ Built-in slack
❑ Punctuations

© 2020, M.T. Özsu & P. Valduriez 43

Multiquery Optimization

◼ More opportunity since the persistent queries are known

beforehand
❑ Aggregate queries over different window lengths or with different
slide intervals
❑ State and computation may be shared (usual)

© 2020, M.T. Özsu & P. Valduriez 44

Parallel Data Stream Processing

1) Partitioning the incoming stream

2) Execution of the operation on the partition
3) (Optionally) aggregation of the results from multiple
machines

© 2020, M.T. Özsu & P. Valduriez 45

Stream Partitioning

◼ Shuffle (round-robin)
partitioning

◼ Hash partitioning

© 2020, M.T. Özsu & P. Valduriez 46

Parallel Stream Query Plan

© 2020, M.T. Özsu & P. Valduriez 47

Outline
◼ Big Data Processing
❑

❑ Graph analytics

© 2020, M.T. Özsu & P. Valduriez 48

Property Graph

◼ Graph G=(V, E, DV, DE) where V is set of vertices, E is

set of edges, DV is set of vertex properties, DE is set of
edge properties
◼ Vertices represent entities, edges relationships among
them.
◼ Multigraph: multiple edges between a pair of vertices
◼ Weighted graph: edges have weights
◼ Directed vs undirected

© 2020, M.T. Özsu & P. Valduriez 49

Graph Workloads

Analytical Online
◼ Multiple iterations ◼ No iteration
◼ Process each vertex at ◼ Usually access portion of
each iteration the graph
◼ Examples ◼ Examples
❑ PageRank ❑ Reachability
❑ Clustering ❑ Single-source shortest path
❑ Connected components ❑ Subgraph matching
❑ Machine learning tasks

© 2020, M.T. Özsu & P. Valduriez 50

PageRank as Analytical Example

A web page is important if it is pointed

at by other important web pages.
𝑃𝑅(𝑃𝑗 )
𝑃𝑅 𝑃𝑖 = 1 − 𝑑 + 𝑑 ෍
|𝐹𝑃𝑗 |
𝑃𝑗 ∈𝐵𝑃𝑖

𝐵𝑃𝑖 : in−neighbors of 𝑃𝑖
𝐹𝑃𝑖 : out−neighbors of 𝑃𝑖
let 𝑑 = 0.85
𝑃𝑅 𝑃1 𝑃𝑅 𝑃3
𝑃𝑅 𝑃2 = 0.15 + 0.85( + )
2 3
Recursive!...
© 2020, M.T. Özsu & P. Valduriez 51
Graph Partitioning

◼ Graph partitioning is more difficult than relational

partitioning because of edges
◼ Two approaches
❑ Edge-cut or vertex-disjoint
◼ Each vertex assigned to one partition, edges may be replicated
❑ Vertex-cut or edge-disjoint
◼ Each edge is assigned to one partition, vertices may be replicated
◼ Objectives
❑ Allocate each vertex/edge to partitions such that partitions are
mutually exclusive
❑ Partitions are balanced
❑ Minimize edge-/vertex-cuts to minimize communication

© 2020, M.T. Özsu & P. Valduriez 52

Graph Partitioning Formalization

Total communication cost

minimize 𝐶 𝑃 due to partitioning
𝑠ubject to:
σ𝑘𝑗=1 𝑤 𝑃𝑗
𝑤 𝑃𝑖 ≤ 𝛽 ∗ , ∀𝑖 ∈ 1, … , 𝑘
𝑘
Abstract cost of Slackness for unbalanced
processing partition 𝑃𝑖 partitioning

◼ 𝐶 𝑃 and 𝑤 𝑃𝑖 differ for different partitionings

© 2020, M.T. Özsu & P. Valduriez 53

Vertex-Disjoint (Edge-Cut)

◼ Objective is to minimize the number of edge cuts

◼ Objective function

σ𝑘
𝑖=1 |𝑒 𝑃𝑖 ,𝑉\𝑃𝑖 |
𝐶 𝑃 = where 𝑒 𝑃𝑖 , 𝑃𝑗 = #edges between 𝑃𝑖 and 𝑃𝑗
|𝐸|

◼ 𝑤 𝑃𝑖 defined in terms of the number of vertices-per-

partition

METIS Vertex-Disjoint Partitioning

◼ METIS is a family of algorithms

◼ Usually the gold standard for comparison

Given an initial graph G0 = (V, E):

1) Produce a hierarchy of successively coarsened graphs
G1, …, Gn such that |V(Gi)| > |V(Gj)| for i < j
2) Partition Gn using some partitioning algorithm
❑ Small enough that it won’t matter what algorithm is used
3) Iteratively coarsen Gn to G0, and at each step
a) Project the partitioning solution on graph Gj to graph Gj-1
b) Improve the partitioning of G0

Vertex-Disjoint Partitioning Example

Edge-Disjoint Partitioning

◼ Vertex-disjoint perform
❑ well for graphs with low-degree vertices
❑ poorly on power-law graphs causing many edge-cuts
◼ Edge-disjoint (vertex-cut) better for these
❑ Put each edge in one partition
❑ Vertices may need to be replicated – minimize these
◼ Objective function
σ𝑣∈𝑉 |𝐴 𝑣 |
𝐶 𝑃 = where 𝐴 𝑣 ⊆ {𝑃1 , … , 𝑃𝑘 } is set of
|𝑉|
partitions in which 𝑣 exists
◼ 𝑤 𝑃𝑖 is the number of edges in partition 𝑃𝑖

Edge-Disjoint Alternatives

◼ Hashing (on the ids of the two vertices incident on edge)

❑ Fast and highly parallelizable
❑ Gives good balance
❑ But may lead to high vertex replication
◼ Heuristics cognizant of graph characteristics
❑ Greedy: decide on allocation edge i+1 based on the allocation of
the previous i edges to minimize vertex replication

Edge-Disjoint Partitioning Example

Can MapReduce be used for Graph
Analytics?
◼ map and reduce functions can be written for gaph
analytics
❑ There are works that have done this
◼ Graph analytics tasks are iterative
❑ Recall: MapReduce is not good for iterative tasks
◼ Spark improves MapReduce (e.g., Hadoop) for iterative
tasks
❑ GraphX on top of Spark
❑ Edge-disjoint partitioning
❑ Vertex table & edge table as RDDs on each worker
❑ Join vertex & edge tables
❑ Perform an aggregation
© 2020, M.T. Özsu & P. Valduriez 60
Special-Purpose Graph Analytics
Systems

???: Systems do not exist

◼ Computation on a vertex
is the focus
◼ “Think like a vertex”
◼ Vertex computation
depends on its own state
+ states of its neighbors
◼ Compute(vertex v)
◼ GetValue(),
WriteValue()

Partition-centric (Block-centric) Model

◼ Computation on an entire
partition is specified
◼ “Think like a block” or
“Think like a graph”
◼ Aim is to reduce the
communication cost
among vertices

Edge-centric Model

◼ Computation is
specified on each
edge rather than on
each vertex or bloc
◼ Compute(edge e)

Bulk Synchronous Parallel (BSP) Model

Each machine performs At the end of each superstep

computation on its results are pushed to other
graph partition workers

Asynchronous Parallel (AP) Model

◼ Supersteps, but no
communication barriers
◼ Uses the most recent values
◼ Computation in step k may be
based on neighbor states of
step k-1 (of received late) or of
state k
◼ Consistency issues → requires
distributed locking

Consider vertex-centric

Gather-Apply-Scatter (GAS) Model

◼ Similar to BSP, but pull-based

◼ Gather: pull state
◼ Apply: Compute function
◼ Scatter: Update state
◼ Updates of states separated from scheduling

Vertex-Centric BSP Systems

◼ “Think like a vertex”

◼ Compute(vertex v)
◼ BSP Computation – push
state to neighbor vertices
at the end of each
superstep
◼ Continue until all vertices
are inactive
◼ Vertex state machine

Connected Components: Vertex-
Centric BSP

Vertex-Centric AP Systems

◼ “Think like a vertex”

◼ Compute(vertex v)
◼ Supersteps exist along with
synchronization barriers, but …
◼ Compute(vertex v) can
see msgs sent in the same
superstep or previous one
◼ Consistency of vertex states:
distributed locking
◼ Consistency issues: no
guarantee about input to
Compute()

Connected Components: Vertex-
Centric AP

Barrierless Asynchronous Parallel (BAP)

◼ Divides each global superstep into logical supersteps separated by

very lightweight local barriers
◼ Compute() can be executed multiple times in each superstep
(once-per-logical superstep)
◼ Synchronizes at global barriers as in AP

Partition- (Block-)Centric BSP Systems

◼ “Think like a block”; also “think like a graph”

◼ Reduces communication
◼ Exploit the partitioning of the graph
❑ Message exchanges only among blocks; BSP in this case
❑ Within a block, run a serial in-memory algorithm

Connected Components: Block-Centric
BSP

Edge-Centric BSP Systems
◼ Dual of vertex-centric BSP
◼ “Think like an edge”
◼ Compute(edge e)
◼ BSP Computation – push
state to neighbor vertices at
the end of each superstep
◼ Continue until all vertices are
inactive
◼ Number of edges ≫ number
of vertices
❑ Fewer msgs, more computation
❑ No random reads

Data Lake

◼ Collection of raw data in native format

❑ Each element has a unique identifier and metadata
❑ For each business question, you can find the relevant data set to
analyze it
◼ Originally based on Hadoop
❑ Enterprise Hadoop

Advantages of a Data Lake

◼ Schema on read
❑ Write the data as they are, read them according to a diagram
(e.g. code of the Map function)
❑ More flexibility, multiple views of the same data
◼ Multi-workload data processing
❑ Different types of processing on the same data
❑ Interactive, batch, real time
◼ Cost-effective data architecture
❑ Excellent cost/performance and ROI ratio with SN cluster and
open source technologies

Principles of a Data Lake

◼ Collect all useful data

❑ Raw data, transformed data
◼ Dive from anywhere
❑ Users from different business units can explore and enrich the
data
◼ Flexible access
❑ Different access paths to shared infrastructure
◼ Batch, interactive (OLAP and BI), real-time, search,.....

Main Functions

◼ Data management, to store and process large amounts

of data
◼ Data access: interactive, batch, real time, streaming
◼ Governance: load data easily, and manage it according
to a policy implemented by the data steward
◼ Security: authentication, access control, data protection
◼ Platform management: provision, monitoring and
scheduling of tasks (in a cluster)

Data Lake Architecture

A collection of multi-modal data stored in their raw formats

Data Lake vs Data Warehouse

Data Lake Data Warehouse

◼ Shorter development ◼ Long development
process process
◼ Schema-on-read ◼ Schema-on-write
◼ Multiworkload processing ◼ OLAP workloads
◼ Cost-effective ◼ Complex development
architecture with ETL

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
SPARK
No ratings yet
SPARK
66 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Spark
No ratings yet
Spark
96 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Lec 9
No ratings yet
Lec 9
33 pages
11-NoSQL Nhom8
No ratings yet
11-NoSQL Nhom8
72 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
A Near Real-Time Big Data Processing Architecture
No ratings yet
A Near Real-Time Big Data Processing Architecture
59 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Near Real Time Fraud Detection With Apac
No ratings yet
Near Real Time Fraud Detection With Apac
87 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
SPARK
No ratings yet
SPARK
125 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
VPEG Software Input Output Functions PDF
No ratings yet
VPEG Software Input Output Functions PDF
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
IGCSE O Level Computer P1 Revision Guide by Inqilab Patel PDF
No ratings yet
IGCSE O Level Computer P1 Revision Guide by Inqilab Patel PDF
137 pages
BigdatMid1 Shcema
No ratings yet
BigdatMid1 Shcema
7 pages
Apache Spark Cheatsheet (2014)
No ratings yet
Apache Spark Cheatsheet (2014)
9 pages
Lê Thị Hậu - ITDSIU21085 - Quiz3
No ratings yet
Lê Thị Hậu - ITDSIU21085 - Quiz3
5 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Sap Hana PDF
No ratings yet
Sap Hana PDF
47 pages
Amazon: Exam Questions AWS-Certified-Cloud-Practitioner
100% (1)
Amazon: Exam Questions AWS-Certified-Cloud-Practitioner
9 pages
Unit Wise Question Bank
No ratings yet
Unit Wise Question Bank
6 pages
Juniper InterAS OptionB
No ratings yet
Juniper InterAS OptionB
10 pages
Introduction To DWH
No ratings yet
Introduction To DWH
10 pages
Multistream Satellite Receiver HZ914 R2
No ratings yet
Multistream Satellite Receiver HZ914 R2
2 pages
Oracle Database Notes For Professionals
No ratings yet
Oracle Database Notes For Professionals
117 pages
Packet Sniff Craft Cheatsheet
100% (2)
Packet Sniff Craft Cheatsheet
10 pages
B DP SQL Iuguide (101-200)
No ratings yet
B DP SQL Iuguide (101-200)
100 pages
MV Refresh Parallel PDF
No ratings yet
MV Refresh Parallel PDF
4 pages
EXAMREVIEW AWSCertifiedDeveloperAssociate
No ratings yet
EXAMREVIEW AWSCertifiedDeveloperAssociate
354 pages
IT5403 2011 Internet Application Development
No ratings yet
IT5403 2011 Internet Application Development
15 pages
Cse Previous Year Paper Computer Network
No ratings yet
Cse Previous Year Paper Computer Network
13 pages
JCLQSTN
No ratings yet
JCLQSTN
26 pages
Chapter 1 Array and Structure
No ratings yet
Chapter 1 Array and Structure
57 pages
LX51
No ratings yet
LX51
107 pages
Websphere MQ V7 Client Enhancements: Lauranette Wheeler
No ratings yet
Websphere MQ V7 Client Enhancements: Lauranette Wheeler
56 pages
OP ABMI ICT 002 Data Backup Procedure
No ratings yet
OP ABMI ICT 002 Data Backup Procedure
7 pages
GG Doc
No ratings yet
GG Doc
9 pages
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
No ratings yet
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
38 pages
Pelindungan Data Pribadi Di Indonesia: Ius Constitutum Dan Ius
No ratings yet
Pelindungan Data Pribadi Di Indonesia: Ius Constitutum Dan Ius
36 pages
JSP Session Hibernate
No ratings yet
JSP Session Hibernate
22 pages
Whitepaper Turbo FEC in Satellite Systems: 1. Summary
No ratings yet
Whitepaper Turbo FEC in Satellite Systems: 1. Summary
10 pages
TDD Lte Lab
No ratings yet
TDD Lte Lab
11 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Praktikum Pertemuan 7
No ratings yet
Praktikum Pertemuan 7
14 pages
Computer 7
No ratings yet
Computer 7
18 pages
Tallerde Produccion
No ratings yet
Tallerde Produccion
13 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet