10-Big Data Nhom7
10-Big Data Nhom7
Systems
TS. Phan Thị Hà
◼ Volume
❑ Increasing data size: petabytes (1015) to zettabytes (1021)
◼ Variety
❑ Multimodal data: structured, images, text, audio, video
❑ 90% of currently generated data unstructured
◼ Velocity
❑ Streaming data at high speed
❑ Real-time processing
◼ Veracity
❑ Data quality
❑ Processing platforms
❑
◼ User-defined function
❑ Accepts one intermediate key and a set of values for that key
(i.e. a list)
❑ Merges these values together to form a (possibly) smaller set
❑ Computes the reduce function generating, typically, zero or one
output per invocation
❑ Executes on multiple machines (called reducer)
◼ reduce function I/O
❑ Input: read from intermediate files using remote reads on local
files of corresponding mappers
❑ Output: Write result back to DFS
◼ Effect of map function
❑ Similar to aggregation function in SQL
Consider EMP(ENO,ENAME,TITLE,CITY)
From: J. Dean and S.Ghemawat. MapReduce: Simplified data processing on large clusters, Comm. ACM, 51(1), 2008.
◼ Declarative
❑ HiveQL
❑ Tenzing
❑ JAQL
◼ Data flow
❑ Pig Latin
◼ Procedural
❑ Sawzall
◼ Java Library
❑ FlumeJava
◼ Repartition join
HDFS
HDFS
HDFS
No
No
No
Transform
Transform Yes
Yes
Transform Yes Transform
Transform
RDD?
RDD? Transform
RDD?
No
No
No
Process
Process
Process
HDFS
HDFS
HDFS
©© M.
M.Tamer
Tamer Özsu
Özsu VLDB Summer
VLDB Summer School
School (2015-07-27/28)
(2015-07-27/ 28) 142// 242
142 242
© M. Tamer Özsu VLDB Summer School (2015-07-27/ 28) 142 / 242
◼ Other differences
❑ Push-based (data-driven) ❑ Unbounded stream
❑ Persistent queries ❑ System conditions may not be
stable
◼ Continuous
❑ Each new arrival is processed as soon as it arrives in the
system.
❑ Examples: Apache Storm, Heron
◼ Windowed
❑ Arrivals are batched in windows and executed as a batch.
❑ For user, recently arrived data may be more interesting and
useful.
❑ Examples: Aurora, STREAM, Spark Streaming
◼ Declarative
❑ SQL-like syntax, stream-specific semantics
❑ Examples: CQL, GSQL, StreaQuel
◼ Procedural
❑ Construct queries by defining an acyclic graph of operators
❑ Example: Aurora
◼ Windowed languages
❑ size: window length
❑ slide: how frequently the window moves
❑ E.g.: size=10min, slide=5sec
◼ Monotonic vs non-monotonic
◼ Continuous execution
❑ Each new arrival is processed as soon as the system gets it
❑ E.g. Apache Storm, Heron
◼ Windowed execution
❑ Arrivals are batched and processed as a batch
❑ E.g. Aurora, STREAM, Spark Streaming
◼ More opportunities for multi-query optimization
❑ E.g. Easier to determine shared subplans
◼ Shuffle (round-robin)
partitioning
◼ Hash partitioning
❑ Graph analytics
Analytical Online
◼ Multiple iterations ◼ No iteration
◼ Process each vertex at ◼ Usually access portion of
each iteration the graph
◼ Examples ◼ Examples
❑ PageRank ❑ Reachability
❑ Clustering ❑ Single-source shortest path
❑ Connected components ❑ Subgraph matching
❑ Machine learning tasks
𝐵𝑃𝑖 : in−neighbors of 𝑃𝑖
𝐹𝑃𝑖 : out−neighbors of 𝑃𝑖
let 𝑑 = 0.85
𝑃𝑅 𝑃1 𝑃𝑅 𝑃3
𝑃𝑅 𝑃2 = 0.15 + 0.85( + )
2 3
Recursive!...
© 2020, M.T. Özsu & P. Valduriez 51
Graph Partitioning
σ𝑘
𝑖=1 |𝑒 𝑃𝑖 ,𝑉\𝑃𝑖 |
𝐶 𝑃 = where 𝑒 𝑃𝑖 , 𝑃𝑗 = #edges between 𝑃𝑖 and 𝑃𝑗
|𝐸|
◼ Vertex-disjoint perform
❑ well for graphs with low-degree vertices
❑ poorly on power-law graphs causing many edge-cuts
◼ Edge-disjoint (vertex-cut) better for these
❑ Put each edge in one partition
❑ Vertices may need to be replicated – minimize these
◼ Objective function
σ𝑣∈𝑉 |𝐴 𝑣 |
𝐶 𝑃 = where 𝐴 𝑣 ⊆ {𝑃1 , … , 𝑃𝑘 } is set of
|𝑉|
partitions in which 𝑣 exists
◼ 𝑤 𝑃𝑖 is the number of edges in partition 𝑃𝑖
◼ Computation on a vertex
is the focus
◼ “Think like a vertex”
◼ Vertex computation
depends on its own state
+ states of its neighbors
◼ Compute(vertex v)
◼ GetValue(),
WriteValue()
◼ Computation on an entire
partition is specified
◼ “Think like a block” or
“Think like a graph”
◼ Aim is to reduce the
communication cost
among vertices
◼ Computation is
specified on each
edge rather than on
each vertex or bloc
◼ Compute(edge e)
◼ Supersteps, but no
communication barriers
◼ Uses the most recent values
◼ Computation in step k may be
based on neighbor states of
step k-1 (of received late) or of
state k
◼ Consistency issues → requires
distributed locking
Consider vertex-centric
◼ Schema on read
❑ Write the data as they are, read them according to a diagram
(e.g. code of the Map function)
❑ More flexibility, multiple views of the same data
◼ Multi-workload data processing
❑ Different types of processing on the same data
❑ Interactive, batch, real time
◼ Cost-effective data architecture
❑ Excellent cost/performance and ROI ratio with SN cluster and
open source technologies