Big Data Analytics
Big Data Analytics
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
Vu Pham
What is Big Data?
Big data is a collection of data sets so large and
complex that it becomes difficult to process using
traditional relational database management systems
100s of
? TBs of
millions
of GPS
enabled
devices
sold
annually
25+ TBs of 2+
log data billion
every day people
76 million smart on the
meters in 2009… Web by
200M by 2014 end 2011
Each row can have its own set of column values. NoSQL
gives better performance in storing massive amount of
data.
Master-Slave design
Master Node
Single NameNode for managing metadata
Slave Nodes
Multiple DataNodes for storing data
Other
Secondary NameNode as a backup
Master
Manages filesystem namespace
Maintains filesystem tree and metadata-persistently on
two files-namespace image and editlog
Stores locations of blocks-but not persistently
Metadata – inode data and the list of blocks of each
file
Master node
Also known as Name Nodes in HDFS
Stores metadata
Might be replicated
Key Value
Welcome 1
Welcome Everyone Everyone 1
Hello Everyone Hello 1
Everyone 1
Input <filename, file text>
MAP TASK 1
Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Everyone 1
Input <filename, file
text>
MAP TASK 2
Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
Here 1
…….
Key Value
Welcome 1
Everyone 2
Everyone 1
Hello 1
Hello 1
Welcome 1
Everyone 1
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
•Advantages:
•High scalability
•Distributed Computing
•Lower cost
•Schema flexibility, semi-structure data
•No complicatedRelationships
•Disadvantages
•No standardization
•Limited query capabilities (sofar)
•Eventual consistent is not intuitive to programfor
Row Key. Each row has a unique key, which is a unique identifier for that
row.
Column. Each column contains a name, a value, and timestamp.
Name. This is the name of the name/value pair.
Value. This is the value of the name/value pair.
Timestamp. This provides the date and time that the data was inserted.
This can be used to determine the most recent version of data.
Examples: BigTable,Cassandra,SimpleDB
• A collection of documents
• Data in this model is stored inside documents.
• A document is akey value collection where the key
allows access to its value.
• Documents are not typically forced to have a schema
and therefore are flexible and easy tochange.
• Documents are stored into collections in order togroup
different kinds of data.
• Documents can contain many different key-valuepairs,
or key-array pairs, or even nested documents.
Input Output
Read
HDFS
Read Cache
Map Reduce
Big Data Analytics Vu Pham FDP
Solution: Resilient Distributed Datasets (RDDs)
Fault Recovery?
Lineage!
Log the coarse grained operation applied to a
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure
Read
HDFS
Read Cache
Map Reduce
Introduction to Spark
Read
Map Reduce
Control
Partitioning: Spark also gives you control over how you can
partition your RDDs.
Sliding Interval
• Number of blocks (partitions) to slide the Window after a
RDD process is conducted
Web pages
Recommendation
Triplets
Triplets
Gather at A
Group-By A
Apply
Map
Scatter
Triplets Join
https://www.guru99.com/cassandra-tutorial.html
http://labs.google.com/papers/mapreduce.html
Big Data
Cloud Computing andAnalytics
DistributedVuSystems
Pham FDP