Chap3_OverviewOfBigDataEcosystem
Chap3_OverviewOfBigDataEcosystem
Yijie Zhang
New Jersey Institute of Technology
Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin
at Columbia University
1
Big Data
2
Forecast revenue big data market worldwide 2011-2026
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
5
5 Key Big Data Use Case Categories – IBM’s Perspective
6
Key Computing Resources for Big Data
HDFS (Storage)
Name node + Data nodes
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
8
Why Big Data now?
9
Definition and Characteristics of Big Data
10
Comparison of Approaches in Adopting High-Performance Capabilities
Spark
12
Apache Hadoop
The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than relying on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of which may
be prone to failures.
15
Big Data (Hadoop) Ecosystem
Big Data Applications/Domains
(Healthcare, insurance, finance, social networks,
transportation, sciences, etc.
Big Data Analytics
(Methods: AI, machine learning, visualization, etc.
Modules: Pig, Hive, Mahout, etc.)
Big Data Computing
(MapReduce, Spark, Storm, Oozie, etc.)
Resource Management and Scheduling
(YARN, Kubernetes, Mesos)
Big Data Management
(NoSQL: RDBMS, Key-Value, Document, Graph, etc.
Systems: SQL, MongoDB, HBase, Cassandra, etc.)
Big Data Storage
(HDFS)
Big Data Networking
(HPN, SDN, etc.) 16
Hadoop Distributed File System (HDFS)
• HDFS is a java-based file system that provides the scalable, fault-tolerant, cost-efficient
storage for big data
• The file content is split into large blocks (typically 128 megabytes), each of which is independently
replicated at multiple DataNodes
• The NameNode maintains the namespace tree (in RAM) and the mapping of blocks to DataNodes
http://hortonworks.com/hadoop/hdfs/
17
WordCouting: “Hello World” in MapReduce
http://www.alex-hanna.com
18
Set Up the Hadoop Environment
19
Setting Up the Hadoop Environment – Pseudo-distributed mode
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html
20
Set Up the Hadoop Environment – Pseudo-distributed mode
21
2
2 Word Count Problem: Hands-on MapReduce Programing Guide
-Configuration
Reducer Function:
4
7 Word Count Problem
-MapReduce Program
Main Function:
4
8 Word Count Problem
-Compile
Compile.
4
9 Word Count Problem
-Compile
Export JAR.
5
0 Word Count Problem
-Input
Upload the input data from local to the master node.
The input size is 235MB.
5
1 Word Count Problem
-Input
Create the input directory in HDFS and place the input file in it.
5
2 Word Count Problem
-Execution
Execute JAR file on the cluster.
5
3 Word Count Problem
-Execution
Execute successfully.
5
4 Word Count Problem
-Output
❑ Standalone Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop
common/SingleCluster.html#Configuration
❑Pseudo-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html#Pseudo-Distributed_Operation
❑Fully-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
The following tutorial shows you how to use Eclipse to write, compile, execute and export .jar file for
the word counting problem in Hadoop in detail:
https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial
Apache Spark Built on top of HDFS
61
Big Data Visualization
• Graph Database
• Visual Analytics
76,425 species 14.8 million tweets 500 million users
The information diffusion graph of the death Facebook friendship graph by Paul Butler
Tree of Life by Dr. Yifan Hu of Osama bin Laden by Gilad Lotan
Challenging Task :
62
Visualization Key Challenges
63
Platform Dependent Graphical Models
• Homogeneous multicore processors
Intel Xeon E5335 (Clovertown)
AMD Opteron 2347 (Barcelona)
Netezza (FPGA, multicore)
• Homogeneous manycore processors
Sun UltraSPARC T2 (Niagara 2), GPGPU
• Heterogeneous multicore processors
Cell Broadband Engine
• Clusters
HPCC, DataStar, BlueGene, etc.
64
Graph Workload Types
network to 3 7 3 7 3 7 5,3,
4 7,3, 11,5,
7,3,5 6,4,5
5 6
Junction tree 4 5 8 4 5 8 4 5 8 6,4,
5
8,7
8,7
11,5
6 11 9 6 11 9 6 11 9 ,6
9.8 9.8 10,7
65
Large-scale graph benchmark – Graph 500
66
Common Use Cases for Big Data in Hadoop
D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
67
Big Data Analytics Example Use Cases
68
Use Case 1: Social Network Analysis in Enterprise for Productivity
Production Live System used by IBM GBS since 2009 – verified ~$100M contribution
15,000 contributors in 76 countries; 92,000 annual unique IBM users
Shortest
25,000,000+ emails & SameTime messages (incl. Content features)
Paths
1,000,000+ Learning clicks; 14M KnowledgeView, SalesOne, …, access data
1,000,000+ Lotus Connections (blogs, file sharing, bookmark) data Centralities
200,000 people’s consulting project & earning data
Graph
Search
Dynamic networks of
400,000+ IBMers:
Shortest Paths
– On BusinessWeek four times, including being the Top Story of Week, April 2009
Social Capital
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award
Bridges
– Wharton School study: $7,010 gain per user per year using the tool Hubs
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings and benefits Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013: Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
69
Use Case 2: Recommendation
item
user
70
Use Case 3: Recommendation for Commerce
Precision Comparison (Number of T riggered Users =
1, Propagation Steps = 1)
0.6
CFCF + SP
0.5 EABIF
IF Network
Precision
T EABIF
TIF
0.4 Info Flow
0.3
0.2
0.1
0
1 2 3 4
Number ofNo. of retrieved users
recommended users
Recall
0.08 – 1 month
Innovators 0.06 – 586
0.04
new docs
Early adopters 0.02
– 1,170
0
1 2 3 4 users
No. ofofretrieved
Number users users
recommended
▪ Data Source:
– Relationships among 7594
companies, data mining
from NYT 1981 ~ 2009
Graph
Communities
75
User Case 8: Visualization for Navigation and Exploration
76
Use Case 9: Graph Search
ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context
77
Use Case 10: Anomaly Detection at Multiple Scales
Emails
Graph analysis
Instant Messaging Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing
Exploration
Copying Database access Psychological Interface
analysis
Log On/Off
Infrastructure + ~ 70 Analytics
78
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features
Normal:
Attacker:
(1) Clique-like
Near-Star
(2) Two-way links
79
Use Case 12: Detecting Cyber Attacks
Network Ego Net
Info Flow Features
Detecting DoS attack
80
Use Case 13: Smarter another Planet
Goal: Atmospheric Radiation Measurement (ARM) climate research Bayesian
facility provides 24x7 continuous field observations of cloud, aerosol Network
and radiative processes. Graphical models can automate the
validation with improvement efficiency and performance.
Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques
81
Use Case 14: Cellular Network Analytics in Telco Operation
Goal: Efficiently and uniquely identify internal state of
Cellular/Telco networks (e.g., performance and load of
network elements/links) using probes between monitors
placed at selected network elements & endhosts
Network load
level report
▪ Applied Graph Analytics to telco network analytics
based on CDRs (call detail records): estimate
traffic load on CSP network with low monitoring
overhead
– CDRs, already collected for billing purposes, contain
information about voice/data calls
– Traditional NMS* and EMS** typically lack of end-to- Graph
end visibility and topology across vendors Analysis
– Employ graph algorithms to analyze network
elements which are not reported by the usage data
from CDR information
▪ Approach Network topology
– Cellular network comprises a hierarchy of network
elements
82
Use Case 15: Monitoring Large Cloud
Goal: Monitoring technology that can track the time-varying state Network Server
(e.g., causality relationships between KPIs) of a large Cloud when KPIs KPIs
the processing power of monitoring system cannot keep up with the
scale of the system & the rate of change
• Causality relationships (e.g., Granger causality) are crucial in
performance monitoring & root cause analysis
• Challenge: easy to test pairwise relationship, but hard to test
multi-variate relationship (e.g., a large number of KPIs)
Graph application
Graph application
Graph objects
Graph objects
84
Use Case 17: Smart Navigation Utilizing Real-time Road Information
Goal: Enable unprecedented level of accuracy in traffic scheduling (for a fleet of
transportation vehicles) and navigation of individual cars utilizing the dynamic real-time
information of changing road condition and predictive analysis on the data
Historical data
Predictive results
Our approach: Querying
over dynamic graph + Predictive analytics for graphs
predictive analytics on
Dynamic Graph query problem Query & response
graph properties
Graph store
Real-time update
85
Use Case 18: Graph Analysis for Image and Video Analysis
Vertex Attribute
Correspondence Transformation
Ys ARG s ARG t
Yt
86
Use Case 19: Graph Matching for Genomic Medicine
• Ongoing discussions
87
Use Case 20: Data Curation for Enterprise Data Management
88
Use Case 21: Understanding Brain Network
89
Use Case 22: Planet Security
• Big Data on Large-Scale Sky Monitoring
https://www.nbcnews.com/video/nasa-s-dart-spacecraft-crashes-into-asteroid-149320773570
90
Questions?
91