0% found this document useful (0 votes)
13 views

Chap3_OverviewOfBigDataEcosystem

Chapter 3 of DS 644 provides an overview of the Big Data ecosystem, highlighting the rapid growth of the Big Data market and its key use cases, including data exploration and operational analysis. It discusses the essential computing resources, techniques, and frameworks like Apache Hadoop and Spark that facilitate Big Data processing and analytics. The chapter also addresses the challenges of data visualization and the need for effective management of large datasets across various applications.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chap3_OverviewOfBigDataEcosystem

Chapter 3 of DS 644 provides an overview of the Big Data ecosystem, highlighting the rapid growth of the Big Data market and its key use cases, including data exploration and operational analysis. It discusses the essential computing resources, techniques, and frameworks like Apache Hadoop and Spark that facilitate Big Data processing and analytics. The chapter also addresses the challenges of data visualization and the need for effective management of large datasets across various applications.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

DS 644: Introduction to Big Data

Chapter 3. Overview of Big Data Ecosystem

Yijie Zhang
New Jersey Institute of Technology

Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin
at Columbia University

1
Big Data

2
Forecast revenue big data market worldwide 2011-2026

Big data market size revenue forecast worldwide from


2011 to 2026 (in billion U.S. dollars)
The Big Data market is exploding, not only in terms of marketing hype, but also in real revenue

Note: Worldwide; 2014 to 2016

Source: Wikibon; ID 254266


3
Revenue from big data and business analytics
worldwide from 2015 to 2022 (in billion U.S. dollars)

2022: ~4 times more than predicted!


Note(s): Worldwide; 2015 to 2021
Source(s): IDC; ID 551501
Big Data Revenue By Type

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017

5
5 Key Big Data Use Case Categories – IBM’s Perspective

Big Data Exploration Enhanced 360o View Security/Intelligence


Find, visualize, and understand of the Customer Extension
all big data to improve decision Extend existing customer views Lower risk, detect fraud and
making (MDM, CRM, etc.) by monitor cyber security in
incorporating additional internal real-time
and external information sources

Operations Analysis Data Warehouse Augmentation


Analyze a variety of machine Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency

6
Key Computing Resources for Big Data

• Processing capability: CPU, multi/many-core processor, or node


• Memory
• Storage
• Network

MapReduce, Spark (Computing)

HDFS (Storage)
Name node + Data nodes

“Big Data Analytics”, David Loshin, 2013


7
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

➔ Techniques exist for years to decades. Why did Big Data


become hot now?

8
Why Big Data now?

• More data are being collected and stored


• Open-source code
• Commodity hardware
• Successful applications of data-driven AI and
ML techniques, such as the recent GPTs.

The driving force behind big data is quantification of


information.

• In the past, you would just go for a morning jog.

• Today, you know it was 7.6km long, you took 11,341


steps and burned 612 calories because of it.

9
Definition and Characteristics of Big Data

“Big data is high-volume, high-velocity and/or high-variety information assets


that demand cost-effective, innovative forms of information processing that
enable enhanced insight, decision making, and process automation.”
– Gartner, Inc.

which was derived from:

“While enterprises struggle to consolidate systems and collapse redundant


databases to enable greater operational, analytical, and collaborative
consistencies, changing economic conditions have made this job more difficult.
E-commerce, in particular, has exploded data management challenges along
three dimensions: volumes, velocity and variety. In 2001/02, IT organizations
much compile a variety of approaches to have at their disposal for dealing
each.”
– Doug Laney

10
Comparison of Approaches in Adopting High-Performance Capabilities

“Big Data Analytics”, David Loshin, 2013


11
Comparison of Data Analytics and Computing Ecosystems

Java, Python, Scala

Spark

12
Apache Hadoop

The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than relying on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of which may
be prone to failures.

The project includes these modules:


• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS ): A distributed file system that provides high-
throughput access to application data.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
• Hadoop YARN (starting from the 2nd generation): A framework for job scheduling and
cluster resource management.
http://hadoop.apache.org
13
Hadoop-related Apache Projects: Hadoop Ecosystem

• AmbariTM: A web-based tool for provisioning, managing, and monitoring Hadoop


clusters. It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• AvroTM: A data serialization system.
• CassandraTM: A scalable multi-master database with no single points of failure.
• ChukwaTM: A data collection system for managing large distributed systems.
• HBaseTM: A scalable, distributed database that supports structured data storage for
large tables.
• HiveTM: A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• MahoutTM: A scalable machine learning and data mining library.
• PigTM: A high-level data-flow language and execution framework for parallel
computation.
• SparkTM: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• TezTM: A generalized data-flow programming framework, built on Hadoop YARN,
which provides a powerful and flexible engine to execute an arbitrary DAG of tasks
to process data for both batch and interactive use-cases.
• ZookeeperTM: A high-performance coordination service for distributed applications.
14
Reading Reference

15
Big Data (Hadoop) Ecosystem
Big Data Applications/Domains
(Healthcare, insurance, finance, social networks,
transportation, sciences, etc.
Big Data Analytics
(Methods: AI, machine learning, visualization, etc.
Modules: Pig, Hive, Mahout, etc.)
Big Data Computing
(MapReduce, Spark, Storm, Oozie, etc.)
Resource Management and Scheduling
(YARN, Kubernetes, Mesos)
Big Data Management
(NoSQL: RDBMS, Key-Value, Document, Graph, etc.
Systems: SQL, MongoDB, HBase, Cassandra, etc.)
Big Data Storage
(HDFS)
Big Data Networking
(HPN, SDN, etc.) 16
Hadoop Distributed File System (HDFS)
• HDFS is a java-based file system that provides the scalable, fault-tolerant, cost-efficient
storage for big data
• The file content is split into large blocks (typically 128 megabytes), each of which is independently
replicated at multiple DataNodes
• The NameNode maintains the namespace tree (in RAM) and the mapping of blocks to DataNodes

http://hortonworks.com/hadoop/hdfs/
17
WordCouting: “Hello World” in MapReduce

Basic data structure: (key, value)

http://www.alex-hanna.com
18
Set Up the Hadoop Environment

• Local (standalone) mode


• Pseudo-distributed mode
• Fully-distributed mode

19
Setting Up the Hadoop Environment – Pseudo-distributed mode

On the SSH server


authorized_keys:
used by the SSH server to store
the public keys of clients for client
authentication

On the SSH client


known_hosts:
used by the SSH client to store
the public keys of servers for
server authentication

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

20
Set Up the Hadoop Environment – Pseudo-distributed mode

21
2
2 Word Count Problem: Hands-on MapReduce Programing Guide
-Configuration

Version: Hadoop 1.2.1


Mode: Pseudo-Distributed Mode
IDE: Eclipse
2
3
Word Count Problem
-Input
Locally stored file: SampleTextFile_1000kb.txt
2 Word Count Problem
4 -Create a MapReduce Project
2
5
Word Count Problem
-Create a class
2
6
Word Count Problem
-MapReduce Program
Mapper Function:
2
7
Word Count Problem
-MapReduce Program
Reducer Function:
2
8
Word Count Problem
-MapReduce Program
Main Function:
2
9
Word Count Problem
-Execute MapReduce Program on Eclipse
3
0
Word Count Problem
-Execute MapReduce Program on Eclipse

“input” is the parameter that


indicates the input directory;
“output” is the parameter that
indicates the output directory;
3
1
Word Count Problem
-Execute MapReduce Program on Eclipse
When the program is successfully executed:
3
2
Word Count Problem
-Check the output
3
3 Word Count Problem
-Configuration
Version: Hadoop 1.2.1
Mode: Fully-Distributed Mode
Cloud: Amazon Web Service
3
Word Count Problem
4 -Configuration
Three homogenous VM instances: one master node, and two slave nodes.
3
5
Word Count Problem
-Input
Create the input directory in HDFS and upload the input data from local to this directory. The
input size is 108MB.
3
6
Word Count Problem
-Export JAR file with Eclipse
We use Eclipse to test the program locally (Stand-alone mode or Pseudo-Distributed mode).

If we want to run a MapReduce program in a Fully-Distributed Mode on a Hadoop cluster,


for example, in a public cloud environment, we can upload the JAR file to the master node
of the cluster and execute the program by using the following command:

$ bin/hadoop jar WordCount.jar WordCount /user/user_name/wordcount/input


/user/user_name/wordcount/output
3
7
Word Count Problem
-Export JAR file with Eclipse
3 Word Count Problem
8 -Execution
Execute JAR on
the cluster.
3
9
Word Count Problem
-Output
Download the output folder from HDFS to the master node.
Finds out the size of the output, which is 130MB.
4
0 Word Count Problem
-Output Location
Note that by default the data block size is 64MB in Hadoop 1.2.1, and 128MB in Hadoop 2.
Check slave node 1: there is only one block stored on this node.
By checking the first 10 lines of the data block’s contents, we see that the file stored on slave
node 1 contains the mapping keys.
4
1
Word Count Problem
-Output Location
Check slave node 2: there are two data blocks stored on this node.
By checking the first 10 lines of these two data blocks’ contents, we see that the data blocks stored on slave
node 2 contain the output.
4
2 Word Count Problem
-Configuration
Version: Hadoop 2.6.0
Mode: Fully-Distributed Mode
Cloud: Amazon Web Service
4
3 Word Count Problem
-Configuration
Three homogenous Virtual Machine instances:
One master node
Two slave nodes
4
4 Word Count Problem
-MapReduce Program
Create a WordCount java program.
4
5 Word Count Problem
-MapReduce Program
Mapper Function:
4
6 Word Count Problem
-MapReduce Program

Reducer Function:
4
7 Word Count Problem
-MapReduce Program
Main Function:
4
8 Word Count Problem
-Compile

Compile.
4
9 Word Count Problem
-Compile

Export JAR.
5
0 Word Count Problem
-Input
Upload the input data from local to the master node.
The input size is 235MB.
5
1 Word Count Problem
-Input
Create the input directory in HDFS and place the input file in it.
5
2 Word Count Problem
-Execution
Execute JAR file on the cluster.
5
3 Word Count Problem
-Execution

Execute successfully.
5
4 Word Count Problem
-Output

Finds out the size of the output, which is 287MB.


55
5
6 Word Count Problem
-Output Location

Two datablocks on slave 1.


(160.42MB)
Three datablocks on slave 2.
(365.80MB)
5
7 Word Count Problem
-Output Location
Check slave node 1: There are two data blocks stored on this node. The data block size is 128MB.
By checking the first 10 lines of the two datablocks’ contents, we see that both datablocks stored on slave
node 1 are the output.
5
8 Word Count Problem
-Output Location
Check slave node 2: There are three data blocks stored on this node. The data block size is 128MB.
By checking the first 10 lines of the three datablocks’ contents, we see that the third datablock stored on slave
node 2 is the output, while the other two store the keys.
5
9
Hadoop configuration

❑ Standalone Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop
common/SingleCluster.html#Configuration

❑Pseudo-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html#Pseudo-Distributed_Operation

❑Fully-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

How to install Hadoop on Amazon AWS (step by step):


https://www.youtube.com/watch?v=a-DXDkK1i08

Another useful tutorial:


https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
6
0
Java programming

How to MapReduce programming with Apache:


https://www.javaworld.com/article/2077907/open-source-tools/mapreduce-programming-with-
apache-hadoop.html

If you need more details, the following book helps:


https://eecs.wsu.edu/~yinghui/mat/courses/fall%202015/resources/Hadoop%20the%20definitive%20g
uide.pdf

The following tutorial shows you how to use Eclipse to write, compile, execute and export .jar file for
the word counting problem in Hadoop in detail:

https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial
Apache Spark Built on top of HDFS

61
Big Data Visualization
• Graph Database
• Visual Analytics
76,425 species 14.8 million tweets 500 million users

The information diffusion graph of the death Facebook friendship graph by Paul Butler
Tree of Life by Dr. Yifan Hu of Osama bin Laden by Gilad Lotan

Challenging Task :

Squeezing millions and even billions of records into


million pixels (1600 X 1200 ≈ 2 million pixels)

62
Visualization Key Challenges

Visual clutter Performance issues Cognition


How can we render the How can users understand
How can we encode the
huge datasets in real time the visual representation
information intuitively?
with rich interactions? when the information
is overwhelming?

63
Platform Dependent Graphical Models
• Homogeneous multicore processors
Intel Xeon E5335 (Clovertown)
AMD Opteron 2347 (Barcelona)
Netezza (FPGA, multicore)
• Homogeneous manycore processors
Sun UltraSPARC T2 (Niagara 2), GPGPU
• Heterogeneous multicore processors
Cell Broadband Engine
• Clusters
HPCC, DataStar, BlueGene, etc.

64
Graph Workload Types

⚫ Type 1: Computations on graph structures / topologies


⚫ Example → converting Bayesian network into junction tree, graph traversal (BFS/DFS), etc.
⚫ Characteristics → Poor locality, irregular memory access, limited numeric operations
3,1, 3,1,2
1 2 10 1 2 10 1 2 10 2
Bayesian 10,7
5,3,4

network to 3 7 3 7 3 7 5,3,
4 7,3, 11,5,
7,3,5 6,4,5
5 6
Junction tree 4 5 8 4 5 8 4 5 8 6,4,
5
8,7
8,7

11,5
6 11 9 6 11 9 6 11 9 ,6
9.8 9.8 10,7

⚫ Type 2: Computations on graphs with rich properties


⚫ Example → Belief propagation: diffuse information through a graph using statistical models
⚫ Characteristics
⚫ Locality and memory access pattern
depend on vertex models
⚫ Typically a lot of numeric operations
⚫ Hybrid workload

⚫ Type 3: Computations on dynamic graphs


⚫ Example → streaming graph clustering, incremental k-core, etc.
⚫ Characteristics
3-core subgraph
⚫ Poor locality, irregular memory access
⚫ Operations to update a model (e.g., cluster, sub-graph)
⚫ Hybrid workload

65
Large-scale graph benchmark – Graph 500

June 2024, Breadth-First Search (GTEPS)

66
Common Use Cases for Big Data in Hadoop

• Log Data Analysis


– most common, fits perfectly for HDFS scenario: Write once & Read
often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond

D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
67
Big Data Analytics Example Use Cases

1. Social Network Analysis


2. Recommendation
3. Commerce
4. Financial Analysis
5. Social Media Monitoring
6. Telco Customer Analysis
7. Watson
8. Data Exploration and Visualization
9. Personalized Search
10. Anomaly Detection (Espionage, Sabotage, etc.)
11. Fraud Detection
12. Cybersecurity
13. Sensor Monitoring (Smarter another Planet)
14. Celluar Network Monitoring
15. Cloud Monitoring
16. Code Life Cycle Management
17. Traffic Navigation
18. Image and Video Semantic Understanding
19. Genomic Medicine
20. Brain Network Analysis
21. Data Curation
22. Near Earth Object Analysis

68
Use Case 1: Social Network Analysis in Enterprise for Productivity
Production Live System used by IBM GBS since 2009 – verified ~$100M contribution
15,000 contributors in 76 countries; 92,000 annual unique IBM users
Shortest
25,000,000+ emails & SameTime messages (incl. Content features)
Paths
1,000,000+ Learning clicks; 14M KnowledgeView, SalesOne, …, access data
1,000,000+ Lotus Connections (blogs, file sharing, bookmark) data Centralities
200,000 people’s consulting project & earning data
Graph
Search

Dynamic networks of
400,000+ IBMers:

Shortest Paths
– On BusinessWeek four times, including being the Top Story of Week, April 2009
Social Capital
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award
Bridges
– Wharton School study: $7,010 gain per user per year using the tool Hubs
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings and benefits Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013: Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
69
Use Case 2: Recommendation

item

user

70
Use Case 3: Recommendation for Commerce
Precision Comparison (Number of T riggered Users =
1, Propagation Steps = 1)
0.6
CFCF + SP
0.5 EABIF
IF Network

Precision
T EABIF
TIF
0.4 Info Flow
0.3
0.2
0.1
0
1 2 3 4
Number ofNo. of retrieved users
recommended users

Recall Comparison (Number of T riggered Users = 1,


Propagation Steps = 1)
0.14
CF + SP
CF
0.12
Early adopter IF
EABIF
0.1 TTIF
EABIF Tests:
Late adopter

Recall
0.08 – 1 month
Innovators 0.06 – 586
0.04
new docs
Early adopters 0.02
– 1,170
0
1 2 3 4 users
No. ofofretrieved
Number users users
recommended

Late majority Early majority IF: Graphical Information Flow Model


? TIF: Joint Topic Detection + Information Flow Model

Laggards → Comparing to Collaborative Filtering (CF) + Similar People


Precision: IF is 91% better, TIF is 108% better
Recall: IF is 87% better, TIF is 113% better
People with
similar tastes 71
Use Case 4: Graph Analytics for Financial Analysis
Goal: Injecting Network Graph Effects for Financial Analysis. Estimating company performance considering
correlated companies, network properties and evolutions, causal parameter analysis, etc.

▪ IBM 2003 ▪ IBM 2009

▪ Data Source:
– Relationships among 7594
companies, data mining
from NYT 1981 ~ 2009

profit (R^2 me an)


Targets: 20 Fortune Network feature:
companies’ normalized 0.5
s
s (current year network feature),
0.45
Profits 0.4
t t (temporal network feature),
0.35
p
d (delta value of network feature)
d
Goal: Learn from 0.3
st Financial feature:
previous 5 years, and 0.25
sp p (historical profits and revenues)
0.2
predict next year 0.15
std
tp
0.1
dp
Model: Support Vector 0.05
stdp
Regression (RBF kernel) 0 Profit prediction by joint network and financial analysis
different feature sets
outperforms network-only by 130% and financial-only by 33%.
72
Use Case 5: Social Media Monitoring

IBM CIO monitoring categories Monitoring filter

Real-Time Translation, Locations,


Live Tweets, Sentiment, Keywords Dynamic Graphs Zooming / Panning Top Retweets 73
Use Case 6: Customer Social Analysis for Telco
Applications
Goal: Extract customer social network behaviors to High Value Viral
Personalized Customer
enable Call Detail Records (CDRs) data monetization Advertisement Identification & marketing
for Telco. targeting campaign

▪ Applications based on the extracted social enable


profiles
− Personalized advertisement (beyond the
scope of traditional campaign in Telco)
Customer Profiles
− High value customer identification and (influence, community, etc.)
targeting
− Viral marketing campaign
▪ Approach
Weakly
− Construct social graphs from CDRs based on Degree
Centrality
Connected Maximal
Component Cliques
{caller, callee, call time, call duration}
− Extract customer social features (e.g.,
Community
influence, communities, etc.) from the Pagerank
Detection
K-core

constructed social graph as customer social


profiles
System G Analysis
− Build analytics applications (e.g.,
personalized advertisement) based on the BigInsights
extracted customer social profiles

PoCs with Chinese and Indian Telecomm companies CDR


74
Use Case 7: Graph Analytics and Visualization for Watson
Graph
Matching
Matches
Query
headache
chill migraine
high fever
stomachache
cough

Graph
Communities

75
User Case 8: Visualization for Navigation and Exploration

Cluster based huge graph visualization


Query based huge graph visualization

76
Use Case 9: Graph Search

existing search engine Graph


query Search
index Improved search results

ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context

77
Use Case 10: Anomaly Detection at Multiple Scales

Based on President Executive Order 13587

Goal: System for Detecting and Predicting


“Enterprise Information
Abnormal Behaviors in Organization, through
Leakage Impacted economy
large-scale social network & cognitive analytics
and data mining, to decrease insider threats such and jobs” Feb 2013
as espionage, sabotage, colleague-shooting,
suicide, etc.

“What's emerged is a multibillion


dollar detective industry”
npr Jan 10, 2013

Emails
Graph analysis
Instant Messaging Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing
Exploration
Copying Database access Psychological Interface
analysis
Log On/Off

Infrastructure + ~ 70 Analytics
78
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features

Ponzi scheme Detection

Normal:
Attacker:
(1) Clique-like
Near-Star
(2) Two-way links

79
Use Case 12: Detecting Cyber Attacks
Network Ego Net
Info Flow Features
Detecting DoS attack

80
Use Case 13: Smarter another Planet
Goal: Atmospheric Radiation Measurement (ARM) climate research Bayesian
facility provides 24x7 continuous field observations of cloud, aerosol Network
and radiative processes. Graphical models can automate the
validation with improvement efficiency and performance.

Approach: BN is built to represent the dependence among sensors


and replicated across timesteps. BN parameters are learned from
over 15 years of ARM climate data to support distributed climate
sensor validation. Inference validates sensors in the connected
instruments.

Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques

81
Use Case 14: Cellular Network Analytics in Telco Operation
Goal: Efficiently and uniquely identify internal state of
Cellular/Telco networks (e.g., performance and load of
network elements/links) using probes between monitors
placed at selected network elements & endhosts
Network load
level report
▪ Applied Graph Analytics to telco network analytics
based on CDRs (call detail records): estimate
traffic load on CSP network with low monitoring
overhead
– CDRs, already collected for billing purposes, contain
information about voice/data calls
– Traditional NMS* and EMS** typically lack of end-to- Graph
end visibility and topology across vendors Analysis
– Employ graph algorithms to analyze network
elements which are not reported by the usage data
from CDR information
▪ Approach Network topology
– Cellular network comprises a hierarchy of network
elements

– Map CDR onto network topology and infer load on CDR


each network element using graph analysis
– Estimate network load and localize potential problems

82
Use Case 15: Monitoring Large Cloud
Goal: Monitoring technology that can track the time-varying state Network Server
(e.g., causality relationships between KPIs) of a large Cloud when KPIs KPIs
the processing power of monitoring system cannot keep up with the
scale of the system & the rate of change
• Causality relationships (e.g., Granger causality) are crucial in
performance monitoring & root cause analysis
• Challenge: easy to test pairwise relationship, but hard to test
multi-variate relationship (e.g., a large number of KPIs)

KPI time series (e.g., Varying


Causality ? over time
server
performance/load, analyzer
network
KPI (a time series)
performance/load)
(potential) pairwise relationship
(e.g., causality)

Our approach: Basic analytics engine


Probabilistic monitoring (e.g., pairwise granger causality)
via sampling &
Link sampling & estimation
estimation
Select KPI pairs (sampling)→ Test link existence → Estimate unsampled links based on history
→ Overall graph 83
Use Case 16: Code Life Cycle Improvement

Graph application
Graph application

Graph objects
Graph objects

Convert from relational Convert to relational


Graph DB Graph DB model
Relational DB

Traditional (relational) model

⚫ Advantages of working directly with graph DB for graph applications


⚫ Smaller and simpler code
⚫ Flexible schema → easy schema evolution
⚫ Code is easier and faster to write, debug and manage
⚫ Code and Data is easier to transfer and maintain

84
Use Case 17: Smart Navigation Utilizing Real-time Road Information
Goal: Enable unprecedented level of accuracy in traffic scheduling (for a fleet of
transportation vehicles) and navigation of individual cars utilizing the dynamic real-time
information of changing road condition and predictive analysis on the data

• Dynamic graph algorithms implemented in


System G provide highly efficient graph query
computation (e.g. shorted path computation) on
time-varying graphs (order of magnitudes
improvement over existing solutions)

• High-throughput real-time predictive analytics


on graph makes it possible to estimate the future
traffic condition on the route to make sure that the
decision taken now is optimal overall

Historical data
Predictive results
Our approach: Querying
over dynamic graph + Predictive analytics for graphs
predictive analytics on
Dynamic Graph query problem Query & response
graph properties

Graph store
Real-time update
85
Use Case 18: Graph Analysis for Image and Video Analysis

Vertex Attribute
Correspondence Transformation

Ys ARG s ARG t
Yt
86
Use Case 19: Graph Matching for Genomic Medicine

• Ongoing discussions

87
Use Case 20: Data Curation for Enterprise Data Management

88
Use Case 21: Understanding Brain Network

89
Use Case 22: Planet Security
• Big Data on Large-Scale Sky Monitoring

NASA’s DART Mission Hits


Asteroid in First-Ever
Planetary Defense Test

https://www.nbcnews.com/video/nasa-s-dart-spacecraft-crashes-into-asteroid-149320773570

90
Questions?

91

You might also like