0% found this document useful (0 votes)

21 views91 pages

Chap3 OverviewOfBigDataEcosystem

Chapter 3 of DS 644 provides an overview of the Big Data ecosystem, highlighting the rapid growth of the Big Data market and its key use cases, including data exploration and operational analysis. It discusses the essential computing resources, techniques, and frameworks like Apache Hadoop and Spark that facilitate Big Data processing and analytics. The chapter also addresses the challenges of data visualization and the need for effective management of large datasets across various applications.

Uploaded by

Pavan Frustum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views91 pages

Chap3 OverviewOfBigDataEcosystem

Uploaded by

Pavan Frustum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

DS 644: Introduction to Big Data

Chapter 3. Overview of Big Data Ecosystem

Yijie Zhang
New Jersey Institute of Technology

Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin
at Columbia University

1
Big Data

2
Forecast revenue big data market worldwide 2011-2026

Big data market size revenue forecast worldwide from

2011 to 2026 (in billion U.S. dollars)
The Big Data market is exploding, not only in terms of marketing hype, but also in real revenue

Note: Worldwide; 2014 to 2016

Source: Wikibon; ID 254266

3
Revenue from big data and business analytics
worldwide from 2015 to 2022 (in billion U.S. dollars)

2022: ~4 times more than predicted!

Note(s): Worldwide; 2015 to 2021
Source(s): IDC; ID 551501
Big Data Revenue By Type

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017

5
5 Key Big Data Use Case Categories – IBM’s Perspective

Big Data Exploration Enhanced 360o View Security/Intelligence

Find, visualize, and understand of the Customer Extension
all big data to improve decision Extend existing customer views Lower risk, detect fraud and
making (MDM, CRM, etc.) by monitor cyber security in
incorporating additional internal real-time
and external information sources

Operations Analysis Data Warehouse Augmentation

Analyze a variety of machine Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency

6
Key Computing Resources for Big Data

• Processing capability: CPU, multi/many-core processor, or node

• Memory
• Storage
• Network

MapReduce, Spark (Computing)

HDFS (Storage)
Name node + Data nodes

“Big Data Analytics”, David Loshin, 2013

7
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

➔ Techniques exist for years to decades. Why did Big Data

become hot now?

8
Why Big Data now?

• More data are being collected and stored

• Open-source code
• Commodity hardware
• Successful applications of data-driven AI and
ML techniques, such as the recent GPTs.

The driving force behind big data is quantification of

information.

• In the past, you would just go for a morning jog.

• Today, you know it was 7.6km long, you took 11,341

steps and burned 612 calories because of it.

9
Definition and Characteristics of Big Data

“Big data is high-volume, high-velocity and/or high-variety information assets

that demand cost-effective, innovative forms of information processing that
enable enhanced insight, decision making, and process automation.”
– Gartner, Inc.

which was derived from:

“While enterprises struggle to consolidate systems and collapse redundant

databases to enable greater operational, analytical, and collaborative
consistencies, changing economic conditions have made this job more difficult.
E-commerce, in particular, has exploded data management challenges along
three dimensions: volumes, velocity and variety. In 2001/02, IT organizations
much compile a variety of approaches to have at their disposal for dealing
each.”
– Doug Laney

10
Comparison of Approaches in Adopting High-Performance Capabilities

“Big Data Analytics”, David Loshin, 2013

11
Comparison of Data Analytics and Computing Ecosystems

Java, Python, Scala

Spark

12
Apache Hadoop

The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than relying on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of which may
be prone to failures.

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS ): A distributed file system that provides high-
throughput access to application data.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
• Hadoop YARN (starting from the 2nd generation): A framework for job scheduling and
cluster resource management.
http://hadoop.apache.org
13
Hadoop-related Apache Projects: Hadoop Ecosystem

• AmbariTM: A web-based tool for provisioning, managing, and monitoring Hadoop

clusters. It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• AvroTM: A data serialization system.
• CassandraTM: A scalable multi-master database with no single points of failure.
• ChukwaTM: A data collection system for managing large distributed systems.
• HBaseTM: A scalable, distributed database that supports structured data storage for
large tables.
• HiveTM: A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• MahoutTM: A scalable machine learning and data mining library.
• PigTM: A high-level data-flow language and execution framework for parallel
computation.
• SparkTM: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• TezTM: A generalized data-flow programming framework, built on Hadoop YARN,
which provides a powerful and flexible engine to execute an arbitrary DAG of tasks
to process data for both batch and interactive use-cases.
• ZookeeperTM: A high-performance coordination service for distributed applications.
14
Reading Reference

15
Big Data (Hadoop) Ecosystem
Big Data Applications/Domains
(Healthcare, insurance, finance, social networks,
transportation, sciences, etc.
Big Data Analytics
(Methods: AI, machine learning, visualization, etc.
Modules: Pig, Hive, Mahout, etc.)
Big Data Computing
(MapReduce, Spark, Storm, Oozie, etc.)
Resource Management and Scheduling
(YARN, Kubernetes, Mesos)
Big Data Management
(NoSQL: RDBMS, Key-Value, Document, Graph, etc.
Systems: SQL, MongoDB, HBase, Cassandra, etc.)
Big Data Storage
(HDFS)
Big Data Networking
(HPN, SDN, etc.) 16
Hadoop Distributed File System (HDFS)
• HDFS is a java-based file system that provides the scalable, fault-tolerant, cost-efficient
storage for big data
• The file content is split into large blocks (typically 128 megabytes), each of which is independently
replicated at multiple DataNodes
• The NameNode maintains the namespace tree (in RAM) and the mapping of blocks to DataNodes

http://hortonworks.com/hadoop/hdfs/
17
WordCouting: “Hello World” in MapReduce

Basic data structure: (key, value)

http://www.alex-hanna.com
18
Set Up the Hadoop Environment

• Local (standalone) mode

• Pseudo-distributed mode
• Fully-distributed mode

19
Setting Up the Hadoop Environment – Pseudo-distributed mode

On the SSH server

authorized_keys:
used by the SSH server to store
the public keys of clients for client
authentication

On the SSH client

known_hosts:
used by the SSH client to store
the public keys of servers for
server authentication

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

20
Set Up the Hadoop Environment – Pseudo-distributed mode

21
2
2 Word Count Problem: Hands-on MapReduce Programing Guide
-Configuration

Version: Hadoop 1.2.1

Mode: Pseudo-Distributed Mode
IDE: Eclipse
2
3
Word Count Problem
-Input
Locally stored file: SampleTextFile_1000kb.txt
2 Word Count Problem
4 -Create a MapReduce Project
2
5
Word Count Problem
-Create a class
2
6
Word Count Problem
-MapReduce Program
Mapper Function:
2
7
Word Count Problem
-MapReduce Program
Reducer Function:
2
8
Word Count Problem
-MapReduce Program
Main Function:
2
9
Word Count Problem
-Execute MapReduce Program on Eclipse
3
0
Word Count Problem
-Execute MapReduce Program on Eclipse

“input” is the parameter that

indicates the input directory;
“output” is the parameter that
indicates the output directory;
3
1
Word Count Problem
-Execute MapReduce Program on Eclipse
When the program is successfully executed:
3
2
Word Count Problem
-Check the output
3
3 Word Count Problem
-Configuration
Version: Hadoop 1.2.1
Mode: Fully-Distributed Mode
Cloud: Amazon Web Service
3
Word Count Problem
4 -Configuration
Three homogenous VM instances: one master node, and two slave nodes.
3
5
Word Count Problem
-Input
Create the input directory in HDFS and upload the input data from local to this directory. The
input size is 108MB.
3
6
Word Count Problem
-Export JAR file with Eclipse
We use Eclipse to test the program locally (Stand-alone mode or Pseudo-Distributed mode).

If we want to run a MapReduce program in a Fully-Distributed Mode on a Hadoop cluster,

for example, in a public cloud environment, we can upload the JAR file to the master node
of the cluster and execute the program by using the following command:

$ bin/hadoop jar WordCount.jar WordCount /user/user_name/wordcount/input

/user/user_name/wordcount/output
3
7
Word Count Problem
-Export JAR file with Eclipse
3 Word Count Problem
8 -Execution
Execute JAR on
the cluster.
3
9
Word Count Problem
-Output
Download the output folder from HDFS to the master node.
Finds out the size of the output, which is 130MB.
4
0 Word Count Problem
-Output Location
Note that by default the data block size is 64MB in Hadoop 1.2.1, and 128MB in Hadoop 2.
Check slave node 1: there is only one block stored on this node.
By checking the first 10 lines of the data block’s contents, we see that the file stored on slave
node 1 contains the mapping keys.
4
1
Word Count Problem
-Output Location
Check slave node 2: there are two data blocks stored on this node.
By checking the first 10 lines of these two data blocks’ contents, we see that the data blocks stored on slave
node 2 contain the output.
4
2 Word Count Problem
-Configuration
Version: Hadoop 2.6.0
Mode: Fully-Distributed Mode
Cloud: Amazon Web Service
4
3 Word Count Problem
-Configuration
Three homogenous Virtual Machine instances:
One master node
Two slave nodes
4
4 Word Count Problem
-MapReduce Program
Create a WordCount java program.
4
5 Word Count Problem
-MapReduce Program
Mapper Function:
4
6 Word Count Problem
-MapReduce Program

Reducer Function:
4
7 Word Count Problem
-MapReduce Program
Main Function:
4
8 Word Count Problem
-Compile

Compile.
4
9 Word Count Problem
-Compile

Export JAR.
5
0 Word Count Problem
-Input
Upload the input data from local to the master node.
The input size is 235MB.
5
1 Word Count Problem
-Input
Create the input directory in HDFS and place the input file in it.
5
2 Word Count Problem
-Execution
Execute JAR file on the cluster.
5
3 Word Count Problem
-Execution

Execute successfully.
5
4 Word Count Problem
-Output

Finds out the size of the output, which is 287MB.

55
5
6 Word Count Problem
-Output Location

Two datablocks on slave 1.

（160.42MB）
Three datablocks on slave 2.
（365.80MB）
5
7 Word Count Problem
-Output Location
Check slave node 1: There are two data blocks stored on this node. The data block size is 128MB.
By checking the first 10 lines of the two datablocks’ contents, we see that both datablocks stored on slave
node 1 are the output.
5
8 Word Count Problem
-Output Location
Check slave node 2: There are three data blocks stored on this node. The data block size is 128MB.
By checking the first 10 lines of the three datablocks’ contents, we see that the third datablock stored on slave
node 2 is the output, while the other two store the keys.
5
9
Hadoop configuration

❑ Standalone Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop
common/SingleCluster.html#Configuration

❑Pseudo-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html#Pseudo-Distributed_Operation

❑Fully-Distributed Mode
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

How to install Hadoop on Amazon AWS (step by step):

https://www.youtube.com/watch?v=a-DXDkK1i08

Another useful tutorial:

https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
6
0
Java programming

How to MapReduce programming with Apache:

https://www.javaworld.com/article/2077907/open-source-tools/mapreduce-programming-with-
apache-hadoop.html

If you need more details, the following book helps:

https://eecs.wsu.edu/~yinghui/mat/courses/fall%202015/resources/Hadoop%20the%20definitive%20g
uide.pdf

The following tutorial shows you how to use Eclipse to write, compile, execute and export .jar file for
the word counting problem in Hadoop in detail:

https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial
Apache Spark Built on top of HDFS

61
Big Data Visualization
• Graph Database
• Visual Analytics
76,425 species 14.8 million tweets 500 million users

The information diffusion graph of the death Facebook friendship graph by Paul Butler
Tree of Life by Dr. Yifan Hu of Osama bin Laden by Gilad Lotan

Challenging Task :

Squeezing millions and even billions of records into

million pixels (1600 X 1200 ≈ 2 million pixels)

62
Visualization Key Challenges

Visual clutter Performance issues Cognition

How can we render the How can users understand
How can we encode the
huge datasets in real time the visual representation
information intuitively?
with rich interactions? when the information
is overwhelming?

63
Platform Dependent Graphical Models
• Homogeneous multicore processors
Intel Xeon E5335 (Clovertown)
AMD Opteron 2347 (Barcelona)
Netezza (FPGA, multicore)
• Homogeneous manycore processors
Sun UltraSPARC T2 (Niagara 2), GPGPU
• Heterogeneous multicore processors
Cell Broadband Engine
• Clusters
HPCC, DataStar, BlueGene, etc.

64
Graph Workload Types

⚫ Type 1: Computations on graph structures / topologies

⚫ Example → converting Bayesian network into junction tree, graph traversal (BFS/DFS), etc.
⚫ Characteristics → Poor locality, irregular memory access, limited numeric operations
3,1, 3,1,2
1 2 10 1 2 10 1 2 10 2
Bayesian 10,7
5,3,4

network to 3 7 3 7 3 7 5,3,
4 7,3, 11,5,
7,3,5 6,4,5
5 6
Junction tree 4 5 8 4 5 8 4 5 8 6,4,
5
8,7
8,7

11,5
6 11 9 6 11 9 6 11 9 ,6
9.8 9.8 10,7

⚫ Type 2: Computations on graphs with rich properties

⚫ Example → Belief propagation: diffuse information through a graph using statistical models
⚫ Characteristics
⚫ Locality and memory access pattern
depend on vertex models
⚫ Typically a lot of numeric operations
⚫ Hybrid workload

⚫ Type 3: Computations on dynamic graphs

⚫ Example → streaming graph clustering, incremental k-core, etc.
⚫ Characteristics
3-core subgraph
⚫ Poor locality, irregular memory access
⚫ Operations to update a model (e.g., cluster, sub-graph)
⚫ Hybrid workload

65
Large-scale graph benchmark – Graph 500

June 2024, Breadth-First Search (GTEPS)

66
Common Use Cases for Big Data in Hadoop

• Log Data Analysis

– most common, fits perfectly for HDFS scenario: Write once & Read
often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond

D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
67
Big Data Analytics Example Use Cases

68
Use Case 1: Social Network Analysis in Enterprise for Productivity
Production Live System used by IBM GBS since 2009 – verified ~$100M contribution
15,000 contributors in 76 countries; 92,000 annual unique IBM users
Shortest
25,000,000+ emails & SameTime messages (incl. Content features)
Paths
1,000,000+ Learning clicks; 14M KnowledgeView, SalesOne, …, access data
1,000,000+ Lotus Connections (blogs, file sharing, bookmark) data Centralities
200,000 people’s consulting project & earning data
Graph
Search

Dynamic networks of
400,000+ IBMers:

Shortest Paths
– On BusinessWeek four times, including being the Top Story of Week, April 2009
Social Capital
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award
Bridges
– Wharton School study: $7,010 gain per user per year using the tool Hubs
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings and benefits Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013: Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
69
Use Case 2: Recommendation

item

user

70
Use Case 3: Recommendation for Commerce
Precision Comparison (Number of T riggered Users =
1, Propagation Steps = 1)
0.6
CFCF + SP
0.5 EABIF
IF Network

Precision
T EABIF
TIF
0.4 Info Flow
0.3
0.2
0.1
0
1 2 3 4
Number ofNo. of retrieved users
recommended users

Recall Comparison (Number of T riggered Users = 1,

Propagation Steps = 1)
0.14
CF + SP
CF
0.12
Early adopter IF
EABIF
0.1 TTIF
EABIF Tests:
Late adopter

Recall
0.08 – 1 month
Innovators 0.06 – 586
0.04
new docs
Early adopters 0.02
– 1,170
0
1 2 3 4 users
No. ofofretrieved
Number users users
recommended

Late majority Early majority IF: Graphical Information Flow Model

? TIF: Joint Topic Detection + Information Flow Model

Laggards → Comparing to Collaborative Filtering (CF) + Similar People

Precision: IF is 91% better, TIF is 108% better
Recall: IF is 87% better, TIF is 113% better
People with
similar tastes 71
Use Case 4: Graph Analytics for Financial Analysis
Goal: Injecting Network Graph Effects for Financial Analysis. Estimating company performance considering
correlated companies, network properties and evolutions, causal parameter analysis, etc.

▪ IBM 2003 ▪ IBM 2009

▪ Data Source:
– Relationships among 7594
companies, data mining
from NYT 1981 ~ 2009

profit (R^2 me an)

Targets: 20 Fortune Network feature:
companies’ normalized 0.5
s
s (current year network feature),
0.45
Profits 0.4
t t (temporal network feature),
0.35
p
d (delta value of network feature)
d
Goal: Learn from 0.3
st Financial feature:
previous 5 years, and 0.25
sp p (historical profits and revenues)
0.2
predict next year 0.15
std
tp
0.1
dp
Model: Support Vector 0.05
stdp
Regression (RBF kernel) 0 Profit prediction by joint network and financial analysis
different feature sets
outperforms network-only by 130% and financial-only by 33%.
72
Use Case 5: Social Media Monitoring

IBM CIO monitoring categories Monitoring filter

Real-Time Translation, Locations,

Live Tweets, Sentiment, Keywords Dynamic Graphs Zooming / Panning Top Retweets 73
Use Case 6: Customer Social Analysis for Telco
Applications
Goal: Extract customer social network behaviors to High Value Viral
Personalized Customer
enable Call Detail Records (CDRs) data monetization Advertisement Identification & marketing
for Telco. targeting campaign

▪ Applications based on the extracted social enable

profiles
− Personalized advertisement (beyond the
scope of traditional campaign in Telco)
Customer Profiles
− High value customer identification and (influence, community, etc.)
targeting
− Viral marketing campaign
▪ Approach
Weakly
− Construct social graphs from CDRs based on Degree
Centrality
Connected Maximal
Component Cliques
{caller, callee, call time, call duration}
− Extract customer social features (e.g.,
Community
influence, communities, etc.) from the Pagerank
Detection
K-core

constructed social graph as customer social

profiles
System G Analysis
− Build analytics applications (e.g.,
personalized advertisement) based on the BigInsights
extracted customer social profiles

PoCs with Chinese and Indian Telecomm companies CDR

74
Use Case 7: Graph Analytics and Visualization for Watson
Graph
Matching
Matches
Query
headache
chill migraine
high fever
stomachache
cough

Graph
Communities

75
User Case 8: Visualization for Navigation and Exploration

Cluster based huge graph visualization

Query based huge graph visualization

76
Use Case 9: Graph Search

existing search engine Graph

query Search
index Improved search results

ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context

77
Use Case 10: Anomaly Detection at Multiple Scales

Based on President Executive Order 13587

Goal: System for Detecting and Predicting

“Enterprise Information
Abnormal Behaviors in Organization, through
Leakage Impacted economy
large-scale social network & cognitive analytics
and data mining, to decrease insider threats such and jobs” Feb 2013
as espionage, sabotage, colleague-shooting,
suicide, etc.

“What's emerged is a multibillion

dollar detective industry”
npr Jan 10, 2013

Emails
Graph analysis
Instant Messaging Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing
Exploration
Copying Database access Psychological Interface
analysis
Log On/Off

Infrastructure + ~ 70 Analytics
78
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features

Ponzi scheme Detection

Normal:
Attacker:
(1) Clique-like
Near-Star
(2) Two-way links

79
Use Case 12: Detecting Cyber Attacks
Network Ego Net
Info Flow Features
Detecting DoS attack

80
Use Case 13: Smarter another Planet
Goal: Atmospheric Radiation Measurement (ARM) climate research Bayesian
facility provides 24x7 continuous field observations of cloud, aerosol Network
and radiative processes. Graphical models can automate the
validation with improvement efficiency and performance.

Approach: BN is built to represent the dependence among sensors

and replicated across timesteps. BN parameters are learned from
over 15 years of ARM climate data to support distributed climate
sensor validation. Inference validates sensors in the connected
instruments.

Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques

81
Use Case 14: Cellular Network Analytics in Telco Operation
Goal: Efficiently and uniquely identify internal state of
Cellular/Telco networks (e.g., performance and load of
network elements/links) using probes between monitors
placed at selected network elements & endhosts
Network load
level report
▪ Applied Graph Analytics to telco network analytics
based on CDRs (call detail records): estimate
traffic load on CSP network with low monitoring
overhead
– CDRs, already collected for billing purposes, contain
information about voice/data calls
– Traditional NMS* and EMS** typically lack of end-to- Graph
end visibility and topology across vendors Analysis
– Employ graph algorithms to analyze network
elements which are not reported by the usage data
from CDR information
▪ Approach Network topology
– Cellular network comprises a hierarchy of network
elements

– Map CDR onto network topology and infer load on CDR

each network element using graph analysis
– Estimate network load and localize potential problems

82
Use Case 15: Monitoring Large Cloud
Goal: Monitoring technology that can track the time-varying state Network Server
(e.g., causality relationships between KPIs) of a large Cloud when KPIs KPIs
the processing power of monitoring system cannot keep up with the
scale of the system & the rate of change
• Causality relationships (e.g., Granger causality) are crucial in
performance monitoring & root cause analysis
• Challenge: easy to test pairwise relationship, but hard to test
multi-variate relationship (e.g., a large number of KPIs)

KPI time series (e.g., Varying

Causality ? over time
server
performance/load, analyzer
network
KPI (a time series)
performance/load)
(potential) pairwise relationship
(e.g., causality)

Our approach: Basic analytics engine

Probabilistic monitoring (e.g., pairwise granger causality)
via sampling &
Link sampling & estimation
estimation
Select KPI pairs (sampling)→ Test link existence → Estimate unsampled links based on history
→ Overall graph 83
Use Case 16: Code Life Cycle Improvement

Graph application
Graph application

Graph objects
Graph objects

Convert from relational Convert to relational

Graph DB Graph DB model
Relational DB

Traditional (relational) model

⚫ Advantages of working directly with graph DB for graph applications

⚫ Smaller and simpler code
⚫ Flexible schema → easy schema evolution
⚫ Code is easier and faster to write, debug and manage
⚫ Code and Data is easier to transfer and maintain

84
Use Case 17: Smart Navigation Utilizing Real-time Road Information
Goal: Enable unprecedented level of accuracy in traffic scheduling (for a fleet of
transportation vehicles) and navigation of individual cars utilizing the dynamic real-time
information of changing road condition and predictive analysis on the data

• Dynamic graph algorithms implemented in

System G provide highly efficient graph query
computation (e.g. shorted path computation) on
time-varying graphs (order of magnitudes
improvement over existing solutions)

• High-throughput real-time predictive analytics

on graph makes it possible to estimate the future
traffic condition on the route to make sure that the
decision taken now is optimal overall

Historical data
Predictive results
Our approach: Querying
over dynamic graph + Predictive analytics for graphs
predictive analytics on
Dynamic Graph query problem Query & response
graph properties

Graph store
Real-time update
85
Use Case 18: Graph Analysis for Image and Video Analysis

Vertex Attribute
Correspondence Transformation

Ys ARG s ARG t
Yt
86
Use Case 19: Graph Matching for Genomic Medicine

• Ongoing discussions

87
Use Case 20: Data Curation for Enterprise Data Management

88
Use Case 21: Understanding Brain Network

89
Use Case 22: Planet Security
• Big Data on Large-Scale Sky Monitoring

NASA’s DART Mission Hits

Asteroid in First-Ever
Planetary Defense Test

https://www.nbcnews.com/video/nasa-s-dart-spacecraft-crashes-into-asteroid-149320773570

90
Questions?

Re - Transaction Confirmation
No ratings yet
Re - Transaction Confirmation
2 pages
???????????????????? accessibilityPunctuationGroup
No ratings yet
???????????????????? accessibilityPunctuationGroup
101 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Sap Gateway Odata
No ratings yet
Sap Gateway Odata
36 pages
PostgreSQL Notes For Professionals
50% (2)
PostgreSQL Notes For Professionals
73 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Data Science
No ratings yet
Data Science
87 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Biggdata
No ratings yet
Biggdata
24 pages
Big Data
No ratings yet
Big Data
27 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Module 2
No ratings yet
Module 2
20 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
HADOOP
No ratings yet
HADOOP
55 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
No ratings yet
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
6 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
HADOOP
No ratings yet
HADOOP
10 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Big Data
No ratings yet
Big Data
29 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
49 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Unit 5
No ratings yet
Unit 5
32 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop
No ratings yet
Hadoop
61 pages
02 Haddop Biginsights
No ratings yet
02 Haddop Biginsights
36 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
Big Data
No ratings yet
Big Data
63 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Rate Limiting - First 3
No ratings yet
Rate Limiting - First 3
13 pages
Membuat Database MySQL - Database Myshop
No ratings yet
Membuat Database MySQL - Database Myshop
4 pages
ANGULAR FRAMEWORK Final Question Bank
No ratings yet
ANGULAR FRAMEWORK Final Question Bank
4 pages
Interfacing ESRI GIS To SAP R/3: An ESRI White Paper - Summer 1999
No ratings yet
Interfacing ESRI GIS To SAP R/3: An ESRI White Paper - Summer 1999
14 pages
Bugreport Lancelot - Global RP1A.200720.011 2021 12 19 06 43 27 Dumpstate - Log 8992
No ratings yet
Bugreport Lancelot - Global RP1A.200720.011 2021 12 19 06 43 27 Dumpstate - Log 8992
29 pages
Hachalu Hundessa Campus IOT Department of Information Technology
No ratings yet
Hachalu Hundessa Campus IOT Department of Information Technology
10 pages
Zaggle FAQ
No ratings yet
Zaggle FAQ
17 pages
Global Extended Reality .9491869.powerpoint
No ratings yet
Global Extended Reality .9491869.powerpoint
4 pages
6-Emerging-Trends-In-Facilities-Management-Sourcing McKinsey
No ratings yet
6-Emerging-Trends-In-Facilities-Management-Sourcing McKinsey
12 pages
OLA Sales Battlecard For Microsoft Workloads On AWS
No ratings yet
OLA Sales Battlecard For Microsoft Workloads On AWS
2 pages
How To Write Test Cases - Sample Template With Examples
No ratings yet
How To Write Test Cases - Sample Template With Examples
10 pages
Social Media by Hanan
No ratings yet
Social Media by Hanan
7 pages
BIM 8th Sem Syllabus 2017 - Removed - Removed
No ratings yet
BIM 8th Sem Syllabus 2017 - Removed - Removed
4 pages
GDS-1000B Quick Start Guide A
No ratings yet
GDS-1000B Quick Start Guide A
2 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
42 pages
NIELIT-Networking Cerificate Course
No ratings yet
NIELIT-Networking Cerificate Course
4 pages
No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations
No ratings yet
No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations
15 pages
Introduction To Mobile Computing
No ratings yet
Introduction To Mobile Computing
3 pages
إدخال الكود السعودى لتصميم الطرق داخل برنامج السيفيل ثرى دى
No ratings yet
إدخال الكود السعودى لتصميم الطرق داخل برنامج السيفيل ثرى دى
11 pages
Week 4 CC
No ratings yet
Week 4 CC
7 pages
Apache HTTP
No ratings yet
Apache HTTP
46 pages
HTB Academy Report Template
No ratings yet
HTB Academy Report Template
24 pages
Excel Fundamentals Manual 46
No ratings yet
Excel Fundamentals Manual 46
1 page
Cellular and Mobile Communication: Handoff Strategies Prioritizing Handoff Practical Handoff Considerations
No ratings yet
Cellular and Mobile Communication: Handoff Strategies Prioritizing Handoff Practical Handoff Considerations
20 pages
Other Script2
No ratings yet
Other Script2
8 pages
MI Sample CyberSecurity Incident Response Plan
No ratings yet
MI Sample CyberSecurity Incident Response Plan
26 pages

Chap3 OverviewOfBigDataEcosystem

Uploaded by

Chap3 OverviewOfBigDataEcosystem

Uploaded by

DS 644: Introduction to Big Data

Chapter 3. Overview of Big Data Ecosystem

Big data market size revenue forecast worldwide from

Note: Worldwide; 2014 to 2016

Source: Wikibon; ID 254266

2022: ~4 times more than predicted!

Big Data Exploration Enhanced 360o View Security/Intelligence

Operations Analysis Data Warehouse Augmentation

• Processing capability: CPU, multi/many-core processor, or node

MapReduce, Spark (Computing)

“Big Data Analytics”, David Loshin, 2013

➔ Techniques exist for years to decades. Why did Big Data

• More data are being collected and stored

The driving force behind big data is quantification of

• In the past, you would just go for a morning jog.

• Today, you know it was 7.6km long, you took 11,341

“Big data is high-volume, high-velocity and/or high-variety information assets

which was derived from:

“While enterprises struggle to consolidate systems and collapse redundant

“Big Data Analytics”, David Loshin, 2013

Java, Python, Scala

The project includes these modules:

• AmbariTM: A web-based tool for provisioning, managing, and monitoring Hadoop

Basic data structure: (key, value)

• Local (standalone) mode

On the SSH server

On the SSH client

Version: Hadoop 1.2.1

“input” is the parameter that

If we want to run a MapReduce program in a Fully-Distributed Mode on a Hadoop cluster,

$ bin/hadoop jar WordCount.jar WordCount /user/user_name/wordcount/input

Finds out the size of the output, which is 287MB.

Two datablocks on slave 1.

How to install Hadoop on Amazon AWS (step by step):

Another useful tutorial:

How to MapReduce programming with Apache:

If you need more details, the following book helps:

Squeezing millions and even billions of records into

Visual clutter Performance issues Cognition

⚫ Type 1: Computations on graph structures / topologies

⚫ Type 2: Computations on graphs with rich properties

⚫ Type 3: Computations on dynamic graphs

June 2024, Breadth-First Search (GTEPS)

• Log Data Analysis

1. Social Network Analysis

Recall Comparison (Number of T riggered Users = 1,

Late majority Early majority IF: Graphical Information Flow Model

Laggards → Comparing to Collaborative Filtering (CF) + Similar People

▪ IBM 2003 ▪ IBM 2009

profit (R^2 me an)

IBM CIO monitoring categories Monitoring filter

Real-Time Translation, Locations,

▪ Applications based on the extracted social enable

constructed social graph as customer social

PoCs with Chinese and Indian Telecomm companies CDR

Cluster based huge graph visualization

existing search engine Graph

Based on President Executive Order 13587

Goal: System for Detecting and Predicting

“What's emerged is a multibillion

Ponzi scheme Detection

Approach: BN is built to represent the dependence among sensors

– Map CDR onto network topology and infer load on CDR

KPI time series (e.g., Varying

Our approach: Basic analytics engine

Convert from relational Convert to relational

Traditional (relational) model

⚫ Advantages of working directly with graph DB for graph applications

• Dynamic graph algorithms implemented in

• High-throughput real-time predictive analytics

NASA’s DART Mission Hits

You might also like