0% found this document useful (0 votes)
3 views

0 Principles of Big Data

The document provides an overview of Big Data, highlighting its significance as a key resource in the modern world and the exponential growth of data generation. It discusses the evolution of Big Data systems from 1.0 (Hadoop) to 2.0, emphasizing the shift towards domain-specific, optimized systems and the impact of the Internet of Things (IoT). Additionally, it addresses the challenges of data overload and the need for new computational architectures to process vast amounts of data effectively.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

0 Principles of Big Data

The document provides an overview of Big Data, highlighting its significance as a key resource in the modern world and the exponential growth of data generation. It discusses the evolution of Big Data systems from 1.0 (Hadoop) to 2.0, emphasizing the shift towards domain-specific, optimized systems and the impact of the Internet of Things (IoT). Additionally, it addresses the challenges of data overload and the need for new computational architectures to process vast amounts of data effectively.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CIT650

Introduction to Big Data

Principles of Big Data

1
Today’s Agenda

Big Data Phenomena

Big Data 1.0 Systems

Big Data 2.0 Systems

2
Part I

Big Data Phenomena

3
Big Data
Data is key resource in the modern world.
According to IBM, we are currently creating 2.5 quintillion bytes of
data everyday.
IDC predicts that the world wide volume of data will reach 40 zettabytes
by 2020.
The radical expansion and integration of computation, networking, dig-
ital devices and data storage has provided a robust platform for the
explosion in big data.

4
Big Data

5
On the Verge of A Disruptive Century: Breakthroughs

Gene Ubiquitous
Sequencing and Computing
Biotechnology

Smaller, Faster,
Cheaper Sensors

Faster
Communication

6
Big Data Applications are Everywhere

7
Big Data: What Happens in the Internet in a Minute?

8
Data Generation and Consumption Model is Changing

9
Data Generation and Consumption Model is Changing

Old Model: Few companies (producers) are generating data, all


others are consuming data

New Model: All of us are generating data, and all of us are


consuming data

10
Big Data

11
Big Data

Data generation and consumption is becoming a main part of people’s


daily life especially with the pervasive availability and usage of Internet
technology and applications.

12
Your Smart Phone is now Very smart

13
Internet of Things (IoT)
A network devices, connect directly with each other to capture, share
and monitor vital data automatically through a SSL that connects a
central command and control server in the cloud
Enabling communication between devices, people & processes to ex-
change useful information & knowledge that create value for humans
A global Network Infrastructure linking Physical & Virtual Objects
Infrastructure: Internet and Network developments
Specific object identification, sensor, and connection capability

14
Big Data: Internet of Things

15
Prediction of IoT Usage1

1
https://www.ericsson.com/
16
Why IoT opportunity is growing now?

Affordable hardware: Costs of actuators & sensors have been cut in


half over last 10 years
Smaller, more powerful hardware: Form factors of hardware have
shrunk to millimeter or even nanometer levels
Ubiquitous & cheap mobility: Cost for mobile devices, bandwidth
and data processing has declined over last 10 years
Availability of supporting tools: Big data tools & cloud based in-
frastructure have become widely available

17
Smart X Phenomena

18
What it all produce?

Data ... Data ... Data

19
Big Data: Activity Data
Simple activities like listening to music or reading a book are now
generating data.
Digital music players and eBooks collect data on our activities.
Your smart phone collects data on how you use it and your web
browser collects information on what you are searching for.
Your credit card company collects data on where you shop and your
shop collects data on what you buy.
It is hard to imagine any activity that does not generate data.

20
Big Data

The cost of sequencing one human genome has fallen from $100
million in 2001 to $1K in 2015

21
New Types of Data

22
New Types of Data

23
The Data Structure Evolution Over the Years

24
What Means Big Data?

25
Big Data (3V)

26
Big Data (5V)

27
Big Data

28
Big Data Definition

McKinsey global report described big data as the next frontier for in-
novation and competition.
The report defined big data as ”Data whose scale, distribution, di-
versity, and/or timeliness require the use of new technical architectures
and analytics to enable insights that unlock the new sources of business
value”

29
Big Data Revolution

30
IBM 5MB Hard Disk ;-)

31
Recent Advances in Computational power

Cheaper, larger, and faster disk storage


You can now put all your large database on disk
Cheaper, larger, and faster memory
You may even be able to accommodate it all in memory
Cheaper, more capable, and faster processors
Parallel computing architectures:
Operate on large datasets in reasonable time
Try exhaustive searches and brute force solutions

32
Big Data

Moore’s Law: The information density on silicon integrated circuits


double every 18 to 24 months
Users expect more sophisticated information

33
Your Pocket Size Terabytes Hard Disk

34
Hardware Advancements Enable Big Data Processing

35
Scale Up VS Scale Out

36
The Data Overload Problem

37
The Data Overload Problem

Data is growing at a phenomenal rate. It has become massive, opera-


tional, and opportunistic
Drowning in data but starving for knowledge
The hidden information and knowledge in these mountains of data are
really the most useful

38
The Data Overload Problem

39
Fourth Paradigm

Jim Gray, a database pioneer, described the big data phenomena as


the Fourth Paradigm and called for a paradigm shift in the computing
architecture and large scale data processing mechanisms.
The first three paradigms were experimental, theoretical and, more
recently, computational science

40
Fourth Paradigm

41
Fourth Paradigm

Thousand years ago - Experimental Science


Description of natural phenomena

Last few hundreds years - Theoretical Science


Newton’s laws, Maxwell’s equation , ...

Last few decades - Computational Science


Simulations of complex phenomena

Today - Data-Intensive Science


Scientists overwhelmed with datasets from many different sources
Data captured by instruments
Data generated by simulations
Data generated by sensor networks

42
Computing Clusters

Many racks of computers, thousands of machines per cluster.


Limited bisection bandwidth between racks.

43
Data Centers

44
Big Data is a Competitive Advantage

45
Big Data is a Competitive Advantage

”It’s not who has the


best algorithm that
wins, It’s who has
the most data”
Andrew Ng

46
Data is the new Oil/Gold

47
Big Data Processing Systems
Big Data is the New Oil
and
Big Data Processing Systems is the Machinery

48
Part II

Big Data 1.0 System: The Hadoop


Decade

49
A Little History: Two Seminal contributions
”The Google File System”2
Describes a scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive commodity hardware,
delivers high aggregate performance
”MapReduce: Simplified Data Processing on Large Clusters”3
Describes a simple programming model and an implementation for pro-
cessing large data sets on computing clusters.

2
S. Ghemawat, H. Gobioff, S. Leung. The Google file system. SOSP 2003
3
J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
OSDI 2004
50
Hadoop4: A Star is Born

Hadoop is an open-source software


framework that supports data-intensive
distributed applications and clones the
Google’s MapReduce framework.
It is designed to process very large
amount of unstructured and complex
data.
It is designed to run on a large number
of machines that don’t share any memory
or disks.
It is designed to run on a cluster of ma-
chines which can put together in rela-
tively lower cost and easier maintenance.
4
http://hadoop.apache.org/
51
Key Aspects of Hadoop

52
Hadoop’s Success

Big Data 1.0 = Hadoop

53
Hadoop’s Success5

Big Data 1.0 = Hadoop

5
https://www.google.com/trends/
54
The Always Dilemma: Does One Size Fit All?!

55
Big Data 2.0 Processing Systems

Big Data 2.0 ! = Hadoop


Domain-specific, optimized and vertically focused systems

Cloudera Impala
Apache Samza

Apache S4

Trinity

Apache Flink PowerGraph Apache Tajo

Google MapReduce Apache Spark Apache Storm GraphLab Facebook Presto Apache Phoenix

Hadoop Apache Hive Google Pregel Apache Giraph IBM Big SQL GraphX Apache Tez

2004 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2015

56
Big Graphs

Google estimates that the total number of


web pages exceeds 1 trillion; experimental
graphs of the World Wide Web contain more
than 20 billion nodes and 160 billion edges.

Facebook reportedly consists of more than a


billion users (nodes) and more than 140 bil-
lion friendship relationships (edges) in 2012.

The LinkedIn network contains almost 260


million nodes and billions of edges.

Linked data contains about 31 billion triples.

57
Hadoop for Big Graphs?!

Popular graph query/analysis operations


such as: Page rank, Pattern matching,
Shortest path, Clustering (e.g. Max clique,
triangle closure), Community detection,...,
etc. are iterative in nature.
MapReduce programming model does not
directly support iterative data analysis. Pro-
grammers may implement iterative programs
by manually issuing multiple MapReduce
jobs and orchestrating their execution using
a driver program which wastes I/O, network
bandwidth and CPU resources.
It is not intuitive to think of graphs as
key/value pairs or matrices.
58
Pregel/Giraph6

In 2010, Google introduced the Pregel sys-


tem as a scalable platform for implementing
graph algorithms.
Pregel relies on a vertex-centric approach
and is inspired by the Bulk Synchronous Par-
allel (BSP) model.
In 2012, Apache Giraph was launched as an
open source project that clones the concepts
of Pregel and leverages the Hadoop infras-
tructure.
Other Projects: Spark GraphX (Apache),
GoldenOrb (Apche), GraphLab (CMU) and
Signal/Collect (UZH).
6
https://giraph.apache.org/
59
Big Streaming Data

Every day, Twitter generates more than 12


TB of tweets.

New York Stock Exchange captures 1 TB of


trade information.

About 30 billion radio-frequency identifica-


tion (RFID) tags are created every day.

Hundreds of millions of GPS devices sold ev-


ery year.

60
Static Data Computation vs Streaming Data Computation

61
Hadoop for Big Streams?!

From the stream-processing point of view,


the main limitation of the original implemen-
tation of the MapReduce framework is that
it was designed so that the entire output of
each map and reduce task is materialized
into a local file before it can be consumed
by the next stage.
This materialization step enables the im-
plementation of a simple and elegant
checkpoint/restart fault-tolerance mecha-
nism. But it causes significant delay for jobs
with real-time processing requirements.
Some Hadoop-based Trials include: MapRe-
duce Online and Incoop.
62
Twitter Storm7

Open source project developed by Nathan


Marz and acquired by Twitter in 2012.
Storm is a distributed stream-processing
system with the following key design fea-
tures: horizontal scalability, guaranteed re-
liable communication between the process-
ing nodes, fault tolerance and programming-
language agnosticism.
A Storm cluster is superficially similar to a
Hadoop cluster. One key difference is that a
MapReduce job eventually finishes, whereas
a Storm job processes messages forever (or
until the user kills it).
Other Projects: Flink, Apex, Spark Stream-
ing and Kafka Streams.
7
http://storm.apache.org/
63
Massively Parallel Processing (MPP) Optimized SQL query
engines

Big Data Management 64 / 80


NoSQL Databases8

NoSQL database systems represent a new generation of low-cost, high-


performance database software which is increasingly gaining more and
more popularity.
These systems promise to simplify administration, be fault-tolerant and
able to scale on commodity hardware (Scale out).

8
http://nosql-database.org/
65
Data Storage Options

66
Big Data Landscape

67
Big Data Landscape

68
Big Data Market Size9

9
https://www.statista.com/
69
The End

70

You might also like