0% found this document useful (0 votes)
4 views

Module 1 - Introduction

The document provides an introduction to Big Data, defining it as large and complex data sets that require innovative processing technologies. It discusses the characteristics of Big Data, known as the 4V's (Volume, Velocity, Variety, and Veracity), and highlights common issues faced in managing and analyzing such data. Additionally, it presents case studies and applications of Big Data across various industries, emphasizing its importance and the growing job market in this field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module 1 - Introduction

The document provides an introduction to Big Data, defining it as large and complex data sets that require innovative processing technologies. It discusses the characteristics of Big Data, known as the 4V's (Volume, Velocity, Variety, and Veracity), and highlights common issues faced in managing and analyzing such data. Additionally, it presents case studies and applications of Big Data across various industries, emphasizing its importance and the growing job market in this field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Big Data

Presented by: Le Ngoc Thanh

Tutor: Le Ngoc Thanh


Department of Computer Science, FIT, HCMUS
Big data: The trending term
o Big data is among trending search terms in recent
years.

Source: Google Trends, updated 02/2020


©lnthanh
2
Outline
o What is Big Data?
• Definitions of Big Data
• The V’s characteristics of Big Data
• Common Issues in Big Data
o Big Data Case Studies
• Applications of Big Data
• Big Data Projects in practice
o Motivations and Opportunities

©lnthanh
3
What is Big Data?
It’s not big. It’s just bigger…

©lnthanh 4
Definitions of Big data
o A variety of Big data definitions are available worldwide.
Big data is a term used to refer to the study and applications of data sets
that are so big and complex that traditional data-processing application
software are inadequate to deal with them.– Wikipedia.

Big data is high-volume, high-velocity and/or high-variety information


assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and process
automation. – Garner, 2001.

Big data refers to the dynamic, large and disparate volumes of data being
created by people, tools and machines; it requires new, innovative and
scalable technology to collect, host and analytically process the vast
amount of data gathered in order to derive real-time business insights
that relate to consumers, risk, profit, performance, productivity
management and enhanced©lnthanh
shareholder value. – Ernst & Young, 2014.5
Definitions of Big data

Big data is a term that describes at least three


separate, but interrelated, trends
ü Capturing and managing lots of information
ü Working with many new types of data
ü Exploiting these masses of information and new
data types with new styles of applications

Hadoop for Dummies, Special Ed. 2012

©lnthanh
6
Small data vs. Big data
o “Big data” is similar to “small data” but bigger.
o Handling bigger data requires different approaches
(i.e., techniques, tools and architecture, etc.).
o Solve new problems or existing problems in a better
way.

bigger computer?

bigger data

small data, small computer

©lnthanh
or more small computers? 7
Technologies in Big data
o Not a single technology but a combination of old
and new technologies that helps companies gain
actionable insight

o Capability to manage a huge volume of disparate


data, at the right speed, and within the right time
frame ® allow for real-time analysis and reaction

©lnthanh
8
Characteristics of Big data
o The characteristics of Big data are characterized by the 4V’s.

©lnthanh
9
The 4V’s: Velocity
o Description: Data is being generated extremely fast,
a process that never stops; and the speed at which
data is transformed into insight
o Attributes: Batch; near/real-time; streams
o Drivers: Improved connectivity; competitive
advantage; precomputed information

©lnthanh
10
Real-time and/or fast data

Mobile devices
(tracking all objects all the time)

Scientific instruments
(collecting all sorts of data)

Social media and networks Sensor technology


(all of us are generating data) and networks
(measuring all kinds of data)

o Innovations and their progresses are no longer hindered by the ability to


collect data, but, by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and in a
scalable fashion.
©lnthanh
11
Real-time analytics/decision requirement

Product
Recommendations Learning why Customers
that are Relevant Influence switch to competitors
Behavior
& Compelling and their offers; in
time to Counter

Friend Invitations
Improving the
Customer to join a
Game or Activity
Marketing
that expands
Effectiveness of a
business
Promotion while it
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively

©lnthanh
12
The 4V’s: Volume
o Description: The amount of data generated is vast
compared to traditional data sources
o Attributes: Exabyte, zettabyte, yottabytes, etc.
o Drivers: Increase in data sources, higher resolution
sensors, scalable infrastructure

©lnthanh
13
The growth of data
o The data volume is increasing exponentially.
• 44x increase from 2009 to 2020
• From 0.8 Zettabytes to 35 ZBs

Exponential increase in
collected/generated data
©lnthanh
14
The growth of data

Source: http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-
data-created-daily/, updated 05/04/2015.
©lnthanh
15
©lnthanh 16
https://www.raconteur.net/infographics/a-day-in-data/ (2019)
Examples of Data
Volume

CERN’s Large Hadron Collider (LHC) generates 30 PBs of data per


17
©lnthanh
year
The Earth scope project
o Explore the structure and evolution of the NA
continent
o Understand the processes controlling earthquakes
and volcanoes
o Funded by the National Science Foundation (NSF),
and the data produced is publicly accessible in real-
time
o Since 2003, its more than 4,000 instruments have
amassed 67 TBs of data.
Source: http://www.earthscope.org/

©lnthanh
18
The growth of data: Other
statistics
o The New York Stock Exchange generated ~4−5 TBs of
data per day.
o Facebook hosts more than 240 billion photos,
growing 7 PBs of data per month.
o The genealogy site Ancestry.com stores ~10 PBs of
data.
o The Internet Archive stores around 18.5 PBs of data.

Source: Hadoop: The Definitive Guide.


©lnthanh 4th edition, 2015
19
Why does data become big now?
o Key enablers of appearance and growth of data are

Increase of storage capacity

Availability of data

Increase of processing power


©lnthanh
20
The 4V’s: Variety
o Description: Data comes from different sources,
machines, people, processes both from outside and
inside the organizations
o Attributes: Degree of structure; complexity
o Drivers: Mobile; social media; video; genomics; IoT

©lnthanh
21
A single view to the customer

Social Banking
Media Finance

Our
Gaming
Customer Known
History

Entertain Purchase

©lnthanh
22
The 4V’s: Veracity
o Description: Quality and origin of data
o Attributes: Consistency; completeness; integrity;
ambiguity
o Drivers: Cost; need of traceability and justification

©lnthanh
23
The emerging V - Value
o The ability and need to turn data into value
o Value is not only profit but also medical or social
benefits, or personal satisfaction (customer,
employee, etc.).

©lnthanh
24
Outline
o What is Big Data?
• Definitions of Big Data
• The V’s characteristics of Big Data
• Common Issues in Big Data
o Big Data Case Studies
• Applications of Big Data
• Big Data Projects in practice
o Motivations and Opportunities

©lnthanh
25
Common issues related to the
4V’s
o As the data volume increases, the value of different
data records will decrease in proportion to age, type,
richness, and quantity among other factors.
o It is hard to handle complex data by existing
traditional analytic systems.
• Big data with relational databases, statistics/visualization
packages
• Massively parallel software running on tens, hundreds, or
even thousands of servers.
• Data analytics with data that is constantly in motion.

©lnthanh
26
Common issues related to the
4V’s
o There is a considerable gap between Business
leaders and IT professionals
• Business leaders concern about adding value to their
business and getting more and more profit, while IT
leaders care about the technicalities of the storage and
processing only.

©lnthanh
27
Issues of storage and transport
o Current technologies limit the disk size to about 4
TBs (1012) ® 1 exabytes (1018) would require
250,000 disks.
• A single computer system would be unable to directly
attach the requisite number of disks
o Access to that data overwhelms current
communication networks

A 1GB/second network with an effective


sustainable transfer rate of 80% and the
sustainable bandwidth of about 100 MBs will
transfer 1 exabytes in about 2800 hours.

©lnthanh
28
Issues of data management
o Possibly the most difficult problem
o Issues of access, utilization, updating, governance,
and reference (in publication) are major stumbling
blocks.
• Data sources are varied by size, format, and by method of
collection.
• What, when, where, who, why and how it was collected.
• Given the volume, it is impractical to validate every data
item.

©lnthanh
29
Issues of processing power
o Extensive parallel processing and new analytics
algorithms are required.

Assume that an exabyte of data need to be processed and it is chunked into blocks of 8 words ® 1
exabytes = 1K petabytes.
Assuming a processor expends 100 instructions on one block at 5 gigahertz ® 1K petabytes would require
a processing time of 635 years.

©lnthanh
30
Outline
o What is Big Data?
• Definitions of Big Data
• The V’s characteristics of Big Data
• Common Issues in Big Data
o Big Data Case Studies
• Applications of Big Data
• Big Data Projects in practice
o Motivations and Opportunities

©lnthanh
31
Big Data Case Studies
The more data, the better decisions, and then the better outcomes…

©lnthanh 32
Big Data use case categories

©lnthanh
33
Big data analytics
o Big data is more real-time in nature than traditional
data warehouse applications.
• Traditional architectures (e.g. Exadata, Teradata) are not
well-suited for big data apps.

• Shared nothing, massively


parallel processing, scale
out architectures are well-
suited for big data apps.

©lnthanh
34
Examples of Big Data Analytics

©lnthanh
35
Practical cases of Big data
analytics

©lnthanh
36
Challenges in handling Big data

o The bottleneck is in technology


• New architecture, algorithms, techniques are needed.
o Also in technical skills
• Lack of experts in using the new technology and dealing with big data.
©lnthanh
37
IBM Watson: AI services

©lnthanh
38
Big data in Healthcare
o 80% of medical data is unstructured and clinically
relevant.
o Data resides in multiple places
• Individual EMRs, labs and imaging systems, physician
notes, medical correspondence, etc.
o Leveraging big data may help us to
• Build sustainable healthcare systems.
• Collaborate to improve care and outcomes.
• Increase access to healthcare.

©lnthanh
39
Vestas: optimizes turbine placement

©lnthanh
40
Source: https://www.slideshare.net/SwissHUG/ibm-big-data-
platform-nov-2012, updated 11/2012
41
KTH: Reducing traffic congestion

©lnthanh
IBM MobileFirst Connected Car

Source:
http://m2m.demos.ibm.com/

©lnthanh
42
Sentiment analysis on Twitter
data
o Real-time sentiment analysis on Twitter data to
predict debate winners and changes in candidate
popularity.
o Tweets related to the topic are collected through
Twitter firehose and processed by a Twitter-specific
NLP tool.
Source:
http://www.socialmediatoday
.com/technology-data/using-
social-media-data-predict-
result-2016-us-presidential-
election

©lnthanh
43
Outline
o What is Big Data?
• Definitions of Big Data
• The V’s characteristics of Big Data
• Common Issues in Big Data
o Big Data Case Studies
• Applications of Big Data
• Big Data Projects in practice
o Motivations and Opportunities

©lnthanh
44
Motivations and opportunities
A new horizon that changes our lives…

©lnthanh 45
New insight into data
o Why deal with more data? New insight
o This new insight is not only for top level executives.
• It will be used to get people throughout the enterprise to
run the business better and to provide better service to
customers.

©lnthanh
46
Purpose of Big data analytics

o Examining large amount of data


o Appropriate information
• Identification of hidden patterns, unknown correlations
o Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue

©lnthanh
47
Applications of Big data analytics

Smarter Healthcare Multi-channel sales Finance Log Analysis

Homeland Security Traffic Control Telecom Search Quality

Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO


©lnthanh
48
Importance of Big Data by
industry

©lnthanh
(Copyright 2018 – Dresner Advisory Services)
49
Adoption of Big Data by industry

(Copyright 2018 – Dresner Advisory Services)


©lnthanh
50
Adoption of Big Data 2015 - 2018

(Copyright 2018 – Dresner Advisory Services)

©lnthanh
51
Big Data market forecast

By 2015, 4.4 million IT jobs in Big Data, 1.9 million is in US itself


Source: Wikibon Taming Big Data

©lnthanh
52
Big Data market forecast

©lnthanh
53
Big Data job opportunities

Statistics for Big Data Analytics skills in IT jobs advertised across the UK (June, 2016)
©lnthanh
54
Big Data job opportunities

Big Data Salaries in the United States


Salary estimated from 126,721 employees, users, and past and present job advertisements on
Indeed in the past 36 months. Last updated: February 17, 2020
(Source: https://www.indeed.com/salaries/big-data-Salaries)
©lnthanh
55
©lnthanh 56

You might also like