0% found this document useful (0 votes)
18 views31 pages

Introduction To Big Data

This document provides an introduction to big data. It defines big data as data that exceeds the processing capacity of conventional database systems due to its large volume, velocity, or variety. The three V's of big data - volume, velocity, and variety - are explained. Examples of real-world big data applications are given, such as recommendation systems, social network analysis, fraud detection, advanced weather forecasting, and data-driven journalism. The relationship between big data and data science is also discussed.

Uploaded by

yagoencuestas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Introduction To Big Data

This document provides an introduction to big data. It defines big data as data that exceeds the processing capacity of conventional database systems due to its large volume, velocity, or variety. The three V's of big data - volume, velocity, and variety - are explained. Examples of real-world big data applications are given, such as recommendation systems, social network analysis, fraud detection, advanced weather forecasting, and data-driven journalism. The relationship between big data and data science is also discussed.

Uploaded by

yagoencuestas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to

Big Data
Jesús Montes
[email protected]

Sept. 2022
Introduction to Big Data
What is Big Data?

Introduction to Big Data 2


What is Big Data?
What are the “parameters” of data?

Introduction to Big Data 3


What is Big Data?

Introduction to Big Data 4


But, how big?

Source:
https://www.visualcapitalist.com/from-am
azon-to-zoom-what-happens-in-an-intern
et-minute-in-2021/

Introduction to Big Data 5


But, how big?
Wait a minute, ZB (zettabyte)?
Volume of data/information created, captured,
copied, and consumed worldwide from 2010 to 2025 1 ZB = 1000 EB = 1000² PB = 1000³
TB = 1000⁴ GB

● If we stored 60 ZB in regular
blu-ray discs, they would
weigh as much as 838
Nimitz-class aircraft carriers.

Estimated, in ZB (soruce)
Introduction to Big Data 6
A definition
Edd Dumbill (O’Reilly Media):

● ‘Big data is data that exceeds the processing capacity of conventional


database systems. The data is too big, moves too fast, or doesn’t fit the
structures of your database architectures. To gain value from this data,
you must choose an alternative way to process it.’

Introduction to Big Data 7


The three V’s of Big Data

Velocity Volume Variety

Introduction to Big Data 8


The three V’s of Big Data
Information is generated faster than it can be
analyzed:

● Speed of networks resources do not


grow as fast as data volume

What we need:

● Faster stream processing and/or


selective storing techniques
Velocity

Introduction to Big Data 9


The three V’s of Big Data
Data volume grows faster than computational
resources:

● Volume x10 every 5 years


● Raw CPU power is doubled every 18
months (Moore’s Law)

What we need:

● New technologies that store and


Volume manage data more efficiently

Introduction to Big Data 10


The three V’s of Big Data
Data sources are increasingly heterogeneous:

● Multiple-structured or semi-structured
data
● Complicated to fit into a classic
relational model

What we need:

● Flexible data representation models


Variety
● Data storing and processing tools
optimized for these new models

Introduction to Big Data 11


Ok but, do we really need all the data?
In a voter intention poll, do we ask the entire population?

● Of course not. That would be like having the election.


● We take a representative sample.

But, what should be the size of this sample?

Introduction to Big Data 12


Ok but, do we really need all the data?
● Sample size will depend on the total size of the population and the
confidence level we would like to achieve.
● Let's say we want a 95% confidence level with a 5% maximum error. How
does the sample size n grows in relation to the population size N?

In case of doubt, go to http://www.wolframalpha.com 🙂

Introduction to Big Data 13


Ok but, do we really need all the data?
Wait a minute. Are you saying we can predict an election outcome (within a
reasonable margin of error) just by asking roughly 400 people?

● Apparently, that’s what (oversimplified) statistics are telling us.

But, even if we assume this is true, how exactly do we collect this sample?

● Do we simply ask the first 400 people we find in the street?


● How do we avoid sample bias?

Proper statistical tools and problem knowledge have been successfully


addressing problems like this for decades.
Introduction to Big Data 14
Then, what is Big Data?
Big data is not...

● … a replacement for statistical inference.


● … a replacement for traditional databases.
● … a replacement for standard business intelligence procedures.

Big Data tries to address new challenges where these (and other) techniques fall
short. Situations where data is being produced…

● … too fast (velocity).


● … in an extremely large amount (volume).
● … from many heterogeneous sources (variety).
Introduction to Big Data 15
The impact of Big Data
Penny Pritzker, US secretary of commerce in a conference at the MIT (march
2014):

● “Data analysis is the new fuel for American economy”


● Citing a report by McKinsey & co.: “If open data were available for these
main seven sectors: electricity, petroleum, gas, education, transportation,
health-care and finances, that could help to unlock up to three trillions
dollars”.

Introduction to Big Data 16


Big Data and the Gartner hype cycle
● Big Data is a novel concept
that has created a lot of hype.
● In 2013, a Gartner article
claimed that Big Data was
entering the “Trough of
Disillusionment”
● Nowadays we should be in
the “Plateau of Productivity”,
but it depends on multiple
factors (region, industry, …)

http://blogs.gartner.com/svetlana-sicular/big-data-is-falling-into-the-trough-of-disillusionment/
Introduction to Big Data 17
Some examples of real Big Data applications
● Nowadays, Big Data techniques
are being used by many:
○ Large corporations
○ Public services
○ Research institutions
○ Innovative start-ups
● Most people acknowledge the
benefits of Big Data for customer
management and marketing, but
there are many more successful
applications. And many, many more...

Introduction to Big Data 18


Some examples of real Big Data applications
Recommendation systems Social network analysis

Introduction to Big Data 19


Some examples of real Big Data applications
Fraud detection Neuroscience

Introduction to Big Data 20


Some examples of real Big Data applications
Advanced weather forecast

Introduction to Big Data 21


Some examples of real Big Data applications
Data driven journalism

Introduction to Big Data 22


Big Data in three “easy” steps...
Data engineering

● Storing, managing and operating with Data


data engineering

Data analysis/modeling
Data
● Extracting knowledge analysis

Data-driven decision making


Decision
● Putting the knowledge to good use making

Introduction to Big Data 23


Big Data and Data Science
Data science is the process of extracting knowledge or insights from data in
various forms, either structured or unstructured.

● Based on the scientific method


● Can be seen as a continuation/combination of data analysis fields such as
statistics, data mining, machine learning, etc.

When a data science problem cannot be addressed using traditional data


storing/processing/analyzing techniques, then it also becomes a Big Data
problem.

Introduction to Big Data 24


What is the problem?

(In spanish: https://www.youtube.com/watch?v=r7Ha7NVW8Xk)


Introduction to Big Data 25
Data Science

Introduction to Big Data 26


A few Data Science use cases
Data science can be applied to many different fields and problems, from
decision making in business scenarios to the scientific domain or even more
recreational approaches.

Classical examples:

● The Wal-Mart “beer & diapers” case


● Moneyball (The real story behind that 2011 film with Brad Pitt on it)
● The Netflix Prize

Introduction to Big Data 27


A few Data Science use cases
Function/model fitting Pattern recognition

Introduction to Big Data 28


A few Data Science use cases
Prediction

Introduction to Big Data 29


A few Data Science use cases
Even create art

Introduction to Big Data 30


Final remarks/reminders
● What is Big Data? → Remember the three Vs
● When are we facing a Big Data problem? → When traditional techniques
are not enough
● What are the three “steps” of Big Data?
○ Data engineering
○ Data analysis/modeling
○ Decision making

In this course we will cover mostly data engineering.


Data analysis is addressed in depth in other courses.

Introduction to Big Data 31

You might also like