Introduction To Big Data BS (CS) 6 Lecture # 3: Dr. Syed Attique Shah (PH.D.)
Introduction To Big Data BS (CS) 6 Lecture # 3: Dr. Syed Attique Shah (PH.D.)
BS (CS) 6th
Lecture # 3
1
Common Problems in Big Data
• Big data is a blanket term that is used to refer to any collection of data so
large and complex that it exceeds the processing capability of conventional
data management systems and techniques.
• Data plays a very important role in every aspect of life.
• In most of the situations there were some common problems
• One common problem is the size of data is extremely huge
• Traditional systems like RDBMS they do not scale up to the limit of this
particular extent
• Even so doing it will be extremely costly to do it with RDBMS
• Other common problem which we have seen is that data is split across
multiple systems
• It's not centralized in most of the cases
2
Common Problems in Big Data
• The intention is to make a single system that can
handle this diversity and volume of data as well as
perform different operations of data
• We conclude like big data is a challenge, consisting of
three core problems also called the 3 V’s of Big Data
• Big data is commonly characterized using a number of
V’s (mainly 6).
3
3’Vs of Big Data
4
Characteristics of
Big Data - Volume
• Volume – Refers to the
amount of data being
generated or present.
• Volume of data these days we
are receiving is in petabytes
or even exabytes which is
beyond the capacity of a
storage of a single machine
5
Characteristics of Big Data - Volume
• Volume is the big data dimension that relates to the sheer size of big data.
• This volume can come from large datasets being shared or many small data
pieces and events being collected over time.
• Every minute 204 million emails are sent, 200,000 photos are uploaded, and
1.8million likes are generated on Facebook.
• The size and the scale of storage for big data can be massive. Peta, Exa and
Yotta are used to define size
• CERN's large hadron collider generates 15 petabytes a year.
• According to a big data company called EMC, digital data, will grow by a
factor of 44 until the year 2020 i.e. 35.2 zettabytes.
• A zettabyte is 1 trillion GB, that's 10 to the power of 21.
6
Characteristics of Big Data - Volume
7
Characteristics of Big Data - Volume
• What is the relevance of this much data in our world?
• The idea is to understand that businesses and organizations
are collecting and leveraging large volumes of data to improve
their end products, whether it is safety, reliability,
healthcare, or governance.
• Challenges involve the cost, scalability, and performance
related to their storage, access, and processing.
8
Characteristics of
Big Data - Variety
• Variety – Refers to the types
of data being generated /
present
• Data which we receive is not
usually in the same format
• It can get be divided into
three categories:
• Structured
• Semi-Structured
• Unstructured like a flat file
9
Characteristics of Big Data - Variety
• It refers to increased diversity of data.
• Image data, text data, network data, geographic
maps, computer generated simulations are only a
few of the types of data we encounter everyday.
• The heterogeneity of data can be characterized
along several dimensions.
• A satellite image of wildfires from NASA is very different
from tweets sent out by people who are seeing the fire
spread.
10
Characteristics of Big Data - Variety
• Sometimes we also use qualitative versus quantitative
measures. For example, age can be a number or we
represent it by terms like infant, juvenile, or adult.
• Think of an email collection
• Sender, receiver, date - Structured
• Body of the email - Text
• Attachments - Multimedia
• Who-sends-to-whom - Network
• A current email cannot reference -
Semantics
11
Characteristics of Big Data - Variety
• Impact of data variety
• Harder to ingest
• Difficult to create common storage
• Difficult compare and match data across
variety
• Difficult to integrate
• Management and policy challenges
12
Characteristics of Big Data - Velocity
• Velocity – Refers to the speed at which data is generated
and processing is required.
• The amount of data being generated is increasing at a
rapid pace
13
Velocity refers to the increasing Speed of Storing data
speed at which big data is
Speed of Creating data
created and the increasing speed
at which the data needs to be Speed of Analyzing
stored and analyzed. data
Velocity
Batch Processing
Processing can be done as Real Time Processing
Streaming Analysis
14
Characteristics of Big Data - Velocity
• Realtime Processing
• Instantly capture streaming data
• Feed real time to machines
• Process Realtime
• Act
• Sensors and smart devices monitoring the human body can
detect abnormalities in real time and trigger immediate
action, potentially saving lives.
15
Characteristics of Big Data - Velocity
• Batch Processing
• Collect Data
• Clean Data
• Feed in Chunks
• Wait
• Act
• This type of processing is still very common today, but it can be catastrophic to some
businesses.
• Organizations which make decisions on latest data are more likely to hit the target.
• For this reason it's important to match the speed of processing with the speed of
information generation, and get real time decision making power.
16
Characteristics of Big Data - Velocity
• Streaming Analysis
• Decisions based on processing of already acquired data such as
batch processing, may give an incomplete picture. And hence, the
applications need real-time status of the context at hand. That is,
streaming analysis.
• Streaming data gives information on what's going on right now.
• Streaming data has velocity, meaning it gets generated at various rates.
• Analysis of such data in real time gives agility and adaptability to maximize
benefits you want to extract.
17
Characteristics of Big Data - Velocity
18
More V’s of Big Data
• We have huge amounts of data in different formats, and varying quality
which must be processed quickly.
• More Vs have been introduced to the big data community as we discover new
challenges and ways to define big data.
• They include:
• Veracity
• Value
• Valence
19
More V’s of Big Data
• Veracity – Refers to the biases, noise, and abnormality in data. Or,
better yet, It refers to the often unmeasurable uncertainties and
truthfulness and trustworthiness of data.
• Data can become invalid if proper preprocessing hasn’t been
performed.
• Filtering the invalid data out increases the its trustworthiness
20
Characteristics of Big Data - Veracity
• Veracity of Big Data refers to the quality of the data.
• It sometimes gets referred to as validity or volatility referring to the lifetime
of the data.
• Veracity is very important for making big data operational.
• Big data can be noisy and uncertain.
• It can be full of biases, abnormalities and it can be imprecise.
• Data is of no value if it's not accurate, the results of big data analysis are
only as good as the data being analyzed.
• Uncertainty of the data increases as we go from enterprise data to sensor
data.
• Traditional enterprise data in warehouses have standardized quality solutions
like master processes for ETL.
21
Characteristics of Big Data - Veracity
22
Characteristics of Big Data - Veracity
• As enterprises started incorporating less structured and unstructured people
and machine data into their big data solutions, the data become messier and
more uncertain.
• In January 2013, Google Friends actually estimated almost twice as
many flu cases as was reported by CDC, the Centers for Disease
Control and Prevention.
• The primary reason behind this was that Google Flu Trends used a big data
on the internet and did not account properly for uncertainties about the data.
• This resulted in what we call an over estimation.
23
More V’s of Big Data
• Valence – Refers to the connectedness of big data and the ways in which
data can be used and formatted.
• Data which can be used for one specific task will be of lessor value.
• Data should be acquired and formatted in a such a way to increase its usage
as the velocity of data is very fast.
24
Characteristics of Big Data - Valence
• Valence refers to Connectedness. The more connected data is, the higher it's
valences.
• Data items are often directly connected to one another.
• A city is connected to the country it belongs to.
• Two Facebook users are connected because they are friends.
• An employee is connected to his work place.
• Data could also be indirectly connected.
• Two scientists are connected, because they are both physicists.
• For data collection, valence measures the ratio of actually connected data items
to the possible number of connections that could occur within the collection.
• The most important aspect of valence is that the data connectivity increases
over time.
25
Characteristics of Big Data - Valence
• The series of network graphs comes from social experiment where
scientists attending a conference were asked to meet other scientists
they did not know before. After several rounds of meetings, they
found new connections shown by their red edges.
• A high valence data set is denser. This makes many regular, analytic
critiques very inefficient.
• More complex analytical methods must be adopted to account for the
increasing density.
26
Characteristics of Big Data - Valence
• Valence: Challenges
• More complex data exploration algorithms
• Modeling and prediction of valence changes
• Group event detection
• Emergent behavior analysis
27
Characteristics of Big Data - Value
• We described the five ways which are considered to be dimensions of
big data.
• At the heart of the big data challenge is turning all of the other
dimensions into truly useful business value.
• The idea behind processing all this big data in the first place is to
bring value to the problem at hand.
28
Characteristics of Big Data - Value
• Value – Refers to the quality and enhanced decision for
an individual / organization.
• Data is of no use if it has not brought into the value
• Should be able to have some insightful decision from the
data
29
Characteristics of Big Data – Example
(Assignment)
• Let's imagine now that you're part of a company called Eglence Inc.
• One of the products of Eglence Inc is a highly popular mobile game called
Catch the Pink Flamingo.
• It's a multi-user game where the users have to catch special types of pink
flamingos that randomly pop up on the world map on their screens based on
the mission that gets updated randomly.
• The game is played by millions of people online throughout the world.
• One of the goals of the game is to form a network of players to collectively
cover the world map with pink flamingo sightings and compete other groups.
• Users can pick their groups based on player stats.
• The game's website sends free cool stuff to registered users.
30
Characteristics of Big Data - Example
• Registration requires users to enter demographic information such gender,
year of birth, city, highest education, and things like that.
• However, most of the users enter inaccurate information about themselves,
just like most of us do.
• To help improve the game, the game collects real-time usage activity data
from each player and feeds them to it's data servers.
• The players of this game are enthusiastically active on social media, and
have strong associations with the game.
• A popular Twitter hashtag for this game is, CatchThePinkFlamingo, which
gets more than 200,000 mentions worldwide per day.
• There are strong communities of users who meet via social media and get
together to play the game.
31
Characteristics of Big Data - Example
• Now, imagine yourself as the big data solutions architect for Fun Games Inc.
• There are definitely examples of all three types of data sources in this example.
• The mobile app generates data for the analysis of user activity.
• Twitter conversations of players forma rich source of unstructured data from
people.
• And the customer and game records are examples of data that this organization
collects.
• This is a challenging big data example where all characteristics of big data are represented.
• There are high volumes of player, game and Twitter data, which also speaks to the variety
of data.
• The data streams from the mobile app, website, and social media in real-time, which can
be defined as high velocity data.
• The quality of demographic data users enter is not clear, and there are networks of players
which are related to the balance of big data. 32