Module 6 - Big Data and NOSQL
Module 6 - Big Data and NOSQL
…
⚫ Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
⚫ It refers to a massive amount of data that keeps on growing
exponentially with time.
⚫ It is so voluminous that it cannot be processed or analysed
using conventional data processing techniques.
⚫ It includes data mining, data storage, data analysis, data
sharing, and data visualization.
⚫ The term is an all-comprehensive one including data, data
frameworks, along with the tools and techniques used to
process and analyse the data.
⚫ Big Data Driving Factors
⚫ Evolution of Big Data:
⚫ The term ‘Big Data’ has been in use since the early 1990s
⚫ Phase I: Big Data originate from the domain of database management.
⚫ Phase II: From early 2000s, usage of Internet and the Web started offering unique
data collections and data analysis opportunities. Companies such as Yahoo,
Amazon and eBay expanded the online stores and started analyzing customer
behavior for personalization. The HTTP-based content on web massively increased
the semi-structured and unstructured data.
⚫ Phase III: From past decade the large scale usage of smart phones with different
internet based applications give the possibility to analyze behavioral data (such as
clicks and search queries and also location-based data (GPS-data). Simultaneously,
the rise of sensor-based internet enabled devices termed as the ‘Internet of Things’
(IoT) is making millions of TVs, thermostats, wearable’s and even refrigerators to
generate zettabytes of data every day. T
The V’s
Variety Volume
of
Big Data Veracit
Velocity
y
Value
⚫ Characteristics of Big Data:
Big Data Characteristics:
⚫ The five characteristics that define Big Data are: Volume,
Velocity, Variety, Veracity and Value.
⚫ VOLUME:
⚫ Volume refers to the ‘amount of data’, which is growing day by day
at a very fast pace.
⚫ The size of data generated by humans, machines and their
interactions on social media itself is massive.
⚫ Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will
be generated by 2020, which is an increase of 300 times from 2005.
⚫ Volume refers to the unimaginable amounts of
information generated every second from social media,
cell phones, cars, credit cards, M2M sensors, images,
video, and whatnot.
⚫ We are currently using distributed systems, to store data
in several locations and brought together by a software
Framework like Hadoop.
⚫ Facebook alone can generate about billion messages, 4.5
billion times that the “like” button is recorded, and over
350 million new posts are uploaded each day.
⚫ Such a huge amount of data can only be handled by Big
Data Technologies
VELOCITY
⚫ Velocity is defined as the pace at which different
sources generate the data every day.
⚫ This flow of data is massive and continuous.
⚫ There are 1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of now, which is an
increase of 22% year-over-year.
⚫ This shows how fast the number of users are
growing on social media and how fast the data is
getting generated daily.
⚫ If you are able to handle the velocity, you will be
able to generate insights and take decisions
based on real-time data.
⚫ Velocity refers to the high speed of accumulation of data.
⚫ In Big Data velocity data flows in from sources like
machines, networks, social media, mobile phones etc.
⚫ There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is
generated and processed to meet the demands.
⚫ Sampling data can help in dealing with the issue like
‘velocity’.
⚫ Example: There are more than 3.5 billion searches per
day are made on Google. Also, FaceBook users are
increasing by 22%(Approx.) year by year.
VARIETY
⚫ As there are many sources which are contributing to Big Data, the
type of data they are generating is different.
⚫ It can be structured, semi-structured or unstructured.
⚫ Hence, there is a variety of data which is getting generated every
day.
⚫ Earlier, we used to get the data from excel and databases, now the
data are coming in the form of images, audios, videos, sensor data
etc. as shown in below image.
⚫ Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analysing the data.
⚫ It refers to nature of data that is structured, semi-structured and
unstructured data.
⚫ It also refers to heterogeneous sources.
⚫ Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
⚫ Structured data: This data is basically an organized data. It generally
refers to data that has defined the length and format of data.
⚫ Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of
data. Log files are the examples of this type of data.
⚫ Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures, videos
etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.
⚫ Big Data is generated in multiple varieties.
⚫ Compared to the traditional data like phone numbers and
addresses, the latest trend of data is in the form of
photos, videos, and audios and many more, making about
80% of the data to be completely unstructured
VERACITY
⚫ Veracity refers to the data in doubt or uncertainty of data
available due to data inconsistency and incompleteness.
⚫ In the image below, you can see that few values are missing in
the table.
⚫ Also, a few values are hard to accept, for example – 15000
minimum value in the 3rd row, it is not possible.
⚫ This inconsistency and incompleteness is Veracity.
⚫ Veracity basically means the degree of reliability that the data
has to offer.
⚫ Since a major part of the data is unstructured and irrelevant,
Big Data needs to find an alternate way to filter them or to
translate them out as the data is crucial in business
developments.
⚫ It refers to inconsistencies and uncertainty in data, that is data
which is available can sometimes get messy and quality and
accuracy are difficult to control.
⚫ Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and
sources.
⚫ Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete Information.
VALUE
⚫ Big data value refers to the usefulness of gathered data for
your business.
⚫ It is not just the amount of data that we store or process.
⚫ Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information.
⚫ It is actually the amount of valuable, reliable and trustworthy
data that needs to be stored, processed, analyzed to find
insights.
⚫ Types of Big-Data
⚫ Big Data is generally categorized into three different
varieties. They are as shown below:
⚫ Structured Data
⚫ Semi-Structured Data
⚫ Unstructured Data
Structured Data
⚫ SQL Databases
⚫ Spreadsheets such as Excel
⚫ OLTP Systems
⚫ Online forms
⚫ Sensors such as GPS or RFID tags
⚫ Network and Web server logs
⚫ Medical devices
⚫ Advantages of Structured Data:
⚫ Structured data have a well defined structure that helps in easy
storage and access of data
⚫ Data can be indexed based on text string as well as attributes. This
makes search operation hassle-free
⚫ Data mining is easy i.e knowledge can be easily extracted from data
⚫ Operations such as Updating and deleting is easy due to well
structured form of data
⚫ Business Intelligence operations such as Data warehousing can be
easily undertaken
⚫ Easily scalable in case there is an increment of data
⚫ Ensuring security to data is easy
Semi-Structured Data
⚫ Semi-Structured Data can be considered as another form of
Structured Data.
⚫ It inherits a few properties of Structured Data, but the major
part of this kind of data fails to have a definite structure and
also, it does not obey the formal structure of data models
such as an RDBMS.
⚫ Example:Comma Separated Values(CSV) File.
⚫ Semi-structured data is data that does not conform to a data
model but has some structure.
⚫ It lacks a fixed or rigid schema.
⚫ It is the data that does not reside in a rational database but
that have some organizational properties that make it easier
to analyze.
⚫ With some processes, we can store them in the relational
database.
⚫ Characteristics of semi-structured Data:
⚫ Data does not conform to a data model but has some structure.
⚫ Data can not be stored in the form of rows and columns as in
Databases
⚫ Semi-structured data contains tags and elements (Metadata) which is
used to group data and describe how the data is stored
⚫ Similar entities are grouped together and organized in a hierarchy
⚫ Entities in the same group may or may not have the same attributes or
properties
⚫ Does not contain sufficient metadata which makes automation and
management of data difficult
⚫ Size and type of the same attributes in a group may differ
⚫ Due to lack of a well-defined structure, it can not used by computer
programs easily
⚫ Sources of semi-structured Data:
⚫ E-mails
⚫ XML and other markup languages
⚫ Binary executables
⚫ TCP/IP packets
⚫ Zipped files
⚫ Integration of data from different sources
⚫ Web pages
⚫ Advantages of Semi-structured Data:
⚫ The data is not constrained by a fixed schema
⚫ Flexible i.e Schema can be easily changed.
⚫ Data is portable
⚫ It is possible to view structured data as semi-structured
data.
⚫ Its supports users who can not express their need in SQL
⚫ It can deal easily with the heterogeneity of sources.
⚫ Disadvantages of Semi-structured data