0% found this document useful (0 votes)
14 views63 pages

Module 6 - Big Data and NOSQL

Big Data refers to large and complex data sets that cannot be processed using traditional data management tools. It is characterized by five V's: Volume, Velocity, Variety, Veracity, and Value, and has evolved significantly since the 1990s due to advancements in technology and the internet. Various types of data, including structured, semi-structured, and unstructured data, are generated from multiple sources, leading to diverse applications in fields such as transportation and analytics.

Uploaded by

signinshreyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views63 pages

Module 6 - Big Data and NOSQL

Big Data refers to large and complex data sets that cannot be processed using traditional data management tools. It is characterized by five V's: Volume, Velocity, Variety, Veracity, and Value, and has evolved significantly since the 1990s due to advancements in technology and the internet. Various types of data, including structured, semi-structured, and unstructured data, are generated from multiple sources, leading to diverse applications in fields such as transportation and analytics.

Uploaded by

signinshreyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Module I.

Introduction to Big Data


⚫ What is Big Data?
⚫ The term Big Data refers to a huge volume of data that
can not be stored processed by any traditional data
storage or processing units.
⚫ Big Data is generated at a very large scale and it is
being used by many multinational companies
to process and analyse in order to discover insights and
improve the business of many organisations.
What is Big Data
• Big data is a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing applications.
• “Big Data” is data whose scale, diversity, and complexity require
new architectures, techniques, algorithms, and analytics to
manage it and extract value and hidden knowledge from it…
• "Big data" is a field that treats ways to analyze, systematically
extract information from, or otherwise deal with data sets that are
too large or complex to be dealt with by traditional data-processing
application software


⚫ Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
⚫ It refers to a massive amount of data that keeps on growing
exponentially with time.
⚫ It is so voluminous that it cannot be processed or analysed
using conventional data processing techniques.
⚫ It includes data mining, data storage, data analysis, data
sharing, and data visualization.
⚫ The term is an all-comprehensive one including data, data
frameworks, along with the tools and techniques used to
process and analyse the data.
⚫ Big Data Driving Factors
⚫ Evolution of Big Data:
⚫ The term ‘Big Data’ has been in use since the early 1990s
⚫ Phase I: Big Data originate from the domain of database management.
⚫ Phase II: From early 2000s, usage of Internet and the Web started offering unique
data collections and data analysis opportunities. Companies such as Yahoo,
Amazon and eBay expanded the online stores and started analyzing customer
behavior for personalization. The HTTP-based content on web massively increased
the semi-structured and unstructured data.
⚫ Phase III: From past decade the large scale usage of smart phones with different
internet based applications give the possibility to analyze behavioral data (such as
clicks and search queries and also location-based data (GPS-data). Simultaneously,
the rise of sensor-based internet enabled devices termed as the ‘Internet of Things’
(IoT) is making millions of TVs, thermostats, wearable’s and even refrigerators to
generate zettabytes of data every day. T
The V’s
Variety Volume
of
Big Data Veracit
Velocity
y
Value
⚫ Characteristics of Big Data:
Big Data Characteristics:
⚫ The five characteristics that define Big Data are: Volume,
Velocity, Variety, Veracity and Value.

⚫ VOLUME:
⚫ Volume refers to the ‘amount of data’, which is growing day by day
at a very fast pace.
⚫ The size of data generated by humans, machines and their
interactions on social media itself is massive.
⚫ Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will
be generated by 2020, which is an increase of 300 times from 2005.
⚫ Volume refers to the unimaginable amounts of
information generated every second from social media,
cell phones, cars, credit cards, M2M sensors, images,
video, and whatnot.
⚫ We are currently using distributed systems, to store data
in several locations and brought together by a software
Framework like Hadoop.
⚫ Facebook alone can generate about billion messages, 4.5
billion times that the “like” button is recorded, and over
350 million new posts are uploaded each day.
⚫ Such a huge amount of data can only be handled by Big
Data Technologies
VELOCITY
⚫ Velocity is defined as the pace at which different
sources generate the data every day.
⚫ This flow of data is massive and continuous.
⚫ There are 1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of now, which is an
increase of 22% year-over-year.
⚫ This shows how fast the number of users are
growing on social media and how fast the data is
getting generated daily.
⚫ If you are able to handle the velocity, you will be
able to generate insights and take decisions
based on real-time data.
⚫ Velocity refers to the high speed of accumulation of data.
⚫ In Big Data velocity data flows in from sources like
machines, networks, social media, mobile phones etc.
⚫ There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is
generated and processed to meet the demands.
⚫ Sampling data can help in dealing with the issue like
‘velocity’.
⚫ Example: There are more than 3.5 billion searches per
day are made on Google. Also, FaceBook users are
increasing by 22%(Approx.) year by year.
VARIETY
⚫ As there are many sources which are contributing to Big Data, the
type of data they are generating is different.
⚫ It can be structured, semi-structured or unstructured.
⚫ Hence, there is a variety of data which is getting generated every
day.
⚫ Earlier, we used to get the data from excel and databases, now the
data are coming in the form of images, audios, videos, sensor data
etc. as shown in below image.
⚫ Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analysing the data.
⚫ It refers to nature of data that is structured, semi-structured and
unstructured data.
⚫ It also refers to heterogeneous sources.
⚫ Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
⚫ Structured data: This data is basically an organized data. It generally
refers to data that has defined the length and format of data.
⚫ Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of
data. Log files are the examples of this type of data.
⚫ Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures, videos
etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.
⚫ Big Data is generated in multiple varieties.
⚫ Compared to the traditional data like phone numbers and
addresses, the latest trend of data is in the form of
photos, videos, and audios and many more, making about
80% of the data to be completely unstructured
VERACITY
⚫ Veracity refers to the data in doubt or uncertainty of data
available due to data inconsistency and incompleteness.
⚫ In the image below, you can see that few values are missing in
the table.
⚫ Also, a few values are hard to accept, for example – 15000
minimum value in the 3rd row, it is not possible.
⚫ This inconsistency and incompleteness is Veracity.
⚫ Veracity basically means the degree of reliability that the data
has to offer.
⚫ Since a major part of the data is unstructured and irrelevant,
Big Data needs to find an alternate way to filter them or to
translate them out as the data is crucial in business
developments.
⚫ It refers to inconsistencies and uncertainty in data, that is data
which is available can sometimes get messy and quality and
accuracy are difficult to control.
⚫ Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and
sources.
⚫ Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete Information.
VALUE
⚫ Big data value refers to the usefulness of gathered data for
your business.
⚫ It is not just the amount of data that we store or process.
⚫ Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information.
⚫ It is actually the amount of valuable, reliable and trustworthy
data that needs to be stored, processed, analyzed to find
insights.
⚫ Types of Big-Data
⚫ Big Data is generally categorized into three different
varieties. They are as shown below:
⚫ Structured Data
⚫ Semi-Structured Data
⚫ Unstructured Data
Structured Data

⚫ Structured Data owns a dedicated data model, It also has a


well-defined structure, it follows a consistent order and it is
designed in such a way that it can be easily accessed and
used by a person or a computer.
⚫ Structured data is usually stored in well-defined columns and
also Databases.
⚫ Example: Database Management Systems(DBMS)
⚫ Structured data is the data which conforms to a data model,
has a well define structure, follows a consistent order and
can be easily accessed and used by a person or a computer
program.
⚫ Structured data is usually stored in well-defined schemas
such as Databases.
⚫ It is generally tabular with column and rows that clearly
define its attributes.
⚫ SQL (Structured Query language) is often used to manage
structured data stored in databases.
⚫ Characteristics of Structured Data:
⚫ Data conforms to a data model and has easily identifiable
structure
⚫ Data is stored in the form of rows and columns
⚫ Example : Database
⚫ Data is well organised so, Definition, Format and Meaning of data
is explicitly known
⚫ Data resides in fixed fields within a record or file
⚫ Similar entities are grouped together to form relations or classes
⚫ Entities in the same group have same attributes
⚫ Easy to access and query, So data can be easily used by other
programs
⚫ Data elements are addressable, so efficient to analyse and
process
⚫ Sources of Structured Data:

⚫ SQL Databases
⚫ Spreadsheets such as Excel
⚫ OLTP Systems
⚫ Online forms
⚫ Sensors such as GPS or RFID tags
⚫ Network and Web server logs
⚫ Medical devices
⚫ Advantages of Structured Data:
⚫ Structured data have a well defined structure that helps in easy
storage and access of data
⚫ Data can be indexed based on text string as well as attributes. This
makes search operation hassle-free
⚫ Data mining is easy i.e knowledge can be easily extracted from data
⚫ Operations such as Updating and deleting is easy due to well
structured form of data
⚫ Business Intelligence operations such as Data warehousing can be
easily undertaken
⚫ Easily scalable in case there is an increment of data
⚫ Ensuring security to data is easy
Semi-Structured Data
⚫ Semi-Structured Data can be considered as another form of
Structured Data.
⚫ It inherits a few properties of Structured Data, but the major
part of this kind of data fails to have a definite structure and
also, it does not obey the formal structure of data models
such as an RDBMS.
⚫ Example:Comma Separated Values(CSV) File.
⚫ Semi-structured data is data that does not conform to a data
model but has some structure.
⚫ It lacks a fixed or rigid schema.
⚫ It is the data that does not reside in a rational database but
that have some organizational properties that make it easier
to analyze.
⚫ With some processes, we can store them in the relational
database.
⚫ Characteristics of semi-structured Data:
⚫ Data does not conform to a data model but has some structure.
⚫ Data can not be stored in the form of rows and columns as in
Databases
⚫ Semi-structured data contains tags and elements (Metadata) which is
used to group data and describe how the data is stored
⚫ Similar entities are grouped together and organized in a hierarchy
⚫ Entities in the same group may or may not have the same attributes or
properties
⚫ Does not contain sufficient metadata which makes automation and
management of data difficult
⚫ Size and type of the same attributes in a group may differ
⚫ Due to lack of a well-defined structure, it can not used by computer
programs easily
⚫ Sources of semi-structured Data:

⚫ E-mails
⚫ XML and other markup languages
⚫ Binary executables
⚫ TCP/IP packets
⚫ Zipped files
⚫ Integration of data from different sources
⚫ Web pages
⚫ Advantages of Semi-structured Data:
⚫ The data is not constrained by a fixed schema
⚫ Flexible i.e Schema can be easily changed.
⚫ Data is portable
⚫ It is possible to view structured data as semi-structured
data.
⚫ Its supports users who can not express their need in SQL
⚫ It can deal easily with the heterogeneity of sources.
⚫ Disadvantages of Semi-structured data

⚫ Lack of fixed, rigid schema make it difficult in storage of


the data
⚫ Interpreting the relationship between data is difficult as
there is no separation of the schema and the data.
⚫ Queries are less efficient as compared to structured data.
⚫ Unstructured Data is completely a different type of which
neither has a structure nor obeys to follow the formal
structural rules of data models.
⚫ It does not even have a consistent format and it found to be
varying all the time. But, rarely it may have information
related to data and time.
⚫ Example: Audio Files, Images etc
⚫ Unstructured data is the data which does not conforms to
a data model and has no easily identifiable structure such
that it can not be used by a computer program easily.
⚫ Unstructured data is not organised in a pre-defined
manner or does not have a pre-defined data model, thus
it is not a good fit for a mainstream relational database.
⚫ Characteristics of Unstructured Data:
⚫ Data neither conforms to a data model nor has any
structure.
⚫ Data can not be stored in the form of rows and columns
as in Databases
⚫ Data does not follows any semantic or rules
⚫ Data lacks any particular format or sequence
⚫ Data has no easily identifiable structure
⚫ Due to lack of identifiable structure, it can not used by
computer programs easily
⚫ Sources of Unstructured Data:
⚫ Web pages
⚫ Images (JPEG, GIF, PNG, etc.)
⚫ Videos
⚫ Memos
⚫ Reports
⚫ Word documents and PowerPoint presentations
⚫ Surveys
⚫ Advantages of Unstructured Data:
⚫ Its supports the data which lacks a proper format or
sequence
⚫ The data is not constrained by a fixed schema
⚫ Very Flexible due to absence of schema.
⚫ Data is portable
⚫ It is very scalable
⚫ It can deal easily with the heterogeneity of sources.
⚫ These type of data have a variety of business intelligence
and analytics applications.
⚫ Disadvantages Of Unstructured data:

⚫ It is difficult to store and manage unstructured data due


to lack of schema and structure
⚫ Indexing the data is difficult and error prone due to
unclear structure and not having pre-defined attributes.
Due to which search results are not very accurate.
⚫ Ensuring security to data is difficult task.
⚫ Differences between Structured, Semi-structured
and Unstructured data:
⚫ Examples of Big Data
⚫ Daily we upload millions of bytes of data. 90 % of the world’s data
has been created in last two years.
Big Data Applications:
Transportation:
⚫ Big Data powers the GPS smartphone applications most of us depend on to
get from place to place in the least amount of time.
⚫ GPS data sources include satellite images and government agencies.
⚫ Airplanes generate enormous volumes of data, on the order of 1,000
gigabytes for transatlantic flights.
⚫ Aviation analytics systems ingest all of this to analyze fuel efficiency,
passenger and cargo weights, and weather conditions, with a view toward
optimizing safety and energy consumption.
⚫ Congestion management and traffic control
⚫ Thanks to Big Data analytics, Google Maps can now tell
you the least traffic-prone route to any destination.
⚫ Route planning
⚫ Different itineraries can be compared in terms of user
needs, fuel consumption, and other factors to plan for
maximize efficiency.
⚫ Traffic safety
⚫ Real-time processing and predictive analytics are used to
pinpoint accident-prone areas.
⚫ Meteorology
⚫ Weather satellites and sensors all over the world collect large amounts of
data for tracking environmental conditions.
⚫ Meteorologists use Big Data to:

⚫ Study natural disaster patterns

⚫ Prepare weather forecasts

⚫ Understand the impact of global warming

⚫ Predict the availability of drinking water in various world regions

⚫ Provide early warning of impending crises such as hurricanes and tsunamis


⚫ Healthcare
⚫ Big Data is slowly but surely making a major impact on
the huge healthcare industry.
⚫ Wearable devices and sensors collect patient data which
is then fed in real-time to individuals’ electronic health
records.
⚫ Providers and practice organizations are now using Big
Data for a number of purposes, including these:
⚫ Prediction of epidemic outbreaks
⚫ Early symptom detection to avoid preventable diseases
⚫ Electronic health records
⚫ Real-time alerting
⚫ Enhancing patient engagement
⚫ Prediction and prevention of serious medical conditions
Advantages of Big Data
⚫ Better Decision Making
⚫ Reduce costs of business processes
⚫ Fraud Detection
⚫ Increased productivity
⚫ Improved customer service
⚫ Difference between Traditional data and Big data
⚫ Traditional data: Traditional data is the structured data
that is being majorly maintained by all types of
businesses starting from very small to big organizations.
⚫ In a traditional database system, a centralized database
architecture used to store and maintain the data in a fixed
format or fields in a file.
⚫ For managing and accessing the data Structured Query
Language (SQL) is used.
⚫ The difference between Traditional data and Big data are as
follows:
Challenges with Big Data
⚫ Challenges which come along with Big Data:
⚫ Data Quality –
⚫ The problem here is the 4th V i.e. Veracity.
⚫ The data here is very messy, inconsistent and incomplete. Dirty data
cost $600 billion to the companies every year in the United States.
⚫ Discovery –
⚫ Analyzing petabytes of data using extremely powerful algorithms to find
patterns and insights are very difficult.
⚫ Storage –
⚫ The more data an organization has, the more complex the problems of
managing it can become.
⚫ The question that arises here is “Where to store it?”. We need a storage
system which can easily scale up or down on-demand.
⚫ Analytics –
⚫ In the case of Big Data, most of the time we are unaware of the
kind of data we are dealing with, so analysing that data is even
more difficult.
⚫ Security –
⚫ Since the data is huge in size, keeping it secure is another
challenge.
⚫ It includes user authentication, restricting access based on a
user, recording data access histories, proper use of data
encryption etc.
5 V’s of Big Data:

You might also like