01 Unit-BDA - Intro BDA
01 Unit-BDA - Intro BDA
UNIT-1
INTRODUCTION TO BIG DATA
A. BHANU PRASAD
Associate Professor, Dept. of CSE
2
Theory Contents Contd..
THE BIG DATA TECHNOLOGY LANDSCAPE: NoSQL (Not Only SQL),
Hadoop, Introduction to Hadoop, Introducing Hadoop, Why Hadoop?, Why
not RDBMS?, RDBMS versus Hadoop, Distributed Computing Challenges,
History of Hadoop, Hadoop Overview, Use Case of Hadoop, Hadoop
Distributors, HDFS (Hadoop Distributed File System), Processing Data with
Hadoop, Managing Resources and Applications with Hadoop YARN (Yet
another Resource Negotiator), Interacting with Hadoop Ecosystem.
Lecture
Topics to be Covered
#
4
BOOKS
TEXT BOOKS:
1. Big Data and Analytics
Seema Acharya, Subhashini Chellappan
2nd Edition, Wiley India.
REFERENCE BOOKS:
2. Big Data Now
O'Reilly Media, 2nd Edition, 2012
7
Sources of Structured data
Oracle, IBM-DB2, Microsoft SQL server,
MySQL(open source)
Online transaction processing (OLTP):
Transactional/operational data in day to day
business activities Ex: online banking
Online shopping
Use simple queries
Required read/write operations
Size is smaller 100MB to 10GB
8
Ease of working with Structured data
1) Insert/delete/update: The Data Manipulation Language (DML)
operations provide the required ease with data input, storage, access,
process, analysis, etc.
2) Security: Encryption and tokenization solutions are available for the
security of information throughout its lifecycle. . Only authorized
individuals are able to decrypt and view sensitive information.
3) Indexing: An index is a data structure that speeds up the data
retrieval operations.
4) Scalability: The storage and processing capabilities of the traditional
RDBMS can be easily scaled up by increasing the horsepower of the
database server.
5) Transactional processing: RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID) properties of
transaction. Atomicity: Either a transaction happens in its entirety or
none of it at all. Consistency: Before and after execution of
transaction the database must be consistent state. Isolation: It allows
concurrent execution. Durability: All changes made to the database
during a transaction are permanent.
9
1.1.2 Semi-Structured Data
Semi-structured data is also referred to as self-describing structure. It
has the following features:
1) It does not conform to the data models that one typically associates
with relational databases or any other form of data tables.
2) It uses tags to segregate semantic elements.
3) Tags are also used to enforce hierarchies of records and fields within
data.
5) There is no separation between the
data and the schema. The amount
of structure used is dictated by the
purpose at hand.
6) In semi-structured data, entities
belonging to the same class and
also grouped together need not
necessarily have the same set of
attributes.
And if at all, they have the same set of attributes, the order of
attributes may not be similar and for all practical purposes it is not 10
important as well.
Sources of Semi-Structured Data
Amongst the sources for semi-structured data, the front runners are
“XML” and “JSON” as depicted in Fig.
1) XML: eXtensible Markup Language (XML) is hugely popularized by
web services developed utilizing the Simple Object Access Protocol
(SOAP) principles.
2) JSON: Java Script Object Notation (JSON) is used to transmit data
between a server and a web application. JSON is popularized by web
services developed utilizing the Representational State Transfer
(REST) – an architecture style for creating scalable web services.
MongoDB (open-source, distributed, NoSQL, documented-oriented
database) and Couchbase (originally known as Membase, open-source,
distributed, NoSQL, document-oriented database) store data natively
in JSON format.
11
1.1.3 UnStructured Data
Unstructured data does not conform to any pre-
defined data model.
The structure of the unstructured data is quite
unpredictable. Various sources of unstructured
data is depicted in Figure 1.8 .
12
Issues with “Unstructured” Data
Although unstructured data is known NOT to conform to a pre-
defined data model or be organized in a pre-defined manner, there are
incidents wherein the structure of the data (placed in the unstructured
category) can still be implied.
As mentioned in Fig, there could be few other reasons behind placing
data in the unstructured category despite it having some structure or
being highly structured.
13
How to Deal with Unstructured Data?
The following techniques are used to find patterns in or interpret
unstructured data:
1) Data mining: We use methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems to
unearth consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the
“knowledge discovery in databases” process. Few popular data mining
algorithms are as follows:
Association rule mining: It is also called “market basket analysis”
or “affinity analysis”. It is about when you buy a product, what is
the other product that you are likely to purchase with it.
Regression analysis: It helps to predict the relationship between
two variables. The variable whose value needs to be predicted is
called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users.
14
Deal with Unstructured Data Contd..
2) Text analytics or text mining: Compared to the structured data stored
in relational databases, text is largely unstructured, amorphous, and
difficult to deal with algorithmically. Text mining is the process of
gleaning high quality and meaningful information (through devising
of patterns and trends by means of statistical pattern learning) from
text. It includes tasks such as text categorization, text clustering,
sentiment analysis, concept/entity extraction, etc.
3) Natural language processing (NLP): It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
4) Noisy text analytics: It is the process of extracting structured or semi-
structured information from noisy unstructured data such as chats,
blogs, wikis, emails, message-boards, text messages, etc. The noisy
unstructured data usually comprises one or more of the following:
Spelling mistakes, abbreviations, acronyms, non-standard words,
missing punctuation, missing letter case, filler words such as “uh”,
“um”, etc.
15
Deal with Unstructured Data Contd..
2) Manual tagging with metadata: This is about tagging manually with
adequate metadata to provide the requisite semantics to understand
unstructured data.
3) Part-of-speech tagging: It is also called POS or POST or grammatical
tagging. It is the process of reading text and tagging each word in the
sentence as belonging to a particular part of speech such as “noun”,
“verb”, “adjective”, etc.
4) Unstructured Information Management Architecture (UIMA): It is an
open source platform from IBM. It is used for real-time content
analytics. It is about processing text and other unstructured data to
find latent meaning and relevant relationship buried therein.
16
1.2 Characteristics of Data
Data has three key characteristics:
1) Composition: The composition of data deals with the structure of data,
that is, the sources of data, the granularity, the types, and the nature of
data as to whether it is static or real-time streaming.
2) Condition: The condition of data deals with the state of data, that is,
“Can one use this data as is for analysis?” or “Does it require cleansing
for further enhancement and enrichment?”
3) Context: The context of data deals with “Where has this data been
generated?” “Why was this data generated?” “How sensitive is this data?”
“What are the events associated with this data?” and so on.
Small data (data as it existed prior to the big data revolution) is about
certainty. It is about fairly known data sources; it is about no major
changes to the composition or context of data.
Big data is about complexity… complexity in
terms of multiple and unknown datasets,
exploding volume, speed at which the data is
being generated and needs to be processed,
and in terms of the variety of data (internal or
external, behavioral or social) that is being
17
generated.
1.3 Evolution of BIG DATA
1970s and before was the era of mainframes.
The data was essentially primitive and structured. Relational
databases evolved in 1980s and 1990s. The era was of data intensive
applications.
The World Wide Web (WWW) and the Internet of Things (IoT) have
led to an onslaught of structured, unstructured, and multimedia data
as shown in Table 2.1.
18
1.4 Definition of BIG DATA
Different sources defined Big data in different ways:
Big data is high-volume, high-velocity, and high-variety information
assets that demand cost effective, innovative forms of information
processing for enhanced insight and decision making. (or)
Big data is anything beyond the human and technical infrastructure
needed to support storage, processing, and analysis. (or)
Big data is the term for the collection of datasets so large and complex
that it becomes difficult to process using database system tools and
traditional processing applications
Today’s BIG may be tomorrow’s NORMAL.
The 3Vs (Volume, Velocity, Variety) concept was proposed by the
Gartner analyst Doug Laney
There is no explicit definition of how big the dataset should be for it to
be considered “big data.” Big data that is just too big, moves fast, and
does not fit the structures of typical database systems. The data
changes are highly dynamic.
19
1.5 Challenges with Big Data
Following are a few challenges with big data:
1) Data Generation: Data today is growing at an exponential rate. The
key questions here are: “Will all this data be useful for analysis?”, “Do
we work with all this data or a subset of it?”, “How will we separate
the knowledge from the noise?”, etc.
2) Cloud computing and virtualization: Cloud computing is the answer
to managing infrastructure for big data as far as cost-efficiency,
elasticity, and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the
enterprise.
3) Retention: How long should one retain this data? As some data is
useful for making long-term decisions, whereas in few cases, the data
may quickly become irrelevant and obsolete just a few hours after
having being generated.
4) Lack of Talent: There are a lot of Big Data projects in major
organizations, but there is a lack of skilled professionals who possess
a high level of proficiency in data sciences that is vital in
implementing big data solutions. 20
Challenges with Big Data Contd..
5) Data visualization: is becoming popular as a separate discipline. We
are short by quite a number, as far as business visualization experts
are concerned.
6) Data Quality: The problem is with Veracity of data. The data is very
messy, inconsistent and incomplete
7) Discovery: Analyzing peta bytes of data using extremely powerful
algorithms to find patterns and insights are very difficult.
8) Storage: The more data an organization has, the more complex the
problems of managing it can become. The question that arises here is
“Where to store it?”. We need a storage system which can easily scale
up or down on-demand
9) Analytics: In the case of Big Data, most of the time we are unaware of
the kind of data we are dealing with, so analyzing that data is even
more difficult.
10) Security: Since the data is huge in size, keeping it secure is another
challenge. It includes user authentication, restricting access based on
a user, recording data access histories, proper use of data encryption
etc 21
1.6 What is Big Data?
Big data is data that is big in volume, velocity, and variety. Refer
Figure 2.5.
1) Volume
Volume refers to the ‘amount of data’, which is growing day by day at a
very fast pace. (or) data can actually be considered as a Big Data or
not, is dependent upon the volume of data.
Data rapidly increasing GB, TB, PB….
Sources of big data
1) Typical internal data sources: Data present within an organization’s
firewall. It is as follows:
• Data storage: File systems, SQL (RDBMSs – Oracle, MS SQL
Server, DB2, MySQL, PostgreSQL, etc.), NoSQL (MongoDB,
Cassandra, etc.), and so on.
• Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients’ health records, students’
admission records, students’ assessment records, and so on.
22
Sources of big data Contd..
2) External data sources: Data residing outside an organization’s firewall.
It is as follows:
• Public Web: Wikipedia, weather, regulatory, compliance, census,
etc.
3) Both (internal + external data sources)
• Sensor data: Car sensors, smart electric meters, office buildings, air
conditioning units, refrigerators, and so on.
• Machine log data: Event logs, application logs, Business process
logs, audit logs, clickstream data, etc.
• Social media: Twitter, blogs, Facebook, LinkedIn, YouTube,
Instagram, etc.
• Business apps: ERP, CRM, HR, Google Docs, and so on. • Media:
Audio, Video, Image, Podcast, etc.
• Docs: Comma separated value (CSV), Word Documents, PDF, XLS,
PPT, and so on.
23
Sources of big data
2) Velocity: Refers to the speed of generation of data, How fast the data
is generated and processed to meet the demands.
We have moved from the days of batch processing (remember our payroll
applications) to real-time processing.
Batch → Periodic → Near real time → Real-time processing
1990: HD: 1GB-20GB ,Ram: 28MB, Reading capacity: 10kbps
3) Variety: Variety deals with a wide range of data types and sources of
data. There are three categories:
1) Structured data: From traditional transaction processing systems and
RDBMS, etc.
2) Semi-structured data: For example Hyper Text Markup Language
(HTML), eXtensible Markup Language (XML).
3) Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs, social media, etc.
25
1.7 Other Characteristics of Data Which are
not Definitional Traits of Big Data
There are yet other characteristics of data which are not necessarily the
definitional traits of big data. Few of these are listed as follows:
1) Veracity and validity: Veracity refers to biases, noise, and
abnormality in data. The key question here is: “Is all the data that is
being stored, mined, and analyzed meaningful and pertinent to the
problem under consideration?” Validity refers to the accuracy and
correctness of the data. Any data that is picked up for analysis needs
to be accurate. It is not just true about big data alone.
2) Volatility: Volatility of data deals with, how long is the data valid?
And how long should it be stored? There is some data that is required
for long-term decisions and remains valid for longer periods of time.
However, there are also pieces of data that quickly become obsolete
minutes after their generation.
3) Variability: Data flows can be highly inconsistent with periodic
peaks. Process of being able to handle and manage the data
effectively.
26
1.7 Other Characteristics of Data Which are
not Definitional Traits of Big Data
There are yet other characteristics of data which are not necessarily the
definitional traits of big data. Few of these are listed as follows:
1) Veracity and validity: Veracity refers to biases, noise, and
abnormality in data. The key question here is: “Is all the data that is
being stored, mined, and analyzed meaningful and pertinent to the
problem under consideration?” Validity refers to the accuracy and
correctness of the data. Any data that is picked up for analysis needs
to be accurate. It is not just true about big data alone.
2) Volatility: Volatility of data deals with, how long is the data valid?
And how long should it be stored? There is some data that is required
for long-term decisions and remains valid for longer periods of time.
However, there are also pieces of data that quickly become obsolete
minutes after their generation.
3) Variability: Data flows can be highly inconsistent with periodic
peaks. Process of being able to handle and manage the data
effectively.
4) Value: Big Data i.e. Value. Is it adding to the benefits of the
organizations who are analyzing big data? 27
28
1.8 Why Big Data?
The more data we have for analysis, the greater will be the analytical
accuracy and also the greater would be the confidence in our decisions
based on these analytical findings.
This will entail a greater positive impact in terms of enhancing
operational efficiencies, reducing cost and time, and innovating on new
products, new services, and optimizing existing services. Refer Figure
2.8.
29
1.9 Are We Just an Information Consumer or
Do we also Produce Information?
There are several instances everyday where you generate data.
1) Text message to send in the confirmation to attend the promotion
bash.
2) Use of credit card to pay for gas/fuel at the gas station.
3) Point of Sale system at Archie’s where your transaction gets recorded.
4) Photographs and posts on social networking sites.
5) Likes and comments to your post.
30
1.10 Traditional Business Intelligence (BI) versus
Big Data
Some of the differences between traditional BI and big data.
S.No Traditional BI environment Big Data
1) In traditional BI environment, In a big data environment, data
all the enterprise’s data is resides in a distributed file
housed in a typical central system that scales horizontally
database server that scales
vertically.
2) data is generally analyzed in an data, it is analyzed in both real
offline mode time as well as in offline mode.
3) Traditional BI is about Big data is about variety:
structured data and it is here Structured, semi-structured,
that data is taken to processing and unstructured data and
functions (move data to code). here the processing functions
are taken to the data (move
code to data).
31
1.11 A Typical Data Warehouse Environment
In a typical Data Warehouse (DW) environment, operational or
transactional or day-to-day business data is gathered from Enterprise
Resource Planning (ERP) systems, Customer Relationship
Management (CRM), legacy systems, and several third party
applications.
The data from these sources may differ in format [data could have been
housed in any RDBMS such as Oracle, MS SQL Server, DB2, MySQL,
and Teradata, and so on or in spreadsheet (.xls, .xlsx, etc.) or .csv or
txt].
Data may come from data sources located in the same geography or
different geographies. This data is then integrated, cleaned up,
transformed, and standardized through the process of Extraction,
Transformation, and Loading (ETL). The transformed data is then
loaded into the enterprise data warehouse (available at the enterprise
level) or data marts (available at the business unit/ functional unit or
business process level).
A host of market leading business intelligence and analytics tools are
then used to enable decision making from the use of ad-hoc queries,
32
SQL, enterprise dashboards, data mining, etc. Refer Figure 2.9.
1.11 A Typical Data Warehouse Environment
business intelligence and
data is gathered analytics tools
from different Extraction,
sources and in Transformation,
different Loading (ETL).
formats
33
1.12 A Typical Hadoop Environment
Hadoop environment is very different from the data warehouse environment.
As is fairly obvious from Figure 2.10, the data sources are quite disparate from
web logs to images, audios, and videos to social media data to the various docs,
pdfs, etc.
Here the data in focus is not just the data within the company’s firewall but
also data residing outside the company’s firewall.
This data is placed in Hadoop Distributed File System (HDFS). If need be, this
can be repopulated back to operational systems or fed to the enterprise data
warehouse or data marts or Operational Data Store (ODS) to be picked for
further processing and analysis.
34
1.13 What is New Today?
Coexistence of Big Data and Data Warehouse
Few companies are a wee bit comfortable working with incumbent data
warehouse for standard BI and analytics reporting, for example the quarterly
sales report, customer dashboard, etc.
The power that Hadoop brings to the table with different types of analysis on
different types of data.
The same operational systems, was engaged in powering the data warehouse,
can also populate the big data environment when they’re needed for
computation-rich processing or for raw data exploration.
We cannot ignore the powerful analytics capability of Hadoop and the
revolutionary developments in RDBMS. So, the need of the hour is to have
both data warehouse and Hadoop co-exist in today’s environment.
35
1.14 What is changing in the Realms of Big Data?
Three very important reasons why companies should compulsorily
consider leveraging big data:
1) Competitive advantage: The most important resource with any
organization today is their data. What they do with it will determine
their fate in the market.
2) Decision making: Decision making has shifted from the hands of the
elite few to the empowered many. Good decisions play a significant
role in furthering customer engagement, reducing operating margins
in retail, cutting cost and other expenditures in the health sector.
3) Value of data: The value of data continues to see a steep rise. As the
all-important resource, it is time to look at newer architecture, tools,
and practices to leverage this.
36
37