0% found this document useful (0 votes)
92 views20 pages

Unit 1 (Big Data)

This document discusses different types of digital data: 1) Structured data which follows a predefined data model and schema and is stored in databases. About 10% of organizational data falls in this category. 2) Semi-structured data which has some structure but does not fully conform to a data model, like XML. Another 10% of data is semi-structured. 3) Unstructured data like documents, emails and presentations which do not have a clear data model and accounts for 80-90% of organizational data. The chapter introduces these data types and their sources and issues with unstructured data.

Uploaded by

Tejus R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views20 pages

Unit 1 (Big Data)

This document discusses different types of digital data: 1) Structured data which follows a predefined data model and schema and is stored in databases. About 10% of organizational data falls in this category. 2) Semi-structured data which has some structure but does not fully conform to a data model, like XML. Another 10% of data is semi-structured. 3) Unstructured data like documents, emails and presentations which do not have a clear data model and accounts for 80-90% of organizational data. The chapter introduces these data types and their sources and issues with unstructured data.

Uploaded by

Tejus R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CHAPTER 1

Types of Digital Data

BRIEF CONTENTS
What'sin Store?
Classification of Digital Data
Semi-Structured Data
Sources of Semi-Structured Data
Structured Data
Unstructured Data
Sources of Structured Data Issues with "Unstructured" Data
Ease of Working with Structured Data How to Deal with Unstructured Data
www.

"In God we trust, all others must bring data."


-

W. Edwards Deming

WHAT'S IN STORE?
Irrespective of the size of the enterprise (big or small), data continues to be a precious and irreplaceable asset.
Data is present internal to the enterprise and also exists outside the four walls and firewalls of the
Data is present in homogeneous sources as well as in heterogeneous sources. The ned of the hour is to
enterprise.
understand, manage, process, and take the data for analysis to draw valuable insights.
Data Information
Information Insights
Ihis chapter is a "must read" for first-time learners interested in understanding the role of data in business
intelligence and busincss analysis and businesses at large. This chapter will introduce you to the various tor-
mats of digital data (structured, semi-structured, and unstructured data), the sources of each tormat ot data,
the issues with the terminology of unstructured data, etc.
Big Data and
2
Anal
the end of this chapter
and also attenm.
We suggest you refer to the learning
r e s o u r c e s suggested
at
while readi al
to get a grip on this topic.
We suggest you make your
own
notes/bookmarks
ading throy
exercises
the chapter.

CLASSIFICATION OF DIGITAL DATA


1.1
structured, semi-structured, and
be broadly classihed into unstr
Figure 1.1, digital data
can
As depicted in
tured data.
data model form
or is not in a
which does not conform to a whi
1. Unstructured data: This is the data
About 80-90% data of an organization is in this f.
can be used casily by a computer program.
PowerPoint presentations, images, Videos, letters, researches
for example, memos, chat rooms,
papers, body of an email,
etc.

2. Semi-structured data: This is the data which


does not conform to a data model but has someme sstr
ture. However, it is not in a form which can be used easily by a computer program; for example. Cme
this data is available but is notsufficient
XML, markup languages like HTML, etc. Metadata for
3. Structured data: This is the data which is in an organized torm
(e.g., in rows and columns) and
be easily used by a computer program. Relationships exist berween entities of data, such as classes
es
their objects. Data stored in databases is an example of structured data.

Ever since the 1980s mostofthe enterprise data has been stored in relational databases complete with ro
records/tuples, columns/attributes/fields, primary keys, foreign keys, etc. Over a period of time Relatio
Database Management System (RDBMS) matured and the RDBMS, as they are available today, h
become more robust, cost-effective, and effhcient. We have grown comfortable working with RDBMS-
storage, retrieval, and management of data has been immensely simplified. The data held in RDBM
typically structured data. However, with the Internet connecting the world, data that existed beyond o
enterprise started to become an integral part of daily transactions. This data grew by leaps and bound
much so that it became diffhcult for the enterprises to ignore it. All of this data was not structured. Ab
it was unstructured. In fact, Gartner estimates that almost 80% of data generated in any enterprise to-
is unstructured data. Roughly around 10% of data is in the structured and semi-structured category. M
Figure 1.2.

1.1.1 Structured Data


Let us
begin with a very basic question When do we say that the data is structured?
-

when data conforms The simple a


to a
pre-defined schema/structure we say it is structured data.

Structured data

Semi-structured data

Unstructured data
Types of Digital Data

10% 10%
Structured dala
Semi-structured data
80% Unstructured data

Figure 1.2 Approximate percentage distribution of digital data.

Think structured data, and think data model - a model of the types of business data that we intend to store,
process, and access. Let us discuss this in the context of an RDBMS. Most of the structured data is held in
RDBMS. An RDBMS conforms to the relational data model wherein the data is stored in rows/columns.
Refer Table 1.1.
The number ofrows/records/ruples in a relation is called the cardinaliy ofa relation and the number of
columns is referred to as the degree ofa relation.
The first step is the design of a relation/table, the fields/columns to store the data, the type of data that will
be stored [number (integeror real), alphabets, date, Boolean, etc.]. Next we hink of the constraints that we
would like our data to confornm to (constraints such as UNIQUE values in dhe column, NOT NULL values
in the column, a business constraint such as the value held in the column should not drop below 50, the set
of permissible values in the column such as the column should accept only "CS", "IS", "MS", etc., as
input).
To explain further, let us design a table/relation structure to store the details of the employees of an enter-
prise. Table 1.2 shows the structure/schema of an "Employee" table in a RDBMS such as Oracle.
Table 1.2 is an example of a good structured table (complete with table name, meaningful column names
with data types, data length, and the relevant constraints) with absolute adherence to relational data model.

Table 1.1 A relation/table with rows and columns


********* **************'********* ** **** **

Column 11 Column 2 Column 3 Column 4


Row:

***
'*****************
' ' * * * * * ' *

Table 1.2 Schema of an "Employee" table in a RDBMS such as Oracle


*****'******'******** **'°**''*°* ' * * * * ' ' * * * * * ' ' * * * * * * * * '***" * ° * * ' ' ° '*''''** '**''* ****** * **

Column Name Data Type Constraints


EmpNo Varchar(10) PRIMARY KEY
EmpName Varchar(50)
Designation Varchar(25) NOT NULL
DeptNo Varchar(5)
ContactNo Varchar(10) NOT NULL
**** ' * * * " *°*°****°'*'*'
* ''°** ' * ' ' * ' ' ******* '** ****'

*****"
Big Dara and
Analyt
table
records in the "Employee"
Table 1.3 Sample
* * *'*''*'''**'''''''''** . . .

ContactNo
*******' **
*** ***.
DeptNo
EmpNo EmpName Designation
Software Engineer D1 0999999999
E101 Allen
Consultant D1 0777777777
E102 Simon *
'****'''''***'''*******'***
'***''***
. . . . .

**'*

Irgoes without saying that each record in the table will have exactly the same structure. Let us a loj

at a few records in Table 1.3.


the above "Employee" table is related to the
The rables in an RDBMS can also be related. For example,
It is not mandatory for the two
"Department" table on the basis of the c o m m o n column, "DeptNo". table
that are related to have exactly the same name for the common column. On the contrary, the two tahla
relared on the basis of values held within the column, "DeptNo". Given in Figure 1.3 is a depiction
of re
erential integrity constraint (primary - foreign key) with the "Department" table being the referenced.tabl
d
and "Employee" table being the referencing table.

1.1.1.1 Sources of Structured Data


If your data is highly structured, one can look at leveraging any of the available RDBMS [Oracle Corm
Oracle, IBM- DB2, Microsoft - Microsoft SQL Server, EMC-Greenplum, Teradata - Teradata,MysQ
orp.
(open source), PostgreSQL (advanced open source), etc.) to house it. Refer Figure 1.4. These database a
ypically used to hold transaction/operational data generated and collected by day-to-day business activitie
In other words, the data of the On-Line Transaction Processing (OLTP) systems are generally quite structunei

Department
DeptNo
DeptName
Employee DeptLocation
EmpNo DeptEmpStrength
EmpName
EmpDesignation
DeptNo
EmpContactNo
Figure 1.3 Relationship between "Employee" and
"Department" tables.

Database such as
Oracle, DB2, Teradata,
Myso,PostgresaL, etc

Structureccata Spreadsheets

OLTP systems
Figure 1.4 Sources of structured data.
Types of Digital Data

1.1.1.2 Ease of Working with Structured Data


Structured data provides the ease ofworking with it. Refer Figure 1.5. The ease is with respect to the followingg:

1. Insert/update/delete: The Data Manipulation Language (DML) operations provide the required ease
with data input, storage, access, process, analysis, etc.
2. Security: How does one ensure the security of information? There are available staunch encryption and
tokenization solutions to warrant the security of information throughout is lifecycle. Organizations are
able to retain control and maintain compliance adherence by ensuring that only authorized individuals
are able to decrypt and view sensitive information.
3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily the
SELECT DMLstatement) at the cost of additional writes and storage space, but the benefits that ensue
in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily scaled up
by increasing the horsepower of the database server (increasing the primary and secondary or peripheral
storage capacity, processing capacity of the processor, etc.).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction. Given next is a quick explanation of the ACID properties:
Atomicity: A transaction is atomic, means that either it happens in its entirety or none ofit at all.
Consisteney: Thedatabase moves from one consistent state to another consistent state. In other words,
if the same piece of information is stored at two or more places, they are in complete agreement.
Isolation: The resource allocation to the transaction happens such that the transaction gets the
mpression that it is the only transaction happening in isolation.
Durabilhty: All changes made to the database during a transaction are permanent and that accounts
for the durabiliry of the transaction.

1.1.2 Semi-Structured Data


emi-structured data is also referred to as self-describing structure. Reter Figure 1.6. It has the following
eatures:

1. It does not conform to the data models that one typically associates with relational databases or any
other form of data tables.
2. It uses tags to segregate semantic elenments.

Input/Update/Delete

Secuity

Ease with structured dala ndexing/Searching

Scalability

ransactiorn processing

Figure 1.5 Ease of working with structured data.


6 Big Data
and
Anale
nalyi
Inconsistent structure

Self-descrlbing
(label/value palrs)
Sem-structured data
Often schema intormation ls
blended with data values

Data objects may have different


atributes not known beforehand
Figure 1.6 Characteristics of semi-structured data.

3. Tags are also used to enforce hierarchies of records and fields within data.
. There is no separation.between the data and the schema. The amount of structure used is die
the purpose at hand. tated -
5. In data, entities belonging to the same class and also grouped together need
semi-structured
have the same set of
not ne
essarily attributes. And if at all, they have the same set of attributes, the orde
artributes may not be similar and for all practical purposes it is not important as well.

1.1.2.1 Sources of Semi-Structured Data


Amongst the sources for semi-strucrured data, the front runners are "XML" and "JSON" as depicte
Figure 1.7.
1. XML: eXtensible Markup Language (XML) hugely popularized by web services developed utiliz
is
the Simple Object Access Protocol (SOAP) principles.
2. JSON: Java Script Object Notation (JSON) is used to transmit data between a server and a web
ap
cation. JSON is popularized by web services developed urilizing the Representational State Tran
(REST)-an architecruresetyle for creating scalable web services. MongoDB (open-source,distribu
NoSQL, documented-oriented database) and Couchbase (originally known as Membase, open-sou
distributed, NoSQL, document-oriented database) store data natively in JSON format.
An example of HTML is as follows:
<HTML>
<HEAD>
<TITLE>Place your title heree/TITLE>
</HEAD
<BODY BGCOLOR="FFFFFF">

XML (0Xtenslble Markup Language)

Seml-structured data Other Markup Languages

JSON (Java Scrlpt Object Notation)


Figure 1.7 Sources of semi-structured data.
7
Data
Types of Digital
C E N T E R > < I M G SRC="cloucds.jpg" A L I G N - " B O I T O M " > < / C E N T E R »

<HR>
cahref="http://bigdatauniversity.com">1ink Namc</a>

Header</HI»
H1>this is a Header</H2>
<H2>this is a sub href="mailto:[email protected]">

Send me mail at <a

[email protected]/a>.

<P>a new paragraph.


<P>«B>a new paragraph!</B»
without a paragraph break, in bold italics.</l></B>
BRB><l>this is a new senternce

HR
/BODY>

/HTML

Sample JSON document

_id:9.
BookTitle: "Fundamentals of Business Analytics",
AuthorName: "Seema Acharya",
Publisher: "Wiley India",
YearofPublication: "2011"

1.1.3 Unstructured Data


data model. In fact, to explain things a little more
Unstructured data does not conform to any pre-defined
text available and the possible structure
associated with it. As
let us take a closer look at the various kinds of
In Figure 1.8 we
can be seen from the examples quoted
in Table 1.4, the structure is quite unpredictable.
look at the other sources of unstructured data.

1.1.3.1 Issues with "Unstructured" Data


to a pre-defined data model or be organized
in a
Although unstructured data is known NOT to conform
of the data (placed in the unstructured cate-
pre-dehned manner, there are incidents wherein the structure
there could be few other reasons behind placing data
gory) can still be implied. As mentioned in Figure 1.9,
in the unstructured caregory despite it having some structure or being highly structured.

There are situations where people argue that a text file should be in the category semi-structured data of
and not unstructured data. Let us look at where they are coming from. Well, the text hle does have a name,

Table 1.4 Few examples of disparate unstructured data

Twitter message Feeling miffed9. Victim of twishing.


Facebook post LOL. C ya. BFN
Log files 127.0.0.1 - frank [10/0ct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200
2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I; Nav)
Email EOD or
Hey Joan, possible to send across the first cut on the Hadoop chapter by Friday
maybe we can meet up over a cup of coffee, Best regards, Tom
'*'''''''''*''**'** ***
ypesf 1Digjn,
Web pages
images
Free-form text

Audios
Videos

Unstructureddata Body of Emal

TOxt mes6ag0s

Chate

Soclal media data

Word document
Figure 1.8 Sources of unstructured data.
The
Structure can be implied despite not being 1.
formerly deflned

Data with some structure may stil be labeled unstructured


lssueswith terminology the structure doesn't help with processing task at hand

Data may have some structure or may even be highly


structured in ways that are unanticipated.or unannounced

Figure 1.9 Issues with terminology of unstructured data.

one can easily look at the properties to get information such as the owner of the file, the date on which the
ile was created, the sive of the file, ctc. Okay, we do have little metadata. But when it comes to analysis, we
ire more concerned with the content of the text file rather than the name or any of the other properties. In
fact, the other properties may not in any way contribute to the processing/analysis task at hand. Therefore
it is fair to place it in the unstructured data category.
1.1.3.2 How to Deal with Unstructured Data?
loday, unstructured data constitutes approximatecly 80% of the data that is being generated in any enter-
rise. The balance is clearly shifting in favor of unstructured data as shown in
Figure 1.10. It is such a big
ercentage that it cannot be ignored. Figure 1.11 states a few ways of dealing with unstructured data.
Types of Digital Data

Structured data

t
Unstructured data

Figure 1.10 Unstructured data clearly constitutes a major percentage of enterprise data.

Data mining

Natural Lenguage Processing (NLP)


Dealing with unstructured data
Textanalytics

Noisy text analytics


Figure 1.11 Dealing with unstructured data.

The following techniques are used to find patterns in or interpret unstructured data:
1. Data mining: First, we deal with large data sets. Second, we use methods at the intersection of arti-
ficial intelligence, machine learning, statistics, and database systems to unearth consistent patterns in
large data sets and/or systematic relationships between variables. It is the analysis step of the "knowl-
edge discovery in databases" process.
Few popular data mining algorithms are as follows:
Association rule mining: It is also called "market basket analysis" or "affinity analysis". It is used to
determine "What goes with what?" It is about when you buy a product, what is the other
product
that you are likely to purchase with it. For example, if you pick up bread from the grocery, are you
Iikely to pick eggs or cheese to go with it.
Regression analysis: It helps to predict the relationship berween two variables. The variable whose
value needs to be predicted is called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.

PicTURE THIS
You are interested in purchasing real estate. builder (joggers track, senior citizen zone, gym-
You have been looking at a few good sites. You nasium, swimming pools, etc.), the built up area,
have come to the conclusion that cost of the etc. The cost of the real estate is the dependent
real estate depends on the location (outskirts variable and the location, amenities, built-up
or prime locale), the amenities provided by the area are called the independent variables.
Rig Dara and Analv.
10 Ivpes o
for modes of learning
learners preterences
depicting
Table 1.5 Sample records
Learning using Videos Textual Learners
Learning using Audios
No
Yes
User 1 Yes Yes
Yes
User 2 Yes
No
Yes
User 3 Yes

Yes
User
POI
predicting alt
s preterence
about preterences user or based on the prefe
Collaboraritv filnering: is

For example, take a look at Table


1.5.
group of
uscrs.
of a
enucs
videos or is a textual learbe
User 4 will preter to learn using
We looking prodirting whether
are at
We analyze the preterences of simil
dencnding on couple of his or her known preferences.
onc or a
videos and
the basis of it. like to learn
predict that User 4 will also using is no
usT prohilcs and on

a textual kcarnet. CC
in relational databases. text i
2 Text analytics mining: Compared to the structured data stored
or text
Text mining is the proces
langel unstructured. amorphous, and difthcult to deal with algorithmically.
ot and trends by mean
ofgieaning high quality and meaningful information (through devising patterns
of statistical pattern learning) trom text. It includes tasks such as text categorization, text clusterine

sentument analysis. concept/entuy extraction, etc.


Narural language processing (NLP): It is related to the area of human computer interaction. It is
about enabling computers to understand human or natural language input.
Nois text analytics: It is the process of extracting structured or semi-structured intormation from
nois unstructured data such as chats, blogs, wikis, emails, message-boards, text messages, etc. The nois
unstructured data usually comprises one or more of the following: Spelling mistakes, abbreviations. acro
nvms. non-standard words, missing punctuation, missing letter case, filler words such as "uh", "um', etc
5. Manual agging with metadata: This is about tagging manually with adequate metadata to
the requisite semantics to understand unstructured data.
provide
6. Part-of-speech tagging: It is also called POS or POST or grammatical tagging. lt is the process ot
reading text and tagging each word in the sentence as belonging to a particular part ot speech such a
Tnoun verb, "adjective, etc.
Unstructured information Management Architecture (UIMA): It is an open source plattorm tro
IBM. It used for real-time content
is

analytics. It is about processing text and other unstructured da


to find latent
micaning and relevant relationship buried therein. Read up more on UIMA at che lin=
http://www.ibm.com/developerworks/dataldownloads/uima/
REMIND ME

Structured data: It
ontoms to data model. F'or a

model. It has a pre-defincd sulhema. cxanmple, RDBMS contorms to relaional di


Semi-structured data: For ius formnai of
data, liule netadara is
Semi-structurcd data have stlf-destribing sICUe. There a av.tilable, bur is insuthien
data and schema. is liule or no sep.aration
betwec.
enimemene?

nredivgy t
r e d , s nRee net yrem a n l fsta tehn
erha m eni

Fsvsn esgplee hief Frnamir 0

tessiese Fhat w t qradef. Ther was f i ihst


the
that a MAde availabfa
at tht ertd of tha
bark fvam the partieinants They
hea in the f
aitinnal rading cnntents tir
de napery
to white apers,rexeaE f
fppes
orded arf ade aValahla For het
e r t i n p

AY
of the partiripants
ea a nd omgorerensinn
evperiemre and you m alsad
T A qond
of anather irh avperienes
m g of be:rg part

of sueh virtual classrmom


There is no dearth
There is a learmins huge
ions being conducted today.
other na think on the
ha they were 43 to learn. Just
v t
(ommunity out there eager
and the
varieh,
voiume of data that gets generated
r g * h e sesson Dunng the scores and grades. the
e ia (the list of attendees. their
4 oVerteith th the
poiling
s397 conversations. their assignments
chat
awt the
artpants using the instructor to gauge the lev
resptw
a
questions put forth by the
discussion
The hac alsc actvated from the learner
at
learnings/views of understanding and participation
te share their as well consume as
Pero fe pertuient: of data that we produce
h e r were assignments, etc.) training s e s s i o n s
oponanns rnuces become part of these virtual
submitted on
t r br aempoted ard
wich w i e huave

CHARACTERISTICS OF DATA
2.1
has three key characteristics:
wrc
star characteristks
thr data. As depictcd in Figure 2.1, data
of
that is, the sources of da
of data deals with the structure ot data,
1Composstaon:har composition static or real-ume sreaming
data as to whether it is
lar rvpes and the nature of
t getauiarsts "Can one use this data as
onditsot with the state ot data, that is,
data dcals
mdon r
enhancement and enrichmen
HaC auate ccafnsung for turther this
been generated? "Why
atialyas was

i e a t of dutu deals
with "Where has this data
3. Conca: What are the events associated
with this datu and o
getucralted "Hou acttatavr thas datu
known
data tevoluton) 1> about certannty. lt ia about taurly
bmal aatGatt a a t casicd pruut Lu tht big
tot lal
athiur hu a 1ialigr thr is11n428itiI Of Loiltex
LLLC where aud when i was geci
Must utc we have atibweis Lu nylac iul lis ity thiy data was gcheiated,
n

i l l i s l a t be able (o ailwer,
nd s On. Big d
lth tu ube u l gurstut)
io we wouid
t'aa
22 EVOLUTION OF 8IG DATA

Duta 6eneration and Duta Utitiaation ata Drtven


Stwage
STtu ata
Unstructure

Compler and Aelatin atadases


Relational ata-ntevnsthe

Primitive and
Structured
1970s and defore Relational 2000s and deyond
(1980s and 1990s)

2.3 DEFINITION OF BIG DATA


t we were to
as vo the simpe questian "INhne ig l ara, nhat mnuli nur ansner Wl, we wll gne
w
responses thar we have hearni NCT (U
Anvthng bevond the human and te htial nttasIUN IN nniat iw su t srAY IAst, aint
n2sis
dav s Bi mav e toorwNR\EAL.
20

of data.
zcttabytcs
orpctabytes o r
. Terabytes
3 Vs. of these: in
tact, big data is
4. I think it is abour
one
But it is not just
correct.
arc
Refer Figure 2.2.
Well, all of these resjponses
of the above and
more.

information
dssets that demand cost effective,
rariety
high-reloin amd bigh and deeision making.
Bigdara is high-relume.
forms of informarion
prvuessing fir
eulanued insight
Source: Gartner ITT Glossary
innorai
2001 MetaGroup resca-
Lancy in a
by the Gartner analyst Doug and Velocity.
W.as proposed
The \ s concepr Data Volume. Varicty
Controlling
Soune: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controli-
publicarion. titled. 3D Data Alanagement:

Data-Volume-Velocity-and-Variety.pdf

Figure 2.3.
three parts. Refer
we will look at the definition in
For the sake of easy comprehension,

Terabytes or Petabytes
orZettabytes..
Wait a minute
Iheard Yottabytes tooll

Dunno
Today's BIGmay
be tomoraw's
? ? NORMAL
Data

B i g

Definition of

Anything beyond the


human 8 technical
infrastructure needed
to suppont storage,
processing and analysis

Figure 2.2 Definition of big data.


Introduction to Big Data
21

High-volume
High-volocity
High-varloty

Cost-effective,
Innovatlve forms
of Information
processing

Enhanced insight
& decision making

Figure 2.3 Definition of big data - Gartner.

Part I of the definition "big data is high-volume, high-velocity, and high-variety information assets
talks about voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured, and unstructured data) and will require a good speed/pace for storage, preparation, pro-
cessing, and analysis.
Part II of the definition "cost effective, innovative forms of information processing" talks about embrac-
ing new techniques and technologies to capture (ingest), store, process, persist, integrate, and visualize the
high-volume, high-velocity, and high-variery data.
Part III of the definition "enhanced insight and decision making" talks about deriving deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain business value
and thus a
competitive edge.
Data Information Actionable intelligence > Better decisions Enhanced business vahue

2.4 CHALLENGES WITH BIG DATA


Refer Figure 2.4. Following are a few challenges with big data:
1. Data today is growing at an exponential rate. Most of the data that we have today has been
generated
in the last 2-3 years. This high tide of data will continue to rise incessantly. The key questions here
are: "Will all this data be useful for analysis?", "Do we work with all this data or a subset of it?", "How
will we separate the knowledge from the noise?", etc.
2. Cloud computing and virtualization are here to stay. Cloud computing is the answer to managing
infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading is con-
cerned. This further complicates the decision to host big data solutions outside the enterprise.
3. The challenge is to decide on the period of big Just how long
other retention of data. should
one
retain
this data? A tricky question indeed as some data is useful for making long-term decisions, whereas in tew
cases, the data may quickly become irrclevant and obsolete juSt a few hours after having being generated.
Capture

Slorage

Curatlon

Search

Challenges withBig Data


Analysls

Tranafer

Visuallzatlon

Privacyviolatlons

Figure 2.4 Challenges with big data.

4. There is a dearth ofskilled professionals who possess a high level of proficiency in data sciences thati
ital in implementing big data solutions.
5. Then. of course, there are otherchallengeswith respect to capture, storage, preparation, search, anal.
sis, transfer, security, and visualization of big data. Big data refers to datasets whose size is typicall
beyond the storage capacity of traditional database sofrware tools. There is no explicit definition of
howbigthe dataset should be for it to be considered "big data." Here we are to deal with data thati
just roo big, moves way to fast, and does not fit the structures of typical database systems. The dat
changes arehighly dynamic and therefore there is a need to ingest this as quickly as posible.
6. is popular separate discipline. We short by quite a number, as far
Daca visualization becoming as a

as business visualization experts are concerned.


are

2.5 WHAT IS BIG DATA?


Big data is data that is big in volume, velocity, and variety. Refer Figure 2.5.

2.5.1 Volume
We have seen it grow from bits to bytes to petabytes and exabytes. Refer Table 2.2 and Figure 2.6.

Bits Bytes > Kilobytes > Megabytes -> Gigabytes Terabytes


Petabytes Exabytes Zettabytes Yottabytes
2.5.1.1 Where Does This Data get Generated?
There are a multitude of sources for big data. An XLS, a DOC, a PD etc. is unstructured data; a video on
YouTube, a chat conversation on Internet Messenger, a customer feedback form on an online retail website
Introduction to Big Data
23

Data velocity

Real time

Near real time

Periodic
Batch
Data volume
MB GB TB PB
Table
Database
Photo
Social Web
Audio
Video

Mobile

Data variety
Figure 2.5 Data: Big in volume, variety, and velocity.

Table 2.2 Growth of data


***************

Bits 0 or 1

Bytes 8 bits

Kilobytes 1024 bytes

Megabytes 1024 bytes


Gigabytes 1024 bytes
Terabytes 1024 bytes
Petabytes 1024 bytes
10245 bytes
Exabytes
1024' bytes
Zettabytes
Yottabytes 1024 bytes

unstructured data too. Refer Figure 2.7


a weather
forecast report is
IS unstructured data; a CCTV coverage,
for the sources of big data. It is as follows:
organization's firewall.
Data present within
an
data DB2, MySQL, PostgresQL,
Oracle, MS SQL Server,
sources:
1. Typical internal (RDBMSs -

File systems, SQL


Data storage: Cassandra, etc.), and so on.
correspondence
records,

etc.), NoSQL (MongoDB, archives, customer

documents, paper records, and so on.

Archives: Archives of scanned admission records,


students' assessment

students'
records,
patients health
24

1Kilobyte (KB)
1000bytes
1,000,000 bytes
(MB)
1 Megabyte bytes
1,000,000,000

Gigabyte
1 (GB)
Terabyte (TB)1,000,000,000.000 bytes
1,000,000.000,000,000
bytes
1Petabyte(PB) bytes
1 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0

Exabyte (EB) 1 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0
bytes
1Zettabyte(ZB)= 1 , 0 0 0 , 0 0 0 . 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0
bytes
1Yottabyte (YB)=
mountain of data.
Figure 2.6 A

Data storage

Media
Archives

Sensor data Docs


Sourcesof bigdata

Machine log data Business apps

Publicweb Social media

Figure 2.7 Sources of big data.

firewalI. It is as follows:
residing
Data outside an organization's
2. External data sources:

compliance, census, etc.


Public Web: Wikipedia, weather, regulatory,
external data sources)
Both (internal+
3. offhice buildings, air conditioning units, retrigera
Car sensors, smart electric meters,
Sensor data:
tors, and so on. clickstream
Business process logs, audit logs,
Machine log data: Event logs, application logs,
data, etc.
Social media: Twitter, blogs, Facebook, LinkedIn,
YouTube, Instagram, etc.

and so on.
Business apps: ERR CRM, HR, Google Docs,
Media: Audio, Video, Image, Podcast, etc.
XLS, PPT, and so on.
Docs: Comma separated value (CSV), Word Documents, PD

2.5.2 Velocity
We have moved from the days of batch processing (remember our payroll applications) to real-tin

processing.
Batch Periodic> Near real time Real-time processing
26

More data

More accurate analysis

More confldenae In declslon making


eficiencies, cost reduction,
Greater operational offerings, ete
time reduction, new productdevelopment, optimized

Figure 2.8 Why big data?

INFORMATION CONSUMER OR DO WE ALSO


2.8
2.8 ARE WE JUST AN
PRODUCE INFORMATION?

PICTURE THIS
Archie's store to pick a good greeting card and a
You have been invited to your friend's promotion You get the items billed at the Point of Sale ss
friend
party. You are happy and excited join your
to
You send
and pay cash at the counter. While at the party
at this important milestone in her career.
click photographs and post it on Facebook, Fm
in your confirmation through a text message. You and the likes. Within minutes, you start to get
residence. On
get ready and leave for your friend's and comments on your posts.
the way, you stop at a gas station to refuel. You
at an upmarket
pay using your credit card. You stop

Mention the places in this scenario where data was generated:


1. Text message to send in the confirmation to attend the promotion bash.
2. Use of credit card to pay for gas/fuel at the gas station.
3. Point of Sale system at Archie's where your transaction gets recorded
4. Photographs and posts on social netrworking sites.
5. Likes and comments to your post.

Likewise, there are several instances everyday where you generate data. Think about cases where
Consumer of information.

.2.9 TRADITIONAL BUSINESS INTELLIGENCE (BI) VERSUS BIG DAT

Ler us take a sneak peek into some of the differences that one encounters dealing with rradition
big data.
1. In traditional BI environment, all the enterprise's data is housed in a central server where»
data environment data resides in a distributed file system. The distributed file system scales
in or out horizontally as compared to
typical database server that scales vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas in big data, it is am
both real time as well as in offline mode.
Introduction to Big Data

3. Traditional BI is about structured data and it is here thar data is taken to procesing uretiyfi6
data to code) whereas big data is about varicty: Structured, semi-structurcd, aned unorructirert data
at
here the processing functions are taken to the data (move code to data).

2.10 ATYPICAL DATA WAREHOUSE ENVIRONMENT


Let look at a typical Data Warehouse (DW) environment.
us
perational or transactional or day-ttr day
business data is gathered from Enterprise Resource Planning (ERP) systems, Customer Relationship Mar
agement (CRM), legacy systems, and several third
party applications. The data from these urces ma
differ in format [data could have been housed in any RDBMS such as Oracle, MS SOIL Server, DB2,
MySQL, and Teradata, and so on or in sprcadsheet (.xls, .xlsx, etc.) or .csv or ut. Data may cne trom
data locatcd in the same geography or ditferent
sources
geographics. This data is then integrated, cleaned
up, transformed, and standardized through the process of
The transformed data is then loaded into the
Extraction, Transformation, and Loading (ETi)
enterprise data warehouse (available at the enterprise level) or
data marts (available at the business unit/ functional unit or business
process level). A host of market leading
business intelligence and analytics tools are then used to enable decision
making
from the use of ad-hoc
queries, SQI, enterprise dashboards, data mining, ctc. Refer Figure 2.9.

2.11 A TYPICAL HADOOP ENVIRONMENT

Let us now study the Hadoop environment. Is it


very different from the data warehouse environment and
where exactly is this difference?
As is fairly obvious from
Figure 2.10, the data sources are quite disparate from web logs to images, audios,
and videos to social media data to the various docs,
within the company's firewall but also data
pdfs, etc. Here the data in focus is not just the data
residing outside the
Hadoop Distributed File System (HDFS). If need be, this can becompany's
firewall. This data is placed in
or fed to the
repopulated back to operarional systems
enterprise data warehouse or data marts or Operational Data Store (ODS) to be picked for
further processing and
analysis.

ERP Reporting
Dashboarding

CRM
OLAP
Data warehouse
Legacy Ad hoc querying

Third party apps


Modeling
Figure 2.9 A typical data warehouse environment.
dld
and h

28

HDFS

Web logs Operational system


Hadoop

images and videas Data warehouse

cal ned Data marts


MapReduce

Oocs& PDFs ODS

environment.
2.10 A typical Hadoop
Figure

12 WHAT IS NEW TODAY?

You might also like