Unit 1 (Big Data)
Unit 1 (Big Data)
BRIEF CONTENTS
What'sin Store?
Classification of Digital Data
Semi-Structured Data
Sources of Semi-Structured Data
Structured Data
Unstructured Data
Sources of Structured Data Issues with "Unstructured" Data
Ease of Working with Structured Data How to Deal with Unstructured Data
www.
W. Edwards Deming
WHAT'S IN STORE?
Irrespective of the size of the enterprise (big or small), data continues to be a precious and irreplaceable asset.
Data is present internal to the enterprise and also exists outside the four walls and firewalls of the
Data is present in homogeneous sources as well as in heterogeneous sources. The ned of the hour is to
enterprise.
understand, manage, process, and take the data for analysis to draw valuable insights.
Data Information
Information Insights
Ihis chapter is a "must read" for first-time learners interested in understanding the role of data in business
intelligence and busincss analysis and businesses at large. This chapter will introduce you to the various tor-
mats of digital data (structured, semi-structured, and unstructured data), the sources of each tormat ot data,
the issues with the terminology of unstructured data, etc.
Big Data and
2
Anal
the end of this chapter
and also attenm.
We suggest you refer to the learning
r e s o u r c e s suggested
at
while readi al
to get a grip on this topic.
We suggest you make your
own
notes/bookmarks
ading throy
exercises
the chapter.
Ever since the 1980s mostofthe enterprise data has been stored in relational databases complete with ro
records/tuples, columns/attributes/fields, primary keys, foreign keys, etc. Over a period of time Relatio
Database Management System (RDBMS) matured and the RDBMS, as they are available today, h
become more robust, cost-effective, and effhcient. We have grown comfortable working with RDBMS-
storage, retrieval, and management of data has been immensely simplified. The data held in RDBM
typically structured data. However, with the Internet connecting the world, data that existed beyond o
enterprise started to become an integral part of daily transactions. This data grew by leaps and bound
much so that it became diffhcult for the enterprises to ignore it. All of this data was not structured. Ab
it was unstructured. In fact, Gartner estimates that almost 80% of data generated in any enterprise to-
is unstructured data. Roughly around 10% of data is in the structured and semi-structured category. M
Figure 1.2.
Structured data
Semi-structured data
Unstructured data
Types of Digital Data
10% 10%
Structured dala
Semi-structured data
80% Unstructured data
Think structured data, and think data model - a model of the types of business data that we intend to store,
process, and access. Let us discuss this in the context of an RDBMS. Most of the structured data is held in
RDBMS. An RDBMS conforms to the relational data model wherein the data is stored in rows/columns.
Refer Table 1.1.
The number ofrows/records/ruples in a relation is called the cardinaliy ofa relation and the number of
columns is referred to as the degree ofa relation.
The first step is the design of a relation/table, the fields/columns to store the data, the type of data that will
be stored [number (integeror real), alphabets, date, Boolean, etc.]. Next we hink of the constraints that we
would like our data to confornm to (constraints such as UNIQUE values in dhe column, NOT NULL values
in the column, a business constraint such as the value held in the column should not drop below 50, the set
of permissible values in the column such as the column should accept only "CS", "IS", "MS", etc., as
input).
To explain further, let us design a table/relation structure to store the details of the employees of an enter-
prise. Table 1.2 shows the structure/schema of an "Employee" table in a RDBMS such as Oracle.
Table 1.2 is an example of a good structured table (complete with table name, meaningful column names
with data types, data length, and the relevant constraints) with absolute adherence to relational data model.
***
'*****************
' ' * * * * * ' *
*****"
Big Dara and
Analyt
table
records in the "Employee"
Table 1.3 Sample
* * *'*''*'''**'''''''''** . . .
ContactNo
*******' **
*** ***.
DeptNo
EmpNo EmpName Designation
Software Engineer D1 0999999999
E101 Allen
Consultant D1 0777777777
E102 Simon *
'****'''''***'''*******'***
'***''***
. . . . .
**'*
Irgoes without saying that each record in the table will have exactly the same structure. Let us a loj
Department
DeptNo
DeptName
Employee DeptLocation
EmpNo DeptEmpStrength
EmpName
EmpDesignation
DeptNo
EmpContactNo
Figure 1.3 Relationship between "Employee" and
"Department" tables.
Database such as
Oracle, DB2, Teradata,
Myso,PostgresaL, etc
Structureccata Spreadsheets
OLTP systems
Figure 1.4 Sources of structured data.
Types of Digital Data
1. Insert/update/delete: The Data Manipulation Language (DML) operations provide the required ease
with data input, storage, access, process, analysis, etc.
2. Security: How does one ensure the security of information? There are available staunch encryption and
tokenization solutions to warrant the security of information throughout is lifecycle. Organizations are
able to retain control and maintain compliance adherence by ensuring that only authorized individuals
are able to decrypt and view sensitive information.
3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily the
SELECT DMLstatement) at the cost of additional writes and storage space, but the benefits that ensue
in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily scaled up
by increasing the horsepower of the database server (increasing the primary and secondary or peripheral
storage capacity, processing capacity of the processor, etc.).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction. Given next is a quick explanation of the ACID properties:
Atomicity: A transaction is atomic, means that either it happens in its entirety or none ofit at all.
Consisteney: Thedatabase moves from one consistent state to another consistent state. In other words,
if the same piece of information is stored at two or more places, they are in complete agreement.
Isolation: The resource allocation to the transaction happens such that the transaction gets the
mpression that it is the only transaction happening in isolation.
Durabilhty: All changes made to the database during a transaction are permanent and that accounts
for the durabiliry of the transaction.
1. It does not conform to the data models that one typically associates with relational databases or any
other form of data tables.
2. It uses tags to segregate semantic elenments.
Input/Update/Delete
Secuity
Scalability
ransactiorn processing
Self-descrlbing
(label/value palrs)
Sem-structured data
Often schema intormation ls
blended with data values
3. Tags are also used to enforce hierarchies of records and fields within data.
. There is no separation.between the data and the schema. The amount of structure used is die
the purpose at hand. tated -
5. In data, entities belonging to the same class and also grouped together need
semi-structured
have the same set of
not ne
essarily attributes. And if at all, they have the same set of attributes, the orde
artributes may not be similar and for all practical purposes it is not important as well.
<HR>
cahref="http://bigdatauniversity.com">1ink Namc</a>
Header</HI»
H1>this is a Header</H2>
<H2>this is a sub href="mailto:[email protected]">
HR
/BODY>
/HTML
_id:9.
BookTitle: "Fundamentals of Business Analytics",
AuthorName: "Seema Acharya",
Publisher: "Wiley India",
YearofPublication: "2011"
There are situations where people argue that a text file should be in the category semi-structured data of
and not unstructured data. Let us look at where they are coming from. Well, the text hle does have a name,
Audios
Videos
TOxt mes6ag0s
Chate
Word document
Figure 1.8 Sources of unstructured data.
The
Structure can be implied despite not being 1.
formerly deflned
one can easily look at the properties to get information such as the owner of the file, the date on which the
ile was created, the sive of the file, ctc. Okay, we do have little metadata. But when it comes to analysis, we
ire more concerned with the content of the text file rather than the name or any of the other properties. In
fact, the other properties may not in any way contribute to the processing/analysis task at hand. Therefore
it is fair to place it in the unstructured data category.
1.1.3.2 How to Deal with Unstructured Data?
loday, unstructured data constitutes approximatecly 80% of the data that is being generated in any enter-
rise. The balance is clearly shifting in favor of unstructured data as shown in
Figure 1.10. It is such a big
ercentage that it cannot be ignored. Figure 1.11 states a few ways of dealing with unstructured data.
Types of Digital Data
Structured data
t
Unstructured data
Figure 1.10 Unstructured data clearly constitutes a major percentage of enterprise data.
Data mining
The following techniques are used to find patterns in or interpret unstructured data:
1. Data mining: First, we deal with large data sets. Second, we use methods at the intersection of arti-
ficial intelligence, machine learning, statistics, and database systems to unearth consistent patterns in
large data sets and/or systematic relationships between variables. It is the analysis step of the "knowl-
edge discovery in databases" process.
Few popular data mining algorithms are as follows:
Association rule mining: It is also called "market basket analysis" or "affinity analysis". It is used to
determine "What goes with what?" It is about when you buy a product, what is the other
product
that you are likely to purchase with it. For example, if you pick up bread from the grocery, are you
Iikely to pick eggs or cheese to go with it.
Regression analysis: It helps to predict the relationship berween two variables. The variable whose
value needs to be predicted is called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
PicTURE THIS
You are interested in purchasing real estate. builder (joggers track, senior citizen zone, gym-
You have been looking at a few good sites. You nasium, swimming pools, etc.), the built up area,
have come to the conclusion that cost of the etc. The cost of the real estate is the dependent
real estate depends on the location (outskirts variable and the location, amenities, built-up
or prime locale), the amenities provided by the area are called the independent variables.
Rig Dara and Analv.
10 Ivpes o
for modes of learning
learners preterences
depicting
Table 1.5 Sample records
Learning using Videos Textual Learners
Learning using Audios
No
Yes
User 1 Yes Yes
Yes
User 2 Yes
No
Yes
User 3 Yes
Yes
User
POI
predicting alt
s preterence
about preterences user or based on the prefe
Collaboraritv filnering: is
a textual kcarnet. CC
in relational databases. text i
2 Text analytics mining: Compared to the structured data stored
or text
Text mining is the proces
langel unstructured. amorphous, and difthcult to deal with algorithmically.
ot and trends by mean
ofgieaning high quality and meaningful information (through devising patterns
of statistical pattern learning) trom text. It includes tasks such as text categorization, text clusterine
Structured data: It
ontoms to data model. F'or a
nredivgy t
r e d , s nRee net yrem a n l fsta tehn
erha m eni
AY
of the partiripants
ea a nd omgorerensinn
evperiemre and you m alsad
T A qond
of anather irh avperienes
m g of be:rg part
CHARACTERISTICS OF DATA
2.1
has three key characteristics:
wrc
star characteristks
thr data. As depictcd in Figure 2.1, data
of
that is, the sources of da
of data deals with the structure ot data,
1Composstaon:har composition static or real-ume sreaming
data as to whether it is
lar rvpes and the nature of
t getauiarsts "Can one use this data as
onditsot with the state ot data, that is,
data dcals
mdon r
enhancement and enrichmen
HaC auate ccafnsung for turther this
been generated? "Why
atialyas was
i e a t of dutu deals
with "Where has this data
3. Conca: What are the events associated
with this datu and o
getucralted "Hou acttatavr thas datu
known
data tevoluton) 1> about certannty. lt ia about taurly
bmal aatGatt a a t casicd pruut Lu tht big
tot lal
athiur hu a 1ialigr thr is11n428itiI Of Loiltex
LLLC where aud when i was geci
Must utc we have atibweis Lu nylac iul lis ity thiy data was gcheiated,
n
i l l i s l a t be able (o ailwer,
nd s On. Big d
lth tu ube u l gurstut)
io we wouid
t'aa
22 EVOLUTION OF 8IG DATA
Primitive and
Structured
1970s and defore Relational 2000s and deyond
(1980s and 1990s)
of data.
zcttabytcs
orpctabytes o r
. Terabytes
3 Vs. of these: in
tact, big data is
4. I think it is abour
one
But it is not just
correct.
arc
Refer Figure 2.2.
Well, all of these resjponses
of the above and
more.
information
dssets that demand cost effective,
rariety
high-reloin amd bigh and deeision making.
Bigdara is high-relume.
forms of informarion
prvuessing fir
eulanued insight
Source: Gartner ITT Glossary
innorai
2001 MetaGroup resca-
Lancy in a
by the Gartner analyst Doug and Velocity.
W.as proposed
The \ s concepr Data Volume. Varicty
Controlling
Soune: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controli-
publicarion. titled. 3D Data Alanagement:
Data-Volume-Velocity-and-Variety.pdf
Figure 2.3.
three parts. Refer
we will look at the definition in
For the sake of easy comprehension,
Terabytes or Petabytes
orZettabytes..
Wait a minute
Iheard Yottabytes tooll
Dunno
Today's BIGmay
be tomoraw's
? ? NORMAL
Data
B i g
Definition of
High-volume
High-volocity
High-varloty
Cost-effective,
Innovatlve forms
of Information
processing
Enhanced insight
& decision making
Part I of the definition "big data is high-volume, high-velocity, and high-variety information assets
talks about voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured, and unstructured data) and will require a good speed/pace for storage, preparation, pro-
cessing, and analysis.
Part II of the definition "cost effective, innovative forms of information processing" talks about embrac-
ing new techniques and technologies to capture (ingest), store, process, persist, integrate, and visualize the
high-volume, high-velocity, and high-variery data.
Part III of the definition "enhanced insight and decision making" talks about deriving deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain business value
and thus a
competitive edge.
Data Information Actionable intelligence > Better decisions Enhanced business vahue
Slorage
Curatlon
Search
Tranafer
Visuallzatlon
Privacyviolatlons
4. There is a dearth ofskilled professionals who possess a high level of proficiency in data sciences thati
ital in implementing big data solutions.
5. Then. of course, there are otherchallengeswith respect to capture, storage, preparation, search, anal.
sis, transfer, security, and visualization of big data. Big data refers to datasets whose size is typicall
beyond the storage capacity of traditional database sofrware tools. There is no explicit definition of
howbigthe dataset should be for it to be considered "big data." Here we are to deal with data thati
just roo big, moves way to fast, and does not fit the structures of typical database systems. The dat
changes arehighly dynamic and therefore there is a need to ingest this as quickly as posible.
6. is popular separate discipline. We short by quite a number, as far
Daca visualization becoming as a
2.5.1 Volume
We have seen it grow from bits to bytes to petabytes and exabytes. Refer Table 2.2 and Figure 2.6.
Data velocity
Real time
Periodic
Batch
Data volume
MB GB TB PB
Table
Database
Photo
Social Web
Audio
Video
Mobile
Data variety
Figure 2.5 Data: Big in volume, variety, and velocity.
Bits 0 or 1
Bytes 8 bits
students'
records,
patients health
24
1Kilobyte (KB)
1000bytes
1,000,000 bytes
(MB)
1 Megabyte bytes
1,000,000,000
Gigabyte
1 (GB)
Terabyte (TB)1,000,000,000.000 bytes
1,000,000.000,000,000
bytes
1Petabyte(PB) bytes
1 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0
Exabyte (EB) 1 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0
bytes
1Zettabyte(ZB)= 1 , 0 0 0 , 0 0 0 . 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0
bytes
1Yottabyte (YB)=
mountain of data.
Figure 2.6 A
Data storage
Media
Archives
firewalI. It is as follows:
residing
Data outside an organization's
2. External data sources:
and so on.
Business apps: ERR CRM, HR, Google Docs,
Media: Audio, Video, Image, Podcast, etc.
XLS, PPT, and so on.
Docs: Comma separated value (CSV), Word Documents, PD
2.5.2 Velocity
We have moved from the days of batch processing (remember our payroll applications) to real-tin
processing.
Batch Periodic> Near real time Real-time processing
26
More data
PICTURE THIS
Archie's store to pick a good greeting card and a
You have been invited to your friend's promotion You get the items billed at the Point of Sale ss
friend
party. You are happy and excited join your
to
You send
and pay cash at the counter. While at the party
at this important milestone in her career.
click photographs and post it on Facebook, Fm
in your confirmation through a text message. You and the likes. Within minutes, you start to get
residence. On
get ready and leave for your friend's and comments on your posts.
the way, you stop at a gas station to refuel. You
at an upmarket
pay using your credit card. You stop
Likewise, there are several instances everyday where you generate data. Think about cases where
Consumer of information.
Ler us take a sneak peek into some of the differences that one encounters dealing with rradition
big data.
1. In traditional BI environment, all the enterprise's data is housed in a central server where»
data environment data resides in a distributed file system. The distributed file system scales
in or out horizontally as compared to
typical database server that scales vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas in big data, it is am
both real time as well as in offline mode.
Introduction to Big Data
3. Traditional BI is about structured data and it is here thar data is taken to procesing uretiyfi6
data to code) whereas big data is about varicty: Structured, semi-structurcd, aned unorructirert data
at
here the processing functions are taken to the data (move code to data).
ERP Reporting
Dashboarding
CRM
OLAP
Data warehouse
Legacy Ad hoc querying
28
HDFS
environment.
2.10 A typical Hadoop
Figure