0% found this document useful (0 votes)
69 views24 pages

Big Data, Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views24 pages

Big Data, Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

2.

1 CHARACTERISTICS OF DATA

Le us start with the characteristics of data. As depicted in Figure 2.1, data has three key characteristics:
1. Composition: The composition of data deals with the structure of data, that is, che sources of data.
the granularity, the types, and the nature of data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data, chat is, "Can one
use this data as is tor
analysis?" or "Does it require cleansing for further
3. Context: The context of data deals with "Where enhancement and enrichment?"
has this data been generated?" Why was this Gat
generated?" "Howsensitive is this data?" "What are the events associated with this data? and so o.
Smalldata (data as it existed prior to the big data
sources; it is about no major changes to the revolution) is about certainty. It is about fairly known dae
Most often we have answers to queries likecomposition or context of data.
why this data was generated, where and when it
exactly how we would like to use it, what was generated
questions will this data be able to answer, and so on. Big ata
Composition

Condition Data

Context

Figure 2.1 Characteristics of data.

about complexity... complexity in terms of multiple and unknown datasets, in terms of exploding volume,
in terms of the speed at which the data is being generated and thespeed at which it needs to be processed,
and in terms of the variery of data (internal or external, behavioral or social) that is being generated.
2.2 EVOLUTION OF BIG DATA
1970s and before was the era of mainframnes. The data was essentially primitive and structured. Relational
databases evolved in 1980s and 1990s. The era was of data intensive applications. The World Wide Web
(www)and the Interner of Things (loT) have led to an onslaught of structured, unstructured, and mul.
timedia data. Refer Table 2.1.
Table 2.1 The evolution of big data
Data Generation and Data Utilization Data Driven
Storage
Complex and Structured data,
Unstructured unstructured data,
multimedia data

Complex and Relational databases:


Relational Data-intensive
applications
Primitive and Mainframes: Basic data
Structured storage
1970s and before Relational 2000s and beyond
(1980s and 1990s)

2.3 DEFINITION OF BIG DATA

If wewere to ask you the simple question: "Define Big Data", what would your answer be? Well, we will give
you a fevw responses that we have heard over time:
1. Anything beyond the human and technical infrastructure needed to support storage, processing, and
analysis.
2. Today's BIG may be tomorrow's NORMAL.
3. TerabyteS Or petabytes or zettabytes of data.
4. lthink it is about 3 Vs.
2.2. Well, all of these responses are correct, But it is not just one of these; in fact, bie dats : .
Reter Figure
of the atbove and more.

high-volume, high-velocity and high-variety information assets that demand cost effectie
Big data is insight and decision making.
innovative forms of informationprocessing for enbanced Source: Gartner IT Glossary

concept was proposed by the Gartner analyst Doug Laney in a 2001 MetaGroup reseav-L
The 3Vs Volume, Variety and Velocity.
publication, titled, 3D Data Management: Controlling Data
Source: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controllino.
Data-Volume-Velocity-and-Variety.pdf
definition in three parts. Refer Figure 2.3.
For the sake of casy comprehension, we willlook at the

Terabytes or Petabytes
or Zettabytes...
Wait a ninute..
heard Yottabytestoo!!

Dunno
Today's BIG may
be tomorrow's
??
Data
NORMAL
Big
of
Definition

Anything beyond the


human &technical
infrastructure needed
to support storage,
processingand analysis

Figure 2.2 Definition of big data.


High-volume
High-velocity
High-variety

Cost-effective,
innovative forms
of information
processing

Enhanced insight
& decision making

Figure 2.3 Definition of big data Gartner.

Part Iof the definition "big data is high-volume, high-velocity, and high-variety information assets"
talks about voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured, and unstructured data) and will require a good speed/pace for storage, preparation, pro
cessing, and analysis.
Part II of the definition "cost effective, innovative forms of information processing" talks about
embrac
ing new techniques and technologies to capture (ingest), store, process, persist, integrate, and visualize the
high-volume, high-velocity, and high-variety data.
Part III of the definition "enhanced insight and decision making" talks about deriving
deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain business value
and thus a competitive edge.
Data ’ Information ’ Actionable intelligence ’ Better decisions ’ Enhanced
business value
2.4 CHALLENGES WITH BIG DATA
Refer Figure 2.4. Following are a few challenges with big data:
1. Data today is growing at an exponential rate. Most of the data that we have today has been
in the last 2-3 years. This high tide of data willcontinue to rise incessantly. The key
generated
are: questions here
"Will all this data be useful for analysis?", "Dowe work with all this data or a subset of it?","How
willwe separate the knowledge from the noise?", etc.
2. Cloud computing and virtualization are here to stay. Cloud
computing is the answer tO Managing
infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading
cerned. This further complicates the decision to host big data soluions outside the enterprse. is con
3. The other challenge is to decide on the period of retention of big data. Just how long should one retain
this data? Atricky question indeed as some data is useful for making long-ternm decisions,whereas in tew
cases, the data may quickly become irrelevant and obsolete just a few hours afer having being generated.
Capure

000009. Challenges with Big Data


Curation

Search

Analysis
Transfer

Visualization

Prvacy violations

Figure 2.4 Challenges with big data.

4. There is adearth of skilled professionals who possess a high level of proficiency in data sciences thar is
vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, anal
ysis, transfer, security, andvisualization of big data. Big data refers to datasets whose size is typically
beyond the storage capacity of traditíonal database software tools. There is no explicit definition o
how big the dataset should be for it tobe considered "big data." Here we are to deal with data that is
just too big, moves way to fast, and does not fit the structures of typical database systems. The data
changes are highly dynamic and therefore there isa need toingest thís as quickly as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite a number, as tar
as business visualization experts are concerned.

2.5 WHAT IS BIG DATA?


Big data is data that is big in volume, velocity, and variety. Refer Figure 2.5.

2.5.1 Volume
We have seen it grow from bits to bytes to petabytes andexabytes. Refer Table 2.2 and Figure
2.6.
Bits -’ Bytes -’ Kilobytes -’ Megabytes -’ Gigabytes ’ Terabytes
’ Petabytes -’ Exabytes -’ Zettabytes -’ Yotabytes
2.5.1.1 Where Does This Data get Generated?
There are a multitude of sources tor big data. An XLS, a DOC, a PDE, ctc. is unstructured data; a video O
YouTube, achat conversation on Internet Messenger, acustomer feedback form on an online retail website
Introducl

Data velocity

Real timo

Near real time

Periodic

Batch
Data vohns
MB GB TB PB
Table

Database
Photo
Web
Social Audio
Video
Mobile

Data variety

Figure 2.5 Data: Big in volume, variety, and velocity.

Table 2.2 Growth of data

Bits 0or 1
Bytes 8bits
Kilobytes 1024 bytes
Megabytes 1024° bytes
Gigabytes 1024' bytes
Terabytes 1024 bytes
Petabytes 10245 bytes
Exabytes 1024° bytes
Zettabytes 1024' bytes
Yottabytes 1024° bytes

forecast report is unstructured data too. Keter Figure


S unstructured data; a CCTV coverage, a weather
for the sources of big data.
organization's fhrewall. It s as loo Q
I. ypical internal data sources: Data present within an Postgres
Oracle. MS SOL Server, DB2, MySQL,
rOrge: File systems, SQL (RDBMSs -
etc.), NoSQL (MongoDB, Cassandra, etc.), and so on. customer correspondence records,
" Archives: Archives of scanned documents, paper archives, and so on.
assessment records,
patients health records, students' admission records, students'
1Kilobyte (KB) 1000 bytes
1Megabyte (MB) 1,000,000 bytos
1Gigabyte (GB) 1,000,000,000 bytes
1Terabyte (TB) 1,000,000,000,000 bytes
1Petabyte (PB) 1,000,000,000,000,000 bytes
1Exabyte (EB) 1,000,000,000,000,000,000 bytes
1Zettabyte (2B) =1,000,000,000,000,000,000,000 bytes
1Yottabyte (YB)= 1,000,000,000,000,000,000,000.000 bytes
Figure 2.6 Amountain of data.

Datastorage

-
Archives
Media
Sensor data
Sources of big data Docs
Machine log data
Business apps
Public web

Social nedia
Figure 2.7 Sources of big data.

2. External data sources: Data residing outside an organization's firewall. It is as


"Public Web: Wikipedia, weather, regulatory, follows:
3. Both (internal + external data sources) compliance, census, etc.
" Sensor data: Car sensors, smart electric meters,
tors, and so on.
office buildings, air conditioning units, refrigerk
" Machine log data: Event logs, application logs, Business process logs, audit logs, clickstream
data, etc.
" Social media: Twitter, blogs,
" Business apps: ERP, CRM, HR, Facebook, Linkedin,YouTube, Instagram, etc.
"Media: Audio, Video, Google Docs, and so on.
Image, Podcast, etc.
" Docs: Comma separated value
(CSV), Word Documents, PDE XLS, PPT, and so on.
2.5.2 Velocity
We have moved from the days of
processing. batch processing (remember our payroll applications) to real-time

Batch -’ Periodic Near real time ’


Real-time processing
2.5.3 Variety
Variety deals with awide range of data types andsources of data. We willstudy this under three categories:
Structured data, senmi-structured data and unstructured data.
1. Structured data: From traditional transaction proceing systems and RDBMS, ec.
2. Semi-structured data: For cxample Hyper Text Markup Ianguage (HTM), cXtensible Markup
Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos, PDF,
social media, etc.

2.6 OTHER CHARACTERISTICS OF DATA WHICH ARE NOT


DEFINITIONAL TRAITS OF BIG DATA
There are yet other characteristics of data which are not necessarily the definitional traits of big data. Few of
these are listed as follows:

1. Veracity and validity: Veracity refers to biases, noise, and abnormalicy in data. The key question here
is: "Is all the data that is being stored, mined, and analyzed meaningfuland pertinent to the problem
under consideration?" Validity refers to the accuracy and correctness of the data. Any data that is
picked up for analysis needs to be accurate. It is not just true about big data alon.
2. Volatility: Volatility of data deals with, how long is the data valid? And how long should it be stored?
There is some data that is required for long-term decisions and remains valid for longer periodsof time.
However, there are also pieces of data that quickly become obsolete minutes after their generation.
3. Variability: Data flows can be highly inconsistent with periodic peaks.

PICTURE THIS...
An online retailer announces the "big sale day" for ence a slump in his/her business immediately after
a particular week. The retailer is likely to experience the festival season. This reemphasizes the point that
an upsurge in customer traffic to the website during one might witness spikes in data at some point in
this week. In the same way, he/she might experi time and at other times, the data flow can go flat.

2.7 WHY BIG DATA?


be
Ihe more data we have for analysis, the greater will be the analytical accuracy and also the greater would
theconfidence in our decisions based on these analytical findings. This will entaila greater positive impact in
products, new
terms of enhancing operational efficiencies, reducing cost and time, and innovating on new
services, and optimizing existing services. Refer Figure 2.8.
confidence in decision making
More data ’ More accurate analysis - Greater product new
’Greater operational eficiencies, cost reduction, time reduction,
development, and optimized offerings, etc.
2.11 A TYPICAL HADOOP ENVIRONMENT
Let us now study the Hadoop environmem. Is it verydifferenn from the data warchouse environment and
where cxactly is this difference?
As is fairly olbvious from Figure 2. 10, the data sources are quite diparate from weh logs to images, audios,
and videos to social mnedia data to the various docs, pdfs, etc. Here the data in focus is no jusx the data
within the company's firewall but also data residing outside the ompany's irewall. This data is placed in
Hadoop Distributed File System (HIDES). If necd be,this can be repopulated back to operationalsyuems
or fed to the enterprisc data warchouse or data marts or Operational Data Store (ODS) to be picked for
further processing and analysis.

ERP Reporting/
Dashboarding

CRM OLAP

Data warehouse
Ad hoc auerying
Legacy

Modeling
Third party apps

Figure 2.9 A typical data warehouse environment.


1alytics
HDFS

Web logs

Hadoop Operational systems


Images and videos

Data warehouse
Social media
(Twitter, Facebook, etc.)
MapReduce
Data marts

Docs & PDFs

ODS

Figure 2.10 Atypical Hadoop environment.

2.12 VWHAT IS NEW TODAY?

Acoexistence strategy thatcombines the best of legacy data warehouse and analytics environment with the
new power of big data solutions is the best of both the worlds. Refer Figure 2.11.

2.12.1 Coexistence of Big Dataand DataWarehouse


It is NOT about rip and replace. It will not be possible to get rid of RDBMS or massively parallel processing
(MPP), but instead use the right tool for the right job.
As we are aware that few companies are awee bit comfortable working with incumbent data warehouse
for standard BI and analytics reporting, for example the quarterly sales report, customer dashboard, etc.
The data warehouse can continue with its standard workload drawing data from legacy operational systems,
storing the historical data to provision traditional BI reporting and analytics needs. However, one will not
be able to ignore the power that Hadoop brings to the table with diferent types of analysis on difterent
types of data. The same operational systemns, which tillnow was engaged in powering the data warehouse
raw
can also populate the big data environment when they're needed for computation-rich processing or tor
data exploration. It will be a tight balancing act to steer the workload to the right platform based on wnal
that platform was designed to do.
Here is athought-provoking piece from Ralph Kimball at acloudera webinar:
"Heres a question that made me laugh a little bit, but its a serious question: "Well does this mean
that relational databases are going to die?. I think that there was a sense, three or four years ago, tha
maybe thiS was all a giant zero sum game between Hadoop and elational databases, and that veà
HDFS
Web loga

Operational systørne
Hadoop
Images and videos

Data watoouse
Data warohousg
Socialmedia
(Twitter,
Facebook, ete.) MapReduce
Data rmarts

Docs &PDFs

ODS

Figure 2.11 Big data and data warehouse coexistence.

simply gone away Everyone has now realized that theres a huge legacy value in relational databaser
for the purposes they are used for. Not only transaction processing, but for all the much focused.
index-oriented queries on that kind of data, and that will continue in a very robust way forever
Hadoop, therfore, will presentthis alternative kind of environment for diferent rypes of analyus for
diferent kinds of data, and the two of them will coexist. And they will call each other There may be
points at which the business user isn't actualy quite sure which one of them they are touching at any
point of time.
Just as one cannot ignore the powerful analytics capability of Hadoop, onc will not be able to ignore th
revolutionary developments in RDBMS such as in-memory processing, etc. The needof the houris to hav
both data warehouse and Hadoop co-exist in today'senvironment.

2.13 WHAT IS CHANGING IN THE REALMS OF BIG DATA?


Gone are the days when IT and business could work in silos and stillsee the businessthrough. Today. it is
an era of a tight handshake between business, IT, and yet another class called Data Scientists (more on it
in Chapter 3on "Big Data Anabytics'"). We are citing three very important reasons why companies should
compulsorily consider leveraging big data:
I. Competitive advantage: The most important resource with any organization today is their data.
What they do with it will determine their fate in the market.
2. Decision making: Decision making has shifted from the hands of the elite few to the empowered
many. Gooddecisions play asignificant role in furthering customer engagement, reducing operating
margins in retail, cutting cost and other expenditures in the health sector.
Analyze all availatble data

Websites Billing (POS) ERP CRM RFID Social media

Figure 3.2 Types of unstructured data available for analysis.


big time and also take into consideration big data that makes it to the organization at unprecedented level
in terms of volume, velocity, and variety.
Big data analytics is the process of examining big data to uncover patterns, unearth trends, and find
unknown correlationsand other useful information to make faster and better decisions. Analytics begin with
analyzing all available data. Refer Figure 3.2.

3.2 WHAT IS BIG DATA ANALYTICS?


Big Data Analytics is...
1. Technology-enabled analytics: Quite a few data analytics and visualization tools are available in
the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World
Programming Systems (WPS), etc. to help process and analyze your big data.
2. About gaining a meaningful, deeper, and richer insight into your business to steer it in the right direc
tion, understanding the customer'sdemographics to cross-sell and up-sell to them, better leveraging
the services of your vendors and suppliers, etc.
Author'sexperience: Theother day Iwas pleasantly surprised to get a few recommendations via email
from one of my frequently visited online retailers. They had recommended clothing line from my
favorite brand and also the color suggested was one to my liking. How did they arrive at this? In
the recent past, Ihad been buying clothing line of aparicular brand and the color preference was
pastel shades. They had it stored in their database and pulled it out while making recommendations
to me.

3. About acompetitive edge over your competitors by enabling you with findings that allow quicker and
better decision-making.
4. Atight handshake between three communities: IT, business users, and data scientists.
Refer Figure 3.3.
5. Working with datasets whosevolume and variety exceed the current storage and processing capabilities
and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed processing is
tiny (just afew KBs) compared to the data (Terabytes or Petabytes today and likely to be Exabytes or
Zettabytes in the near future).

3.3 WHAT BIG DATA ANALYTICS ISN'T?


We have often asked participants of our learning programs as what comes to mind when you hear the term Dig
ata. And we are not surprised by the answer... it is "Volume." But now that we have a clear understanding
of big data, we know it isnt only about volume but the variety and velociry too are very important factors.
Move code to data for
Better, taster greater speed
decisions in real time and etficlency Alcher; deeper
insights into
customers, partners,
and thé business

Working with datasets


whose volume and variety
is beyond the storage and Blg Data Analytics Competitive
processing capability of a
typical Database Software
advantage

Technology enabled
Time-sensitive decisions
analytics
ITS collaboration with made in near real time
business users and by processing a steady
stream of real-time data
data scientists

Figure 3.3 What is big data analytics?

saving
Refer Figure 3.4. Big data isn't just about technology. It is about understanding what the dataIt isis abou
to us. It is about understanding relationships that we thought never existed between datasets.
patterns and trends waiting to be unveiled. powerful Relational
And of course, big data analytics is not here to replace our now very robust and coexist with boch
is here to
Database Management System (RDBMS) or our traditional Data Warehouse. It
is not
RDBMS and Data Warehouse, leveraging the power of each to yield business value. Big data analytics
"One-size fits all" traditional RDBMS built on shared disk and memory.

One-size fit all" tráditional RDBMS Only about volume


buitt on shared disk and memory

Just about technology


Big Data Analytics isn't ..

Only used by huge online RDBMS


tike Google or Amazon Meant to replace
Companb
Meant to replace data warehouse

Figure 3.4 What big data analytics isn't?


3.4 WHY THIS SUDDEN HYPE
AROUND BIG DATAANALYTICS?
ibissddern hype Heler Vyr 35
Leus pn it down tothee oenst eaMne
I. Datais gowing at a %h mponInd annual tatt,
about 12 illonGiyalyte of data was yruetatrd, This teahiny eatly 49 7H try D) Sn 010, almn
2012 and to about 5 rillion Gyalytes in the year 20)14,amat The
datat n 24 11illium Cogtre in
une th brines datz
is expeted to double every 1.2 years, Wal
Mart, the wotld retailet, prreesws te malam wdwide nes
transactions per hour. 500millin "tweets are posted hy Twiterues eery day 27
and comments are posted by Facebook uers in aday. byery day 2.5 ballim "Lie
with 90% of the world's data created in the past 2 qyuintillinn boges h daa is csset.
Soure: years alone.
(a)
(b) hup:hup://wwwwiww.n0t1.cli.cbomlm.ccom/ontuefrwarel
nt/wwwldatusla/ebndommuni
igdatal catinsdinternt
2. Cost per gigabyte of storage has hugely dropped. vwhat-is big data.htrnl
nínueátángahi kaná
3. There are an overwhelming number of user-friendly analytícs
tools available in the market trday.
3.5 CLASSIFICATION OF ANALYTICS
There are basically two schools of thought:
1. Thosethat classify analytics into basic, operationalized, advanced, and
2. Those that classify analytics into analytics 1.0, analytícs 2.0, and monetized.
analytics 3.0.

Steady More data


growth of
analysis produced

Better More data


predictions stored

More data
analyzed
Figure 3.5 What big data entails?
3.5.1 First School of Thought
1. Basic analytics This prímarily is slicing andvisualízation,
dicing of data
etc
to help with basic buasiness insighs T
is about reporting on historical data, basic
2. Operationalized analytics: It is operationalized analytics if it getswoven into the enterprise's basines
processes.
3. Advanced analytic: This largely is about forccasting for the fuure by wazy of predictive:and presrp
tíive modeling
4. Monetized analytics: This is analytics in use to derive direct business revenue.
3.5.2 Second School of Thought
Let us take adoser look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table 3.1.
Table 3.1 Analytics 1.0, 2.0, and 3.0
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1950s to 2009 2005 to2012 2012 to present
Descriptive statistics Descriptive statistics + predictive statistics Descriptive + predictive +
(report on events, (use data from the past to make predictions prescriptive statistics
OcCurrences, etc. of the for the future) (use data from the past to make
past)
prophecies for the future and at the
same time make recommendatiors
to leverage the situation to one's
advantage)
Key questions asked: Key questions asked: Key questions asked:
What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the actiontaken to
take advantage of what will happen?
Data from legacy Big data Ablend of big data and data from
systems, ERP, CRM,and legacy systems,ERP, CRM,and
3rd party applications. 3" party applications.
Small and structured Big data is being taken up seriously. Data Ablend of big data and traditional
data sources. Data
is mainly unstructured, arriving at a much analytics to yield insights and
stored in enterprise higher pace. This fast flow of data entailed offerings with speed and impact.
data warehouses or data that the infux of big volume data had to
marts. be stored and processed rapidly, often on
massive parallel servers running Hadoop.
Data was internally Data was often externally sourced. Data is both being internally and
Sourced.
externally sourced.
Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database
to Hadoop environments, etc. processing,agile analytical methods.
machine learning techniques,etc.
How can we
make haooen?
Prescriptive
What wl analytics
happen?
Predictive
Why did it analytics
happen? Foresight

What
Diagnostic
happened?
analytics
Insight
Descriptive
analytics

Hindsight
Figure 3.6 Analytics 1.0, 2.0, and 3.0.

Figure 3.6 shows the subtle growth of analytics from Descriptive -> Diagnostic ’ Predictive ’ Prescriptive
analytics.

3.6 GREATEST CHALLENGES THAT PREVENT BUSINESSES FROM


CAPITALIZING ON BIG DATA

1. Obtaining executive sponsorships for investments in big data and its related activiies (such as train
ing, etc.).
2. Getting the business unitsto share information across organizational silos.
3. Finding he right skills (business analysts and data scientists) that can manage large amounts of struc
tured, semi-structured, and unstructured data and create insights from it.
4. Determining che approach to scale rapidly and elastically. In other words, the need to addres the
storage and processing of large volume,velocity, and variety of big data.
5. Deciding whether to use structured or unstructured, internal or external dara to make business
decisions.
6. Choosing the optimal way to report findings and analysis of big data (visual pesentation and analy
tics) for the presentations to make the most sense.
7. Determining what to do with the insights created from big data.

3.7 TOP CHALLENGES FACING BIG DATA


. Scale: Storage (RDBMS (Relational Database Management Systen) or NosQL (Not only SQL)) is
one major concern that needs to be addressed to handle the need for scaling rapidly and clastically.
Ihe ned of the hour is a storagethat can best withstandthe onslaught of large volume, velocity, and
variety of big data? Should you scale vertically or should you scale horizontally?
4.2 HADOOP
originally
Hadoop is an open-souNe projetof the Apache foundation. It is a framework written in Java,
developed by Doug Cutting in 2005 who named it afier his sons toy clephant. He was working with Yahoo
then. It was created to support distribution for "Nutch", the text search engine. Hadoop uses (roogle s
MapRetuce and Google File System technologies as its foundation. Hadoop is now a core part of the
computing intrastructure for companies such as Yahoo, Facebook, Linkedln,Twitter, etc. Refer Figure 4.8.

Hadoop
Apache open-source software framework

Inspired by
-Google MapReduce
-Google File System

Hadoop Distributed File System


MapReduce

Figure 4.8 Hadoop.


4.2.1 Features of Hadoop
Let us cite a few features of Hadoop:
1. It is optimized to handle massive quantities of structured, semi-structured, and unstructured data,
using commodity hardware, that is, relatively inexpensive computers.
2. Hadoop has a shared nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the data can still be processed
from another machine that stores its replica.
4. Hadoop is for high throughput rather than low latency. It is abatch operation handling massive quan
tities of data; therefore the response time is not immediate.
5. Itcomplements On-Line Transacrion Processing (01TP) and On-Line Analytical Processing (0OLAP).
However, itis not a replacement for a relational database management system.
6. It isNOT good when work cannot be parallelized or when there are dependencies within the data.
7. It is NOT good for processing small files. It works best with huge data files and datasets.

4.2.2 Key Advantages of Hadoop


Refer Figure 4.9 for aquick look at the key advantages of Hadoop. Some of them are as follows:
1. Stores data in its native format: Hadoop's darastorage framework (HDES - Hadoop Distributed
File System) can storedata in its native format. There is nostructure that is inposed while keying
in data or storing data. HDESis prety much schema-less. It isonly later when the dara needs to be
processed that structure is imposed on the raw data.
2. Scalable: Hadoop can store and distribute very large darasets (involving thousands of terabytes of
data) across hundreds of inexpensive servers that operate in parallel.
Btoresdata in its natlve format

No loss ol information as there s no translation/


tranalormation to ary speoitie schema
Soalablity-proven to scale by Comparios like
Facebook &
Yahoo
Key advantages
offered by Hadoop Delivers new insights
tor big data analytics
Higher Availability- Fault toleranco through
replication of data/tallover across computer nodes
Reduced cost- lower cost/terabyteof storage
and procesoing

HW can be added or swapped in or out of a cluster

Figure 4.9 Key advantages of Hadoop.

3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced cost/terabyte of
storage and processing.
4. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently which
means
whenever data is sent to any node, the same data also gets replicated to other nodes in the clustet, thee
byensuring that in the event of a node failure, there will always be another copy of data available for uwe
5. Flexibility: One ofthe key advantages of Hadoop is its ability to work with allkinds of data: structured
semi-structured, and unstructured data. It can help derive meaningful business insights from email
conversations, social media data, click-stream data, etc. It can be put to several purposes such as log
analysis, data mining, recommendation systems, market campaign analysis, etc.
6. Fast: Processing is extremely fast in Hadoop as compared to other conventional systems owing to the
"move code to data" paradigm.
Hadoop has a shared-nothing architecture.

4.2.3 Versions of Hadoop


There are two versions of Hadoop available:
1. Hadoop 1.0
2. Hadoop 2.0
Let us take a look at the features of both. Refer Figure 4.10.
4.2.3.1 Hadoop 1.0
It has two main parts:
1. Data storage framework: It is ageneral-purpose file system called Hadoop Distributed File Systetm
(HDFS). HDFS is schema-less. It simply stores data files. These data files can be in just about any
Hadoop 1.0 Hadoop 2.0
MapReduce Others
(Cluster Resource Manager MapReduce
and Data Processing) (Data Processing) (Data Processing)
HDFS YARN
(redundant, reliable storage) (Cluster Resource Manager
HDFS
(redundant, reliable storage)
Figure 4.10 Versions of Hadoop.

format. The idea is to store files as close to their original form as possible. This is turn provides the
business unitsand the organization the much needed lexibility and agility withoutbeing overty wor
ried by what it can implement.
2. Data processing framework: This is asimple functional programming model initially popularized
by Google as MapReduce. It essentially uses two functions: the MAP and the REDUCE functions to
process data. The "Mappers' take in a set of key-value pairs and generate intermediate data (which
is another list of key-value pairs). The "Reducers' then act on this input to produce the output data.
The rwo functions seemingly work in isolation from one another, thus enabling the processing to be
highly distributed in a highly-parallel, fault-tolerant, and scalable way.
There were, however, a few limitations of Hadoop 1.0. They are as follows:
1. The first limitation was the requirement for MapReduce programming expertise along with protciency
required in other programming languages, notably Java.
2. It supported only batch processing which although is suitable fortasks such as log analysis, large-scale
projects.
data mining projects but pretty much unsuitable for other kinds of MapReduce.
3. One major limitation was that Hadoop 1.0was tightly computationally coupled with Either rewrite
which meant that the established data management vendors were left with twooroptions:
extract the data trom
their functionality in MapReduce so that it could be executed in Hadoop process ineti
HDES and process it outside of Hadocp. None of the options were viable as it led to
cluster.
ciencies caused by the databeing moved in and out of the Hadoop
Hadoop 2.0.
Let us look at whether these limitations have been wholly or in parts resolved by
4.2.3.2 Hadoop 2.0
framework. However, a new and separate resource
In Hadoop 2.0, HDESContinues to be the data storage Any application
management frameworkcalled Yet Another Resource Negotiator (YARN)has been added.
by YARN. YARN coordinates he allocation of
capable of dividing itself into parallel tasks is supportedenhancing
further
subtasks of the submitted application, therebyApplicationMaster the tlexibility, scalability, and ethciencv
works by having an in place of the erstwhile JobIracker, run
of the applications. It NodeManager (in placeof the erstwhile Task Tracker).
hg applications on resources governed by a new
APplicationMaster is able to run any application and not just MapReduce.
Programning expertise is no longer required.
Ihis, in other words, means that the MapReduce real-time processing. MapReduce is no longer
ermore, it not only supports batch processingdatabut also
proceing option; other alternative processing functions such as data standardization,
oniy data natively in HDFS.
aster data management can now be performed
Amban
Provisioning, Managing and MonitoringHadoop Cluster)
Hive
Sqoop Mahout Pig (Statistics) (Data Warehouse)
(Relational Database (Machine Learning) (Data Flow)
Data Coilector)
MapReduce
HBase
(Distributed Table Sfore)
Wordiow
(Distributed Processing)

(CZodke
Flume/Chukwa HDFS
(Log Data Collector) (Hadoop Distributed File
System)
oordnay
Figure 4.11 Hadoop ecosystem.

4.2.4 Overview of Hadoop Ecosystems


shown in Figure 4.11.
Ihe components of the Hadoop ecosystem are
There arce components available in the Hadoop ecosystem for data ingestion, processing, and anal
Data Ingestion -’ Data Processing -’ Data Analysis
Components that help with Data Ingestion are:
1. Sqoop
2. Flume
Componentsthat help with Data Processing are:
1. MapReduce
2. Spark
Componentsthat help with Data Analysis are:
1. Pig
2. Hive
3. Impala
HDFS
It is the distributed storage unit of Hadoop. It provides streaming access to hle systemt a
missions and authentication. It is based on GFS (Google File System). It is used toscalehurtwue
commodity
to hundreds and thousands of nodes. It handles large datasets running on
highly fault-tolerant. It stores files across multiple machines. These fles are stored
allow for data recovery in case of failure.

PicTURE THES preferences


their understandmonts
An e-commerce website stores millions of custom buying patterns, helps to which

ments, etc. This


in
ers' data in a distributed manner. Data has been Customers
purchased by
collected over 4-5 years. It then runs batch analytics are
on the archived data to anaiyze customer's behavior,
HBase

It stores data in HDES. It isthe irst non-batch component of the Hadoop Ecosystem. It is adatabase on top
of HDFS. It provides a quick random access to the stored data. It has very low latency compared to HDFS.
It is a NoSQL database, is non-relational and is a column-oriented databasc. A table can have thousands of
columns. Atable can have multiple rows. Each row can have several column families. Each column farmily
can have several columns. Each column can have several key values, It is based on Google BigTable. This is
widely used by Facebook, Twitter, Yahoo, etc.

PICTURE THIS...
The same e-commerce website as in the HDFS case Given the huge velocity of data, they opted for
above also stores millions of product data. To search HBase over HDFS, as HDFS does not support
for a product among millions of products and to pro real-time writes. The results were overwhelming: it
duce the result immediately (or you can say in real reduced the query time from 3 days to 3 minutes.
time), it needs to optimize the request and search
process. HBase supports real-time analytics.

Difference between HBase and Hadoop/HDFS


1. HDES is the fle system whereas HBase is a Hadoop database. It is like NTES and MySQL.
2. HDFS is WORM (Write once and read multiple times or many times). Latest versions support ap
pending of data butchis feature is rarely used. However, HBase supports real-time random read and
write.
3. HDES is based on Google File System (GFS) whereas HBase is based on Google Big Table.
4. HDFS supports only full table scan or partition table scan. Hbase supports random small range scan
or table scan.
5. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4-5 times slowet.
6. The accessto data is via MapReduce job only in HDFS whereas in HBase the access is via Java AFPls.
Rest, Avro, Thrif APLs.
7. HDFS does not support dynamic storage owing to its rigid structure whereas HBase supports dynamic
storage.
8. HDEShas high latency operations whercas HBase has low latency operations.
9. HDES is most suitable for batch analytics whereas HBase is for real-time analyrics.
Hadoop Ecosystem Componentsfor Data Ingestion
1. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system (HDES,
HBase, Hive).
b) Exporting data from Hadoop File system (HDES, HBase, Hive) to RDBMS (MySQL, Oracle,
DB2).
Uses of Sqoop
a) It has a connector-based architecture to allow plug-ins to connect to external systems such as
MySQL, Oracle, DB2,
b) lt can provision the data from external system on to HDES and populate
HBase. tables
in Hive The

c) It integrates with Oozic allowing you to schedule and automate inport and exporr tad.
2. Flumne: Flume is an important log aggregator (aggregates logs from different
in HDES) component in the Hadoop ecosystem. Fhume has been developed bymachines and
Cloudera. places then
Itin
for high volume ingestion of event-based data into Hadoop. The default destination in Flume
(decalsilgeyed
Ha
sink in fume parlancc) is HDFS. However it can also write to HBase or Solr.

PIcTURE THIS
There is a bank of web servers. Flume moves log HDFS for processing.
events from those files into new aggregated files in

Hadoop Ecosystem Components for Data Processing


1. MapReduce: It is a programing paradigm that allows distributed and parallel processing of
datasets. It is based on Google MapReduce. Google released a paper on MapReduce
paradigm in 2004 and that became the genesis of Hadoop processing model. The proMapReduc:
grammIng
framework gets the input data from HDFS. There are two main phases: Map phase and the
Reducene
phase. The map phase converts the input data into another set of data (key-value pairs). This
intermediate dataset then serves as the input to the reduce phase. The reduce phase acts on the datake
to combine (aggregate and consolidate) and reduce them to a smaller set of tuples. The tesult is the
stored back in HDFS.
2. Spark: It is both a programming model as well as a computing model. It is an open-source big t:
processing framework. It was originally developed in 2009 at UCBerkeley's AmpLab and became
open-source project in 2010. It is written in Scala. It provides in-memory computing for Hadoe
In Spark, workloads execute in memory rather than on disk owing to which it is much faster (10
100tímes) than when the workload isexecuted on disk. However, if the datasets are too large to ht into
the available system memory, it can perform conventional disk-based processing. lt serves as apotet:
tially faster and more flexible alternative to MapReduce. It accesses data from HDFS (Spark does n
have its own distributed file system) but bypasses the MapReduce processing.
Spark can be used with Hadoop coexisting smoothly with MapReduce (siting on top of Haliy
independently of Hadoop (standalone). As a programming model, it works wellwiD
YARN) or used
languy
Scala, Python (it has API connectors for using it with Java or Python) or R programming
The following are the Spark libraries: dacastored
help query
a) Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to
disparate applications.
b) Spark streaming: It helps to analyze and present data in real time.statistical operationsondaca
c) MLib: It supports machine learning such as applying advanced
Spark Cluster.
d) GraphX: It helps in graph parallel computation.
Hadoop was priefo
Spark and Hadoop are usually used together by several companies.it. Sparkis used
to house unstructured data and run batch processing operations on
4.2.5 Hadoop Distributions
freely download the core aspects of
Hadoop is an open-source Apache project. Anyone can
core aspects of Hadoop include the following: Hadoop.The
1. Hadoop Common
2. Hadoop Distributed File System (HDFS)
3. Hadoop YARN (Yet Another Resource Negotiator)
4. Hadoop MapReduce
Teradata, Hortonworks cL..
There are few companies such as IBM, Amazon Web Services, Microsoft,
or services. Althouch
etc. that have packaged Hadoop into a more casily consumable distributions
its ability to distribute do.
these companics have a slightly different strategy, the key essence remains manageable data. A
workloads across potentially thousands of servers thus making big data few Hadog
distributions are given in Figure 4.12.
Intel distribution for
Apache Hadoop Software

Hortonworks

Cloudera's distribution including


Apache Hadoop (CDH)

- Hadoop distribution EMC Greenplum HD

IBM InfoSphere Biglnsights

MapR M5 Edition

MS Big Data Solution

Figure 4.12 Hadoop distributions.


4.2.6 Hadoop versus SQL
Table 4.6 lists the differences between Hadoop and SQL.
Table 4.6 Hadoop versus SQL
Hadoop SQL
Scale out Scale up
Key-Value pairs Relational table
Functional Programming Declarative Queries
Offline batch processing Online transaction processing
he

4.2.7 Integrated Hadoop Systems Ofered by Leading Market Vendors


Reler Figue 4.13topget a glimpse of the leading market vendors offering integrated Hadop sytems.

EMC Greenplum

Oracle Big Data Appliance

Integrated Hadoop Systems Microsoft Big Data Solution

IBM InfoSphere

HP Big Data Solutions


Figure 4.13 Integrated Hadoop systems.

4.2.8 Cloud-Based Hadoop Solutions


AmazonWeb Services holds out a comprehensive, end-to-end portfolio of cloud computing services to help
manage big data. The aim is to achieve this and more along with retaining the emphasis on reducing
scaling to meet demand, and accelerating the speed of innovation.
The Google Cloud Storage connector for Hadoop empowers one to perform running MapReduce jobs directly
it in che Hadoop
on data in Google Cloud Storage, without the need to copy it to local disk and
Distributed File System (HDFS). The connector simplifies Hadoop deployment, and at the same time
increasing reliablity by elimi
reduces cost and provides performance comparable to HDFS, all chis while
nating the single point of failure of the name node. Refer Figure 4.14.

Amazon web services

Cloud based solutions


Google BigQuery

Figure 4.14 Cloud-based solutions.

You might also like