Big Data, Hadoop
Big Data, Hadoop
1 CHARACTERISTICS OF DATA
Le us start with the characteristics of data. As depicted in Figure 2.1, data has three key characteristics:
1. Composition: The composition of data deals with the structure of data, that is, che sources of data.
the granularity, the types, and the nature of data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data, chat is, "Can one
use this data as is tor
analysis?" or "Does it require cleansing for further
3. Context: The context of data deals with "Where enhancement and enrichment?"
has this data been generated?" Why was this Gat
generated?" "Howsensitive is this data?" "What are the events associated with this data? and so o.
Smalldata (data as it existed prior to the big data
sources; it is about no major changes to the revolution) is about certainty. It is about fairly known dae
Most often we have answers to queries likecomposition or context of data.
why this data was generated, where and when it
exactly how we would like to use it, what was generated
questions will this data be able to answer, and so on. Big ata
Composition
Condition Data
Context
about complexity... complexity in terms of multiple and unknown datasets, in terms of exploding volume,
in terms of the speed at which the data is being generated and thespeed at which it needs to be processed,
and in terms of the variery of data (internal or external, behavioral or social) that is being generated.
2.2 EVOLUTION OF BIG DATA
1970s and before was the era of mainframnes. The data was essentially primitive and structured. Relational
databases evolved in 1980s and 1990s. The era was of data intensive applications. The World Wide Web
(www)and the Interner of Things (loT) have led to an onslaught of structured, unstructured, and mul.
timedia data. Refer Table 2.1.
Table 2.1 The evolution of big data
Data Generation and Data Utilization Data Driven
Storage
Complex and Structured data,
Unstructured unstructured data,
multimedia data
If wewere to ask you the simple question: "Define Big Data", what would your answer be? Well, we will give
you a fevw responses that we have heard over time:
1. Anything beyond the human and technical infrastructure needed to support storage, processing, and
analysis.
2. Today's BIG may be tomorrow's NORMAL.
3. TerabyteS Or petabytes or zettabytes of data.
4. lthink it is about 3 Vs.
2.2. Well, all of these responses are correct, But it is not just one of these; in fact, bie dats : .
Reter Figure
of the atbove and more.
high-volume, high-velocity and high-variety information assets that demand cost effectie
Big data is insight and decision making.
innovative forms of informationprocessing for enbanced Source: Gartner IT Glossary
concept was proposed by the Gartner analyst Doug Laney in a 2001 MetaGroup reseav-L
The 3Vs Volume, Variety and Velocity.
publication, titled, 3D Data Management: Controlling Data
Source: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controllino.
Data-Volume-Velocity-and-Variety.pdf
definition in three parts. Refer Figure 2.3.
For the sake of casy comprehension, we willlook at the
Terabytes or Petabytes
or Zettabytes...
Wait a ninute..
heard Yottabytestoo!!
Dunno
Today's BIG may
be tomorrow's
??
Data
NORMAL
Big
of
Definition
Cost-effective,
innovative forms
of information
processing
Enhanced insight
& decision making
Part Iof the definition "big data is high-volume, high-velocity, and high-variety information assets"
talks about voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured, and unstructured data) and will require a good speed/pace for storage, preparation, pro
cessing, and analysis.
Part II of the definition "cost effective, innovative forms of information processing" talks about
embrac
ing new techniques and technologies to capture (ingest), store, process, persist, integrate, and visualize the
high-volume, high-velocity, and high-variety data.
Part III of the definition "enhanced insight and decision making" talks about deriving
deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain business value
and thus a competitive edge.
Data ’ Information ’ Actionable intelligence ’ Better decisions ’ Enhanced
business value
2.4 CHALLENGES WITH BIG DATA
Refer Figure 2.4. Following are a few challenges with big data:
1. Data today is growing at an exponential rate. Most of the data that we have today has been
in the last 2-3 years. This high tide of data willcontinue to rise incessantly. The key
generated
are: questions here
"Will all this data be useful for analysis?", "Dowe work with all this data or a subset of it?","How
willwe separate the knowledge from the noise?", etc.
2. Cloud computing and virtualization are here to stay. Cloud
computing is the answer tO Managing
infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading
cerned. This further complicates the decision to host big data soluions outside the enterprse. is con
3. The other challenge is to decide on the period of retention of big data. Just how long should one retain
this data? Atricky question indeed as some data is useful for making long-ternm decisions,whereas in tew
cases, the data may quickly become irrelevant and obsolete just a few hours afer having being generated.
Capure
Search
Analysis
Transfer
Visualization
Prvacy violations
4. There is adearth of skilled professionals who possess a high level of proficiency in data sciences thar is
vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, anal
ysis, transfer, security, andvisualization of big data. Big data refers to datasets whose size is typically
beyond the storage capacity of traditíonal database software tools. There is no explicit definition o
how big the dataset should be for it tobe considered "big data." Here we are to deal with data that is
just too big, moves way to fast, and does not fit the structures of typical database systems. The data
changes are highly dynamic and therefore there isa need toingest thís as quickly as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite a number, as tar
as business visualization experts are concerned.
2.5.1 Volume
We have seen it grow from bits to bytes to petabytes andexabytes. Refer Table 2.2 and Figure
2.6.
Bits -’ Bytes -’ Kilobytes -’ Megabytes -’ Gigabytes ’ Terabytes
’ Petabytes -’ Exabytes -’ Zettabytes -’ Yotabytes
2.5.1.1 Where Does This Data get Generated?
There are a multitude of sources tor big data. An XLS, a DOC, a PDE, ctc. is unstructured data; a video O
YouTube, achat conversation on Internet Messenger, acustomer feedback form on an online retail website
Introducl
Data velocity
Real timo
Periodic
Batch
Data vohns
MB GB TB PB
Table
Database
Photo
Web
Social Audio
Video
Mobile
Data variety
Bits 0or 1
Bytes 8bits
Kilobytes 1024 bytes
Megabytes 1024° bytes
Gigabytes 1024' bytes
Terabytes 1024 bytes
Petabytes 10245 bytes
Exabytes 1024° bytes
Zettabytes 1024' bytes
Yottabytes 1024° bytes
Datastorage
-
Archives
Media
Sensor data
Sources of big data Docs
Machine log data
Business apps
Public web
Social nedia
Figure 2.7 Sources of big data.
1. Veracity and validity: Veracity refers to biases, noise, and abnormalicy in data. The key question here
is: "Is all the data that is being stored, mined, and analyzed meaningfuland pertinent to the problem
under consideration?" Validity refers to the accuracy and correctness of the data. Any data that is
picked up for analysis needs to be accurate. It is not just true about big data alon.
2. Volatility: Volatility of data deals with, how long is the data valid? And how long should it be stored?
There is some data that is required for long-term decisions and remains valid for longer periodsof time.
However, there are also pieces of data that quickly become obsolete minutes after their generation.
3. Variability: Data flows can be highly inconsistent with periodic peaks.
PICTURE THIS...
An online retailer announces the "big sale day" for ence a slump in his/her business immediately after
a particular week. The retailer is likely to experience the festival season. This reemphasizes the point that
an upsurge in customer traffic to the website during one might witness spikes in data at some point in
this week. In the same way, he/she might experi time and at other times, the data flow can go flat.
ERP Reporting/
Dashboarding
CRM OLAP
Data warehouse
Ad hoc auerying
Legacy
Modeling
Third party apps
Web logs
Data warehouse
Social media
(Twitter, Facebook, etc.)
MapReduce
Data marts
ODS
Acoexistence strategy thatcombines the best of legacy data warehouse and analytics environment with the
new power of big data solutions is the best of both the worlds. Refer Figure 2.11.
Operational systørne
Hadoop
Images and videos
Data watoouse
Data warohousg
Socialmedia
(Twitter,
Facebook, ete.) MapReduce
Data rmarts
Docs &PDFs
ODS
simply gone away Everyone has now realized that theres a huge legacy value in relational databaser
for the purposes they are used for. Not only transaction processing, but for all the much focused.
index-oriented queries on that kind of data, and that will continue in a very robust way forever
Hadoop, therfore, will presentthis alternative kind of environment for diferent rypes of analyus for
diferent kinds of data, and the two of them will coexist. And they will call each other There may be
points at which the business user isn't actualy quite sure which one of them they are touching at any
point of time.
Just as one cannot ignore the powerful analytics capability of Hadoop, onc will not be able to ignore th
revolutionary developments in RDBMS such as in-memory processing, etc. The needof the houris to hav
both data warehouse and Hadoop co-exist in today'senvironment.
3. About acompetitive edge over your competitors by enabling you with findings that allow quicker and
better decision-making.
4. Atight handshake between three communities: IT, business users, and data scientists.
Refer Figure 3.3.
5. Working with datasets whosevolume and variety exceed the current storage and processing capabilities
and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed processing is
tiny (just afew KBs) compared to the data (Terabytes or Petabytes today and likely to be Exabytes or
Zettabytes in the near future).
Technology enabled
Time-sensitive decisions
analytics
ITS collaboration with made in near real time
business users and by processing a steady
stream of real-time data
data scientists
saving
Refer Figure 3.4. Big data isn't just about technology. It is about understanding what the dataIt isis abou
to us. It is about understanding relationships that we thought never existed between datasets.
patterns and trends waiting to be unveiled. powerful Relational
And of course, big data analytics is not here to replace our now very robust and coexist with boch
is here to
Database Management System (RDBMS) or our traditional Data Warehouse. It
is not
RDBMS and Data Warehouse, leveraging the power of each to yield business value. Big data analytics
"One-size fits all" traditional RDBMS built on shared disk and memory.
More data
analyzed
Figure 3.5 What big data entails?
3.5.1 First School of Thought
1. Basic analytics This prímarily is slicing andvisualízation,
dicing of data
etc
to help with basic buasiness insighs T
is about reporting on historical data, basic
2. Operationalized analytics: It is operationalized analytics if it getswoven into the enterprise's basines
processes.
3. Advanced analytic: This largely is about forccasting for the fuure by wazy of predictive:and presrp
tíive modeling
4. Monetized analytics: This is analytics in use to derive direct business revenue.
3.5.2 Second School of Thought
Let us take adoser look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table 3.1.
Table 3.1 Analytics 1.0, 2.0, and 3.0
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1950s to 2009 2005 to2012 2012 to present
Descriptive statistics Descriptive statistics + predictive statistics Descriptive + predictive +
(report on events, (use data from the past to make predictions prescriptive statistics
OcCurrences, etc. of the for the future) (use data from the past to make
past)
prophecies for the future and at the
same time make recommendatiors
to leverage the situation to one's
advantage)
Key questions asked: Key questions asked: Key questions asked:
What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the actiontaken to
take advantage of what will happen?
Data from legacy Big data Ablend of big data and data from
systems, ERP, CRM,and legacy systems,ERP, CRM,and
3rd party applications. 3" party applications.
Small and structured Big data is being taken up seriously. Data Ablend of big data and traditional
data sources. Data
is mainly unstructured, arriving at a much analytics to yield insights and
stored in enterprise higher pace. This fast flow of data entailed offerings with speed and impact.
data warehouses or data that the infux of big volume data had to
marts. be stored and processed rapidly, often on
massive parallel servers running Hadoop.
Data was internally Data was often externally sourced. Data is both being internally and
Sourced.
externally sourced.
Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database
to Hadoop environments, etc. processing,agile analytical methods.
machine learning techniques,etc.
How can we
make haooen?
Prescriptive
What wl analytics
happen?
Predictive
Why did it analytics
happen? Foresight
What
Diagnostic
happened?
analytics
Insight
Descriptive
analytics
Hindsight
Figure 3.6 Analytics 1.0, 2.0, and 3.0.
Figure 3.6 shows the subtle growth of analytics from Descriptive -> Diagnostic ’ Predictive ’ Prescriptive
analytics.
1. Obtaining executive sponsorships for investments in big data and its related activiies (such as train
ing, etc.).
2. Getting the business unitsto share information across organizational silos.
3. Finding he right skills (business analysts and data scientists) that can manage large amounts of struc
tured, semi-structured, and unstructured data and create insights from it.
4. Determining che approach to scale rapidly and elastically. In other words, the need to addres the
storage and processing of large volume,velocity, and variety of big data.
5. Deciding whether to use structured or unstructured, internal or external dara to make business
decisions.
6. Choosing the optimal way to report findings and analysis of big data (visual pesentation and analy
tics) for the presentations to make the most sense.
7. Determining what to do with the insights created from big data.
Hadoop
Apache open-source software framework
Inspired by
-Google MapReduce
-Google File System
3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced cost/terabyte of
storage and processing.
4. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently which
means
whenever data is sent to any node, the same data also gets replicated to other nodes in the clustet, thee
byensuring that in the event of a node failure, there will always be another copy of data available for uwe
5. Flexibility: One ofthe key advantages of Hadoop is its ability to work with allkinds of data: structured
semi-structured, and unstructured data. It can help derive meaningful business insights from email
conversations, social media data, click-stream data, etc. It can be put to several purposes such as log
analysis, data mining, recommendation systems, market campaign analysis, etc.
6. Fast: Processing is extremely fast in Hadoop as compared to other conventional systems owing to the
"move code to data" paradigm.
Hadoop has a shared-nothing architecture.
format. The idea is to store files as close to their original form as possible. This is turn provides the
business unitsand the organization the much needed lexibility and agility withoutbeing overty wor
ried by what it can implement.
2. Data processing framework: This is asimple functional programming model initially popularized
by Google as MapReduce. It essentially uses two functions: the MAP and the REDUCE functions to
process data. The "Mappers' take in a set of key-value pairs and generate intermediate data (which
is another list of key-value pairs). The "Reducers' then act on this input to produce the output data.
The rwo functions seemingly work in isolation from one another, thus enabling the processing to be
highly distributed in a highly-parallel, fault-tolerant, and scalable way.
There were, however, a few limitations of Hadoop 1.0. They are as follows:
1. The first limitation was the requirement for MapReduce programming expertise along with protciency
required in other programming languages, notably Java.
2. It supported only batch processing which although is suitable fortasks such as log analysis, large-scale
projects.
data mining projects but pretty much unsuitable for other kinds of MapReduce.
3. One major limitation was that Hadoop 1.0was tightly computationally coupled with Either rewrite
which meant that the established data management vendors were left with twooroptions:
extract the data trom
their functionality in MapReduce so that it could be executed in Hadoop process ineti
HDES and process it outside of Hadocp. None of the options were viable as it led to
cluster.
ciencies caused by the databeing moved in and out of the Hadoop
Hadoop 2.0.
Let us look at whether these limitations have been wholly or in parts resolved by
4.2.3.2 Hadoop 2.0
framework. However, a new and separate resource
In Hadoop 2.0, HDESContinues to be the data storage Any application
management frameworkcalled Yet Another Resource Negotiator (YARN)has been added.
by YARN. YARN coordinates he allocation of
capable of dividing itself into parallel tasks is supportedenhancing
further
subtasks of the submitted application, therebyApplicationMaster the tlexibility, scalability, and ethciencv
works by having an in place of the erstwhile JobIracker, run
of the applications. It NodeManager (in placeof the erstwhile Task Tracker).
hg applications on resources governed by a new
APplicationMaster is able to run any application and not just MapReduce.
Programning expertise is no longer required.
Ihis, in other words, means that the MapReduce real-time processing. MapReduce is no longer
ermore, it not only supports batch processingdatabut also
proceing option; other alternative processing functions such as data standardization,
oniy data natively in HDFS.
aster data management can now be performed
Amban
Provisioning, Managing and MonitoringHadoop Cluster)
Hive
Sqoop Mahout Pig (Statistics) (Data Warehouse)
(Relational Database (Machine Learning) (Data Flow)
Data Coilector)
MapReduce
HBase
(Distributed Table Sfore)
Wordiow
(Distributed Processing)
(CZodke
Flume/Chukwa HDFS
(Log Data Collector) (Hadoop Distributed File
System)
oordnay
Figure 4.11 Hadoop ecosystem.
It stores data in HDES. It isthe irst non-batch component of the Hadoop Ecosystem. It is adatabase on top
of HDFS. It provides a quick random access to the stored data. It has very low latency compared to HDFS.
It is a NoSQL database, is non-relational and is a column-oriented databasc. A table can have thousands of
columns. Atable can have multiple rows. Each row can have several column families. Each column farmily
can have several columns. Each column can have several key values, It is based on Google BigTable. This is
widely used by Facebook, Twitter, Yahoo, etc.
PICTURE THIS...
The same e-commerce website as in the HDFS case Given the huge velocity of data, they opted for
above also stores millions of product data. To search HBase over HDFS, as HDFS does not support
for a product among millions of products and to pro real-time writes. The results were overwhelming: it
duce the result immediately (or you can say in real reduced the query time from 3 days to 3 minutes.
time), it needs to optimize the request and search
process. HBase supports real-time analytics.
c) It integrates with Oozic allowing you to schedule and automate inport and exporr tad.
2. Flumne: Flume is an important log aggregator (aggregates logs from different
in HDES) component in the Hadoop ecosystem. Fhume has been developed bymachines and
Cloudera. places then
Itin
for high volume ingestion of event-based data into Hadoop. The default destination in Flume
(decalsilgeyed
Ha
sink in fume parlancc) is HDFS. However it can also write to HBase or Solr.
PIcTURE THIS
There is a bank of web servers. Flume moves log HDFS for processing.
events from those files into new aggregated files in
Hortonworks
MapR M5 Edition
EMC Greenplum
IBM InfoSphere