Module 1
Module 1
1
emails about a particular case of testinal problem. Dr. Stanley has chanced
upon a particular combination of drugs that has cured gastro-intestinal disorders
in his patients. He has written an email about this combination of drugs to Dr.
Ben.
• Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar
case of gastro-intestinal disorder whose cure Dr. Stanley has chanced upon. Dr.
Mark has already tried regular drugs but with no positive results so far. He
quickly searches the organization’s database for answers, but with no luck. The
information he wants is tucked away in the email conversation between two other
“GoodLife” doctors, Dr. Ben and Dr. Stanley Stanley. Dr. Mark would have
accessed accessed the solution solution with few mouse clicks had the storage
and analysis of unstructured data been undertaken by “GoodLife”.
• As is the case at “GoodLife”, 80-85% of data in any organization is unstruc-
tured and is an alarming rate. An enormous amount of knowledge is buried in
this data. In the above Stanley’s email to Dr. Ben had not been successfully
updated into the medical system in the unstructured format.
• Unstructured data, thus, is the one which cannot be stored in the form of
rows and as in a database and does not conform to any data model, i.e. it is
difficult to determine the meaning of the data. It does not follow any rules or
semantics. It can be of any type and is hence unpredictable.
Semi-structured Data
Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns
as in a database but semi-structured data has tags and markers which help to
group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties
as in Address 1 Semi-structured Data Address 2
• For example an e-mail follows a standard format To: From: Subject: CC:
Body:
• The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
Structured Data
• Structured data is organized in semantic chunks (entities)
• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions (attributes)
2
• Descriptions for all entities in a group (schema)
have the same defined format
have a predefined length
are all present
and follow the same order .
What is Big Data? Introduction, Types, Characteristics, Examples
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical recording media.
Now, let’s learn Big Data definition
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
What is an Example of Big Data?
Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
Following are the types of Big Data:
1.Structured
2.Unstructured
3.Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format
is termed as a ‘structured’ data. Over the period of time, talent in computer
science has achieved greater success in developing techniques for working with
such kind of data (where the format is well known in advance) and also deriving
3
value out of it. However, nowadays, we are foreseeing issues when a size of
such data grows to a huge extent, typical sizes are being in the rage of multiple
zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.
Looking at these figures one can easily understand why the name Big Data is
given and imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one
example of a ‘structured’ data.
Examples Of Structured Data
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out
of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
The output returned by ‘Google Search’
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a
table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
4
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Big data can be described by the following characteristics:
1.Volume
2.Variety
3.Velocity
4.Variability
(i) Volume – The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of data. Also,
whether a particular data can actually be considered as a Big Data or not, is
dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which
needs to be considered while dealing with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the
data at times, thus hampering the process of being able to handle and manage
the data effectively.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
1.Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.
5
2.Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new systems, Big Data and
natural language processing technologies are being used to read and evaluate
consumer responses.
3.Early identification of risk to the product/services, if any
4.Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for
new data before identifying what data should be moved to the data warehouse.
In addition, such integration of Big Data technologies and data warehouse helps
an organization to offload infrequently accessed data.
What is Big Data Analytics?
Big Data analytics is a process used to extract meaningful insights, such as hid-
den patterns, unknown correlations, market trends, and customer preferences.
Big Data analytics provides various advantages—it can be used for better deci-
sion making, preventing fraudulent activities, among other things.
6
Today, there are millions of data sources that generate data at a very rapid rate.
These data sources are present across the world. Some of the largest sources
of data are social media platforms and networks. Let’s use Facebook as an
example—it generates more than 500 terabytes of data every day. This data
includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data,
and unstructured data. For example, in a regular Excel sheet, data is classified
as structured data—with a definite format. In contrast, emails fall under semi-
structured, and your pictures and videos fall under unstructured data. All this
data combined makes up Big Data.
7
As the field of Big Data analytics continues to evolve, we can expect to see even
more amazing and transformative applications of this technology in the years
to come.
Read More: Fascinated by Data Science, software alum Aditya Shivam wanted
to look for new possibilities of learning and then gradually transitioning in to the
data field. Read about Shivam’s journey with our Big Data Engineer Master’s
Program, in his Simplilearn Big Data Engineer Review.
8
• Stage 3 - Data filtering - All of the identified data from the
previous stage is filtered here to remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the
tool is extracted and then transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same
fields across different datasets are integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and
statistical tools to discover useful information.
• Stage 7 - Visualization of data - With tools like Tableau, Power
BI, and QlikView, Big Data analysts can produce graphic visu-
alizations of the analysis.
• Stage 8 - Final analysis result - This is the last step of the Big
Data analytics lifecycle, where the final results of the analysis
are made available to business stakeholders who will take action.
1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps
in creating reports, like a company’s revenue, profit, sales, and so on. Also, it
helps in the tabulation of social media metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facil-
ity utilization across its office and lab space. Using descriptive analytics, Dow
was able to identify underutilized space. This space consolidation helped the
company save nearly US $4 million annually.
2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques
like drill-down, data mining, and data recovery are all examples. Organizations
use diagnostic analytics because they provide an in-depth insight into a
particular problem.
Use Case: An e-commerce company’s report shows that their sales have
gone down, although customers are adding products to their carts. This can
be due to various reasons like the form didn’t load correctly, the shipping fee
is too high, or there are not enough payment options available. This is where
you can use diagnostic analytics to find the reason.
9
3. Predictive Analytics
This type of analytics looks into the historical and present data to make pre-
dictions of the future. Predictive analytics uses data mining, AI, and machine
learning to analyze current data and make predictions about the future. It
works on predicting customer trends, market trends, and so on.
4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspec-
tive analytics works with both descriptive and predictive analytics. Most of
the time, it relies on AI and machine learning.
10
• Education - Used to develop new and improve existing courses
based on market requirements
• Healthcare - With the help of a patient’s medical history, Big
Data analytics is used to predict how likely they are to have
health issues
• Media and entertainment - Used to understand the demand of
shows, movies, songs, and more to deliver a personalized recom-
mendation list to its users
• Banking - Customer income and spending patterns help to pre-
dict the likelihood of choosing various banking offers, like loans
and credit cards
• Telecommunications - Used to forecast network capacity and
improve customer experience
• Government - Big Data analytics helps governments in law en-
forcement, among other things
11
1.In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
2.While working on Apache Nutch, they were dealing with big data. To store
that data they have to spend a lot of costs which becomes the consequence
of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
3.In 2003, Google introduced a file system known as GFS (Google file system).
It is a proprietary distributed file system developed to provide efficient access
to data.
4.In 2004, Google released a white paper on Map Reduce. This technique sim-
plifies the data processing on large clusters.
5.In 2005, Doug Cutting and Mike Cafarella introduced a new file system known
as NDFS (Nutch Distributed File System). This file system also includes Map
reduce.
6.In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the
Nutch project, Dough Cutting introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File System). Hadoop first version
0.1.0 released in this year.
7.Doug Cutting gave named his project Hadoop after his son’s toy elephant.
8.In 2007, Yahoo runs two clusters of 1000 machines.
9.In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a
900 node cluster within 209 seconds.
10.In 2013, Hadoop 2.2 was released.
11.In 2017, Hadoop 3.0 was released.
year Event
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006 Hadoop introduced.
Hadoop 0.1.0 released.
Yahoo deploys 300 machines and
within this year reaches 600
machines.
2007 Yahoo runs 2 clusters of 1000
machines.
Hadoop includes HBase.
12
year Event
2008 YARN JIRA opened
Hadoop becomes the fastest system
to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
Yahoo clusters loaded with 10
terabytes per day.
Cloudera was founded as a Hadoop
distributor.
2009 Yahoo runs 17 clusters of 24,000
machines.
Hadoop becomes capable enough to
sort a petabyte.
MapReduce and HDFS become
separate subproject.
2010 Hadoop added the support for
Kerberos.
Hadoop operates 4,000 nodes with
40 petabytes.
Apache Hive and Pig released.
2011 Apache Zookeeper released.
Yahoo has 42,000 Hadoop nodes and
hundreds of petabytes of storage.
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
2017 Apache Hadoop 3.0 version released.
13
Four modules comprise the primary Hadoop framework and work collectively
to form the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of the
Hadoop ecosystem, HDFS is a distributed file system in which individual
Hadoop nodes operate on data that resides in their local storage. This removes
network latency, providing high-throughput access to application data. In
addition, administrators don’t need to define schemas up front.
Yet Another Resource Negotiator (YARN): YARN is a resource-management
platform responsible for managing compute resources in clusters and using them
to schedule users’ applications. It performs scheduling and resource allocation
across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data process-
ing. In the MapReduce model, subsets of larger datasets and instructions for
processing the subsets are dispatched to multiple different nodes, where each
subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more
manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and
shared by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosys-
tem continues to grow and includes many tools and applications to help collect,
store, process, analyze, and manage big data. These include Apache Pig, Apache
Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
How does Hadoop work?
Hadoop allows for the distribution of datasets across a cluster of commodity
hardware. Processing is performed in parallel on multiple servers simultane-
ously.
Software clients input data into Hadoop. HDFS handles metadata and the dis-
tributed file system. MapReduce then processes and converts the data. Finally,
YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.
What are the benefits of Hadoop?
Scalability
Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.
14
Low cost
As an open source framework that can run on commodity hardware and has
a large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocess-
ing before storing it which means that an organization can store as much data
as they like and then utilize it later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system
resilience, meaning if one of the hardware nodes fail, jobs are redirected to other
nodes. Data stored on one Hadoop cluster is replicated across other nodes within
the system to fortify against the possibility of hardware or software failure.
What are the challenges of Hadoop?
MapReduce complexity and limitations
As a file-intensive system, MapReduce can be a difficult tool to utilize for com-
plex jobs, such as interactive analytical tasks. MapReduce functions also need
to be written in Java and can require a steep learning curve. The MapReduce
ecosystem is quite large, with many components for different functions that can
make it difficult to determine what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large
datasets. An ecosystem of tools for authentication, encryption, auditing, and
provisioning has emerged to help developers secure data in Hadoop.
Governance and management
Hadoop does not have many robust tools for data management and governance,
nor for data quality and standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap.
Finding developers with the combined requisite skills in Java to program MapRe-
duce, operating systems, and hardware can be difficult. In addition, MapReduce
has a steep learning curve, making it hard to get new programmers up to speed
on its best practices and ecosystem.
Why is Hadoop important?
Research firm IDC estimated that 62.4 zettabytes of data were created or repli-
cated in 2020, driven by the Internet of Things, social media, edge computing,
and data created in the cloud. The firm forecasted that data growth from 2020
to 2025 was expected at 23% per year. While not all that data is saved (it is
15
either deleted after consumption or overwritten), the data needs of the world
continue to grow.
Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend
the capabilities of the core module. Some of the main software tools used with
Hadoop include:
Apache Hive: A data warehouse that allows programmers to work with data in
HDFS using a query language called HiveQL, which is similar to SQL
Apache HBase: An open source non-relational distributed database often paired
with Hadoop
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze
large sets of data and enables functions like filter, sort, load, and join
Apache Impala: Open source, massively parallel processing SQL query engine
often used with Hadoop
Apache Sqoop: A command-line interface application for efficiently transferring
bulk data between relational databases and Hadoop
Apache ZooKeeper: An open source server that enables reliable distributed co-
ordination in Hadoop; a service for, ”maintaining configuration information,
naming, providing distributed synchronization, and providing group services”
Apache Oozie: A workflow scheduler for Hadoop jobs
What is Apache Hadoop used for?
Here are some common uses cases for Apache Hadoop:
Analytics and big data
A wide variety of companies and organizations use Hadoop for research, pro-
duction data processing, and analytics that require processing terabytes or
petabytes of big data, storing diverse datasets, and data parallel processing.
Data storage and archiving
As Hadoop enables mass storage on commodity hardware, it is useful as a low-
cost storage option for all kinds of data, such as transactions, click streams, or
sensor and machine data.
Data lakes
Since Hadoop can help store data without preprocessing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.
Marketing analytics
Marketing departments often use Hadoop to store and analyze customer rela-
tionship management (CRM) data.
16
Risk management
Banks, insurance companies, and other financial services companies use Hadoop
to build risk analysis and management models.
AI and machine learning
Hadoop ecosystems help with the processing of data and model training opera-
tions for machine learning applications.
Data Analysis with Unix tools
To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the
globe and assemble an enormous volume of log information, which is a decent
possibility for investigation with MapReduce in light of the fact that is required
to process every one of the information, and the information is record-oriented
and semi-organized.
The information utilized is from the National Climatic Data Center, or NCDC.
The information is put away utilizing a line-arranged ASCII group, in which
each line is a record. The organization underpins a rich arrangement of meteo-
rological components, huge numbers of which are discretionary or with variable
information lengths. For straightforwardness, centre around the fundamental
components, for example, temperature, which is constantly present and are of
fixed width.
Use of UNIX
So now we’ll find out the highest recorded global temperature in the dataset
(for each year) using Unix?
The classic tool for processing line-oriented data is awk.
Analyzing the Data with Unix Tools
To take advantage of the parallel processing that Hadoop provides, we need to
express our query as a MapReduce job. After some local, small-scale testing,
we will be able to run it on a cluster of machines.
Unix tools defined the modern computing landscape. Originally created in the
1960’s, in today’s fast moving market place, Unix tools provide the user the
avenue to solve many engineering and business analytics problems professionals
face today.
Although by the standards of shiny IDEs some may find the interface of these
tools arcane, their power for exploring and prototyping big data processing
workflows remains unmatched. Their versatility makes them the first choice
for obtaining a quick answer and the last resort for tackling difficult problems.
Compared to scripting languages, another great productivity booster, Unix tools
uniquely allow an interactive, explorative programming style, which is ideal for
17
solving efficiently many engineering and business analytics problems that we
face every day.
Natively available on all flavors of Unix-like operating systems, including
GNU/Linux and Mac OS X, the tools are nowadays also easy to install under
Windows.
While many Unix-like systems have come and gone over the years, there’s still
plenty of reasons why the original operating system has outlasted the competi-
tion.
18
Hadoop also provides a number of other tools for analyzing data, including
Apache Hive, Apache Pig, and Apache Spark. These tools provide higher-level
abstractions that simplify the process of data analysis.
Apache Hive provides a SQL-like interface for querying data stored in HDFS. It
translates SQL queries into MapReduce jobs, making it easier for analysts who
are familiar with SQL to work with Hadoop.
19
Apache Pig is a high-level scripting language that enables users to write data
processing pipelines that are translated into MapReduce jobs. Pig provides a
simpler syntax than MapReduce, making it easier to write and maintain data
processing code.
Apache Spark is a distributed computing framework that provides a fast and
flexible way to process large amounts of data. It provides an API for work-
ing with data in various formats, including SQL, machine learning, and graph
processing.
In summary, Hadoop provides a powerful framework for analyzing large amounts
of data. By storing data in HDFS and using MapReduce or other tools like
Apache Hive, Apache Pig, or Apache Spark, you can perform distributed data
processing and gain insights from your data that would be difficult or impossible
to obtain using traditional data analysis tools.
Hadoop Streaming
Hadoop Streaming uses UNIX standard streams as the interface between
Hadoop and your program so you can write MapReduce program in any
language which can write to standard output and read standard input. Hadoop
offers a lot of methods to help non-Java development.
The primary mechanisms are Hadoop Pipes which gives a native C++ inter-
face to Hadoop and Hadoop Streaming which permits any program that uses
standard input and output to be used for map tasks and reduce tasks.
Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :
20
1.Hadoop Streaming is a part of the Hadoop Distribution System.
2.It facilitates ease of writing Map Reduce programs and codes.
3.Hadoop Streaming supports almost all types of programming languages such
as Python, C++, Ruby, Perl etc.
4.The entire Hadoop Streaming framework runs on Java. However, the codes
might be written in different languages as mentioned in the above point.
5.The Hadoop Streaming process uses Unix Streams that act as an interface
between Hadoop and Map Reduce programs.
6.Hadoop Streaming uses various Streaming Command Options and the two
mandatory ones are – -input directoryname or filename and -output directory-
name
As it can be clearly seen in the diagram above that there are almost 8 key parts
in a Hadoop Streaming Architecture. They are :
A.Input Reader/Format
B.Key Value
C.Mapper Stream
D.Key-Value Pairs
E.Reduce Stream
F.Output Format
G.Map External
H.Reduce External
The involvement of these components will be discussed in detail when we explain
the working of the Hadoop streaming. However, to precisely summarize the
Hadoop Streaming Architecture, the starting point of the entire process is when
the Mapper reads the input value from the Input Reader Format. Once the
21
input data is read, it is mapped by the Mapper as per the logic given in the
code. It then passes through the Reducer stream and the data is transferred to
the output after data aggregation is done. A more detailed description is given
in the below section on the working of the Hadoop Streaming.
How does Hadoop Streaming Work?
Input is read from standard input and the output is emitted to standard output
by Mapper and the Reducer. The utility creates a Map/Reduce job, submits
the job to an appropriate cluster, and monitors the progress of the job until
completion.
Every mapper task will launch the script as a separate process when the mapper
is initialized after a script is specified for mappers. Mapper task inputs are
converted into lines and fed to the standard input and Line oriented outputs
are collected from the standard output of the procedure Mapper and every line
is changed into a key, value pair which is collected as the outcome of the mapper.
Each reducer task will launch the script as a separate process and then the
reducer is initialized after a script is specified for reducers. As the reducer task
runs, reducer task input key/value pairs are converted into lines and fed to the
standard input (STDIN) of the process.
Each line of the line-oriented outputs is converted into a key/value pair after it
is collected from the standard output (STDOUT) of the process, which is then
collected as the output of the reducer.
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not acquainted
with this technology, one question arises that what is big data ? Big data is
a term given to the data sets which can’t be processed in an efficient manner
with the help of traditional methodology such as RDBMS. Hadoop has made
its place in the industries and companies that need to work on large data sets
which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being
a framework, Hadoop is made up of several modules that are supported by a
large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides vari-
ous services to solve the big data problems. It includes Apache projects and
various commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
22
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm
libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsi-
ble for storing large data sets of structured or unstructured data across various
nodes and thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1.Name node
23
2.Data Node
Name Node is the prime node which contains metadata (data about data) re-
quiring comparatively fewer resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs schedul-
ing and resource allocation for the Hadoop System.
Consists of three major components i.e.
1.Resource Manager
2.Nodes Manager
3.Application Manager
Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource man-
ager and node manager and performs negotiations as per the requirement of the
two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes
it possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
24
It is a platform for structuring the data flow, processing and analyzing huge
data sets.
Pig does the work of executing commands and in the background, all the activ-
ities of MapReduce are taken care of. After the processing, pig stores the result
in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two compo-
nents: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permis-
sions and connection whereas HIVE Command line helps in the processing of
queries.
Mahout:
• Mahout, allows Machine Learnability to a system or applica-
tion. Machine Learning, as the name suggests helps the system
to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as collabo-
rative filtering, clustering, and classification which are nothing
but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Apache Spark:
25
• Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of the companies interchangeably.
Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.
• At times where we need to search or retrieve the occurrences
of something small in a huge database, the request must be
processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing
large datasets. They are as follows:
• Solr, Lucene: These are the two services that perform the
task of searching and indexing with the help of some java li-
braries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of co-
ordination and synchronization among the resources or the
components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchro-
nization, inter-component based communication, grouping, and
maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus
scheduling jobs and binding them together as a single unit.
There is two kinds of jobs .i.e Oozie workflow and Oozie coordi-
nator jobs. Oozie workflow is the jobs that need to be executed
in a sequentially ordered manner whereas Oozie Coordinator
jobs are those that are triggered when some data or external
stimulus is given to it.
IBM Big Data Strategy :
• IBM, a US-based computer hardware and software manufacturer, had im-
plemented a Big Data strategy.
• Where the company offered solutions to store, manage, and analyze the
26
huge amounts of data generated daily and equipped large and small com-
panies to make informed business decisions.
• The company believed that its Big Data and analytics products and ser-
vices would help its clients become more competitive and drive growth.
Issues :
• · Understand the concept of Big Data and its importance to large, medium,
and small companies in the current industry scenario.
• · Understand the need for implementing a Big Data strategy and the
various issues and challenges associated with this.
• · Analyze the Big Data strategy of IBM.
• · Explore ways in which IBM’s Big Data strategy could be improved
further.
Introduction to InfoSphere :
• InfoSphere Information Server provides a single platform for data integra-
tion and governance.
• The components in the suite combine to create a unified foundation for
enterprise information architectures, capable of scaling to meet any infor-
mation volume requirements.
• You can use the suite to deliver business results faster while maintaining
data quality and integrity throughout your information landscape.
• InfoSphere Information Server helps your business and IT personnel col-
laborate to understand the meaning, structure, and content of information
across a wide variety of sources.
• By using InfoSphere Information Server, your business can access and
use information in new ways to drive innovation, increase operational effi-
ciency, and lower risk.
BigInsights :
• BigInsights is a software platform for discovering, analyzing, and visualiz-
ing data from disparate sources.
• The flexible platform is built on an Apache Hadoop open-source framework
that runs in parallel on commonly available, low-cost hardware.
Big Sheets :
• BigSheets is a browser-based analytic tool included in the InfoSphere Bi-
gInsights Console that you use to break large amounts of unstructured
data into consumable, situation-specific business contexts.
• These deep insights help you to filter and manipulate data from sheets
even further.
27