BDA Unit-1
BDA Unit-1
The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
[OR]
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.
[OR]
The definition of big data is data that contains greater variety, arriving in increasing volumes and with
more velocity.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data is
in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Every company uses its collected data in its own way. More effectively the company uses its data,
more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when they
have to store large amounts of data. These tools help organizations in identifying more effective ways
of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on the
learnings.
For example, analysis of customer purchasing behavior helps companies to identify the products sold
most and thus produces those products accordingly. This helps companies to get ahead of their
competitors.
Companies can use Big data tools to improve their online presence.
If we don’t know what our customers want then it will degrade companies’ success. It will result in
the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns. Customer
behavior analysis leads to a profitable business.
Meet Hadoop:
Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce
program runs on Hadoop which is an Apache open-source framework.
The first is the map job, which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller set
of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after
the map job.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Data Storage and Analysis - Hadoop
The problem is simple: while the storage capacities of hard drives have increased massively
over the years, access speeds the rate at which data can be read from drives have not kept
up.One typical drive from 1990 could store 1,370 MB of data and hada transfer speed of 4.4
MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years
later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes
more than two and a half hours to read all the data off the disk.
This is a long time to read all data on a single drive and writing is even slower.The obvious
way to reduce the time is to read from multiple disks at once.Imagine if we had 100 drives,
each holding one hundredth of the data.Working in parallel,we could read the data in under
two minutes.
Only using one hundredth of a disk may seem wasteful.But we can store one hundred data sets,
each of which is one terabyte, and provide shared ould be likely to be spread over time, so they
wouldn’t interfere with each other too much.
There’s more to being able to read and write data in parallel to or from multiple disks,though.
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware,the chance that one will fail is fairly high. A common way of avoiding data loss is
through replication: redundant copies of the data are kept by the system so that in the event of
failure, there is another copy available. This is how RAID works, for instance, although
Hadoop’s file system, the Hadoop Distributed Filesystem (HDFS),takes a slightly different
approach, as you shall see later.
The second problem is that most analysis tasks need to be able to combine the data in some
way; data read from one disk may need to be combined with the data from any of the other 99
disks. Various distributed systems allow data to be combined from multiple sources, but doing
this correctly is notoriously challenging. Map Reduce provides a programming model that
abstracts the problem from disk reads and writes,transforming it into a computation over sets
of keys and values. We will look at the details of this model in later chapters, but the important
point for the present discussion is that there are two parts to the computation, the map and the
reduce, and it’s theinterface between the two where the “mixing” occurs. Like HDFS,
MapReduce has built-in reliability.
This, in a nut shell, is what Hadoop provides: a reliable shared storage and analysis system.
The storage is provided by HDFS and analysis by MapReduce. There are other parts to
Hadoop, but these capabilities are its kernel.
RDBMS is more efficient for point queries where data is indexed to improve disk latency. Whereas
Hadoop's Mapreduce is more efficient for queries involving complete data. Moreover Mapreduce suits
applications in which data is written once and read many times, whereas in RDBMS dataset is
continuously updated.
MapReduce RDBMS
1. Size of data Petabytes Gigabytes
2. Integrity of data Low High
3. Data schema Dynamic Static
4. Access method Interactive and Batch Batch
5. Scaling Linear Nonlinear
6. Data structure Unstructured Structured
7. Normalization of data Not Required Required
S.No. RDBMS Hadoop
Moreover Mapreduce saves the programmers from writing code for node failure and handling data
flow as these are handled implicitly by MapReduce.Whereas Grid Computing provides great control
to handle data flow and node failures.
Thus we can say that Hadoop is not a replacement for RDBMS and both these systems can coexist
simultaneously.
3.Hadoop vs Cloud computing
In simplest terms, it means storing and accessing your data, programs, and files over the internet rather
than your PC’s hard drive. Basically, the cloud is another name for the internet.
Cloud computing has several attractive benefits for end users and businesses. The primary benefits of
cloud computing includes:
Elasticity – with cloud computing, businesses only use the resources they require. Organizations can
increase their usage as computing needs increase and reduce their usage as the computing needs
decrease. This eliminates the need for investing heavily in IT infrastructures which may or may not be
used.
Self-service provisioning – users can always use the resources for almost any type of workload on
demand. This eliminates the need for IT admins to provide and manage computer resources.
Pay-per-use – compute resources are measures depending on the usage level. This means that users
are only charged for the cloud resources they use.
Differences between Hadoop and Cloud Computing:
1. Hadoop is a framework which uses simple programming models to process large data sets across
clusters of computers. It is designed to scale up from single servers to thousands of machines, which
offer local computation and storage individually. Cloud computing, on the other hand, constitutes
various computing concepts, which involve a large number of computers which are usually connected
through a real-time communication network.
2. Cloud computing focuses on on-demand, scalable and adaptable service models. Hadoop on the other
hand all about extracting value out of volume, variety, and velocity.
3. In cloud computing, Cloud MapReduce is a substitute implementation of MapReduce. The main
difference between cloud MapReduce and Hadoop is that Cloud MapReduce doesn’t provide its own
implementation; rather, it relies on the infrastructure offered by different cloud services providers.
4. Hadoop is an ‘ecosystem’ of open source software projects which allow cheap computing which is
well distributed on industry-standard hardware. On the other hand, cloud computing is a model where
processing and storage resources can be accessed from any location via the internet.
When people first hear about Hadoop and MapReduce, they often ask, “How is it different from
SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home
in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope
data for signs of intelligent life outside earth. SETI@home is the most well-known of many volunteer
computing projects; others include the Great Internet Mersenne Prime Search (to search for large
prime numbers) and Folding@home (to understand protein folding and how it relates to disease).
Volunteer computing projects work by breaking the problem they are trying to solve into chunks
called work units, which are sent to computers around the world to be analyzed. For example, a
SETI@home work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze
on a typical home computer. When the analysis is completed, the results are sent back to the server,
and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to
three different machines and needs at least two results to agree to be accepted.
Although SETI@home may be superficially similar to MapReduce (breaking a problem into
independent pieces to be worked on in parallel), there are some significant differences. The
SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of
thousands of computers across the world,† since the time to transfer the work unit is dwarfed by the
time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running
in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home
runs a perpetual computation on untrusted machines on the Internet with highly variable connection
speeds and no data locality.
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an
open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they have
to spend a lot of costs which becomes the consequence of that project. This problem becomes
one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop
Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.
Yea Event
r
Hadoop Ecosystem:
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises
that what is big data ? Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop has made its
place in the industries and companies that need to work on large data sets which are sensitive and
needs efficient handling. Hadoop is a framework that enables processing of large data sets which
reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are
supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve
the big data problems. It includes Apache projects and various commercial tools and solutions.
There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common. Most of the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are
part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop
that it revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for the
Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning , as the name
suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times, HBase
comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing with
the help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization among
the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus
is given to it.
Hadoop Installation
In this section of the Hadoop tutorial, we will be talking about the Hadoop installation process.
Hadoop is supported by the Linux platform and its facilities. If you are working on Windows, you can
use Cloudera VMware that has preinstalled Hadoop, or you can use Oracle VirtualBox or the VMware
Workstation. In this tutorial, I will be demonstrating the installation process for Hadoop using the
VMware Workstation 12. You can use any of the above to perform the installation. I will do this by
installing CentOS on my VMware.
Prerequisites
Step 3: Setting up CentOS in VMware 12
Click on Create a New Virtual Machine
1. As seen in the screenshot above, browse the location of your CentOS file you downloaded. Note
that it should be a disc image file
2. Click on Next
1. Choose the name of your machine. Here, I have given the name CentOS 64-bit
2. Then, click Next
After this, you should be able to see a window as shown below. This screen indicates that you are
booting the system and getting it ready for installation. You will be given a time of 60 seconds to
change the option from Install CentOS to others. You will need to wait for 60 seconds if you need
the option selected to be Install CentOS
Note: In the image above, you can see three options, such as, I Finished Installing, Change Disc,
and Help. You don’t need to touch any of these until your CentOS is successfully installed.
o At the moment, your system is being checked and is getting ready for installation
Once the checking percentage reaches 100%, you will be taken to a screen as shown below:
Interested in learning Big Data Hadoop in-depth? Enroll in our Big Data Hadoop Training now!
Step 4: Here, you can choose your language. The default language is English, and that is what I have
selected
1. If you want any other language to be selected, specify it
2. Click on Continue
Step 5: Setting up the Installation Processes
o From Step 4, you will be directed to a window with various options as shown below:
o First, to select the software type, click on the SOFTWARE SELECTION option
Now, you will see the following window:1. Select the Server with GUI option to give your
server a graphical appeal
2. Click on Done
After clicking on Done, you will be taken to the main menu where you had previously
selected SOFTWARE SELECTION
Next, you need to click on INSTALLATION DESTINATION
On clicking this, you will see the following window:
1.
Under Other Storage Options, select I would like to make additional space available
2. Then, select the radio button that says I will configure partitioning
3. Then, click on Done
o Next, you’ll be taken to another window as shown below:
1. Select
the partition scheme here as Standard Partition2. Now, you need to add three mount points
here. For doing that, click on ‘+’
a) Select the Mount Point /boot as shown above
b) Next, select the Desired Capacity as 500 MiB as shown below:
c) Click on Add mount point
d) Again, click on ‘+’ to add another Mount Point
e) This time, select the Mount Point as swap and Desired Capacity as 2 GiB
f) Click on Add Mount Point
g) Now, to add the last Mount Point, click on + again
h) Add another Mount Point ‘/’ and click on Add Mount Point
i) Click on Done, and you will see the following window:
Note: This is just to make you aware of all the changes you had made in the partition of your drive
Now, click on Accept Changes if you’re sure about the partitions you have made
Next, select NETWORK & HOST NAME
You’ll be taken to a window as shown below:
Step 6: Configuration
o Once you complete step 5, you will see the following window where the final installation
process will be completed.
o But before that, you need to set the ROOT PASSWORD and create a user
o Click on ROOT PASSWORD, which will direct you to the following window:
1
. Enter your root password here
2. Confirm the password
3. Click on Done
Now, click on USER CREATION, and you will be directed to the following window:
o You’ll see the Reboot button, as seen below when your installation is done, which takes up to
20–30 minutes
o In the next screen, you will see the installation process in progress
Wait until a window pops up to accept your license info step 7: Setting up the License Information
Step 8: Logging
into CentOS
Note: All commands need to be run on the Terminal. You can open the Terminal by right-clicking on
the desktop and selecting Open Terminal
Step 9: Downloading and Installing Java 8
Click here to download the Java 8 Package. Save this file in your home directory
Extract the Java tar file using the following command:
o Download a stable release packed as a zipped file from here and unpack it somewhere on your
file system
Extract the Hadoop file using the following command on the terminal:
Step
11: Moving Hadoop to a Location
o Use the following code to move your file to a particular location, here Hadoop:
mv hadoop-2.7.3/home/intellipaaat/hadoop
Note: The location of the file you want to change may differ. For demonstration purposes, I
have used this location, and this will be the same throughout this tutorial. You can change it
according to your choice.
o When you get logged into your root user, enter the command: vi ~/.bashrc
o The above command takes you to the vi editor, and you should be able to see the following
screen:
To access this, press Insert on your keyboard, and then, start writing the following set of codes for
setting a path for Java:
o fi
o #HADOOP VARIABLES START
o export JAVA_HOME= (path you copied in the previous step)
o export HADOOP_HOME=/home/(your username)/hadoop
o export PATH=$PATH:$HADOOP_INSTALL/bin
o export PATH=$PATH:$HADOOP_INSTALL/sbin
o export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
o export HADOOP_COMMON_HOME=$HADOOP_INSTALL
o export HADOOP_HDFS_HOME=$HADOOP_INSTALL
o export YARN_HOME=$HADOOP_INSTALL
o export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/lib/native
o export HADOOP_OPTS=”Djava.library.path”=$HADOOP_INSTALL/lib”
#HADOOP VARIABLES END
After writing the code, click on Esc on your keyboard and write the command:wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the
image below:
Step 13: Adding Configuration Files
vi /home/intellipaaat/hadoop/etc/hadoop/hadoop-env.sh
o Replace this path with the Java path to tell Hadoop which path to use. You will see the
following window coming up:
o Change the JAVA_HOME variable to the path you had copied in the previous step
Step 14:
Now, several XML files need to be edited, and you need to set the property and the path for
them.
o Editing core-site.xml
Use the same command as in the previous step and just change the last part to core-
site.xml as given below:
vi /home/intellipaaat/hadoop/etc/hadoop/core-site.xml
Next, you will see the following window:
o
o
Now, exit from this window by entering the command:wq!
o Editing yarn-site.xml
vi /home/intellipaaat/hadoop/etc/hadoop/yarn-site.xml
You will see the following window:
o
Enter the code in between the configuration tags as shown below:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Exit from this window by pressing Esc and then writing the command:wq!
Editing mapred-site.xml
cp /home/intellipaaat/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml.template
/home/intellipaaat/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml
Once the contents have been copied to a new file named mapred-site.xml, you can verify
it by going to the following path:
Home > intellipaaat > hadoop > hadoop-2.7.3 > etc > hadoop
vi/home/intellipaaat/hadoop/etc/hadoop/mapred-site.xml
o
In the new window, enter the following code in between the configuration tags as below:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Editing hdfs-site.xml
Before editing the hdfs-site.xml, two directories have to be created, which will contain the
namenode and the datanode.
o
mkdir -p /home/intellipaaat/hadoop_store/hdfs/namenode
mkdir -p /home/intellipaaat/hadoop_store/hdfs/datanode
vi /home/intellipaaat/hadoop/etc/hadoop/hdfs-site.xml
That’sall!
All your configurations are done. And Hadoop Installation is done now!
Step15:CheckingHadoop
You will now need to check whether the Hadoop installation is successfully done on your system or
not.
Go to the location where you had extracted the Hadoop tar file, right-click on the bin, and open it
in the terminal
Now, write the command, ls
Next, if you see a window as below, then it means that Hadoop is successfully installed!
To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to
run it on a cluster of machines.
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the
map function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of
the line from the beginning of the file, but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in. In this case, the map function is just a
data preparation phase, setting up the data in such a way that the reducer function can
do its work on it: finding the maximum temperature for each year. The map function is
also a good place to drop bad records: here we filter out temperatures that are missing,
suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106 0043011990999991950051512004...9999999N9+00221+99999999999
, ...)
(212 0043011990999991950051518004...9999999N9-
, 00111+99999999999...)
(318 0043012650999991949032412004...0500001N9+01111+99999999999
, ...)
(424 0043012650999991949032418004...0500001N9+00781+99999999999
, ...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in bold
text), and emits them as its output (the temperature values have been interpreted as
integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the key-
value pairs by key. So, continuing the example, our reduce function sees the
following input:
Each year appears with a list of all itsair temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
The whole data flow is illustrated in 2.2. At the bottom of the diagram is a Unix
pipeline, which mimics the whole MapReduce flow, and which we will see again
later in the chapter when we look at Hadoop Streaming.
Figure 2-1. MapReduce logical data flow
JAVA MAPREDUCE
Having run through how the MapReduce program works, the next step is to express
it in code. We need three things: a map function, a reduce function, and some code
to run the job. The map function is represented by the Mapper class, which declares
an abstract map() method.
The Mapper class is a generic type, with four formal type parameters that specify the
input key, input value, output key, and output value types of the map function.
For the present example, the input key is a long integer offset, the input value is
a line of text, the output key is a year, and the output value is an air temperature
(an integer). Rather than use built-in Java types, Hadoop provides its own set of
basic types that are opti- mized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which corresponds to
a Java Long, Text (like Java String), and IntWritable (like Java Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method to
extract the columns we are interested in.
The map() method also provides an instance of Context to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the
tem- perature is present and the quality code indicates the temperature reading is
OK.
Again, four formal type parameters are used to specify the input and output
types, this time for the reduce function. The input types of the reduce function
must match the output types of the map function: Text and IntWritable. And in
this case, the output types of the reduce function are Text and IntWritable, for a
year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.
A Job object forms the specification of the job. It gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code
into a JAR file (which Hadoop will distribute around the cluster). Rather than
explicitly specify the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat,
and it can be a single file, a directory (in which case, the input forms all the files in
that directory), or a file pattern. As the name suggests, addInputPath() can be called
more than once to use input from multiple paths.
The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output
files from the reducer functions are written. The directory shouldn’t exist before
running the job, as Hadoop will complain and not run the job. This precaution is
to prevent data loss(it can be very annoying to accidentally overwrite the output
of a long job with another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The input types are controlled via the input format, which we have not explicitly
set since we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to
run the job. The waitForCompletion() method on Job submits the job and waits for it
to finish. The method’s boolean argument is a verbose flag, so in this case the job
writes information about its progress to the console.
After writing a MapReduce job, it’s normal to try it out on a small dataset to flush out
any immediate problems with the code. First install Hadoop in standalone mode—
there are instructions for how to do this in Appendix A. This is the mode in which
Hadoop runs using the local filesystem with a local job runner. Then install and
compile the examples using the instructions on the book’s website.
When the hadoop command is invoked with a classname as the first argument, it
launches a JVM to run the class. It is more convenient to use hadoop than straight
java since the former adds the Hadoop libraries (and their dependencies) to the class-
path and picks up the Hadoop configuration, too. To add the application classes to the
classpath, we’ve defined an environment variable called HADOOP_CLASSPATH,
which the hadoop script picks up.
The last section of the output, titled “Counters,” shows the statistics that Hadoop
generates for each job it runs. These are very useful for checking whether the
amount of data processed is what you expected. For example, we can follow the
number of records that went through the system: five map inputs produced five
map outputs, then five reduce inputs in two groups produced two reduce outputs.
The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-r-00000:
% cat output/part-
1950 22
This result is the same as when we went through it by hand earlier. We interpret
this as saying that the maximum temperature recorded in 1949 was 11.1°C, and in
1950 it was 2.2°C.
THE OLD AND THE NEW JAVA MAPREDUCE APIS
The Java MapReduce API used in the previous section was first released in
Hadoop
0.20.0. This new API, sometimes referred to as “Context Objects,” was designed
to
make the API easier to evolve in the future. It is type-incompatible with the old, how-
ever, so applications need to be rewritten to take advantage of it.
The new API is not complete in the 1.x (formerly 0.20) release series, so the old API
is recommended for these releases, despite having been marked as deprecated in the
early
Previous editions of this book were based on 0.20 releases, and used the old API
throughout (although the new API was covered, the code invariably used the old
API). In this edition the new API is used as the primary API, except where
mentioned. How- ever, should you wish to use the old API, you can, since the
code for all the examples in this book is available for the old API on the book’s
website.1
• The new API favors abstract classes over interfaces, since these are easier to
evolve. For example, you can add a method (with a default implementation) to
an abstract class without breaking old implementations of the class2. For
example, the Mapper and Reducer interfaces in the old API are abstract
classes in the new API.
• The new API is in the org.apache.hadoop.mapreduce package (and
subpackages). The old API can still be found in org.apache.hadoop.mapred.
• The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The new Context, for example,
essen- tially unifies the role of the JobConf, the OutputCollector, and the
Reporter from the old API.
• In both APIs, key-value record pairs are pushed to the mapper and reducer, but
in addition, the new API allows both mappers and reducers to control the
execution flow by overriding the run() method. For example, records can be
processed in batches, or the execution can be terminated before all the
records have been pro- cessed. In the old API this is possible for mappers by
writing a MapRunnable, but no equivalent exists for reducers.
• Configuration has been unified. The old API has a special JobConf object for
job configuration, which is an extension of Hadoop’s vanilla Configuration
object (used for configuring daemons. In the new API, this distinction is
dropped, so job configuration is done through a Configuration.
• Job control is performed through the Job class in the new API, rather than the
old
JobClient, which no longer exists in the new API.
• Output files are named slightly differently: in the old API both map and
reduce outputs are named part-nnnnn, while in the new API map outputs are
named part- m-nnnnn, and reduce outputs are named part-r-nnnnn (where
nnnnn is an integer designating the part number, starting from zero).
• User-overridable methods in the new API are declared to throw java.lang.Inter
ruptedException. What this means is that you can write your code to be
reponsive to interupts so that the framework can gracefully cancel long-running
operations if it needs to3.
• In the new API the reduce() method passes values as a
java.lang.Iterable, rather than a java.lang.Iterator (as the old API
does). This change makes it easier to iterate over the values using
Java’s for-each loop construct: for (VALUEIN value : values)
{ ... }
Scale up
Resources such as CPU, network, and storage are common targets for scaling up. The goal is to
increase the resources supporting your application to reach or maintain adequate performance. In
a hardware-centric world, this might mean adding a larger hard drive to a computer for increased
storage capacity. It might mean replacing the entire computer with a machine that has more
CPU and a more performant network interface. If you are managing a non-cloud system, this
scaling up process can take anywhere from weeks up to months as you request, purchase, install,
and finally deploy the new resources.
In a cloud system, the process should take seconds or minutes. A cloud system might still target
hardware and that will be on the tens of minutes end of the time to scale range. But virtualized
systems dominate cloud computing and some scaling actions, like increasing storage volume
capacity or spinning up a new container to scale up a microservice can take seconds to deploy.
What is being scaled will not be that different. One may still shift applications to a larger VM or
it may be as simple as allocating more capacity on an attached storage volume.
Regardless of whether you are dealing with virtual or hardware resources, the take-home point is
that you are moving from one smaller resource and scaling up to one larger, more performant
resource.
Scale out
Scaling up makes sense when you have an application that needs to sit on a single machine. If
you have an application that has a loosely coupled architecture, it becomes possible to easily
scale out by replicating resources.
Scaling out a microservices application can be as simple as spinning up a new container running
a webserver app and adding it to the load balancer pool. When scaling out the idea is that it is
possible to add identical services to a system to increase performance. Systems that support this
model also tolerate the removal of resources when the load decreases. This allows greater
fluidity in scaling resource size in response to changing conditions.
The incremental nature of the scale out model is of great benefit when considering cost
management. Because components are identical, cost increments should be relatively
predictable. Scaling out also provides greater responsiveness to changes in demand. Typically
services can be rapidly added or removed to best meet resource needs. This flexibility and speed
effectively reduces spending by only using (and paying for) the resources needed at the time.