Big Data Hadoop
Big Data Hadoop
Big data is a collection of massive and complex data sets and data volume that include the huge quantities
of data, data management capabilities, social media analytics and real-time data. Big data analytics is the
process of examining large amounts of data. There exist large amounts of heterogeneous digital data. Big
data is about data volume and large data set's measured in terms of terabytes or petabytes. After
examining of Bigdata, the data has been launched as Big Data analytics The challenges include capturing,
analysis, storage, searching, sharing, visualization, transferring and privacy violations. It can neither be
worked upon by using traditional SQL queries nor can the relational database management system
(RDBMS) be used for storage. Though, a wide variety of scalable database tools and techniques has
evolved. Hadoop is an open source distributed data processing is one of the prominent and well known
solutions. The NoSQL has a non-relational database with the likes of MongoDB from Apache.
The New York Stock Exchange generates about one terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments
etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per
day, generation of data reaches up to many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
Over the period of time, talent in computer science has achieved greater success in developing techniques for
working with such kind of data (where the format is well known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the range of multiple zettabytes.
1
7465 Shushil Roy Male Admin 500000
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A
typical example of unstructured data is a heterogeneous data source containing a combination of simple text
files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in
form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured
data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
2
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per month.
Also, by the year 2020 we will have almost 40000 ExaBytes of data.
2. Velocity:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.
There is a massive and continuous flow of data. This determines the potential of data that how fast the
data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.
Example: There are more than 3.5 billion searches per day are made on Google. Also, FaceBook users are
increasing by 22%(Approx.) year by year.
3. Variety:
Variety refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise.
It can be structured, semi-structured and unstructured.
Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data
that do not conform to the formal structure of data. Log files are the examples of this type of data.
Unstructured data: This data basically refers to unorganized data. It generally refers to data
that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows
and columns.
Veracity:
It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate
data types and sources.
Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.
5. Value:
After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data
having no Value is of no good to the company, unless you turn it into something useful.
Data in itself is of no use or importance but it needs to be converted into something valuable to extract
Information. Hence, you can state that Value! is the most important V of all the 5V’s.
Security Challanges :
Advanced analytic tools for unstructured big data and nonrelational databases (NoSQL) are
newer technologies in active development. It can be difficult for security software and processes to
Mature security tools effectively protect data ingress and storage. However, they may not have
the same impact on data output from multiple analytics tools to multiple locations.
Big data administrators may decide to mine data without permission or notification. Whether
the motivation is curiosity or criminal profit, your security tools need to monitor and alert on
3
The fine size of a big data installation, terabytes to petabytes large, is too big for routine security
audits. And because most big data platforms are cluster-based, this introduces multiple
If the big data owner does not regularly update security for the environment, they are at risk of
Security tools need to monitor and alert on suspicious malware infection on the system,
database or a web CMS such as WordPress, and big data security experts must be proficient in
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn,
leads to smarter business moves, more efficient operations, higher profits and happier customers. In his
report Big Data in Big Companies, IIA Director of Research Tom Davenport interviewed more than 50 businesses
to understand how they used big data. He found they got value in the following ways:
1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing
business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with
the ability to analyze new sources of data, businesses are able to analyze information immediately – and make
decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big data analytics,
more companies are creating new products to meet customers’ needs
1. Banking and Securities : For monitoring financial markets through network activity monitors
and natural language processors to reduce fraudulent transactions. Exchange Commissions or
Trading Commissions are using big data analytics to ensure that no illegal trading happens by
monitoring the stock market.
2. Communications and Media: For real-time reportage of events around the globe on several
platforms (mobile, web and TV), simultaneously. Music industry, a segment of media, is using big data
to keep an eye on the latest trends which are ultimately used by autotuning softwares to generate
catchy tunes.
3. Sports: To understand the patterns of viewership of different events in specific regions and also
monitor the performance of individual players and teams by analysis. Sporting events like Cricket
world cup, FIFA world cup and Wimbledon make special use of big data analytics.
4. Healthcare: To collect public health data for faster responses to individual health problems and
identify the global spread of new virus strains such as Ebola. Health Ministries of different countries
incorporate big data analytic tools to make proper use of data collected after Census and surveys.
4
5. Education: To update and upgrade prescribed literature for a variety of fields which are
witnessing rapid development. Universities across the world are using it to monitor and track the
performance of their students and faculties and map the interest of students in different subjects via
attendance.
7. Insurance: For everything from developing new products to handling claims through predictive
analytics. Insurance companies use business big data to keep a track of the scheme of policy which is
the most in demand and is generating the most revenue.
8. Consumer Trade: To predict and manage staffing and inventory requirements. Consumer
trading companies are using it to grow their trade by providing loyalty cards and keeping a track of
them.
9. Transportation: For better route planning, traffic monitoring and management, and logistics.
This is mainly incorporated by governments to avoid congestion of traffic in a single place.
10. Energy: By introducing smart meters to reduce electrical leakages and help users to manage
their energy usage. Load dispatch centers are using big data analysis to monitor the load patterns and
identify the differences between the trends of energy consumption based on different parameters
and as a way to incorporate daylight savings.
Hadoop:
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.
Hadoop Architecture
5
Map Reduce:
MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples (key/value
pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is
always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once
we write an application in the MapReduce form, scaling the application to run over hundreds, thousands,
or even tens of thousands of machines in a cluster is merely a configuration change. This simple
scalability is what has attracted many programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
6
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hadoop HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are
stored across multiple machines. These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS
HDFS Architecture
7
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −
It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software.
For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage
the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or
more segments and/or stored in individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
8
Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
Had
oop Common − These are Java libraries and utilities required by other Hadoop modules.
Had
oop YARN − This is a framework for job scheduling and cluster resource management.
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned
Resource Manager” at the time of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer. In Hadoop
1.0 version, the responsibility of Job tracker is split between the resource manager and application
manager.
YARN Features: YARN gained popularity because of the following features-
Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
Compatability: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-tenancy.
9
The main components of YARN architecture include:
Client: It submits map-reduce jobs.
Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a task fails. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition
the cluster resources.
Application manager: It is responsible for accepting the application and negotiating
the first container from the resource manager. It also restarts the Application Manager
container if a task fails.
Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. It monitors resource usage, performs log management
and also kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request of Application
master.
Application Master: An application is a single job submitted to a framework. The
application manager is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends its report to the resource manager from time-to-time.
Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
10
1. Client submits an application
2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager
RDBMS works better when the volume Hadoop was meant to handle very
Data Volume-
of data is low(in Gigabytes). But when large data size . So Hadoop works
the data size is huge i.e, in Terabytes better when the data size is big. It can
and Petabytes, RDBMS fails to give the easily process and store large amount
desired results. of data quite effectively as compared to
the traditional RDBMS.
Throughput Throughput means amount of Hadoop provides higher
data processed in particular throughput than RDBMS. This is
period of time. one of the reason behind the
heavy usage of Hadoop than the
11
RDBMS has lower throughput traditional database
than Apache Hadoop . managenent .
Data variety Data variety generally means the Hadoop has the ability to process
type of data to be processed and store all variety of data
.RDBMS can process structured whether it is structured , semi
data. It can not be used to structured and unstructured
manage unstructured data. data . it can be used to manage
unstructured data. So we can say
hadoop is way better than the
traditional relational database
management System.
Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a Linux
operating system for setting up Hadoop environment. In case you have an OS other than Linux, you
can install a Virtualbox software in it and have Linux inside the Virtualbox.
Pre-installation Setup
12
Java must be installed that is prequisite for running hadoop on your system.
Creating a User
At the beginning, it is recommended to create a separate user for Hadoop to isolate Hadoop
file system from Unix file system. Follow the steps given below to create a user −
Open the root using the command “su”.
Create a user from the root account using the command “useradd username”.
Now you can open an existing user account using the command “su username”.
Open the Linux terminal and type the following commands to create a user.
% su
password:
% useradd hadoop
% passwd hadoop
New passwd:
Retype new passwd
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the
distribution in a sensible location, such as /usr/local (/opt is another standard choice; note
that Hadoop should not be installed in a user’s home directory, as that may be an NFS-
mounted directory):
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
You also need to change the owner of the Hadoop files to be the hadoop user and group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=%PATH:%HADOOP_HOME/bin:%HADOOP_HOME/sbin
The following commands are used for generating a key value pair using SSH. Copy the public
keys form id_rsa.pub to authorized_keys, and provide the owner with read and write
permissions to authorized_keys file respectively.
% ssh-keygen -t rsa -f ~/.ssh/id_rsa
− Hadoop Configuration
13
You can find all the Hadoop configuration files in the location
“%HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure.
% cd %HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
Note − In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site.xml.template to mapred-site.xml file using the following command.
% cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
15
Moving data in and out of hadoop
The first step in working with data in Hadoop is to make it available to Hadoop, there are two
primary methods that can be used for moving data into Hadoop: writing external data at
the HDFS level (a data push), or reading external data at the MapReduce level (more
like a pull). Reading data in MapReduce has advantages in the ease with which the
operation can be parallelized and fault tolerant. Not all data is accessible from MapReduce,
The traditional method of transferring data into the HDFS system is to use
the put command. Let us see how to use the put command.
The main challenge in handling the log data is in moving these logs produced by multiple
servers to the Hadoop environment.
Hadoop File System Shell provides commands to insert data into Hadoop and read from it.
You can insert data into Hadoop using the put command as shown below.
$ Hadoop fs –put /path of the required file /path in HDFS where to save the file
We can use the put command of Hadoop to transfer data from these sources to HDFS. But,
it suffers from the following drawbacks −
16
Using put command, we can transfer only one file at a time while the data
generators generate data at a much higher rate. Since the analysis made on older
data is less accurate, we need to have a solution to transfer data in real time.
If we use put command, the data is needed to be packaged and should be ready for
the upload. Since the webservers generate data continuously, it is a very difficult task.
What we need here is a solutions that can overcome the drawbacks of put command and
transfer the "streaming data" from data generators to centralized stores (especially HDFS)
with less delay.
In HDFS, the file exists as a directory entry and the length of the file will be considered as
zero till it is closed. For example, if a source is writing data into HDFS and the network was
interrupted in the middle of the operation (without closing the file), then the data written in
the file will be lost.
Therefore we need a reliable, configurable, and maintainable system to transfer the log data
into HDFS.
Available Solutions
To send streaming data (log files, events etc..,) from various sources to HDFS, we have the
following tools available at our disposal −
Facebook’s Scribe
Scribe is an immensely popular tool that is used to aggregate and stream log data. It is
designed to scale to a very large number of nodes and be robust to network and node
failures.
Apache Kafka
Apache Flume
17
or it will be pushed out of Hadoop. An example of this scenario would be one where you
pulled some data from an OLTP database, performed some machine learning
activities on that data, and then copied the results back into the OLTP database for use by
your production systems.
In this section we’ll cover how to automate moving regular files from HDFS to a local
filesystem. We’ll also look at data egress to relational databases and HBase. To start off we’ll
look at how to copy data out of Hadoop using the HDFS Slurper.
Egress to a local filesystem
The challenge to using a local filesystem for egress (and ingress for that matter) is that map
and reduce tasks running on clusters won’t have access to the filesystem on a specific server.
You need to leverage one of the following three broad options for moving data from HDFS to
a filesystem:
1 Host a proxy tier on a server, such as a web server, which you would then write
to using MapReduce.
2 Write to the local filesystem in MapReduce and then as a postprocessing step
trigger a script on the remote server to move that data.
3 Run a process on the remote server to pull data from HDFS directly.
The third option is the preferred approach because it’s the simplest and most efficient,
and as such is the focus of this section. We’ll look at how you can use the HDFS
Slurper to automatically move files from HDFS out to a local filesystem.
TECHNIQUE
18
Map reduce has simple model of data processing. Input and outputs for the maps and reduce
funtions are key value pairs. The Map and reduce function in the hadoop map reduce have
following general form.
Now before processing it needs to know which data to process, this is achieved with the
InputFormat class. An InputFormat is the class which selects the file from the HDFS that should
be Input to the map function. An InputFormat is also responsible for creating the input splits and
dividing them into records. The data is divided into number of splits in HDFS. This is called as
inputsplit which is processed by a single map.
InputFormat class calls the getSplits() function and computes splits for each file and then
sends them to the JobTracker. which uses their storage locations to schedule map tasks to process
them on the TaskTrackers. Map task then passes the split to the createRecordReader() method on
InputFormat in task tracker to obtain a RecordReader for that split. The record reader then loads
data from source and converts it into suitable key value pairs for reading by the mapper. The
“start” is the byte position in the file Where the recordreader should start generating key value
pairs and the “end ” is where it should stop reading records.it communicates with input split until
the file reading is not completed.
Record writer is the class which handles the job of taking an individual key –value pair i.e output
From reducer and writing it to the location prepared by the OutputFormat.
RecordWriter implements ‘write’ and ‘close’. The Write function takes key-values from the
MapReduce job and writes the bytes to hdfs. The close function closes the hadoop data stream to
the ouput file.
19
output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret
such files correctly, KeyValueTextInputFormat is appropriate.
IV. StreamInputFormat – Hadoop comes with a InputFormat for streaming which can be
used outside streaming and can be used for processing XML documents. You can use it
by setting your input format to StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader. The reader is
configured by setting job configuration properties to tell it the patterns for the start and
end tags.
VIII. Multiple Inputs – Input to a MapReduce job may consist of multiple input files , all of
the input is interpreted by a single InputFormat and a single Mapper. What often
happens, is that the data format evolves over time, so you have to write your mapper to
cope with all of your legacy formats. Or you may have data sources that provide the
same type of data but in different formats. This arises in the case of performing joins of
different datasets . For instance, one might be tab-separated plain text, and the other a
binary sequence file. Even if they are in the same format, they may have different
representations, and therefore need to be parsed differently. These cases are handled
elegantly by using the MultipleInputs class, which allows you to specify which
InputFormat Mapper had to use on a per-path basis.
20
IX. Database Input – DBInputFormat is an input format for reading data from a relational
database, using JDBC. Because it doesn’t have any sharding capabilities, you need to be
careful not to overwhelm the database from which you are reading by running too many
mappers.
For this reason, it is best used for loading relatively small datasets. The corresponding
output format is DBOutputFormat, which is useful for dumping job outputs (of modest
size) into a database.
Output Formats
Hadoop has output data formats that correspond to the input formats.
I. Text Output – The default output format, TextOutputFormat, writes records as lines of
text. Its keys and values may be of any type, since TextOutputFormat turns them to
strings by calling toString() on them. Each key-value pair is separated by a tab
character, although that may be changed using the
mapreduce.output.textoutputformat.separator property.
II. SequenceFileOutputFormat – its for writing binary Output. As the name indicates,
SequenceFileOutputFormat writes sequence files for its output. This is a good choice
of output if it forms the input to a further MapReduce job, since it is compact and is
readily compressed.
III. SequenceFileAsBinaryOutputFormat – Is the counterpart to
SequenceFileAsBinaryInputFormat writes keys and values in raw binary format into a
sequence file container.
IV. MapFileOutputFormat – MapFileOutputFormat writes map files as output. The keys in
a MapFile must be added in order, so you need to ensure that your reducers emit keys
in sorted order.
V. Multiple Outputs – Sometimes there is a need to have more control over the naming
of the files or to produce multiple files per reducer. MapReduce comes with the
MultipleOutputs class to help you do this.
VI. Lazy Output – FileOutputFormat subclasses will create output (part-r-nnnnn) files,
even if they are empty. Some applications prefer that empty files not be created, which
is where Lazy OutputFormat helps. It is a wrapper output format that ensures that the
output file is created only when the first record is emitted for a given partition. To use
it, call its setOutputFormatClass() method with the JobConf and the underlying output
format.
VII. Database Output – The output formats for writing to relational databases and to
HBase.
Unit –II
Introduction to hive:
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
21
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interface that Hive
supports are hive web UI, Hive command line and Hive HD Insight(In
windows server).
Meta Store Hive chooses respective database servers to store the schema or
metadata of tables, databases , columns in a table their data , types
and hdfs mapping.
22
HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in Java,
we can write a query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.
Installation of Hive: All hadoop sub projects such as hive , pig and Hbase support linux operating
system you need to install any linux flavoured OS.The following simple steps are executed for Hive
installation. There are two prequisite for installing the hive i.e hadoop and java must be installed.
Step 1: Verifying JAVA Installation:
Java must be installed on your system before installing hive .let us verify java installation using the
following command:
$ java –version
If java is already installed on your system , you get to see the following response.
java version “1.7.0_71”
Java(TM) SE Runtime Environment(build 1.7.0_71-b13)
Java Hotspot(TM) Client VM(build 25.0-b02,mixed mode)
Verifying Hadoop Installation:The following steps are used to verify the Hadoop
installation.
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The following command is used to start dfs. Executing this command will start your
Hadoop file system.
$ start-dfs.sh
23
The following command is used to start the yarn script. Executing this command will start
your yarn daemons.
$ start-yarn.sh
.out
The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/
The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088/
The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.
The following command is used to verify the download and extract the hive archive:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands
are used to copy the files from the extracted directory to the /usr/local/hive”
directory.
$ su -
passwd:
# cd /home/user/Download
24
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ source ~/.bashrc
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and
copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.
Follow the steps given below to download and install Apache Derby:
The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-
bin.tar.gz
$ ls
25
db-derby-10.4.2.0-bin.tar.gz
The following commands are used for extracting and verifying the Derby archive:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
We need to copy from the super user “su -”. The following commands are used to copy the
files from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:
$DERBY_HOME/lib/derbytools.jar
$ source ~/.bashrc
$ mkdir $DERBY_HOME/data
26
Configuring Metastore means specifying to Hive where the database is stored. You can do
this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all,
copy the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
27
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
$ cd $HIVE_HOME
$ bin/hive
SQL
The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch
query processing) and real-time processing (i.e. Interactive query processing). Hive gets internally
gets converted into MapReduce programs.
It supports all primitive data types of SQL. You can use predefined functions, or write tailored
user-defined functions (UDF) also to accomplish your specific needs.
HiveQL
Introduction to PIG:
Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of
abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig
Latin which is used to develop the data analysis codes. First, to process the data which is stored in the
HDFS, the programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But these are
not visible to the programmers in order to provide a high-level of abstraction. Pig Latin and Pig Engine are
the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the reducer
and mapper, compiling packaging the code, submitting the job and retrieving the output is a time-
consuming task. Apache Pig reduces the time of development using the multi-query approach. Also, Pig is
beneficial for the programmers who are not from Java background. 200 lines of Java code can be written in
only 10 lines using the Pig Latin language. Programmers who have SQL knowledge needed less effort to
learn Pig Latin.
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the main
idea to develop Pig was to execute the MapReduce jobs on extremely large datasets. In the year 2007, it
moved to Apache Software Foundation(ASF) which makes it an open source project. The first version(0.1)
of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which came in the year 2017.
For performing several operations Apache Pig provides rich sets of operators like the filters, join, sort,etc .
Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
Apache Pig is extensible so that you can make your own user-defined functions and process.
Pig can handle the analysis of both structured and unstructured data.
Introduction to NoSQL:
29
A NoSQL originally referring to non SQL or non relational is a database that provides a mechanism for
storage and retrieval of data. This data is modeled in means other than the tabular relations used in
relational databases. Such databases came into existence in the late 1960s, but did not obtain the NoSQL
moniker until a surge of popularity in the early twenty-first century. NoSQL databases are used in real-time
web applications and big data and their use are increasing over time. NoSQL systems are also sometimes
called Not only SQL to emphasize the fact that they may support SQL-like query languages.
A NoSQL database includes simplicity of design, simpler horizontal scaling to clusters of machines and finer
control over availability. The data structures used by NoSQL databases are different from those used by
default in relational databases which makes some operations faster in NoSQL. The suitability of a given
NoSQL database depends on the problem it should solve. Data structures used by NoSQL databases are
sometimes also viewed as more flexible than relational database tables.
Many NoSQL stores compromise consistency in favor of availability, speed and partition tolerance. Barriers
to the greater adoption of NoSQL stores include the use of low-level query languages, lack of standardized
interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true
ACID(Atomicity, Consistency, Isolation, Durability) transactions but a few databases, such as MarkLogic,
Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and
OrientDB have made them central to their designs.
Most NoSQL databases offer a concept of eventual consistency in which database changes are propagated
to all nodes so queries for data might not return updated data immediately or might result in reading data
that is not accurate which is a problem known as stale reads. Also some NoSQL systems may exhibit lost
writes and other forms of data loss. Some NoSQL systems provide concepts such as write-ahead logging to
avoid data loss. For distributed transaction processing across multiple databases, data consistency is an
even bigger challenge. This is difficult for both NoSQL and relational databases. Even current relational
databases do not allow referential integrity constraints to span databases. There are few systems that
maintain both X/Open XA standards and ACID transactions for distributed transaction processing.
30
Introduction to cloud computing
Cloud Computing is the delivery of computing services such as servers, storage, databases,
networking, software, analytics, intelligence, and more, over the Cloud (Internet).
But if we choose Cloud Computing, a cloud vendor is responsible for the hardware purchase
and maintenance. They also provide a wide variety of software and platform as a service. We
can take any required services on rent. The cloud computing services will be charged based
on usage.
The cloud environment provides an easily accessible online portal that makes handy for the
user to manage the compute, storage, network, and application resources. Some cloud
service providers are following .
Amazon web services, IBM Cloud, google cloud Platform, terremark, joyent , Rackspace,
digitalOcean, Microsoft .
Types of cloud:
o Public Cloud: The cloud resources that are owned and operated by a third-party cloud
service provider are termed as public clouds. It delivers computing resources such as
servers, software, and storage over the internet
o Private Cloud: The cloud computing resources that are exclusively used inside a single
business or organization are termed as a private cloud. A private cloud may physically
31
be located on the company’s on-site datacentre or hosted by a third-party service
provider.
o Hybrid Cloud: It is the combination of public and private clouds, which is bounded
together by technology that allows data applications to be shared between them.
Hybrid cloud provides flexibility and more deployment options to the business.
The benefits of cloud computing have been spelled out extensively over a long time. Some of the stated
benefits are closely related, and I have summarized the major ones here:
Benefits:
Scalability and elasticity – Cloud is massively scalable, and allows organizations to grow their users from a
handful to hundreds virtually overnight. It only takes an order for additional subscriptions and a payment
to the cloud service provider. Elasticity is similar and allows for a sudden change in cloud computing
resources to respond to spikes in demand.
Accessibility and reliability – All you need to access a cloud service is a current subscription, a good
Internet connection and an internet-enabled device (e.g. desktop, tablet, phone). Cloud service providers
use redundant IT resources and a quick failover mechanism, and many of them offer a 24/7/365 and 99.9%
uptime guarantee.
Cost and operational efficiency – Cloud is cost-effective, since one uses the shared infrastructure of the
cloud service provider via pay-as-you-go modes of payment. Cloud also enhances operational efficiency,
since administrative tasks (e.g. software upgrades, storage increase, data backup) are off-loaded to the
cloud service provider.
Rapid and flexible deployment – Cloud service providers offer an ecosystem of ready-to-use services that
can be rapidly deployed with simple migration and configuration. Users may have the flexibility of choosing
32
online or installed deployment of cloud applications. Some service providers even offer the flexibility of
Public, Private and Hybrid Cloud.
Security and compatibility – Cloud service providers take the security of their systems very serious to
retain their customer base. They also keep their entire software stack updated and fully compatible to
keep services up and running. Finally, cloud services can be expected to be compatible with a wide variety
of mobile devices and web interfaces.
Challenges:
The challenges of cloud computing are known, but easily brushed aside or overlooked. Here are the key
challenges that you might have to deal with:
Internet connectivity – You need good Internet connectivity and a powered-up device to access the cloud.
This can be a challenge in a developing economy like Kenya, particularly outside the urban centers.
Accessing cloud services through public Wi-Fi could pose a risk, unless the necessary security measures are
taken.
Financial commitment – For most subscription plans you must make a monthly or annual financial
commitment. The service ceases once you stop payment, and in the worst case you might lose access to
your business data. Compare this to buying a permanent software license, which you only maintain for
good reason.
Data security and protection – Your cloud service provider could have the best security certifications, but
there’s no guarantee that you won’t lose your data. Cloud service providers might even abuse your data in
disregard of privacy concerns. Hackers are increasingly targeting cloud storage for their abundance of
sensitive data.
Readiness and maturity – Cloud requires a new thinking about computing, and adoption will fail if the
culture doesn’t change. Cloud buying decisions are increasingly made by functional managers and
influenced by end-user requirements. Managing the requirements and delivering the envisioned benefits
requires a high level of IT maturity.
Interoperability – Some of your existing applications might not be available as a cloud service. In addition,
you have little control over the cloud services that you subscribe to. Therefore, integration between
services from different service providers and applications that run on your organization’s infrastructure
could present a real problem.
Virtualization is the backbone of Cloud Computing; Cloud Computing brings efficient benefits
as well as makes it more convenient with the help of Virtualization, not only this, it also
provides solutions for great challenges in the field of data security and privacy protection.
Virtualization is the imitation of hardware within a software program. The role of multiple
computers is allowed on a single computer. In a file or a web server, the usage of purchase,
maintenance, depreciation, energy and floor space is double, but by creating virtual web or
file server all of our objectives are fulfilled like the use of hardware resources to its maximum,
flexibility, improvement in security, reduced cost. Efficient use of resources, increased
security, portability, problem free testing, easier manageability, increased flexibility, fault
isolation, rapid deployment are the benefits of virtualization.
Virtualization in Cloud Computing:
For reaching the high level of availability or improving availability using virtualization
Capacity improvement
Benefits of Cloud Computing in business: Cloud computing tends to be different from the
other computing concepts. Basically, it supports an interactive and user friendly web applications. Different people
34
will have their own perspective. Some people will consider cloud computing as a virtualized computer resources,
dynamic development and software deployment. In today’s world, cloud computing has played an important role
especially in business. [66] found that cloud computing as an innovative technology helps the organization to stay
competitive among others. It is able to bring a various benefit to business. Cloud computing is able to provide
improved new capabilities which the traditional IT solution cannot provide.
35