0% found this document useful (0 votes)

143 views35 pages

Big Data Hadoop

Big data refers to massive amounts of structured, unstructured, and semi-structured data that is too large for traditional databases to process and analyze. It is characterized by high volume, velocity, and variety of data. Examples of big data include social media posts, sensor data from jet engines, and trade data from the stock exchange. Analyzing big data using tools like Hadoop can provide benefits like cost reduction, faster decision making, and new revenue opportunities for organizations. However, big data also poses security challenges due to its size, diverse sources, and use of new technologies that security tools may not be able to fully protect.

Uploaded by

gunjan kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views35 pages

Big Data Hadoop

Uploaded by

gunjan kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Introduction to Big Data :

Big data is a collection of massive and complex data sets and data volume that include the huge quantities
of data, data management capabilities, social media analytics and real-time data. Big data analytics is the
process of examining large amounts of data. There exist large amounts of heterogeneous digital data. Big
data is about data volume and large data set's measured in terms of terabytes or petabytes. After
examining of Bigdata, the data has been launched as Big Data analytics The challenges include capturing,
analysis, storage, searching, sharing, visualization, transferring and privacy violations. It can neither be
worked upon by using traditional SQL queries nor can the relational database management system
(RDBMS) be used for storage. Though, a wide variety of scalable database tools and techniques has
evolved. Hadoop is an open source distributed data processing is one of the prominent and well known
solutions. The NoSQL has a non-relational database with the likes of MongoDB from Apache.

Examples of Big data:

The New York Stock Exchange generates about one terabyte of new trade data per day.

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments
etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per
day, generation of data reaches up to many Petabytes.

Types of Big Data

BigData' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
Over the period of time, talent in computer science has achieved greater success in developing techniques for
working with such kind of data (where the format is well known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the range of multiple zettabytes.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

1
7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A
typical example of unstructured data is a heterogeneous data source containing a combination of simple text
files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in
form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured
data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

5v’s of Big data

In recent years big data was defined by 3v’s but now it is defined by 5 vs which is also known as the
characterststics of big data
1. Volume:
 The name ‘Big Data’ itself is related to a size which is enormous.
 Volume is a huge amount of data.
 To determine the value of data, size of data plays a very crucial role. If the volume of data is very large
then it is actually considered as a ‘Big Data’. This means whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data.
 Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

2
 Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per month.
Also, by the year 2020 we will have almost 40000 ExaBytes of data.

2. Velocity:
 Velocity refers to the high speed of accumulation of data.
 In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.
 There is a massive and continuous flow of data. This determines the potential of data that how fast the
data is generated and processed to meet the demands.
 Sampling data can help in dealing with the issue like ‘velocity’.
 Example: There are more than 3.5 billion searches per day are made on Google. Also, FaceBook users are
increasing by 22%(Approx.) year by year.
3. Variety:
 Variety refers to nature of data that is structured, semi-structured and unstructured data.
 It also refers to heterogeneous sources.
 Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise.
It can be structured, semi-structured and unstructured.
 Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
 Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data
that do not conform to the formal structure of data. Log files are the examples of this type of data.
 Unstructured data: This data basically refers to unorganized data. It generally refers to data
that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows
and columns.
Veracity:
 It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
 Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate
data types and sources.
 Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.
5. Value:
 After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data
having no Value is of no good to the company, unless you turn it into something useful.
 Data in itself is of no use or importance but it needs to be converted into something valuable to extract
Information. Hence, you can state that Value! is the most important V of all the 5V’s.

Security Challanges :

 Advanced analytic tools for unstructured big data and nonrelational databases (NoSQL) are

newer technologies in active development. It can be difficult for security software and processes to

protect these new toolsets.

 Mature security tools effectively protect data ingress and storage. However, they may not have

the same impact on data output from multiple analytics tools to multiple locations.

 Big data administrators may decide to mine data without permission or notification. Whether

the motivation is curiosity or criminal profit, your security tools need to monitor and alert on

suspicious access no matter where it comes from.

3
 The fine size of a big data installation, terabytes to petabytes large, is too big for routine security

audits. And because most big data platforms are cluster-based, this introduces multiple

vulnerabilities across multiple nodes and servers.

 If the big data owner does not regularly update security for the environment, they are at risk of

data loss and exposure.

 Security tools need to monitor and alert on suspicious malware infection on the system,

database or a web CMS such as WordPress, and big data security experts must be proficient in

cleanup and know how to remove malware from wordpress.

NEED FOR BIG DATA ANALYTICS

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn,
leads to smarter business moves, more efficient operations, higher profits and happier customers. In his
report Big Data in Big Companies, IIA Director of Research Tom Davenport interviewed more than 50 businesses
to understand how they used big data. He found they got value in the following ways:

1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing
business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with
the ability to analyze new sources of data, businesses are able to analyze information immediately – and make
decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big data analytics,
more companies are creating new products to meet customers’ needs

Applications of big data Analytics.

1. Banking and Securities : For monitoring financial markets through network activity monitors
and natural language processors to reduce fraudulent transactions. Exchange Commissions or
Trading Commissions are using big data analytics to ensure that no illegal trading happens by
monitoring the stock market.

2. Communications and Media: For real-time reportage of events around the globe on several
platforms (mobile, web and TV), simultaneously. Music industry, a segment of media, is using big data
to keep an eye on the latest trends which are ultimately used by autotuning softwares to generate
catchy tunes.

3. Sports: To understand the patterns of viewership of different events in specific regions and also
monitor the performance of individual players and teams by analysis. Sporting events like Cricket
world cup, FIFA world cup and Wimbledon make special use of big data analytics.

4. Healthcare: To collect public health data for faster responses to individual health problems and
identify the global spread of new virus strains such as Ebola. Health Ministries of different countries
incorporate big data analytic tools to make proper use of data collected after Census and surveys.

4
5. Education: To update and upgrade prescribed literature for a variety of fields which are
witnessing rapid development. Universities across the world are using it to monitor and track the
performance of their students and faculties and map the interest of students in different subjects via
attendance.

6. Manufacturing: To increase productivity by using big data to enhance supply chain

management. Manufacturing companies use these analytical tools to ensure that are allocating the
resources of production in an optimum manner which yields the maximum benefit.

7. Insurance: For everything from developing new products to handling claims through predictive
analytics. Insurance companies use business big data to keep a track of the scheme of policy which is
the most in demand and is generating the most revenue.

8. Consumer Trade: To predict and manage staffing and inventory requirements. Consumer
trading companies are using it to grow their trade by providing loyalty cards and keeping a track of
them.

9. Transportation: For better route planning, traffic monitoring and management, and logistics.
This is mainly incorporated by governments to avoid congestion of traffic in a single place.

10. Energy: By introducing smart meters to reduce electrical leakages and help users to manage
their energy usage. Load dispatch centers are using big data analysis to monitor the load patterns and
identify the differences between the trends of energy consumption based on different parameters
and as a way to incorporate daylight savings.

Hadoop:
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

5
Map Reduce:
MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples (key/value
pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is
always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once
we write an application in the MapReduce form, scaling the application to run over hundreds, thousands,
or even tens of thousands of machines in a cluster is merely a configuration change. This simple
scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

 Generally MapReduce paradigm is based on sending the computer to where the data resides!

 MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

o Map stage − The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.

o Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage.

The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.

 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.

 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.

 Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.

6
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

Hadoop HDFS

Hadoop File System was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are
stored across multiple machines. These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

7
HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −

 Manages the file system namespace.

 Regulates client’s access to files.

 It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software.
For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage
the data storage of their system.

 Datanodes perform read-write operations on the file systems, as per client request.

 They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or
more segments and/or stored in individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

8
Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.

Hardware at data − A requested task can be done efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.

 Had
oop Common − These are Java libraries and utilities required by other Hadoop modules.
 Had
oop YARN − This is a framework for job scheduling and cluster resource management.

YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned
Resource Manager” at the time of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the processing layer. In Hadoop
1.0 version, the responsibility of Job tracker is split between the resource manager and application
manager.
YARN Features: YARN gained popularity because of the following features-
 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
 Compatability: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-tenancy.

9
The main components of YARN architecture include:
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as
monitoring or tracking and does not guarantee a restart if a task fails. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition
the cluster resources.
 Application manager: It is responsible for accepting the application and negotiating
the first container from the resource manager. It also restarts the Application Manager
container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. It monitors resource usage, performs log management
and also kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request of Application
master.
 Application Master: An application is a single job submitted to a framework. The
application manager is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends its report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.

Application workflow in Hadoop YARN:

10
1. Client submits an application
2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

Comparison between Traditional System and Hadoop Ecosystem.

Attributes RDBMS Hadoop

Architecture RDBMS have acid properties Hadoop is distributed computing

.these properties are responsible framework having two main
to maintain data integrity and components : distributed File
accuracy. when a transaction system and map reduce.
takes place in database.

RDBMS works better when the volume Hadoop was meant to handle very
Data Volume-
of data is low(in Gigabytes). But when large data size . So Hadoop works
the data size is huge i.e, in Terabytes better when the data size is big. It can
and Petabytes, RDBMS fails to give the easily process and store large amount
desired results. of data quite effectively as compared to
the traditional RDBMS.
Throughput Throughput means amount of Hadoop provides higher
data processed in particular throughput than RDBMS. This is
period of time. one of the reason behind the
heavy usage of Hadoop than the
11
RDBMS has lower throughput traditional database
than Apache Hadoop . managenent .

Scalability RDBMS provides vertical Hadoop provides horizontal

scalabilty.so if the data increases scalabilty.so we just have to add
for storing then we have to one or more node to the cluster if
increase particular system there is any requirement for an
configuration. increase in data.

Data Processing RDBMS supports OLTP (online Apache supports (online

transaction processing ), which Analytical Processing ), which is
involves comparatively fast query used in data Mining techniques.
processing . the database design Olap involves very complex
is highly normalized having a large Queries and aggregations .the
no of tables. data processing speed depends
on the amount of data which can
take several hours. The database
design is de – normalized having
fewer tables.

Cost RDBMS is a licensed software Hadoop is a open source

therefore we have to pay for the framework , so we don’t need to
software. pay for software.

Data variety Data variety generally means the Hadoop has the ability to process
type of data to be processed and store all variety of data
.RDBMS can process structured whether it is structured , semi
data. It can not be used to structured and unstructured
manage unstructured data. data . it can be used to manage
unstructured data. So we can say
hadoop is way better than the
traditional relational database
management System.

Installation steps of hadoop :

Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a Linux
operating system for setting up Hadoop environment. In case you have an OS other than Linux, you
can install a Virtualbox software in it and have Linux inside the Virtualbox.

Pre-installation Setup
12
Java must be installed that is prequisite for running hadoop on your system.

Creating a User

At the beginning, it is recommended to create a separate user for Hadoop to isolate Hadoop
file system from Unix file system. Follow the steps given below to create a user −
 Open the root using the command “su”.
 Create a user from the root account using the command “useradd username”.
 Now you can open an existing user account using the command “su username”.
Open the Linux terminal and type the following commands to create a user.
% su
password:
% useradd hadoop
% passwd hadoop
New passwd:
Retype new passwd

Installing Hadoop

Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the
distribution in a sensible location, such as /usr/local (/opt is another standard choice; note
that Hadoop should not be installed in a user’s home directory, as that may be an NFS-
mounted directory):
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
You also need to change the owner of the Hadoop files to be the hadoop user and group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=%PATH:%HADOOP_HOME/bin:%HADOOP_HOME/sbin

SSH Setup and Key Generation:

SSH setup is required to do different operations on a cluster such as starting, stopping,

distributed daemon shell operations. To authenticate different users of Hadoop, it is required
to provide public/private key pair for a Hadoop user and share it with different users.

The following commands are used for generating a key value pair using SSH. Copy the public
keys form id_rsa.pub to authorized_keys, and provide the owner with read and write
permissions to authorized_keys file respectively.
% ssh-keygen -t rsa -f ~/.ssh/id_rsa
− Hadoop Configuration

13
You can find all the Hadoop configuration files in the location
“%HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure.
% cd %HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration>
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
14
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
Note − In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site.xml.template to mapred-site.xml file using the following command.
% cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

15
Moving data in and out of hadoop

The first step in working with data in Hadoop is to make it available to Hadoop, there are two
primary methods that can be used for moving data into Hadoop: writing external data at
the HDFS level (a data push), or reading external data at the MapReduce level (more
like a pull). Reading data in MapReduce has advantages in the ease with which the
operation can be parallelized and fault tolerant. Not all data is accessible from MapReduce,

The traditional method of transferring data into the HDFS system is to use
the put command. Let us see how to use the put command.

HDFS put Command

The main challenge in handling the log data is in moving these logs produced by multiple
servers to the Hadoop environment.
Hadoop File System Shell provides commands to insert data into Hadoop and read from it.
You can insert data into Hadoop using the put command as shown below.
$ Hadoop fs –put /path of the required file /path in HDFS where to save the file

Problem with put Command

We can use the put command of Hadoop to transfer data from these sources to HDFS. But,
it suffers from the following drawbacks −
16
 Using put command, we can transfer only one file at a time while the data
generators generate data at a much higher rate. Since the analysis made on older
data is less accurate, we need to have a solution to transfer data in real time.
 If we use put command, the data is needed to be packaged and should be ready for
the upload. Since the webservers generate data continuously, it is a very difficult task.
What we need here is a solutions that can overcome the drawbacks of put command and
transfer the "streaming data" from data generators to centralized stores (especially HDFS)
with less delay.

Problem with HDFS

In HDFS, the file exists as a directory entry and the length of the file will be considered as
zero till it is closed. For example, if a source is writing data into HDFS and the network was
interrupted in the middle of the operation (without closing the file), then the data written in
the file will be lost.
Therefore we need a reliable, configurable, and maintainable system to transfer the log data
into HDFS.

Available Solutions

To send streaming data (log files, events etc..,) from various sources to HDFS, we have the
following tools available at our disposal −

Facebook’s Scribe

Scribe is an immensely popular tool that is used to aggregate and stream log data. It is
designed to scale to a very large number of nodes and be robust to network and node
failures.

Apache Kafka

Kafka has been developed by Apache Software Foundation. It is an open-source message

broker. Using Kafka, we can handle feeds with high-throughput and low-latency.

Apache Flume

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and

transporting large amounts of streaming data such as log data, events (etc...) from various
webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool that is principally designed to transfer
streaming data from various sources to HDFS.

Moving data out of Hadoop

After data has been brought into Hadoop it will likely be joined with other datasets to
produce some results. At this point either that result data will stay in HDFS for future access,

17
or it will be pushed out of Hadoop. An example of this scenario would be one where you
pulled some data from an OLTP database, performed some machine learning
activities on that data, and then copied the results back into the OLTP database for use by
your production systems.
In this section we’ll cover how to automate moving regular files from HDFS to a local
filesystem. We’ll also look at data egress to relational databases and HBase. To start off we’ll
look at how to copy data out of Hadoop using the HDFS Slurper.
Egress to a local filesystem
The challenge to using a local filesystem for egress (and ingress for that matter) is that map
and reduce tasks running on clusters won’t have access to the filesystem on a specific server.
You need to leverage one of the following three broad options for moving data from HDFS to
a filesystem:
1 Host a proxy tier on a server, such as a web server, which you would then write
to using MapReduce.
2 Write to the local filesystem in MapReduce and then as a postprocessing step
trigger a script on the remote server to move that data.
3 Run a process on the remote server to pull data from HDFS directly.
The third option is the preferred approach because it’s the simplest and most efficient,
and as such is the focus of this section. We’ll look at how you can use the HDFS
Slurper to automatically move files from HDFS out to a local filesystem.

TECHNIQUE

Need of Record reader in hadoop

To understand the record reader in hadoop we need to understand the hadoop dataflow.

18
Map reduce has simple model of data processing. Input and outputs for the maps and reduce
funtions are key value pairs. The Map and reduce function in the hadoop map reduce have
following general form.

Now before processing it needs to know which data to process, this is achieved with the
InputFormat class. An InputFormat is the class which selects the file from the HDFS that should
be Input to the map function. An InputFormat is also responsible for creating the input splits and
dividing them into records. The data is divided into number of splits in HDFS. This is called as
inputsplit which is processed by a single map.

InputFormat class calls the getSplits() function and computes splits for each file and then
sends them to the JobTracker. which uses their storage locations to schedule map tasks to process
them on the TaskTrackers. Map task then passes the split to the createRecordReader() method on
InputFormat in task tracker to obtain a RecordReader for that split. The record reader then loads
data from source and converts it into suitable key value pairs for reading by the mapper. The
“start” is the byte position in the file Where the recordreader should start generating key value
pairs and the “end ” is where it should stop reading records.it communicates with input split until
the file reading is not completed.

Need of Record writter in hadoop:

Record writer is the class which handles the job of taking an individual key –value pair i.e output
From reducer and writing it to the location prepared by the OutputFormat.
RecordWriter implements ‘write’ and ‘close’. The Write function takes key-values from the
MapReduce job and writes the bytes to hdfs. The close function closes the hadoop data stream to
the ouput file.

Understanding the input and output file formats of hadoop

I. TextInputFormat – TextInputFormat is the default InputFormat. Each record is a line of

input. The key, a LongWritable, is the byte offset within the file of the beginning of the
line. The value is the contents of the line, excluding any line terminators.
II. KeyValueTextInputFormat – TextInputFormat’s keys, being simply the offsets within
the file, are not normally very useful. It is common for each line in a file to be a key-value
pair, separated by a delimiter such as a tab character. For example, this is the kind of

19
output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret
such files correctly, KeyValueTextInputFormat is appropriate.

III. NLineInputFormat – With TextInputFormat and KeyValueTextInputFormat, each

mapper receives a variable number of lines of input. The number depends on the size of
the split and the length of the lines. If you want your mappers to receive a fixed number
of lines of input, then NLineInputFormat is the InputFormat to use. Like with
TextInputFormat, the keys are the byte offsets within the file and the values are the lines
themselves. N refers to the number of lines of input that each mapper receives. With N
set to 1 (the default), each mapper receives exactly one line of input. The
mapreduce.input.lineinputformat.linespermap property controls the value of N.

IV. StreamInputFormat – Hadoop comes with a InputFormat for streaming which can be
used outside streaming and can be used for processing XML documents. You can use it
by setting your input format to StreamInputFormat and setting the
stream.recordreader.class property to
org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader. The reader is
configured by setting job configuration properties to tell it the patterns for the start and
end tags.

V. SequenceFileAsTextInputFormat – SequenceFileAsTextInputFormat is a variant of

SequenceFileInputFormat that converts the sequence file’s keys and values to Text
objects. The conversion is performed by calling toString() on the keys and values. This
format makes sequence files suitable input for Streaming.
VI. SequenceFileAsBinaryInputFormat – SequenceFileAsBinaryInputFormat is a variant of
SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque
binary objects. They are encapsulated as BytesWritable objects, and the application is
free to interpret the underlying byte array as it pleases.

VII. FixedLengthInputFormat – FixedLengthInputFormat is for reading fixed-width binary

records from a file, when the records are not separated by delimiters. The record size
must be set via fixedlengthinputformat.record.length.

VIII. Multiple Inputs – Input to a MapReduce job may consist of multiple input files , all of
the input is interpreted by a single InputFormat and a single Mapper. What often
happens, is that the data format evolves over time, so you have to write your mapper to
cope with all of your legacy formats. Or you may have data sources that provide the
same type of data but in different formats. This arises in the case of performing joins of
different datasets . For instance, one might be tab-separated plain text, and the other a
binary sequence file. Even if they are in the same format, they may have different
representations, and therefore need to be parsed differently. These cases are handled
elegantly by using the MultipleInputs class, which allows you to specify which
InputFormat Mapper had to use on a per-path basis.

20
IX. Database Input – DBInputFormat is an input format for reading data from a relational
database, using JDBC. Because it doesn’t have any sharding capabilities, you need to be
careful not to overwhelm the database from which you are reading by running too many
mappers.
For this reason, it is best used for loading relatively small datasets. The corresponding
output format is DBOutputFormat, which is useful for dumping job outputs (of modest
size) into a database.

Output Formats

Hadoop has output data formats that correspond to the input formats.

I. Text Output – The default output format, TextOutputFormat, writes records as lines of
text. Its keys and values may be of any type, since TextOutputFormat turns them to
strings by calling toString() on them. Each key-value pair is separated by a tab
character, although that may be changed using the
mapreduce.output.textoutputformat.separator property.

II. SequenceFileOutputFormat – its for writing binary Output. As the name indicates,
SequenceFileOutputFormat writes sequence files for its output. This is a good choice
of output if it forms the input to a further MapReduce job, since it is compact and is
readily compressed.
III. SequenceFileAsBinaryOutputFormat – Is the counterpart to
SequenceFileAsBinaryInputFormat writes keys and values in raw binary format into a
sequence file container.
IV. MapFileOutputFormat – MapFileOutputFormat writes map files as output. The keys in
a MapFile must be added in order, so you need to ensure that your reducers emit keys
in sorted order.
V. Multiple Outputs – Sometimes there is a need to have more control over the naming
of the files or to produce multiple files per reducer. MapReduce comes with the
MultipleOutputs class to help you do this.
VI. Lazy Output – FileOutputFormat subclasses will create output (part-r-nnnnn) files,
even if they are empty. Some applications prefer that empty files not be created, which
is where Lazy OutputFormat helps. It is a wrapper output format that ensures that the
output file is created only when the first record is emitted for a given partition. To use
it, call its setOutputFormatClass() method with the JobConf and the underlying output
format.
VII. Database Output – The output formats for writing to relational databases and to
HBase.

Unit –II
Introduction to hive:
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

21
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of hive:

Unit name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interface that Hive
supports are hive web UI, Hive command line and Hive HD Insight(In
windows server).

Meta Store Hive chooses respective database servers to store the schema or
metadata of tables, databases , columns in a table their data , types
and hdfs mapping.

22
HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in Java,
we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.

Installation of Hive: All hadoop sub projects such as hive , pig and Hbase support linux operating
system you need to install any linux flavoured OS.The following simple steps are executed for Hive
installation. There are two prequisite for installing the hive i.e hadoop and java must be installed.
Step 1: Verifying JAVA Installation:
Java must be installed on your system before installing hive .let us verify java installation using the
following command:
$ java –version
If java is already installed on your system , you get to see the following response.
java version “1.7.0_71”
Java(TM) SE Runtime Environment(build 1.7.0_71-b13)
Java Hotspot(TM) Client VM(build 25.0-b02,mixed mode)

Verifying Hadoop Installation:The following steps are used to verify the Hadoop
installation.

Step I: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows.

$ cd ~
$ hdfs namenode -format

Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your
Hadoop file system.

$ start-dfs.sh

Step III: Verifying Yarn Script

23
The following command is used to start the yarn script. Executing this command will start
your yarn daemons.

$ start-yarn.sh
.out

Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.

http://localhost:50070/

Step V: Verify all applications for cluster

The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.

http://localhost:8088/

Step 4: Installing Hive

The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

The following command is used to verify the download and extract the hive archive:

$ tar zxvf apache-hive-0.14.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive director

We need to copy the files from the super user “su -”. The following commands
are used to copy the files from the extracted directory to the /usr/local/hive”
directory.

$ su -
passwd:

# cd /home/user/Download

24
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 5: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and
copy the template file:

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop

Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache Derby:

Downloading Apache Derby

The following command is used to download Apache Derby. It takes some time to download.

$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-
bin.tar.gz

The following command is used to verify the download:

$ ls

On successful download, you get to see the following response:

25
db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

The following commands are used for extracting and verifying the Derby archive:

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following commands are used to copy the
files from the extracted directory to the /usr/local/derby directory:

$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to ~/.bashrc file:

export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:
$DERBY_HOME/lib/derbytools.jar

The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc

Create a directory to store Metastore

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data

Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive

26
Configuring Metastore means specifying to Hive where the database is stored. You can do
this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all,
copy the template file using the following command:

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml and append the following lines between the <configuration> and

</configuration> tags:

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>

Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:

chmod g+w

Now set them in HDFS before verifying Hive. Use the following commands:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

27
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

The following commands are used to verify Hive installation:

$ cd $HIVE_HOME
$ bin/hive

On successful installation of Hive, you get to see the following response:

HQL vs SQL
SQL-Structured Query Language is a domain-specific language used in programming and
designed for managing data held in a relational database management system (RDBMS), or for
stream processing in a relational data stream management system (RDSMS).

SQL

 SQL stands for Structured Query Language.

 SQL is a language which helps us to work with the databases. Database does not
understand English or any other language.
 Just as to create software, we use Java or C#, in a similar way to work with
databases, we use SQL.
 SQL is the standard language of Database and is also pronounced as Sequel by
many people
 SQL itself is a declarative language.
 SQL deals with structured data and is for RDBMS that is a relational database
management
 SQL support schema for data storage
 We use SQL when we need frequent modification in records. SQL is used for better
performance

The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch
query processing) and real-time processing (i.e. Interactive query processing). Hive gets internally
gets converted into MapReduce programs.

It supports all primitive data types of SQL. You can use predefined functions, or write tailored
user-defined functions (UDF) also to accomplish your specific needs.

HiveQL

 Hive’s SQL language is known as HiveQL, it is a combination of SQL-92, Oracal’s SQL

language, and MySQL.
 HiveQL provides some improved features from the previous version of SQL
standards, like analytics function from SQL 2003.
28
 Some Hive’s’ extension like multitable inserts, TRANSFORM, MAP and REDUCE.

Introduction to PIG:
Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of
abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig
Latin which is used to develop the data analysis codes. First, to process the data which is stored in the
HDFS, the programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But these are
not visible to the programmers in order to provide a high-level of abstraction. Pig Latin and Pig Engine are
the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS.

Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the reducer
and mapper, compiling packaging the code, submitting the job and retrieving the output is a time-
consuming task. Apache Pig reduces the time of development using the multi-query approach. Also, Pig is
beneficial for the programmers who are not from Java background. 200 lines of Java code can be written in
only 10 lines using the Pig Latin language. Programmers who have SQL knowledge needed less effort to
learn Pig Latin.

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the main
idea to develop Pig was to execute the MapReduce jobs on extremely large datasets. In the year 2007, it
moved to Apache Software Foundation(ASF) which makes it an open source project. The first version(0.1)
of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which came in the year 2017.

features of Apache Pig:

For performing several operations Apache Pig provides rich sets of operators like the filters, join, sort,etc .

Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.

Apache Pig is extensible so that you can make your own user-defined functions and process.

Join operation is easy in Apache Pig.

Fewer lines of code.

Apache Pig allows splits in the pipeline.

The data structure is multivalued, nested and richer.

Pig can handle the analysis of both structured and unstructured data.

Introduction to NoSQL:

29
A NoSQL originally referring to non SQL or non relational is a database that provides a mechanism for
storage and retrieval of data. This data is modeled in means other than the tabular relations used in
relational databases. Such databases came into existence in the late 1960s, but did not obtain the NoSQL
moniker until a surge of popularity in the early twenty-first century. NoSQL databases are used in real-time
web applications and big data and their use are increasing over time. NoSQL systems are also sometimes
called Not only SQL to emphasize the fact that they may support SQL-like query languages.

A NoSQL database includes simplicity of design, simpler horizontal scaling to clusters of machines and finer
control over availability. The data structures used by NoSQL databases are different from those used by
default in relational databases which makes some operations faster in NoSQL. The suitability of a given
NoSQL database depends on the problem it should solve. Data structures used by NoSQL databases are
sometimes also viewed as more flexible than relational database tables.

Many NoSQL stores compromise consistency in favor of availability, speed and partition tolerance. Barriers
to the greater adoption of NoSQL stores include the use of low-level query languages, lack of standardized
interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true
ACID(Atomicity, Consistency, Isolation, Durability) transactions but a few databases, such as MarkLogic,
Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and
OrientDB have made them central to their designs.

Most NoSQL databases offer a concept of eventual consistency in which database changes are propagated
to all nodes so queries for data might not return updated data immediately or might result in reading data
that is not accurate which is a problem known as stale reads. Also some NoSQL systems may exhibit lost
writes and other forms of data loss. Some NoSQL systems provide concepts such as write-ahead logging to
avoid data loss. For distributed transaction processing across multiple databases, data consistency is an
even bigger challenge. This is difficult for both NoSQL and relational databases. Even current relational
databases do not allow referential integrity constraints to span databases. There are few systems that
maintain both X/Open XA standards and ACID transactions for distributed transaction processing.

30
Introduction to cloud computing

Cloud Computing is the delivery of computing services such as servers, storage, databases,
networking, software, analytics, intelligence, and more, over the Cloud (Internet).

Cloud Computing provides an alternative to the on-premises datacentre. With an on-

premises, we have to manage everything, such as purchasing and installing hardware,
virtualization, installing the operating system, and any other required applications, setting up
the network, configuring the firewall , and setting up storage for data. After doing all the set-
up, we become responsible for maintaining it through its entire lifecycle.

But if we choose Cloud Computing, a cloud vendor is responsible for the hardware purchase
and maintenance. They also provide a wide variety of software and platform as a service. We
can take any required services on rent. The cloud computing services will be charged based
on usage.

The cloud environment provides an easily accessible online portal that makes handy for the
user to manage the compute, storage, network, and application resources. Some cloud
service providers are following .

Amazon web services, IBM Cloud, google cloud Platform, terremark, joyent , Rackspace,
digitalOcean, Microsoft .

Types of cloud:

o Public Cloud: The cloud resources that are owned and operated by a third-party cloud
service provider are termed as public clouds. It delivers computing resources such as
servers, software, and storage over the internet
o Private Cloud: The cloud computing resources that are exclusively used inside a single
business or organization are termed as a private cloud. A private cloud may physically

31
be located on the company’s on-site datacentre or hosted by a third-party service
provider.
o Hybrid Cloud: It is the combination of public and private clouds, which is bounded
together by technology that allows data applications to be shared between them.
Hybrid cloud provides flexibility and more deployment options to the business.

Types of cloud services:

o Infrastructure as a Service (IaaS): In IaaS, we can rent IT infrastructures like servers

and virtual machines (VMs), storage, networks, operating systems from a cloud service
vendor. We can create VM running Windows or Linux and install anything we want on
it. Using IaaS, we don’t need to care about the hardware or virtualization software, but
other than that, we do have to manage everything else. Using IaaS, we get maximum
flexibility, but still, we need to put more effort into maintenance.
o Platform as a Service (PaaS): This service provides an on-demand environment for
developing, testing, delivering, and managing software applications. The developer is
responsible for the application, and the PaaS vendor provides the ability to deploy and
run it. Using PaaS, the flexibility gets reduce, but the management of the environment
is taken care of by the cloud vendors.
o Software as a Service (SaaS): It provides a centrally hosted and managed software
services to the end-users. It delivers software over the internet, on-demand, and
typically on a subscription basis. E.g., Microsoft One Drive, Dropbox, WordPress, Office
365, and Amazon Kindle. SaaS is used to minimize the operational cost to the
maximum extent.

Benefits and challenges of Cloud computing

The benefits of cloud computing have been spelled out extensively over a long time. Some of the stated
benefits are closely related, and I have summarized the major ones here:

Benefits:

Scalability and elasticity – Cloud is massively scalable, and allows organizations to grow their users from a
handful to hundreds virtually overnight. It only takes an order for additional subscriptions and a payment
to the cloud service provider. Elasticity is similar and allows for a sudden change in cloud computing
resources to respond to spikes in demand.

Accessibility and reliability – All you need to access a cloud service is a current subscription, a good
Internet connection and an internet-enabled device (e.g. desktop, tablet, phone). Cloud service providers
use redundant IT resources and a quick failover mechanism, and many of them offer a 24/7/365 and 99.9%
uptime guarantee.

Cost and operational efficiency – Cloud is cost-effective, since one uses the shared infrastructure of the
cloud service provider via pay-as-you-go modes of payment. Cloud also enhances operational efficiency,
since administrative tasks (e.g. software upgrades, storage increase, data backup) are off-loaded to the
cloud service provider.

Rapid and flexible deployment – Cloud service providers offer an ecosystem of ready-to-use services that
can be rapidly deployed with simple migration and configuration. Users may have the flexibility of choosing
32
online or installed deployment of cloud applications. Some service providers even offer the flexibility of
Public, Private and Hybrid Cloud.

Security and compatibility – Cloud service providers take the security of their systems very serious to
retain their customer base. They also keep their entire software stack updated and fully compatible to
keep services up and running. Finally, cloud services can be expected to be compatible with a wide variety
of mobile devices and web interfaces.

Challenges:

The challenges of cloud computing are known, but easily brushed aside or overlooked. Here are the key
challenges that you might have to deal with:

Internet connectivity – You need good Internet connectivity and a powered-up device to access the cloud.
This can be a challenge in a developing economy like Kenya, particularly outside the urban centers.
Accessing cloud services through public Wi-Fi could pose a risk, unless the necessary security measures are
taken.

Financial commitment – For most subscription plans you must make a monthly or annual financial
commitment. The service ceases once you stop payment, and in the worst case you might lose access to
your business data. Compare this to buying a permanent software license, which you only maintain for
good reason.

Data security and protection – Your cloud service provider could have the best security certifications, but
there’s no guarantee that you won’t lose your data. Cloud service providers might even abuse your data in
disregard of privacy concerns. Hackers are increasingly targeting cloud storage for their abundance of
sensitive data.

Readiness and maturity – Cloud requires a new thinking about computing, and adoption will fail if the
culture doesn’t change. Cloud buying decisions are increasingly made by functional managers and
influenced by end-user requirements. Managing the requirements and delivering the envisioned benefits
requires a high level of IT maturity.

Interoperability – Some of your existing applications might not be available as a cloud service. In addition,
you have little control over the cloud services that you subscribe to. Therefore, integration between
services from different service providers and applications that run on your organization’s infrastructure
could present a real problem.

Private Cloud v/s Public Cloud

Private Cloud Public Cloud

Infrastructure Single-Tenant: Dedicated Multi-Tenant: Shared network
hardware and network for your hosted off site and managed by
business managed by an in- your service provider.
house technical team.
Business High performance, security, Affordable solutions that provide
requirement and customization and control room for growth.
options.
Best use Protect your most sensitive Disaster recovery and application
data and applications testing for smaller, public facing
companies.
33
Scalability Can be managed in house. Depends on the Service Level
Extreme performance – fine- Agreement but usually easy via a
grained control for both self-managed tool the customer will
storage and compute. use.
Support and Your technical administrators. Cloud Service Provider’s technical
maintenance team.
Cost Large upfront cost to Affordable option offering a pay as
implement the hardware, you go service fee.
software and staff resources. OpEx – Pay as you go, scale up,
Maintenance and growth must scale down as needed, charged by
also be built into ongoing the minute.
costs. CapEx.
Security Isolated network environment. Basic security compliance.
Enhanced security to meet Some may offer bolt-on security
data protection legislation. options
Performance High performance from Competing users can reduce
dedicated server. performance levels.

Role of virtualization in enabling the cloud:

Virtualization is the backbone of Cloud Computing; Cloud Computing brings efficient benefits
as well as makes it more convenient with the help of Virtualization, not only this, it also
provides solutions for great challenges in the field of data security and privacy protection.
Virtualization is the imitation of hardware within a software program. The role of multiple
computers is allowed on a single computer. In a file or a web server, the usage of purchase,
maintenance, depreciation, energy and floor space is double, but by creating virtual web or
file server all of our objectives are fulfilled like the use of hardware resources to its maximum,
flexibility, improvement in security, reduced cost. Efficient use of resources, increased
security, portability, problem free testing, easier manageability, increased flexibility, fault
isolation, rapid deployment are the benefits of virtualization.
Virtualization in Cloud Computing:

 For combining local and network resources data storage virtualization.

 For grouping physical storage devices into the single unit

 For reaching the high level of availability or improving availability using virtualization

 Improving performance using virtualization

 Using virtualization using stripping and caching

 Capacity improvement

Benefits of Cloud Computing in business: Cloud computing tends to be different from the
other computing concepts. Basically, it supports an interactive and user friendly web applications. Different people

34
will have their own perspective. Some people will consider cloud computing as a virtualized computer resources,
dynamic development and software deployment. In today’s world, cloud computing has played an important role
especially in business. [66] found that cloud computing as an innovative technology helps the organization to stay
competitive among others. It is able to bring a various benefit to business. Cloud computing is able to provide
improved new capabilities which the traditional IT solution cannot provide.

Evolution of Big Data
No ratings yet
Evolution of Big Data
50 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Converted 4011171
No ratings yet
Converted 4011171
144 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Bda QB Answer
No ratings yet
Bda QB Answer
39 pages
Big Data
No ratings yet
Big Data
15 pages
BDA GTU Study Material E-Notes All-Units 03122021014217PM
No ratings yet
BDA GTU Study Material E-Notes All-Units 03122021014217PM
42 pages
Big Data Analysis
No ratings yet
Big Data Analysis
14 pages
Big Data Study Material Part 1 (Unit I) - 1
No ratings yet
Big Data Study Material Part 1 (Unit I) - 1
38 pages
Big Data Introduction Unit 1
No ratings yet
Big Data Introduction Unit 1
19 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
Big Data Processing
No ratings yet
Big Data Processing
19 pages
Unit 1 What Is Big Data
No ratings yet
Unit 1 What Is Big Data
26 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
10 pages
BDA Introduction
No ratings yet
BDA Introduction
61 pages
Crime File
0% (1)
Crime File
20 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
BigData 1
No ratings yet
BigData 1
14 pages
Bda M1
No ratings yet
Bda M1
111 pages
Chapter 1
No ratings yet
Chapter 1
17 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
Unit 1
No ratings yet
Unit 1
44 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
32 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Introduction To Big Data Analytics - Thendral1
No ratings yet
Introduction To Big Data Analytics - Thendral1
26 pages
BDA Unit-1
No ratings yet
BDA Unit-1
56 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
20 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Unit 1
No ratings yet
Unit 1
56 pages
Lecture 1
No ratings yet
Lecture 1
44 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
UNIT - 1 - DA - Notes
No ratings yet
UNIT - 1 - DA - Notes
51 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
AMR Assignment
No ratings yet
AMR Assignment
11 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Big Data 101
No ratings yet
Big Data 101
18 pages
What Is Big Data
No ratings yet
What Is Big Data
3 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
1.8 Big Data - Introduction & Characteristics
No ratings yet
1.8 Big Data - Introduction & Characteristics
9 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Unit-I Bdaur-Bcom
No ratings yet
Unit-I Bdaur-Bcom
5 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Unit 1
No ratings yet
Unit 1
26 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Chapter 4 Data Analytics
No ratings yet
Chapter 4 Data Analytics
19 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Ds Assignment
No ratings yet
Ds Assignment
4 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data Pgdca
No ratings yet
Big Data Pgdca
23 pages
UNIT 3 Notes by ARUN JHAPATE
No ratings yet
UNIT 3 Notes by ARUN JHAPATE
9 pages
Pent4343 XS-96 Uk L PDF
No ratings yet
Pent4343 XS-96 Uk L PDF
84 pages
Big Data
No ratings yet
Big Data
7 pages
Unit 01
No ratings yet
Unit 01
32 pages
User Femap
No ratings yet
User Femap
309 pages
Raspberry Pi Command Line Audio PDF
No ratings yet
Raspberry Pi Command Line Audio PDF
8 pages
Big Data
No ratings yet
Big Data
7 pages
Big Data
No ratings yet
Big Data
4 pages
Preventing Ransomware
No ratings yet
Preventing Ransomware
389 pages
Notes Application: Department of Computer Engineering
No ratings yet
Notes Application: Department of Computer Engineering
12 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
3 Python Self Assessment
No ratings yet
3 Python Self Assessment
2 pages
Oracle Datatypes: Data Types For Oracle 8 To Oracle 11g
No ratings yet
Oracle Datatypes: Data Types For Oracle 8 To Oracle 11g
9 pages
Opc Quick Client
No ratings yet
Opc Quick Client
8 pages
Using The CSV Module in Python
No ratings yet
Using The CSV Module in Python
5 pages
G4S Firmware Update Tutorial
100% (1)
G4S Firmware Update Tutorial
2 pages
Case Study On 4 Software Failures PDF
100% (1)
Case Study On 4 Software Failures PDF
7 pages
Kavya CS Project
No ratings yet
Kavya CS Project
18 pages
MDX Tutorial
100% (1)
MDX Tutorial
5 pages
Online Safety, Security, Ethics and Etiquette: It Is Better To Be Safe
No ratings yet
Online Safety, Security, Ethics and Etiquette: It Is Better To Be Safe
18 pages
IPR Protection Procedure With Reference To Biotechnology: Intellectual Property Rights
No ratings yet
IPR Protection Procedure With Reference To Biotechnology: Intellectual Property Rights
55 pages
Kendall Sad9 Cpu 04 PDF
No ratings yet
Kendall Sad9 Cpu 04 PDF
9 pages
Java Automation Guide - 0
No ratings yet
Java Automation Guide - 0
12 pages
20 Formative Assessment Tools For Your Classroom
No ratings yet
20 Formative Assessment Tools For Your Classroom
3 pages
Com - Upgadata.up7723 Logcat
No ratings yet
Com - Upgadata.up7723 Logcat
81 pages
Databse Systems (Lab Manual) COT-313 and IT-216
No ratings yet
Databse Systems (Lab Manual) COT-313 and IT-216
12 pages
Os SK 2022 23 Vezbe 11 Cas Materijal Zadatak
No ratings yet
Os SK 2022 23 Vezbe 11 Cas Materijal Zadatak
18 pages
PHP AJAX Programming - Phppot
No ratings yet
PHP AJAX Programming - Phppot
4 pages
Turbine Controls History
No ratings yet
Turbine Controls History
1 page
Bills Payment Guide - 2024
No ratings yet
Bills Payment Guide - 2024
27 pages
Digital Library
No ratings yet
Digital Library
26 pages
Guide To Download Microdata
No ratings yet
Guide To Download Microdata
19 pages
4 - Arduino Sketch and Libraries
No ratings yet
4 - Arduino Sketch and Libraries
5 pages
Naveenkumar Rajendran: Work Experience Skills
No ratings yet
Naveenkumar Rajendran: Work Experience Skills
1 page
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet