0% found this document useful (0 votes)

106 views47 pages

Unit V Big Data Analytics

Uploaded by

Aman Ruhela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views47 pages

Unit V Big Data Analytics

Uploaded by

Aman Ruhela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

lOMoARcPSD|12548837

UNIT V - Big Data Analytics

Bid data analytics (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by Aman Ruhela ([email protected])
lOMoARcPSD|12548837

UNIT V FRAMEWORKS AND VISUALIZATION 8

MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases – S3 – Hadoop Distributed file systems –
Visualizations – Visual data analysis techniques, interaction techniques; Systems and applications.
MapReduce

MapReduce is a functional programming paradigm that is well suited to handling parallel processing of
huge data sets distributed across a large number of computers.

MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this
article.

MapReduce works in two steps:

1. Map: The map step essentially solves a small problem: Hadoop’s partitioner divides the problem into small
workable subsets and assigns those to map processes to solve.

2. Reduce: The reducer combines the results of the mapping processes and forms the output of the
MapReduce operation.

Maps specific keys to specific values.

For example, if we were to count the number of times each word appears in a book, our MapReduce
application would output each word as a key and the value as the number of times it is seen.

Or more specifically, the book would probably be broken up into sentences or paragraphs, and the Map step
would return each word mapped either to the number of times it appears in the sentence (or to “1” for each
occurrence of every word) and then the reducer would combine the keys by adding their values together.

Prior to submitting your job to Hadoop, you would first load your data into Hadoop. It would then distribute
your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it
would distribute your code to the slave nodes and have each map and reduce task process data on that slave

1|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

node. Your map task would iterate over every word in the data block passed to it (assuming a sentence in this
example), and output the word as the key and the value as “1”.

The reduce task would then receive all instances of values mapped to a particular key; for example, it may
have 1,000 values of “1” mapped to the work “apple”, which would mean that there are 1,000 apples in the
text. The reduce task sums up all of the values and outputs that as its result.

Then your Hadoop job would be set up to handle all of the output from the various reduce tasks.

An example of MapReduce

1. Let’s look at a simple example. Assume you have five files, and each file contains two columns (a key
and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for
the various measurement days.In this example, city is the key and temperature is the value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature for each city across all
of the data files (note that each file might have the same city represented multiple times). Using the
MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the
five files and the mapper task goes through the data and returns the maximum temperature for each city. For
example, the results produced from one mapper task for the data above would look like this:

(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced
the following intermediate results:

2|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)
(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)

All five of these output streams would be fed into the reduce tasks, which combine the input results
and output a single value for each city, producing a final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
An example of MapReduce:

The typical introductory program or ‘Hello World’ for Hadoop is a word count program. Word count

programs or functions do a few things: 1) look at a file with words in it, 2) determine what words are contained

in the file, and 3) count how many times each word shows up and potentially rank or sort the results. For

example, you could run a word count function on a 200 page book about software programming to see how

many times the word “code” showed up and what other words were more or less common. A word count

program like this is considered to be a simple program.

The word counting problem becomes more complex when we want it to run a word count function on

100,000 books, 100 million web pages, or many terabytes of data instead of a single file. For this volume of

data, we need a framework like MapReduce to help us by applying the principle of divide and conquer—

MapReduce basically takes each chapter of each book, gives it to a different machine to count, and then

aggregates the results on another set of machines. The MapReduce workflow for such a word count function

would follow the steps as shown in the diagram below:

1. The system takes input from a file system and splits it up across separate Map nodes

2. The Map function or code is run and generates an output for each Map node—in the word count

function, every word is listed and grouped by word per node

3. This output represents a set of intermediate key-value pairs that are moved to Reduce nodes as input

3|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

4. The Reduce function or code is run and generates an output for each Reduce node—in the word count

example, the reduce function sums the number of times a group of words or keys occurs

5. The system takes the outputs from each node to aggregate a final view .

4|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

M
apReduce is implemented in a master/worker configuration, with one master serving as the coordinator of
many workers. A worker may be assigned a role of either a map worker or a reduce worker.

Step 1. Split input

The first step, and the key to massive parallelization in the next step, is to split the input into multiple
pieces. Each piece is called a split, or shard. For M map workers, we want to have M shards, so that each
worker will have something to work on. The number of workers is mostly a function of the amount of machines
we have at our disposal.

5|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

The MapReduce library of the user program performs this split. The actual form of the split may be
specific to the location and form of the data. MapReduce allows the use of custom readers to split a collection
of inputs into shards, based on specific format of the files. Step 2. Fork processes

The next step is to create the master and the workers. The master is responsible for dispatching jobs to
workers, keeping track of progress, and returning results. The master picks idle workers and assigns them
either a map task or a reduce task. A map task works on a single shard of the original data. A reduce task works
on intermediate data generated by the map tasks. In all, there will be M map tasks and R reduce tasks. The
number of reduce tasks is the number of partitions defined by the user. A worker is sent a message by the
master identifying the program (map or reduce) it has to load and the data it has to read.

Step 3. Map

Each map task reads from the input shard that is assigned to it. It parses the data and generates (key,
value) pairs for data of interest. In parsing the input, the map function is likely to get rid of a lot of data that is
of no interest. By having many map workers do this in parallel, we can linearly scale the performance of the
task of extracting data.

Step 4: Map worker: Partition

The stream of (key, value) pairs that each worker generates is buffered in memory and periodically
stored on the local disk of the map worker. This data is partitioned into R regions by a partitioning function.

6|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

The partitioning function is responsible for deciding which of the R reduce workers will work on a specific key.
The default partitioning function is simply a hash of key modulo R but a user can replace this with a custom
partition function if there is a need to have certain keys processed by a specific reduce worker.

Step 5: Reduce: Sort (Shuffle)

When all the map workers have completed their work, the master notifies the reduce workers to start
working. The first thing a reduce worker needs to is to get the data that it needs to present to the user's reduce
function. The reduce worker contacts every map worker via remote procedure calls to get the (key, value) data
that was targeted for its partition. This data is then sorted by the keys. Sorting is needed since it will usually be
the case that there are many occurrences of the same key and many keys will map to the same reduce worker
(same partition). After sorting, all occurrences of the same key are grouped together so that it is easy to grab
all the data that is associated with a single key.

This phase is sometimes called the shuffle phase.

Step 6: Reduce function

With data sorted by keys, the user's Reduce function can now be called. The reduce worker calls
the Reduce function once for each unique key. The function is passed two parameters: the key and the list of
intermediate values that are associated with the key.

The Reduce function writes output sent to file.

7|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Step 7: Done!

When all the reduce workers have completed execution, the master passes control back to the user
program.
The client library initializes the shards and creates map workers, reduce workers, and a master. Map
workers are assigned a shard to process. If there are more shards than map workers, a map worker will be
assigned another shard when it is done. Map workers invoke the user'sMap function to parse the data and
write intermediate (key, value) results onto their local disks. This intermediate data is partitioned into R
partitions according to a partioning function. Each of R reduce workers contacts all of the map workers and
gets the set of (key, value) intermediate data that was targeted to its partition. It then calls the user's Reduce
function once for each unique key and gives it a list of all values that were generated for that key. The Reduce
function writes its final output to a file that the user's program can access once MapReduce has completed.

Hadoop
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were
created to help locate relevant information amid the text-based content. In the early years, search results were

8|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web
crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo,
AltaVista, etc.).

One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting
and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across
different computers so multiple tasks could be accomplished simultaneously. During this time, another search
engine project called Google was in progress. It was based on the same concept – storing and processing data
in a distributed, automated way so that relevant web search results could be returned faster.

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s
early work with automating distributed data storage and processing. The Nutch project was divided – the web
crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop

(named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today,

9|Page UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache
Software Foundation (ASF), a global community of software developers and contributors.

Hadoop

In the evolution of data processing, we moved from flat files to relational databases and from relational
databases to NoSQL databases. Essentially, as the amount of captured data increased, so did our needs, and
traditional patterns no longer sufficed. The databases of old worked well with data that measured in
megabytes and gigabytes, but now that companies realize “data is king,” the amount of captured data is
measured in terabytes and petabytes. Even with NoSQL data stores, the question remains: How do we analyze
that amount of data?

Hadoop is an open-source framework for developing and executing distributed applications that
process very large amounts of data. Hadoop is meant to run on large clusters of commodity machines, which
can be machines in your data center that you’re not using or even Amazon EC2 images. The danger, of course,
in running on commodity machines is how to handle failure. Hadoop is architected with the assumption that
hardware will fail and as such, it can gracefully handle most failures. Furthermore, its architecture allows it to
scale nearly linearly, so as processing capacity demands increase, the only constraint is the amount of budget
you have to add more machines to your cluster.

Hadoop Architecture

At a high-level, Hadoop operates on the philosophy of pushing analysis code close to the data it is
intended to analyze rather than requiring code to read data across a network. As such, Hadoop provides its
own file system, aptly named Hadoop File System or HDFS. When you upload your data to the HDFS, Hadoop
will partition your data across the cluster (keeping multiple copies of it in case your hardware fails), and then it
can deploy your code to the machine that contains the data upon which it is intended to operate.

Like many NoSQL databases, HDFS organizes data by keys and values rather than relationally. In other
words, each piece of data has a unique key and a value associated with that key. Relationships between keys, if
they exist, are defined in the application, not by HDFS. And in practice, you’re going to have to think about your

10 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

problem domain a bit differently in order realize the full power of Hadoop (see the next section on
MapReduce).

The components that comprise Hadoop are:

1. HDFS: The Hadoop file system is a distributed file system designed to hold huge amounts of data across
multiple nodes in a cluster (where huge can be defined as files that are 100+ terabytes in size!) Hadoop
provides both an API and a command-line interface to interacting with HDFS.

2. MapReduce Application: The next section reviews the details of MapReduce, but in short, MapReduce is a
functional programming paradigm for analyzing a single record in your HDFS. It then assembles the results
into a consumable solution. The Mapper is responsible for the data processing step, while the Reducer
receives the output from the Mappers and sorts the data that applies to the same key.

3. Partitioner: The partitioner is responsible for dividing a particular analysis problem into workable chunks of
data for use by the various Mappers. The Hash Partioner is a partitioner that divides work up by “rows” of
data in the HDFS, but you are free to create your own custom partitioner if you need to divide your data up
differently.

4. Combiner: If, for some reason, you want to perform a local reduce that combines data before sending it
back to Hadoop, then you’ll need to create a combiner. A combiner performs the reduce step, which
groups values together with their keys, but on a single node before returning the key/value pairs to Hadoop
for proper reduction.

5. InputFormat: Most of the time the default readers will work fine, but if your data is not formatted in a
standard way, such as “key, value” or “key *tab+ value”, then you will need to create a custom InputFormat
implementation.

6. OutputFormat: Your MapReduce applications will read data in some InputFormat and then write data out
through an OutputFormat. Standard formats, such as “key *tab+ value”, are supported out of the box, but
if you want to do something else, then you need to create your own OutputFormat implementation.

11 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

7. Additionally, Hadoop applications are deployed to an infrastructure that supports its high level of scalability
and resilience. These components include:

8. NameNode: The NameNode is the master of the HDFS that controls slave DataNode daemons; it
understands where all of your data is stored, how the data is broken into blocks, what nodes those blocks
are deployed to, and the overall health of the distributed filesystem. In short, it is the most important node
in the entire Hadoop cluster. Each cluster has one NameNode, and the NameNode is a single-point of
failure in a Hadoop cluster.
9. Secondary NameNode: The Secondary NameNode monitors the state of the HDFS cluster and takes
“snapshots” of the data contained in the NameNode. If the NameNode fails, then the Secondary
NameNode can be used in place of the NameNode. This does require human intervention, however, so
there is no automatic failover from the NameNode to the Secondary NameNode, but having the Secondary
NameNode will help ensure that data loss is minimal. Like the NameNode, each cluster has a single
Secondary NameNode.

10. DataNode: Each slave node in your Hadoop cluster will host a DataNode. The DataNode is responsible for
performing data management: It reads its data blocks from the HDFS, manages the data on each physical
node, and reports back to the NameNode with data management status.

11. JobTracker: The JobTracker daemon is your liaison between your application and Hadoop itself. There is
one JobTracker configured per Hadoop cluster and, when you submit your code to be executed on the
Hadoop cluster, it is the JobTracker’s responsibility to build an execution plan. This execution plan includes
determining the nodes that contain data to operate on, arranging nodes to correspond with data,
monitoring running tasks, and relaunching tasks if they fail.

12. TaskTracker: Similar to how data storage follows the master/slave architecture, code execution also follows
the master/slave architecture. Each slave node will have a TaskTracker daemon that is responsible for
executing the tasks sent to it by the JobTracker and communicating the status of the job (and a heartbeat)
with the JobTracker.

12 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Figure 1 Hadoop application and infrastructure interactions

The master node contains two important components: the NameNode, which manages the cluster and is in
charge of all data, and the JobTracker, which manages the code to be executed and all of the TaskTracker
daemons. Each slave node has both a TaskTracker daemon as well as a DataNode: the TaskTracker receives its
instructions from the JobTracker and executes map and reduce processes, while the DataNode receives its data
from the NameNode and manages the data contained on the slave node. And of course there is a Secondary
NameNode listening to updates from the NameNode.

HIVE

Apache Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to
data stored in HDP. In the previous tutorial, we used Pig, which is a scripting language with a focus on
dataflows. Hive provides a database query interface to Apache Hadoop

13 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies. For
example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive

• It stores schema in a database and processed data into HDFS.

• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

14 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web
UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.

15 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is
one of the replacements of traditional approach for MapReduce program.
Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

16 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver (any database driver
such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check the syntax and query
plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.

5 Send Plan

The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing
and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

17 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.

What is NoSQL?

What is a NoSQL (Not Only SQL) Database?

A NoSQL database environment is, simply put, a non-relational and largely distributed database system
that enables rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types. NoSQL
databases are sometimes referred to as cloud databases, non-relational databases, Big Data databases and a
myriad of other terms and were developed in response to the sheer volume of data being generated, stored
and analyzed by modern users (user-generated data) and their applications (machine-generated data).

18 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

In general, NoSQL databases have become the first alternative to relational databases, with scalability,
availability, and fault tolerance being key deciding factors. They go well beyond the more widely understood
legacy, relational databases (such as Oracle, SQL Server and DB2 databases) in satisfying the needs of today’s
modern business applications. A very flexible and schema-less data model, horizontal scalability, distributed
architectures, and the use of languages and interfaces that are “not only” SQL typically characterize this
technology.

Types of NoSQL Databases

There are four general types of NoSQL databases, each with their own specific attributes:
1. Graph database – Based on graph theory, these databases are designed for data whose relations are
well represented as a graph and has elements which are interconnected, with an undetermined
number of relations between them. Examples include: Neo4j and Titan.

2. Key-Value store – we start with this type of database because these are some of the least complex
NoSQL options. These databases are designed for storing data in a schema-less way. In a key-value
store, all of the data within consists of an indexed key and a value, hence the name. Examples of this
type of database include:Cassandra, DyanmoDB, Azure Table Storage (ATS), Riak, BerkeleyDB.

3. Column store – (also known as wide-column stores) instead of storing data in rows, these databases are
designed for storing data tables as sections of columns of data, rather than as rows of data. While this
simple description sounds like the inverse of a standard database, wide-column stores offer very high
performance and a highly scalable architecture. Examples include: HBase, BigTable and HyperTable.

4. Document database – expands on the basic idea of key-value stores where “documents” contain more
complex in that they contain data and each document is assigned a unique key, which is used to
retrieve the document. These are designed for storing, retrieving, and managing document-oriented
information, also known as semi-structured data. Examples include: MongoDB and CouchDB.

19 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

5. The following table lays out some of the key attributes that should be considered when evaluating

Advantages of NoSQL Databases over Relational Databases

The reasons for businesses to adopt a NoSQL database environment over a relational database have
almost everything to do with the following market drivers and technical requirements.
When making the switch, consider checking out this roadmap relational database to NoSQL database for a
walkthrough of NoSQL education, migration and success.

Which NoSQL database should you use?

Key considerations when choosing your NoSQL platform include:

20 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

1. Workload diversity – Big Data comes in all shapes, colors and sizes. Rigid schemas have no place here;
instead you need a more flexible design. You want your technology to fit your data, not the other way
around. And you want to be able to do more with all of that data – perform transactions in real-time,
run

analytics just as fast and find anything you want in an instant from oceans of data, no matter what from
that data may take.

2. Scalability – With big data you want to be able to scale very rapidly and elastically, whenever and
wherever you want. This applies to all situations, whether scaling across multiple data centers and even
to the cloud if needed.

3. Performance – As has already been discussed, in an online world where nanosecond delays can cost
you sales, Big Data must move at extremely high velocities no matter how much you scale or what
workloads your database must perform. Performance of your environment, namely your applications,
should be high on the list of requirements for deploying a NoSQL platform.

4. Continuous Availability – Building off of the performance consideration, when you rely on big data to
feed your essential, revenue-generating 24/7 business applications, even high availability is not high
enough. Your data can never go down, therefore there should be no single point of failure in your
NoSQL environment, thus ensuring applications are always available.

5. Manageability – Operational complexity of a NoSQL platform should be kept at a minimum. Make sure
that the administration and development required to both maintain and maximize the benefits of
moving to a NoSQL environment are achievable.

6. Cost – This is certainly a glaring reason for making the move to a NoSQL platform as meeting even one
of the considerations presented here with relational database technology can cost become
prohibitively expensive. Deploying NoSQL properly allows for all of the benefits above while also
lowering operational costs.

21 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

7. Strong Community – This is perhaps one of the more important factors to keep in mind as you move to
a NoSQL platform. Make sure there is a solid and capable community around the technology, as this will
provide an invaluable resource for the individuals and teams that will be managing the environment.
Involvement on the part of the vendor should not only include strong support and technical resource
availability, but also consistent outreach to the user base. Good local user groups and meetups will
provide many opportunities for communicating with other individuals and teams that will provide great
insight into how to work best with the platform of choice.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior
performance, and their data model addresses several issues that the relational model is not designed to
address:

1. Large volumes of rapidly changing structured, semi-structured, and unstructured data

2. Agile sprints, quick schema iteration, and frequent code pushes

3. Object-oriented programming that is easy to use and flexible

4. Geographically distributed scale-out architecture instead of expensive, monolithic architecture

Dynamic Schemas

Relational databases require that schemas be defined before you can add data. For example, you might
want to store data about your customers such as phone numbers, first and last name, address, city and state –
a SQL database needs to know what you are storing in advance.

22 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

This fits poorly with agile development approaches, because each time you complete new features, the
schema of your database often needs to change. So if you decide, a few iterations into development, that
you'd like to store customers' favorite items in addition to their addresses and phone numbers, you'll need to
add that column to the database, and then migrate the entire database to the new schema.

If the database is large, this is a very slow process that involves significant downtime. If you are
frequently changing the data your application stores – because you are iterating rapidly – this downtime may
also be frequent. There's also no way, using a relational database, to effectively address data that's completely
unstructured or unknown in advance.
NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it
easy to make significant application changes in real-time, without worrying about service interruptions – which
means development is faster, code integration is more reliable, and less database administrator time is needed.
Developers have typically had to add application-side code to enforce data quality controls, such as mandating
the presence of specific fields, data types or permissible values. More sophisticated NoSQL databases allow
validation rules to be applied within the database, allowing users to enforce governance across data, while
maintaining the agility benefits of a dynamic schema.

Auto-sharding

Because of the way they are structured, relational databases usually scale vertically – a single server
has to host the entire database to ensure acceptable performance for cross- table joins and transactions. This
gets expensive quickly, places limits on scale, and creates a relatively small number of failure points for
database infrastructure. The solution to support rapidly growing applications is to scale horizontally, by adding
servers instead of concentrating more capacity in a single server.

Sharding a database across many server instances can be achieved with SQL databases, but usually is
accomplished through SANs and other complex arrangements for making hardware act as a single server.
Because the database does not provide this ability natively, development teams take on the work of deploying

23 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

multiple relational databases across a number of machines. Data is stored in each database instance
autonomously. Application code is developed to distribute the data, distribute queries, and aggregate the
results of data across all of the database instances. Additional code must be developed to handle resource
failures, to perform joins across the different databases, for data rebalancing, replication, and other
requirements. Furthermore, many benefits of the relational database, such as transactional integrity, are
compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and
automatically spread data across an arbitrary number of servers, without requiring the application to even be
aware of the composition of the server pool. Data and query load are automatically balanced across servers,
and when a server goes down, it can be quickly and transparently replaced with no application disruption.
Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing
virtually unlimited capacity on demand, and taking care of all the necessary infrastructure administration tasks.
Developers no longer need to construct complex, expensive platforms to support their applications, and can
concentrate on writing application code. Commodity servers can provide the same processing and storage
capabilities as a single high-end server for a fraction of the price.

Replication

Most NoSQL databases also support automatic database replication to maintain availability in the event
of outages or planned maintenance events. More sophisticated NoSQL databases are fully self-healing, offering
automated failover and recovery, as well as the ability to distribute the database across multiple geographic
regions to withstand regional failures and enable data localization. Unlike relational databases, NoSQL
databases generally have no requirement for separate applications or expensive add-ons to implement
replication.

Integrated Caching

24 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

A number of products provide a caching tier for SQL database systems. These systems can improve read
performance substantially, but they do not improve write performance, and they add operational complexity
to system deployments. If your application is dominated by reads then a distributed cache could be considered,
but if your application has just a modest write volume, then a distributed cache may not improve the overall
experience of your end users, and will add complexity in managing cache invalidation.

Many NoSQL database technologies have excellent integrated caching capabilities, keeping
frequentlyused data in system memory as much as possible and removing the need for a separate caching
layer. Some NoSQL databases also offer fully managed, integrated in-memory database management layer for
workloads demanding the highest throughput and lowest latency.

S3 (Simple Storage Service)

Amazon S3 (Simple Storage Service) is an online service provided by Amazon.com, that allows web
marketers, retailers and web-preneurs to store large amounts of data online.

S3 is free to join, and is a pay-as-you-go sevice, meaning you only ever pay for any of the hosting and
bandwidth costs that you use, making it very attractive for start-up, agile and lean companies looking to
minimize costs.

25 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

On top of this, the fully scalable, fast and reliable service provided by Amazon, makes it highly
attractive to video producers and marketers all over the world.

Amazon offers S3 as a hosting system, with pricing dependent on the geographic location of the
datacenter where you store your videos.

So Why Should You Use It?

1. One of the Most Affordable Hosting Solutions on the Web

The obvious answer to this question is cost! Since you only ever pay for the storage and bandwidth you
use, you don't have to contend with high end server costs, or pay for storage and bandwidth that you
will never use. Your bill is always in line with the volume of your use.

2. No Limits, Fully Scalable Solution

You should also consider the fact the Amazon S3 is fully scalable, and there are no limits to the amount
of storage or bandwidth that you use. Conventional hosting companies apply limits to the majority of
their plans, and once you hit them, you either get slapped with large, extra costs, or they simply
suspend your account, putting your whole website out of action. Using Amazon S3 means you're no
longer at the mercy of the hosting company.

3. Reliability: S3 is provided by Amazon, a global leader in web services, with world class technical
expertise

The S3 service is very reliable - there is currently a growing network of over 200,000 developers, it's
being used by a multitude of companies of different sizes, from start-ups to Fortune 1000 companies -
it's a widely tested system. This is backed up by a guarantee of 99.9% uptime from Amazon, and their
service level agreement if the service ever falls below that.

26 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

4. Protect your Own Server

Finally, the use of Amazon S3 allows you to reduce strain on the server hosting your website. Files such
as video and audio have much larger file sizes than standard html files, and an influx of traffic to view a
particular file can put undue strain on your server, and in some cases even knock it out. Outsourcing
the hosting of your larger files to Amazon S3, means you can optimize your own server for hosting the
website, and let Amazon worry about everything else.

HDFS

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored
across multiple machines. These files are stored in redundant fashion to rescue the system from possible data
losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

• It is suitable for the distributed storage and processing.

• Hadoop provides a command interface to interact with HDFS.

• The built-in servers of namenode and datanode help users to easily check the status of cluster.

• Streaming access to file system data.

• HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

27 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode
software. It is a software that can be run on commodity hardware. The system having the namenode acts as
the master server and it does the following tasks:

• Manages the file system namespace.

• Regulates client’s access to files.

• It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For
every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the
data storage of their system.

• Datanodes perform read-write operations on the file systems, as per client request.

28 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

• They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more
segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the
minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it
can be increased as per the need to change in HDFS configuration.

Goals of HDFS

• Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
• Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.

• Hardware at data : A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.

Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) is designed to store
very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large
cluster, thousands of servers both host directly attached storage and execute user application tasks. By
distributing storage and computation across many servers, the resource can grow with demand while
remaining economical at every size.

Introduction

29 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Hadoop provides a distributed filesystem and a framework for the analysis and transformation of very
large data sets using the MapReduce paradigm. While the interface to HDFS is patterned after the Unix
filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at
hand.

An important characteristic of Hadoop is the partitioning of data and computation across many
(thousands) of hosts, and the execution of application computations in parallel close to their data. A
Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding
commodity servers. Hadoop clusters at Yahoo! span 40,000 servers, and store 40 petabytes of application
data, with the largest cluster being 4000 servers. One hundred other organizations worldwide report using
Hadoop.

HDFS stores filesystem metadata and application data separately. As in other distributed filesystems,
like PVFS, Lustre, and GFS, HDFS stores metadata on a dedicated server, called the NameNode. Application
data are stored on other servers called DataNodes. All servers are fully connected and communicate with
each other using TCP-based protocols. Unlike Lustre and PVFS, the DataNodes in HDFS do not rely on data
protection mechanisms such as RAID to make the data durable. Instead, like GFS, the file content is
replicated on multiple DataNodes for reliability. While ensuring data durability, this strategy has the added

30 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

advantage that data transfer bandwidth is multiplied, and there are more opportunities for locating
computation near the needed data.

Architecture
NameNode
The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the
NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace
and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user
selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes
(typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the
mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster
can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may
execute multiple application tasks concurrently.

Image and Journal

The inodes and the list of blocks that define the metadata of the name system are called the image.
NameNode keeps the entire namespace image in RAM. The persistent record of the image stored in the
NameNode's local native filesystem is called a checkpoint. The NameNode records changes to HDFS in a
write-ahead log called the journal in its local native filesystem. The location of block replicas are not part of
the persistent checkpoint.

Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before
the acknowledgment is sent to the client. The checkpoint file is never changed by the NameNode; a new
file is written when a checkpoint is created during restart, when requested by the administrator, or by the
CheckpointNode described in the next section. During startup the NameNode initializes the namespace
image from the checkpoint, and then replays changes from the journal. A new checkpoint and an empty
journal are written back to the storage directories before the NameNode starts serving clients.

31 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

For improved durability, redundant copies of the checkpoint and journal are typically stored on multiple
independent local volumes and at remote NFS servers. The first choice prevents loss from a single volume
failure, and the second choice protects against failure of the entire node. If the NameNode encounters an
error writing the journal to one of the storage directories it automatically excludes that directory from the
list of storage directories. The NameNode automatically shuts itself down if no storage directory is
available.

The NameNode is a multithreaded system and processes requests simultaneously from multiple clients.
Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the
synchronous flush-and-sync procedure initiated by one of them is complete. In order to optimize this
process, the NameNode batches multiple transactions. When one of the NameNode's threads initiates a
flush-and-sync operation, all the transactions batched at that time are committed together. Remaining
threads only need to check that their transactions have been saved and do not need to initiate a flush-
andsync operation.

DataNodes
Each block replica on a DataNode is represented by two files in the local native filesystem. The first file
contains the data itself and the second file records the block's metadata including checksums for the data
and the generation stamp. The size of the data file equals the actual length of the block and does not
require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is
half full it needs only half of the space of the full block on the local drive.

During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the
handshake is to verify the namespace ID and the software version of the DataNode. If either does not
match that of the NameNode, the DataNode automatically shuts down.

The namespace ID is assigned to the filesystem instance when it is formatted. The namespace ID is
persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join

32 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

the cluster, thus protecting the integrity of the filesystem. A DataNode that is newly initialized and without
any namespace ID is permitted to join the cluster and receive the cluster's namespace ID.

After the handshake the DataNode registers with the NameNode. DataNodes persistently store their
unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable
even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it
registers with the NameNode for the first time and never changes after that.

A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block
report contains the block ID, the generation stamp and the length for each block replica the server hosts.
The first block report is sent immediately after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on
the cluster.

During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is
operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the
NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the
DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The
NameNode then schedules creation of new replicas of those blocks on other DataNodes.

Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use,
and the number of data transfers currently in progress. These statistics are used for the NameNode's block
allocation and load balancing decisions.

The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send
instructions to the DataNodes. The instructions include commands to replicate blocks to other nodes,
remove local block replicas, re-register and send an immediate block report, and shut down the node.

33 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

These commands are important for maintaining the overall system integrity and therefore it is critical to
keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per
second without affecting other NameNode operations.

HDFS Client
User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem
interface.

Like most conventional filesystems, HDFS supports operations to read, write and delete files, and
operations to create and delete directories. The user references files and directories by paths in the
namespace. The user application does not need to know that filesystem metadata and storage are on
different servers, or that blocks have multiple replicas.

When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that
host replicas of the blocks of the file. The list is sorted by the network topology distance from the client.
The client contacts a DataNode directly and requests the transfer of the desired block. When a client
writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The
client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client
requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and
the client sends the further bytes of the file.

34 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Unlike conventional filesystems, HDFS provides an API that exposes the locations of a file blocks. This
allows applications like the MapReduce framework to schedule a task to where the data are located, thus
improving the read performance. It also allows an application to set the replication factor of a file. By
default a file's replication factor is three. For critical files or files which are accessed very often, having a
higher replication factor improves tolerance against faults and increases read bandwidth.

CheckpointNode
The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute
either of two other roles, either a CheckpointNode or a BackupNode. The role is specified at the node
startup.

The CheckpointNode periodically combines the existing checkpoint and journal to create a new
checkpoint and an empty journal. The CheckpointNode usually runs on a different host from the
NameNode since it has the same memory requirements as the NameNode. It downloads the current
checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint
back to the NameNode.

Creating periodic checkpoints is one way to protect the filesystem metadata. The system can start from
the most recent checkpoint if all other persistent copies of the namespace image or journal are
unavailable. Creating a checkpoint also lets the NameNode truncate the journal when the new checkpoint
is uploaded to the NameNode. HDFS clusters run for prolonged periods of time without restarts during
which the journal constantly grows. If the journal grows very large, the probability of loss or corruption of
the journal file increases. Also, a very large journal extends the time required to restart the NameNode. For
a large cluster, it takes an hour to process a week-long journal. Good practice is to create a daily checkpoint.

BackupNode
A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the BackupNode is
capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of
the filesystem namespace that is always synchronized with the state of the NameNode.

35 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

The BackupNode accepts the journal stream of namespace transactions from the active NameNode,
saves them in journal on its own storage directories, and applies these transactions to its own namespace
image in memory. The NameNode treats the BackupNode as a journal store the same way as it treats
journal files in its storage directories. If the NameNode fails, the BackupNode's image in memory and the
checkpoint on disk is a record of the latest namespace state.

The BackupNode can create a checkpoint without downloading checkpoint and journal files from the
active NameNode, since it already has an up-to-date namespace image in its memory. This makes the
checkpoint process on the BackupNode more efficient as it only needs to save the namespace into its local
storage directories.

The BackupNode can be viewed as a read-only NameNode. It contains all filesystem metadata
information except for block locations. It can perform all operations of the regular NameNode that do not
involve modification of the namespace or knowledge of block locations. Use of a BackupNode provides the
option of running the NameNode without persistent storage, delegating responsibility of persisting the
namespace state to the BackupNode.

Upgrades and Filesystem Snapshots

During software upgrades the possibility of corrupting the filesystem due to software bugs or human
mistakes increases. The purpose of creating snapshots in HDFS is to minimize potential damage to the data
stored in the system during upgrades.
The snapshot mechanism lets administrators persistently save the current state of the filesystem, so
that if the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS
to the namespace and storage state as they were at the time of the snapshot.

The snapshot (only one can exist) is created at the cluster administrator's option whenever the system
is started. If a snapshot is requested, the NameNode first reads the checkpoint and journal files and merges
them in memory. Then it writes the new checkpoint and the empty journal to a new location, so that the
old checkpoint and journal remain unchanged.

36 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

During handshake the NameNode instructs DataNodes whether to create a local snapshot. The local
snapshot on the DataNode cannot be created by replicating the directories containing the data files as this
would require doubling the storage capacity of every DataNode on the cluster. Instead each DataNode
creates a copy of the storage directory and hard links existing block files into it. When the DataNode
removes a block it removes only the hard link, and block modifications during appends use the copy-
onwrite technique. Thus old block replicas remain untouched in their old directories.

The cluster administrator can choose to roll back HDFS to the snapshot state when restarting the
system. The NameNode recovers the checkpoint saved when the snapshot was created. DataNodes restore
the previously renamed directories and initiate a background process to delete block replicas created after
the snapshot was made. Having chosen to roll back, there is no provision to roll forward. The cluster
administrator can recover the storage occupied by the snapshot by commanding the system to abandon
the snapshot; for snapshots created during upgrade, this finalizes the software upgrade.

System evolution may lead to a change in the format of the NameNode's checkpoint and journal files,
or in the data representation of block replica files on DataNodes. The layout version identifies the data
representation formats, and is persistently stored in the NameNode's and the DataNodes' storage
directories. During startup each node compares the layout version of the current software with the version
stored in its storage directories and automatically converts data from older formats to the newer ones. The
conversion requires the mandatory creation of a snapshot when the system restarts with the new software
layout version.

File I/O Operations and Replica Management

File Read and Write

37 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

An application adds data to HDFS by creating a new file and writing the data to it. After the file is closed,
the bytes written cannot be altered or removed except that new data can be added to the file by reopening
the file for append. HDFS implements a single-writer, multiple-reader model.

The HDFS client that opens a file for writing is granted a lease for the file; no other client can write to
the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When
the file is closed, the lease is revoked. The lease duration is bound by a soft limit and a hard limit. Until the
soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client
fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires
(one hour) and the client has failed to renew the lease, HDFS assumes that the client has quit and will
automatically close the file on behalf of the writer, and recover the lease. The writer's lease does not
prevent other clients from reading the file; a file may have many concurrent readers.

An HDFS file consists of blocks. When there is a need for a new block, the NameNode allocates a block
with a unique block ID and determines a list of DataNodes to host replicas of the block. The DataNodes
form a pipeline, the order of which minimizes the total network distance from the client to the last
DataNode. Bytes are pushed to the pipeline as a sequence of packets. The bytes that an application writes
first buffer at the client side. After a packet buffer is filled (typically 64 KB), the data are pushed to the
pipeline. The next packet can be pushed to the pipeline before receiving the acknowledgment for the
previous packets. The number of outstanding packets is limited by the outstanding packets window size of
the client.

After data are written to an HDFS file, HDFS does not provide any guarantee that data are visible to a
new reader until the file is closed. If a user application needs the visibility guarantee, it can explicitly call
the hflush operation. Then the current packet is immediately pushed to the pipeline, and the hflush
operation will wait until all DataNodes in the pipeline acknowledge the successful transmission of the
packet. All data written before the hflush operation are then certain to be visible to readers.

38 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

Data Pipeline While Writing a Block

If no error occurs, block construction goes through three stages as shown a pipeline of three DataNodes
(DN) and a block of five packets. In the picture, bold lines represent data packets, dashed lines represent
acknowledgment messages, and thin lines represent control messages to setup and close the pipeline. Vertical
lines represent activity at the client and the three DataNodes where time proceeds from

top to bottom. From t0 t t1 is the pipeline setup stage. The t1 t t2 is the data streaming
o interval o
stage, where t1 is the time when the first data packet gets sent and t2 is the time that the

acknowledgment to the last packet gets received. Here an hflush operation transmits . The packet 2 hflush

indication travels with the packet data and is not a separate operation. The final interval is the pipeline close
to
stage for this block.
t2 t3

39 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

In a cluster of thousands of nodes, failures of a node (most commonly storage faults) are daily occurrences.
A replica stored on a DataNode may become corrupted because of faults in memory, disk, or network. HDFS
generates and stores checksums for each data block of an HDFS file. Checksums are verified by the HDFS client
while reading to help detect any corruption caused either by client, DataNodes, or network. When a client
creates an HDFS file, it computes the checksum sequence for each block and sends it to a DataNode along with
the data. A DataNode stores checksums in a metadata file separate from the block's data file. When HDFS
reads a file, each block's data and checksums are shipped to the client. The client computes the checksum for
the received data and verifies that the newly computed checksums matches the checksums it received. If not,
the client notifies the NameNode of the corrupt replica and then fetches a different replica of the block from
another DataNode.

When a client opens a file to read, it fetches the list of blocks and the locations of each block replica from
the NameNode. The locations of each block are ordered by their distance from the reader. When reading the
content of a block, the client tries the closest replica first. If the read attempt fails, the client tries the next
replica in sequence. A read may fail if the target DataNode is unavailable, the node no longer hosts a replica of
the block, or the replica is found to be corrupt when checksums are tested.

HDFS permits a client to read a file that is open for writing. When reading a file open for writing, the length
of the last block still being written is unknown to the NameNode. In this case, the client asks one of the replicas
for the latest length before starting to read its content.

The design of HDFS I/O is particularly optimized for batch processing systems, like MapReduce, which require
high throughput for sequential reads and writes. Ongoing efforts will improve read/write response time for
applications that require real-time data streaming or random access.

Block Placement
For a large cluster, it may not be practical to connect all nodes in a flat topology. A common practice is to
spread the nodes across multiple racks. Nodes of a rack share a switch, and rack switches are connected by one
or more core switches. Communication between two nodes in different racks has to go through multiple

40 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

switches. In most cases, network bandwidth between nodes in the same rack is greater than network
bandwidth between nodes in different racks.

Cluster Topology

HDFS estimates the network bandwidth between two nodes by their distance. The distance from a node to
its parent node is assumed to be one. A distance between two nodes can be calculated by summing the
distances to their closest common ancestor. A shorter distance between two nodes means greater bandwidth
they can use to transfer data.

HDFS allows an administrator to configure a script that returns a node's rack identification given a node's
address. The NameNode is the central place that resolves the rack location of each DataNode. When a
DataNode registers with the NameNode, the NameNode runs the configured script to decide which rack the
node belongs to. If no such a script is configured, the NameNode assumes that all the nodes belong to a
default single rack.

The placement of replicas is critical to HDFS data reliability and read/write performance. A good replica
placement policy should improve data reliability, availability, and network bandwidth utilization. Currently
HDFS provides a configurable block placement policy interface so that the users and researchers can
experiment and test alternate policies that are optimal for their applications.

The default HDFS block placement policy provides a tradeoff between minimizing the write cost, and
maximizing data reliability, availability and aggregate read bandwidth. When a new block is created, HDFS

41 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

places the first replica on the node where the writer is located. The second and the third replicas are placed on
two different nodes in a different rack. The rest are placed on random nodes with restrictions that no more
than one replica is placed at any one node and no more than two replicas are placed in the same rack, if
possible. The choice to place the second and third replicas on a different rack better distributes the block
replicas for a single file across the cluster. If the first two replicas were placed on the same rack, for any file,
two-thirds of its block replicas would be on the same rack.

After all target nodes are selected, nodes are organized as a pipeline in the order of their proximity to the
first replica. Data are pushed to nodes in this order. For reading, the NameNode first checks if the client's host
is located in the cluster. If yes, block locations are returned to the client in the order of its closeness to the
reader. The block is read from DataNodes in this preference order.

This policy reduces the inter-rack and inter-node write traffic and generally improves write performance.
Because the chance of a rack failure is far less than that of a node failure, this policy does not impact data
reliability and availability guarantees. In the usual case of three replicas, it can reduce the aggregate network
bandwidth used when reading data since a block is placed in only two unique racks rather than three.

Replication Management
The NameNode endeavors to ensure that each block always has the intended number of replicas. The
NameNode detects that a block has become under- or over-replicated when a block report from a DataNode
arrives. When a block becomes over replicated, the NameNode chooses a replica to remove. The NameNode
will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from
the DataNode with the least amount of available disk space. The goal is to balance storage utilization across
DataNodes without reducing the block's availability.

When a block becomes under-replicated, it is put in the replication priority queue. A block with only one
replica has the highest priority, while a block with a number of replicas that is greater than two thirds of its
replication factor has the lowest priority. A background thread periodically scans the head of the replication
queue to decide where to place new replicas. Block replication follows a similar policy as that of new block

42 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

placement. If the number of existing replicas is one, HDFS places the next replica on a different rack. In case
that the block has two existing replicas, if the two existing replicas are on the same rack, the third replica is
placed on a different rack; otherwise, the third replica is placed on a different node in the same rack as an
existing replica. Here the goal is to reduce the cost of creating new replicas.

The NameNode also makes sure that not all replicas of a block are located on one rack. If the
NameNode detects that a block's replicas end up at one rack, the NameNode treats the block as mis-
replicated and replicates the block to a different rack using the same block placement policy described
above. After the NameNode receives the notification that the replica is created, the block becomes
overreplicated. The NameNode then will decides to remove an old replica because the over-replication
policy prefers not to reduce the number of racks.

Balancer
HDFS block placement strategy does not take into account DataNode disk space utilization. This is to
avoid placing new—more likely to be referenced—data at a small subset of the DataNodes with a lot of
free storage. Therefore data might not always be placed uniformly across DataNodes. Imbalance also
occurs when new nodes are added to the cluster.

The balancer is a tool that balances disk space usage on an HDFS cluster. It takes a threshold value as an
input parameter, which is a fraction between 0 and 1. A cluster is balanced if, for each DataNode, the
utilization of the node3 differs from the utilization of the whole cluster 4 by no more than the threshold
value.

The tool is deployed as an application program that can be run by the cluster administrator. It iteratively
moves replicas from DataNodes with higher utilization to DataNodes with lower utilization. One key
requirement for the balancer is to maintain data availability. When choosing a replica to move and deciding
its destination, the balancer guarantees that the decision does not reduce either the number of replicas or
the number of racks.

43 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

The balancer optimizes the balancing process by minimizing the inter-rack data copying. If the balancer
decides that a replica A needs to be moved to a different rack and the destination rack happens to have a
replica B of the same block, the data will be copied from replica B instead of replica A.

A configuration parameter limits the bandwidth consumed by rebalancing operations. The higher the
allowed bandwidth, the faster a cluster can reach the balanced state, but with greater competition with
application processes.

Block Scanner
Each DataNode runs a block scanner that periodically scans its block replicas and verifies that stored
checksums match the block data. In each scan period, the block scanner adjusts the read bandwidth in
order to complete the verification in a configurable period. If a client reads a complete block and checksum
verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica.

The verification time of each block is stored in a human-readable log file. At any time there are up to
two files in the top-level DataNode directory, the current and previous logs. New verification times are
appended to the current file. Correspondingly, each DataNode has an in-memory scanning list ordered by
the replica's verification time.

Whenever a read client or a block scanner detects a corrupt block, it notifies the NameNode. The
NameNode marks the replica as corrupt, but does not schedule deletion of the replica immediately.
Instead, it starts to replicate a good copy of the block. Only when the good replica count reaches the
replication factor of the block the corrupt replica is scheduled to be removed. This policy aims to preserve
data as long as possible. So even if all replicas of a block are corrupt, the policy allows the user to retrieve
its data from the corrupt replicas.

Decommissioning
The cluster administrator specifies list of nodes to be decommissioned. Once a DataNode is marked for
decommissioning, it will not be selected as the target of replica placement, but it will continue to serve

44 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

read requests. The NameNode starts to schedule replication of its blocks to other DataNodes. Once the
NameNode detects that all blocks on the decommissioning DataNode are replicated, the node enters the
decommissioned state. Then it can be safely removed from the cluster without jeopardizing any data
availability.

Inter-Cluster Data Copy

When working with large datasets, copying data into and out of a HDFS cluster is daunting. HDFS
provides a tool called DistCp for large inter/intra-cluster parallel copying. It is a MapReduce job; each of the
map tasks copies a portion of the source data into the destination filesystem. The MapReduce framework
automatically handles parallel task scheduling, error detection and recovery.

45 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

lOMoARcPSD|12548837

46 | P a g e UNITV

Downloaded by Aman Ruhela ([email protected])

Cisco SD Wan
100% (1)
Cisco SD Wan
185 pages
SQL Cheat Sheet PDF
100% (1)
SQL Cheat Sheet PDF
43 pages
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
No ratings yet
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
38 pages
TCS NQT Coding Questions, Videos & Solutions Total 134 Sessions
No ratings yet
TCS NQT Coding Questions, Videos & Solutions Total 134 Sessions
65 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
TCS Number System & Problem On Ages
No ratings yet
TCS Number System & Problem On Ages
12 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
PHP Code Example For View Edit Delete Search Update Database Table
93% (14)
PHP Code Example For View Edit Delete Search Update Database Table
12 pages
TCS NQT Numerical Aptitude PYQ - 160 Questions
No ratings yet
TCS NQT Numerical Aptitude PYQ - 160 Questions
26 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
UNIT2
No ratings yet
UNIT2
25 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Data Analytics Merged AKTU
No ratings yet
Data Analytics Merged AKTU
45 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
R Language
No ratings yet
R Language
59 pages
Deploying A SOA WSDL in Oracle EBS R12
100% (1)
Deploying A SOA WSDL in Oracle EBS R12
41 pages
Python Data Structures Cheat Sheet
No ratings yet
Python Data Structures Cheat Sheet
1 page
Unit 3
No ratings yet
Unit 3
24 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
AZ 104T00A ENU PowerPoint - 07
No ratings yet
AZ 104T00A ENU PowerPoint - 07
56 pages
Tender Document For Network Switch, Router, Network Management Software, Server, SAN Storage, SAN Switch and Virtualization Software
No ratings yet
Tender Document For Network Switch, Router, Network Management Software, Server, SAN Storage, SAN Switch and Virtualization Software
18 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
BDA Notes-1
No ratings yet
BDA Notes-1
39 pages
Unit V
No ratings yet
Unit V
67 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
DDM 5
No ratings yet
DDM 5
46 pages
STA112 - Lecture - 1 - Content - Probability 1
No ratings yet
STA112 - Lecture - 1 - Content - Probability 1
42 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Data Engineering 101 - Azure Synapse Analytics
No ratings yet
Data Engineering 101 - Azure Synapse Analytics
45 pages
Question Bank - CSE-DS
No ratings yet
Question Bank - CSE-DS
5 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Unit 5
No ratings yet
Unit 5
27 pages
BDM 1
No ratings yet
BDM 1
37 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
Conformation Meeting PPT Presentation
No ratings yet
Conformation Meeting PPT Presentation
20 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Pattern Recognition and Anomaly Detection
No ratings yet
Pattern Recognition and Anomaly Detection
2 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Parallel Port Programming (PART 1) With C
No ratings yet
Parallel Port Programming (PART 1) With C
13 pages
Installed Files Sort
No ratings yet
Installed Files Sort
35 pages
DWH QB
No ratings yet
DWH QB
10 pages
I2C Bus Manual
100% (6)
I2C Bus Manual
51 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
CLI Commands Cheat Sheets-1
No ratings yet
CLI Commands Cheat Sheets-1
1 page
Internet Protocol (Ipv4) : Tcp/Ip Protocol Suite
No ratings yet
Internet Protocol (Ipv4) : Tcp/Ip Protocol Suite
55 pages
Quality Management Koe085
100% (1)
Quality Management Koe085
2 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Unit 2
No ratings yet
Unit 2
11 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Bus Structure: Time Multiplexing
No ratings yet
Bus Structure: Time Multiplexing
37 pages
DesignCon Case of The Closing Eyes Panel
No ratings yet
DesignCon Case of The Closing Eyes Panel
33 pages
Photography Catalogue
No ratings yet
Photography Catalogue
8 pages
H.264 / MPEG-4 Part 10 White Paper Variable-Length Coding 1
No ratings yet
H.264 / MPEG-4 Part 10 White Paper Variable-Length Coding 1
7 pages
Performance Benchmarks For ODBC vs. Oracle, MySql, SQL Server PDF
No ratings yet
Performance Benchmarks For ODBC vs. Oracle, MySql, SQL Server PDF
3 pages
Manual: Lynxmotion Visual Sequencer V1.16
No ratings yet
Manual: Lynxmotion Visual Sequencer V1.16
20 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
HTTP Codesd
No ratings yet
HTTP Codesd
2 pages
Cse 306
No ratings yet
Cse 306
2 pages
Diagnosing and Repairing Failures With 11g Data Recovery Advisor
No ratings yet
Diagnosing and Repairing Failures With 11g Data Recovery Advisor
12 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
OBIEE LDAP Authentication Using Microsoft AD 1
No ratings yet
OBIEE LDAP Authentication Using Microsoft AD 1
8 pages
EEET2370/2371 Wireless Sensor Networks Assignment 1
No ratings yet
EEET2370/2371 Wireless Sensor Networks Assignment 1
7 pages
Using Using Using Using Using Using Using Using Namespace Public Partial Class Public
No ratings yet
Using Using Using Using Using Using Using Using Namespace Public Partial Class Public
4 pages
KVR32S22S8/16: Memory Module Specifi Cations
No ratings yet
KVR32S22S8/16: Memory Module Specifi Cations
2 pages
The Window Property: Examples
No ratings yet
The Window Property: Examples
2 pages
0936E01200R00
No ratings yet
0936E01200R00
1 page
Amadeus Vs Sabre
88% (8)
Amadeus Vs Sabre
16 pages
Sample OOPS ALV With Docking Container
No ratings yet
Sample OOPS ALV With Docking Container
7 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet