1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

1) Discuss the design of Hadoop Distributed File System (HDFS) and

concept in detail.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large files:
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access :
HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
then various analyses are performed on that dataset over time.
Commodity hardware :
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at
least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.

Concepts in HDFS
Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks −

 Manages the file system namespace.


 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening
files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there
will be a datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client
request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

Block
The files in HDFS are divided into segments and stored in individual data nodes. These
file segments are called as blocks. In other words, the minimum amount of data that
HDFS can read or write is called a Block. The default block size is 64MB in
Hadoop(version 1.x) and 128 MB in Hadoop(version 2.x onwards), but it can be
increased as per the need to change in HDFS configuration.

2) Show how a client read and write data in HDFS, Give an example
with code.
Inserting Data into HDFS
Assume we have data in the file called file.txt file.txt in the local system which is ought
to be saved in the
HDFS file system. Follow the steps given below to insert the required file in the Hadoop
file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the
put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple demonstration
for retrieving
the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

OR
Anatomy of File Read in HDFS
Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:

Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node
and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading an
endless stream. Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through the stream. It will also call
the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.

Anatomy of File Write in HDFS


Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit
the files which are already stored in HDFS, but we can append data by reopening the
files.

Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).


Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory of suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third
(and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to
be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments before connecting to the name node to signal whether the
file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already
stored in HDFS, but we can include them by again reopening the file. This design
allows HDFS to scale to a large number of concurrent clients because the data traffic is
spread across all the data nodes in the cluster. Thus, it increases the availability,
scalability, and throughput of the system.

3) Explain briefly about Input format and Output format in Detail.


INPUT FORMAT
Hadoop InputFormat checks the Input-Specification of the job. InputFormat split the
Input file into InputSplit and assign to individual Mapper.
How the input files are split up and read in Hadoop is defined by the InputFormat. An
Hadoop InputFormat is the first component in Map-Reduce, it is responsible for creating
the input splits and dividing them into records.
Initially, the data for a MapReduce task is stored in input files, and input files typically
reside in HDFS. Although these files format is arbitrary, line-based log files and binary
format can be used. Using InputFormat we define how these input files are split and
read. The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
 The files or other objects that should be used for input is selected by the
InputFormat.
 InputFormat defines the Data splits, which defines both the size of individual Map
tasks and its potential execution server.
 InputFormat defines the RecordReader, which is responsible for reading actual
records from the input files.
OUTPUT FORMAT
The Hadoop Output Format checks the Output-Specification of the job. It determines
how RecordWriter implementation is used to write output to output files.

Hadoop RecordWriter takes output data from Reducer and writes this data to output
files. The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format. The Output Format and InputFormat functions are
alike. OutputFormat instances provided by Hadoop are used to write to files on
the HDFS or local disk. OutputFormat describes the output-specification for a Map-
Reduce job. On the basis of output specification;
 MapReduce job checks that the output directory does not already exist.
 OutputFormat provides the RecordWriter implementation to be used to write the
output files of the job. Output files are stored in a FileSystem.

4) How does Hadoop system analyze data? Explain in your answer


with an example code.

5) Discuss different types and formats of Map Reduce with examples.


A MapReduce program is composed of a map procedure, which performs filtering and
sorting (such as sorting students by first name into queues, one queue for each name),
and a reduce method, which performs a summary operation (such as counting the
number of students in each queue, yielding name frequencies).
Input Formats
1. Text Inputs
TextInputFormat - default InputFormat where each record is a line of input
Key - byte offset within the file of the beginning of the line; Value - the contents of
the line, not including any line terminators, packaged as a Text object
mapreduce.input.linerecordreader.line.maxlength - can be used to set a maximum
expected line length
Safeguards against corrupted files (often appears as a very long line)
KeyValueTextInputFormat - Used to interpret TextOutputFormat (default output
that contains key-value pairs separated by a delimiter)
mapreduce.input.keyvaluelinerecordreader.key.value.separator - used to specify the
delimiter/separator which is tab by default
NLineInputFormat - used when the mappers need to receive a fixed number of
lines of input
mapreduce.input.line.inputformat.linespermap - controls the number of input lines
(N)
StreamXmlRecordReader - used to break XML documents into records

2. Binary Input:
SequenceFileInputFormat - stores sequences of binary key-value pairs
SequenceFileAsTextInputFormat - converts sequence file’s keys and values to
Text objects
SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and
values as binary objects
FixedLengthInputFormat - reading fixed-width binary records from a file where the
records are not separated by delimiters

3. Multiple Inputs:
All input is interpreted by a single InputFormat and a single Mapper
MultipleInputs - allows programmer to specify which InputFormat and Mapper to
use on a per-path basis
Database Input/Output:
DBInputFormat - input format for reading data from a relational database
DBOutputFormat - output format for outputting data from a relational database

Output Formats
1. Text Output:
TextOutputFormat - default output format; writes records as lines of text (keys and
values are turned into strings)
KeyValueTextInputFormat - breaks lines into key-value pairs based on a
configurable separator
2. Binary Output:
SequenceFileOutputFormat - writes sequence files as output
SequenceFileAsBinaryOutputFormat - writes keys and values in binary format
into a sequence file container
MapFileOutputFormat - writes map files as output
3. Multiple Outputs:
MultipleOutputs - allows programmer to write data to files whose names are
derived from output keys and values to create more than one file

4. Lazy Output:
LazyOutputFormat - wrapper output format that ensures the output file is created
only when the first record is emitted for a given partition

6) Write the working procedure of HDFS and also explain the features
of HDFS.
Hadoop File System was developed using distributed file system design. It is run on
commodity
hardware. hardware. Unlike other distributed systems, systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. access. To store
such huge data, the
files are stored across multiple machines. machines. These files are stored in
redundant fashion to rescue the
system from possible data losses in case of failure. failure. HDFS also makes
applications available to
parallel processing.
Features of HDFS
1. It is suitable for the distributed storage and processing.
2. Hadoop provides a command interface to interact with HDFS.
3. The built-in servers of namenode and datanode help users to easily check the
status of
4. cluster.
5. Streaming access to file system data.
6. HDFS provides file permissions and authentication.
OR
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
o Replication - Due to some unfavorable conditions, the node containing the data
may be loss. So, to overcome such problems, HDFS always maintains the copy of
data on a different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the
system in the event of failure. The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data automatically
become active.
o Distributed data storage - This is one of the most important features of HDFS
that makes Hadoop very powerful. Here, data is divided into multiple blocks and
stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from
platform to another.

7) Explain big data and algorithmic trading.


Algo-trading is the use of predefined programs to execute trades. A set of instructions
or an algorithm is fed into a computer program and it automatically executes the trade
when the command is met.
Algorithmic trading has become synonymous with big data due to the growing
capabilities of computers. The automated process enables computer programs to
execute financial trades at speeds and frequencies that a human trader cannot.
Role of Big Data in Algorithmic Trading
1. Technical Analysis : Technical Analysis is the study of prices and price behavior,
using charts as the primary tool.
2. Real Time Analysis : The automated process enables computer to execute financial
trades at speeds and frequencies that a human trader cannot.
3. Machine Learning : With Machine Learning, algorithms are constantly fed data and
actually get smarter over time by learning from past mistakes, logically deducing new
conclusions based on past results and creating new techniques that make sense based
on thousands of unique factors.

8) Discuss crowdsourcing analytics and inter, Trans firewall


analytics.
 Crowdsourcing is the collection of information, opinions, or work from a group of
people, usually sourced via the Internet.
 Crowdsourcing work allows companies to save time and money while tapping
into people with different skills or thoughts from all over the world.
Crowdsourcing analytics refers Crowdsourcing platform using machine learning and
advanced algorithms to analyze the din of the online “crowd,” determine who the wise
voices in the crowd are, and then turn the input from these sources into actionable
insights to companies. Finding these insights, and focusing on the best sources for
information, can be invaluable for organizations that are struggling to make sense out
of mountains of audio, video, and unstructured text coming at them from all directions.

9) Explain big data and Hadoop open-source technology.

10) Relate crowd sourcing and big data. Justify the relationship with
an example.

11) Write down the aggregate data model in detail with an example.

12) Differentiate “Scale up and Scale out” Explain with an example


How Hadoop uses Scale out feature to improve the Performance.
Scaling up is taking what you’ve got and replacing it with something more powerful.
Scaling up is making a component bigger or faster so that it can handle more load.
Scaling up is a viable scaling solution until it is impossible to scale up individual
components any larger.
Example: From a networking perspective, this could be taking a 1GbE switch, and
replacing it with a 10GbE switch. Same number of switchports, but the bandwidth has
been scaled up via bigger pipes.
Scaling out takes the infrastructure you’ve got and replicates it to work in parallel.
Scaling out is adding more components in parallel to spread out a load. This has the
effect of increasing infrastructure capacity roughly linearly.
Example: Data centers often scale out using pods. Build a compute pod, spin up
applications to use it, then scale out by building another pod to add capacity.

13) Discuss in detail about the basic building blocks of Hadoop with
a neat sketch.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.

NameNode: The NameNode is the master of HDFS that directs the slave DataNode
daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps
track of how your files are broken down into file blocks, which nodes store those blocks
and the overall health of the distributed filesystem.
DataNode: Each slave machine in your cluster will host a DataNode daemon to
perform the grunt work of the distributed filesystem - reading and writing HDFS blocks
to actual files on the local file system When you want to read or write a HDFS file, the
file is broken into blocks and the NameNode will tell your client which DataNode each
block resides in. Your client communicates directly with the DataNode daemons to
process the local files corresponding to the blocks.
JobTracker: Once you submit your code to your cluster, the JobTracker determines
the execution plan by determining which files to process, assigns nodes to different
tasks, and monitors all tasks as they're running. should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different node, up to a predefined limit of
retries.
TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall
execution of a MapReduce job and the TaskTracker manage the execution of individual
tasks on each slave node.

You might also like