Bda Queston and Answer
Bda Queston and Answer
630562
Hadoop – Architecture
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that
was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with
Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks which are
divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
RecordReader
Map: .
Combiner: .
Partitionar:
Shuffle and Sort:
Reduce:
OutputFormat.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly
designed for working on commodity Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of
data is divided into multiple blocks of size 128MB which is default and you can also change it
manually.
Replication
Rack Awareness
HDFS Architecture
ArrayWritable
TwoDArrayWritable
NullWritable- NullWritable is a special type of Writable representing a null
value. No bytes are read or written when a data type is specified as NullWritable. So, in
Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to
use that field.
ObjectWritable This is a general-purpose generic object wrapper which can
store any objects like Java primitives, String, Enum, Writable, null, or arrays.
Text Text can be used as the Writable equivalent of java.lang.String and It’s
max size is 2 GB. Unlike java’s String data type, Text is mutable in Hadoop.
BytesWritable BytesWritable is a wrapper for an array of binary data.
GenericWritable It is similar to ObjectWritable but supports only a few types.
User need to subclass this GenericWritable class and need to specify the types to support.
13. Give a detailed explanation about hadoop streaming and hadoop pipe
Hadoop Streaming
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like
Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.
Hadoop pipes
Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be
used in MapReduce programs. Applications that require high numerical performance may see
better throughput if written in C++ and used through Pipes.
Hadoop Pipes
It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop Streaming which uses
standard I/O to communicate with the map and reduce code Pipes uses sockets as the channel
over which the tasktracker communicates with the process running the C++ map or reduce
function.
1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary
forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be
efficiently stored and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer**
provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
Sequencefile Compression
three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record
consists of its record length (number of bytes), the length of the key, the key and the value. The
Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the
uncompressed format, and the difference is that the value byte is compressed with the encoder
defined in the header. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is
more compact than record compression and generally preferred . When the number of bytes
recorded reaches the minimum size, it is added to the block. The minimum
valueio.seqfile.compress.blocksizeis defined by the property in. The default value is 1000000
bytes. The format is record count, key length, key, value length, value.
Read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
Read/write Mapfile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream