BDA Unit-4
BDA Unit-4
UNIT-IV
BASICS OF HADOOP
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes
– design of Hadoop distributed file system (HDFS) – HDFS concepts – Java interface – data
flow – Hadoop I/O – data integrity – compression – serialization – Avro – file-based data
structures - Cassandra – Hadoop integration.
DATA FORMAT
Q. What is the data format of Hadoop?
The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.
The default output format provided by hadoop is TextOutput Format and it writes records as
lines of text. If file output format is not specified explicitly, then textfiles are created as
output files. Output key-value pairs can be of any format because TextOutput Format
converts these into strings with to String() method.
HDFS data is stored in something called blocks. These blocks are the smallest unit of data
that the file system can store. Files are processed and broken down into these blocks, which
are then taken and distributed across the cluster and also replicated for safety.
The number of reduced tasks is not governed by the size of the input, but is specified
independently.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys in each partition, but the records for
any given key are all in a single partition.
Hadoop allows the user to specify a combiner function to be run on the map output, the
combiner function's output forms the input to the reduce function. Since the combiner
function is an optimization, Hadoop does not provide a guarantee of how many times it will
call it for a particular map output record, if at all.
HADOOP STREAMING
Q. What is Hadoop streaming? Explain in details.
Definition:
Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It
uses UNIX standard streams as the interface between Hadoop and the user application.
Hadoop streaming is a utility that comes with the Hadoop distribution.
Streaming is naturally suited for text processing. The data view is line-oriented and
processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines
from the standard input, which is sorted by key and writes its results to the standard output.
Where:
Input = Input location for Mapper from where it can read input
Output = location for the Reducer to store the output
Mapper = The executable file of the Mapper
Reducer = The executable file of the Reducer
Map and reduce functions read their input from STDIN and produce their output to
STDOUT. In the diagram above, the Mapper reads the input data from Input Reader/Format
in the form of key-value pair, maps them as per logic, written on code, and then passes
through the Reduce stream, which performs data aggregation and releases the data to the
output.
HADOOP PIPES
Q. Briefly explain about Hadoop pipes
Hadoop pipes are the name of the C++ interface to Hadoop MapReduce. Unlike Streaming,
this uses standard input and output to communicate with the map and reduce code.
Pipes uses sockets as the channel over which the task tracker communicates with the process
running the C++ map or reduce function.
Hadoop architecture
Name node
Two different files are :
1. fsimage: It is the snapshot of the file system when name node started.
2. Edit logs: It is the sequence of changes made to the file system after namenode started.
Only in the restart of namenode, edit logs are applied to fsimage to get the latest snapshot of
the file system.
Secondary Namenode
Working of secondary Namenode:
1. It gets the edit logs from the Namenode in regular intervals and applies of fsimage.
2. Once it has new fsimage, it copies back to Namenode.
3. Namenode will use this fsimage for the next restart, which will reduce the startup time.
Secondary Namenode's whole purpose is to have a checkpoint in HDFS. It’s just a helper
node for Namenode. That is why it also known as checkpoint node inside the community.
HDFS BLOCK
Q. Write short note on HDFS block
HDFS is a block structured file system. In general, the user’s data stored in HDFS in terms
of block. The files in the file system are divided into one or more segments called blocks.
The default size of HDFS block is 64 MB that can be increased as per need.
Heartbeat mechanism
The connectivity between the NameNode and a DataNode are managed by the persistent
heartbeats that are sent by the DataNode every three seconds.
The heartbeat provides the NameNode confirmation about the availability of the blocks and
the replicas of the DataNode.
Additionally, heartbeats also carry information about total storage capacity, storage in use
and the number of data transfers currently in progress. These statistics are by the NameNode
for managing space allocation and load balancing.
During normal operations, if the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode, it considers that DataNode to be out of service and the block
replicas hosted to be unavailable.
The NameNode schedules the creation of new replicas of those blocks on other DataNodes.
The heartbeats carry roundtrip communications and instructions from the NameNode,
including commands to :
a) Replicate blocks to other nodes.
b) Remove local block replicas.
c) Re-register the node.
In the above list VIntWritable and VLongWritable are used for variable length Integer types
and variable length long types respectively.
AVRO
Q. Write short note on Avro, file-based structure and Cassandra Hadoop integration.
Assuming that the block size is 128MB, the cluster has 10GB (so ~80 available blocks). Suppose
that I have created 3 small files which together take 128MB on disk which meand replication factor
is 3 means it takes (block files, checksums, replication...) and 3 HDFS blocks. If I want to add
another small file to HDFS, then what does HDFS use, the used blocks or the actual disk usage, to
calculate the available blocks?
Healthcare Big Data is more mind boggling than Big Data emerging from some other basic
area in light of the fact that an assortment of information sources and techniques are
followed in conventional doctor's facility settings and in social insurance organization (e-
Health). Keeping in mind the end goal to accomplish their essential objective, which is to
improve experience while managing reliable care inside money related practice and regard
for government controls, the HBD ought to be dissected to decide the fulfilment level.
A large portion of the Big Data are in a type of unstructured information, significant strides
of enormous information, administration in human services industry are information
obtaining, capacity of information, dealing with the information, investigation on
information, and information representation. A gigantic measure of information is created
day by day by the medicinal associations, which on the whole comprises of patients. Human
services focuses, restorative masters and obviously, the sicknesses. The information is
gigantic and gives a knowledge into future forecasts, which may keep most extreme