0% found this document useful (0 votes)
17 views32 pages

BDA Unit-4

The document provides an overview of Big Data Analytics with a focus on Hadoop, covering its data format, analysis methods using MapReduce, and the architecture of the Hadoop Distributed File System (HDFS). It explains key concepts such as data locality optimization, Hadoop streaming, and the roles of NameNode and DataNode in HDFS. Additionally, it discusses the importance of fault tolerance and the management of large datasets within a distributed environment.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views32 pages

BDA Unit-4

The document provides an overview of Big Data Analytics with a focus on Hadoop, covering its data format, analysis methods using MapReduce, and the architecture of the Hadoop Distributed File System (HDFS). It explains key concepts such as data locality optimization, Hadoop streaming, and the roles of NameNode and DataNode in HDFS. Additionally, it discusses the importance of fault tolerance and the management of large datasets within a distributed environment.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 32

V.R.S.

College of Engineering and Technology


(Reaccredited by NAAC and an ISO 9001:2008 Recertified Institution)

SUBJECT NAME : BIG DATA ANALYTICS


SUBJECT CODE : CCS334
REGULATION : 2021
YEAR/SEMESTER : III/V
BRANCH : CSE

UNIT-IV
BASICS OF HADOOP
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes
– design of Hadoop distributed file system (HDFS) – HDFS concepts – Java interface – data
flow – Hadoop I/O – data integrity – compression – serialization – Avro – file-based data
structures - Cassandra – Hadoop integration.

DATA FORMAT
Q. What is the data format of Hadoop?
 The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.
 The default output format provided by hadoop is TextOutput Format and it writes records as
lines of text. If file output format is not specified explicitly, then textfiles are created as
output files. Output key-value pairs can be of any format because TextOutput Format
converts these into strings with to String() method.
 HDFS data is stored in something called blocks. These blocks are the smallest unit of data
that the file system can store. Files are processed and broken down into these blocks, which
are then taken and distributed across the cluster and also replicated for safety.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 1


ANALYZING THE DATA WITH HADOOP
 Hadoop supports parallel processing, so we take this advantage for expressing query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster
of machines.
 MapReduce works by breaking the processing into two phases: The map phase and the
reduce phase.
 Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: The map function and the reduce
function.
 MapReduce is a method for distributing a task across multiple nodes. Each node processes
the data stored on that node to the extent possible. A running MapReduce job consists of
various phases which is described in the following

Phases of Hadoop MapReduce


 In the map job, we split the input dataset into chunks. Map task processes these chunks in
parallel. The map we use outputs as inputs for the reduced tasks. Reducers process the
intermediate data from the maps into smaller tuples that reduce the tasks leading to the final
output of the framework.
 The advantages of using MapReduce, which run over a distributed infrastructure like CPU
and storage, are automatic parallelization and distribution of data in blocks across a
distributed system, fault-tolerance against failure of storage, computation and network
infrastructure, deployment, monitoring and security capability and clear abstraction for
programmers. Most MapReduce programs are written in Java. It can also be written in any
scripting language using the Streaming API of Hadoop.
SCALING OUT
Q.Discuss data flow in MapReduce programming model.
 To scale out, we need to store the data in a distributed file system, typically HDFS, to allow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
 A MapReduce job is a unit of work that the client wants to be performed: It consists of the
input data, the MapReduce program and configuration information.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 2


 Hadoop runs the job by dividing it into tasks, of which there are two types: Map tasks and
reduce tasks.
 There are two types of nodes that control the job execution process: A job tracker and a
number of task trackers.
 Job tracker: This tracker plays the role of scheduling jobs and tracking all job assigned to the
task tracker.
 Task tracker: This tracker plays the role of tracking tasks and reporting the status of tasks to
the job tracker.
 Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits.
Hadoop creates one map task for each split, which runs the user defined map function for
each record in the split.
 That split information is used by YARN Application Master to try to schedule map tasks on
the same node where split data is residing thus making the task data local. If map tasks are
spawned at random locations, then each map task has to copy the data it needs to process
from the Data Node where that split data is residing, resulting in lots of cluster bandwidth.
By trying to schedule map tasks on the same node where split data is residing, what Hadoop
framework does is to send computation to data rather than bringing data to computation,
saving cluster bandwidth. This is called data locality optimization.
 Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: It is processed by reducing tasks to produce the final output and once
the job is complete the map output can be thrown away. So storing it in HDFS, with
replication, would be overkill. If the node running the map task fails before the map output
has been consumed by the reduce task, then Hadoop will automatically rerun the map task
on another node to re-create the map output.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 3


MapReduce data flow with single reduce task

 The number of reduced tasks is not governed by the size of the input, but is specified
independently.
 When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys in each partition, but the records for
any given key are all in a single partition.

MapReduce data flow with Multiple reduce task

 Hadoop allows the user to specify a combiner function to be run on the map output, the
combiner function's output forms the input to the reduce function. Since the combiner
function is an optimization, Hadoop does not provide a guarantee of how many times it will
call it for a particular map output record, if at all.
HADOOP STREAMING
Q. What is Hadoop streaming? Explain in details.
Definition:
 Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It
uses UNIX standard streams as the interface between Hadoop and the user application.
Hadoop streaming is a utility that comes with the Hadoop distribution.
 Streaming is naturally suited for text processing. The data view is line-oriented and
processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines
from the standard input, which is sorted by key and writes its results to the standard output.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 4


 It helps in real-time data analysis, which is much faster using MapReduce programming
running on a multi-node cluster. There are different Technologies like spark Kafka and
others which help in real-time Hadoop streaming.
Features of Hadoop Streaming
1. Users can execute non-Java-programmed MapReduce jobs on Hadoop clusters.
Supported languages include Python, Perl, and C++.
2. Hadoop Streaming monitors the progress of jobs and provides logs of a job's entire
execution for analysis.
3. Hadoop Streaming works on the MapReduce paradigm, so it supports scalability,
flexibility, and security/authentication.
4. Hadoop Streaming jobs are quick to develop and don't require much programming.
 Following code shows Streaming Utility:

Where:
Input = Input location for Mapper from where it can read input
Output = location for the Reducer to store the output
Mapper = The executable file of the Mapper
Reducer = The executable file of the Reducer

 Map and reduce functions read their input from STDIN and produce their output to
STDOUT. In the diagram above, the Mapper reads the input data from Input Reader/Format
in the form of key-value pair, maps them as per logic, written on code, and then passes
through the Reduce stream, which performs data aggregation and releases the data to the
output.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 5


Code execution process

HADOOP PIPES
Q. Briefly explain about Hadoop pipes
 Hadoop pipes are the name of the C++ interface to Hadoop MapReduce. Unlike Streaming,
this uses standard input and output to communicate with the map and reduce code.
 Pipes uses sockets as the channel over which the task tracker communicates with the process
running the C++ map or reduce function.

Execution of streaming and pipes


 With Hadoop pipes, we can implement applications that require higher performance in
numerical calculations using C++ in MapReduce. The pipes utility works by establishing a
persistent socket connection on a port with the Java pipes task on one end, and the external
C++ process at the other.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 6


 Other dedicated alternatives and implementations are also available, such as Pydoop for
Python, and for C. These are mostly built as wrappers, and are JNI-based. It is, however,
noticeable that MapReduce tasks are often a smaller component to a larger aspect of
chaining, redirecting and recurring MapReduce jobs. This is usually done with the help of
higher-level languages or APIs like Pig, Hive and Cascading, which can be used to express
such data extraction and transformation problems.
DESIGN OF HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Q. Briefly explain about design and architecture of HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. HDFS is the file system component of Hadoop.
 HDFS stores file system metadata and application data separately. As in other distributed file
systems, like GFS, HDFS stores metadata on a dedicated server, called the NameNode.
Application data is stored on other server’s called DataNodes. All servers are fully
connected and communicate with each other using TCP-based protocols.
 Hadoop Distributed File System (HDFS) is a distributed file system that handles large data
sets running on commodity hardware. It is used to scale a Apache Hadoop cluster to
hundreds of nodes.
 A block is the minimum amount of data that it can read or write. HDFS blocksare 128 MB
by default and this is configurable. When a file is saved in HDFS, the file is broken into
smaller chunks or "blocks".
 HDFS is a fault-tolerant and resilient system, meaning it prevents a failure in anode from
affecting the overall system's health and allows for recovery from failure too. In order to
achieve this, data stored in HDFS is automatically replicated across different nodes.
 HDFS supports a traditional hierarchical file organization. A user or an application can
create directories and store files inside these directories. The file system namespace
hierarchy is similar to most other existing file systems; one can create and remove files,
move a file from one directory to another, or rename a file.
 Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
 Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (MasterNode) and all the other nodes are DataNodes
(Slave nodes).

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 7


 HDFS can be deployed on a broad spectrum of machines that support Java. Though one can
run several DataNodes on a single machine, but in the practical world, these DataNodes are
spread across various machines.
 Design issue of HDFS:
1. Commodity hardware: HDFS do not require expensive hardware for executing user
tasks. It's designed to run on clusters of commodity hardware.
2. Streaming data access: HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
3. Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a
single writer. Writes are always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the file.
4. Low-latency data access.
5. Holds lots of small files.
6. Store very large files.
 The HDFS achieve the following goals:
1. Manage large datasets: Organizing and storing datasets can be a hard talk to handle.
HDFS is used to manage the applications that have to deal with huge datasets. To do
this, HDFS should have hundreds of nodes per cluster.
2. Detecting faults: HDFS should have technology in place to scan and detect faults
quickly and effectively as it includes a large number of commodity hardware. Failure of
components is a common issue.
3. Hardware efficiency: When large datasets are involved it can reduce the network traffic
and increase the processing speed.
HDFS Architecture

Hadoop architecture

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 8


 Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
 Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (Master node) and all the other nodes are DataNodes
(Slave nodes).
 DataNodes process and store data blocks, while NameNodes manage the many DataNodes,
maintain data block metadata and control client access.
NameNode and DataNode
 Namenode holds the metadata for the HDFS like Namespace information, block information
etc. When in use, all this information is stored in main memory. But this information also
stored in disk for persistence storage.
 Namenode manages the file system namespace. It keeps the directory tree of all files in the
file system and metadata about files and directories.
 DataNode is a slave node in HDFS that stores the actual data as instructed by the
NameNode. In brief, NameNode controls and manages a single or multiple data nodes..
 DataNode serves to read or write requests. It also creates deletes and replicates blocks on the
instructions from the NameNode.

Name node
 Two different files are :
1. fsimage: It is the snapshot of the file system when name node started.
2. Edit logs: It is the sequence of changes made to the file system after namenode started.
 Only in the restart of namenode, edit logs are applied to fsimage to get the latest snapshot of
the file system.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 9


 But namenode restart are rare in production clusters which means edit logs can grow very
large for the clusters where namenode runs for a long period of time.
 The following issues we will encounter in this situation :
1. Edit log become very large, which will be challenging to manage it.
2. Namenode restart takes long time because lot of changes to be merged.
3. In the case of crash, we will lose huge amount of metadata since fsimage is very old.
 So to overcome these issues we need a mechanism which will help us reduce the edit log
size which is manageable and have up to date fsimage, so that load on namenode reduces.
 Secondary Namenode helps to overcome the above issues by taking over responsibility of
merging edit logs with fsimage from the namenode.

Secondary Namenode
 Working of secondary Namenode:
1. It gets the edit logs from the Namenode in regular intervals and applies of fsimage.
2. Once it has new fsimage, it copies back to Namenode.
3. Namenode will use this fsimage for the next restart, which will reduce the startup time.
 Secondary Namenode's whole purpose is to have a checkpoint in HDFS. It’s just a helper
node for Namenode. That is why it also known as checkpoint node inside the community.
HDFS BLOCK
Q. Write short note on HDFS block
 HDFS is a block structured file system. In general, the user’s data stored in HDFS in terms
of block. The files in the file system are divided into one or more segments called blocks.
The default size of HDFS block is 64 MB that can be increased as per need.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 10


 The HDFS is fault tolerant such that if a data node fails then the current blockwrite
operation on the data node is re-replicated to some other node. The blocksize, number of
replicas and replication factors are specified in the Hadoop configuration file. The
synchronization between name node and data node is done by heartbeat functions which are
periodically generated by data node to namenode.
 Apart from above components the job tracker and task trackers are used when map reduce
applications run over the HDFS. Hadoop Core consists of one master job tracker and several
task trackers. The job tracker runs on name nodes like a master while task trackers run on
data nodes like slaves.
 The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks to the
task tracker on the data nodes where the data is locally present.
 If for some reason the node fails the job tracker assigns the task to another task replicated
tracker where the replica of the data exists since the data blocks are across the data nodes.
This ensures that the job does not fail even if a node fails within the cluster.
The Command Line Interface:
 The HDFS can be manipulated either using the command line. All the commands used for
manipulating HDFS through the command line interface begin with the "hadoop fs"
command.
 Most of the Linux commands are supported over HDFS which starts with "-"sign.
 For example: The command for listing the files in Hadoop directory will be,
#hadoop fs –Is
 The general syntax of HDFS command line manipulation is,
#hadoop fs -<command>
JAVA INTERFACE
Q. Briefly explain about java interface in Hadoop filesystem.
 Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the
Java API. The filesystem shell, for example, is a Java application that uses the Java
FileSystem class to provide file system operations. By exposing its filesystem interface as a
Java API, Hadoop makes it awkward for non-Java applications to access.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 11


1. Reading Data from a Hadoop URL :
 To read a file from a Hadoop file system is by using a java.net.URL object to open a stream
to read the data. The syntax is as follows:
InputStream in =null;
try {
in =new URL("hdfs://host/path").openStream();
// process in
}
Finally {
IOUtils.closeStream(in);
}
 Java recognizes Hadoop's hdfs URL scheme by calling the
setURLStreamHandlerFactory Method on URL with an instance of
FSUrlStreamHandlerFactory.
 The 'setURLStreamHandlerFactory‘ is a method in the java.net. URL class that sets the
URL stream handler factory for the Java Virtual Machine. This factory is responsible for
creating URL stream handler instances that are used to retrieve the contents of a URL.
 This method can only be called once per JVM, so it is typically executed in a static block.
 Example: Displaying files from a Hadoop file system on standard output using a
URLStreamHandler.
import java.io.InputStream;
import java.net. URL;
importorg.apache.hadoop.fs.FsUrlStreamHandlerFactory;
importorg.apache.hadoop.io.IOUtils;
// vvURLCat
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try{
in = new URL(args[0]).openStream();

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 12


IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

2.Reading Data Using the FileSystemAPI :


 Sometimes it is not possible to set URLStreamHandlerFactory for our application, so in that
case we will use Filesystem API for opening an input stream for a file. A file in a Hadoop
filesystem is represented by a Hadoop Path object.
 There are two static factory methods for getting a FileSystem instance:
public static Filesystem get(Configuration conf) throws IOException
public static FileSystemget(URI url, Configuration conf) throws IOException
 FileSystem is a generic abstract class used to interface with a file system. FileSystem class
also serves as a factory for concrete implementations, with the following methods:
Public static FileSystem get (Configuration conf): Use information from configuration such
as scheme and authority.
 A configuration object encapsulates a client or server's configuration, which is set using
configuration files read from the class path, such as conf/core-site.xml.
FSDatalnputStream :
 The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class.
 This class is a specialization of java.io.DataInputStream with support for random Package
access, SO we can read from any part of the stream org.apache.hadoop.fs;
Writing Data:
 The FileSystem class has a number of methods for creating a file. The simplest isthe method
that takes a path object for the file to be created and returns an outputstream to write to
public FSDataOutputStreamcreate(Path f) throws IOExceptionFSDataOutputStream
 The create() method on FileSystem returns an FSDataOutputStream, which, like
FSDataInputStream, has a method for querying the current position in the file :package
org.apache.hadoop.fs;
 We can append to an existing file using the append() method :

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 13


public FSDataOutputStreamappend(Path f) throws IOException
 The append operation allows a single writer to modify an already written file by opening it
and writing data from the final offset in the file. With this API, applications that produce
unbounded files, such as log files, can write to an existing file after a restart, for example,
the append operation is optional and not implemented by all Hadoop file systems.
DATA FLOW
Q Explain in details about data flow and heart beat mechanism of HDFS APR/MAY 2024
1. Anatomy of a File Read:
 The client opens the file it wishes to read by calling open() on the FileSystem object, which
for HDFS is an instance of Distributed FileSystem (DFS). DFS calls the namenode, using
RPC, to determine the locations of the blocks for the first few blocks in the file.
 For each block, the namenode returns the addresses of the datanodes that have a copy of that
block. Furthermore, the datanodes are sorted according to their proximity to the client. If the
client is itself a datanode, then it will read from the local datanode, if it hosts a copy of the
block.
 The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.

Client reading data from HDFS


 The client then calls read() on the stream. DFSInputStream, which has stored the datanode
addresses for the first few blocks in the file, then connects to the first(closest) datanode for
the first block in the file.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 14


 Data is streamed from the datanode back to the client, which calls read()repeatedly on the
stream. When the end of the block is reached, DFSInputStreamwill close the connection to
the datanode, then find the best datanode for the next block. This happens transparently to
the client, which from its point of view is just reading a continuous stream.
 Blocks are read in order with the DFSInputStream opening new connections to datanodes as
the client reads through the stream. It will also call the namenode to retrieve the datanode
locations for the next batch of blocks as needed. When the client has finished reading, it
calls close() on the PSDataInputStream.
 During reading, if the DPSInputStream encounters an error while communicating with a
datanode, then it will try the next closest one for that block.
2. Anatomy of a File Write:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is
then written to an internal queue, called data queue. Data streamer consumes the data
queue.
4. Data streamer streams the packets to the first DataNode in the pipeline. It stores the
packet and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages the “Ackqueue"of
the packets that are waiting to be acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.

Anatomy of a file writes

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 15


Heartbeat Mechanism in HDFS
 Heartbeat is a single indicating that is alive. A datanode sends heartbeat toNamenode and
task tracker will send its heartbeat to job tracker.

Heartbeat mechanism

 The connectivity between the NameNode and a DataNode are managed by the persistent
heartbeats that are sent by the DataNode every three seconds.
 The heartbeat provides the NameNode confirmation about the availability of the blocks and
the replicas of the DataNode.
 Additionally, heartbeats also carry information about total storage capacity, storage in use
and the number of data transfers currently in progress. These statistics are by the NameNode
for managing space allocation and load balancing.
 During normal operations, if the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode, it considers that DataNode to be out of service and the block
replicas hosted to be unavailable.
 The NameNode schedules the creation of new replicas of those blocks on other DataNodes.
 The heartbeats carry roundtrip communications and instructions from the NameNode,
including commands to :
a) Replicate blocks to other nodes.
b) Remove local block replicas.
c) Re-register the node.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 16


d) Shut down the node.
e) Send an immediate block report.
Role of Sorter, Shuffler and Combiner in MapReduce Paradigm
 A combiner, also known as a semi-reducer, is an optional class that operates by accepting the
inputs from the Map class and thereafter passing the output key-value pairs to the Reducer
class.
 The main function of a combiner is to summarize the map output records with the same key.
The output of the combiner will be sent over the network to the actual reducer task as input.
 The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as
input. So, shuffle phase is necessary for the reducers, otherwise, they would not have any
input.
 Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce.
Sort phase in MapReduce covers the merging and sorting of map outputs.
 Data from the mapper are grouped by the key, split among reducers and sorted by the key.
Every reducer obtains all values associated with the same key. Shuffle and sort phase in
Hadoop occur simultaneously and are done by the MapReduce framework.
HADOOP I/O
 Hadoop input output system comes with a set of primitives. Hadoop deals with multi-
terabytes of datasets; a special consideration on these primitives will give an idea how
Hadoop handles data input and output.
DATA INTEGRITY
Q. Explain in details about data integrity and Hadoop local file system in HDFS.
 Data integrity means that data should remain accurate and consistent all across its storing,
processing and retrieval operations.
 However, since every I/O operation on the disk or network carries with it a small chance of
introducing errors into the data that it is reading or writing. The usual way of detecting
corrupted data is by computing a checksum for the data when it first enters the system and
again whenever it is transmitted across a channel that is unreliable and hence capable of
corrupting the data.
 The commonly used error detecting code is CRC-32 which computes a 32-bit integer
checksum input of any size.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 17


Data Integrity in HDFS:
 HDFS transparently checksums all data written to it and by default verifies checksums when
reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data.
The default is 512 bytes, and since a CRC-32checksum is 4 bytes long, the storage overhead
is not an issue.
 All data that enters into the system is verified by the datanodes before being forwarded for
storage or further processing. Data sent to the datanode pipeline is verified through
checksums and any corruption found is immediately notified to the client with
ChecksumException.
 The client read from the datanode also goes through the same drill. The datanodes maintain
a log of checksum verification to keep track of the verified block. The log is updated by the
datanode upon receiving a block verification success signal from the client. This type of
statistics helps in keeping the bad disks at bay.
 Apart from this, a periodic verification on the block store is made with the help of
DataBlockScanner running along with the datanode thread in the background. This protects
data from corruption in the physical storage media.
 HDFS stores replicas of blocks, it can "heal" corrupted blocks by copying one of the good
replicas to produce a new, uncorrupt replica.
 If a client detects an error when reading a block, it reports the bad block and the datanode it
was trying to read from to the namenode before throwing a ChecksumException. The
namenode marks the block replica as corrupt, so it does not direct clients to it, or try to copy
this replica to another datanode.
 It then schedules a copy of the block to be replicated on another datanode, so its replication
factor is back at the expected level. Once this has happened, the corrupt replica is deleted. It
is possible to disable verification of checksums bypassing false to the setVerify Checksum()
method on FileSystem, before using theopen() method to read a file.
Hadoop Local File System
 The Hadoop local file system performs client-side checksums. When a file is created, it
automatically creates a transparent file in the background with the filename.crc, which uses
check chunks to check the file?
 Each chunk can check a segment up to 512 bytes, the chunk of data is divided by the
file.byte-per-checksum property and the chunk is then stored as metadata in a.crc file.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 18


 The file can be read correctly though the settings of the files might change and if an error is
detected then the local system throws a checksum exception.
Checksum file system:
 Local file systems use checksum file systems as a security measure to ensure that the data is
not corrupt or damaged in any way.
 In this file system, the underlying file system is called the raw file system, if an error is
detected while working on the checksum file system, it will call reportchecksumfailure().
 Here the local system moves the affected file to another directory as a file titled as bad_file.
It is then the responsibility of an administrator to keep a check on these bad files and take
the necessary action.
COMPRESSION
Q. Difference between Compression and serialization
 Compression has two major benefits
a) It creates space for a file.
b) It also increases the speed of data transfer to a disk or drive.
 The following are the commonly used methods of compression in Hadoop :
a) Deflate, b) Gzip, c) Bzip2, d) Lzo, e) Lz4, and f) Snappy.
 All these compression methods primarily provide optimization of speed and storage space
and they all have different characteristics and advantages.
 Gzip is a general compressor used to clear the space and performs faster than bzip2, but the
decompression speed of bzip2 is good.
 Lzo, Lz4 and Snappy can be optimized as required and hence, are the better toolsin
comparison to the others.
Codecs:
 A codec is an algorithm that is used to perform compression and decompressionof a data
stream to transmit or store it.
 In Hadoop, these compression and decompression operations run with differentcodecs and
with different compression formats.
SERIALIZATION.
Explain generic methods and classes in java. Give a procedure to stop java serialization.
Nov/Dec-2023
 Serialization is the process of converting a data object; a combination of code and data
represented within a region of data storage into a series of bytes that saves the state of the

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 19


object in an easily transmittable form. In this serialized form, the data can be delivered to
another data store, application, or some other destination.
 Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
 The reverse process, constructing a data structure or object from a series of byte is
deserialization. The deserialization process recreates the object, thus making the data easier
to read and modify as a native structure in a programming language.

Serialization and deserialization


 Serialization and deserialization work together to transform/recreate data objects to/from a
portable format.
 Serialization enables us to save the state of an object and recreate the object in anew
location. Serialization encompasses both the storage of the object and exchange of data.
Since objects are composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a standard way to capture the
object into a shareable format.
 Serialization is divided in two methods of data processing: Intercrossing communication and
data storage.
 Intercrossing communication between nodes is processing that uses remote procedure calls
(RPC's). In RPC, the data is converted to the binary system and is then transferred to a
remote node where the data is de-serialized into the original message. The lifespan of RPC
is less than a second.
 It is desirable that an RPC serialization format is:

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 20


a. Compact: A compact format makes the best use of network bandwidth, which is the
scarcest resource in a data center.
b. Fast: Interprocess communication forms the backbone for a distributed system, so it
is essential that there is as little performance overhead as possible for the
serialization and deserialization process.
c. Extensible: Protocols change over time to meet new requirements, so it should be
straightforward to evolve the protocol in a controlled manner for clients and servers.
d. Interoperable: For some systems, it is desirable to be able to support clients that are
written in different languages to the server, so the format needs to be designed to
make this possible.
The Writable Interface
Q. Briefly explain about the writable interface of Hadoop.
 Hadoop uses its own serialization format called Writable. It is written in Java and is fast as
well as compact. The other serialization framework supported by Hadoop is Avro.
 The Writable interface defines two methods: One for writing its state to a DataOutput binary
stream and one for reading its state from a DataInput binary stream.
 When we write a key as IntWritable in the Mapper class and send it to the reducer class,
there is an intermediate phase between the Mapper and Reducer class i.e., shuffle and sort,
where each key has to be compared with many other keys. If the keys are not comparable,
then the shuffle and sort phase won't be executed or may be executed with a high amount of
overhead.
 If a key is taken as IntWritable by default, then it has a comparable feature because of Raw
Comparator acting on that variable. It will compare the key taken with the other keys in the
network. This cannot take place in the absence of Writable.
 WritableComparator is a general-purpose implementation of RawComparator
forWritableComparable classes. It provides two main functions:
a. It provides a default implementation of the raw compare() method that deserializes the
objects to be compared from the stream and invokes the objectcompare() method.
b. It acts as a factory for RawComparator instances.
 To provide mechanisms for serialization and deserialization of data, Hadoop provided two
important interfaces Writable and WritableComparable. Writableinterface specification is as
follows:

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 21


package org.apache.hadoop.io;
import java.io.Datalnput;
import java.io.DataOutput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields (Datainput in) throws IOException;
}

 WritableComparable interface is sub-interface of Hadoop's Writable and Java'sComparable


interfaces. Its specification is shown below :
public interface Writable Comparable extends Writable, Comparable
{
}
Writable Classes Hadoop Data Types:
 Hadoop provides classes that wrap the Java primitive types and implement the
WritableComparable and Writable Interfaces. They are provided in the
org.apache.hadoop.io package.
 All the Writable wrapper classes have a get() and a set() method for retrieving and storing
the wrapped value.
Primitive Writable Classes:
 These are writable wrappers for Java primitive data types and they hold a single primitive
value that can be set either at construction or via a setter method.
 All these primitive writable wrappers have get() and set() methods to read or write the
wrapped value. Below is the list of primitive writable data types available in Hadoop.
a) BooleanWritable b) ByteWritable
c) IntWritable d) VIntWritable
e) FloatWritable f) Long Writable
g) VLongWritable h) DoubleWritable.

 In the above list VIntWritable and VLongWritable are used for variable length Integer types
and variable length long types respectively.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 22


 Serialized sizes of the above primitive writable data types are the same as the size of actual
java data types. So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.
Text:
 Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of
java.lang.String.
 The Text class uses an int to store the number of bytes in the string encoding, so the
maximum value is 2 GB.
Bytes Writable:
 BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer
field (4 bytes) that specifies the number of bytes to follow, followed by the bytes
themselves.
 Bytes Writable is mutable and its value may be changed by calling its set() method.
 NullWritable is a special type of writable, as it has a zero-length serialization. No bytes are
written to, or read from, the stream. It is used as a placeholder.
ObjectWritable and GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives, string,
enum, writable, null, or arrays of any of these types.
 It is used in Hadoop RPC to marshal and unmarshal method arguments and return types.
 There are four writable collection types in the org.apache.hadoop.io package:Array
Writable, TwoDArrayWritable, MapWritable, and SortedMapWritable.
 ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two-
dimensional arrays of Writable instances. All the elements of an ArrayWritable or a
TwoDArrayWritable must be instances of the same class.
 ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as a to
Array() method, which creates a shallow copy of the array.
 MapWritable and SortedMapWritable are implementations of java.util.Map<Writable,
Writable> and java.util.SortedMap<WritableComparable, Writable>, respectively. The type
of each key and value field is a part of the serialization format for that field.

AVRO
Q. Write short note on Avro, file-based structure and Cassandra Hadoop integration.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 23


 Data serialization is a technique of converting data into binary or text format. There are
multiple systems available for this purpose. Apache Avro is one of those data serialization
systems.
 Avro is a language-independent serialization library. Avro is a language independent,
schema-based data serialization library. It uses a schema to perform serialization and
deserialization. Moreover, Avro uses a JSON format to specify the data structure which
makes it more powerful.
 Avro creates a data file where it keeps data along with schema in its metadata section. Avro
files include markers that can be used to split large data sets into subsets suitable for Apache
MapReduce processing.
 Avro has rich schema resolution capabilities. Within certain carefully defined constraints,
the schema used to read data need not be identical to the schema that was used to write the
data.
 An Avro data file has a metadata section where the schema is stored, which makes the file
self-describing. Avro data files support compression and are splitable, which is crucial for a
MapReduce data input format.
 Avro defines a small number of data types, which can be used to build application specific
data structures by writing schemas.
 Avro supports two types of data :
a. Primitive type: Avro supports all the primitive types. We use primitive TypeName’s to
define a type of a given field. For example, a value which holds a string should be
declared as {"type": "string") in Schema.
b. Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps,
unions and fixed.
Avro data files:
 A data file has a header containing metadata, including the Avro schema and a sync marker,
followed by a series of blocks containing the serialized Avro objects.
 Blocks are separated by a sync marker that is unique to the file and that permits rapid
resynchronization with a block boundary after seeking to an arbitrary pointing the file, such
as an HDFS block boundary. Thus, Avro data files are splitable,which makes them amenable
to efficient MapReduce processing.
FILE-BASED DATA STRUCTURES

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 24


 Apache Hadoop supports text files which are quite commonly used for storing the data,
besides text files it also supports binary files and one of these binary formats is called
sequence files.
 Hadoop sequence file is a flat file structure which consists of serialized key-value pairs. This
is the same format in which the data is stored internally during the processing of the
MapReduce tasks.
 Sequence files can also be compressed for space considerations and based on these
compression type users, Hadoop sequence files can be of three types: Uncompressed, record
compressed and block compressed.
 To create a SequenceFile, use one of its createWriter() static methods, which returns a
SequenceFile.Writer instance.
 The keys and values stored in a SequenceFile do not necessarily need to be writable. Any
types that can be serialized and deserialized by a serialization maybe used.
 Reading sequence files from beginning to end is a matter of creating an instance of
SequenceFile.Reader and iterating over records by repeatedly invoking one ofthe next()
methods.
The SequenceFile format

Structure of a sequence file with no compression and record compression


 A sequence file consists of a header followed by one or more records. The first three bytes
of a sequence file are the bytes SEQ, which acts a magic number, followed by a single byte
representing the version number. The header contains other fields including the names of the
key and value classes, compression details, user defined metadata and the sync marker.
 Recall that the sync marker is used to allow a reader to synchronize to a record boundary
from any position in the file. Each file has a randomly generated sync marker, whose value
is stored in the header. Sync markers appear between records in the sequence file.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 25


CASSANDRA HADOOP INTEGRATION.
Elaborate the impact of seamless Hadoop integration on enhancing data processing and
analytics. Nov/Dec-2023.
 Cassandra provides native support to Hadoop MapReduce, Pig and Hive. Cassandra
supports input to Hadoop with ColumnFamilyInputFormat and output with
ColumnFamilyOutputFormat classes, respectively.
 ColumnFamilyInputFormat: It is an implementation of
org.apache.hadoop.mapred.InputFormat. So, its implementation is dictated by the
InputFormat class specifications. Hadoop uses this class to get data for the MapReduce
tasks. It also fragments input data into small chunks that get fed to map tasks.
 ColumnFamilyOutputFormat: OutputFormat is the mechanism of writing the result from
MapReduce to a permanent storage. Cassandra implements Hadoop'sOut
putFormat. It enables Hadoop to write the result from the reduced task as column family
rows. It is implemented such that the results are written, to the column family, in batches.
This is a performance improvement and this mechanism is called lazy write-back caching.
 ConfigHelper: ConfigHelper is a gateway to configure Cassandra-specific settings for
Hadoop. It saves developers from inputting a wrong property name because all the
properties are set using a method; any typo will appear at compile time.
 Bulk loading: BulkOutputFormat is another utility that Cassandra provides to improve the
write performance of jobs that result in large data. It streams the data in binary format,
which is much quicker than inserting one by one. It uses SSTableLoader to do this.
 Configuring Hadoop with Cassandra is itself quite some work. Writing verbose and long
Java code to do something as trivial as word count is a turn-off to a high level user like a
data analyst.
Compare and Contrast the difference between Hadoop and RDMS for performing large scale
batch analysis. APR/MAY 2024
Among the board's ever-changing agenda, two notable developments emerge: Hadoop and
Relational Database Management Systems. While both are important parts of the data ecosystem,
their design, implementation, and implementation are very different from each other. Businesses
that want to maximize the capacity of their data infrastructure need to have a thorough
understanding of these differences.
Understanding Hadoop
Hadoop, an open-source system built with embedded Apache programming has disrupted the way it
manages massive scope information handling and storage Hadoop Distributed File System and
MapReduce are two of the main components of Hadoop.HDFS was designed to store large amounts

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 26


of data in partitioned clusters using common hardware. This insulates the chip from technical
guidelines, providing unnecessary frustration and greater ease of improvement.
In contrast, MapReduce is a similar framework for processing and analyzing large amounts of data.
It works by breaking down activities into many smaller steps, spreading them out across the group,
and then aggregating all the results. Hadoop's ability to process information in the same way makes
it ideal for massive scope testing, log handling, and instructions that require cluster handling.
RDBMS
An RDBMS, exemplified by large organizations such as Prophet, MySQL, and PostgreSQL, is a
data set administration framework in a lifecycle model. It organizes data into tables, each with
columns and blocks, and uses Corrosive values to keep the data predictable
The RDBMS succeeds in organizing data through pre-written diagrams, making it clearly suitable
for value-based projects and applications that require complex, interactive analysis and
transformation of a structured query language which with data types provides, and provides a way
of handling information all and want what is lost
The main differences are:
Data structures
 Hadoop is optimized for processing and analyzing unstructured and semi-structured data,
including text, images and log files. Its schema-on-read approach allows flexibility in
handling a variety of data types.
 On the other hand, RDBMS are optimized for structured data with static schemas. It relies
on a predefined table structure and incorporates data normalization to reduce redundancy
and ensure accuracy.
Scalability
 Hadoop exhibits horizontal scalability, which means that as the amount of data increases, it
can grow by adding more nodes to the cluster. This distributed system allows for easy
expansion without compromising performance.
 RDBMSs tend to scale vertically, requiring new hardware resources to enable increasing
workloads. Although vertical scaling has its limitations, RDBMS can still handle significant
data types through effective indexing and query optimization.
Example implementation
 Hadoop uses a batch processing paradigm, where data is processed in bulk across distributed
nodes. This approach is best suited for applications that require extensive data processing,
such as ETL applications and data-intensive analysis
 The RDBMS supports online transaction processing and online analytical processing. OLTP
focuses on transactional work characterized by frequently transient transactions, while
OLAP facilitates complex queries and analysis of historical data
Cost Considerations
 Hadoop, being open source, offers a costeffective solution for dealing with large volumes of
data processing and storage. Organizations can use commodity hardware and open source
software components to create scalable Hadoop clusters at a fraction of the cost of a
proprietary solution.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 27


 RDBMS solutions, while providing robust features and support, often come with licensing
fees and hardware requirements that can significantly impact the overall cost of ownership,
especially for enterprise users.
Differences between Hadoop and RDBMS
Aspect Hadoop RDBMS
Handles structured and unstructured
Data Structure Primarily structured data
data
Horizontal scaling: add commodity Vertical scaling: enhance single
Scalability
hardware server
Processing Batch processing using MapReduce
Interactive querying using SQL
Paradigm or Spark
Big data analytics, log processing, Transactional applications,
Use Cases
data lakes relational data
Opensource software, commodity
Cost Licensing fees, hardware upgrades
hardware

CASE-STUDY –BIG DATA in HEALTH RESEARCH SYSTEM APR/MAY 2024

Assuming that the block size is 128MB, the cluster has 10GB (so ~80 available blocks). Suppose
that I have created 3 small files which together take 128MB on disk which meand replication factor
is 3 means it takes (block files, checksums, replication...) and 3 HDFS blocks. If I want to add
another small file to HDFS, then what does HDFS use, the used blocks or the actual disk usage, to
calculate the available blocks?

80 blocks - 3 blocks = 77 available blocks or (3 GB - 128 MB)/128 MB = 79 available blocks?

 Healthcare Big Data is more mind boggling than Big Data emerging from some other basic
area in light of the fact that an assortment of information sources and techniques are
followed in conventional doctor's facility settings and in social insurance organization (e-
Health). Keeping in mind the end goal to accomplish their essential objective, which is to
improve experience while managing reliable care inside money related practice and regard
for government controls, the HBD ought to be dissected to decide the fulfilment level.
 A large portion of the Big Data are in a type of unstructured information, significant strides
of enormous information, administration in human services industry are information
obtaining, capacity of information, dealing with the information, investigation on
information, and information representation. A gigantic measure of information is created
day by day by the medicinal associations, which on the whole comprises of patients. Human
services focuses, restorative masters and obviously, the sicknesses. The information is
gigantic and gives a knowledge into future forecasts, which may keep most extreme

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 28


restorative cases from happening. Be that as it may, without huge information investigation
methods and Hadoop bunch, this information stays futile .
 Social insurance has expanded its general, an incentive by receiving huge information
strategies to break down and comprehend its information from different sources.
 The medicinal data sets comprise of huge volumes of information which are typically
produced from differing sources, for example, doctors' case notes, clinic affirmation notes,
release rundowns, drug stores, insurance agencies, restorative imaging, labs, sensor-based
gadgets, genomics, internet-based life and also articles in therapeutic diaries. Healthcare
information is anyway exceptionally intricate and hard to oversee. This is because of the
galactic development of medicinal services information, the rate at which this information is
produced and in addition the different information generated
Two Marks Questions with Answers
Q.1 Why do we need Hadoop streaming?
Ans. It helps in real-time data analysis, which is much faster using MapReduce programming
running on a multi-node cluster. There are different technologies like spark Kafka and others which
help in real-time Hadoop streaming.
Q.2 What is the Hadoop Distributed file system?
Ans. The Hadoop Distributed File System (HDFS) is designed to store very large datasets reliably
and to stream those data sets at high bandwidth to user applications’ stores file system metadata and
application data separately. The HDFS namespace is a hierarchy of files and directories. Files and
directories are represented on the NameNode by inodes, which record attributes like permissions,
modification and access times, namespace and disk space quotas.
Q.3 What is data locality optimization?
Ans. To run the map task on a node where the input data resides in HDFS. This is called data
locality optimization.
Q.4Why do map tasks write their output to the local disk, not to HDFS?
Ans.: Map output is intermediate output: It is processed by reducing tasks to produce the final
output and once the job is complete the map output can be thrown away. So, storing it in HDFS,
with replication, would be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map task on another
node to re-create the map output.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 29


Q.5 Why is a block in HDFS so large ?
Ans.: HDFS blocks are large compared to disk blocks and the reason is to minimize the cost of
seeks. By making a block large enough, the time to transfer the data from the disk can be made to
be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large
file made of multiple blocks operates at the disk transfer rate.
Q.6 How HDFS services support big data ?
Ans.: Five core elements of big data organized by HDFS services :
 Velocity - How fast data is generated, collated and analyzed.
 Volume - The amount of data generated.
 Variety - The type of data, this can be structured, unstructured, etc.
 Veracity - The quality and accuracy of the data.
 Value-How you can use this data to bring an insight into your business processes.
Q.7 What if writable were not there in Hadoop ?
Ans.: Serialization is important in Hadoop because it enables easy transfer of data. If Writable is not
present in Hadoop, then it uses the serialization of Java which increases the data over-head in the
network.
Q.8 Define serialization.
Ans.: Serialization is the process of converting object data into byte stream data for transmission
over a network across different nodes in a cluster or for persistent data storage.
Q.9 What is writable? Explain its Importance in Hadoop.
Ans. Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all
the primitive data type of Java. That is how int of java has become IntWritable in Hadoop and
String of Java has become Text in Hadoop. Writables are used for creating serialized data types in
Hadoop. So, let us start by understanding what are data type, interface and serialization.
Q.10 What happens if a client detects an error when reading a block in Hadoop ?
Ans. If a client detects an error when reading a block :
 It reports the bad block and datanode it as trying to read from to the namenode
before throwing a ChecksumException.
 The namenode marks the block replica as corrupt, so it does not direct clients to it, or
try to copy this replica to another datanode.
 It then schedules a copy of the block to be replicated on another datanode, so its
replication factor is back at the expected level.
 Once this has happened, the corrupt replica is deleted.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 30


Q.11 What is MapFile ?
Ans. A MapFile is a sorted sequence file with an index to permit lookups by key. Map File can be
thought of as a persistent form of java.util.Map which is able to grow beyond the size of a map that
is kept in memory.
Q.12 What are Hadoop pipes ?
Ans. Hadoop pipes is the name of the C++ interface to Hadoop MapReduce. Unlike streaming, this
uses standard input and output to communicate with the map and reduce code. Pipes uses sockets as
the channel over which the task tracker communicates with the process running the C++ map or
reduce function.
Q13.Why is ensuring data integrity crucial in HDFS? NOV/DEC 2023
Ans.
 HDFS transparently checksums all data written to it and by default verifies checksums when
reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data.
The default is 512 bytes, and since a CRC-32checksum is 4 bytes long, the storage overhead
is not an issue.
 All data that enters into the system is verified by the datanodes before being forwarded for
storage or further processing. Data sent to the datanode pipeline is verified through
checksums and any corruption found is immediately notified to the client with
ChecksumException
Q.14. In the context of Hadoop, what is the purpose of Hadoop pipes? NOV/DEC 2023
Ans. With Hadoop pipes, we can implement applications that require higher performance in
numerical calculations using C++ in MapReduce. The pipes utility works by establishing a
persistent socket connection on a port with the Java pipes task on one end, and the external C++
process at the other.
Q.15. List the applications where HDFS does not work well. APR/MAY 2024
Ans. Application that have low latency access to data will not work well with HDFS. Some
examples are video conferencing apps(zoom, skype), online games(Fortnite league of legends),
youtube live, twitter,etc,..
Q.16.Mention the necessity of serialization in Hadoop and present the default serialization
framework supported by Hadoop. APR/MAY 2024
Ans. Hadoop uses its own serialization format called Writable. It is written in Java and is fast as
well as compact. The other serialization framework supported by Hadoop is Avro. The Writable

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 31


interface defines two methods: One for writing its state to a DataOutput binary stream and one for
reading its state from a DataInput binary stream.
UNIT-IV
QUESTION BANK
1. What is the data format of Hadoop?
2. Discuss data flow in MapReduce programming model
3. What is Hadoop streaming? Explain in details.
4. Briefly explain about Hadoop pipes
5. Briefly explain about design and architecture of HDFS APR/MAY 2024
6. Write short note on HDFS block
7. Briefly explain about java interface in Hadoop filesystem.
8. Explain in details about data flow and heart beat mechanism of HDFS –case study of
health researcher APR/MAY 2024
9. Explain in details about data integrity and Hadoop local file system in HDFS.
10. Compare and Contrast the difference between Hadoop and RDMS for performing
large scale batch analysis. APR/MAY 2024
11. Difference between Compression and serialization
12. Explain generic methods and classes in java. Give a procedure to stop java
serialization. Nov/Dec-2023
13. Write short note on Avro, file-based structure and Cassandra Hadoop integration
14. Elaborate the impact of seamless Hadoop integration on enhancing data processing
and analytics. Nov/Dec-2023.

Prepared By Mrs.C.Leena AP/CSE CCS334 Big Data Analytics Page 32

You might also like