0% found this document useful (0 votes)

16 views8 pages

Bda Queston and Answer

BIG DATA ANALYTICS UNIT 4 Q & A

Uploaded by

praneeshprathiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Bda Queston and Answer

BIG DATA ANALYTICS UNIT 4 Q & A

Uploaded by

praneeshprathiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

PANDIAN SARASWATHI YADAV ENGINEERING COLLEGE –

630562

DEPARTMENT OF COMPUTER SCIENCE AND TECHNOLOGY

1. What is the Hadoop Distributed File System?

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes. It's
often used by companies who need to handle and store big data.
2. What is data locality optimization?
Data locality optimization is the process of moving computation to the data, rather than moving
the data to the computation. This is a basic computing principle that helps improve performance,
reduce network congestion, and optimize bandwidth use.
3. How does HDFS service support big data?
HDFS integrates with big data frameworks and supports batch processing. It ensures scalability,
fault tolerance, and cost-effectiveness. Read and write operations are managed by the NameNode
and executed by DataNodes. Data replication provides fault tolerance and redundancy.
4. Define serialization.
Serialization is a process that converts an object into a stream of bytes so that it can be stored or
transmitted The main purpose of serialization is to save the state of an object so that it can be
recreated later
5. Explain about Hadoop Streaming.
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like
Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language.
6. What are Hadoop Pipes?
Hadoop Pipes is the C++ interface to Hadoop Reduce. Hadoop Pipes uses sockets to enable task-
trackers to communicate processes running the C++ map or reduce functions.
7. Write some Hadoop Streaming Commands?
-input directoryname or filename
mapper executable or JavaClassName
reducer executable or JavaClassName
file filename
inputformat JavaClassName
outputformat JavaClassName
List out the primitive writable data types available in Hadoop.
8. Below is the list of primitive writable data types available in Hadoop:
BooleanWritable.
ByteWritable.
IntWritable.
VIntWritable.
FloatWritable.
LongWritable.
VLongWritable.
DoubleWritable.
9. What is writables? Explain its importance in Hadoop.
Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become IntWritable in Hadoop and String
of Java has become Text in Hadoop. Writables are used for creating serialized data types in
Hadoop.
10. What is Mapfile?
The 'mapfile' command, also known as 'readarray', is a powerful built-in feature of the Bash shell
used for reading input lines into an array variable.
PART – B
11. Explain about the architecture of Hadoop HDFS.

Hadoop – Architecture
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that
was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with
Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks which are
divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.

Let’s understand the Map Task and Reduce Task in detail.

Map Task:

 RecordReader
 Map: .
 Combiner: .
 Partitionar:
 Shuffle and Sort:
 Reduce:
 OutputFormat.

2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly
designed for working on commodity Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of
data is divided into multiple blocks of size 128MB which is default and you can also change it
manually.

Replication
Rack Awareness
HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task
into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and
Processing can be Maximized. Job Scheduler also keeps track of which job is important, which
job has more priority, dependencies between the jobs and all the other information like job
timing, etc. And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can
say the java scripts that we need for all the other components present in a Hadoop cluster. these
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

12.Explain about Hadoop I/O.

Hadoop I/O
The Writable Interface
Writable in an interface in Hadoop and types in Hadoop must implement this
interface. Hadoop provides these writable wrappers for almost all Java primitive types
and some other types,but sometimes we need to pass custom objects and these custom
objects should implement Hadoop's Writable interface. Hadoop MapReduce uses
implementations of Writables for interacting with user-provided Mappers and
Reducers. To implement the Writable interface we require two methods: public
interface Writable { void readFields(DataInput in); void write(DataOutput out); }

IntWritable, LongWritable, BooleanWritable and FloatWritable.

WritableComparable interface is just a sub interface of the Writable and
java.lang.Comparable interfaces.
. Below is the list of primitive writable data types available in Hadoop.
 BooleanWritable 
 ByteWritable 
 IntWritable 
 VIntWritable 
 FloatWritable 
 LongWritable 
 VLongWritable 
 DoubleWritable

 ArrayWritable
 TwoDArrayWritable
 NullWritable- NullWritable is a special type of Writable representing a null
value. No bytes are read or written when a data type is specified as NullWritable. So, in
Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to
use that field.
 ObjectWritable This is a general-purpose generic object wrapper which can
store any objects like Java primitives, String, Enum, Writable, null, or arrays.
 Text Text can be used as the Writable equivalent of java.lang.String and It’s
max size is 2 GB. Unlike java’s String data type, Text is mutable in Hadoop.
 BytesWritable BytesWritable is a wrapper for an array of binary data.
 GenericWritable It is similar to ObjectWritable but supports only a few types.
User need to subclass this GenericWritable class and need to specify the types to support.
13. Give a detailed explanation about hadoop streaming and hadoop pipe

Hadoop Streaming
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like
Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.

Some Hadoop Streaming Commands

Option Description

-input directory_name or filename Input location for the mapper.

-output directory_name Input location for the reducer.

-mapper executable or JavaClassName The command to be run as the mapper

Hadoop pipes
Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be
used in MapReduce programs. Applications that require high numerical performance may see
better throughput if written in C++ and used through Pipes.
Hadoop Pipes

It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop Streaming which uses
standard I/O to communicate with the map and reduce code Pipes uses sockets as the channel
over which the tasktracker communicates with the process running the C++ map or reduce
function.

14. Explain about File Based Structure in Hadoop.

File-based data structures

Two file formats:
1, Sequencefile
2, MapFile
Sequencefile

1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary
forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be
efficiently stored and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer**
provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.

Sequencefile Compression
three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record
consists of its record length (number of bytes), the length of the key, the key and the value. The
Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the
uncompressed format, and the difference is that the value byte is compressed with the encoder
defined in the header. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is
more compact than record compression and generally preferred . When the number of bytes
recorded reaches the minimum size, it is added to the block. The minimum
valueio.seqfile.compress.blocksizeis defined by the property in. The default value is 1000000
bytes. The format is record count, key length, key, value length, value.

Benefits of the Sequencefile file format:

A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the corresponding business logic,
regardless of the specific storage format.
Disadvantages of the Sequencefile file format:
The downside is the need for a merge file, and the merged file will be inconvenient to view.
because it is a binary file.

Read/write Sequencefile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

Read/write Mapfile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

Bda Final Sem 7
No ratings yet
Bda Final Sem 7
120 pages
Bda Guess Paper Solution
No ratings yet
Bda Guess Paper Solution
130 pages
500+ Data Engineering Interview - Questions
No ratings yet
500+ Data Engineering Interview - Questions
118 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Oxf HB Virt Music
100% (7)
Oxf HB Virt Music
721 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
132 pages
Big-Data-Unit 4
No ratings yet
Big-Data-Unit 4
99 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Unit 2
No ratings yet
Unit 2
9 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
M2 Q&a
No ratings yet
M2 Q&a
31 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Taylor'S Principle For Gauge Design: Subject: Gauges and Measurement (20MTE15)
No ratings yet
Taylor'S Principle For Gauge Design: Subject: Gauges and Measurement (20MTE15)
18 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Module 2
No ratings yet
Module 2
23 pages
Data Types in Hadoop
No ratings yet
Data Types in Hadoop
3 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
NguyenNgocMinhKhue 20211124
No ratings yet
NguyenNgocMinhKhue 20211124
5 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
HADOOP
No ratings yet
HADOOP
19 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Hadoop
No ratings yet
Hadoop
7 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop
No ratings yet
Hadoop
11 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Eir January2018
No ratings yet
Eir January2018
1,071 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Embedded Tutorial
No ratings yet
Embedded Tutorial
257 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
Powerflex 700H:Internal Password/Backdoor Password Using Him
No ratings yet
Powerflex 700H:Internal Password/Backdoor Password Using Him
2 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
HSV5 TB
No ratings yet
HSV5 TB
15 pages
As 2374.5-1982 Power Transformers Ability To Withstand Short-Circuit
No ratings yet
As 2374.5-1982 Power Transformers Ability To Withstand Short-Circuit
6 pages
Device Network SDK (Card-Based Access Control) - Developer Guide - V6.1.5.X - 20230330
No ratings yet
Device Network SDK (Card-Based Access Control) - Developer Guide - V6.1.5.X - 20230330
526 pages
Schneider Electric - Altistart-22 - ATS22C11Q
No ratings yet
Schneider Electric - Altistart-22 - ATS22C11Q
16 pages
Toc Unit1
No ratings yet
Toc Unit1
30 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Configuring Log Sources
No ratings yet
Configuring Log Sources
70 pages
Toc Unit2
No ratings yet
Toc Unit2
24 pages
Data Structures Using C (Csit124) Lecture Notes: by Dr. Nancy Girdhar
No ratings yet
Data Structures Using C (Csit124) Lecture Notes: by Dr. Nancy Girdhar
31 pages
R05410204-Power System Operation and Control
No ratings yet
R05410204-Power System Operation and Control
6 pages
Chapter 12 Object Recognition
No ratings yet
Chapter 12 Object Recognition
45 pages
ML Notes by Pushpa
No ratings yet
ML Notes by Pushpa
26 pages
Front and Back Cover - "Mooring System Engineering For Offshore Structures"
0% (1)
Front and Back Cover - "Mooring System Engineering For Offshore Structures"
2 pages
Cp5201 Network Technologies
No ratings yet
Cp5201 Network Technologies
22 pages
12 IP Webbrowser8
No ratings yet
12 IP Webbrowser8
3 pages
ERP Chapter 1
No ratings yet
ERP Chapter 1
21 pages
12 Guideline For Change Management System
No ratings yet
12 Guideline For Change Management System
4 pages
Eudamed Com Consultancy Submission Software 1700782776
No ratings yet
Eudamed Com Consultancy Submission Software 1700782776
107 pages
Dpco
No ratings yet
Dpco
5 pages
Daa Q&a
100% (1)
Daa Q&a
5 pages
Toc Unit3
No ratings yet
Toc Unit3
11 pages
CCS332 - Question Set (Apr-May 24)
No ratings yet
CCS332 - Question Set (Apr-May 24)
5 pages
Concur Expense EXP - SG - Workflow - AuthAppr
No ratings yet
Concur Expense EXP - SG - Workflow - AuthAppr
38 pages
CB3491 Unit 1 Ques
No ratings yet
CB3491 Unit 1 Ques
6 pages
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
No ratings yet
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
6 pages
CS3452 Theory of Computation Syllabus
No ratings yet
CS3452 Theory of Computation Syllabus
1 page
Entre Visillos Resumen
100% (1)
Entre Visillos Resumen
5 pages
ACN Lab Manual - 0
No ratings yet
ACN Lab Manual - 0
23 pages
Auto MS
No ratings yet
Auto MS
15 pages
E20214 Tuf Gaming B650-Plus Um Web 0728221600
No ratings yet
E20214 Tuf Gaming B650-Plus Um Web 0728221600
38 pages
Diode Clipping Circuits
No ratings yet
Diode Clipping Circuits
3 pages
SB8086 Cse Viva
No ratings yet
SB8086 Cse Viva
3 pages
CS6801 - Multi Core Architectures and Programming - R2013 - 2017 Nov
No ratings yet
CS6801 - Multi Core Architectures and Programming - R2013 - 2017 Nov
2 pages
CS6801 - Multi Core Architectures and Programming - R2013 - 2021 Nov
No ratings yet
CS6801 - Multi Core Architectures and Programming - R2013 - 2021 Nov
2 pages
CS6801 - Multi Core Architectures and Programming - R2013 - 2018 Nov
No ratings yet
CS6801 - Multi Core Architectures and Programming - R2013 - 2018 Nov
2 pages
CS6801 - Multi Core Architectures and Programming - R2013 - 2019 Nov
No ratings yet
CS6801 - Multi Core Architectures and Programming - R2013 - 2019 Nov
2 pages
Bike of The Future-Pneumatic Bike2
No ratings yet
Bike of The Future-Pneumatic Bike2
17 pages
Ut1 Aiml
No ratings yet
Ut1 Aiml
1 page
Software Requir SRS
No ratings yet
Software Requir SRS
3 pages
Self-Driving Cars: Shiva Kumar Pothuganti
No ratings yet
Self-Driving Cars: Shiva Kumar Pothuganti
1 page
Finger Command
No ratings yet
Finger Command
3 pages
NMEA-0183 Messages: Overview
No ratings yet
NMEA-0183 Messages: Overview
2 pages
05 - Auger Specifications
No ratings yet
05 - Auger Specifications
1 page
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Bda Queston and Answer

Uploaded by

Bda Queston and Answer

Uploaded by

PANDIAN SARASWATHI YADAV ENGINEERING COLLEGE –

DEPARTMENT OF COMPUTER SCIENCE AND TECHNOLOGY

1. What is the Hadoop Distributed File System?

Let’s understand the Map Task and Reduce Task in detail.

3. YARN(Yet Another Resource Negotiator)

4. Hadoop common or Common Utilities

12.Explain about Hadoop I/O.

IntWritable, LongWritable, BooleanWritable and FloatWritable.

Some Hadoop Streaming Commands

-input directory_name or filename Input location for the mapper.

-output directory_name Input location for the reducer.

-mapper executable or JavaClassName The command to be run as the mapper

14. Explain about File Based Structure in Hadoop.

File-based data structures

Benefits of the Sequencefile file format:

You might also like