BigData Objective
BigData Objective
BigData Objective
1. IBM and ________ have announced a major initiative to use Hadoop to support university
courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google
Explanation:Google and IBM Announce University Initiative to Address Internet-Scale.
2. Point out the correct statement :
a) Hadoop is an ideal environment for extracting and transforming small volumes of data
b) Hadoop stores data in HDFS and supports data compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve graph and machine
d) None of the mentioned
Explanation:Data compression can be achieved using compression algorithms like bzip2, gzip,
LZO, etc. Different algorithms can be used in different scenarios based on their capabilities.
3. What license is Hadoop distributed under ?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
Explanation:Hadoop is Open Source, released under Apache 2 license.
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional
Hadoop cluster using a live CD.
b) OpenSolaris
c) GNU
d) Linux
Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image.
5. Which of the following genres does Hadoop produce ?
a) Distributed file system
c) Java Message Service
d) Relational Database Management System
Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data
sets reliably, and to stream those data sets at high bandwidth to user.
6. What was Hadoop written in ?
a) Java (software platform)
b) Perl
c) Java (programming language)
d) Lua (programming language)
Explanation: The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell-scripts.
7. Which of the following platforms does Hadoop run on ?
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
Explanation:Hadoop has support for cross platform operating system.
8. Hadoop achieves reliability by replicating the data across multiple hosts, and hence does not
require ________ storage on hosts.
b) Standard RAID levels
c) ZFS
d) Operating system
Explanation:With the default replication value, 3, data is stored on three nodes: two on the same
rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to
which client applications submit MapReduce jobs.
a) MapReduce
b) Google
c) Functional programming
d) Facebook
Explanation:MapReduce engine uses to distribute work around a cluster.
10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and
matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification
d) Artificial intelligence
Explanation: The Apache Mahout projects goal is to build a scalable machine learning tool.
1. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support
Explanation:Adding security to Hadoop is challenging because all the interactions do not follow
the classic client- server pattern.
2. Point out the correct statement :
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
b) Prism
c) Project Big
d) Project Data
Explanation:Prism automatically replicates and moves data wherever its needed across a vast
network of computing facilities.
Hadoop Questions and Answers Hadoop Ecosystem
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Hadoop
1. ________ is a platform for constructing data flows for extract, transform, and load (ETL)
processing and analysis of large datasets.
a) Pig Latin
b) Oozie
c) Pig
d) Hive
Explanation:Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs.
2. Point out the correct statement :
a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to
querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
d) All of the mentioned
Explanation:Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible
file systems.
3. _________ hides the limitations of Java behind a powerful and concise Clojure API for
a) Scalding
b) HCatalog
c) Cascalog
b) Drill
c) Oozie
d) None of the mentioned
Explanation:Mapreduce provides a flexible and scalable foundation for analytics, from
traditional reporting to leading-edge machine learning algorithms.
8. The Pig Latin scripting language is not only a higher-level data flow language but also has
operators similar to :
a) SQL
c) XML
d) All of the mentioned
Explanation:Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL
and the low-level procedural style of MapReduce.
9. _______ jobs are optimized for scalability but not latency.
a) Mapreduce
b) Drill
c) Oozie
d) Hive
Explanation:Hive Queries are translated to MapReduce jobs to exploit the scalability of
10. ______ is a framework for performing remote procedure calls and data serialization.
a) Drill
b) BigTop
c) Avro
d) Chukwa
Explanation:In the context of Hadoop, Avro can be used to pass data from one program or
language to another.
Explanation:JobConfigurable.configure method is overridden to initialize themselves.
2. Point out the correct statement :
a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by
the InputFormat for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value)
d) All of the mentioned
Answer: d
Explanation:Reporters can be used to set application-level status messages and update Counters.
3. Input to the _______ is the sorted output of the mappers.
a) Reducer
b) Mapper
c) Shuffle
d) All of the mentioned
Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.
4. The right number of reduces seems to be :
a) 0.90
b) 0.80
c) 0.36
d) 0.95
Explanation: The right number of reduces seems to be 0.95 or 1.75.
5. Point out the wrong statement :
a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop
framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned
Explanation:JobConf represents a MapReduce job configuration.
Hadoop Questions and Answers Scaling out in Hadoop
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Scaling out in
1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in
the nodes.
a) NoSQL
b) NewSQL
c) SQL
d) All of the mentioned
Explanation: NoSQL systems make the most sense whenever the application is based on data
with varying data types and the data can be stored in key-value notation.
2. Point out the correct statement :
a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload
b) HDFS runs on a small cluster of commodity-class nodes
c) NEWSQL is frequently the collection point for big data
d) None of the mentioned
Explanation:Hadoop together with a relational data warehouse, they can form very effective data
warehouse infrastructure.
3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record
values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
d) All of the mentioned
Explanation:Other means of tagging the values also can be used.
4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional
Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
d) None of the mentioned
Explanation:enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
5. Point out the wrong statement :
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and
highly efficient storage platform
b) Isilons native HDFS integration means you can avoid the need to invest in a separate Hadoop
c) NoSQL systems do provide high latency access and accommodate less concurrent users
d) None of the mentioned
Explanation:NoSQL systems do provide low latency access and accommodate many concurrent
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to :
a) Scale out
b) Scale up
c) Both Scale out and up
d) None of the mentioned
Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to
increase performance (scale-out) but even they require node configuration with elements of scale
7. Which is the most popular NoSQL database for scalable big data store with Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
d) None of the mentioned
Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables billions of rows multiplied by millions of columns on clusters built
with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in the
map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
d) All of the mentioned
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
9. HBase provides ___________ like capabilities on top of Hadoop and HDFS.
a) TopTable
b) BigTop
c) Bigtable
d) None of the mentioned
Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
10. _______ refers to incremental costs with no major impact on solution design, performance
and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
d) None of the mentioned
Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a
cluster does not require additional network switches.
Hadoop Questions and Answers Hadoop Streaming
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Hadoop
1. Streaming supports streaming command options as well as _________ command options.
a) generic
b) tool
c) library
d) task
Explanation:Place the generic options before the streaming options, otherwise the command will
2. Point out the correct statement :
a) You can specify any executable as the mapper and/or the reducer
b) You cannot supply a Java class as the mapper and/or the reducer
c) The class you supply for the output format should return key/value pairs of Text class
d) All of the mentioned
Explanation:If you do not specify an input format class, the TextInputFormat is used as the
3. Which of the following Hadoop streaming command option parameter is required ?
a) output directoryname
b) mapper executable
c) input directoryname
d) All of the mentioned
Explanation:Required parameters is used for Input and Output location for mapper.
4. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Explanation:Environment Variable is set using cmdenv command.
5. Point out the wrong statement :
a) Hadoop has a library package called Aggregate
b) Aggregate allows you to define a mapper plugin class that is expected to generate
aggregatable items for each input key/value pair of the mappers
c) To use Aggregate, simply specify -mapper aggregate
d) None of the mentioned
Explanation:To use Aggregate, simply specify -reducer aggregate:
6. The ________ option allows you to copy jars locally to the current working directory of tasks
and automatically unjar the files.
a) archives
b) files
c) task
d) None of the mentioned
Explanation:Archives options is also a generic option.
7. ______________ class allows the Map/Reduce framework to partition the map outputs based
on certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
d) None of the mentioned
Explanation: The primary key is used for partitioning, and the combination of the primary and
secondary keys is used for sorting.
8. Which of the following class provides a subset of features provided by the Unix/GNU Sort ?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
d) All of the mentioned
Explanation:Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
9. Which of the following class is provided by Aggregate package ?
a) Map
b) Reducer
c) Reduce
d) None of the mentioned
Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as sum, max, min and so on over a
sequence of values.
10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that
effectively allows you to process text data like the unix ______ utility.
a) Copy
b) Cut
c) Paste
d) Move
Explanation: The map function defined in the class treats each input key/value pair as a list of
Hadoop Questions and Answers Introduction to HDFS
This set of Multiple Choice Questions & Answers (MCQs) focuses on Hadoop Filesystem
1. A ________ serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
Explanation:All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
2. Point out the correct statement :
a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
d) None of the mentioned
Explanation:There can be any number of DataNodes in a Hadoop Cluster.
3. HDFS works in a __________ fashion.
a) master-worker
b) master-slave
c) worker/slave.
d) All of the mentioned
Explanation:NameNode servers as the master and each DataNode servers as a worker/slave
4. ________ NameNode is used when the Primary NameNode goes down.
a) Rack
b) Data
c) Secondary
d) None of the mentioned
Explanation:Secondary namenode is used for all time availability and reliability.
5. Point out the wrong statement :
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.
6. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
d) None of the mentioned
Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows
storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
7. The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance.
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has
more than one DataNode, with data replicated across them.
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) HDFS Shell
b) FS Shell
c) DFS Shell
d) None of the mentioned
Explanation: The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System (HDFS).
10. HDFS is implemented in _____________ programming language.
a) C++
b) Java
c) Scala
d) None of the mentioned
Explanation:HDFS is implemented in Java and any computer which can run Java can host a
NameNode/DataNode on it.
Hadoop Questions and Answers Java Interface
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Java Interface.
1. In order to read any file in HDFS, instance of __________ is required.
a) filesystem
b) datastream
c) outstream
d) inputstream
Explanation:InputDataStream is used to read data from file.
2. Point out the correct statement :
a) The framework groups Reducer inputs by keys
b) The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
c) Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate
keys are grouped, these can be used in conjunction to simulate secondary sort on values
d) All of the mentioned
Explanation:If equivalence rules for keys while grouping the intermediates are different from
those for grouping keys before reduction, then one may specify a Comparator.
3. ______________ is method to copy byte from input stream to any other stream in Hadoop.
a) IOUtils
b) Utils
c) IUtils
d) All of the mentioned
Explanation:IOUtils class is static method in Java interface.
4. _____________ is used to read data from bytes buffers .
a) write()
b) read()
c) readwrite()
d) All of the mentioned
Explanation:readfully method can also be used instead of read method.
5. Point out the wrong statement :
a) The framework calls reduce method for each pair in the grouped inputs
b) The output of the Reducer is re-sorted
c) reduce method reduces values for a given key
d) None of the mentioned
Explanation: The output of the Reducer is not re-sorted.
6. Interface ____________ reduces a set of intermediate values which share a key to a smaller
set of values.
a) Mapper
b) Reducer
c) Writable
d) Readable
Explanation:Reducer implementations can access the JobConf for the job.
Explanation:reporter parameter is for facility to report progress.
Hadoop Questions and Answers Data Flow
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Data Flow.
1. ________ is a programming model designed for processing large volumes of data in parallel
by dividing the work into a set of independent tasks.
a) Hive
b) MapReduce
c) Pig
d) Lucene
Explanation:MapReduce is the heart of hadoop.
2. Point out the correct statement :
a) Data locality means movement of algorithm to the data instead of data to algorithm
b) When the processing is done on the data algorithm is moved across the Action Nodes rather
than data to the algorithm
c) Moving Computation is expensive than Moving Data
d) None of the mentioned
Explanation:Data flow framework possesses the feature of data locality.
3. The daemons associated with the MapReduce phase are ________ and task-trackers.
a) job-tracker
b) map-tracker
c) reduce-tracker
d) All of the mentioned
Explanation:Map-Reduce jobs are submitted on job-tracker.
4. The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep the
work as close to the data as possible
a) DataNodes
b) TaskTracker
c) ActionNodes
8. The default InputFormat is __________ which treats each value of input a new value and the
associated key is byte offset.
a) TextFormat
b) TextInputFormat
c) InputFormat
d) All of the mentioned
Explanation:A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs.
9. __________ controls the partitioning of the keys of the intermediate map-outputs.
a) Collector
b) Partitioner
c) InputFormat
d) None of the mentioned
Explanation: The output of the mapper is sent to the partitioner.
10. Output of the mapper is first written on the local disk for sorting and _________ process.
a) shuffling
b) secondary sorting
c) forking
d) reducing
Explanation:All values corresponding to the same key will go the same reducer.
Hadoop Questions and Answers Hadoop Archives
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Hadoop
1. _________ is the name of the archive you would like to create.
a) archive
b) archiveName
c) Name
d) None of the mentioned
Explanation: The name should have a *.har extension.
2. Point out the correct statement :
a) A Hadoop archive maps to a file system directory
b) Hadoop archives are special format archives
c) A Hadoop archive always has a *.har extension
d) All of the mentioned
Explanation:A Hadoop archive directory contains metadata (in the form of _index and
_masterindex) and data (part-*) files.
3. Using Hadoop Archives in __________ is as easy as specifying a different input filesystem
than the default file system.
a) Hive
b) Pig
c) MapReduce
d) All of the mentioned
Explanation:Hadoop Archives is exposed as a file system MapReduce will be able to use all the
logical input files in Hadoop Archives as input.
4. The __________ guarantees that excess resources taken from a queue will be restored to it
within N minutes of its need for them.
a) capacitor
b) scheduler
c) datanode
d) None of the mentioned
Explanation:Free resources can be allocated to any queue beyond its guaranteed capacity.
5. Point out the wrong statement :
a) The Hadoop archive exposes itself as a file system layer
b) Hadoop archives are immutable
c) Archive renames, deletes and creates return an error
d) None of the mentioned
Explanation:All the fs shell commands in the archives work but with a different URI.
6. _________ is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share
large clusters.
a) Flow Scheduler
b) Data Scheduler
c) Capacity Scheduler
d) None of the mentioned
Explanation: The Capacity Scheduler supports for multiple queues, where a job is submitted to a
7. Which of the following parameter describes destination directory which would contain the
archive ?
a) -archiveName
b) <source>
c) <destination>
d) None of the mentioned
Explanation: -archiveName is the name of the archive to be created.
8. _________ identifies filesystem pathnames which work as usual with regular expressions.
a) -archiveName <name>
b) <source>
c) <destination>
d) None of the mentioned
Explanation: identifies destination directory which would contain the archive.
9. __________ is the parent argument used to specify the relative path to which the files should
be archived to
a) -archiveName <name>
b) -p <parent_path>
c) <destination>
d) <source>
Explanation: The hadoop archive command creates a Hadoop archive, a file that contains other
10. Which of the following is a valid syntax for hadoop archive ?
hadooparchive [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
hadooparch [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
hadoop [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
d) None of the mentioned
Explanation: The Hadoop archiving tool can be invoked using the following command format:
hadoop archive -archiveName name -p *
Explanation:SequenceFile has 3 available formats: An Uncompressed format, A Record
Compressed format and a Block-Compressed.
5. Point out the wrong statement :
a) The data file contains all the key, value records but key N + 1 must be greater then or equal to
the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after several records
d) None of the mentioned
Explanation:Map file is again a kind of hadoop file based data structure and it differs from a
sequence file in a matter of the order.
6. Which of the following format is more compression-aggressive ?
a) Partition Compressed
b) Record Compressed
c) Block-Compressed
d) Uncompressed
Explanation:SequenceFile key-value list can be just a Text/Text pair, and is written to the file
during the initialization that happens in the SequenceFile.
7. The __________ is a directory that contains two SequenceFile.
a) ReduceFile
b) MapperFile
c) MapFile
d) None of the mentioned
Explanation:Sequence files are data file (/data) and the index file (/index).
8. The ______ file is populated with the key and a LongWritable that contains the starting byte
position of the record.
a) Array
b) Index
c) Immutable
a) 128k
b) 256k
c) 24k
d) 36k
Explanation:LZO was designed with speed in mind: it decompresses about twice as fast as gzip,
meaning its fast enough to keep up with hard drive read speeds.
Hadoop Questions and Answers Data Integrity
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Data Integrity.
1. The HDFS client software implements __________ checking on the contents of HDFS files.
a) metastore
b) parity
c) checksum
d) None of the mentioned
Explanation:When a client creates an HDFS file, it computes a checksum of each block of the
file and stores these checksums in a separate hidden file in the same HDFS namespace.
2. Point out the correct statement :
a) The HDFS architecture is compatible with data rebalancing schemes
b) Datablocks support storing a copy of data at a particular instant of time.
c) HDFS currently support snapshots.
d) None of the mentioned
Explanation:A scheme might automatically move data from one DataNode to another if the free
space on a DataNode falls below a certain threshold.
3. The ___________ machine is a single point of failure for an HDFS cluster.
a) DataNode
b) NameNode
c) ActionNode
d) All of the mentioned
Explanation:If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software to another machine is not supported.
4. The ____________ and the EditLog are central data structures of HDFS.
a) DsImage
b) FsImage
c) FsImages
d) All of the mentioned
Explanation:A corruption of these files can cause the HDFS instance to be non-functional
5. Point out the wrong statement :
a) HDFS is designed to support small files only
b) Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get
updated synchronously
c) NameNode can be configured to support maintaining multiple copies of the FsImage and
d) None of the mentioned
Explanation:HDFS is designed to support very large files.
6. __________ support storing a copy of data at a particular instant of time.
a) Data Image
b) Datanots
c) Snapshots
d) All of the mentioned
Explanation:One usage of the snapshot feature may be to roll back a corrupted HDFS instance to
a previously known good point in time.
7. Automatic restart and ____________ of the NameNode software to another machine is not
a) failover
b) end
c) scalability
c) kafka
d) Avro
Explanation:Apache Avro doesnt require proxy objects or code generation.
2. Point out the correct statement :
a) Apache Avro is a framework that allows you to serialize data in a format that has a schema
built in
b) The serialized data is in a compact binary format that doesnt require proxy objects or code
c) Including schemas with the Avro messages allows any application to deserialize the data
d) All of the mentioned
Explanation:Instead of using generated proxy libraries and strong typing, Avro relies heavily on
the schemas that are sent along with the serialized data.
3. Avro schemas describe the format of the message and are defined using :
b) XML
c) JS
d) All of the mentioned
Explanation: The JSON schema content is put into a file.
4. The ____________ is an iterator which reads through the file and returns objects using the
next() method.
a) DatReader
b) DatumReader
c) DatumRead
d) None of the mentioned
Explanation:DatumReader reads the content through the DataFileReader implementation.
5. Point out the wrong statement :
a) Java code is used to deserialize the contents of the file into objects
b) Avro allows you to use complex data structures within Hadoop MapReduce jobs
c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies
d) None of the mentioned
Explanation:A unit test is useful because you can make assertions to verify that the values of the
deserialized object are the same as the original values.
6. The ____________ class extends and implements several Hadoop-supplied interfaces.
a) AvroReducer
b) Mapper
c) AvroMapper
d) None of the mentioned
Explanation:AvroMapper is used to provide the ability to collect or map data.
7. ____________ class accepts the values that the ModelCountMapper object has collected.
a) AvroReducer
b) Mapper
c) AvroMapper
d) None of the mentioned
Explanation:AvroReducer summarizes them by looping through the values.
8. The ________ method in the ModelCountReducer class reduces the values the mapper
collects into a derived value
a) count
b) add
c) reduce
d) All of the mentioned
Explanation:In some case, it can be simple sum of the values.
9. Which of the following works well with Avro ?
a) Lucene
b) kafka
c) MapReduce
b) Static number
c) UID
d) None of the mentioned
Explanation:Avro resolves possible conflicts through the name of the field.
8. Avro is said to be the future _______ layer of Hadoop.
a) RMC
b) RPC
c) RDC
d) All of the mentioned
Explanation:When Avro is used in RPC, the client and server exchange schemas in the
connection handshake.
9. When using reflection to automatically build our schemas without code generation, we need to
configure Avro using :
a) AvroJob.Reflect(jConf);
b) AvroJob.setReflect(jConf);
c) Job.setReflect(jConf);
d) None of the mentioned
Explanation:For strongly typed languages like Java, it also provides a generation code layer,
including RPC services code generation.
10. We can declare the schema of our data either in a ______ file.
b) XML
c) SQL
d) R
Explanation:Schema can be declared using an IDL or simply through Java beans by using
reflection-based schema building.
Explanation:Each block of array consists of a long count value, followed by that many array
items. A block with count zero indicates the end of the array. Each item is encoded per the arrays
item schema.
5. Point out the wrong statement :
a) Record, enums and fixed are named types
b) Unions may immediately contain other unions
c) A namespace is a dot-separated sequence of such names
d) All of the mentioned
Explanation:Unions may not immediately contain other unions.
6. ________ instances are encoded using the number of bytes declared in the schema.
a) Fixed
b) Enum
c) Unions
d) Maps
Explanation:Except for unions, the JSON encoding is the same as is used to encode field default
7. ________ permits data written by one system to be efficiently sorted by another system.
a) Complex Data type
b) Order
c) Sort Order
d) All of the mentioned
Explanation:Avro binary-encoded data can be efficiently ordered without deserializing it to
8. _____________ are used between blocks to permit efficient splitting of files for MapReduce
a) Codec
b) Data Marker
c) Syncronization markers
d) Scala
Explanation: The data storage will be in the form of regions (tables). These regions will be split
up and stored in region servers.
adoop Questions and Answers Mapreduce Development-2
This set of Questions & Answers focuses on Hadoop MapReduce.
1. The Mapper implementation processes one line at a time via _________ method.
a) map
b) reduce
c) mapper
d) reducer
Explanation: The Mapper outputs are sorted and then partitioned per Reducer.
2. Point out the correct statement :
a) Mapper maps input key/value pairs to a set of intermediate key/value pairs
b) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
c) Mapper and Reducer interfaces form the core of the job
d) None of the mentioned
Explanation: The transformed intermediate records do not need to be of the same type as the
input records.
3. The Hadoop MapReduce framework spawns one map task for each __________ generated by
the InputFormat for the job.
a) OutputSplit
b) InputSplit
c) InputSplitStream
d) All of the mentioned
Explanation:Mapper implementations are passed the JobConf for the job via the
JobConfigurable.configure(JobConf) method and override it to initialize themselves.
4. Users can control which keys (and hence records) go to which Reducer by implementing a
custom :
a) Partitioner
b) OutputSplit
c) Reporter
d) All of the mentioned
Explanation:Users can control the grouping by specifying a Comparator via
5. Point out the wrong statement :
a) The Mapper outputs are sorted and then partitioned per Reducer
b) The total number of partitions is the same as the number of reduce tasks for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value)
d) None of the mentioned
Explanation:All intermediate values associated with a given output key are subsequently grouped
by the framework, and passed to the Reducer(s) to determine the final output.
6. Applications can use the ____________ to report progress and set application-level status
a) Partitioner
b) OutputSplit
c) Reporter
d) All of the mentioned
Explanation:Reporter is also used to update Counters, or just indicate that they are alive.
7. The right level of parallelism for maps seems to be around _________ maps per-node
a) 1-10
b) 10-100
c) 100-150
d) 150-200
Explanation:Task setup takes a while, so it is best if the maps take at least a minute to execute.
8. The number of reduces for the job is set by the user via :
a) JobConf.setNumTasks(int)
b) JobConf.setNumReduceTasks(int)
c) JobConf.setNumMapTasks(int)
d) All of the mentioned
Explanation:Reducer has 3 primary phases: shuffle, sort and reduce.
9. The framework groups Reducer inputs by key in _________ stage.
a) sort
b) shuffle
c) reduce
d) None of the mentioned
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged.
10. The output of the reduce task is typically written to the FileSystem via _____________ .
a) OutputCollector.collect
b) OutputCollector.get
c) OutputCollector.receive
d) OutputCollector.put
Explanation: The output of the Reducer is not sorted.
Hadoop Questions and Answers MapReduce Features-1
This set of Hadoop Questions & Answers for freshers focuses on MapReduce Features.
1. Which of the following is the default Partitioner for Mapreduce ?
a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
d) None of the mentioned
Explanation: The total number of partitions is the same as the number of reduce tasks for the job.
2. Point out the correct statement :
a) The right number of reduces seems to be 0.95 or 1.75
b) Increasing the number of reduces increases the framework overhead
c) With 0.95 all of the reduces can launch immediately and start transferring map outputs as the
maps finish
d) All of the mentioned
Explanation:With 1.75 the faster nodes will finish their first round of reduces and launch a
second wave of reduces doing a much better job of load balancing.
3. Which of the following partitions the key space ?
a) Partitioner
b) Compactor
c) Collector
d) All of the mentioned
Explanation:Partitioner controls the partitioning of the keys of the intermediate map-outputs.
4. ____________ is a generalization of the facility provided by the MapReduce framework to
collect data output by the Mapper or the Reducer
a) OutputCompactor
b) OutputCollector
c) InputCollector
d) All of the mentioned
Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
5. Point out the wrong statement :
a) It is legal to set the number of reduce-tasks to zero if no reduction is desired
b) The outputs of the map-tasks go directly to the FileSystem
c) The Mapreduce framework does not sort the map-outputs before writing them out to the
c) io.sort.mb
d) None of the mentioned
Explanation:When percentage of either buffer has filled, their contents will be spilled to disk in
the background.
10. ______________ is percentage of memory relative to the maximum heapsize in which map
outputs may be retained during the reduce.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
Explanation:When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
Hadoop Questions and Answers MapReduce Features-2
This set of Interview Questions & Answers focuses on MapReduce.
1. ____________ specifies the number of segments on disk to be merged at the same time.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
Explanation:io.sort.factor limits the number of open files and compression codecs during the
2. Point out the correct statement :
a) The number of sorted map outputs fetched into memory before being merged to disk
b) The memory threshold for fetched map outputs before an in-memory merge is finished
c) The percentage of memory relative to the maximum heapsize in which map outputs may not
be retained during the reduce
d) None of the mentioned
Explanation:When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
3. Map output larger than ___ percent of the memory allocated to copying map outputs.
a) 10
b) 15
c) 25
d) 35
Explanation:Map output will be written directly to disk without first staging through memory.
4. Jobs can enable task JVMs to be reused by specifying the job configuration :
a) mapred.job.recycle.jvm.num.tasks
b) mapissue.job.reuse.jvm.num.tasks
c) mapred.job.reuse.jvm.num.tasks
d) All of the mentioned
Explanation:Many of my tasks had performance improved over 50% using
5. Point out the wrong statement :
a) The task tracker has local directory to create localized cache and localized job
b) The task tracker can define multiple local directories
c) The Job tracker cannot define multiple local directories
d) None of the mentioned
Explanation:When the job starts, task tracker creates a localized job directory relative to the local
directory specified in the configuration.
6. During the execution of a streaming job, the names of the _______ parameters are
a) vmap
b) mapvim
c) mapreduce
d) mapred
Explanation:To get the values in a streaming jobs mapper/reducer use the parameter names with
the underscores.
7. The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker
and logged to :
a) ${HADOOP_LOG_DIR}/user
b) ${HADOOP_LOG_DIR}/userlogs
c) ${HADOOP_LOG_DIR}/logs
d) None of the mentioned
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
8. ____________ is the primary interface by which user-job interacts with the JobTracker.
a) JobConf
b) JobClient
c) JobServer
d) All of the mentioned
Explanation:JobClient provides facilities to submit jobs, track their progress, access componenttasks reports and logs, get the MapReduce clusters status information and so on.
9. The _____________ can also be used to distribute both jars and native libraries for use in the
map and/or reduce tasks.
a) DistributedLog
b) DistributedCache
c) DistributedJars
d) None of the mentioned
Explanation:Cached libraries can be loaded via System.loadLibrary or System.load.
10. __________ is used to filter log files from the output directory listing.
a) OutputLog
b) OutputLogFilter
c) DistributedLog
d) DistributedJars
a) core-default.xml
b) core-site.xml
c) coredefault.xml
d) All of the mentioned
Explanation:Value strings are first processed for variable expansion.
5. Point out the wrong statement :
a) addDeprecations adds a set of deprecated keys to the global deprecations
b) Configuration parameters cannot be declared final
c) addDeprecations method is lockless
d) None of the mentioned
Explanation:Configuration parameters may be declared final.
6. _________ method clears all keys from the configuration.
a) clear
b) addResource
c) getClass
d) None of the mentioned
Explanation:getClass is used to get the value of the name property as a Class.
7. ________ method adds the deprecated key to the global deprecation map.
a) addDeprecits
b) addDeprecation
c) keyDeprecation
d) None of the mentioned
Explanation:addDeprecation does not override any existing entries in the deprecation map.
8. ________ checks whether the given key is deprecated.
a) isDeprecated
b) setDeprecated
c) isDeprecatedif
Explanation:LinuxTaskController keeps track of all paths and directories on datanode.
10. The configuration file must be owned by the user running :
a) DataManager
b) NodeManager
c) ValidationManager
d) None of the mentioned
Explanation:To re-cap,local file-sysytem permissions need to be modified
Hadoop Questions and Answers MapReduce Job-1
This set of Hadoop Interview Questions & Answers for freshers focuses on MapReduce Job.
1. __________ storage is a solution to decouple growing storage capacity from compute
a) DataNode
b) Archival
c) Policy
d) None of the mentioned
Explanation:Nodes with higher density and less expensive storage with low compute power are
becoming available.
2. Point out the correct statement :
a) When there is enough space, block replicas are stored according to the storage type list
b) One_SSD is used for storing all replicas in SSD
c) Hot policy is useful only for single replica blocks
d) All of the mentioned
Explanation: The first phase of Heterogeneous Storage changed datanode storage model from a
single storage.
3. ___________ is added for supporting writing single replica files in memory.
Explanation:When a block is cold, all replicas are stored in ARCHIVE.
8. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are
stored in :
d) All of the mentioned
Explanation:Warm storage policy is partially hot and partially cold.
9. ____________ is used for storing one of the replicas in SSD.
a) Hot
b) Lazy_Persist
c) One_SSD
d) All_SSD
Explanation: The remaining replicas are stored in DISK.
10. ___________ is used for writing blocks with single replica in memory.
a) Hot
b) Lazy_Persist
c) One_SSD
d) All_SSD
Explanation: The replica is first written in RAM_DISK and then it is lazily persisted in DISK.
Hadoop Questions and Answers MapReduce Job-2
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on MapReduce Job2.
1. _________ is a data migration tool added for archiving data.
a) Mover
b) Hiver
c) Serde
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Explanation:TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.
2. Point out the correct statement :
a) MapReduce tries to place the data and the compute as close as possible
b) Map Task in MapReduce is performed using the Mapper() function.
c) Reduce Task in MapReduce is performed using the Map() function.
d) All of the mentioned
Explanation:This feature of MapReduce is Data Locality.
3. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.
a) Maptask
b) Mapper
c) Task execution
d) All of the mentioned
Explanation:Map Task in MapReduce is performed using the Map() function.
4. _________ function is responsible for consolidating the results produced by each of the Map()
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
Explanation:Reduce function collates the work and resolves the results.
5. Point out the wrong statement :
a) A MapReduce job usually splits the input data-set into independent chunks which are
Explanation:Maps are the individual tasks that transform input records into intermediate records.
9. The number of maps is usually driven by the total size of :
a) inputs
b) outputs
c) tasks
d) None of the mentioned
Explanation:Total size of inputs means total number of blocks of the input files.
10. Running a ___________ program involves running mapping tasks on many or all of the
nodes in our cluster.
a) MapReduce
b) Map
c) Reducer
d) All of the mentioned
Explanation: In some applications, component tasks need to create and/or write to side-files,
which differ from the actual job-output files.
Hadoop Questions and Answers YARN-1
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on YARN-1.
1. ________ is the architectural center of Hadoop that allows multiple data processing engines.
b) Hive
c) Incubator
d) Chuckwa
Explanation:YARN is the prerequisite for Enterprise Hadoop, providing resource management
and a central platform to deliver consistent operations, security, and data governance tools across
Hadoop clusters.
2. Point out the correct statement :
a) YARN also extends the power of Hadoop to incumbent and new technologies found within the
data center
b) YARN is the central point of investment for Hortonworks within the Apache community
c) YARN enhances a Hadoop compute cluster in many ways
d) All of the mentioned
Explanation:YARN provides ISVs and developers a consistent framework for writing data access
applications that run IN Hadoop.
3. YARNs dynamic allocation of cluster resources improves utilization over more static _______
rules used in early versions of Hadoop.
a) Hive
b) MapReduce
c) Imphala
d) All of the mentioned
Explanation:Multi-tenant data processing improves an enterprises return on its Hadoop
4. The __________ is a framework-specific entity that negotiates resources from the
a) NodeManager
b) ResourceManager
c) ApplicationMaster
d) All of the mentioned
Explanation:Each ApplicationMaster has responsibility for negotiating appropriate resource
containers from the schedule.
5. Point out the wrong statement :
a) From the system perspective, the ApplicationMaster runs as a normal container.
b) The ResourceManager is the per-machine slave, which is responsible for launching the
applications containers
c) The NodeManager is the per-machine slave, which is responsible for launching the
applications containers, monitoring their resource usage
d) None of the mentioned
Explanation:ResourceManager has a scheduler, which is responsible for allocating resources to
the various applications running in the cluster, according to constraints such as queue capacities
and user limits.
6. Apache Hadoop YARN stands for :
a) Yet Another Reserve Negotiator
b) Yet Another Resource Network
c) Yet Another Resource Negotiator
d) All of the mentioned
Explanation:YARN is a cluster management technology.
7. MapReduce has undergone a complete overhaul in hadoop :
a) 0.21
b) 0.23
c) 0.24
d) 0.26
Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the
8. The ____________ is the ultimate authority that arbitrates resources among all the
applications in the system.
a) NodeManager
b) ResourceManager
c) ApplicationMaster
d) All of the mentioned
Explanation: The ResourceManager and per-node slave, the NodeManager (NM), form the datacomputation framework.
9. The __________ is responsible for allocating resources to the various running applications
subject to familiar constraints of capacities, queues etc.
a) Manager
b) Master
c) Scheduler
Explanation:All applications submitted to a queue will have access to the capacity allocated to
the queue.
3. The queue definitions and properties such as ________, ACLs can be changed, at runtime.
a) tolerant
b) capacity
c) speed
d) All of the mentioned
Explanation:Administrators can add additional queues at runtime, but queues cannot be deleted
at runtime.
4. The CapacityScheduler has a pre-defined queue called :
a) domain
b) root
c) rear
d) All of the mentioned
Explanation:All queueus in the system are children of the root queue.
5. Point out the wrong statement :
a) The multiple of the queue capacity which can be configured to allow a single user to acquire
more resources
b) Changing queue properties and adding new queues is very simple
c) Queues cannot be deleted, only addition of new queues is supported
d) None of the mentioned
Explanation:You need to edit conf/capacity-scheduler.xml and run yarn rmadmin -refreshQueues
for changing queue properties.
6. The updated queue configuration should be a valid one i.e. queue-capacity at each level should
be equal to :
a) 50%
b) 75%
c) 100%
d) 0%
Explanation:Queues cannot be deleted, only addition of new queues is supported.
7. Users can bundle their Yarn code in a _________ file and execute it using jar command.
a) java
b) jar
c) C code
d) xml
Explanation:Usage: yarn jar [mainClass] args
8. Which of the following command is used to dump the log container ?
a) logs
b) log
c) dump
d) All of the mentioned
Explanation:Usage: yarn logs -applicationId .
9. __________ will clear the RMStateStore and is useful if past applications are no longer
a) -format-state
b) -form-state-store
c) -format-state-store
d) None of the mentioned
Explanation:-format-state-store formats the RMStateStore.
10. Which of the following command runs ResourceManager admin client ?
a) proxyserver
b) run
c) admin
d) rmadmin
Explanation:proxyserver command starts the web proxy server.
Hadoop Questions and Answers Mapreduce Types
This set of Hadoop Questions & Answers for experienced focuses on MapReduce Types.
1. ___________ generates keys of type LongWritable and values of type Text.
a) TextOutputFormat
b) TextInputFormat
c) OutputInputFormat
d) None of the mentioned
Explanation:If K2 and K3 are the same, you dont need to call setMapOutputKeyClass().
2. Point out the correct statement :
a) The reduce input must have the same types as the map output, although the reduce output
types may be different again
b) The map input key and value types (K1 and V1) are different from the map output types
c) The partition function operates on the intermediate key
d) All of the mentioned
Explanation:In practice, the partition is determined solely by the key (the value is ignored).
3. In _____________, the default job is similar, but not identical, to the Java equivalent.
a) Mapreduce
b) Streaming
c) Orchestration
d) All of the mentioned
Explanation:MapReduce Types and Formats MapReduce has a simple model of data processing.
4. An input _________ is a chunk of the input that is processed by a single map.
a) textformat
b) split
c) datanode
d) All of the mentioned
Explanation:Each split is divided into records, and the map processes each recorda key-value
pairin turn.
5. Point out the wrong statement :
a) If V2 and V3 are the same, you only need to use setOutputValueClass()
b) The overall effect of Streaming job is to perform a sort of the input
c) A Streaming application can control the separator that is used when a key-value pair is turned
into a series of bytes and sent to the map or reduce process over standard input
d) None of the mentioned
Explanation:If a combine function is used then it is the same form as the reduce function, except
its output types are the intermediate key and value types (K2 and V2), so they can feed the
reduce function.
6. An ___________ is responsible for creating the input splits, and dividing them into records.
a) TextOutputFormat
b) TextInputFormat
c) OutputInputFormat
d) InputFormat
Explanation:As a MapReduce application writer, you dont need to deal with InputSplits directly,
as they are created by an InputFormat.
7. ______________ is another implementation of the MapRunnable interface that runs mappers
concurrently in a configurable number of threads.
a) MultithreadedRunner
b) MultithreadedMap
c) MultithreadedMapRunner
d) SinglethreadedMapRunner
Explanation:A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs, which it passes to the map function.
8. Which of the following is the only way of running mappers ?
a) MapReducer
b) MapRunner
c) MapRed
d) All of the mentioned
Explanation:Having calculated the splits, the client sends them to the jobtracker.
9. _________ is the base class for all implementations of InputFormat that use files as their data
source .
a) FileTextFormat
b) FileInputFormat
c) FileOutputFormat
d) None of the mentioned
Explanation:FileInputFormat provides implementation for generating splits for the input files.
10. Which of the following method add a path or paths to the list of inputs ?
a) setInputPaths()
b) addInputPath()
c) setInput()
d) None of the mentioned
Explanation:FileInputFormat offers four static convenience methods for setting a JobConfs
input paths.
Hadoop Questions and Answers Mapreduce Formats-1
This set of Hadoop Interview Questions & Answers for experienced focuses on MapReduce
1. The split size is normally the size of an ________ block, which is appropriate for most
a) generic
b) task
c) library
Explanation:FileInputFormat splits only large files(Here large means larger than an HDFS
2. Point out the correct statement :
a) The minimum split size is usually 1 byte, although some formats have a lower bound on the
split size
b) Applications may impose a minimum split size.
c) The maximum split size defaults to the maximum value that can be represented by a Java long
d) All of the mentioned
Explanation: The maximum split size has an effect only when it is less than the block size,
forcing splits to be smaller than a block.
3. Which of the following Hadoop streaming command option parameter is required ?
a) output directoryname
b) mapper executable
c) input directoryname
d) All of the mentioned
Explanation:Required parameters is used for Input and Output location for mapper.
4. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Explanation:Environment Variable is set using cmdenv command.
5. Point out the wrong statement :
a) Hadoop works better with a small number of large files than a large number of small files
b) CombineFileInputFormat is designed to work well with small files
c) CombineFileInputFormat does not compromise the speed at which it can process the input in a
typical MapReduce job
b) Reducer
c) Reduce
d) None of the mentioned
Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as sum, max, min and so on over a
sequence of values.
10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that
effectively allows you to process text data like the unix ______ utility.
a) Copy
b) Cut
c) Paste
d) Move
Explanation: The map function defined in the class treats each input key/value pair as a list of
Hadoop Questions and Answers Mapreduce Formats-2
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Mapreduce
1. ___________ takes node and rack locality into account when deciding which blocks to place
in the same split
a) CombineFileOutputFormat
b) CombineFileInputFormat
c) TextFileInputFormat
d) None of the mentioned
Explanation:CombineFileInputFormat does not compromise the speed at which it can process the
input in a typical MapReduce job.
2. Point out the correct statement :
a) With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable
number of lines of input
b) StreamXmlRecordReader, the page elements can be interpreted as records for processing by a
c) The number depends on the size of the split and the length of the lines.
d) All of the mentioned
Explanation:Large XML documents that are composed of a series of records can be broken
into these records using simple string or regular-expression matching to find start and end tags of
3. The key, a ____________, is the byte offset within the file of the beginning of the line.
a) LongReadable
b) LongWritable
c) LongWritable
d) All of the mentioned
Explanation: The value is the contents of the line, excluding any line terminators (newline,
carriage return), and is packaged as a Text object.
4. _________ is the output produced by TextOutputFor mat, Hadoops default OutputFormat.
a) KeyValueTextInputFormat
b) KeyValueTextOutputFormat
c) FileValueTextInputFormat
d) All of the mentioned
Explanation:To interpret such files correctly, KeyValueTextInputFormat is appropriate.
5. Point out the wrong statement :
a) Hadoops sequence file format stores sequences of binary key-value pairs
b) SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the
sequence files keys and values as opaque binary objects
c) SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that retrieves the
sequence files keys and values as opaque binary objects.
d) None of the mentioned
Explanation:SequenceFileAsBinaryInputFormat is used for reading keys, values from
SequenceFiles in binary (raw) format.
Explanation:TextOutputFormat keys and values may be of any type.
10. Which of the following writes MapFiles as output ?
a) DBInpFormat
b) MapFileOutputFormat
c) SequenceFileAsBinaryOutputFormat
d) None of the mentioned
Explanation:SequenceFileAsBinaryOutputFormat writes keys and values in raw binary format
into a SequenceFile container.
Hadoop Questions and Answers Hadoop Cluster-1
This set of Questions and Answers focuses on Hadoop Cluster
1. Mapper implementations are passed the JobConf for the job via the ________ method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
d) None of the mentioned
Explanation:JobConfigurable.configure method is overrided to initialize themselves.
2. Point out the correct statement :
a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by
the InputFormat for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value)
d) All of the mentioned
Answer: d
Explanation:Reporters can be used to set application-level status messages and update Counters.
3. Input to the _______ is the sorted output of the mappers.
a) Reducer
b) Mapper
c) Shuffle
Explanation:enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
5. Point out the wrong statement :
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and
highly efficient storage platform
b) Isilons native HDFS integration means you can avoid the need to invest in a separate Hadoop
c) NoSQL systems do provide high latency access and accommodate less concurrent users
d) None of the mentioned
Explanation:NoSQL systems do provide low latency access and accommodate many concurrent
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to :
a) Scale out
b) Scale up
c) Both Scale out and up
d) None of the mentioned
Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to
increase performance (scale-out) but even they require node configuration with elements of scale
7. Which is the most popular NoSQL database for scalable big data store with Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
d) None of the mentioned
Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables billions of rows multiplied by millions of columns on clusters built
with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in the
map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
d) All of the mentioned
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
9. HBase provides ___________ like capabilities on top of Hadoop and HDFS.
a) TopTable
b) BigTop
c) Bigtable
d) None of the mentioned
Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
10. _______ refers to incremental costs with no major impact on solution design, performance
and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
d) None of the mentioned
Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a
cluster does not require additional network switches.
Hadoop Questions and Answers HDFS Maintenance
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on HDFS
1. Which of the following is a common hadoop maintenance issue ?
a) Lack of tools
b) Lack of configuration management
c) Lack of web interface
d) None of the mentioned
Explanation:Without a centralized configuration management framework, you end up with a
number of issues that can cascade just as your usage picks up.
2. Point out the correct statement :
a) RAID is turned off by default
b) Hadoop is designed to be a highly redundant distributed system
c) Hadoop has a networked configuration system
d) None of the mentioned
Explanation:Hadoop deployment is sometimes difficult to implement.
3. ___________ mode allows you to suppress alerts for a host, service, role, or even the entire
a) Safe
b) Maintenance
c) Secure
d) All of the mentioned
Explanation:Maintenance mode can be useful when you need to take actions in your cluster and
do not want to see the alerts that will be generated due to those actions.
4. Which of the following is a configuration management system ?
a) Alex
b) Puppet
c) Acem
d) None of the mentioned
Explanation:Administrators may use configuration management systems such as Puppet and
Chef to manage processes.
5. Point out the wrong statement :
a) If you set the HBase service into maintenance mode, then its roles (HBase Master and all
Region Servers) are put into effective maintenance mode
b) If you set a host into maintenance mode, then any roles running on that host are put into
effective maintenance mode
c) Putting a component into maintenance mode prevent events from being logged
c) JMX
d) None of the mentioned
Explanation:Hadoop includes several managed beans (MBeans), which expose Hadoop metrics
to JMX-aware applications.
10. NameNode is monitored and upgraded in a __________ transition.
a) safemode
b) securemode
c) servicemode
d) None of the mentioned
Explanation: The HDFS service has some unique functions that may result in additional
information on its Status and Instances pages.
Hadoop Questions and Answers Monitoring HDFS
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Monitoring
1. For YARN, the ___________ Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource
d) Replication
Explanation:All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
2. Point out the correct statement :
a) The Hadoop framework publishes the job flow status to an internally running web server on
the master nodes of the Hadoop cluster
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
d) None of the mentioned
Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows
information about the NameNode itself.
3. For ________, the HBase Master UI provides information about the HBase Master uptime.
a) HBase
b) Oozie
c) Kafka
d) All of the mentioned
Explanation:HBase Master UI provides information about the number of live, dead and
transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks.
4. ________ NameNode is used when the Primary NameNode goes down.
a) Rack
b) Data
c) Secondary
d) None of the mentioned
Explanation:Secondary namenode is used for all time availability and reliability.
5. Point out the wrong statement :
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Explanation:NameNode is aware of the files to which the blocks stored on it belong to.
6. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
d) None of the mentioned
Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows
storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
7. The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance.
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has
more than one DataNode, with data replicated across them.
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) HDFS Shell
b) FS Shell
c) DFS Shell
d) None of the mentioned
Explanation: The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System.
10. During start up, the ___________ loads the file system state from the fsimage and the edits
log file.
a) DataNode
b) NameNode
c) ActionNode