Hadoop MCQs
Hadoop MCQs
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “History of
Hadoop”.
1. IBM and ________ have announced a major initiative to use Hadoop to support university
courses in distributed computer programming.
a) Google Latitude
c) Google Variations
d) Google
View Answer
Answer: d
ADVERTISEMENT
a) Hadoop is an ideal environment for extracting and transforming small volumes of data
c) The Giraph framework is less useful than a MapReduce job to solve graph and machine
learning
View Answer
Answer: b
Explanation: Data compression can be achieved using compression algorithms like bzip2,
gzip, LZO, etc. Different algorithms can be used in different scenarios based on their
capabilities.
3. What license is Hadoop distributed under?
c) Shareware
d) Commercial
View Answer
Answer: a
advertisement
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional
Hadoop cluster using a live CD.
a) OpenOffice.org
b) OpenSolaris
c) GNU
d) Linux
View Answer
Answer: b
Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image.
b) JAX-RS
View Answer
Answer: a
Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data
sets reliably, and to stream those data sets at high bandwidth to the user.
ADVERTISEMENT
b) Perl
View Answer
Answer: c
Explanation: The Hadoop framework itself is mostly written in the Java programming
language, with some native code in C and command-line utilities written as shell scripts.
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
View Answer
Answer: c
advertisement
8. Hadoop achieves reliability by replicating the data across multiple hosts and hence does
not require ________ storage on hosts.
a) RAID
c) ZFS
d) Operating system
View Answer
Answer: a
Explanation: With the default replication value, 3, data is stored on three nodes: two on the
same rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to
which client applications submit MapReduce jobs.
a) MapReduce
b) Google
c) Functional programming
d) Facebook
View Answer
Answer: a
advertisement
10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and
matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification
d) Artificial intelligence
View Answer
Answer: a
Explanation: The Apache Mahout project’s goal is to build a scalable machine learning tool.
Menu
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop
Ecosystem”.
1. ________ is a platform for constructing data flows for extract, transform, and load (ETL)
processing and analysis of large datasets.
a) Pig Latin
b) Oozie
c) Pig
d) Hive
View Answer
Answer: c
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs.
a) Hive is not a relational database, but a query engine that supports the parts of SQL
specific to querying data
View Answer
Answer: a
Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-
compatible file systems.
ADVERTISEMENT
3. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
a) Scalding
b) HCatalog
c) Cascalog
View Answer
Answer: c
Explanation: Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the
name “Cascalog” is a contraction of Cascading and Datalog.
a) C#
b) Java
c) C
d) C++
View Answer
Answer: b
Explanation: Hive also supports custom extensions written in Java, including user-defined
functions (UDFs) and serializer-deserializers for reading and optionally writing custom
formats.
advertisement
b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering
c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
View Answer
Answer: a
Explanation: Rather than building Hadoop deployments manually on EC2 (Elastic Compute
Cloud) clusters, users can spin up fully configured Hadoop installations using simple
invocation commands, either through the AWS Web Console or through command-line
tools.
a) Scalding
b) HCatalog
c) Cascalog
d) Cascading
View Answer
Answer: d
ADVERTISEMENT
a) Mapreduce
b) Drill
c) Oozie
Answer: a
Explanation: Mapreduce provides a flexible and scalable foundation for analytics, from
traditional reporting to leading-edge machine learning algorithms.
8. The Pig Latin scripting language is not only a higher-level data flow language but also has
operators similar to ____________
a) SQL
b) JSON
c) XML
View Answer
Answer: a
Explanation: Pig Latin, in essence, is designed to fill the gap between the declarative style of
SQL and the low-level procedural style of MapReduce.
advertisement
a) Mapreduce
b) Drill
c) Oozie
d) Hive
View Answer
Answer: d
Explanation: Hive Queries are translated to MapReduce jobs to exploit the scalability of
MapReduce.
10. ______ is a framework for performing remote procedure calls and data serialization.
a) Drill
b) BigTop
c) Avro
d) Chukwa
View Answer
Answer: c
Explanation: In the context of Hadoop, Avro can be used to pass data from one program or
language to another.
This set of Multiple Choice Questions & Answers (MCQs) focuses on “Big-Data”.
1. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including _______________
View Answer
Answer: d
Explanation: Adding security to Hadoop is challenging because all the interactions do not
follow the classic client-server pattern.
c) In the Hadoop programming framework output files are divided into lines or records
Answer: b
Explanation: Hadoop batch processes data distributed over a number of computers ranging
in 100s and 1000s.
ADVERTISEMENT
3. According to analysts, for what can traditional IT systems provide a foundation when
they’re integrated with big data technologies like Hadoop?
View Answer
Answer: a
Explanation: Data warehousing integrated with Hadoop would give a better understanding
of data.
4. Hadoop is a framework that works with a variety of related tools. Common cohorts
include ____________
View Answer
Answer: a
Explanation: To use Hive with HBase you’ll typically want to launch two clusters, one to run
HBase and the other to run Hive.
advertisement
5. Point out the wrong statement.
a) Hardtop processing capabilities are huge and its real advantage lies in the ability to
process terabytes & petabytes of data
b) Hadoop uses a programming model called “MapReduce”, all the programs should
conform to this model in order to work on the Hadoop platform
c) The programming model, MapReduce, used by Hadoop is difficult to write and test
View Answer
Answer: c
Explanation: The programming model, MapReduce, used by Hadoop is simple to write and
test.
View Answer
Answer: c
Explanation: Doug Cutting, Hadoop creator, named the framework after his child’s stuffed
toy elephant.
Sanfoundry Certification Contest of the Month is Live. 100+ Subjects. Participate Now!
ADVERTISEMENT
a) Open-source
b) Real-time
c) Java-based
View Answer
Answer: b
a) MapReduce
b) Mahout
c) Oozie
View Answer
Answer: a
advertisement
a) Apple
b) Datamatics
c) Facebook
View Answer
Answer: c
Explanation: Facebook has many Hadoop clusters, the largest among them is the one that is
used for Data warehousing.
a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
View Answer
Answer: a
Explanation: Prism automatically replicates and moves data wherever it’s needed across a
vast network of computing facilities.
A ________ node acts as the Slave and is responsible for executing a Task assigned to it by .1
.the JobTracker
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Answer: c
Explanation: TaskTracker receives the information necessary for the execution of a Task from
.JobTracker, Executes the Task, and Sends the Results back to JobTracker
a) MapReduce tries to place the data and the compute as close as possible
part of the MapReduce is responsible for processing one or more chunks ___________ .3
.of data and producing the output results
a) Maptask
b) Mapper
c) Task execution
Answer: a
function is responsible for consolidating the results produced by each of the _________ .4
.Map() functions/tasks
a) Reduce
b) Map
c) Reducer
View Answer
Answer: a
.Explanation: Reduce function collates the work and resolves the results
a) A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner
c) Applications typically implement the Mapper and Reducer interfaces to provide the map
and reduce methods
Answer: d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them
.and re-executes the failed tasks
a) Java
b) C
#c) C
Answer: a
SanfoundryMenu
This set of Multiple Choice Questions & Answers (MCQs) focuses on “Introduction to
.”Mapreduce
A ________ node acts as the Slave and is responsible for executing a Task assigned to it by .1
.the JobTracker
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
View Answer
Answer: c
Explanation: TaskTracker receives the information necessary for the execution of a Task from
.JobTracker, Executes the Task, and Sends the Results back to JobTracker
a) MapReduce tries to place the data and the compute as close as possible
View Answer
Answer: a
ADVERTISEMENT
part of the MapReduce is responsible for processing one or more chunks ___________ .3
.of data and producing the output results
a) Maptask
b) Mapper
c) Task execution
View Answer
Answer: a
function is responsible for consolidating the results produced by each of the _________ .4
.Map() functions/tasks
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
View Answer
Answer: a
.Explanation: Reduce function collates the work and resolves the results
advertisement
a) A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner
c) Applications typically implement the Mapper and Reducer interfaces to provide the map
and reduce methods
View Answer
Answer: d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them
and re-executes the failed tasks.
a) Java
b) C
c) C#
View Answer
Answer: a
Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
Sanfoundry Certification Contest of the Month is Live. 100+ Subjects. Participate Now!
ADVERTISEMENT
7. ________ is a utility which allows users to create and run jobs with any executables as the
mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
View Answer
a) Mapper
b) Reducer
View Answer
advertisement
a) inputs
b) outputs
c) tasks
Answer: a
Explanation: Total size of inputs means the total number of blocks of the input files.
a) HashPar
b) Partitioner
c) HashPartitioner
Answer: c
Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method
called getPartition to partition.
11. Running a ___________ program involves running mapping tasks on many or all of the
nodes in our cluster.
a) MapReduce
b) Map
c) Reducer
Answer: a
Explanation: In some applications, component tasks need to create and/or write to side-files,
which differ from the actual job-output files.
a) Partitioner
b) OutputCollector
c) Reporter
View Answer
Answer: b
Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
10. _________ is the primary interface for a user to describe a MapReduce job to the
Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
View Answer
Answer: b
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Scaling out
in Hadoop”.
1. ________ systems are scale-out file-based (HDD) systems moving to more uses of
memory in the nodes.
a) NoSQL
b) NewSQL
c) SQL
Answer: a
Explanation: NoSQL systems make the most sense whenever the application is based on data
with varying data types and the data can be stored in key-value notation.
View Answer
Answer: a
Explanation: Hadoop together with a relational data warehouse, they can form very effective
data warehouse infrastructure.
ADVERTISEMENT
3. Hadoop data is not sequenced and is in 64MB to 256MB block sizes of delimited record
values with schema applied on read based on ____________
a) HCatalog
b) Hive
c) Hbase
View Answer
Answer: a
a) EMR
b) Isilon solutions
c) AWS
View Answer
Answer: b
Explanation: enterprise data protection and security options including file system auditing
and data-at-rest encryption to address compliance requirements are also provided by Isilon
solution.
advertisement
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and
highly efficient storage platform
b) Isilon native HDFS integration means you can avoid the need to invest in a separate
Hadoop infrastructure
c) NoSQL systems do provide high latency access and accommodate less concurrent users
View Answer
Answer: c
Explanation: NoSQL systems do provide low latency access and accommodate many
concurrent users.
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to ____________
a) Scale out
b) Scale up
View Answer
Answer: a
Explanation: HDFS and NoSQL file systems focus almost exclusively on adding nodes to
increase performance (scale-out) but even they require node configuration with elements of
scale up.
Sanfoundry Certification Contest of the Month is Live. 100+ Subjects. Participate Now!
ADVERTISEMENT
7. Which is the most popular NoSQL database for scalable big data store with Hadoop?
a) Hbase
b) MongoDB
c) Cassandra
View Answer
Answer: a
Explanation: HBase is the Hadoop database: a distributed, scalable Big Data store that lets
you host very large tables — billions of rows multiplied by millions of columns — on clusters
built with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in
the map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
View Answer
Answer: c
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
advertisement
a) TopTable
b) BigTop
c) Bigtable
View Answer
Answer: c
Explanation: Google Bigtable leverages the distributed data storage provided by the Google
File System.
10. __________ refers to incremental costs with no major impact on solution design,
performance and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
View Answer
Answer: c
Explanation: Adding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part
of a cluster does not require additional network switches.
a) generic
b) tool
c) library
d) task
View Answer
Answer: a
Explanation: Place the generic options before the streaming options, otherwise the
command will fail.
a) You can specify any executable as the mapper and/or the reducer
b) You cannot supply a Java class as the mapper and/or the reducer
c) The class you supply for the output format should return key/value pairs of Text class
View Answer
Answer: a
Explanation: If you do not specify an input format class, the TextInputFormat is used as the
default.
a) output directoryname
b) mapper executable
c) input directoryname
d) all of the mentioned
View Answer
Answer: d
Explanation: Required parameters are used for Input and Output location for the mapper.
advertisement
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
View Answer
Answer: c
b) Aggregate allows you to define a mapper plugin class that is expected to generate
“aggregatable items” for each input key/value pair of the mappers
View Answer
Answer: c
Sanfoundry Certification Contest of the Month is Live. 100+ Subjects. Participate Now!
ADVERTISEMENT
6. The ________ option allows you to copy jars locally to the current working directory of
tasks and automatically unjar the files.
a) archives
b) files
c) task
View Answer
Answer: a
7. ______________ class allows the Map/Reduce framework to partition the map outputs
based on certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
View Answer
Answer: b
Explanation: The primary key is used for partitioning, and the combination of the primary
and secondary keys is used for sorting.
advertisement
8. Which of the following class provides a subset of features provided by the Unix/GNU Sort?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
Answer: c
Explanation: Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
applications.
a) Map
b) Reducer
c) Reduce
View Answer
Answer: b
Explanation: Aggregate provides a special reducer class and a special combiner class, and a
list of simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on
over a sequence of values.
advertisement
a) Copy
b) Cut
c) Paste
d) Move
View Answer
Answer: b
Explanation: The map function defined in the class treats each input key/value pair as a list
of fields
Hadoop Questions and Answers – Introduction to HDFS
This set of Multiple Choice Questions & Answers (MCQs) focuses on “Introduction to
HDFS”.
1. A ________ serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
View Answer
Answer: b
Explanation: All the metadata related to HDFS including the information about data nodes,
files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
ADVERTISEMENT
a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of
fault tolerance
View Answer
Answer: a
a) master-worker
b) master-slave
c) worker/slave
d) all of the mentioned
View Answer
Answer: a
Explanation: NameNode servers as the master and each DataNode servers as a worker/slave
advertisement
a) Rack
b) Data
c) Secondary
View Answer
Answer: c
Explanation: Secondary namenode is used for all time availability and reliability.
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
View Answer
Answer: d
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.
6. Which of the following scenario may not be a good fit for HDFS?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
View Answer
Answer: a
Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows
storing the data on low cost commodity hardware while ensuring a high degree of fault-
tolerance.
7. The need for data replication can arise in various scenarios like ____________
View Answer
Answer: d
Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-
tolerance.
advertisement
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
View Answer
Answer: a
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) “HDFS Shell”
b) “FS Shell”
c) “DFS Shell”
View Answer
Answer: b
Explanation: The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System (HDFS).
advertisement
a) C++
b) Java
c) Scala
View Answer
Answer: b
Explanation: HDFS is implemented in Java and any computer which can run Java can host a
NameNode/DataNode on it.
11. For YARN, the ___________ Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource
d) Replication
View Answer
Answer: c
Explanation: All the metadata related to HDFS including the information about data nodes,
files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
a) The Hadoop framework publishes the job flow status to an internally running web server
on the master nodes of the Hadoop cluster
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of
fault tolerance
View Answer
Answer: a
Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows
information about the NameNode itself.
13. For ________ the HBase Master UI provides information about the HBase Master
uptime.
a) HBase
b) Oozie
c) Kafka
View Answer
Answer: a
Explanation: HBase Master UI provides information about the number of live, dead and
transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks.
14. During start up, the ___________ loads the file system state from the fsimage and the
edits log file.
a) DataNode
b) NameNode
c) ActionNode
View Answer
Answer: b
Explanation: HDFS is implemented on any computer which can run Java can host a
NameNode/DataNode on it