Introduction To Big Data PDF
Introduction To Big Data PDF
Skills required
to become Data
Scientist, Big
Data Specialist
and Data
Analyst.
• Fast data processing. In-memory processing makes Spark faster than Hadoop MapReduce
– up to 100 times for data in RAM and up to 10 times for data in storage.
• Iterative processing. If the task is to process data again and again – Spark defeats Hadoop
MapReduce. Spark’s Resilient Distributed Datasets (RDDs) enable multiple map
operations in memory, while Hadoop MapReduce has to write interim results to a disk.
• Near real-time processing. If a business needs immediate insights, then they should opt
for Spark and its in-memory processing.
• Graph processing. Spark’s computational model is good for iterative computations that
are typical in graph processing. And Apache Spark has GraphX – an API for graph
computation.
• Machine learning. Spark has MLlib – a built-in machine learning library, while Hadoop
needs a third-party to provide it. MLlib has out-of-the-box algorithms that also run in
memory. But if required, our Spark specialists will tune and adjust them to tailor to your
needs.
• Joining datasets. Due to its speed, Spark can create all combinations faster, though
Hadoop may be better if joining of very large data sets that requires a lot of shuffling and
sorting is needed.
A. Big data management and data mining B. Data warehousing and business intelligence C.
Management of Hadoop clusters D. Collecting and storing unstructured data
5. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
8. Define the Port Numbers for NameNode, Task Tracker and Job Tracker.
11. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
12. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?
a) fsk
b) fsck
c) fetchdt
b) Hadoop has an option parsing framework that employs only parsing generic options
17. HDFS supports the ____________ command to fetch Delegation Token and store it in a file
on the local system.
a) fetdt
b) fetchdt
c) fsk
d) rec
18. In ___________ mode, the NameNode will interactively prompt you at the command line
about possible courses of action you can take to recover your data.
a) full
c) recovery
d) commit
a) classNAME displays the class name needed to get the Hadoop jar
a) dtcp
b) distcp
c) dcp
d) distc
21. __________ mode is a Namenode state in which it does not accept changes to the name space.
a) Recover
b) Safe
c) Rollback
22. __________ command is used to interact and view Job Queue information in HDFS.
a) queue
b) priority
c) dist
23. Which of the following command runs the HDFS secondary namenode ?
a) secondary namenode
b) secondarynamenode
c) secondary_namenode
a) MapReduce tries to place the data and the compute as close as possible
26. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.
a) Maptask
b) Mapper
c) Task execution
27. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
a) A MapReduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner
c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
30. ________ is a utility which allows users to create and run jobs with any executable as the
mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
31. __________ maps input key/value pairs to a set of intermediate key/value pairs.
a) Mapper
b) Reducer
a) inputs
b) outputs
c) tasks
a) MapReduce
b) Map
c) Reducer
34. ___________ is the world’s most complete, tested, and popular distribution of Apache Hadoop
and related projects.
a) MDH
b) CDH
c) ADH
d) BDH
b) CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified
batch processing, interactive SQL, and interactive search, and role-based access controls
c) More enterprises have downloaded CDH than all other such distributions combined
36. Cloudera ___________ includes CDH and an annual subscription license (per node) to
Cloudera Manager and technical support.
a) Enterprise
b) Express
c) Standard
a) Enterprise
b) Express
c) Standard
d) Manager
a) HCatalog
b) Hbase
c) Imphala
d) Oozie
a) multi-tenancy
b) flexibilty
c) scalability