MapR Certified Hadoop Developer Study Guide (MCHD)
MapR Certified Hadoop Developer Study Guide (MCHD)
MapR Certified Hadoop Developer Study Guide (MCHD)
Study Guide
1
CONTENTS
About MapR Study Guides .................................................................................................................................... 3
Datasets .................................................................................................................................................................... 17
MapR certification study guides are intended to help you prepare for certification by
providing additional study resources, sample questions, and details about how to take the
exam. The study guide by itself is not enough to prepare you for the exam. Youll need
training, practice, and experience. The study guide will point you in the right direction and
help you get ready.
If you use all the resources in this guide, and spend 6-12 months on your own using the
software, experimenting with the tools, and practicing the role you are certifying for, you
should be well prepared to attempt the exams.
The MapR Certified Hadoop Developer credential is designed for Developers who program
MapReduce in Java. The credential measures the specific technical knowledge, skills, and
abilities required to design, develop, deploy, and manage MapReduce programs in Java.
3
Exam?
1
Whats on the
MapR tests new questions on the exam in an unscored manner. This means that you may see
test questions on the exam that are not used for scoring your exam. You will not know which
items are scored and which are unscored. Unscored items are being tested for inclusion in
future versions of the exam. They do not affect your results.
MapR exams are Pass or Fail. We do not publish the exam cut score because the passing score
changes frequently based on the scored items that are being used.
1.1 Describe the MapReduce computational model including input, map, reduce, splits,
outputs, and combiners
1.2 Define how data flows in a MapReduce workflow including details on how data is
loaded, analyzed, stored, and read
2.1 Describe how MapReduce jobs are executed and monitored in both MapReduce v.1
and in YARN
4.1 Demonstrate how to use the MapReduce API to solve common programming
problems
4.2 Describe how Mapper input and Reducer output work in processing data in
MapReduce
4.3 Demonstrate how to use the Mapper, Reducer, and Job class APIs
5.1 Demonstrate how to use counters to validate jobs and how to write custom counters
for specific tasks
5.2 Demonstrate how to manage and display jobs, history, and logs using the command
line interface
5.3 Demonstrate how to use MRUnit to test Mapper and Reducer class functionality
6.3 Describe the strategies that can be used to improve MapReduce performance
7.1 Describe the requirements for working with sequence files and compressing sequence
files on a MapR cluster
7.2 Demonstrate how to work with the distributed cache including distribution of jar
files, dynamic information to run a task, and using map-side joins.
7.3 Demonstrate how to work with HBase in MapReduce jobs as source, as a sink, and
both a source and sink in your data flow
7.4 Describe the requirements for working with sequence files and compressing
sequence files on a MapR cluster
8.3 Demonstrate how to manage multiple jobs in MapReduce within the driver
9.1 Define the programming contract for mappers & reducers in MapReduce streaming
9.2 Demonstrate how to use non-Java programs such as Perl and Python to stream
MapReduce jobs
Sample Questions
The following questions represent the kinds of questions you will see on the exam. The
answers to these sample questions can be found in the answer key following the sample
questions.
Q2. What information is included in a heartbeat from task tracker to job tracker?
A. Job status
B. Network errors
C. CPU health
D. Task status
Q4. Which Java statement correctly sums the values of an Iterable values input
parameter?
Q6. Which Java statement correctly defines a combiner class in a MapReduce driver?
A. job.setCombinerClass(MyCombiner.class);
B. combiner.setClass(MyCombiner.class);
C. job.setCombiner(MyCombiner.class);
D. combiner.set(MyCombiner.class);
Q7. What is the correct signature for the map method of TableMapReduceUtil?
A. Map and reduce programs are distributed to mappers and reducers using the
distributed cache
B. Map and reduce programs are automatically distributed to mappers and
reducers
C. Map and reduce programs may be pre-installed on mappers and reducers in
a directory contained in the HADOOP_PATH environment variable
D. Map and reduce programs may be pre-installed on the mappers and reducers
in the $STREAMING_DIR/bin directory
Q10. Which statement accurately describes how data flows through a streaming
reducer?
A. Each line in the partition is sent to the reducer one line at a time and then
standard input is closed
B. Every line from the input file is sent at once and then standard input is closed
C. Input keys are separated from values by the newline character
D. Input values are terminated by the tab character
Q2. What information is included in a heartbeat from task tracker to job tracker?
A. Job status
B. Network errors
C. CPU health
D. *Task status
Q4. Which Java statement correctly sums the values of an Iterable values input
parameter?
10
Q6. Which Java statement correctly defines a combiner class in a MapReduce driver?
A. *job.setCombinerClass(MyCombiner.class);
B. combiner.setClass(MyCombiner.class);
C. job.setCombiner(MyCombiner.class);
D. combiner.set(MyCombiner.class);
Q7. What is the correct signature for the map method of TableMapReduceUtil?
A. *Map and reduce programs are distributed to mappers and reducers using
the distributed cache
B. Map and reduce programs are automatically distributed to mappers and
reducers
C. Map and reduce programs may be pre-installed on mappers and reducers in
a directory contained in the HADOOP_PATH environment variable
D. Map and reduce programs may be pre-installed on the mappers and reducers
in the $STREAMING_DIR/bin directory
Q10. Which statement accurately describes how data flows through a streaming
reducer?
A. *Each line in the partition is sent to the reducer one line at a time and then
standard input is closed
B. Every line from the input file is sent at once and then standard input is closed
C. Input keys are separated from values by the newline character
D. Input values are terminated by the tab character
11
2
Preparing for
the Certification
12
MapR offers a number of training courses that will help you prepare. We recommend
taking the classroom training first, followed by self-paced online training, and then
several months of experimentation on your own learning the tools in a real-world
environment.
We also provide additional resources in this guide to support your learning. The blogs,
whiteboard walkthroughs, and ebooks are excellent supporting material in your efforts
to become a Hadoop Developer.
Course Description:
This course teaches developers how to write Hadoop Applications using MapReduce
and YARN in Java. The course covers debugging, managing jobs, improving
performance, working with custom data, managing workflows, and using other
programming languages for MapReduce.
13
Self-paced Training
MapR self-paced courses are based on the instructor-led courses. We recommend that
everyone considering the certification take advantage of the free online training in
conjunction with the instructor-led training.
Course Description:
This course teaches developers how to write Hadoop applications using MapReduce and
YARN in Java. The course covers debugging, managing jobs, improving performance,
working with custom data, managing workflows, and using other programming
languages for MapReduce.
14
15
2. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay
Ghemawat
http://research.google.com/archive/mapreduce.html
4. Data-Intensive Text Processing with MapReduce - Jimmy Lin and Chris Dyer
http://lintool.github.io/MapReduceAlgorithms/
5. Hadoop: The Definitive Guide MapReduce for the Cloud Tom White
http://shop.oreilly.com/product/9780596521981.do
7. How To: Using Non-Java Programs or Streaming for MapReduce Jobs James
Casaletto
https://www.mapr.com/blog/how-using-non-java-programs-or-streaming-
mapreduce-jobs
16
Datasets
These are some datasets that we recommend for experimenting with.
3. Kaggle
This site includes a collection of datasets used in machine learning competitions run
by Kaggle. Areas include classification, regression, ranking, recommender systems,
and image analysis. These datasets can be found under the Competitions section at
http://www.kaggle.com/competitions
4. KDnuggets
This site has a detailed list of public datasets, including some of those mentioned
earlier. The list is available at http://www.kdnuggets.com/datasets/index.html
5. SF Open Data
SF OpenData is the central clearinghouse for data published by the City and County
of San Francisco and is part of the broader open data program.
https://data.sfgov.org/data
17
3
Taking the Exam
18
MapR Certification exams are delivered online using a service from Innovative Exams. A
human will proctor your exam. Your proctor will have access to your webcam and
desktop during your exam. Once you are logged in for your test session, and your
webcam and desktop are shared, your proctor will launch your exam.
This method allows you to take our exams anytime, and anywhere, but you will need a
quiet environment where you will remain uninterrupted for up to two hours. You will
also need a reliable Internet connection for the entire test session.
19
20
Once confirmed, your reservation will be in your My Exams tab of Innovative Exams
To cancel an exam, the examinee logs into www.examslocal.com and clicks My Exams,
selects the exam to cancel, and then selects the Cancel button to confirm their
cancellation. A cancellation confirmation email will be sent to the examinee following
the cancellation.
21
22
23
We recommend that you sign in 30 minutes in advance of your testing time so that you
can communicate with your proctor, and get completely set up well in advance of your
test time.
You will be required to share your desktop and your webcam prior to the exam start.
YOUR EXAM SESSION WILL BE RECORDED. If the Proctor senses any misconduct, your
exam will be paused and you will contacted by the proctor. If your misconduct is not
corrected, the Proctor will shut down your exam, resulting in a Fail.
Examples of misconduct and/or misuse of the exam include, but are not limited to, the
following:
24
When you pass a MapR Certification exam, you will receive a confirmation email with
the details of your success. This will include the title of your certification and details on
how you can download your digital certificate, and share your certification on social
media.
Your certification will be updated in learn.mapr.com in your profile. From your profile
you can view your certificate and share it on LinkedIn.
25
Your certificate is available as a PDF. You can download and print your certificate from
your profile in learn.mapr.com.
Your credential contains a unique Certificate Number and a URL. You can share your
credential with anyone who needs to verify your certification.
If you happen to fail the exam, you will automatically qualify for a discounted exam
retake voucher. Retakes are $100 USD and can be purchased by contacting
[email protected]. MapR will verify your eligibility and supply you with a
special 1-time use discount code which you can apply at the time of purchase.
Exam Retakes
If you fail an exam, you are eligible to purchase and retake the exam in 14 days. Once
you have passed the exam, you may not take that version (e.g., v.4.0) of the exam again,
but you may take any newer version of the exam (e.g., v.4.1). A test result found to be in
violation of the retake policy will result in no credit awarded for the test taken. Violators
of these policies may be banned from participation in the MapR Certification Program.
26