Certified Machine Learning Engineer – Associate
Get fully updated and valid exam questions at: Skillcertexams
Question # 1
A college endowment office is using S3 data lake with structured and
unstructured data to identify potential big donors. Many different data lake
records refer to the same person, so fundraisers need to de-duplicate data before
storing it and preparing for further processing. What is the easiest and most
effective way to achieve that goal?
A Write a Python code for a custom de-duplication and run it on EMR cluster.
B Use AWS Glue Crawler to identify and eliminate duplicate people.
C Find a matching algorithm on AMI Marketplace.
D Store data in compressed JSON format.
Answer B
AWS GLUE provides machine learning capabilities to create custom transforms to
cleanse your data. There is currently one available transform named Find Matches. The
Find Matches transform enables you to identify duplicate or matching records in your
dataset, even when the records do not have a common unique identifier and no fields
match exactly. This will not require writing any code or knowing how machine learning
works. Glue’s Find Matches feature is a new way to perform de-duplication as part of
Glue ETL and is a simple, server-less solution to the problem. The algorithm can link
peoples’ records across different databases, even when many fields do not match
exactly across the databases (e.g. different name spelling, address differences, missing
or inaccurate data, etc).
Writing a custom Python code or finding a commercial Marketplace code would not be
the easiest and effective solution. Compressing data in JSON format does not address
the issue.
References
* Find Matches Glue Transform
* Glue Learning Transforms
Question # 2
A Machine Learning Engineer is tasked with developing a server less BI
Dashboard on AWS that has ML methods build-in. What is the best AWS service
he can choose?
A Google BI integrated with AWS Dash
B AWS Quick Sight
C AWS Tableau
D Sage Maker Server less
Answer B
AWS Quicksight is a fully managed business intelligence (BI) service that includes ML
insights and enables a user to build and share interactive dashboards.
SageMaker Serverless is not the name of any AWS cloud service. Tableau Server
could be deployed into a user’s virtual private cloud (VPC) using AWS CloudFormation
templates but that is not the most straightforward choice. AWS Dash is not an Amazon
service.
References
* AWS Quick Sight
* Tableau Server on AWS
Question # 3
Mark is running a small print-on-demand (POD) business. This month he has
been selling an average of 5 T-shirts per day. He is running low on inventory and
he wants to calculate the probability that he will sell more than 10 T-shirts
tomorrow. What probability distribution should he use for that calculation?
A Poisson distribution
B Normal (Gaussian) distribution
C Modified alpha distribution
D Student t-distribution
Answer A
The Poisson distribution is a discrete probability distribution defining a probability of
given the number of events occurring in a fixed time or space window. An assumption is
that events happen at a fixed rate and are independent of each other.
Student t-distribution (or simply t-distribution) and Normal (Gaussian) distribution are
continuous probability distributions, while modified alpha distribution does not exist.
References
* Poisson Distribution Wiki
* Normal Distribution Wiki
Question # 4
The AWS Glue Data Catalog contains references to data that are used as sources
and targets of extract, transform, and load (ETL) jobs in AWS Glue. To create a
data warehouse or data lake, a user must catalog this data. One way to take
inventory of the data in the data store is to run a Glue crawler. What is NOT the
datastore a crawler can connect to?
A Amazon S3
B Amazon Redshift
C JDBC API
D Amazon Elasti Cache
Answer D
A Glue crawler connects to the data store that can be Amazon S3, RDS, Redshift,
DynamoDB, or JDBC (Java Database Connectivity Interface).
Amazon ElastiCache is in-memory data stores in the cloud that cannot be connected to
Glue. (A caveat: a user can write custom Scala or Python code and import custom
libraries and Jar files into Glue ETL jobs to access data sources not natively supported
by AWS Glue, like ElastiCache).
References
* Glue Data Catalog
* What is Glue?
Question # 5
A Data Scientist is dealing with s binary classification problem with highly
imbalanced classes in a 1:200 ratio. He wants to fit and evaluate a decision tree
algorithm but does not expect it to perform very well on a raw unbalanced
dataset. What are the two techniques he can use as data preparation? (Select
TWO.)
A Transform Training Data with SMOTE
B Under-sample majority (normal) class.
C Use SVM (Support-Vector Machine) Algorithm.
D Normalize features of the majority class.
E Collect more data.
Answer A,B
Transforming Training Data with SMOTE (Synthetic Minority Oversampling) and under-
sampling are the answers. The majority class will work best in this case. The original
SMOTE preprint from 2002 is linked below.
SVM Algorithm is a legitimate ML algorithm, but it will not resolve the class imbalance.
Normalizing features of majority class and/or collecting more data will not solve the
problem either.
References
* SMOTE Oversampling
* Original SMOTE Paper