0% found this document useful (0 votes)
8 views

Data Final

Uploaded by

Goutam Thukral
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Final

Uploaded by

Goutam Thukral
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Quiz

Question 1 (1 point)

Which of the following is not an example of data cleaning:


Question 1 options:

Removing entries where there is a negative value in the 'age' column


Rounding up the 'Customer ID' column to the nearest hundred
Deleting the entire 'gender' column because this particular study is not interested in the gender of the
customer
Filtering out and removing all customers over the age of 50 because they are not a target demographic for
this specific study

Question 2 (1 point)

Given this list of prices:


2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, find the min-max normalized price of
apples for the apples sold at $11
Question 2 options:

8/18
9/18
10/18
11/18

Question 3 (1 point)
Which is the measurement used in the k-NN algorithm to
determine the classification of any given data point in the test
data set?
Question 3 options:

Entropy
Probability
Variance
Distance

Question 4 (1 point)

There is no target variable identified in both supervised and


unsupervised data mining because the target variable is
identified based on the results.
Question 4 options:

True
False

Question 5 (1 point)

Given this list of prices:


2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, find the decimal scaling price for apples
selling at $12.30 / lbs.
Question 5 options:

12.3
1.23
0.123
0.0123

Question 6 (1 point)

Mean-squared Error is used to:


Question 6 options:

Evaluate the relationship between a test data set and training data set
Balance classification models where target variables have a significantly lower frequency than other
classes
Establish baseline performance to determine whether results produced by a data model are within the
confidence interval of expected results
Combine bias and variance to evaluate the accuracy of model estimation for a continuous target
variable

Question 7 (1 point)

Which of the following best defines clustering?


Question 7 options:

To group similar records together without the usage of a target variable


Classification of records into groups based on similarity
A hierarchical method of aggregating data into a combinations of clusters
An effective method to group data records and used as an alternative to classification by neural network

Question 8 (1 point)

Select the properties of left-skewed data.


Question 8 options:
Positive skewness, and median is smaller than the mean
Positive skewness, and median is greater than the mean
Negative skewness, and median is smaller than the mean
Negative skewness, and median is greater than the mean

Question 9 (1 point)

Unsupervised data mining always requires human input.


Question 9 options:

True
False

Question 10 (1 point)
Should be -5 but not an answer????

Given this list of prices:


2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, find the Z-score for the apples sold at $2
given a standard deviation of 1.0.
Question 10 options:

3
1
-1
-3

Question 11 (1 point)
In the description task of data mining, analysts do this:
Question 11 options:

Identify methods to describe observed trends and patterns within the data
Describe the data at hand before moving on to the data cleaning and data transformation stage
Perform classification tasks such as creating decision trees and neural networks, and creating a report
based on the results
All of the above

Question 12 (1 point)
Skew = 3 * (Mean – Median) / Standard Deviation.

Given the following data, calculate the skewness of the distribution.


Mean = 50
Median = 45
Standard Deviation = 3

Question 12 options:

-3
3
-5
5

Question 13 (1 point)

Choose the answer that fits best. Jim is an IT analyst - For his
current task at work, he would like to assess how many more IT
staff his department should hire and train to keep up with IT
requests because his company expects to add an additional 100
staff members within the next 6 months. This is an example of:
Question 13 options:

Data Mining
Prediction
Estimation
Clustering

Question 14 (1 point)

In decision trees, the root node is located at the top of a


decision tree, and extend downwards towards decision nodes
that are eventually terminated in leaf nodes.
Question 14 options:

True
False

Question 15 (1 point)

When someone wishes to examine variables, investigate


distributions of categorized variables, read histograms with their
numeric variables, or look into relationships within a set of
variables, it is a good situation to use:
Question 15 options:

Hypothesis testing
Exploratory data analysis
Supervised data modeling
Data mining

Question 16 (1 point)

The best way to reduce the margin of error is:


Question 16 options:

Increasing the sample size


Decreasing the sample size
Increasing the confidence level
Decreasing the confidence level

Question 17 (1 point)

Overfitting is observed when:


Question 17 options:
Complexity is greater than the optimal level of model complexity, and error rate on the test data set
is increasing
Complexity is greater than the optimal level of model complexity, and error rate on the test data set is
decreasing
Complexity is greater than the optimal level of model complexity, and error rate on the training data set is
increasing
None of the above

Question 18 (1 point)
Not sure if the word “gradual” affects the answer

You can consider two variables 'a' and 'b' to be linearly


correlated if an increase in 'b' is associated with a gradual
increase in 'a'.
Question 18 options:

True
False

Question 19 (1 point)

The level of optimal model complexity can be found at:


Question 19 options:

The point of minimum error rate of the test data set


The point of minimum error rate of the training data set
The point of least distance between the error rate of the test data set and the training data set
The point of maximum distance between the error rate of the test data set and the training data set

Question 20 (1 point)
If the k value is too large in k-NN, it is likely the following will
happen.
Question 20 options:

Overfitting
Training data set becomes corrupted
The most common class will dominate the classification
None of the above

Question 21 (1 point)

Margin of error can be decreased by:


Question 21 options:

Increasing sample size and increasing confidence level


Increasing sample size and decreasing confidence level
Decreasing sample size and increasing confidence level
Decreasing sample size and decreasing confidence level

Question 22 (1 point)

We should use hypothesis testing instead of exploratory data


analysis when:
Question 22 options:

We want to examine relationships among attributes within a data set


There is a large prediction error after testing against the initial hypothesized relationship between two
variables, which necessitates the forming of a new hypothesis
Many potential outliers are identified
The a priori hypothesis has been identified
Question 23 (1 point)

If the mean is 68 and the margin of error is 1.55, what is your


confidence interval?
Question 23 options:

[66.45, 69.55]
[65.55, 70.45]
[59.13, 78.20]
[0, 1.55]

Question 24 (1 point)

Confidence interval estimate can be defined as:


Question 24 options:

An example of prediction, classification, clustering, and association methods


An interval of numbers produced by a point estimate combined with the associated confidence level
The margin of error in a regression line
The percentage of results which provide normalized data points

Question 25 (1 point)

During data preprocessing, it is best practice to omit records


with missing values in order to ensure that data is clean and
easy to work with.
Question 25 options:

True
False

Question 26 (1 point)

Reasons that data mining has increased in usage across multiple


industries include:
Question 26 options:

Commercialization of products make it easier for users to find data-driven solutions to problems
Continual technological advancements have made it faster to process more data
External pressure for companies to find advantages over their competitors
All of the above

Question 27 (1 point)

Which of the following is not a common task of data mining?


Question 27 options:

Association
Prediction
Classification
Compilation
Question 28 (1 point)

Given this list of prices:


2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, calculate the mode price.
Question 28 options:

5
8
7
4

Question 29 (1 point)

Which of the following is most likely to be a duplicate record in


the database and should be removed after further investigation?
Question 29 options:

Two customers have identical first and last names, but different birthdays
Two customers have identical 'Customer ID' fields
Data set with three nominal fields, and each field takes only four values. There are 63 records in total.
All of the above

Question 30 (1 point)
Identify the best tool for determining if a predictor is useful for
predicting a target variable.
Question 30 options:

Overlay Histogram
Directed Web Graph
Contingency Table with Row Percentages
C4.5 algorithm

Question 31 (1 point)

You can consider two variables 'a' and 'b' to be linearly


correlated if an increase in 'b' is associated with a decrease in
'a'.
Question 31 options:

True
False

Question 32 (1 point)

This is a list of prices of apples compiled around the city (in $/lbs):
2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, calculate the mean price.
Question 32 options:

5
8
7
4

Question 33 (1 point)

Multivariate graphs can be used to


Question 33 options:

Normalize data
Identify and confirm findings from the initial univariate exploration
Uncover new findings that the initial univariate exploration may have missed
All the above

Question 34 (1 point)

Clustering algorithms cannot use recursive methods of splitting.


Question 34 options:

True
False

Question 35 (1 point)

In k-means clustering, the centroid is used to:


Question 35 options:

Represent the distance between clusters and is set after initial data set partitioning
Represent the center point of between clusters and is updated during each pass
Represent the center point of a given cluster and is updated during each pass
Represent the outer bounds of any given cluster and is updated during each pass
Question 36 (1 point)
When working with potential outliers in multiple variables, a
good tool to help identify the outliers is a:
Question 36 options:

Histogram
Frequency distribution chart
Scatterplot, 2D
Least squares regression

Question 37 (1 point)

Which of the following is a requirement that must be met in


order to use a decision tree?
Question 37 options:
Target attribute classes must not be discrete, as decision tree logic cannot be applied to continuous target
variables
Training data set must provide the algorithm with target variable values
Both the training and testing dataset must be varied and rich, as this is an unsupervised learning algorithm
All training data must be normalized

Question 38 (1 point)
Clustering uses a combination of classification, estimation, and
prediction in order to segment the entire data set into
subgroups.
Question 38 options:

True
False

Question 39 (1 point)

Given this list of prices:


2 1 3 5 5 11 8 10 4 19 9
Using the list of prices above, calculate the median price.
Question 39 options:

5
8
7
4

Question 40 (1 point)
Statistical inference is a tool for:
Question 40 options:

Prediction and classification


Classification and clustering
Classification, clustering, and association
Estimation and prediction

You might also like