0% found this document useful (0 votes)
34 views6 pages

Reading 4 Big Data Projects - Answers

The document consists of a series of questions and explanations related to big data and machine learning concepts, including data exploration, model evaluation metrics, and characteristics of big data. It covers topics such as feature selection, precision, recall, and the importance of data curation. Each question is accompanied by an explanation that clarifies the correct answer and the reasoning behind it.

Uploaded by

r379764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views6 pages

Reading 4 Big Data Projects - Answers

The document consists of a series of questions and explanations related to big data and machine learning concepts, including data exploration, model evaluation metrics, and characteristics of big data. It covers topics such as feature selection, precision, recall, and the importance of data curation. Each question is accompanied by an explanation that clarifies the correct answer and the reasoning behind it.

Uploaded by

r379764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Question #1 of 13 Question ID: 1472236

In big data projects, data exploration is least likely to encompass:

A) feature selection.
B) feature engineering.
C) feature design.

Explanation

Data exploration encompasses exploratory data analysis, feature selection, and feature
engineering.

(Module 4.2, LOS 4.d)

Question #2 of 13 Question ID: 1472234

Big data is most likely to suffer from low:

A) veracity.
B) velocity.
C) variety.

Explanation

Big data is defined as data with high volume, velocity, and variety. Big data often suffers from
low veracity, because it can contain a high percentage of meaningless data.

(Module 4.1, LOS 4.a)

Question #3 of 13 Question ID: 1472237

Under which of these conditions is a machine learning model said to be underfit?

A) The model identifies spurious relationships.


B) The model treats true parameters as noise.
C) The input data are not labelled.

Explanation

Underfitting describes a machine learning model that is not complex enough to describe the
data it is meant to analyze. An underfit model treats true parameters as noise and fails to
identify the actual patterns and relationships. A model that is overfit (too complex) will tend
to identify spurious relationships in the data. Labelling of input data is related to the use of
supervised or unsupervised machine learning techniques.

(Module 4.3, LOS 4.f)

Question #4 of 13 Question ID: 1681485

Which of the following uses of data is most accurately described as curation?

A data technician accesses an offsite archive to retrieve data that has been stored
A)
there.
An investor creates a word cloud from financial analysts’ recent research reports
B)
about a company.
C) An analyst gathering data for sentiment analysis determines what sources to use.

Explanation

Data collection (curation) is determining the sources of data to be used (e.g., web scouring,
specific social media sites). Word clouds are a visualization technique. Moving data from a
storage medium to where they are needed is referred to as transfer.

(Module 4.1, LOS 4.a)

Question #5 of 13 Question ID: 1472238

When evaluating the fit of a machine learning algorithm, it is most accurate to state that:

A) precision is the percentage of correctly predicted classes out of total predictions.


accuracy is the ratio of correctly predicted positive classes to all predicted positive
B)
classes.
C) recall is the ratio of correctly predicted positive classes to all actual positive classes.

Explanation

Recall (also called sensitivity) is the ratio of correctly predicted positive classes to all actual
positive classes. Precision is the ratio of correctly predicted positive classes to all predicted
positive classes. Accuracy is the percentage of correctly predicted classes out of total
predictions.

(Module 4.3, LOS 4.c)

Question #6 of 13 Question ID: 1685260

In big data analysis, the three primary tasks involved in data exploration are most accurately
described as:

A) data collection, data curation, and data preparation.


B) exploratory data analysis, feature selection, and feature engineering.
C) data wrangling, data curation, and model training

Explanation

Data exploration involves three central tasks: exploratory data analysis, feature selection,
and feature engineering. Exploratory data analysis uses visualizations to observe and
summarize data. Feature selection is where only pertinent features from the dataset are
selected for machine learning model training. Feature engineering is the process of creating
new features by changing or transforming existing features.

(Module 4.2, LOS 4.d)

Question #7 of 13 Question ID: 1472232

An executive describes her company's "low latency, multiple terabyte" requirements for
managing Big Data. To which characteristics of Big Data is the executive referring?

A) Volume and velocity.


B) Velocity and variety.
C) Volume and variety.

Explanation
Big Data may be characterized by its volume (the amount of data available), velocity (the
speed at which data are communicated), and variety (degrees of structure in which data
exist). "Terabyte" is a measure of volume. "Latency" refers to velocity.

(Module 4.1, LOS 4.a)

Question #8 of 13 Question ID: 1685261

In big data analysis, the most appropriate method of gaining a high-level picture of the
composition of textual content is through the use of a:

A) scatterplot.
B) histogram.
C) word cloud.

Explanation

Word clouds are an effective way to gain a high-level picture of the composition of textual
content. Histograms, box plots, and scatterplots are common techniques for exploring
structured data.

(Module 4.2, LOS 4.d)

Question #9 of 13 Question ID: 1472235

The process of splitting a given text into separate words is best characterized as:

A) tokenization.
B) stemming.
C) bag-of-words.

Explanation

Text is considered to be a collection of tokens, where a token is equivalent to a word.


Tokenization is the process of splitting a given text into separate tokens. Bag-of-words (BOW)
is a collection of a distinct set of tokens from all the texts in a sample dataset. Stemming is
the process of converting inflected word forms into a base word.

(Module 4.1, LOS 4.g)


Question #10 - 13 of 13 Question ID: 1472240

Based on Exhibit 1, Karlsson's model's precision is closest to:

A) 91%.
B) 71%.
C) 81%.

Explanation

Precision, the ratio of correctly predicted positive classes (true positives) to all predicted
positive classes, is calculated as:

Precision (P) = TP /(TP + FP) = 307 / (307 + 31) = 0.9083 (91%)

In the context of this default classification, high precision would help us avoid the situation
where a bond is incorrectly predicted to default when it actually is not going to default.

(Module 4.3, LOS 4.c)

Question #11 - 13 of 13 Question ID: 1472241

Karlsson is especially concerned about the possibility that her model may indicate that a bond
will not default, but then the bond actually defaults. Karlsson decides to use the model's recall
to evaluate this possibility. Based on the data in Exhibit 1, the model's recall is closest to:

A) 83%.
B) 73%.
C) 93%.

Explanation

Recall that TP / (TP + FN) = 307 / (307 + 23) = 0.9303 = 93%.

Recall is useful when the cost of a false negative is high, such as when we predict that a bond
will not default but it actually will. In cases like this, high recall indicates that false negatives
will be minimized.

(Module 4.3, LOS 4.c)


Question #12 - 13 of 13 Question ID: 1472242

Karlsson would like to gain a sense of her model's overall performance. In her research,
Karlsson learns about the F1 score, which she hopes will provide a useful measure. Based on
Exhibit 1, Karlsson's model's F1 score is closest to:

A) 72%.
B) 82%.
C) 92%.

Explanation

The model's F1 score, which is the harmonic mean of precision and recall, is calculated as:

F1 score = (2 × P × R) / (P + R) = (2 × 0.9083 × 0.9303) / (0.9083 + 0.9303) = 0.9192


(92%)

Like accuracy, F1 is a measure of overall performance measures that gives equal weight to FP
and FN.

(Module 4.3, LOS 4.c)

Question #13 - 13 of 13 Question ID: 1472243

Karlsson also learns of the model measure of accuracy. Based on Exhibit 1, Karlsson's model's
accuracy metric is closest to:

A) 79%.
B) 89%.
C) 69%.

Explanation

The model's accuracy is the percentage of correctly predicted classes out of total predictions.
Model accuracy is calculated as:

Accuracy = (TP + TN) / (TP + FP + TN + FN) = (TP + TN) / N


= (307 + 113) / (307 + 31 + 113 + 23) = (307 + 113) / (474)
= 0.8861 = 89%

(Module 4.3, LOS 4.c)

You might also like