0% found this document useful (0 votes)
166 views143 pages

It0089 Finalreviewer

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 143

[IT0089] FUNDAMENTALS OF - Pertains to a value of a variable.

ANALYTICS MODELING QUANTITATIVE VARIABLES


MODULE 1 – STATISTICAL - It pertains to a variable that can be
ANALYSIS measured numerically
WHY DO WE SUMMARIZE DATA? - A data collected on a quantitative
- The main objective of summarizing variable are called quantitative data
data is to easily identify the o Discrete Variables – a
characteristics of the data to be variable whose values are
processed. countable. It can assume
- Summarizing your data will enable only certain values.
you to classify the normal values of o Continuous Variables –
your data and uncover what are the there are that can take on any
odd values present on your dataset. interval values. Also called
WHAT IS RAW DATA? as float, interval, numeric
- Raw data pertains to the collected QUALITATIVE VARIABLES
data before it’s processed or ranked. - Qualitative or categorical variable
o Qualitative raw data pertains to the variables that values
o Categorical raw data couldn’t be measured
STATISTICAL ANALYSIS ORDINAL VARIABLES
- Statistics is defined as the science of - Categorical data that has an explicit
collecting, analyzing, presenting, and ordering
interpreting data, as well as of - Also called as ordered factor
making decisions based on such BINARY VARIABLES
analysis. - A special case of categorical with
TYPES OF STATISTICS just two categories. (0/1, True or
1. Descriptive Statistics False)
- consists of different techniques for - Also called as dichotomous, logical,
organizing, displaying, and indicator
describing data by using labels, EXPLORATORY DATA ANALYSIS
graph, and summary measures (EDA)
2. Inferential Statistics - It refers to the critical process of
- consists of methods that use sample performing initial investigations on
results to help make decisions or the data so as to discover patterns,
predictions about a population to spot anomalies, to test hypothesis
DATA SET and to check assumptions with the
- a collection of observations on one help of summary statistics and
or more variables graphical representations.
VARIABLE MEASURES OF CENTRAL
- a characteristic under study that TENDENCIES
assures different values for different - A basic step in exploring your data is
elements getting a “typical value” for each
OBSERVATION OR MEASUREMENT feature (variable): an estimate of
where most of the data are located 3. MULTIMODAL - if the
(i.e. their central tendency) distribution has more than two
ARITHMETIC MEAN modes
- Calculated by adding together the MEASURES OF DISPERSION
values in the sample. The sum is then - A second dimension, variability, also
divided by the number of items in referred to as dispersion, measures
the sample (the replication) whether the data values are tightly
clustered or spread out
RANGE
- The difference between the largest
and the smallest value in a data set
MEDIAN - Range = largest value – smallest
- The median is the middle value, value
taken when you arrange your STANDARD DEVIATION
numbers in order (rank) - The standard deviation is used when
- When calculating for median, one the data are normally distributed.
must remember the following: You can think of it as a sort of
o Arrange the data values of “average deviation” from the mean.
the given data set in - General formula for calculating
increasing order. (from standard deviation
smallest to largest)
o Find the value that divides
the ranked data set in two
equal parts
INTERPRETING STANDARD
MODE
DEVIATION
- It is the most frequent value in a
- A small standard deviation means
sample
that the values in a statistical data set
- It is calculated by working out how
are close to the mean of the data set,
many there are of each value in your
on average
sample
- A large standard deviation means
- The one with the highest frequency
that the values in the data set are
is the mode
farther away from the mean, on
- It is possible to get tied frequencies,
average
in which case you report both values
- The standard deviation can never be
TYPES OF MODES
a negative number
1. UNIMODAL - the distribution is
- The smallest possible value for the
said to be unimodal if it has only one
standard deviation is 0
value with highest number of
- The standard deviation is affected
occurrences
by outliers (extremely low or
2. BIMODAL - the data set contains
extremely high numbers in the data
two modes
set). That’s because the standard
deviation is based on the distance the largest among the three and mode
from the mean. has the lowest value
- The standard deviation has the same
units as the original data
VARIANCE
- Overall distance of all the points
from your mean
TWO WAYS TO EXPLORE
DISTRIBUTION
1. Graphically – using frequency - If a histogram and frequency
histograms or tally plots draws a distribution are skewed to the left,
picture of the sample shape. the value of the mean is the smallest
2. Shape statistics – such as skewness and mode has the largest value
and kurtosis
HISTOGRAM
- A plot of the frequency table with
the bins on the x-axis and the count
(or pro‐ portion) on the y-axis.

BOXPLOT
- A plot introduced by Tukey as a
quick way to visualize the
distribution of data.
SKEWNESS
- a measure of how central the average
is in the distribution
- The skewness of a sample is a
- If a histogram is symmetric, then it measure of how central the average
can be said that the values for the is in relation to the overall spread of
mean, median, and mode are the values
equal o POSITIVELY SKEWED - A
positive value indicates that
the average is skewed to the
left, that is, there is a long
“tail” of more positive values
o NEGATIVELY SKEWED -
A negative value indicates
that the average is skewed to
- If a histogram and frequency
the right, that is, there is a
distribution is skewed to the right it
means that the value of the mean is
long “tail” of more positive The contrary of this claim is the
values alternative hypothesis
KURTOSIS STUDENT’S T-test
- A measure of how pointy the In statistics, student’s t-test is a method
distribution developed by William Sealy Gosset used in
WHAT IS STATISTICS? testing hypothesis about the mean of a small
- Statistics is the science concerned sample drawn from a normally distributed
with developing and studying population when the population standard is
methods for collecting, analyzing, unknown
interpreting and presenting empirical
data • If the t value < critical value = don’t
- Rely upon the calculation of reject the null hypothesis
numbers • If the t value > critical value = reject
- Rely upon how the numbers are the null hypothesis
chosen and how statistics are SIGNIFICANCE LEVELS (ALPHA)
interpreted - A significance level, also known as
TYPES OF STATISTICS alpha or α, is an evidentiary standard
1. Descriptive Statistics - describing that a researcher sets before the study
and summarizing data sets using - It defines how strongly the sample
pictures and statistical quantities evidence must contradict the null
2. Inferential Statistics - analyzing hypothesis before you can reject the
data sets and drawing conclusions null hypothesis for the entire
from them population
3. Probability - the study of chance CONFIDENCE INTERVALS
events governed by rules (or laws) - The table shows the equivalent
WHAT IS HYPOTHESIS TESTING? confidence levels and levels of
- a statistical procedure for testing significance
whether chance is a plausible
explanation of an experimental
finding
NULL HYPOTHESIS
- Null Hypothesis is usually denoted
by H0 and the alternative hypothesis
is denoted by H1
- The null hypothesis is a statement
that is assumed to be true at the
beginning of an analysis • If the p value > α= don’t reject the
- Suppose you are trying to test a null hypothesis
certain case, initially, the person • If the p value < α= reject the null
being questioned is not guilty. This hypothesis
initial verdict is the null hypothesis. DEGREES OF FREEDOM
= N1 + N2 – 2
ASSUMPTION IN T-TEST
• independent observations RESEARCH QUESTIONS FOR ONE-
• normally distributed data for each WAY ANOVA
group • Do accountants, on average, earn
• equal variances for each group more than teachers?
ANOVA • Do people treated with one of two
- Analysis of variance (ANOVA) is a new drugs have higher average T-
statistical technique used to compare cell counts than people in the control
the means of two or more groups of group?
observations or treatments. For this • Do people spend different amounts
type of problem, you have the depending on which type of credit
following: card they have?
o a continuous dependent ASSUMPTIONS IN ANOVA
variable or response variable • Observations are independent
o a discrete independent • Errors are normally distributed
variable, also called a • All groups have equal response
predictor or explanatory variances.
variable CORRELATION
ONE-WAY ANOVA - Exploratory data analysis in many
- Use analysis of variance to test for modeling projects (whether in data
differences between population science or in research) involves
means. examining correlation among
TYPE OF CATEG CONTI CONTI predictors, and between predictors
PREDICT ORICA NUOU NUOUS and a target variable
ORS/TYP L S AND
- Variables X and Y (each with
E OF CATEG
RESPONS ORICA measured data) are said to be
E L positively correlated if high values of
CONTIN Ordina Analysis X go with high values of Y, and low
UOUS ry of values of X go with low values of Y
Least Covaria CORRELATION COEFFICIENT
Square nce - A metric that measures the extent to
s (ANCO
which numeric variables are
(OLS) VA)
Regres associated with one another (ranges
sion from -1 to +1).
CATEGO Contige Logisti Logistic - Scatterplot is a plot in which the x-
RICAL ncy c Regressi axis is the value of one variable, and
Table Regres on the y-axis the value of another.
Analysis sion ASSUMPTIONS IN CORRELATION
or
Logistic • The correlation coefficient measures
Regressi the extent to which two variables are
on associated with one another.
• When high values of v1 go with used. In module 1, the most common
high values of v2, v1 and v2 are techniques will be identified.
positively associated What is raw data?
• When high values of v1 are
associated with low values of v2, v1 Raw data pertains to the collected data
and v2 are negatively associated before it’s processed or ranked.
• The correlation coefficient is a Suppose you are tasked to gather the ages
standardized metric so that it always (in years) of 50 students in a university.
ranges from -1 (perfect negative
correlation) to +1 (perfect positive The table below is an example of
correlation) quantitative raw data.
• 0 indicates no correlation, but be
aware that random arrangements of
data will produce both positive and
negative values for the correlation
coefficient just by chance
TYPES OF CORRELATION
Another example is gathering the student
status of the same 50 students. Now we will
have an example of categorical raw data
which is presented in the table below.

SUPPLEMENTARY MATERIALS
Subtopic 1
Every data analysis requires data. Data can
be in different forms such as images, text,
videos, and etc, that are usually gathered The examples above can also be called
from different data sources. Organizations ungrouped data. An ungrouped data set
store data in different ways, some through contains information on each member of a
data warehouses, traditional RDBMS, or sample or population individually.
even through cloud. With the voluminous
Raw data can be summarized through
amount of data that an organization charts, dashboards, tables, and numbers.
processes each day, the dilemma on how The most common way to describe raw
to start data analysis emerges. data is through the frequency distributions.
How do we start performing an analysis?
A frequency distribution shows how
First and foremost, know your data.
the frequencies are distributed over
To understand your organization’s data, various categories. Below is a
there are numerous techniques that can be
frequency table that summarized the
survey conducted by Gallup Poll about
the Worries
About Not Having Enough Money to Pay
Normal Monthly Bills

A frequency distribution of a
qualitative variable enumerates all the
categories and the number of
instances that belong to each
category.
Let’s transform the following responses into
a frequency table in order to interpret the
data better
Is it easier to understand through frequency table, isn’t it?

In order to study data, in this module we will be using statistics.

Statistics is defined as the science of collecting, analyzing, presenting, and interpreting data,
as well as of making decisions based on such analysis.

Since statistics is a broad body of knowledge, it is divided into two areas: descriptive
statistics and inferential statistics.

Descriptive statistics is consists of different techniques for organizing, displaying, and


describing data by using labels, graph, and summary measures.

Inferential statistics - It is consist of methods that use sample results to help


make decisions or predictions about a population.
Basic terms

A variable is a characteristic under study that assures different values for different elements.

Observation or measurement pertains to a value of a variable.

A data set is a collection of observations on one or more variables.

Customer ID First Name Last Name Address Age


1 Ana Liza Quezon 29
2 Ben Tan Makati 30
3 Rose Park Antipolo 22
4 Dan Mangulabnan Pampanga 23
5 Romeo Pascual San Juan 24
In the above table, the variables are the customer ID, first name, last name, address, and
age. The data set are all the transactions from customer 1 to 5. Each transaction can also be
called as measurement or observation.
Types of Variables

1. Quantitative Variables
It pertains to a variable that can be measured numerically. A data collected on a
quantitative variable are called quantitative data.

Examples:
Income, height, gross sales, price of a home, number of cars owned.
Quantitative variables are divided into two types: discrete variables and continuous
variables.

• Discrete variable is a variable whose values are countable. In other


words, a discrete variable can assume only certain values.
• Continuous variables are variables that can assume any numerical value or
interval.

2. Qualitative or Categorical Variables


It pertains to the variables that values couldn’t be measured.

Population Vs Sample

A population is consist of all elements – individuals, items, or objects, whose characteristics


are being studied.
A sample is a portion of the population selected for study.

Visual example of Population vs Sample


Let’s proceed to the different ways to group and study data.

Aside from using the frequency table, we can also make use of the different graphs that are
commonly used to visually present data.

The first one is a bar graph. A bar graph is made of bars whose heights represent the
frequencies of respective categories. One type of bar graph is called pareto chart.

Pareto chart is a bar graph where in the bars are arranged based on their heights. It is
arranged in descending order (largest to smallest).

Another way to present data is through pie chart. A pie chart is a circle divided into
portions that represent frequencies or percent ages of a population
To graph grouped data, we can use of the following methods:

A group data can be presented using histograms. Histograms can be drawn for a frequency
distribution. A histogram is a graph which classes are marked on horizontal axis and the
frequencies, relative frequencies, or percentages are marked on the vertical axis.

The above histogram shows the percentage distribution of annual car insurance premiums in 50
states. The data used to make this distribution and histogram are based on estimates made by
insure.com
Understanding Frequency Distribution Curve

Knowing the meaning for each curve in histogram would be helpful to interpret a dataset. A
histogram can be:
1. Symmetric
2. Skewed
3. Uniform rectangular
A symmetric histogram is the type of histogram which both sides are equal.

If a histogram doesn’t have equal sides, it is said to be skewed.

• A skewed-to-the-right histogram has longer tail on the right side.


• A skewed-to-the-left histogram has longer tail on the left.

If the histogram has equal values then it would be considered as uniform or


rectangular histogram.

Measures of Tendencies
1. Mean
2. Median
3. Modes
Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
Examples of statistical analysis application in real life:

1. Manufacturers use statistics to weave quality into beautiful fabrics, to bring lift to
the airline industry and to help guitarists make beautiful music.
2. Researchers keep children healthy by using statistics to analyze data from the
production of viral vaccines, which ensures consistency and safety.
3. Communication companies use statistics to optimize network resources, improve
service and reduce customer churn by gaining greater insight into subscriber
requirements.
4. Government agencies around the world rely on statistics for a clear understanding
of their countries, their businesses and their people.
Understanding the measures of central tendencies:

Measures of central tendencies are useful in identifying the middle value for histograms and
frequency distribution. Methods used to calculate for the measures central tendencies can
determine the typical values that can be found in your data.

What is mean?
Mean is also called as arithmetic mean which pertains to the sum of all the values over the
number of items added.
What is median?
The median is the middle value, taken when you arrange your numbers in order (rank). This
measure of the average does not depend on the shape of the data.
When calculating for median, one must remember the following:
1. Arrange the data values of the given data set in increasing order. (from smallest to
largest)
2. Find the value that divides the ranked data set in two equal parts.
Example:

Find the median for 2014 compensation:

Step 1: Arrange the values in increasing order

16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1

Step 2: Identify the center of the data sets.

16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1

For this example, the middle value is 21.0, therefore it is the median of the data values.

There are instances where the data values are even, in this case, the two middle numbers are
gathered and divided into two.
See the example below:

The following data describes the cell phone minutes used last month by 12 randomly
selected customers:
230, 2053, 160, 397, 510, 380, 263, 3864, 184, 201, 326, 721
Here, we need to arrange the data values first. It will give us:

160 184 201 230 263 326 380 397 510 721 2053 3864

If we observe our data, we have no central value, to compute for the median, we need to
identify the values that divide the data into two equal parts. In our case we have 326 and 380.
We can only have one median per data set, so the median would be calculated using:

326 + 380
= = 353
2
What is mode?

Mode is the value that occurs the most frequent in the given dataset.

Let’s identify the most frequent values in the following example:

77 82 74 81 79 84 74 78

In this data set, the value 74 appears twice. Therefore, our mode is 74.

When a dataset has only one value that repeat the most then the distribution can be called as
unimodal. If the data in the distribution has two values that repeat the most, then it is called
bimodal. If there are more than two modes in a dataset, it is said to be multimodal.

Relationships among mean, median, and mode:


• If a histogram is symmetric, then it can be said that the values for the mean, median,
and mode are the equal.

• If a histogram and frequency distribution is skewed to the right it means that the
value of the mean is the largest among the three and mode has the lowest value.
If the mean is the largest, it means that the dataset is sensitive to outliers that occur in the
right.
4. If a histogram and frequency distribution are skewed to the left, the value of the mean
is the smallest and mode has the largest value. In this scenario, the left tail contains
the outliers.

Measures of Dispersion
If an analyst wants to know how disperse a dataset is, the methods of calculating the measures
of dispersion can be used. Measures of dispersion can help in determining how spread the data
values are.
What is range?

Range is the simplest method to compute when measuring the dispersion of data. The range
can be obtained by subtracting the smallest value in the dataset from the largest value.
= −

What is standard deviation?

This method is the most commonly used measure of dispersion. The value of standard
deviation tells how closely the values of a data set are clustered around the mean.
The things one must remember when dealing with standard deviation are the ff:

1. A lower value of standard deviation indicates that the data set are spread over a smaller
range around the mean.
2. A larger value of the standard deviation indicates that the values of that data set are
spread over a relatively larger range around the mean.
3. The standard deviation can be obtained by taking the positive square root of the
variance.
FORMATIVES B 20
C 30
17/20 D 35
1. A skewed-to-the-right histogram has longer tail E 24
[NONE]
on the left side [TRUE]

2. Boxplot is a plot of frequency table with the


11. Identify whether the statement below is a null
bins on the x-axis and the count (or pro-
hypothesis or alternative hypothesis:
portion) on the y-axis [FALSE]
Bioflu is more effective than aspirin in helping
3. Categorical data can have an explicit ordering
a person who has had a heart attack [NULL
[TRUE]
HYPOTHESIS] - ALTERNATIVE
4. The difference between largest and the smallest
12. Identify whether the statement below is an
value in a data set [RANGE]
example of
A. Positive correlation
5. A data set is a collection of observation on one
B. Negative correlation
or more variables [TRUE]
C. No correlation
The more one runs, the less likely one is to have
6. People’s age is an example of continuous
cardiovascular problems. [POSITIVE
variable. [TRUE] – FALSE - DISCRETE
CORRELATION] - NEGATIVE
7. You conduct a survey where the respondents
could choose from good, better, best, excellent.
13. The scores awarded to 25 students for an
What type of variable should contain this type
assignment were as follows:
of data? [ORDINAL]
4 7 5 9 8 6 7 7 8 5 6 9 8 5 8 7 4 7 3 6 8 9 7 6 9.
What is the mode? [7]
8. The data that is collected before being
14. Inferential statistics is about describing and
processed is called statistical data [FALSE]
summarizing data sets using pictures and
statistical quantities [FALSE] –
9. The value of the mean is ____ to/the value of
DESCRIPTIVE STATISTICS
the mode in the histogram shown below.
15. On the semester final, Joe scored 85, Jill scored
89, and Bill scored a 99. What was the average
score the average score for these students? [91]
16. Identify whether the statement below is an
example of
a. Positive correlation
b. Negative correlation
c. No correlation
As one increases in age, often one’s agility
decreases. [NEGATIVE CORRELATION]
[LESS THAN] 17. Find the median: 12, 5, 9, 18, 22, 25, 5 [12]
18. Identify whether the statement below is a null
10. What is the mode of the following data? hypothesis or alternative hypothesis:
STUDENT SCORE the sky is blue [NULL HYPOTHESIS]
A 10
19. Observe the histogram below. Based on it, how 5. Boxplot is a plot of frequency table with the
many students were greater than or equal to 60 bins on the x-axis and the count (or pro-
inches tall? [6] - 11 portion) on the y-axis [FALSE] -
HISTOGRAM
6. Continuous variables are variables that can
assume any numerical value or interval
[TRUE]
7. Continuous variable is variable whose
values are countable [FALSE] - DISCRETE
8. People’s age is an example of continuous
variable [TRUE] - DISCRETE
9. A population is consisting of all elements –
20. Identify whether the statement below is an individuals, items, objects, whose
example of characteristics are being studied. [TRUE]
a. Positive correlation 10. Match the type of variable of each instance:
b. Negative correlation Number of car accident – DISCRETE
c. No correlation Alicia’s weight during December –
As you drink more coffee, the number of hours you CONTINUOUS
stay awake increases. [POSITIVE Customer feedback - CATEGORICAL
CORRELATION]
11. The standard deviation can never be a
negative number [TRUE]
18/20 12. Inferential statistics is about describing and
summarizing data sets using pictures and
1. Which of the following doesn’t measure statistical quantities [FALSE] -
how spread a data set is? [MODE] DESCRIPTIVE
2. If the distribution has more than two modes, 13. Identify whether the statement below is an
it is called multimodal [TRUE] example of
3. The value of the mean is ____ to/the value d. Positive correlation
of the mode in the histogram shown below. e. Negative correlation
f. No correlation
As you drink more coffee, the number of hours you
stay awake increases. [POSITIVE
CORRELATION]

14. When finding the median, what do you do if


there are two middle numbers?[ADD
THEM TOGETHER AND DIVIDE BY
2]
15. Identify whether the statement below is an
[LESS THAN] example of
d. Positive correlation
4. The data that is collected before being e. Negative correlation
processed is called statistical data [TRUE] f. No correlation
FALSE – RAW DATA
As one increases in age, often one’s agility 7. Categorical data can have an explicit
decreases. [NEGATIVE CORRELATION] ordering [TRUE]
16. To find the average of a set of numbers, add 8. Match the ff:
up all items and divide by _____. [THE
NUMBER OF ITEMS] Size of computer monitor – ordinal –
17. Inferential statistics is the study of chance INTERVAL
events governed by rules [FALSE] - Anime genre – categorical
PROBABILITY
18. When x and y are negatively correlated, as Anime season – interval – ORDINAL
the value of x increases, the value of y tends
9. Which of the ff doesn’t measure how spread
to decrease [TRUE]
a data set is? [MODE]
19. Identify whether the statement below is a
10. What is median of following data?
null hypothesis or alternative hypothesis:
STUDENT SCORE
iPhone 6 Plus weighs 6.77 ounces [NULL
A 10
HYPOTHESIS] B 20
20. Identify whether the statement below is a C 30
null hypothesis or alternative hypothesis: D 35
the sky is blue [NULL HYPOTHESIS] E 24
[24]

11. If you are going to conduct a study if there’s a


16/20
difference between five fertilizers, you use
1. A boxplot is a qualitative variable correlation [TRUE] – FALSE (correlation only
enumerates all the categories and the requires two variables)
number of instances that belong to each
12. The cost of four cell phones is $345, $400,
category. [FALSE] – FREQUENCY
$110, and $640. What is the median cost? [$372.50]
DISTRIBUTION
2. A data set is a collection of observations on 13. A residual is the difference between the
one or more variables [TRUE] observed value of the response and the predict value
3. Continuous variables are variables that can of the response variable [FALSE] - TRUE
assume any numerical value or interval
[TRUE] 14. What is the median of this data set? 7.3, 2.9, 1.5,
4. A data set is a collection of observations on 0.6, 3.8 [2.9]
one or more variables [TRUE]
15. The standard deviation can never be a negative
5. The difference between the largest and the
number [TRUE]
smallest value in a data set [RANGE]
6. What is the mode of the following data? 16. Boxplot is a plot in which the x-axis is the value
STUDENT SCORE of one variable, and the y-axis the value of another.
A 10 [FALSE] - HISTOGRAM
B 20
C 30 17. two-tailed test is a statistical technique used to
D 35 compare the means of two or more groups of
E 24 observations or treatments [FALSE] - ANOVA
[NONE]
18. Identify whether the statement below is a null
hypothesis or alternative hypothesis:
Bioflu is more effective than aspirin in helping 6. The data below is bimodal.
a person who has had a heart attack
Data: 77 82 74 81 79 84 74 82
[ALTERNATIVE HYPOTHESIS]
19. Observe the histogram below. Based on it, how - True
many students were greater than or equal to 60
8. What is the median of the following data?
inches tall? [6] - 11

- 24

9. People’s age is an example of continuous variable


20. Identify whether the statement below is a null
hypothesis or alternative hypothesis: - True – FALSE - DISCRETE
The sky is blue [NULL HYPOTHESIS]
10. It is a portion of the population selected for study.
(Use lowercase for your answer)

- Sample
FORMATIVES
11. If the histogram is skewed right, the mean is greater
1. If a histogram and frequency distribution are skewed than the median.
to the left, the value of the means is the largest.
- True
- False
12. Identify whether the statement below is an example
2. Which of the following is an example of categorical of
raw data?
a) positive correlation
- Collected subjects offered
b) negative correlation
3. You conduct a survey where the respondents could
choose from good, better, best, excellent. What type of c) no correlation
variable should contain this type of data?
The more one eats, the less hunger one will have.
- Ordinal
- Positive correlation – NEGATIVE?
4. Student height is a categorical variable.
14. Boxplot is a plot in which the x-axis is the value of
- True ? one variable, and the y-axis the value of another.

5. Quantitative variable pertains to the variables that - False


values couldn’t be measured.
15.0 indicates no correlation, but be aware that random
- False arrangement of data will produce both positive and
negative values for the correlation coefficient just by
chance
- True - True

16. The alternative hypothesis or research hypothesis 2. It refers to the critical process of performing initial
Ha represents an alternative claim about the value of investigations on the data so as to discover patterns,
the parameter. to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and
- True graphical representations
17. Identify whether the statement below is an example - Exploratory Data Analysis
of
3. A variable is a characteristic under study that
a) positive correlation assures different values for different elements.
b) negative correlation - True
c) no correlation 4. The data that is collected before being processed
is called statistical data.
As one increases in age, often one’s agility decreases.
- False
- Negative correlation

18. Find the median: 15, 5, 9, 18, 22, 25, 5 5. The smallest possible value for the standard
deviation is 1.
- 15
- False
19. Observe the histogram below. Based on it, how
many students were greater than or equal to 60 inches 6. Observation or measurement pertains to a
tall? value of a variable.

- True

7. Customer gender is an example of ordinal data.

- False

8. If the distribution has more than two modes, it is


called multimodal.

- True

9. It is a measure of how central the average is in


- 11 relation to the overall spread of values

20. The height of three volleyball players is 210 cm, - Skewness


220 cm and 191 cm. What is their average height?
10. A special case of categorical with just two
- 207 categories is called logical.

- False - BINARY

20/20 11. Identify whether the statement below is a


null hypothesis or alternative hypothesis:
1. The standard deviation can never be a negative
number Contrary to popular belief, people can see
through walls.
- null hypothesis 20. When x and y are positively correlated, as
the value of x increases, the value of y tends to
12. The height of three volleyball players is 210 cm,
increase as well.
220 cm and 191 cm. What is their average height?
- True
- 207

13. 0 indicates no correlation, but be aware that


random arrangements of data will produce both
positive and negative values for the correlation
coefficient just by chance

- True

14. Inferential statistics is about describing and


summarizing data sets using pictures and statistical
quantities

- False - DESCRIPTIVE

15. Identify whether the statement below is an


example of
a) positive correlation
b) negative correlation
c) no correlation

The more you exercise your muscles, the


stronger they get.
- positive correlation

16. Inferential Statistics is about analyzing data


sets and drawing conclusions from them
- True
17. A residual is the difference between the
observed value of the response and the
predicted value of the response variable.
- True
18. The smallest possible value for the standard
deviation is 0.
- True
19. When conducting two-tailed T-test, the data
is normally distributed.
- True
[IT0089] MODULE 2 depend on the values of the attribute
OVERVIEW OF ANALYTICS LIFE variables
CYCLE DATA PATTERNS LEARNED IN DATA
• Singapore’s Supply and Demand gap MINING
on a typical day ➢ Classification and prediction patterns
• Microsoft’s Voice Recognition ➢ Cluster and association patterns
• Gmail’s Spam ➢ Data reduction patterns
• Instagram algorithm ➢ Outliers and anomaly patterns
• Weather Forecast ➢ Sequential and temporal patterns
What exactly is data mining?
Data Mining is defined as a process used to Data-mining approaches can be separated
extract usable insights from a larger set of into two categories.
any raw data.
- It implies analysing data patterns in TWO MAJOR APPROACHES IN DATA
large batches of data using one or MINING
more software. 1. Supervised learning - the desired
KEY FEATURES OF DATA MINING output is known.
- Automatic pattern predictions based 2. Unsupervised learning - it is used
on trend and behavior analysis against data that has no historical
- Prediction based on likely outcomes labels
- Creation of decision-oriented DATA ANALYTICS LIFE CYCLE
information - Big data analysis differs from
- Focus on large data sets and traditional data analysis primarily due
databases for analysis to the volume, velocity, and variety
- Clustering based on finding and characteristics of data being
visually documented groups of facts processed
not previously known. WHAT IS BIG DATA?
KEY TERMS - Normal data comes from your
- A data set may have attribute traditional relational database system
variables and target variable(s) - Big data comes from different data
- The values of the attribute variables sources. Data are usually in
are used to determine the values of petabytes, or terabytes
the target variable(s) CHARACTERISTICS OF BIG DATA
- Attribute variables and target
1. VOLUME
variables may also be called as
- big data starts as low as 1 terabyte
independent variables and dependent
and it has no upper limit
variables, respectively, to reflect that
2. VELOCITY
the values of the target variables
- big data enters an average system at
velocity ranging between 30
Kilobytes (KB) p/sec to as much as
30 gigabytes (GB) per sec
3. VARIETY
- big data is composed of unstructured,
semi structured, structured dataset
HOW CAN YOU ENSURE THE
SUCCESS OF AN ANALYTICS
PROJECT?
1ST PHASE: ASK
People – first, you need the right people
- Whether you’re a data scientist
KEY ROLES FOR A SUCCESSFUL
developing a new model to reduce
ANALYTICS PROJECT
churn, or a business executive
• Business User – understands the wanting to improve the customer
domain area experience, this phase defines what
• Project Sponsor – provides your business needs to know
requirements (monetary, equipment,
2ND PHASE: PREPARE
etc.)
- This phase is both critical to success
• Project manager – ensures meeting
and frustratingly time-consuming.
objectives
You have data sitting in database, on
• Business Intelligence Analyst –
desktops or in Hadoop, plus you want
provides business domain expertise
to capture live-streaming data.
based on deep understanding of the
data 3RD PHASE: EXPLORE
• Database Administrator (DBA) – - In this phase, you’ll search for
creates DB environment relationships, trends and patterns to
• Data Engineer – provides technical gain a deeper understanding of
skills, assist data management and your data.
extraction, supports analytics - You’ll also develop and test
sandbox hypotheses through rapid prototyping
• Data Scientist – provides analytic in an iterative process.
techniques data and modeling 4TH PHASE: MODEL
DIFFERENT ANALYTICAL LIFE - This is the phase where you code
CYCLE your own models using R or Python
DATA LIFE CYCLE or an interactive predictive software
ACCORDING TO SAS like SAS
- Organizations must be able to
identify the appropriate algorithm
to analyze the available data.
5TH PHASE: IMPLEMENT for expert systems, and data
- After validating the accuracy of the visualization.
model generated during the previous STEPS OF THE KDD PROCESS
phase, it must be implemented in the 1. Developing an understanding of
organization. a. The application domain
- Models are usually deployed by b. The relevant prior knowledge
automating common manual tasks c. The goals of the end-user
6TH PHASE: ACT 2. Creating a target data set: selecting
- In this phase, we enable two types of a data set or focusing on a subset of
decisions: operational that are variables, or data samples, on which
automated, and strategic decisions discovery is to be performed.
where individuals make a long term 3. Data Cleaning and preprocessing
impact. a. Removal of noise or outliers
b. Collecting necessary
7TH PHASE: EVALUATE
information to model or
- Organizations must always monitor account for noise
the performance of the model c. Strategies for handling
generated. missing data fields
- When the performance starts to d. Accounting for time sequence
degrade below the acceptance level, and known changes
the model can be recalibrated or 4. Data reduction and projection
replaced with a new model. a. Finding useful features to
8TH PHASE: ASK AGAIN represent the data depending
- The marketplace changes. Your on the goal of the task
business changes. And that’s why 5. Choosing the data mining task
your analytics process occasionally a. Deciding whether the goal of
needs to change. the KDD process is
classification, regression,
clustering, etc.
KDD
6. Choosing the data mining
- The term Knowledge Discovery in
algorithm
Databases, or KDD for short, refers
a. Selecting method(s) to be
to the broad process of finding
used for searching for patterns
knowledge in data, and emphasizes
in the data
the "high-level" application of
b. Deciding which models and
particular data mining methods.
parameters may be
- It is of interest to researchers in appropriate
machine learning, pattern recognition, c. Matching a particular data
databases, statistics, artificial mining method with the
intelligence, knowledge acquisition
overall criteria of the KDD form hypotheses regarding hidden
Process information.
7. Data Mining 3RD PHASE: DATA PREPARATION
a. Searching for patterns of
- The data preparation phase covers all
interest in particular
activities needed to construct the final
representation form or a set of
dataset (data that will be fed into the
such representations as
modeling tool(s)) from the initial raw
classification rules or tress,
data
regression, clustering, and so
- Data preparation tasks are likely to be
forth.
performed multiple times and not in
8. Interpreting mined patterns
any prescribed order. Tasks include
9. Consolidating discovered
table, record, and attribute selection,
knowledge
as well as transformation and
CRISP-DM cleaning of data for modeling tools.
- The CRISP-DM methodology is 4TH PHASE: MODELING
described in terms of a hierarchical
- In this phase, various modeling
process model, consisting of sets of
techniques are selected and applied,
tasks described at four levels of
and their parameters are calibrated to
abstraction (from general to specific):
optimal values
phase, generic task, specialized task,
- Typically, there are several
and process instance
techniques for the same data mining
1ST PHASE: BUSINESS problem type. Some techniques have
UNDERSTANDING specific requirements on the form
- This initial phase focuses on of data. Therefore, going back to the
understanding the project data preparation phase is often
objectives and requirements from a necessary.
business perspective, then converting 5TH PHASE: EVALUATION
this knowledge into a data mining
- Before proceeding to final
problem definition and a preliminary
deployment of the model, it is
plan designed to achieve the
important to thoroughly evaluate it
objectives
and review the steps executed to
2ND PHASE: DATA UNDERSTANDING create it, to be certain the model
- The data understanding phase starts properly achieves the business
with initial data collection and objectives.
proceeds with activities that enable - A key objective is to determine if
you to become familiar with the data, there is some important business
identify data quality problems, issue that has not been sufficiently
discover first insights into the data, considered. At the end of this phase, a
and/or detect interesting subsets to
decision on the use of the data mining Suppose your email program watches which
results should be reached. emails you do or do not mark as spam, and
6TH PHASE: DEPLOYMENT based on that learns how to better filter
- Depending on the requirements, the spam. What is the task T in this setting?
deployment phase can be as simple as Classifying emails as spam or not spam. T
generating a report or as complex as Watching you label emails as spam or not
implementing a repeatable data spam. E
mining process across the enterprise.
The number (or fraction) of emails correctly
- In many cases, it is the customer, not
the data analyst, who carries out the classified as spam/not spam. P
deployment steps. None of the above—this is not a machine
learning problem.
SUBTOPIC 1 STANDARD EXAMPLES OF MACHINE
WHAT IS DATA MINING? LEARNING
- Data mining aims at discovering • Smart Tagging
useful data patterns from massive • Product Recommendations
amounts of data. • Priority Filtering
• Spam Filtering
ARTIFICIAL INTELLIGENCE
• Text Processing
- is the broad science of mimicking
• Speech Recognition
human abilities.
• Face Recognition
- AI is the science of training machines
Artificial Intelligence – a technique which
to perform human tasks
enables machines to mimic human behavior
MACHINE LEARNING
Machine Learning – subset of AI
- Arthur Samuel (1959). Machine techniques which use statistical methods to
Learning: Field of study that gives enable machines to improve with experience
computers the ability to learn
Deep Learning – subset of ML which make
without being explicitly the computation of multi-layer neural
programmed. network feasible
- Tom Mitchell (1998) Well-posed ASPECTS OF BUSINESS ANALYTICS
Learning Problem: A computer Forecasting – leveraging historical time
series data to provide better insights into
program is said to learn from decision-making about the future
experience E with respect to some Data Mining – perform predictive analytics
and pattern discovery techniques to address
task T and some performance numerous business problems
measure P, if its performance on T, Text Analytics – finding treasures in
unstructured data like social media or survey
as measured by P, improves with tools which could uncover insights about
experience E. consumer sentiment
Optimization – analyze massive amounts of
ASK YOURSELF! data in order to accurately identify decisions
which are likely to produce the most optimal • Probability of default in credit risk
results assessment
DATA MINING CAN BE PERFORMED
IMPACT OF ANALYTICS IN THE FF DATA TYPES:
 Relational databases
 Data warehouses
 Advanced DB and information
repositories
 Object-oriented and object-relational
databases
 Transactional and Spatial databases
 Heterogenous and legacy databases
 Multimedia and streaming database
 Text databases
 Text mining and web mining

MAJOR TYPES OF DATA PATTERNS


 CLASSIFICATION AND
PREDICTION
- Allow us to classify or predict values
DATA MINING TRENDS of target variables from values of
• Ongoing research to ensure that attribute variables.
analysts have access to modern
techniques which are robust and
scalable
• Innovative computational
implementations of existing
analytical methods
• Creative applications of existing
methods to solve new and different
problems
• Integration of methods from multiple
disciplines to provide targeted
solutions
THE ROLE OF BIG DATA IN DATA
MINING

 CLUSTER AND ASSOCIATION

SPECIFIC APPLICATION
• Attrition/Churn prediction
• Propensity to buy/avail of a product
or service - Cluster patterns give groups of
• Cross-sell or up-sell probability similar data records such that data
• Next-best offer records in one group are similar but
• Time-to-event modeling have larger differences from data
• Fraud detection records in another group.
• Revenue/profit predictions
- Association patterns are established
based on co-occurrences of items in DATA MINING TECHNIQUES
data records - CLASSIFICATION
- CLUSTERING
- REGRESSION
 DATA REDUCTION PATTERNS - OUTER
- SEQUENTIAL PATTERNS
- Data reduction patterns look for a
small number of variables that can - PREDICTION
be used to represent a data set with - ASSOCIATION RULES
a much larger number of variables
CHALLENGES
- Skilled Experts are needed to formulate
the data mining queries.
- Over fitting: Due to small size training
database, a model may not fit future
states.
- Data mining needs large databases
which sometimes are difficult to manage
- Business practices may need to be
modified to determine to use the
information uncovered.
- If the data set is not diverse, data mining
results may not be accurate.
 OUTLIER AND ANOMALY - Integration information needed from
PATTERNS heterogeneous databases and global
- Outliers and anomalies are data information systems could be complex
points that differ largely from the ADVANTAGES OF DATA MINING
norm of data - Data mining technique helps companies
to get knowledge based information.
- Data mining helps organizations to make
the profitable adjustments in operation
and production.
- The data mining is a cost-effective and
efficient solution compared to other
statistical data applications.
- Data mining helps with the decision-
making process.
- Facilitates automated prediction of
trends and behaviors as well as
automated discovery of hidden patterns.
- It can be implemented in new systems
as well as existing platforms
 SEQUENTIAL AND TEMPORAL - It is the speedy process which makes it
PATTERNS easy for the users to analyze huge
- Sequential and temporal patterns amount of data in less time.
reveal patterns in a sequence of data DISADVANTAGES OF DATA MINING
points. - There are chances of companies may sell
- If the sequence is defined by the useful information of their customers to
time over which data points are other companies for money. For
observed, we call the sequence of example, American Express has sold
data points as a time series. credit card purchases of their customers
to the other companies.
- Many data mining analytics software is
difficult to operate and requires advance
training to work on.
- Different data mining tools work in
different manners due to different
algorithms employed in their design.
Therefore, the selection of correct data
mining tool is a very difficult task.
- The data mining techniques are not variable.
accurate, and so it can cause serious - Series of questions that successively
consequences in certain conditions narrow down observations into smaller
INDUSTRIES THAT UTILIZE DATA and smaller groups of decreasing
MINING impurity.
 Communications
 Insurance
 Education Logistic Regression
 Manufacturing - Attempts to classify a categorical
 Banking
 Retail outcome (y = 0 or 1) as a linear function
 Service providers of explanatory variables.
 E-commerce UNSUPERVISED LEARNING
 Super markets - used against data that has no historical
 Crime labels
 Bioinformatics - the goal is to explore the data and find
SUBTOPIC 2 some structure within
SUPERVISED LEARNING - There is no right or wrong answer
- the desired output is known. UNSUPERVISED LEARNING
- also known as predictive modeling EXAMPLE TECHNIQUES
- uses patterns to predict the values of Self-organizing maps
the label on additional unlabeled data
- used in applications where historical
data predicts likely future events
USE OF DATA IN SUPERVISED
LEARNING
- We can use the abundance of data to
guard against the potential for
overfitting by decomposing the data set
into partitions:
o TRAINING DATASET - Consists of
the data used to build the
candidate models.
o TEST DATASET - The data set to
which the final model should be Nearest-neighbor mapping
applied to estimate this model’s - k-nearest neighbors (k-NN): This method
effectiveness when applied to can be used either to classify an
data that have not been used to outcome category or predict a
build or select the model. continuous outcome.
- If there is only one dataset, it may be o k-NN uses the k most similar
partitioned into a training and test sets. observations from the training
- The basic assumption is that the training set, where similarity is typically
and test sets are produced by measured with Euclidean
independent sampling from an infinite distance.
population. o When k-NN is used as a
SUPERVISED LEARNING EXAMPLE classification method, a new
TECHNIQUE observation is classified as Class
1 if the percentage of it k
Classification Tree nearest neighbors in Class 1 is
- Partition a data set of observations into greater than or equal to a
increasingly smaller and more specified cut-off value (e.g. 0.5).
homogeneous subsets. o When k-NN is used as a
- At each iteration, a subset of prediction method, a new
observations is split into two new observation’s outcome value is
subsets based on the values of a single predicted to be the average of
the outcome values of its k-
nearest neighbors
K-Nearest neighbor
- To classify an outcome, the training set
is searched for the one that is “most
like” it. This is an example of “instance
based” learning. It is “rote learning”, the 10. It is defined as heterogeneous data
simplest form of learning from multiple sources combined in a
Clustering common source [DATA
INTEGRATION]
- A definition of clustering could be “the 11. Machine learning is a subset of
process of organising objects into groups artificial intelligence [TRUE]
whose members are similar in some 12. Different data mining tools work in
way”. different manners due to different
WHAT IS WEKA? algorithms employed in their design.
- Weka is tried and tested open source Therefore, the selection of correct
machine learning software that can be data mining tool is a very difficult
accessed through a graphical user task. [TRUE]
interface, standard terminal 13. Supposed you want to train a
applications, or a Java API. machine to help you predict how long
Logistic Regression in Weka it will take you to drive home from
your workplace. What type of data
• Open diabetes dataset. mining approach would you use?
• Click Classifier Tab [UNSUPERVISED LEARNING] –
• Choose Classifier: functions>Logistic Supervised learning
• Use Test Options: Use Training Set 14. Data mining implies analyzing data
• Press Start patterns in large batches of data using
Trees in Weka one or more software [TRUE]
• Open weather dataset 15. It uses statistical methods to enable
• Click Classify Tab machines to improve with experience
• Choose Classifier: trees>J48 [MACHINE LEARNING]
• Use Test Options: Use Training Set 16. In this approach in data mining, only
• Press Start put data will be given
• Right-click result from Result List for [UNSUPERVISED LEARNING]
options 17. One of the advantages of data mining
• Choose Visualize tree is that there are chances of companies
may sell useful information of their
customers to other companies for
money [FALSE] - disadvantage
FORMATIVES 18. Data mining cannot be used on
spatial databases [FALSE]
19/20 19. Regression is a famous supervised
1. Ordinal data can also be called as learning technique [TRUE]
ordered factor [TRUE] 20. The goal of unsupervised learning is
2. Which of the following is not to explore the data and find some
included on the analytical life cycle structure within [TRUE]
defined by SAS? [INTEGRATION] 17/20
3. The last phase of the analytical life
cycle defined by SAS implementation
[FALSE] – ask again 1. Reviewing process is under what
4. It is a role that is responsible for
collecting, analyzing, and interpreting phase in CRISP-DM? [BUSINESS
large amount of data [DATA UNDERSTANDING] -
SCIENTIST]
5. Data preparation is the first step in EVALUATION
CRSIP-DM [FALSE] – Business
Understanding 2. Data preparation is the first step in
6. These are values that lie away from CRISP-DM [FALSE] – Business
the bulk of the data [OUTLIERS]
7. It refers to the broad process of Understanding
finding knowledge in data and 3. Business user provides business
emphasizes the “high-level”
application of particular data mining domain expertise based on deep
methods [KDD] understanding of the data [FALSE] –
8. A key objective is to determine if
there is some important business Business Intelligence analyst
issue that has not been sufficiently
considered [TRUE]
9. Identify the third step in CRISP-DM
[DATA PREPARATION]
4. Selecting data mining technique is given, can predict the ouput [TRUE]
under what phase in CRISP-DM? 17. In this approach in data mining, input
[MODELING] variables and ouput variables will be
5. Interpreting mined patterns concludes given [SUPERVISED LEARNING]
the KDD process [TRUE] – FALSE 18. Data mining is a cost-effective and
– Consolidating discovered efficient solution compared to other
knowledge statistical data applications. [TRUE]
6. The last phase of the analytical life 19. Association rule algorithm is an
cycle defined by SAS is example of what approach in data
implementation [FALSE] – ask mining? [UNSUPERVISED
again LEARNING]
7. Depending on the requirements, the 20. It makes the computation of multi-
deployment phase can be as simple as layer neural network feasible [DEEP
generating a report [TRUE] LEARNING]
8. Evaluation is the last phase in
16/20
CRISP-DM [FALSE] - Deployment
1. This is the type of learning which
9. It is the role that ensures the progress
uses patterns to predict the values of
of any project [PROJECT
the label on additional unlabeled data
MANAGER]
[SUPERVISED LEARNING]
10. A special case of categorical with just
2. Unsupervised machine learning finds
two categories [CATEGORICAL] -
all kind of unknown patterns in data
BINARY
[TRUE]
11. Artificial intelligence is a subset of a
3. Data mining focuses on small data
machine learning [FALSE]
sets and databases for analysis
12. Weka is tried and tested open source
[TRUE] -FALSE
machine learning software that can be
4. Which of the following is not
accessed through a graphical user
considered as data mining technique?
interface, standard terminal
[KURTOSIS]
applications or a JAVA API [TRUE]
5. Data mining can be performed on
13. Regression is a famous supervised
web mining data [TRUE]
learning technique [TRUE]
6. Data mining cannot be used on
14. Machine learning is a subset of
spatial databases [FALSE]
artificial intelligence [TRUE]
7. Data mining is the broad science of
15. A supervised learning algorithm
mimicking human abilities [FALSE]
learns from labeled training data,
- AI
helps you to predict outcomes for
8. Association rule algorithm is an
unforeseen data [TRUE]
example of what approach in data
16. Supervised learning goal is to
mining? [SUPERVISED
determine the function so well that
LEARNING] - UNSUPERVISED
when new input data set
9. There is no right or wrong on 20. This role provides the funding when
predictive modeling [FALSE] doing analytical project [PROJECT
10. It is used against data that has no SPONSOR]
historical labels [UNSUPERVISED 16/20
LEARNING]
1. This phase starts with initial data
11. It is defined as heterogeneous data
collection. [DATA
from multiple sources combined in a
UNDERSTANDING]
common source [DATA
2. It is the role that ensures the
INTEGRATION]
progress of any project. [PROJECT
12. It refers to the broad process of
finding knowledge in data and MANAGER]
emphasizes the “high-level” 3. A special case of categorical with
application of particular data mining just two categories. [BINARY]
methods [KDD] 4. Creating target datasets also
13. This role is responsible for creating includes collecting necessary
database environment for analytic information to model or account
projects [DATABASE for noise. [FALSE]
ADMINISTRATOR] 5. It is a role that is responsible for
14. Selecting data mining technique is collecting, analyzing, and
under what phase in CRISP-DM? interpreting large amount of data.
[BUSINESS UNDERSTANDING] - [DATA SCIENTIST]
MODELING 6. This role is responsible for creating
15. It is the practice of science and database environment for analytic
technology that is dedicated to projects. [DATABASE
building and data-handling problems ADMINISTRATOR]
that arise due to high volume of data 7. Ordinal data can also be called as
[DATA SCIENCE] ? ordered factor. [TRUE]
16. CRISP-DM stands for _______
8. Identify the third step in CRISP-
[CROSS-INDUSTRY STANDARD
DM. [DATA PREPARATION]
PROCESS FOR DATA MINING]
9. Which of the following is not
17. Evaluation is the last phase in
included on the analytical life cycle
CRISP-DM [FALSE] -
defined by SAS? [INTEGRATION]
DEPLOYMENT
10. The last phase of the analytical life
18. Which of the following is not
considered as a phase of CRISP-DM? cycle defined by SAS is
[DATA SELECTION] implementation. [FALSE] -
19. These are values that lie away from Deployment
the bulk of the data [OUTLIERS] 11. In this approach in data mining,
input variables and output
variables will be given.
[SUPERVISED LEARNING]
12. Artificial intelligence is a subset of In this phase you’ll also develop and test
machine learning. [FALSE] hypotheses through rapid prototyping in an
13. Supposed you want to train a iterative process.
machine to help you predict how - Explore
long it will take you to drive home
It is a comprehensive data mining
from your workplace. What type of
methodology and process model that
data mining approach should you
provides anyone – from novice to data mining
use? [UNSUPERVISED
experts – with a complete blueprint for
LEARNING] ? conducting a data mining project.
14. Association rule algorithm is an
example of what approach in data - CRISP-DM
mining? [SUPERVISED CRISP-DM stands for
LEARNING] - UNSUPERVISED ____________________________.
15. Data mining cannot be used on
- Cross-industry standard process for
text databases. [FALSE] data mining
16. A data value that is very different
from most of the data. [DATA It is a role that is responsible for collecting,
analyzing, and interpreting large amount of
FRAME] ?
data.
17. Self-organizing maps are example
of supervised learning. [TRUE] - - Data Scientist
FALSE It is the aspect of business analytics that finds
18. Unsupervised machine learning patterns in unstructured data like social
finds all kind of unknown patterns media or survey tools which could uncover
in data. [TRUE] insights about consumer sentiment
19. Suppose your email program
watches which emails you do or do - Data Mining ?
not mark as spam, and based on One of the benefits of data mining is
that learns how to better filter overfitting.
spam. What is the experience E in
this setting? [WATCHING YOU - False
LABELS EMAILS SPAM AS SPAM]
It is the phase of CRISP-DM where analysts
20. Logistic regression is classified as
supervised learning. [TRUE] review the steps executed.

- Evaluation
Ordinal data can also be called as ordered
A key objective is to determine if there is
factor.
some important business issue that has not
- True been sufficiently considered.

- True

Identify the third step in CRISP-DM.


- Data Preparation All data is labeled and the algorithms learn to
predict the output from the input data. This
In this approach in data mining, input
statement pertains to ___________. (Use
variables and output variables will be given.
lowercase for your answer)
- Supervised Learning
- supervised Learning
Self-organizing maps are example of
supervised learning. 17/20
- False This role provides the funding when doing
Logistic regression is classified as supervised analytical project.
learning.
- Project Sponsor
- True It is a role that is responsible for collecting,
Association rule algorithm is an example of analyzing, and interpreting large amount of
what approach in data mining? data.

- Unsupervised Learning - Data Scientist

Weka is tried and tested open source Based on the analytical life cycle defined by
machine learning software that can be SAS, this phase has two types of decisions:
accessed through a graphical user interface, operational and strategic.
standard terminal applications, or a Java API.
- Act
- True Interpreting mined patterns concludes the
Unsupervised machine learning finds all kind KDD process.
of unknown patterns in data
- True
- True It is the aspect of business analytics that finds
All data is unlabeled and the algorithms learn patterns in unstructured data like social
to inherent structure from the input data. media or survey tools which could uncover
This statement pertains to ___________. insights about consumer sentiment.

- UnSupervised Learning ? - Text analytics

Random forest is an example of supervised This is the first step in KDD process.
learning. - Data Selection
- True In this phase, you’ll search for relationships,
Unsupervised methods help you to find trends and patterns to gain a deeper
features which can be useful for understanding of your data.
categorization. - Explore
- True
Which of the following is not included on the - False - ADVANTAGES
analytical life cycle defined by SAS?
One of the advantages of data mining is that
- Integration there are chances of companies may sell
useful information of their customers to
These are values that lie away from the bulk
other companies for money.
of the data.
- False - DISADVANTAGES
- Outliers
Which of the following is not considered as
CRISP-DM stands for
data mining technique?
____________________________.
- Kurtosis
- Cross-industry standard process for
data mining Supposed your email program watches which
emails you do or do not mark as spam, and
All data is labeled and the algorithms learn to
based on that learns how to better filter
predict the output from the input data. This
spam. What is the performance measure P in
statement pertains to ___________. (Use
this setting?
lowercase for your answer)
- The number of email correctly classified
- supervised
as spam/not spam
It makes the computation of multi-layer
neural network feasible. Self-organizing maps are example of
supervised learning.
- Deep Learning
- False
Unsupervised methods help you to find
features which can be useful for
categorization.

- True

Logistic regression is classified as supervised SUPPLEMENTARY MATERIALS


learning.

- True
In this module, we’ve discussed the
concept of data mining and its application
Supposed you want to train a machine to help in real life. Data Mining is defined as a
you predict how long it will take you to drive process used to extract usable data from a
larger set of any raw data.
home from your workplace. What type of
data mining approach should you use?
It implies analyzing data patterns in large
- Supervised learning batches of data using one or more software.

One of the challenges of data mining is It can


The increase in the use of data-mining
be implemented in new systems as well as
techniques in business has been caused
existing platforms.
largely by three events.
• The explosion in the amount of data All these details are your inputs. The
being produced and electronically output is the amount of time it took to
tracked drive back home on that specific day.
• The ability to electronically
warehouse these data
• The affordability of computer power
to analyze the data

What exactly is Supervised Learning?

In Supervised learning, you train the


machine using data which is well
"labeled." It means some data is already
tagged with the correct answer. It can be
compared to learning which takes place in
the presence of a supervisor or a teacher.

A supervised learning algorithm learns from


labeled training data, helps you to predict
outcomes for unforeseen data. Successfully
building, scaling, and deploying accurate
supervised machine learning Data science
model takes time and technical expertise
from a team of highly skilled data scientists.

Moreover, Data scientist must rebuild


models to make sure the insights given
remains true until its data changes.

How does Supervised Learning Works?

For example, you want to train a machine to


help you predict how long it will take you to
drive home from your workplace. Here, you
start by creating a set of labeled data.

This data includes:

• Weather conditions
• Time of the day
• Holidays
Example scenario:

You instinctively know that if it's raining outside, then it will take you longer to drive
home. But the machine needs data and statistics.

Let's see now how you can develop a supervised learning model of this example which help the
user to determine the commute time. The first thing you requires to create is a training data set.
This training set will contain the total commute time and corresponding factors like weather,
time, etc. Based on this training set, your machine might see there's a direct relationship between
the amount of rain and time you will take to get home.

So, it ascertains that the more it rains, the longer you will be driving to get back to your home.
It might also see the connection between the time you leave work and the time you'll be on the
road.

The closer you're to 6 p.m. the longer time it takes for you to get home. Your machine may find
some of the relationships with your labeled data.

This is the start of your Data Model. It begins to impact how rain impacts the way people
drive. It also starts to see that more people travel during a particular time of day.

Famous Supervised Learning Techniques

In both regression and classification, the goal is to find specific relationships or structure in an
input data that allow us to effectively produce correct output data.

What exactly is Unsupervised Learning?

Unsupervised learning is a machine learning technique, where you do not need to supervise the
model. Instead, you need to allow the model to work on its own to discover information. It mainly
deals with the
unlabeled data.

Unsupervised learning algorithms allow you to perform more complex processing tasks
compared to supervised learning. Although, unsupervised learning can be more unpredictable
compared with other natural learning deep learning and reinforcement learning methods.

Here, are prime reasons for using Unsupervised Learning:


3. Unsupervised machine learning finds all kind of unknown patterns in data.
4. Unsupervised methods help you to find features which can be useful for categorization.
5. It is taken place in real time, so all the input data to be analyzed and labeled in the
presence of learners.
6. It is easier to get unlabeled data from a computer than labeled data, which needs
manual intervention.
How does Unsupervised Learning works?

Sample scenario:
Let's, take the case of a baby and her family dog.

She knows and identifies this dog. A few weeks later a family friend brings along a dog and
tries to play with the baby.

Baby has not seen this dog earlier. But it recognizes many features (2 ears, eyes, walking on 4
legs) are like her pet dog. She identifies a new animal like a dog. This is unsupervised learning,
where you are not taught but you learn from the data (in this case data about a dog.) Had this been
supervised learning, the family friend would have told the baby that it's a dog.

To further understand the differences between the two methods, observe the table below.
Supervised Learning Unsupervised Learning

Input variables and Only input data will be given

output variables will


Method
be given.

Supervised learning The unsupervised learning goal is

goal is to determine to model the hidden patterns or


Goal
the function so well underlying structure in the given

that when new input


data set given, can input data in order to learn about

predict the output. the data.

Machine Machine Learning, Data Mining,

learning Problems, Problems and Neural Network


Class
Data Mining and

Neural Network,

• Classification • Clustering

• Regression • Association

• Linear • k-means

Examples regression • Association

• Support vector

machine

Who uses Data scientists Data scientists


Big data Processing, Data mining

etc
Eco-systems Big data Processing,

Data mining etc

Unsupervised learning algorithms

are used to pre-process the data,


Supervised learning is
during exploratory analysis or to
often used for export
Uses
pre-train supervised learning
systems in image
algorithms.
recognition, speech

recognition,

forecasting, financial

analysis and training

neural networks and

decision trees etc


M3- Main
(Introduction to Regression Analysis)
What is Regression Analysis?

• A technique of studying the dependence variable (called dependent variable), on one or more
variables (called explanatory variable), with a view to estimate or predict the average value of
the dependent variables in terms of the known or fixed values of the independent variable.
When do you use regression?
• Estimate the relationship that exists, on the average, between the dependent variable and the
explanatory variable
• Determine the effect of each of the explanatory variables on the dependent variable, controlling the
effects of all other explanatory variables
• Predict the value of dependent variable for a given value of the explanatory variable
Understanding regression model based on the concept of slope
• The mathematical of slope is similar to regression model.
y = mx + b
• And when using the slope intercept formula, we focus on the
two constants (numbers) m and b.
• m describes the slope or steepness of the line, whereas
• b represents the y-intercept or the point where the graph crosses the y-axis.
Regression Model
• The situation using the regression model is analogous
to that of the interviewers, except instead of using
interviewers, predictions are made by performing a
linear transformation of the predictor variable.
• The prediction takes the form
y = ax + b
• where a and b are parameters in the regression model
Parameters
• Dependent variable or response: Variable being predicted
• Independent variables or predictor variables: Variables being used to predict the value of the dependent
variable.
• Simple regression: A regression analysis involving one independent variable and one dependent
variable.
• In statistical notation:
y = dependent variable
x = independent variable
Types of regression

Who uses regression?


• Data analysts
• Market researchers
• Professors
• Data scientists
Advantages of using regression
• It indicates the strength of impact of multiple independent variables on a dependent variable
• It indicates the significant relationships between dependent variable and independent variable.
Regression Pitfalls

• Overfitting
It pertains to the accuracy of the
provisional model is not as high
on the test set as it is on the
training set, often because the
provisional model is overfitting
on the training set

• Excluding Important Predictor


Variables
The linear association between
two variables ignoring other
relevant variables can differ both
in magnitude and direction from
the association that controls for
other relevant variables.

• Extrapolation
It refers to estimates and
predictions of the target variable
made using the regression
equation with values of the
predictor variable outside of the
range of the values of x in the
data set

• Missing Values
Missing data has the potential to
adversely affect a regression
analysis by reducing the total
usable sample size.

• Power and sample size


In small datasets, a lack of
observations can lead to poorly
estimated models with large
standard errors.
M3S1
(Overview of regression)
What is Regression analysis?
- Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables.
- It can be utilized to assess the strength of the relationship between variables and for modeling the
future relationship between them.
Linear Regression
-It is a model that tests the relationship between a dependent variable and a single independent variable.
-It can also be described using the following expression:

𝑦 = 𝑎 + 𝑏𝑋 + 𝜖
Where:
• y – dependent variable
• X – independent (explanatory) variable
• a – intercept
• b – slope

• 𝜖 – residual (error)
Linear Model Assumptions
1. The dependent and independent variables show a linear relationship between the slope and the
intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all
observations.
5. The value of the residual (error) is not correlated across all
observations.
6. The residual (error) values follow the normal distribution.
Example Problem:
If you want to know the strength of relationship between House Price and Square feet, you can use
regression.
Step 1:
Identify the dependent and independent variables.
Step 2:
Run regression analysis on the data using any system that offers statistical analysis. For this example, we
can use Microsoft Excel with the help of Data Analysis Tool pack.
Note: Data Analysis Tools must be embedded manually on your excel through Options.
Step 3:
Analyze the results. Take note of the values for coefficients.
Step 4:
Substitute the values of the coefficients to the formula mentioned previously.
Ex:

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕.)

• 98.25- value of intercept from the results provided in Step 3


• 0.01098- Value of X variable 1 from the results provided in Step 3.
Step 5:
Insert the value for sq. ft. to predict for the house price.

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕. )

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖 (𝟏𝟓𝟎𝟎)

𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟐𝟔𝟐. 𝟗𝟓

• 1500- example value for sq.ft.


Multiple linear
• Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
𝑦 = 𝑎 + 𝑏𝑋1 + 𝑐𝑋2 + 𝑑𝑋3 + 𝜖
Interpretation
• Interpretation of slope coefficient βj : Represents the change in the mean value of the dependent
variable y that corresponds to a one unit increase in the independent variable xj , holding the values of
all other independent variables in the model constant.
• The multiple regression equation that describes how the mean value of y is related to x1 , x2 , . . . , xq :
E( y | x1 , x2 , . . . , xq ) = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq
Multiple Linear Regression Model

• Estimated multiple regression equation: 𝑦ො = b0 + b1x1 + b2x2 + ∙ ∙ ∙ + bqxq

• b0 , b1 , b2 , . . . , bq = the point estimates of β0 , β1 , β2 , . . . , βq

• 𝑦ො = estimated value of the dependent variable

• The least squares method is used to develop the estimated multiple regression equation:

• Finding b0 , b1 , b2 , . . . , bq that satisfy minσ𝑖=1 𝑛 𝑦𝑖 − 𝑦ො 𝑖 2 = minσ𝑖=1 𝑛 𝑒𝑖 2.

• Uses sample data to provide the values of b0 , b1 , b2 , . . . , bq that minimize the sum of
squared residuals.
Nonlinear Regression
• Nonlinear regression is a regression in which the dependent or criterion variables are modeled as
a non-linear function of model parameters and one or more independent variables.
• Nonlinear regression can be model in several equations similar to the one below:

Regression Analysis
• Regression is a technique used for forecasting, time series modeling and finding the casual effect
between the variables.
Why use regression?
1) Prediction of a target variable (forecasting).
2) Modeling the relationships between the dependent variable and the explanatory variable.
3) Testing hypothesis.
M3S2
(Linear Regression)
Linear Regression
• The whole process of linear regression is based on the fact that there exists a relation between the
independent variables and dependent variable.
Simple Linear Regression
• Regression Model: The equation that describes how y is related to x and an error term.
• Simple Linear Regression Model:
y = β0 + β1 x + ε
• Parameters: The characteristics of the population, β0 and β1
• Random variable - Error term, ε
• The error term accounts for the variability in y that cannot be explained by the linear relationship
between x and y.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
Least Square Method
• Least squares method: A procedure for using sample data to find the estimated regression equation.
• Here, we will determine the values of b0 and b1 .
• Interpretation of b0 and b1 :
• The slope b1 is the estimated change in the mean of the dependent variable y that is associated
with a one unit increase in the independent variable x.
• The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0.
• Least squares method equation:

• y𝑖 = observed value of the dependent variable for the ith observation

• = predicted value of the dependent variable for the ith observation


• n = total number of observations
• ith residual: The error made using the regression model to estimate the mean value of the dependent
variable for the ith observation.

• We are finding the regression that minimizes the sum of squared errors.
• Least squares estimates of the regression parameters:
Slope equation
y-intercept

• xi = value of the independent variable for the ith observation


• yi = value of the dependent variable for the ith observation

• = mean value for the independent variable

• = mean value for the dependent variable


• n = total number of observations
• Experimental region: The range of values of the independent variables in the data used to estimate
the model.
• The regression model is valid only over this region.
• Extrapolation: Prediction of the value of the dependent variable outside the experimental region.
• It is risky
Multiple Linear
• It is a statistical technique that uses several independent variables to predict the dependent variable.
• It is the extension of linear regression.
Difference
• Linear Regression only deals with one independent variable.

• Multiple linear Regression with more than one independent variable.


Case Study
You want to know the effect of violence, stress, social support on internalizing behavior.
Details about the study
• Participants were children aging from 8 to 12
• Lived in high-violence areas, USA
• Hypothesis: violence and stress lead to internalizing behavior, whereas social support would reduce
internalizing behavior.
Parameters
• Predictors
• Degree of witnessing violence
• Measure of life stress
• Measure of social support
• Outcome
• Internalizing behavior (i.e. depression, anxiety)
Test for overall significance
• Shows if there is a linear relationship between all of the X variables taken together and Y
• Hypothesis:

Interpretations
• Slopes for Witness and Stress are positive, but slope for Social Support is negative
• If you had subjects with identical stress and social support, a one unit increase in Witness would
produce .038 unit increase in internalizing symptoms
• If Witness = 20, Stress = 5 and SocSupport 35, then we would predict that the internalizing symptoms
would be .012.
M3S3
(Logistic Regression)
Logistic Regression
- It is a statistical technique used to develop predictive models with categorical dependent
variables having dichotomous or binary outcomes.
- Similar to linear regression, the logistic regression models the relationship between the
dependent variable and one or more independent variables.
Graph of Logistic Regression

Logistic Regression Assumption


• Logistic regression measures the relationship between the categorical dependent variable and one or
more independent variables by estimating probabilities using a logistic function, which is the cumulative
logistic distribution.
• There are other variants of logistic regression that focus on modeling a categorical variable with three
or more levels, say X, Y, and Z and a few others.
Concept of Probability
- Unlike linear regression, logistic regression model is concern for the log of odds ratio or the
probability of the event to happen.
- Everything starts with the concept of probability.
Example Scenario
Let's say that the probability of success of some event is 0.8.
Then the probability of failure is 1 – 0.8 = 0.2.
The odds of success are defined as the ratio of the probability of success over the probability of failure.
Logistic Regression Equation
In our example, the odds of success are 0.8/0.2 = 4. That is to say that the odds of success are 4 to 1. If
the probability of success is 0.5, that is, a 50-50 percent chance, then the odds of success are 1 to 1.
- The equation of logistic regression can be defined as follows:

To predict the probability of the event to happen, we can further solve the preceding equation as follows:

Maximum-Likelihood estimation
- It is a method of estimating the parameters of a statistical model with given data.
- The method of maximum likelihood selects the set of values of the model parameters that
maximizes the likelihood function, that is, it maximizes the “agreement” of the selected model
with the observed data
Building Logistic Regression Model
- You can perform logistic regression using R by using the glm() function.
- The family = "binomial" command tells R to use the glm function to fit a logistic regression
model. (The glm() function can fit other models too; we'll look into this later.)
Interpreting Results
• A positive estimate indicates that, for every unit increase of the respective independent variable, there
is a corresponding increase in the log of odds ratio and the other way for a negative estimate.
• A long with the independent variables, we also see 'Intercept'. Intercept is the log of odds of the
event (Good or Bad Quality) when we have all the categorical predictors having a value as 0.
• We can see the standard error, z value, and p-value along with an asterisk indication to easily identify
significance.
• We then determine whether the estimate is truly far away from 0. If the standard error of the estimate
is small, then relatively small values of the estimate can reject the null hypothesis.
• If the standard error is large, then the estimate should also be large enough to reject the null hypothesis.
Testing the Significance
• To test the significance, we use the 'Wald Z Statistic' to measure how many standard deviations the
estimate is away from 0.
• The significance of the estimate can be determined if the probability of the event happening by
chance is less than 5%.
Two ways to validate model accuracy
• Confusion Matrix
• ROC curve

What is Confusion Matrix?


- A confusion matrix is a table used to analyze the performance of a model (classification).
- Each column of the matrix represents the instances in a predicted class while each row represents the
instances in an actual class or vice versa.
What is ROC Curve?
- The Receiver Operating Characteristic (ROC) curve is a standard technique to summarize
classification model performance over a range of trade-offs between true positive (TP) and false
positive (FP) error rates.
- The ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly)
versus 1-specificity for the possible cutoff classification probability values.
Columns in Confusion Matrix
• True Positive (TP): When it is predicted as TRUE and is actually TRUE
• False Positive (FP): When it is predicted as TRUE and is actually FALSE
• True Negative (TN): When it is predicted as FALSE and is actually FALSE
• False Negative (FN): When it is predicted as TRUE and is actually FALSE
Confusion Matrix Formula
An exhaustive list of metrics that are usually computed from the confusion matrix to aid in interpreting
the goodness of fit for the classification model are as follows:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Interpreting ROC Curve

- Interpreting the ROC curve is again straightforward. The ROC curve visually helps us
understand how our model compares with a random prediction.
- The random prediction will always have a 50% chance of predicting correctly; after comparing
with this model, we can understand how much better is our model
- The diagonal line indicates the accuracy of random predictions and the lift from the diagonal line
towards the left upper corner indicates how much improvement our model has in comparison to
the random predictions.
- Models having a higher lift from the diagonals are considered to be more accurate models.
Formative assessment 3
26/30
1. It is computed as the ratio of the two odds. [Odds ratio]
2. It pertains to the accuracy of the provisional model is not as high on the test set as it is on the
training set. [OVERFITTING]
3. Supposed you want to train a machine to help you predict how long it will take you to drive
home from your workplace using regression. What type of data mining approach should you
use? [supervised Learning]
4. There should not be any multicollinearity between the independent variables in the model, and
all independent variables should be independent to each other. [TRUE]
5. In logistic regression, it is that the target variable must be discrete and mostly binary or
dichotomous. [TRUE]
6. The explanatory variable is the variable being predicted. [FALSE]
7. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear]
8. In regression, the value of the residual (error) is one. [FALSE]
9. Multiple linear regressions are classified as supervised learning. [TRUE]
10. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [Origin of the coffee]
11. It is the tool pack in Microsoft Excel that can be downloaded to perform linear regression. [Data
Analysis Tool Pack]
12. It is the range of values of the independent variables in the data used to estimate the model.
[Experimental region]
13. It is essentially similar to the simple linear model, with the exception that multiple independent
variables are used in the model. [Multiple Regression]
14. The graph of the simple multiple regression equation is a straight line. [TRUE]
15. Researcher question: Do fourth graders tend to be taller than third graders?

This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the explanatory variable on this study? [grade level]

16. It is used to explain the variability in the response variable. [Error term]
17. Given the results for multiple linear regression below. Predict the exam score if a student spent
10 hours in studying and had 4 prep exams
taken. [(67.67 + 5.56 (number ng hours) + -0.60 (4)]

[120.87]

18. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [origin of the coffee]
19. It is a statistical technique that uses several independent variables to predict the dependent
variable. [Multiple linear regression]
20. Researcher question: Do fourth graders tend to be taller than third graders?

This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the response variable on this study? [height]
21. Logistic regression is used to predict continuous target variable. [FALSE]
22. Given the results for multiple linear regression below. Identify the estimated regression
equation.
[exam score= 5.58*hours+ prep_exams*-0.60+67.67]
23. y= a+ bX_1+cX_2+dx_3+ 𝜖
the formula above is used to model non linear regression. [FALSE]
24. When it is predicted as TRUE and is actually FALSE. [FALSE POSITIVE]
25. Multiple linear regression is classified as unsupervised learning. [FALSE]
26. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear regression]
27. Which of the following evaluation metrics can’t be applied in the case of logistic regression
output to compare with the target? [mean squared error]
28. Compute for the accuracy of the model depicted in the confusion matrix below;
TP= 20
TN= 27
FP=18
FN=25
[52%]
29. Nonlinear is the extension of linear regression. [FALSE] - multiple
30. Provide one (1) way to validate the accuracy of the logit model [ROC CURVE]

28/30 Formative 3
1. It is used to develop the estimated multiple regression [Least squared method]
2. It is used to measure the binary or dichotomous classifier performance visually and Area Under
Curve (AUC) is used to quantify the model performance [ROC Curve]
3. If a predictor variable X is found to be highly significant, we could conclude that: [changes in X
are associated to changes in Y]
4. Which of the following is not a reason when to use regression? [When you aim to know the
products that are usually bought together]
5. A group of middle school students wants to know if they can use height to predict age, what is
the response variable in this study? [Age]
6. It can be utilized to assess the strength of the relationship between variables and for modeling the
future relationship between them. [Regression Analysis]
7. Two variables are correlated if there is a linear association between them. If not, the variables are
uncorrelated. [TRUE]
8. In linear models, the dependent and independent variables show a linear relationship between the
slope and the intercept. [TRUE]
9. What type of relationship is shown in the graph below?
[Positive linear relationship]
10. It is a points that lies far away from the rest. [OUTLIERS]
11. What type of relationship is shown in the graph below?

[Negative Linear Relationship]


12. Formula below describes:

[Linear regression]
13. Logistic regression is classified as one of the example of unsupervised learning [false]
14. Logistic regression is used to predict continuous target variable [FALSE]
15. In the equation

A and B are considered as independent variables [FALSE]


16. A regression line is used for all of the following except one. Which one is not a valid use of a
regression line? [To predict the value of Y for an individual, given that individual’s X-
value.]
17. It is considered as the log of odds of the event [Intercept] – probability
45.What type of relationship is shown in the What is considered as the parameter/s?
graph above?
- B, x
Fat intake – No relationship
FORMATIVE 25/30
46. In the equation Y =a + bX + €
It is a model that tests the relationship between a
A and b are considered as independent dependent variable and a single independent
variables. – False variable.

47.Provide one (1) way to validate the accuracy - Linear regression


of a logit model. - ROC MODEL
Supposed you want to train a machine to help you
48. A regression line is used for all of the predict how long it will take you to drive home from
following except one. Which one is not a valid your workplace using regression. What type of data
use of a regression line? – to predict the value mining approach should you use?
of Y for an individual given that individual’s X-
- Supervised Learning
value.
In small datasets, a lack of observations can lead to
49. it is considered as the log odds of the event.
poorly estimated models with large standard errors.
– Intercept
- True
50. A linear regression analysis was conducted
to predict the company sales. Below are the Multiple linear regression is classified as supervised
results. Identify the linear regression equation learning.
assuming that x is the parameter for
advertising. - True

It is computed as the ratio of the two odds.

- Odds ratio

Multilinear regression is a regression in which the


dependent or criterion variables are modeled as a
non-linear function of model parameters and one or
- sales=23.42+intercept more independent variables.

51. Compute for the accuracy of the model - False


depicted in the confusion matrix below;
Regression is a technique used for forecasting, time
TP = 20 series modeling and finding the casual effect
between the variables.
TN = 27
- False
FP = 18
It is used to compare two nested generalized linear
FN = 25 models.

- 52% - Likelihood ratio test

52. In the equation The lower is the AUC value, the worse is the model
predictive accuracy.
Y =a + bX + €
- True - exam score = 5.56*hours + prep_exams*-0.60 +
67.67

It is the variable being manipulated by researchers.

- Explanatory variable
Which of the following is not a reason when to use
regression?

- When you aim to know the products that are The formula below describes
usually bought together.

It is a regression in which the dependent or criterion


variables are modeled as a non-linear function of
model parameters and one or more independent - Multiple linear regression
variables. Regression is a technique used for forecasting, time
- Nonlinear Regression series modeling, and finding the casual effect
between the variables.
It is essentially similar to the simple linear model,
with the exception that multiple independent - False
variables are used in the model. It is the tool pack in Microsoft Excel that can be
- Multiple Regression downloaded to perform linear regression.

There should not be any multicollinearity between - Data Analysis Tool Pack
the independent variables in the model, and all A group of middle school students wants to know if
independent variables should be independent to they can use height to predict age. The explanatory
each other variable is height.
- True - True
If a predictor variable x is found to be highly Multiple linear regression can be expressed
significant we would conclude that:

- a change in x causes a change in y as


(changes in x are associated to changes in y)
- False
Overfitted data can significantly lose the predictive
ability due to an erratic response to noise whereas The _____________ are defined as the ratio of the
underfitted will lack the accuracy to account for the probability of success over the probability of failure.
variability in response in its entirety. (Use lowercase for your answer).

- True - odds ratio

Given the results for multiple linear regression Multiple linear regression is classified as
below. Identify the estimated regression equation. unsupervised learning.

- False

The formula above is used to model nonlinear


regression.
- False -False

It is a model that tests the relationship between a Logistic regression is classified as supervised
dependent variable and a single independent learning.
variable.
-True
- Linear Regression
If a predictor variable x is found to be highly
When it is predicted as TRUE and is actually TRUE. significant we would conclude that:
- True Positive -changes in x are associated to changes in y
If a predictor variable x is found to be highly It is the variable being predicted.
significant we would conclude that: a change in y
causes a change in x. -Dependent variable

- False (changes in x are associated to changes in In logistic regression, it is that the target variable
y) must be discrete and mostly binary or dichotomous.

When it is predicted as FALSE and is actually FALSE. -True

- False Negative It is the tool pack in Microsoft Excel that can be


downloaded to perform linear regression.
Response variable is the variable being manipulated
by researchers. -Data Analysis Tool Pak
- False It is essentially similar to the simple linear model,
with the exception that multiple independent
variables are used in the model.
FORMATIVE 25/30
-Multiple regression
A linear regression analysis was conducted to
predict the company sales. Below are the It pertains to the accuracy of the provisional model
results. Identify the linear regression equation is not as high on the test set as it is on the training
assuming that x is the parameter for set. (Use UPPERCASE for your answer)
advertising.
-OVERFITTING

Extrapolation of the range of values of the


independent variables in the data used to estimate
the model.

-True

In logistic regression, it is that the target variable


- sales = 23.42x + intercept must be discrete and mostly binary or dichotomous.

-True

The higher is the AUC value, the better is the It is the tool pack in Microsoft Excel that can be
model predictive accuracy. downloaded to perform linear regression.

-True -Data Analysis Tool Pack

The explanatory variable is the variable being Multiple linear regression is classified as
predicted. unsupervised learning.
-False -False
A researcher believes that the origin of the beans When it is predicted as FALSE and is actually
used to make a cup of coffee affects hyperactivity. FALSE.
He wants to compare coffee from three different
regions: Africa, South America, and Mexico. What -True Negative
is the response variable on this study?
Logistic regression measures the relationship
-Hyperactivity level between the ____________ dependent
variable and one or more independent
The graph of the simple linear regression equation variables. (Use lowercase for your answer)
is a straight line
categorical
-True
-categorical
A group of middle school students wants to know if
Given the results for multiple linear regression they can use height to predict age. The explanatory
variable is height.
below. Identify the estimated regression equation.
-True

Which of the following evaluation metrics can’t be


applied in the case of logistic regression output to
compare with the target?
-Mean-Squared-Error

-exam score = 5.56*hours + prep_exams*-0.60 +


67.67

There should not be any multicollinearity between


the independent variables in the model, and all
independent variables should be independent to
each other

-True

Logistic regression is classified as one of the


It is essentially similar to the simple linear model,
examples of unsupervised learning. with the exception that multiple independent
variables are used in the model.
-False -Multiple linear regression
Which of the following methods do we use to best
It is essentially similar to the simple linear model, fit the data in Logistic Regression?
with the exception that multiple independent -Maximum Likelihood
variables are used in the model.
What type of relationship is shown in the graph
-Multiple Regression below?

The graph of the simple linear regression equation


is a straight line.

-True

Multiple linear regression can be expressed


as y=a+bX+ ϵ
2. Overfitted data can significantly lose the
predictive ability due to an erratic response
to noise whereas underfitted will lack the
accuracy to account for the variability in
response in its entirety.
• True

3. A group of middle school students wants to


know if they can use height to predict age.
What is the response variable in this study
• Height
• The type of school
• Age
(https://online.stat.psu.edu/stat200/lesson/1/
1.1/1.1.2#:~:text=A%20group%20of%20middle
%20school,use%20height%20to%20predict%20
age.&text=The%20students%20want%20to%2
0use,the%20response%20variable%20is%20ag
e.)

• The number of students

-Undefined 4. Multilinear regression is a regression in


which the dependent or criterion variables
are modeled as a non-linear function of
model parameters and one or more
A regression line is used for all of the following independent variables.
except one. Which one is not a valid use of a • False
regression line?

-to determine if a change in X causes a change in


Y. 5. It is the variable being manipulated by
researchers.
• Explanatory Variable
• Factor
• Response Variable
• Dependent Variable

6. Multiple linear regression is classified as


supervised learning.
• True

7. Research question: Do fourth graders


24/30 tend to be taller than the third graders?

1. Regression is a famous supervised learning This is an observational study. The researcher


technique. wants to use grade level to explain differences
• True
in height. What is the response variable in this • The cups of coffee taken
study?
• Origin of the coffee
• Gender
• Hyperactivity level
• Age
• Grade level
13. 1The value of the residual (error) is not
• Height correlated across all observations.
• True

8. If p value for model fit is less than 0.5, 14. There should not be any multicollinearity
then signify that our full model fits between the independent variables in the
significantly better than our reduced model, and all independent variables should
model. be independent to each other
• False
• False

9. It is used to compare two nested 15. It is the variable being predicted.


generalized linear models. • Explanatory variable
• Odds ratio • Dependent variable
• Hosmer-lemeshow test • Independent variable
• Likelihood ratio test • Factor
• ROC Curve
16. The graph of the simple linear regression
10. In linear models, the dependent and equation is a straight line
independent variables show a linear • True
relationship between the slope and the
intercept 17. It is the tool pack in Microsoft Excel that can
be downloaded to perform linear regression.
• True • Data Analysis Tool Pack
11. Supposed you want to train a machine • Test Mining Tool Pack
to help you predict how long it will take • Regression Tool Pack
you to drive home from your workplace. • Data Mining Tool Pack
What type of data mining approach
should you use 18. It is a point that lies far away from the rest.
• Outliers
• Unsupervised Learning • Dummy variables
• Supervised learning • Residual

19. The formula below describes

12. A researcher believes that the origin of y=a+bX_1+cX_2+ 〖dX〗_3+ ϵ


the beans used to make a cup of coffee • Linear Regression
affects hyperactivity. He wants to • Multiple Linear Regression
compare coffee from three different • Nonlinear Regression
regions: Africa, South America, and
Mexico. What is the explanatory variable
in this study? 20. Logistic regression is classified as one of
the examples of unsupervised learning.
• The taste of the coffee • False
A and b are considered as independent
variables.
21. It is a table used to analyze the
performance of a model (classification). • False
(Use lowercase)
• confusion matrix
28. Shown below is a scatterplot of Y versus X.
Which of the following is the approximate
22. The family = ______________ command value for R2?
tells R to use the glm function to fit a logistic • 99.5%
regression model. (Use lowercase for your • 50%
answer to supply the code) • 25%
• logit regression • 0%
23. It is a model that tests the relationship
between a dependent variable and a single
independent variable.
• Linear Regression
• Least squared method
• Multiple regression
• Nonlinear regression

24. y=a+bX_1+cX_2+ 〖dX〗


_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ
The formula above is used to model nonlinear
regression.
• False

29. Nonlinear is the extension of linear regression


25. It is a regression in which the dependent
• False
or criterion variables are modeled as a
30. A group of middle school students wants to
non-linear function of model parameters
know if they can use height to predict age. The
and one or more independent variables.
response variable in this study is age.
• Linear Regression • True
• Least squared method
• Multiple regression
• Nonlinear regression ADDITIONAL FORMATIVE

It is essentially similar to the simple linear


26. Logistic regression is classified as model, with the exception that multiple
supervised learning. independent variables are used in the model.
• True
Multiple regression

Which of the following is not a reason when to


27. In the equation use regression? When you aim to know the
y=a+bX+ ϵy=a+bX+ ϵ products that are usually bought together.
It pertains to the accuracy of the provisional Logistic regression is classified as supervised
model is not as high on the test set as it is on the learning. True
training set. (Use UPPERCASE for your answer) In logistic regression, it is that the target variable
OVERFITTING must be discrete and mostly binary or dichotomous.
True
It is used to explain the variability in the
It can be utilized to assess the strength of the
response variable. response variable relationship between variables and for modeling the
future relationship between them Regression
It is used to measure the binary or dichotomous Analysis
classifier performance visually and Area Under
Curve (AUC) is used to quantify the model If p value for model fit is less than 0.5, then signify
that our full model fits significantly better than our
performance. ROC Curve reduced model. False

Missing data has the potential to adversely Factor is the variable being predicted. False
affect a regression analysis by reducing the total
It is a model that tests the relationship between
usable sample size. True
a dependent variable and a single independent
In regression, the value of the residual (error) is variable. Linear Regression
one. True False What type of relationship is shown in the graph
below? Negative Linear Relationship
The value of the residual (error) is not
correlated across all observations. True

It is the variable of primary interest. response


variable
If p value for model fit is less than 0.5, then signify
that our full model fits significantly better than our
reduced model. False

It is used to explain the variability in the response


variable parameter

The value of the residual (error) is one false

The value of the residual (error) is correlated across It is essentially similar to the simple linear model,
all observations. False with the exception that multiple independent
variables are used in the model. Multiple linear
It is essentially similar to the simple linear model, regression
with the exception that multiple independent
Logistic regression is used to predict continuous
variables are used in the model. Multiple Regression
target variable. True
The value of the residual (error) is not correlated
Response variable is the variable being manipulated
across all observations. True
by researcher. False
Given the results for multiple linear regression
below. Predict the exam score if a student spent Logistic regression is classified as supervised
hours studying and had 2 prep exams taken.. learning. True

A researcher believes that the origin of the


beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from
105.39
three different regions: Africa, South America,
and Mexico. What is the explanatory variable in The value of the residual (error) is not
this study? Origin of the coffee correlated across all observations. True
The lower is the AUC value, the worse is the model Extrapolation of the range of values of the
predictive accuracy. True independent variables in the data used to estimate
the model. True
Regression is a famous supervised learning
technique. True A group of middle school students wants to know if
they can use height to predict age. What is the
In regression, the value of the residual (error) is one. explanatory variable in this study? height
False
Supposed you want to train a machine to help you
How will you express the equation of a regression predict how long it will take you to drive home from
analysis where you aim to predict the value of y your workplace. What type of data mining approach
based on x. (Use lowercase and avoid so much space) should you use? Supervised Learning
y=b0+b1*x1
Multiple linear regression is classified as
Regression is a technique used for forecasting, time unsupervised learning. False
series modeling, and finding the causal effect
between the variables. False It is a point that lies far away from the rest. Outliers

Supposed you want to train a machine to help you When it is predicted as TRUE and is actually TRUE.
predict how long it will take you to drive home from True Positive
your workplace using regression. What type of data
mining approach should you use? Supervised Two variables are correlated if there is a linear
Learning association between them. If not, the variables are
uncorrelated. True
Multilinear regression is a regression in which the
dependent or criterion variables are modeled as a Multiple linear regression is classified as
non-linear function of model parameters and one or unsupervised learning. False
more independent variables. True
What type of relationship is shown in the graph
Multilinear regression is a regression in which the below? Negative Linear Relationship
dependent or criterion variables are modeled as a
non-linear function of model parameters and one or
more independent variables. False

If a predictor variable x is found to be highly


significant we would conclude that: changes in x are
associated to changes in y

It is used to develop the estimated multiple


regression. Multiple Regression

The graph for a nonlinear relationship is often a


straight line false

It is essentially similar to the simple linear model,


with the exception that multiple independent Logistic regression is used to predict continuous
variables are used in the model. Multiple Regression. target variable. True

In linear regression, it is that the target variable You predicted negative and it’s false. False Negative
must be discrete and mostly binary or dichotomous.
True In ROC Curve, models having a higher lift from the
diagonals are considered to be more accurate
What type of relationship is shown in the graph models. True
above? No relationship
y=a+bX_1+cX_2+ 〖dX〗 A group of middle school students wants to
_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ know if they can use height to predict age. What
is the response variable in this study? Age
The formula above is used to model nonlinear
regression. False It is a regression in which the dependent or criterion
variables are modeled as a non-linear function of
Logistic regression is classified as supervised model parameters and one or more independent
learning. True variables. Nonlinear Regression

It is essentially similar to the simple linear Slope is the point that lies far away from the
model, with the exception that multiple rest. True
independent variables are used in the model.
Nonlinear is the extension of linear regression.
Multiple regression
False
The lower is the AUC value, the worse is the
In logistic regression, it is that the target variable
model predictive accuracy. True
must be discrete and mostly binary or dichotomous.
Multiple linear regression is classified as supervised True
learning. True
Given the results for multiple linear regression
It is the variable of primary interest. Parameter below. Predict the exam score if a student spent
A group of middle school students wants to know if 10 hours in studying and had 4 prep exams
they can use height to predict age. What is the taken.
response variable in this study? Age

If p value for model fit is less than 0.5, then signify


that our full model fits significantly better than our
reduced model. False
120.98
Supposed you want to train a machine to help It can be utilized to assess the strength of the
you predict how long it will take you to drive relationship between variables and for modeling the
home from your workplace using regression. future relationship between them. Regression
What type of data mining approach should you Analysis

use? Supervised Learning Logistic regression is classified as one of the


examples of unsupervised learning. False
In the equation y=a+bX+ ϵ What is considered
as a dependent parameter/s? y Research question: Do fourth graders tend to be
taller than third graders?
the graph for a nonlinear relationship is often a
straight line. True This is an observational study. The researcher
wants to use grade level to explain differences
In linear models, the dependent and in height. What is the explanatory variable on
independent variables show a linear this study? grade level
relationship between the slope and the
intercept true The family = ______________ command tells R to use
the glm function to fit a logistic regression model.
The formula below describes (Use lowercase for your answer to supply the code)
glm()
y=a+bX_1+cX_2+ 〖dX〗
_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ What type of relationship is shown in the graph
Multiple linear regression below? No relationship
This is an observational study. The researcher
wants to use grade level to explain differences
in height. What is the response variable in this
study? Height
It can be utilized to assess the strength of the
relationship between variables and for modeling
the future relationship between them.
Regression Analysis

It is used to develop the estimated multiple


regression. Multiple Regression
It refers to estimates and predictions of the
target variable made using the regression
It is a regression in which the dependent or criterion equation with values of the predictor variable
variables are modeled as a non-linear function of outside of the range of the values of x in the data
model parameters and one or more independent set. (Use UPPERCASE for your answer)
variables. Nonlinear Regression PREDICTION
Which of the following evaluation metrics can’t be The formula below describes
applied in the case of logistic regression output to
compare with the target? Mean-Squared-Error y= (b_1+ b_2 x_1+b_3 x_2+b_4 x_3)/(1+
Nonlinear is the extension of linear regression. False
b_5 x_1+b_6 x_2+ b_7 x_3 )
Nonlinear regression
Which of the following methods do we use to best fit
the data in Logistic Regression? Maximum Likelihood
A researcher believes that the origin of the beans
used to make a cup of coffee affects hyperactivity.
When it is predicted as TRUE and is actually TRUE.
He wants to compare coffee from three different
True Positive
regions: Africa, South America, and Mexico. What is
the response variable in this study? Hyperactivity
Residual is a point that lies far away from the rest
level
true
The graph of the simple linear regression equation is
You predicted negative and it’s false. False Negative. a straight line true

How do we assess the goodness of fit or accuracy of In the equation


the model in logistic regression? (Use lowercase for
your answer) roc curve y=a+bX+ ϵy=a+bX+ ϵ

The graph for a nonlinear relationship is often a What is considered as the parameter/s? a,b
straight line. True In logistic regression, it is that the target
variable must be discrete and mostly binary or
How will you express the equation of a regression
dichotomous. True
analysis where you aim to predict the value of y
based on x. Y=a+bX It is used to explain the variability in the
response variable parameter
It is used to explain the variability in the response
variable parameter Logistic regression is classified as one of the
examples of unsupervised learning. False
Research question: Do fourth graders tend to be
taller than the third graders?
Regression is a technique used for forecasting, If a predictor variable x is found to be highly
time series modeling, and finding the casual significant we would conclude that: a change in y
effect between the variables. True causes a change in x. false
There should not be any multicollinearity Regression is a famous supervised learning
between the independent variables in the technique. True
model, and all independent variables should be
independent to each other true exam score = 5.56*hours + prep_exams*-0.60 + 67.67

It is the range of values of the independent It is a statistical technique that uses several
variables in the data used to estimate the model. independent variables to predict the dependent
Experimental region variable. Multiple linear regression
The value of the residual (error) is correlated
It is the tool pack in Microsoft Excel that can be
across all observations. False
downloaded to perform linear regression. Data
Two variables are correlated if there is a linear Analysis Tool Pack
association between them. If not, the variables
are uncorrelated. True Compute for the accuracy of the model depicted
in the confusion matrix below;
Suppose you have been given a fair coin and you
want to find out the odds of getting heads. TP = 20
Which of the following option is true for such a TN = 27
case? odds will be 1
FP = 18
It is essentially similar to the simple linear
model, with the exception that multiple FN = 25
independent variables are used in the model.
Multiple linear regression 52%

It is a regression in which the dependent or


criterion variables are modeled as a non-linear It is considered as the log of odds of the event. (Use
function of model parameters and one or more lowercase for your answer) intercept
independent variables Multiple Regression
Supposed your model was able to predict that a
Shown below is a scatterplot of Y versus X. certain student fails the certification exam and she
Which of the following is the approximate value actually is not. False Negative
for R2? 99.5% Two variables are correlated if there is a linear
association between them. If not, the variables are
uncorrelated. True
of data that may be used for causal
[IT0089] MODULE 4 MAIN (TIME SERIES forecasting.
AND FORECASTING)
TIME SERIES PATTERNS

TIME SERIES 1. Trend Pattern


- Gradual shifts or movements to relatively
- Time series analysis was introduced by Box higher or lower values over a longer period
and Jenkins (1976) to model and analyze of time
time series data with autocorrelation. o A trend is usually the result of long-
- A sequence of observations on a variable term factors such as population
measured at successive points in time or increases or decreases, shifting
over successive periods of time. demographic characteristics of the
o The measurements may be taken population, improving technology,
every hour, day, week, month, year, changes in the competitive
or any other regular interval. The landscape, and/or changes in
pattern of the data is an important to consumer preferences.
understand the series’ past behavior 2. Seasonal pattern
o If the behavior of the times series - Recurring patterns over successive periods
data of the past is expected to of time
continue in the future, we can use it o Example: A manufacturer of
to guide us in selecting an swimming pools expects low sales
appropriate forecasting method. activity in the fall and winter
months, with peak sales in the spring
TIME SERIES DATA and summer months to occur every
- Time series data consist of data observations year.
over time. - Time series plot not only exhibits a seasonal
pattern over a one-year period but also for
APPLICATIONS OF TIME SERIES less than one year in duration.
o Example: daily traffic volume shows
- Predicting stock prices
within-the-day “seasonal” behavior
- Airline fares
- Labor force size M4S1 [OVERVIEW OF TIME SERIES
- Unemployment data ANALYSIS]
- Natural gas price
What is TIME SERIES?
KEY IDEAS IN FORECASTING
- Time series data is a sequence of
- The objective of time series analysis is to observations collected from a process with
uncover a pattern in the time series and equally spaced periods of time.
then extrapolate the pattern into the future. - It establish relation between “cause” and
- The forecast is based solely on past values “effect”
of the variable and/or on past forecast errors - One variable is “Time” which is considered
- Modern data-collection technologies have as the independent variable and the second
enabled individuals, businesses, and is “Data” also known as the dependent
government agencies to collect vast amounts variable.
Understanding the Time Series Data o Change in technological progress
o Large scale shift in consumers
- Time series data is expected to have two demands
variables: time and data
Seasonal variation
OTHER EXAMPLES OF TIME SERIES DATA
- It is a short-term fluctuation in a time series
1. Daily data on sales which occur periodically in a year.
2. Monthly inventory
3. Daily customers Examples:
4. Monthly interest rates, cost
5. Monthly unemployment rates • More woolen clothes are sold in winter than
6. Weekly measures of money supply in the season of summer
7. Daily closing prices of stock indexes, and • Each year more ice creams are sold in
soon summer and very little in Winter season

PLOTTING TIME SERIES DATA Cyclical Variations

- These are recurrent upward or downward


movements in a time series but the period of
cycle is greater than a year.

Irregular Variations

- These are fluctuations in the time series that


are short in duration, erratic in nature and
BENEFITS OF TIME SERIES ANALYSIS follow no regularity in the occurrence
pattern
Through Time Series Analysis, businessmen can
predict about the changes in economy. Furthermore, FORECASTING METHODOLOGIES
it could also help in determining the following:
1. Exponential Smoothing
• Stock Market Analysis 2. Moving Average
• Risk analysis and evaluation
APPLICATIONS AVAILABLE TO PERFORM
• Census analysis
TSA
• Budgetary analysis
• Inventory studies 1. Tableau
• Sales forecasting 2. Excel
3. SAS 9.4
COMPONENTS OF TIME SERIES
4. SAP Analytics
Secular Trend
FORECAST ACCURACY
- The increase or decrease in the movements
- Computing Forecasts and Measures of
of a time series is called secular trend.
Forecast Accuracy using the most recent
- A time series data may show upward trend
Value as the Forecast for the next Period
or downward trend for a period of years and
this may be due to factors like:
o Increase of population
FORECAST ERROR UNIVARIATE TIME SERIES MODELS

- Measures to determine how well a particular - Univariate time series models are models
forecasting method is able to reproduce the used when the dependent variable is a single
time series data that are already available time series.
- Forecast Error: Difference between the
MULTIVARIATE TIME SERIES MODELS
actual and the forecasted values for period t.
- used when there are multiple dependent
variables. In addition to depending on their
own past values, each series may depend on
past and present values of the other series
- Modeling U.S. gross domestic product,
- Mean Forecast Error: Mean or average of inflation, and unemployment together as
the forecast errors. endogenous variables is an example of a
multivariate time series model.

WHY USE TIME SERIES ANALYSIS?

Advantages

1. Reliability - time series uses collected


- Mean Absolute Error (MAE): Measure of historical data over a period of time.
forecast accuracy that avoids the problem of 2. Seasonal Patterns - since TSA deals with
positive and negative forecast errors periodic data, it is easy to predict seasonal
offsetting one another. pattern.
3. Estimation of trends - the graph of TSA
makes it easy to visualize the increase or
decrease in sales, production, etc.
4. Growth - through depicting patterns, TSA
helps in measuring the financial growth.

M4S2 [MOVING AVERAGE]

- Mean Squared Error (MSE): measure that CONCEPT OF AVERAGING METHODS


avoids the problem of positive and negative
errors offsetting each other is obtained by - If a time series is generated by a constant
computing the average of the squared process subject to random error, then mean
forecast errors. is a useful statistic and can be used as a
forecast for the next period.
- Averaging methods are suitable for
stationary time series data where the series
is in equilibrium around a constant value
(the underlying mean) with a constant
variance over time.

WHAT IS MOVING AVERAGE?


- It is a technique that calculates the overall AVERAGING METHODS
trend in sales volume from historical data of
The Mean
the company
- This technique is well-known when - Uses the average of all the historical data as
forecasting short-term trends. the forecast

SAMPLE SCENARIO

- When new data becomes available, the


forecast for time t+2 is the new mean
including the previously observed data plus
this new observation.

APPLYING MOVING AVERAGE

The mean sales for the first five years (2003-2007)


is calculated by finding the mean from the first five
years

- This method is appropriate when there is no


noticeable trend or seasonality.

• Compute for the second subset of 5 years


(2004-2008)
ASSUMPTIONS ON MOVING AVERAGE

- The moving average for time period t is the


mean of the “k” most recent observations.
• Get the average of the third subset (2005- - The constant number k is specified at the
2009) outset.
- The smaller the number k, the more weight
is given to recent periods
- The greater the number k, the less weight is
given to more recent periods.
• Continue calculating each five-year average - A large k is desirable when there are wide,
until you reach 2009-2013. This gives you a infrequent fluctuations in the series.
series of points that you can plot a chart for - A small k is most desirable when there are
moving averages. sudden shifts in the level of series.
- For quarterly data, a four-quarter moving
average, MA (4), eliminates or averages out
seasonal effects.

REMINDERS

- For monthly data, a 12-month moving


average, MA (12), eliminate or averages out
seasonal effect.
- Equal weights are assigned to each
observation used in the average.
- Each new data point is included in the
average as it becomes available, and the
oldest data point is discarded. Graph of Store’s weekly sales

MOVING AVERAGE FORMULA

- A moving average of order k, MA (k) is the


value of k consecutive observations.

Using Moving Average

• Use a three-week moving average (k=3) for


- K is the number of terms in the moving
the department store sales to forecast for the
average. The moving average model does
week 24 and 26.
not handle trend or seasonality very well
although it can do better than the total mean.

EXAMPLE: WEEKLY DEPARTMENT


STORE SALES
• The forecast error is
- The weekly sales figures (in millions of
dollars) presented in the following table are
used by a major department store to
Forecasted Value
determine the need for temporary sales
personnel. • The forecast for the week 26 is

Results
Compute for the MSE

Actual
20

100 abs(20 – 30) ^ 2

Compute for the MSE: -9 85.5 81 9

Actual
FORMATIVES 81

A trend is usually the result of long-term factors 81 abs(81 – 90) ^ 2


such as population increases or decreases, MSE stands for the mean standard error. False
shifting demographic characteristics of the
population, improving technology, changes in The reproduction of crops is highly dependent
the competitive landscape, and/or changes in on this component in time series. Seasonal
consumer preferences. True variations

What does the graph below illustrate? an increasing Which of the following describes an unpredictable,
trend only rare event that appears in the time series? Irregular
variations

88.67 dalwa sagot ? 88.67


Moving average is suitable for dealing with a time
series that has short-term trends. True

Compute for the MAD based on the following


Mean absolute error is the mean or average of values:3
the forecast errors. False
Error |Error|
Qualitative methods can be used when past
information about the variable being forecast is
available. False 5 5
Trend projection is an example of a time series.
True -3 3

Compute for the forecasted error: 3


4 4
Actual Forecasted
39 42
The formula below describes MAD
Another measure is mean standard error (MSE),
which is a measure of the average size of the
prediction errors in squared units. True It pertains to the gradual shifts or movements to
relatively higher or lower values over a longer
It is a model that tests the relationship between a period of time. Trend Pattern
dependent variable and a single independent
Another measure is mean standard error (MSE),
variable. Linear Regression
which is a measure of the average size of the
The moving average method uses a weighted prediction errors in squared units. True
average of the observed value. False
The forecast is Ft+1 based on weighting the most
Which of the following describes the overall recent observation yt with a weight alpha and
tendency of a time series? Trend weighting the most recent forecast Ft with a weight
of 1-alpha . true
Its objective is to uncover a pattern in the time series
and then extrapolate the pattern into the future. Supply the missing values given for the attribute
Time Series Analysis salary. 19, 275 18, 625 18, 650 19,525

If a time series is generated by a constant process


salary
subject to random error, then median is a useful
statistic. True
16000
The value of α with the smallest RMSE is chosen for
use in producing future forecasts. True 12000
One measure of forecasting accuracy is the mean
accuracy deviation (MAD). As the name suggests, it 17500
is a measure of the average size of the prediction
errors. To estimate the change in Y for a one-unit 29000
change in X. false
If elections are held every 6 years in a country, then
Based on the given values below, what is the value of
the existing government may allow wages to
F3 if the value of the alpha is 0.2?
increase the year before the election to make the
40 (F1&F1 = 39 TO GET 40 F3=0.2(44)+0.8(39.99))
people more likely to vote for them. The statement is
an example of cyclical variation. True

Trend series a sequence of observations on a


variable measured at successive points in time or
over successive periods of time. True
The classical multiplicative time series model indicates
that the forecast is the product of which of the
22/30
following terms?
Each year, more sweaters are sold in winter and
- trend, cyclical, and irregular components very few in the summer season. This is an
example of a ______________.
Which of the following statements is correct regarding
moving averages? - Seasonal variations
- The choice of the number of periods It is a measure that avoids the problem of positive
impacts the performance of the moving
and negative errors offsetting each other is obtained
average forecast
by computing the average of the squared forecast
The method of moving averages is used for which of
errors.
the following purposes?

- smooth the time series


- Mean Absolute Error

When exponential smoothing is used as a forecasting Which of the following research scenarios would
method, which of the following is used as the forecast time-series analysis be best for?
for the next period?
- Measuring the time it takes for something to
- smoothed value in the current time period happen based on a given number of variables
Which of the following is a valid weight for exponential
smoothing?

- 0.5
Which of the following indicates the purpose for using Compute for the Mean Absolute Deviation:
the least squares method on time series data?

- identifying the trend variations

An autoregressive forecast includes which of the


following terms?

- All of the above


- 4
Which of the following indicates a guideline for
selecting a forecast model?
A trend is usually the result of long-term factors
- All of the above such as population increases or decreases,
shifting demographic characteristics of the
To assess the adequacy of a forecasting model, which population, improving technology, changes in
of the following methods is used?
the competitive landscape, and/or changes in
- Mean absolute deviation consumer preferences.

Which of the following indicates the calculation of an


- True
index number?
It pertains to the gradual shifts or movements to
- (Pi/Pbase)100
relatively higher or lower values over a longer period
How many degrees of freedom are used for the t test of time
to determine the significance of the highest order
autoregressive term? - Trend Pattern
- n – 2p – 1 It is a measure of forecast accuracy that avoids the
problem of positive and negative forecast errors
offsetting one another.
- Mean Absolute Error

Compute for the MSE

- 100

Spike is the predictable change in something based


on the season.

- False
It pertains to the difference between the actual and
the forecasted values for period t

- Forecast Error

A time series data usually has two variables namely


transaction and item. Deviation and error are used interchangeably in
time series
- False
- False

Which of the following statements best describes


the cyclical component?

- It represents periodic fluctuations that recur


within the business cycle

Compute for the 3 Moving Average based on the


table below: Compute for the MSE based on the following values:

- 53

The formula below describes

- MAPE (d nascreenshot shuta)


- 16.67?
Compute for the MAD based on the following value:
MAPE stands for mean absolute predicted error

- False

Compute for the MAD:


Exponential Smoothing is an obvious extension the
moving average method

- True (False)
- 6
Given the following values for age, what is the
It is the variable being manipulated by researchers. problem with the data?

- Explanatory variable

MAPE stands for mean absolute predicted error

- False

Moving average is suitable for dealing with a time


series that has short-term trends.

- True

A trend is usually the result of long-term factors


such as population increases or decreases, shifting
demographic characteristics of the population,
improving technology, changes in the competitive
landscape, and/or changes in consumer preferences

- True

Trend series a sequence of observations on a


variable measured at successive points in time or
over successive periods of time

- True

A heterogenous data set is a data set whose data


records have the same target value.

- False Data Inconsistency?

The sales of the mini electric fans that Louise sells A time series data usually has two variables namely
varies every season. This is an example of a seasonal transaction and item.
effect to a time series.
- False
- True

One measure of forecasting accuracy is the mean


24/20
accuracy deviation (MAD). As the name suggests, it
is a measure of the average size of the prediction Regression is a famous supervised learning
errors. To estimate the change in Y for a one-unit technique.
- True
change in X.
Forecasting methods can be classified as qualitative
- False
and quantitative

- True
MSE stands for the mean standard error. The classical multiplicative time series model
indicates that the forecast is the product of which of
- False the following terms?
It pertains to the gradual shifts or movements to
- trend, cyclical, and irregular components
relatively higher or lower values over a longer period
of time. The MAD for the following values is 63.67.

- Trend Pattern

Compute for the MAPE:

- -7.69

It is a measure of forecast accuracy that avoids the


problem of positive and negative forecast errors
offsetting one another

- Mean Absolute Error - False


A trend is usually the result of long-term factors Time series is an example of unsupervised learning.
such as population increases or decreases, shifting
demographic characteristics of the population, - False
improving technology, changes in the competitive
Averaging methods are suitable for stationary time
landscape, and/or changes in consumer preferences.
series data
- True
- True
These are fluctuations in the time series that are
The formula below describes
short in duration, erratic in nature

- Irregular variations

What component affects the time series analysis


when a sudden volcanic eruption happens?

- Irregular variations

What does the graph below illustrate? - MAPE

an increasing trend only The choice of the number of periods impacts the
performance of the moving average forecast.

- True
The seasonal component represents periodic
fluctuations that recur within the business cycle. It is the variable being manipulated by researchers.

- False - Explanatory variable

MSE stands for ________.

- Mean Squared Error The formula below describes


weighting the most recent forecast Ft with a weight
of 1-alpha .

- True

Exponential Smoothing is an obvious extension of


the five- moving average method

- False

Averaging methods are suitable for stationary time


- MSE
series data
The equation below illustrates how to compute the
mean absolute deviation. - True

Supply the missing values given for the attribute


salary.

- False

Averaging methods are suitable for nonstationary


time series data.

- False

Which of the following describes an unpredictable,


rare event that appears in the time series?

- Irregular variations

Trend series a sequence of observations on a


variable measured at successive points in time or
over successive periods of time - 18,650

With moving average the idea is that the most


- True
recent observations will usually provide the best
guide as to the future, so we want a weighting
scheme that has decreasing weights as the
Exponential Smoothing is an obvious extension the observations get older (exponential smoothing)
moving average method. False

- False

The value of α with the smallest RMSE is chosen for [IT0089] MODULE 5 (MAIN) OVERVIEW OF
use in producing future forecasts. ASSOCIATION ANALYSIS
- True WHAT ARE PATTERNS?
The forecast is Ft+1 based on weighting the most
recent observation yt with a weight alpha and
- Patterns are set of items, subsequences, or 4. Package Barbie + candy + poorly selling
substructures that occur frequently together item.
(or strongly correlated) in a data set 5. Raise the price on one, and lower it on the
- Patterns represent intrinsic and importance other.
properties of datasets 6. Offer Barbie accessories for proofs of
purchase
WHEN DO YOU USE PATTERN 7. Do not advertise candy and Barbie together
DISCOVERY? 8. Offer candies in the shape of a Barbie doll
▪ What products were often purchased PROCESS OF RULE SELECTION
together?
▪ What are the subsequent purchases after Generate all rules that meet specified support &
buying an iPad? confidence
▪ What code segments likely contain copy-
and-paste bugs? • Find frequent item sets (those with sufficient
▪ What word sequences likely form phrases in support – see above)
this corpus? • From these item sets, generate rules with
sufficient confidence
CONCEPT OF MARKET BASKET ANALYSIS
RULE INTERPRETATION
- Market basket analysis is like an imaginary
basket used by retailers to check the Lift Ratio
combination of two or more items that the - shows how effective the rule is in finding
customers are likely to buy consequents (useful if finding particular
- “Two-thirds of what we buy in the consequents is important)
supermarket we had no intention of buying,”
- Paco Underhill, author of Why We Buy: Confidence
The Science of Shopping
- shows the rate at which consequents will be
ASSOCIATION RULE IS THE FOUNDATION found (useful in learning costs of promotion)
OF SEVERAL RECOMMENDER SYSTEMS
Support

- measures overall impact

RULE SET (XLMiner)

POSSIBLE ACTIONS BASED ON


CHALLENGES
ASSOCIATION RULE
- Random data can generate apparently
BARBIE DOLL -> CANDY
interesting association rules
1. Put them closer together in the store. - The more rules you produce, the greater this
2. Put them far apart in the store. danger
3. Package candy bars with the dolls - Rules based on large numbers of records are
less subject to this danger
M5 (ST1- PATTERN DISCOVERY) • Patterns are not known
• But data which are believed to possess
WHAT IS THE ESSENCE OF DATA MINING?
patterns are given
“...the discovery of interesting, unexpected, or • Examples:
valuable structures in large data sets.” –David Hand o Clustering: grouping similar
samples into clusters
“If you’ve got terabytes of data, and you’re relying o Associative rule mining: discover
on data mining to find interesting things in there for certain features that often appear
you, you’ve lost before you’ve even begun.” –Herb together in data
Edelstein • Patterns are known beforehand and are
WHAT IS PATTERN DISCOVERY? observed/described by:
o Explicit samples
- According to SAS, it is one of the broad o Similar samples (usually)
categories of analytical methods associated • Modeling approaches
with data mining. o Build a model for each pattern
- It is a process of uncovering patterns from o Find the best fit model for new data
massive data sets. • Usually require training using observed
samples
PATTERN DISCOVERY CAUTION
SEQUENTIAL PATTERN
• Poor data quality
• Opportunity - Shopping sequences, medical treatments,
• Interventions natural disasters, weblog click streams,
• Separability programming execution sequences, DNA,
• Obviousness protein, etc.
• Non-stationarity
PATTERN DISCOVERY TECHNIQUES
APPLICATIONS OF PATTERN DISCOVERY
1. CLUSTERING
• Data reduction - Clustering aims at dividing the data set into
• Novelty detection groups (clusters) where the inter-cluster
• Profiling similarities are minimized while the
• Market basket analysis similarities within each cluster are
maximized
• Sequence analysis
2. ASSOCIATION RULE MINING
- Association rule discovery on usage data
results in finding groups of pages that are
commonly accessed
3. SEQUENTIAL PATTERN MINING
4. CLASSIFICATION
5. DECISION TREES
6. NAÏVE BAYES CLASSIFIER

SUPPORT

- The strength of the association is measured


PATTERN DISCOVERY KEY IDEAS by the support and confidence of the rule.
- Support is estimated using the following: - defines the likeliness of occurrence of
consequent on the cart given that the cart
already has the antecedents
- Confidence is estimated using the following:
- The support for the rule A =>B is the
probability that the two item sets occur
together

CALCULATING FOR THE SUPPORT - Technically, confidence is the conditional


probability of occurrence of consequent
TRANSACTION ITEMS given the antecedent.
ID
Lanz Bread, Milk CALCULATING FOR THE CONFIDENCE
Ong Bread, Lady’s Choice, Eggs
Gonzales Bread, Milk, Butter
Gepilano Bread, Diaper, Coke
Francisco Bread, Diaper, Milk
Tupaz Bread, Milk, Butter
support for the bread => milk is 4/6

RULE INTERPRETATION

LIFT RATIO - shows how effective the rule is in


finding consequents (useful if finding particular
consequents is important)

CONFIDENCE - shows the rate at which


consequents will be found (useful in learning costs
of promotion)

SUPPORT - measures overall impact

M5 (ST2- ASSOCIATION RULE)

ASSOCIATION RULE DISCOVERY

- Market basket analysis (also known as


association rule discovery or affinity
analysis) is a popular data mining method.
In the simplest situation, the data consists of
CONFIDENCE
two variables: a transaction and an item

DID YOU KNOW?


- Forbes (Palmeri 1997) reported that a major • A set of items is referred as an itemset. An
retailer determined that customers who buy itemset that contains k items is a k-itemset.
Barbie dolls have a 60% likelihood of • The support s of an itemset X is the
buying one of three types of candy bars. percentage of transactions in the transaction
database D that contain X
APPLICATIONS OF ASSOCIATION RULE
• The support of the rule X  Y in the
• Market Basket Analysis: given a database transaction database D is the support of the
of customer transactions, where each items set X  Y in D.
transaction is a set of items the goal is to • The confidence of the rule X  Y in the
find groups of items which are frequently transaction database D is the ratio of the
purchased together number of transactions in D that contain X
• Telecommunication: (each customer is a  Y to the number of transactions that
transaction containing the set of phone calls) contain X in D.
• Credit Cards/Banking Services: (each
SAMPLE SCENARIO
card/account is a transaction containing the
set of customer’s payments)
• Medical Treatments: (each patient is
represented as a transaction containing the
ordered set of diseases)
• Basketball-Game Analysis: (each game is
represented as a transaction containing the VARIABLES IN ASSOCIATION RULE
ordered set of ball passes)
• GIVEN:
ADVANTAGES - a set I of all the items;
- a database D of transactions;
- Uses large itemset property
- minimum support s;
- Easily parallelized
- minimum confidence c;
- Easy to implement
• FIND:
DISADVANTAGES - all association rules X  Y with a minimum
support s and confidence c.
- Assumes transaction database is memory
resident PROBLEM DECOMPOSITION
- Requires many database scans
• Find all sets of items that have minimum
KEY IDEAS support (frequent itemsets)
• Use the frequent itemsets to generate the
• a set of all the items desired rules
• Transaction T: a set of items such that T I
• Transaction Database D: a set of
transactions
• A transaction T  I contains a set X  I of
some items, if X  T
• An Association Rule: is an implication of
the form X  Y, where X, Y  I
• [(support {red, white, green})/(support
{red, green})]

Plus 4 more with confidence of 100%, 33%, 29%


& 100%

TERMS If confidence criterion is 70%, report only rules


2, 3 and 6
“IF” part = antecedent
SAMPLE ASSOCIATION PROBLEM
“THEN” part = consequent

“Item set” = the items (e.g., products) comprising


the antecedent or consequent

• Antecedent and consequent are disjoint (i.e.,


have no items in common)

EXAMPLE: PHONE FACEPLATES

TASK 1: COMPUTE FOR THE SUPPORT

MANY RULES ARE POSSIBLE

For example: Transaction 1 supports several rules,


such as

• “If red, then white” (“If a red faceplate is


purchased, then so is a white one”)
• “If white, then red”
• “If red and white, then green”
• + several more
TASK 2: COMPUTE FOR THE CONFIDENCE
EXAMPLE RULE

{red, white} > {green} with confidence = 2/4 =


50%

• [(support {red, white, green})/(support {red,


white})]

{red, green} > {white} with confidence = 2/2 =


100% TASK 3: COMPUTE FOR THE LIFT RATIO
FORMATIVES:
17/20

1. Write the support formula for the following


expression:

A => B

[ (transactions that contain every item in A and


B) / (all transactions) ]

2. It is intended to select the “best” subset of


predictors. (Use lowercase for your answer)
ASSOCIATION RULE MINING [variable selection]
3. When it comes to association analysis, the
more rules you produce, the greater the risk
is. [TRUE]
4. Observe the table below and compute for
the confidence of beer -> diaper [NONE
OF THE CHOICES] [1/4 OR 0.25]

TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
5. Supposed you want to solve a time series
problem where a rapid response to a real
change in the pattern of observations is
desired, which among the following is the
ideal value for your alpha? [0.8]
6. Affinity analysis is a data mining method
that usually consists of two variables: a
transaction and an item [TRUE]
7. Observe the table below and compute for the
lift ratio of
Diaper ->milk landscape, and/or changes in consumer
preferences [TRUE]
TRANSACTION ITEMS
ID TRANSACTION ITEMS
Trans A Beer, Peanut, Egg ID
Trans B Beer, Milk, Peanut, Diaper Trans A Beer, Peanut, Egg
Trans C Milk, Diaper, Egg Trans B Beer, Milk, Peanut, Diaper
Trans D Peanut, Egg, Diaper Trans C Milk, Diaper, Egg
Trans E Beer, Peanut, Egg Trans D Peanut, Egg, Diaper
Trans F Egg, Beer, Peanut Trans E Beer, Peanut, Egg
[1.5] [2] Trans F Egg, Beer, Peanut

15. observe the table below and compute for the


TRANSACTION ITEMS support for beer ->peanut [4/7]
ID
16. observe the table below and compute for the
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper support for SD Card => phone case [0.3]
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut

8. observe the table below and compute for the


support for beer ->peanut [4/7]
9. Which of the following is not an application
of a sequential pattern? [IDENTIFYING
FAKE NEWS]
10. Which of the following is not an application
of pattern discovery? [NONE OF THE
CHOICES]
11. It shows how effective the rule is in finding
consequents [LIFT RATIO]
12. Trend series a sequence of observations on a
variable measured at successive points in
time or over successive periods of time
[FALSE]
13. Which of the following is not an advantage
of using association rule? [ASSUMES
TRANSACTION DATABASE IS
MEMORY RESIDENT]
14. A trend is usually the result of long-term
factors such as population increases or
decreases, shifting demographic
characteristics of the population, improving
technology, changes in the competitive
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut

observe the table below and compute for the support


17. It measures the overall impact [SUPPORT] for beer ->peanut [4/7] [5/7]
18. Clustering aims to discover certain features
5. Association rule mining is about grouping
that often appear together in data [FALSE]
similar samples into clusters [FALSE]
19. observe the table below and compute for the
6. It is the process of discovering useful
support for diaper ->peanut [0.43]
patterns and trends in large data sets (use
TRANSACTION ITEMS lowercase for your answer) [data mining]
ID 7. observe the table below and compute for the
Trans A Beer, Peanut, Egg support for beer => egg [0.43]
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg TRANSACTION ITEMS
Trans D Peanut, Egg, Diaper ID
Trans E Beer, Peanut, Egg Trans A Beer, Peanut, Egg
Trans F Egg, Beer, Peanut Trans B Beer, Milk, Peanut, Diaper
Trans G Beer, Diaper, Peanut Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
20. It is another type of association analysis that Trans F Egg, Beer, Peanut
involves using sequence data Trans G Beer, Diaper, Peanut
[ASSOCIATION RULE] 8. Observe the table below and compute for the
lift ratio of

17/20 Egg ->Peanut

1. Segmentation is a data mining method that TRANSACTION ITEMS


ID
usually consists of two variables: a
Trans A Beer, Peanut, Egg
transaction and an item [FALSE]
Trans B Beer, Milk, Peanut, Diaper
2. It is useful tool for data reduction, such as Trans C Milk, Diaper, Egg
choosing the best variables or cluster Trans D Peanut, Egg, Diaper
components for analysis. (use lowercase for Trans E Beer, Peanut, Egg
your answer) [variable clustering] Trans F Egg, Beer, Peanut
3. It controls for the support (frequency) of Trans G Beer, Diaper, Peanut
consequent while calculating the conditional [0.93]
probability of occurrence of {Y} given {X}
9. Write the support formula for the following
[LIFT RATIO]
expressions:
4.
A => B
TRANSACTION ITEMS
ID [(transactions that contain
Trans A Beer, Peanut, Egg every item in A and B)/(all
Trans B Beer, Milk, Peanut, Diaper
transactions)]
Trans C Milk, Diaper, Egg
10. It is the conditional probability of
occurrence of consequent given the
antecedent [CONFIDENCE]
11. observe the table below and compute for the
support for milk ->egg [0.14]

TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
12. it shows how effective the rule is in finding
components (consequents) [LIFT RATIO]
13. observe the table below and compute for the
support for diaper ->peanut [0.43]

TRANSACTION ITEMS
ID
15. observe the table below and compute for the
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper support for Fries => Burger [0.6]
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
14. observe the table below and compute for the
support for airpods => charger [0.4]

16. Input validation helps to lessen what type of


anomaly? [INSERTION ANOMALY]
17. observe the table below and compute for the
support for Phone Case => SD Card [0.5]
2. Write the Support formula for the following
expression;

A => B

Answer:(transactions that contain every


item A and B)/(all transactions)

3. Affinity analysis is a data mining method


that usually consists of two variables a
transaction and an item. [TRUE]
4. Which of the following is not an application
of a sequential pattern?[Identifying fake
news]
5. Which of the following is not an application
of pattern discovery? [None of pattern
discovery]
6. It shows how effective the rule is in finding
consequents. [Lift ratio]
7. Trend series a sequence of observation on a
variable measured at successive points in
time or over successive periods of time.
[False] (time series dapat)
8. Observe the table below and compute for
18. in the association analysis, the values of the confidence of
items can be categoric or numeric [FALSE] Beer->diaper.
-kasi dapat categoric lang
19. Lift ratio shows how effective the rule is in
finding consequent. [TRUE]
20. Observe the following table and compute for
the lift ratio of cake -> fries [0.1]

Answer: None of the choices


9. Observe the table below and compute for
20/20 the confidence of
1. It is intended to select the “best” subset of Diaper => peanut
predictors. [Variables Selection]
Answer: 0.43 Answer: 1.5
10. Observe the table below and compute for 13. When it comes to association analysis, the
the support of more rules you produce, the greater the risk
Beer=>peanut is. [True]
14. Supposed you want to solve a time series
Answer: 4/7 problem where a rapid response to a real
11. Observe the table below and compute for change in the pattern of observations is
the support of desired, which among the following is the
ideal value for your alpha? [0.8]
SD Card => Phone case
15. Which of the following is not an advantage
of using association rule?
[Assumes transaction database is memory
resident.]
16. A trend is usually the result of long-term
factors such as population increases or
decreases, shifting demographic
characteristics of population, improving
technology, changes in the competitive
landscape, and/or changes in consumer
preferences. [True]
17. It measures the overall impact. [Support]
18. Clustering aims to discover certain features
that often appear together in data.[False]
19. It is another type of association analysis
that involves using sequence data.
[Association Rule]
20. Observe the table below and compute for
Answer:0.3 the support of
12. Observe the table below and compute for SD Card => Phone case
the lift ratio of
Answer:0.3 Answer: 0.93
7. observe the table below and compute
for the confidence of
1. Segmentation is a data mining method
Airpods->powerbank
that usually consists of two variables a
transaction and an item. [False]
2. It is a useful tool for data reduction,
such as choosing the best variables or
cluster components for
analysis.[Variable Clustering]
3. It controls for the support ( frequency)
of consequent while calculating the
conditional probability of occurrence of
{Y} given {X}. [Lift ratio]
4. Association rule mining is about
grouping similar samples into clusters.
[False]
5. It is the process of discovering useful
patterns and trends in large data Answer:0.4
sets.[data mining]
6. observe the table below and compute 8. observe the table below and support
for the lift ratio of for Fries -> Burger
egg -> Peanut

Answer: 0.6
9. It is the conditional probability of --------------------------------------------------------------------------
occurrence of consequent given the
antecedent. [Confidence] 15/20
10. Input validation helps to lessen what
type of anomaly? [ Insertion anomaly] 1. Market Basket Analysis creates if-Then
11. observe the table below and compute scenario rules. The IF part is called the
for the confidence of _______ (use lower case for your
Phone -> SD card answer) [ANTECEDENT]
2. It is the conditional probability of
occurrence of consequent given the
antecedent [CONFIDENCE]
3. It controls for the support (frequency)
of consequent while calculating the
conditional probability of occurrence [Y]
given [X]. [LIFT RATIO]
4. Observe the table below and compute
for the confidence of diaper->egg.

Answer: 0.5
12. Lift ratio shows how effective the rule is
in finding consequents. [True]
13. observe the table below and compute
for the lift ratio of

[0.50]
5. Observe the table below and compute for
milk->egg.

[0.14]
Answer: 0.1 6. Observe the table below and compute for
lift ratio diaper->milk.
8. Observe the table below and compute for
the confidence of egg->peanut.

[1.5]
7. Observe the table below and compute for
[0.8]
the confidence of beer->diaper.

1. Write the confidence formula for the following


expression: A->B
[(transactions containing both A and B
)/(transaction containing A)]
2. Lift ration shows how effective the rule is in
finding consequents. [FALSE]
3. Observe the table below and compute for the
confidence of airpods->powerbank.

[0.40]

[0.33]
4. Observe the table below and compute for the
support for diaper->peanut.

[0.43]
[0.67]
13. Which of the following is not an application of a
sequential pattern? [IDENTIFYING FAKE NEWS] 18. It pertains to how likely item Y is purchased when
item X is purchased, expressed as [X->Y]/ [confidence]
14. Lift ratio shows how effective the rule is in finding
consequents [TRUE] 19. Which of the following is not an application of
pattern discovery? [None of the choices]
15. Market Basket Analysis creates if-Then scenario
rules. The THEN part is called the _______ (use lower 20. Observe the table below and compute for the
case for your answer) [CONSEQUENT] support for Phone case->SD card.

16. It is the conditional probability of occurrence of


consequent given the antecedent [CONFIDENCE]

17. Observe the table below and compute for the lift
ratio of powerbank->airpods.
Transaction E Beer, Peanut, Egg

Transaction F Egg, Beer, Peanut

Transaction G Beer, Diaper, Peanut


0.75 (3/7 = .43 Umulit sa number19)

5. It controls for the support (frequency) of


consequent while calculating the conditional
probability of occurrence of {Y} given {X}. Lift
ratio
6. It pertains to how popular an itemset is, as
measured by the proportion of transactions
in which an itemset appears. Support
7. It shows how effective the rule is in finding
consequents. Lift ratio
8. Observe the table below and compute
for the lift ratio of diaper→ milk.
[0.3]
Transaction ID Items

Transaction A Beer, Peanut, Egg


1. The objective of clustering is to uncover a
pattern in the time series and then
extrapolate the pattern into the future. False Transaction B Beer, Milk, Peanut, Diaper
2. These are set of items, subsequences, or
substructures that occur frequently together
Transaction C Milk, Diaper, Egg
(or strongly correlated) in the data set.
Pattern
3. When it comes to association analysis, the Transaction D Peanut, Egg, Diaper
more rules you produce, the greater the risk
is. True
Transaction E Beer, Peanut, Egg
4. Observe the table below and compute
for the support for diaper⇒ peanut
Transaction F Egg, Beer
Transaction ID Items
Transaction G Beer, Diaper, Peanut
Transaction A Beer, Peanut, Egg
1.5 0.29/0.29/0.57=1.75

Transaction B Beer, Milk, Peanut, Diaper

9. Supposed you want to solve a time series


Transaction C Milk, Diaper, Egg problem where a rapid response to a real
change in the pattern of observations is
Transaction D Peanut, Egg, Diaper desired, which among the following is the
ideal value for your alpha ? 0.8
10. Association rule mining is about grouping
similar samples into clusters. False Transaction A Beer, Peanut, Egg
11. It is the conditional probability of occurrence
of consequent given the antecedent.
Confidence
Transaction B Beer, Milk, Peanut, Diaper
12. It is the conditional probability of occurrence
of consequent given the antecedent. Transaction C Milk, Diaper, Egg
Confidence
13. Observe the table below and compute
for confidence of phone case→ SD Transaction D Peanut, Egg, Diaper
Card
Transaction E Beer, Peanut, Egg
Transaction ID Items
Transaction F Egg, Beer
1 airpods, charger, powerbank
Transaction G Beer, Diaper, Peanut
2 powerbank, phone case
4/7.

3 airpods, phone case, charger 16. Market Basket Analysis creates If-Then
scenario rules. True
17. It shows how effective the rule is in finding
4 phone case, SD Card consequents. Lift ratio
18. It is another type of association analysis that
involves using sequence data. Association
5 SD Card, charger, airpods rule
19. Observe the table below and compute
6 SD Card, phonecase, powerbank for the support for diaper⇒peanut

Transaction ID Items
7 Powerbank, phonecase, SD Card

Transaction A Beer, Peanut, Egg


8 airpods, powerbank

Transaction B Beer, Milk, Peanut, Diaper


9 phone case, airpods

Transaction C Milk, Diaper, Egg


10 charger, SD Card, airpods

0.5 (0.3/0.6 = 0.5) Transaction D Peanut, Egg, Diaper


14. It pertains to how popular an itemset is, as
measured by the proportion of transactions Transaction E Beer, Peanut, Egg
in which an itemset appears. (Use lowercase
for your answer) support
15. Observe the table below and compute Transaction F Egg, Beer, Peanut
for the support for beer⇒ peanut
Transaction G Beer, Diaper, Peanut
Transaction ID Items
0.43 ( 3/7 = .43 Inulit nya lang nasa number 4)
20. It shows how effective the rule is in finding DATA REDUCTION
consequents. Lift ratio
- Data reduction techniques can be applied to
obtain a reduced representation of the data
set that is much smaller in volume, yet
closely maintains the integrity of the original
[IT0089] MODULE 6 – DATA PREPARATION
data.
Data Preparation
Data Preprocessing Techniques
- Data preparation is the method of
transforming raw data into a form suitable
for modeling
- It is also known as data preprocessing.

Why do we need to preprocess the data?

• MISSING VALUES
- Having null values in your data set could Variables Selection
affect the accuracy of the model.
Observe minimizing garbage in, garbage out
• OUTLIERS
(GIGO)
- When your data has outliers, it could affect
the distribution of your data. Procedures
• INCONSISTENT DATA
• IMPROPERLY FORMATTED DATA 1. Backward-selection
2. Forward-selection
• LIMITED FEATURES
• THE NEED FOR TECHNIQUES SUCH [M6-ST1] DATA ANOMALIES
AS FEATURE ENGINEERING
What are anomalies?
Ways to Preprocess Data
- Anomalies are problems that can occur in
DATA CLEANING poorly planned, unnormalized databases
where all the data is stored in one table (a
- Data cleaning (or data cleansing) routines
flat-file database).
attempt to fill in missing values, smooth out
- Anomalies are caused when there is too
noise while identifying outliers, and correct
much redundancy in the database's
inconsistencies in the data. In this section,
information
you will study basic methods for data
cleaning Database Anomalies
DATA INTEGRATION AND • Insertion Anomaly – happen when
TRANSFORMATION inserting vital data into the database is not
- There are a number of issues to consider possible because other data is not already
during data integration. Schema integration there.
and object matching can be tricky. This is • Update Anomalies – happen when the
referred to as the entity identification person charged with the task of keeping all
problem. the records current and accurate, is asked,
for example, to change an employee’s title 4. Replace the missing values with imputed
due to a promotion. values based on the other characteristics of
- If the data is stored redundantly in the same the record.
table, and the person misses any of them,
then there will be multiple titles associated Applications
with the employee. The end user has no way 1) Replacing missing field values with user-
of knowing which the correct title is. defined constants.
• Deletion Anomalies – happen when the
deletion of unwanted information causes
desired information to be deleted as well.
- For example, if a single database record
contains information about a particular
product along with information about a
salesperson for the company and the 2) Replacing missing field values with means
salesperson quits, then information about the or modes
product is deleted along with salesperson
information.

Handling Missing Data

- Missing data is a problem that continues to


plague data analysis methods. HOW TO HANDLE OUTLIERS?
- Let’s examine the cars data set.
Use Graphical Methods to Identify Outliers

- A common method of “handling” missing


values is simply to omit the records or
fields with missing values from the analysis.
- However, this action would make our data
bias.

Common Criteria in Handling Null Values

1. Replace the missing value with some


constant, specified by the analyst
2. Replace the missing value with the field
mean (for numeric variables) or the mode
(for categorical variables)
3. Replace the missing values with a value
Variable Clustering
generated at random from the observed
distribution of the variable.
• Collapse the categories based on the number
of observations in a category.
• Collapse the categories based on the
reduction in the chi-square test of
association between the categorical input
and the target.
• Use smoothed weight of evidence coding to
convert the categorical input to a continuous
input.

HOW TO HANDLE CATEGORICAL DATA? Anomaly Detection

Dealing with Categorical Inputs - Anomaly detection (aka outlier analysis) is a


step in data mining that identifies data
- When a categorical input has many levels, points, events, and/or observations that
expanding the input into dummy variables deviate from a dataset’s normal behavior.
can greatly increase the dimension of the - Anomalous data can indicate critical
input space. incidents, such as a technical glitch, or
- Including categorical inputs in the model potential opportunities, for instance a change
can cause quasicomplete separation. in consumer behavior.
HOW TO HANDLE NOISY DATA? Observe the graph
Handling Noisy Data

1. Binning: Binning methods smooth a sorted


data value by consulting its “neighbor-
hood,” that is, the values around it.
2. Regression: Data can be smoothed by This graph shows an anomalous drop detected in
fitting the data to a function, such as with time series data. The anomaly is the yellow part of
regression. Linear regression involves the line extending far below the blue shaded area,
finding the “best” line to fit two attributes which is the normal range for this metric.
(or variables), so that one attribute can be
used to predict the other. THREE MAIN CATEGORIES OF BUSINESS
3. Clustering: Outliers may be detected by DATA ANOMALIES
clustering, where similar values are
1. Global Outliers
organized into groups, or “clusters.”
- Also known as point anomalies, these
Intuitively, values that fall outside of the set
outliers exist far outside the entirety of a
of clusters maybe considered outliers
data set.
DEALING WITH CATEGORICAL INPUTS

Solutions to the problems of categorical inputs


include the following choices:

• Use the categorical input as a link to other


data sets.
2. Contextual Outliers
- Also called conditional outliers, these • We want to explain the data in the simplest
anomalies have values that significantly way — redundant predictors should be
deviate from the other data points that exist removed.
in the same context • Unnecessary predictors will add noise to the
estimation of other quantities that we are
interested in. Degrees of freedom will be
wasted
• Collinearity is caused by having too many
variables trying to do the same job.
• Cost: if the model is to be used for
prediction, we can save time and/or money
3. Collective Outliers
by not measuring redundant predictors.
- When a subset of data points within a set is
anomalous to the entire dataset, those values Prior to variable selection
are called collective outliers. In this
category, individual values aren’t anomalous 1. Identify outliers and influential points -
globally or contextually. maybe exclude them at least temporarily
2. Add in any transformations of the variables
that seem appropriate

Forward Selection

- The forward selection procedure starts with


no variables in the model.

Steps Forward Selection

1. For the first variable to enter the model,


select the predictor most highly correlated
with the target.
WHY YOUR COMPANY DOES NEEDS
2. For each remaining variable, compute the
ANOMALY DETECTION?
sequential Fstatistic for that variable, given
Reasons to detect anomalies the variables already in the model.
3. For the variable selected in step 2, test for
1. Anomaly detection for application the significance of the sequential F-statistic.
performance If the resulting model is not significant, then
2. Anomaly detection for product quality stop, and report the current model without
3. Anomaly detection for user experience adding the variable from step 2. Otherwise,
[M6-ST2] – VARIABLE SELECTION add the variable from step 2 into the model
and return to step 2.
Variable Selection
What if you have 2k models to choose from?
- Variable selection is intended to select the
“best” subset of predictors.

Why Variable Selection?


- In situations where there is a complex
hierarchy, backward elimination can be run
manually while taking account of what
variables are eligible for removal.

STEPS OF BACKWARD-SELECTION

1. Perform the regression on the full model;


that is, using all available variables. For
example, perhaps the full model has four
variables, x1, x2, x3, x4.
FORWARD SELECTION
2. For each variable in the current model,
compute the partial F-statistic. In the first
pass through the algorithm, these would be
F(x1|x2, x3, x4), F(x2|x1, x3, x4), F(x3|x1,
x2, x4), and F(x4|x1, x2, x3). Select the
variable with the smallest partial F-statistic.
Denote this value Fmin.
3. Test for the significance of Fmin. If Fmin is
not significant, then remove the variable
associated with Fmin from the model, and
return to step 2. If Fmin is significant, then
stop the algorithm and report the current
model

BACKWARD SELECTION

- This is the simplest of all variable selection


procedures and can be easily implemented
without special software.
Step 3 Then all models of two predictors are built.
Their R2, R2adj , Mallows’ Cp (see below), and s
values are calculated. The best k models are
reported

• The procedure continues in this way until


maximum number of predictors (p) is
reached. The analyst then has a listing of the
best models of each size, 1,2, …., p , to
assist in the selection of the best overall
model

EXAMPLE

STEPWISE PROCEDURE

- The stepwise procedure represents a


modification of the forward selection
procedure.
FORMATIVES
- A variable that has been entered into the
model early in the forward selection process Forward selection is the opposite of stepwise
may turn out to be nonsignificant, once other selection.
variables have been entered into the model.
• FALSE
Step 1 The analyst specifies how many (k) models
of each size he or she would like reported, as well Estimation is about estimating the value for the
as the maximum number of predictors (p) the target variable except that the target variable is
analyst wants in the model categorical rather than numeric.

Step 2 All models of one predictor are built. Their • True


R2, R2adj , Mallows’ Cp (see below), and s values are Linearity is caused by having too many variables
calculated. The best k models are reported, based on trying to do the same job.
these measures.
• False (collinearity)
It is about estimating the value for the target This happens when the deletion of unwanted
variable except that the target variable is categorical information causes desired information to be
rather than numeric. deleted as well.

• Estimation • Insertion anomaly


• Deletion anomaly
It is the process of transforming the existing
features into a lower-dimensional space, typically The following are techniques to treat missing values
generating new features that are composites of the except:
existing features.
• Substitute an imputed value
• Feature extraction • Returning to the field
• Substitute an imputed value
Postcoding process is necessary for:
• Case wise deletion
• Structure questions Clustering can also detect outliers.
A good rule of thumb in having a right amount of
• True
data is to have 10 records for every predictor value.
Supply the missing values given for the attribute
• True city.
Forward selection is the simplest variable selection
City
model.
Makati
• False
Caloocan
Enumerate at least one of the two(2) types of
variables transformation commonly used in Caloocan
machine learning: numeric variable and categorical
variable Makati

Caloocan

Data preparation affects: • Caloocan


• Makati
• The objectives of the research • Quezon
• The quality of the research • None of the choices
• The research approach
• The sample size These anomalies have values that significantly
deviate from the other data points that exist in the
It is a manipulation of scale values to ensure same context.
comparability with other scales:
• Contextual outliers
• Scale transformation
When there’s a missing value for a categorical
It is a best practice to divide your dataset into train variable, it is ideal to supply it by computing for the
and test dataset. average of the data values available.
• True • False
Outlier analysis can provide good product quality.
• True Answer: Data Inconsistency
A review of the questionnaires is essential in order Anomaly detection can cause a bad user experience.
to:
• False- False? hahahahaahha
• Select the data analysis strategy
You can use histogram to detect outliers.
• Find new insights
• Increase the quality of the data • True
• Increase accuracy and precise of the
collected data When a subset points within a set is anomalous to
the entire data set, those values are:
Feature selection maps the original feature space to
a new feature space with lower dimensions by • Collective outliers
combining the original feature space.
These anomalies have values that significantly
True deviate from the other data points that exist in the
same context.
False (dapat feature extraction kasi)
• Contextual outliers
This happens when inserting vital data into the
database is not possible because other data is not These are problems that can occur in poorly planed,
already there. un-normalized databases where all the data is stored
in one table (a flat-file database).
• Insertion anomaly
• Anomalies
Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.

• True
You can also use regression when handling noisy
data.

True
- Quality of the research
Given the following values for age, what is the
problem with the data? The following can be done to treat unsatisfactory
response except:
Age
Returning to the field
16
A good rule of thumb in having a right amount of
27 data is to have 10 records for every predictor value.
True
-8990
Given are the following records for the
19
attribute rating. What is the problem with the data?
15 Data Inconsistency

18
application_rating
Anomaly detection is also known as outlier
1 analysis. True

2 It is a best practice to divide your dataset into train


and test dataset. True
a It fits and performs variable selection on an
ordinary least square regression predictive model.
b Linear regression selection

It is a useful tool for data reduction, such as


c
choosing the best variables or cluster components
for analysis. (Use lowercase for your answer)
3 variable clustering

The simplest of all variable selection procedures is


Supply the missing value. AGD stepwise procedure. FALSE

Backward selection starts with all the variables.


Degree True

SMBA It is about estimating the value for the target


variable except that the target variable is categorical
rather than numeric. Estimation
WMA
Prior to variable selection, one must identify
AGD outliers and influential points - maybe exclude them
at least temporarily. True
AGD
Variable clustering is about grouping the attributes
with similarities. True
WMA
Resampling refers to the process of sampling at
WMA random and with replacement from a data set. True

Estimation is about estimating the value for the


- target variable except that the target variable is
categorical rather than numeric. True
AGD
The figure below illustrates the first step in doing
backward selection.
AGD

These outliers exist far outside the entirety of a data


set. Global outliers
False
These are also known as point anomalies. Global
Forward selection is the opposite of stepwise
outliers
selection. False
Anomaly detection can cause a bad user experience.
True- False? hahahahaahha Formative 6
Forward selection is the opposite of stepwise • The research approach
selection. • The sample size

• False It is a manipulation of scale values to ensure


comparability with other scales:
Estimation is about estimating the value for the
target variable except that the target variable is • Scale transformation
categorical rather than numeric.
It is a best practice to divide your dataset into train
• True and test dataset.

Linearity is caused by having too many variables • True


trying to do the same job.
This happens when the deletion of unwanted
• False (collinearity dapat) information causes desired information to be
deleted as well.
It is about estimating the value for the target
variable except that the target variable is categorical • Insertion anomaly
rather than numeric. • Deletion anomaly

• Estimation The following are techniques to treat missing values


except:
It is the process of transforming the existing
features into a lower-dimensional space, typically • Substitute an imputed value
generating new features that are composites of the • Returning to the field
existing features. • Substitute an imputed value
• Case wise deletion
• Feature extraction
Clustering can also detect outliers.
Postcoding process is necessary for:
• True
• Structure questions
Supply the missing values given for the attribute
A good rule of thumb in having a right amount of city.
data is to have 10 records for every predictor value.
City
• True
Makati
Forward selection is the simplest variable selection
model. Caloocan

• False Caloocan
Enumerate at least one of the two(2) types of Makati
variables transformation commonly used in
machine learning Caloocan

• Caloocan
• Makati
Data preparation affects: • Quezon
• None of the choices
• The objectives of the research
• The quality of the research
These anomalies have values that significantly Age
deviate from the other data points that exist in the
same context. 16

27
• Contextual outliers
When there’s a missing value for a categorical -8990
variable, it is ideal to supply it by computing for the
19
average of the data values available.
15
• True
18
Outlier analysis can provide good product quality.
Answer: Data Inconsistency
• True
A review of the questionnaires is essential in order Anomaly detection can cause a bad user experience.
to:
• False
• Select the data analysis strategy You can use histogram to detect outliers.
• Find new insights
• Increase the quality of the data • True
• Increase accuracy and precise of the When a subset points within a set is anomalous to
collected data the entire data set, those values are:
Feature selection maps the original feature space to
Collective outliers
a new feature space with lower dimensions by
combining the original feature space. These anomalies have values that significantly
deviate from the other data points that exist in the
True
same context.
False
• Contextual outliers
This happens when inserting vital data into the
These are problems that can occur in poorly planed,
database is not possible because other data is not
un-normalized databases where all the data is stored
already there.
in one table (a flat-file database).
• Insertion anomaly
• Anomalies
Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.

• True
You can also use regression when handling noisy
data.

True 16/20
Given the following values for age, what is the A homogenous data set is a data set whose data
problem with the data? records have the same target value. True
Supply for the missing values. This happens when the deletion of unwanted
information causes desired information to be
deleted as well.
Deletion anomaly

Instead of using the real number for age attribute,


you categorized the age as the following:

Young = 12 – 17
Adult = 18 -34
Old = 35 – 60
What kind of data preparation was practiced? Data
Cleaning
It is the process of integrating multiple databases,
data cubes, or files. data integration
These are problems that can occur in poorly
- 19.6 planned, un-normalized databases where all the data
It is a manipulation of scale values to ensure is stored in one table (a flat-file database).
Anomalies
comparability with variables with other scales:
You can also use regression when handling noisy
Scale transformation data. True
Supply the missing value in the given data below. The procedure starts with an empty set of features
[reduced set]. Forward Selection
It is the simplest of all variable selection procedures
and can be easily implemented without special
software (Use lowercase for your answer)
Backward Selection
The forward selection procedure starts with no
variables in the model. True
Estimation is about estimating the value for the
target variable except that the target variable is
categorical rather than numeric.
True
The figure below illustrates the first step in doing
backward selection. False (wala pics 😊)
It is intended to select the “best” subset of
predictors. (Use lowercase for your answer)
- 88.2
Variables Selection
It is a best practice to divide your dataset into train
and test dataset. True
Enumerate at least one of the two (2) types of 5. Input validation helps to lessen the deletion
variables transformation commonly used in anomaly [FALSE]
machine learning: (Use lowercase for your answer) 6. Given the following values for age, what is
categorical variables? the problem with the data?
Age
numerical variables? 16
Forward selection is the opposite of stepwise 27
selection. False -8990
19
15
The figure below illustrates the basic steps for what 18 [DATA INCONSISTENCY]
type of variable selection method? 7. Mode is used when catering missing values
for numerical variables [FALSE]
8. A homogenous data set is a data set whose
data records have the same target value
[TRUE]
9. Post coding Process is necessary for:
[STRUCTURED QUESTIONS]
10. Given the following records for the attribute
rating. What is the problem with the data?

Application_rating
Backward
Prior to variable selection, one must identify 1
outliers and influential points - maybe exclude them
2
at least temporarily. True
A
17/20
B
1. It fits and performs variable selection on an
ordinary least square regression predictive C
model [LINEAR REGRESSION
SELECTION] 3
2. It is a manipulation of scale values to ensure
[DATA INCONSISTENCY]
comparability with variables with other
scales: [SCALE TRANSFORMATION] 11. It identifies the set of input variables that
3. It is the process of integrating multiple jointly explains the maximum amount of
databases, data cubes, or files [DATA data variance. The target variable is not
INTEGRATION] considered with this method.
4. If the data is stored redundantly in the same [UNSUPERVISED SELECTION]
table, and the person misses any of them, 12. Clustering aims to discover certain features
then there will be multiple titles associated that often appear together in data [FALSE]
with the employees. This is an example of 13. Backward selection starts with no variables
what type of data anomaly? [UPDATE [FALSE]
ANOMALY] 14. Forward selection is the simplest variable
selection model [FALSE]
15. It fits and performs variable selection on an 75
ordinary least square regression predictive
model. [LINEAR REGRESSION 87
SELECTION] [88.2]
16. It identifies the set of input variables that
jointly explain the maximum amount of
variance contained in the target
8. When Stephen tried to change the section of
[UNSUPERVISED SELECTION]
all students enrolled to his class however,
17. The simplest of all variable selection
upon performing the query, only one data
procedures is stepwise procedure [FALSE]
record was modified instead of all the
18. It is intended to slect the “best” subset of
records. What data anomaly was present in
predictors (use lowercase for your answer)
Stephen’s database? [UPDATE
[variable selection]
ANOMALY]
19. Forward selection is the opposite of stepwise
9. In this category, individual values aren’t
selection [FALSE]
anomalous globally or contextually
20. It is the simplest of all variable selection
[COLLECTIVE OUTLIERS]
procedures and can be easily implemented
10. It is used when there is a single
without special software (use lowercase for
measurement of each element in the sample:
your answer) [backward selection]
[INTERDEPENDENCE TECHNIQUES]
20/20 11. It fits and perform variable selection on an
ordinary least square regression predictive
1. You can also use regression when handling model [LINEAR REGRESSION
noisy data [TRUE] SELECTION]
2. When a subset of data points within a set is 12. Data preparation affects: [THE QUALITY
anomalous to the entire dataset, those values OF THE RESEARCH]
are: [COLLECTIVE OUTLIERS] 13. It performs a greedy search to find the best
3. The following can be done to treat performing feature subset. It iteratively
unsatisfactory response except: creates models and determines the best or
[ASSIGNING MISSING VALUES] the worst performing feature at each
4. A homogenous data set is a data set whose iteration [RECURSIVE FEATURE
data records have the same target value ELIMINATION]
[TRUE] 14. The first step in stepwise procedure is to
5. Post coding process is necessary for: select the predictor most highly correlated
[STRUCTURED QUESTIONS] with the target [FALSE]
6. Anomaly detection is also known as outlier 15. It involves both running the analysis to
analysis [TRUE] create unique clusters or segments and
7. Supply the missing value in the given data evaluating or describing the clusters that are
below Exam_scores created in the analysis. [CLUSTER
ANALYSIS]
100
16. Give the first step for backward selection
89 [Perform the regression on the full model]
17. A good rule of thumb in having a right
- amount of data is to have 10 records for
90 every predictor value [TRUE]
18. Forward selection is the simplest variable 10. Identify atleast one of the two principal
selection model [FALSE] reasons for eliminating a variable: (use
19. The simplest of all variable selection lowercase for your answer) [redundancy or
procedures is stepwise procedure. [FALSE] irrelevancy]
20. Clustering aims to discover certain features 11. Variable clustering is about grouping the
that often appear together in data [FALSE] attributes with similarities [TRUE]
12. Supply the missing values given for the
18/20 attribute salary
1. If the data is stored redundantly in the same Salary
table, and the person misses any of them,
then there will be multiple titles associated 16000
with the employee. This is an example of
12000
what type of data anomaly? [UPDATE
ANOMALY] 17500
2. It is a best practice to divide your dataset
into train and test dataset. [TRUE] 29000
3. You can use histogram to detect outliers
[18,625]
[TRUE]
4. Anomaly detection is also known as outlier 13. It is the process of transforming the existing
analysis [TRUE] features into a lower-dimensional space,
5. The following can be done to treat typically generating new features that are
unsatisfactory response except: composites of the existing features.
[RETURNING TO THE FIELD] [FEATURE EXTRACTION]
6. Given the following values for age, what is 14. The figure below illustrates the basic steps
the problem with the data? for what type of variable selection method?
[FORWARD SELECTION]
Age

16

27

-8990

19

15
15. Forward selection is the simplest variable
18
selection model [FALSE]
[DATA INCONSISTENCY] 16. These are variables that significantly
influence Y and so should be in the model
7. Histogram is used to see missing data but are excluded [OMITTED
[FALSE] VARIABLES]
8. Clustering can also detect outliers [TRUE] 17. Unnecessary predictors will add noise to the
9. These outliers exist far outside the entirety estimation of other quantities that we are
of a data set [GLOBAL OUTLIERS] interested in. [TRUE]
18. The first step in stepwise procedure is to
select the predictor most highly correlated WMA
with the target. [FALSE]
19. Prior to variable selection, one must identify -
outliers and influential points – maybe
exclude them at least temporarily [TRUE] AGD
20. The procedure starts with an empty set of
features [reduced set]. [FORWARD AGD
SELECTION]
4. The following are techniques to treat
16/20 missing values except: [RETURNING TO
THE FIELD]
1. Given the following values for age, what is
5. Supply for the missing values. [19.6]
the problem with the data?

Age

16

27

-8990

19

15

18

Answer: Data Inconsistency

2. The following can be done to treat


unsatisfactory response except:
[RETURNING TO THE FIELD] 6. Enumerate at least one main category of
3. Supply the missing value. AGD business data anomalies (use lowercase for
your answer) [GLOBAL OUTLIERS]
Degree 7. It is the process of integrating multiple
databases, data cubes, or files [DATA
INTEGRATION]
SMBA
8. It is a manipulation of scale values to ensure
comparability with variables with other
WMA scales: [SCALE TRANSFORMATION]
9. Resampling refers to the process of
AGD sampling at random and with replacement
from a data set [TRUE]
AGD 10. Identify at least one of the two principal
reasons for eliminating a variable: (use
WMA lowercase for your answer)
[REDUNDANCY]
11. It identifies the set of input variables that The following can be done to treat unsatisfactory
jointly explain the maximum amount of response except: Returning to the field
variance contained in the target.
A good rule of thumb in having a right amount of
[UNSUPERVISED LEARNING]
data is to have 10 records for every predictor value.
12. It involves both running the analysis to
True
create unique clusters or segments and
evaluating or describing the clusters that are Given are the following records for the
created in the analysis [CLUSTER attribute rating. What is the problem with the
ANALYSIS] data? Data Inconsistency
13. Variable clustering is about grouping the
attributes with similarities [TRUE]
application_rating
14. Unnecessary predictors will add noise to the
estimation of other quantities that we are
interested in. [TRUE] 1
15. When a subset of data points within a set is
anomalous to the entire dataset, those values 2
are: [COLLECTIVE OUTLIERS]
16. It is review of the questionnaires in order to a
increase accuracy and precision of the
collected data: [EDITING]
b
17. Forward selection is the opposite of stepwise
selection [FALSE]
18. It is the simplest of all variable selection c
procedures and can be easily implemented
without special software (use lowercase for 3
your answer) [BACKWARD
SELECTION]
19. It starts with the full set of attributes. At Supply the missing value. AGD
each step, it removes the worst attribute
remaining in the set [STEPWISE Degree
FORWARD AND BACKWARD
SELECTION] OR SMBA
(STEPWISE BACKWARD
ELIMINATION)
WMA
20. Backward selection starts with all the
variables [TRUE] AGD

Perform the regression on the full model; that is, AGD


using all available variables. For example, perhaps
the full model has four variables, x1, x2, x3, x4. -
WMA
Backward Selection

WMA
Estimation is about estimating the value for the
- target variable except that the target variable is
categorical rather than numeric. True
AGD
The figure below illustrates the first step in
doing backward selection.
AGD
These outliers exist far outside the entirety of a data
set. Global outliers
False
These are also known as point anomalies. Global
outliers Forward selection is the opposite of stepwise
selection. False
Anomaly detection can cause a bad user experience.
True When Stephen tried to change the section of all
students enrolled to his class however, upon performing
Anomaly detection is also known as outlier analysis. the query, only one data record was modified instead of
True all the records. What data anomaly was present in
Stephen’s database?
It is a best practice to divide your dataset into train
and test dataset. True Update anomaly

It fits and performs variable selection on an ordinary A review of the questionnaires is essential in order
least square regression predictive model. Linear to:
regression selection
Increase accuracy and precision of the collected
It is a useful tool for data reduction, such as choosing data
the best variables or cluster components for
analysis. (Use lowercase for your answer) variable These are variables that significantly influence Y
clustering and so should be in the model but are excluded.
The simplest of all variable selection procedures is Omitted variables
stepwise procedure. FALSE
The first step in stepwise procedure is to select the
Backward selection starts with all the variables. True predictor most highly correlated with the target.
It is about estimating the value for the target False
variable except that the target variable is categorical
rather than numeric. Estimation It performs a greedy search to find the best
performing feature subset. It iteratively creates
Prior to variable selection, one must identify outliers
models and determines the best or the worst
and influential points - maybe exclude them at least
performing feature at each iteration.
temporarily. True
Recursive feature elimination
Variable clustering is about grouping the attributes
with similarities. True When a subset of data points within a set is
Resampling refers to the process of sampling at
anomalous to the entire dataset, those values are:
random and with replacement from a data set. True Collective outliers
[IT0089] MODULE 7 – CLASSIFICATION
ALGORITHM

SUPERVISED CLASSIFICATION

VARIOUS MEASUREMENT SCALES

GENERALIZATION

NONLINEARITIES AND INTERACTIONS

VARIABLE ANNUITY DATA SET MODEL SELECTION


OUTCOME PRESENTATION DATA SPLITTING

- A common predictive modeling practice is


to build models from a sample with a
primary outcome proportion different from
the original population

SEPARATE SAMPLING

- Target-based samples are created by


considering the primary outcome cases
separately from the secondary outcome STRATIFIED RANDOM SAMPLING
cases

CLASSIFICATION

- It is the task of assigning objects to one of several


predefined categories, is a pervasive problem that
encompasses many diverse applications.
THE MODELING SAMPLE
ALGORITHMS FOR CLASSIFICATION
+ Similar predictive power with smaller case count
• Linear and Nonlinear Regression Models
- must adjust assessment statistics and graphics • Naïve Bayes Classifier
- must adjust prediction estimates for bias • Decision Trees

THE OPTIMISM PRINCIPLE WHAT IS DECISION TREE?


- A decision tree is a graph that uses a branching - In the medical field, decision tree models have
method to illustrate every possible outcome of a been designed to diagnose blood infections or even
decision. predict heart attack outcomes in chest pain patients.
Variables in the decision tree include diagnosis,
- The main components of a decision tree involve treatment, and patient data.
decision points represented by nodes, actions and
specific choices from a decision point - The gaming industry now uses multiple decision
trees in movement recognition and facial
SAMPLE SCENARIO recognition. The Microsoft Kinect platform uses
this method to track body movement

MODULE 7 – SUBTOPIC 1

WHAT IS PREDICTIVE MODELING?

- a commonly used statistical technique to predict


future behavior.

- Predictive modeling solutions are a form of data-


KEY TERMS
mining technology that works by analyzing
Leaf – represents the value based on the values historical and current data and generating a model
given from the input variable in the path running to help predict future outcomes.
from the root node to the leaf.
MOST COMMON PREDICTIVE MODELING
- Decision trees always start with a root node and TECHNIQUES
end on a leaf. Notice that the trees don’t converge at
1. Logistic Regression
any point; they split their way out as the nodes are
2. Clustering
processed.
3. Decision Tree
IDENTIFY THE ROOT NODE 4. Random forest
5. K-nearest neighbor
6. XGBoost

BENEFITS

1. Demand forecasting
2. Workforce planning and churn analysis
3. Forecasting or external factors
4. Analysis of competitors
5. Fleet or equipment maintenance
6. Modeling credit or other financial risks

APPLICATIONS TOOLS USED IN PREDICTIVE MODELING

- Marketers use decision trees to establish 1. R Programming


customers by type and predict whether a customer
will buy a specific type of product.
- It is used for statistics and data modeling. It
can easily manipulate your data and present
in different ways.

4. SAS
- A programming environment and language
2. Tableau Public for data manipulation and a leader in
- A free software that connects any data analytics. SAS is easily accessible,
source it corporate Data Warehouse, manageable and can analyze data from any
Microsoft Excel or web-based data, and sources.
creates data visualizations, maps,
dashboards etc. with real-time updates
presenting on web.

5. RapidMiner
3. Python - One of the best predictive analysis system
- It is an object-oriented scripting language developed. It provides an integrated
which is easy to read, write, maintain and is environment for deep learning, text mining,
a free open source tool. machine learning & predictive analysis.
6. Orange
- It is a perfect software suite for machine
APPLICATIONS
learning & data mining.
- It best aids the data visualization and is a - Target Marketing
component based software. - Attrition Prediction
- It has been written in Python computing - Credit Scoring
language. - Fraud Detection

SOME CHALLENGES

- Operational/Observational
- Massive
- Errors and Outliers
- Missing Values

ANALYTICS PROFESSIONALS OFTEN USE


DATA FROM THE FF SOURCES TO FEED
PREDICTIVE MODELS:

- Transaction data
7. Weka - CRM data
- It is best suited for data analysis and - Customer service data
predictive modeling - Survey or polling data
- It contains algorithms and visualization tools - Digital marketing and advertising data
that support machine learning. - Economic data
- Demographic data
- Machine-generated data (i.e. telemetric data
or data from sensors)
- Geographical data
- Web traffic data

MODULE 7 – SUBTOPIC 2

DECISION TREE
- a graphical representation of possible
outcome to a decision based on certain
conditions.
- It’s called a decision tree because it starts
with a single box (or root), which then
branches off into a number of solutions, just
like a tree.

KEY TERMS

Root Node (Top Decision Node) SIMPLE DECISION TREE PROBLEM


- It represents the entire population or sample - Analyzing credit risk, with potential
and this further gets divided into two or customers being classified as either good or
more homogeneous sets. bad credit risks.
Splitting

- It is a process of dividing a node into two or


more sub-nodes.

Decision Node

- When a sub-node splits into further


GENERATED DECISION TREE
subnodes, then it is called a decision node.

Leaf/Terminal Node

- Nodes with no children (no further split) is


called Leaf or Terminal node.

Pruning

- When we reduce the size of decision trees


by removing nodes (opposite of Splitting),
the process is called pruning.

Branch/ Sub-Tree
REQUIREMENTS OF DECISION TREE
- A sub section of the decision tree is called
branch or sub-tree. 1. Decision tree algorithms represent
supervised learning, and as such require
Parent and Child Node
preclassified target variables. A training data
- A node, which is divided into subnodes, is set must be supplied, which provides the
called a parent node of sub-nodes whereas algorithm with the values of the target
subnodes are the child of a parent node. variable.
2. This training data set should be rich and
VISUALIZING DECISION TREE varied, providing the algorithm with a
healthy cross section of the types of records
for which classification may be needed in FORMATIVES:
the future.
3. The target attribute classes must be discrete.
That is, one cannot apply decision tree M7 16/20
analysis to a continuous target variable.
Rather, the target variable must take on 1. A decision tree has no shortcomings in
values that are clearly demarcated as either expressing classification and prediction
belonging to a particular class or not patterns because it uses multiple attribute
variable in split criterion.
belonging
- False
ADVANTAGES 2. A decision tree can be large to have as many
leaf nodes as data records in the training
- Are simple to understand and interpret data set with each leaf node containing each
data record.
- Have value even with a complex data
- True
- It can be combined with other decision 3. A sub section of the decision tree
techniques - Sub-tree
4. Which of the following does not define a
KEY IDEAS leaf node?
- Also called as internal node
- Decision trees are drawn from top-to-bottom
5. A node, which is divided into sub-nodes is?
or left-to-right - Parent node
- Top (or left-most) node is called root node 6. The topmost node in a decision tree is
- Descendant node(s) are called child node(s) known as the internal node.
- Bottom (or right-most) node is called leaf - False
node 7. Decision tree can work on continuous target
variables
- Unique path from root to each leaf is called
- True
a rule 8. It is also called a terminal node
- Leaf Node
TYPES OF DECISION TREES
9. The decision tree can classify a data record
1. Binary Trees – only two choices in each by passing the data record through the
decision tree using the attribute values in the
split. It can be non-uniform(uneven) in
data record.
depth - True
2. N-way trees – three or more choices in at 10. It is a type of decision tree with only two
least one of its splits choices in each split. It can be non-uniform
(uneven) in depth.
DECISION TREE ALGORITHMS - Binary tress
11. It is also called a terminal node
• Hunt’s Algorithm - Leaf node
• CART 12. It is an unsupervised learning technique as it
• ID3 describes how the data is organized without
• C4.5 using an outcome.
- Clustering
• SLIQ
13. An efficient model was defined as provide at
• SPRINT least 1.
• CHAID - Decision tree
14. Affinity analysis is a data mining method
that usually consists of two variables: a
transaction and an item
- True - True
15. A objective of clustering is to uncover a 6. It train forest predictive models by fitting
pattern in the time series and then multile decision
extrapolate the pattern into the future. - Forest selection
- Flase 7. It trains a decision tree predictive model
16. It is also called as terminal node - Decision tree selection
- Leaf node 8. A decision tree has no shortcomings in
17. A heterogenous data set is a data set whose expressing classification and prediction
data records have the same target value. patterns because it uses multiple attribute
- True variable in split criterion.
18. A decision tree can be large to have as many - False
leaf nodes as data records in the training 9. It trains a gradient boosting predictive model
data set with each leaf node containing each by fitting a set of addictive decision tress
data record. - Gradient boosting selection
- True 10. It is a process a dividing a node into two or
19. Segments profile is the tool available in SAS more sub-nodes
Enterprise miner to execute sequence - Splitting
analysis 11. Assessing whether a mortgage application is
- False a good or bad credit risk is an example of
20. A heterogenous data set is date set whose classification
data records have the same target value. - True
- True 12. It is the process of integrating multiple
databases, data cubes, or files.
- Data Integration
13. Which Apache product is used for managing
M7 18/20 real time transaction such as logs and
events?
1. Leaf node is also known as internal node - Apache Kafka
- False 14. Which of the following is not an example of
2. A node, which is divided into sub-node is classification?
________ - Estimating the grade point average (GPA) of
- Parent node a graduate student, based on that student’s
3. The encircled part shown the picture below undergraduate GPA
illustrates the: 15. Factor is variable being manipulated by
researchers
- True
16. Determining whether a will was written by
the actual decreased or fraudulently by
someone else is an else is an example of
classification.
- True
17. It is the scientific domain that’s dedicated to
knowledge discovery via data analysis
- Data Science
18. The model can be used to classify or predict
the outcome of interest in new cases where
- Decision node the outcome unknown.
4. SAS Enterprise Guide does not include a - True
full programming interface 19. Behavioral analytics are also part of the
- False pattern discovery
5. Child nodes are also called as sub-nodes - True
M7 13/20 14. A sub section of the decision tree
- Sub-nodes
1. It is a type of decision tree with only two 15. The objective of clustering is to uncover a
choices in each split. It can be non-uniform pattern in the time series and then
(uneven) in depth. extrapolated the pattern into the future
- Binary trees - True
2. The decision tree can classify a data record 16. SAS Enterprise Guide does not include a
by passing the data record through the full programming interface
decision tree using the attribute values in the - False
data record. 17. An efficient model was defined as provide at
- True least 1.
3. If a data set has a numeric attribute variable, - Weka
the variable needs to be transformed into a 18. It is the scientific domain that’s dedicated to
categorical variable before being used to knowledge discovery via data analysis
construct a decision tree. - Data Science
- True 19. A heterogenous data set is a data set whose
4. It is a type of decision tree with only two data records have the same target values
choices in each split. It can be non-uniform - True
(uneven) in depth. 20. The following can be done when dealing
- Binary trees with categorical inputs except:
5. Sub-nodes are the sub sections of the - Collapse the categories based on the
decision tree reduction in the chi-square test of
- True association between the categorical input
6. The decision tree can classify a data record and the target.
by passing the data record through the
decision tree using the attribute values in the
data record. Other:
- True
7. Decision tree can work on continuous target 1. The encircled part shown in the picture below
variables illustrates the:
- False
8. It trains a gradient boosting predictive model
by fitting a set of addictive decision trees
- Gradient boosting selection
9. It is a process of dividing a node into two or
more sub-nodes
- Splitting
10. It is also called a terminal node
- Leaf node
11. Factor is the variable being manipulated by
researchers • decision node
- False 2. It is a type of decision tree with only two
12. It is a commonly used statistical technique choices in each split. It can be non-
to predict future behavior uniform(uneven) in depth.
- Predictive modeling • Binary trees
13. Movie Recommendation system are an 3. It trains a gradient boosting predictive model by
example of: fitting a set of additive decision trees.
1. Classification • Gradient boosting selection
2. Clustering 4. Decision node represents the entire population
3. Reinforcement learning or sample and this further gets divided into two
4. Regression or more homogeneous sets
- 2 and 3
• True ID3
5. It is the scientific domain that’s dedicated to C4.5
knowledge discovery via data analysis. SLIQ
• Data Science SPRINT
6. It aims at dividing the data set into groups. CHAID
• Clustering 16. Pruning It is a process of dividing a node into
7. Predicting the percentage increase in traffic two or more sub-nodes.
deaths next year if the speed limit is increased • False
is an example of clustering. 17. It is a graphical representation of possible
• False outcome to a decision based on certain
8. Provide an example of predictive models: conditions.
1. Logistic regression • decision tree
18. A node, which is divided into sub-nodes is
2. Clustering ______________.
• Parent node
3. Decision Tree 19. Child nodes are also called as sub-nodes.
• True
4. Random forest
20. It trains forest predictive models by fitting
5. K-nearest neighbor multiple decision trees.
• Forest selection
6. XGBoost 21. It is a process of dividing a node into two or
more sub-nodes.
9. Behavioral analytics are also part of the pattern • Splitting
discovery. 22. Sub-nodes are the sub sections of the decision
• True tree.
10. The objective of clustering is to uncover a • True
pattern in the time series and then extrapolate 23. A sub section of the decision tree.
the pattern into the future. • Sub-tree
• False 24. The decision tree can classify a data record by
11. It is the process of integrating multiple passing the data record through the decision
databases, data cubes, or files. tree using the attribute values in the data
• Data Integration record.
12. A decision tree can be large to have as many • True
leaf nodes as data records in the training data 25. A decision tree has no shortcomings in
set with each leaf node containing each data expressing classification and prediction patterns
record. because it uses multiple attribute variables in a
• True split criterion.
13. It is essentially similar to the simple linear • False
model, with the exception that multiple 26. Which Apache product is used for managing
independent variables are used in the model. real time transactions such as logs and events?
• Multiple linear regression • Apache Kafka
14. Root node is a graphical representation of 27. As you build tasks, SAS Enterprise Guide
possible outcomes to a decision based on generates SAS codes.
certain conditions. • True
• False 28. An efficient model was defined as provide at
15. Give at least one decision tree algorithm. (Use least 1.
UPPERCASE or your answer) • N-way trees
Hunt’s Algorithm
CART
29. It is a famous data mining method which • Use smoothed weight of evidence coding to
requires that the data must be consists of two convert the categorical input to a
variables: a transaction and an item. continuous input.
• Association Rule • Collapse the categories based on the
30. It is essentially similar to the simple linear number of observations in a category
model, with the exception that multiple • No answer text provided.
independent variables are used in the model. • Collapse the categories based on the
• Multiple linear regression reduction in the chi-square test of
31. Pruning It is a process of dividing a node into association between the categorical input
two or more sub-nodes. and the target.
• False 42. It is a type of decision tree with only two
32. Decision tree can work on continuous target choices in each split. It can be non-
variables. uniform(uneven) in depth.
• True • Binary trees
33. Which of the following is not an example of 43. It pertains to the process of reducing the size of
classification? decision trees by removing nodes.
• Estimating the grade point average (GPA) • Pruning
of a graduate student, based on that 44. Which of the following is not an algorithm used
student’s undergraduate GPA for streaming features?
34. There is no guarantee that making locally • Alpha-investing algorithm
optimal decisions at separate times leads to the • ANOVA
smallest decision tree or a globally optimal • OSFS
decision. • Grafting algorithm
• True
35. In cluster analysis, the goal is to identify distinct May mga mali need ng corrections
groupings of cases across a set of inputs.
• True
• False M7 parin
36. In cluster analysis, the goal is to partition cases
from a cloud of data (data that doesn't 18/20
necessarily have distinct groups) into
contiguous groups. 1. Leaf node is also known as internal node
• False - False
2. A node, which is divided into sub-node is
37. SAS Enterprise Guide does not include a full
________
programming interface.
- Parent node
• False 3. The encircled part shown the picture below
38. Determining whether a will was written by the illustrates the:
actual deceased, or fraudulently by someone
else is an example of classification.
• True
39. It aims at dividing the data set into groups.
• Clustering
40. Predicting the percentage increase in traffic
deaths next year if the speed limit is increased
is an example of clustering.
• False
41. The following can be done when dealing with
categorical inputs except:
- Decision node
4. SAS Enterprise Guide does not include a full 20. It is a commonly used statistical technique to
programming interface predict future behavior. (Use lowercase for your
- False answer)
5. Child nodes are also called as sub-nodes
- True
6. It train forest predictive models by fitting
multile decision Module 8 - CLUSTER ANALYSIS
- Forest selection
7. It trains a decision tree predictive model Unsupervised Classification
- Decision tree selection
8. A decision tree has no shortcomings in inputs
expressing classification and prediction patterns
because it uses multiple attribute variable in grouping
split criterion. cluster 1
- False cluster 2
9. It trains a gradient boosting predictive model by
cluster 3
fitting a set of addictive decision tress
cluster 1
- Gradient boosting selection
10. It is a process a dividing a node into two or cluster 2
more sub-nodes
- Splitting Unsupervised classification: grouping of cases based
11. Assessing whether a mortgage application is a on similarities in input values.
good or bad credit risk is an example of
classification What is Cluster Analysis?
- True Naturally occurring groups?
12. It is the process of integrating multiple Yes- Cluster Analysis
databases, data cubes, or files. No- Segmentation
- Data Integration Clustering
13. Which Apache product is used for managing
real time transaction such as logs and events?
“Cluster analysis is a set of methods for constructing a
- Apache Kafka
14. Which of the following is not an example of (hopefully) sensible and informative classification of an
classification? initially unclassified set of data, using the variable
- Estimating the grade point average (GPA) of a values observed on each individual.”
graduate student, based on that student’s Everitt (1998), The Cambridge Dictionary of Statistics
undergraduate GPA
15. Factor is variable being manipulated by Clustering in real life
researchers -While you have thousands of customers, there are really
- True only a handful of major types into which most of your
16. Determining whether a will was written by the customers can be grouped.
actual decreased or fraudulently by someone
• Bargain hunter
else is an else is an example of classification.
• Man/woman on a mission
- True
• Impulse shopper
17. It is the scientific domain that’s dedicated to
knowledge discovery via data analysis • Weary Parent
- Data Science • DINK (dual income, no kids)
18. The model can be used to classify or predict the
outcome of interest in new cases where the
outcome unknown.
- True
19. Behavioral analytics are also part of the pattern
discovery
- True
A case study

K-means Algorithm- is an algorithm is an iterative


algorithm that tries to partition the dataset into Kpre-
defined distinct nonoverlapping subgroups (clusters)
where each data point belongs to only one group.

Steps in performing K-means

Training Data
1. Select inputs.

Euclidean Distance
• Euclidean distance gives the linear distance
between any two points in n-dimensional
space.
2. Select k cluster centers.
• It is a generalization of the Pythagorean
theorem.

DE= (x1,x2)
(0,0)

3. Assign cases to closest center.


4. Update cluster centers. -The objective is to identify the features, or combination
of features, that uniquely describe each cluster.
• small-investors
• younger biginvestors
• older, big investors

5. Re-assign cases.

SUBTOPIC 1. INTRODUCTION TO CLUSTERING

Use Clustering to identify fraud


-Most fraudulent customer activity is difficult to identify
Steps in performing K-means by a single variable. Are there unusual combinations of
Training Data behaviors that can help identify criminal activity or
4. Update cluster centers. fraud? -A customer who rarely travels starts making
5. Re-assign cases. purchases in many foreign countries. Fraud alert!
6. Repeat steps 4 and 5 until convergence. -A customer who has never shopped online before
begins to make many online purchases. Fraud alert!

Clustering for Store


-Location You want to open new grocery stores in the
U.S. based on demographics. Where should you locate
the following types of new stores?
• Low-end budget grocery stores
• small boutique grocery stores
• large full-service supermarkets
Segmentation Analysis
Training Data Euclidean Distance
• Euclidean distance gives the linear distance
between any two points in n-dimensional
space.
• It is a generalization of the Pythagorean
theorem.

DE=

(x1,x2)
-When no clusters exist, use the K-means algorithm to
(0,0)
partition cases into contiguous groups.

Cluster Profiling
-Cluster profiling can be defined as the derivation of a
class label from a proposed cluster solution.
P(Point, Mean1)= |x2-x1| + |y2 – yı|
= |2-2| + |10-10|
Example Problem = 0+0
Cluster the following eight points (with (x,y)) =0
representing locations into three clusters A1(2,10)
A2(2,5) A3(8,4) A5(7,5) A6(6,4) A7(1,2) A8(4,9). ----------------------------

Initial cluster centers are: A102, 10), A4(5,8) and


A7(1,2)

The distance function between two points a = (x1,y1)


and b = (x2, y2) is defined as p(a,b) = |x2 – x1| + |y2 –
y1|.

Iteration (2, 10) (5, 8) (1, 2)


1
Point Dist Dist Dist Cluster
Mean Mean Mean Point
1 2 3 x1 y1
A1 (2, 10) (2, 10)
A2 (2, 5)
A3 (8, 4)
Mean2
A4 (5, 8)
A5 (7, 5) x2 y2
A6 (6, 4) (5, 8)
A7 (1, 2) P(a,b) = | x2- x1| + | y2 – y1|
A8 (4, 9) P(Point, Mean1)= |x2-x1| + |y2 – yı|
= |5-2| + |8-10|
Solution = 3+2
=5

Iteration (2, 10) (5, 8) (1, 2)


1
Point Dist Dist Dist Cluster
Mean Mean Mean
1 2 3
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
Point A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
x1 y1
(2, 10)
Cluster 1
(2,10)
Mean1
x2 y2
Cluster 2
(2, 10)
(8,4)
P(a,b) = | x2- x1| + | y2 – y1|
(5,8)
(7,5) 1 2 3
(6,4) A1 (2, 10) 1.5 9.25 7 1
A2 (2, 5) 5.5 4.25 2 3
Cluster 3 A3 (8, 4) 10.5 4 7 2
(2,5) A4 (5, 8) 3.5 3 8 2
A5 (7, 5) 8.5 2 7 2
(1,2)
A6 (6, 4) 8.5 2 5 2
(4,9) A7 (1, 2) 9.5 9 2 3
A8 (4, 9) 1.5 5 8 1
Cluster 1: has 1 point A1(2,10) which was the old mean
(remains) Cluster 1
Cluster 2: ((8 + 5 + 7 + 6 + 4)/5 , (4 + 8 + 5 + 4 + 9)/5) (2,10)
= (6, 6) (5,8)
Cluster 3: ((2 + 1)/2, (5 + 3)/2)) = (1.5, 3.5) (4,9)

Iteration (2, 10) (5, 8) (1, 2) Cluster 2


1 (8,4)
Point Dist Dist Dist Cluster
Mean Mean Mean
1 2 3 (7,5)
A1 (2, 10) 0 8 7 1 (6,4)
A2 (2, 5) 5 5 2 3
A3 (8, 4) 12 4 7 2 Cluster 3
A4 (5, 8) 5 3 8 2 (2,5)
A5 (7, 5) 10 2 7 2 (1,2)
A6 (6, 4) 10 2 5 2
A7 (1, 2) 9 9 2 3 Cluster 1: points 1,4 & 8
A8 (4, 9) 3 5 8 1
((2 +5+4)/3 , (10 + 8 + 9)/3) = (3.67, 9)
Cluster 1:
Cluster 2: points 3,5 & 6
(2,10)
((8 + 7 + 6 )/3, (4 + 5 + 4)/3) = (7, 4.3)
(4,9)
Cluster 3: ((2 + 1)/2 , (5 + 3)/2)) = (1.5, 3.5)
Cluster 2:
(8,4)
(5,8)
(7,5)
(6,4)

Cluster 3:
(2,5)
(1,2)

Cluster 1: points 1 & 8


((2 + 4) 12 , (10 + 9)/2) = (3, 9.5)
Cluster 2: points 3, 4, 5 & 6
((8 + 5 + 7 + 6 )/4 , (4 + 8 + 5 + 4)/4) = (6.5, 5.25)
Cluster 3: ((2 + 1)/2 , (5 + 3)/2)) = (1.5, 3.5)

Iteration (2, 10) (5, 8) (1, 2)


1
Point Dist Dist Dist Cluster
Mean Mean Mean
Step 4: For each of the k clusters, find the cluster
centroid, and update the location of each cluster center to
the new
value of the centroid.
Step 5: Repeat steps 3–5 until convergence or
termination

Application of K-means using SAS Enterprise Miner

Enterprise Miner profile of International Plan


adopters across clusters

SUBTOPIC 2. K-MEANS ALGORITHM

What is cluster and cluster analysis?


• A cluster is a group of similar objects
• Cluster analysis is a set of data-driven
partitioning techniques designed to a group
(clusters) the degree of association or similarity
is strong between numbers of the same cluster
VoiceMail Plan adopters and nonadopters are
mutually exclusive

What is k-means algorithm?


-K-means is one of the simplest unsupervised learning
algorithms that solve the well-known clustering
problem.
-The procedure follows a simple and easy way to
classify a given data set through a certain number of
clusters (assume k clusters) fixed apriori.
Clusters of Users
K-means algorithm process: • Cluster 1: Sophisticated Users. A small group
Step 1: Ask the user how many clusters k the data set of customers who have adopted both the
should be partitioned into. International Plan and the VoiceMail Plan.
Step 2: Randomly assign k records to be the initial • Cluster 2: The Average Majority. The largest
cluster center locations. segment of the customer base, some of whom
Step 3: For each record, find the nearest cluster center. have adopted the VoiceMail Plan but none of
Thus, in a sense, each cluster center "owns" a subset of whom have adopted the International Plan.
the records, thereby representing a partition of the data • Cluster 3: Voice Mail Users. A medium-sized
set. We therefore have k clusters, C1, C2, ..., Ck. group of customers who have all adopted the
VoiceMail Plan but not the International Plan.
- True
Concept of K-means 3. Predicting house price based on the size of
-The main idea is to define k centers, one for each the house is an example of cluster analysis.
cluster. These centers should be placed in a cunning way - True
because of different location causes different result. So, 4. It is an algorithm is an iterative algorithm
the better choice is to place them as much as possible far that tries to partition the dataset into kpre-
away from each other. defined district nonoverlapping subgroups
-The next step is to take each point belonging to a given (clusters) where each data point belongs to
data set and associate it to the nearest center. When no only one group
point is pending, the first step is completed and an early - Kmeans
5. A cluster is defined as a collection of data
group age is done.
points exhibiting certain similarities
- True
Advantages
6. It can be defined as the derivation of a class
• Fast, robust and easier to understand. label from a proposed cluster solution
• Relatively efficient: - Cluster Profiling
• Gives best result when data set are distinct or 7. Clustering is supervised classification
well separated from each other. - False
8. There is a separate “quality” fuction that
Disadvantages measured the “goodness” of a cluster
• The learning algorithm requires apriori - True
specification of the number of cluster centers. 9. Identify if the statement is a clustering or
• The use of Exclusive Assignment - If there are not:
two highly overlapping data then k-means will Identifying groups of motor insurance policy
not be able to resolve that there are two clusters. holders with a high average claim cost.
• The learning algorithm is not invariant to non-
- True
linear transformations i.e. with different
10. Cluster analysis is a statistical technique
representation of data we get different results
used to identify how various units—like
(data represented in form of cartesian
people, groups, or societies—can be
coordinates and polar co-ordinates will give
grouped together because of the
different results).
characteristics they have in common
• Euclidean distance measures can unequally
- True
weight underlying factors.
11. The easiest and simplest clustering
• The learning algorithm provides the local optima
algorithm that is widely used because of its
of the squared error function.
simple methods of implementation is called
• Randomly choosing of the cluster center cannot k-means algorithm
lead us to the fruitful result
- True
• Applicable only when mean is defined i.e. fails 12. K-means algorithm can be used in
for categorical data. forecasting car plant electricity usage.
• Unable to handle noisy data and outliers. - False
• Algorithm fails for non-linear data set. 13. Natural language processing is an example
of clustering
- True
FORMATIVES: 14. K-means cannot handle noisy data and
outliers
M8 16/20
- True
1. Clustering analysis in negatively affected by 15. Data point belonging to different clusters
heteroscedasticity have low degree of dissimilarity
- False - True
2. It requires to specify the number of clusters 16. Data point belonging to different clusters
(k) in advance is an advance of k-means have low degree of dissimilarity
- True - True
17. This method is used to quickly cluster large
datasets. Here, researchers define the
number of clusters prior to performing the 7. K-Means clustering is an unsupervised iterative
actual study clustering technique.
- Hierarchical Cluster
- True
18. The Euclidean distance from each case in
8. There is a separate “quality” function that
the training data to each cluster center is
measures the “goodness” of a cluster. - google
calculated
- True
- True
9. Statistical analysis a set of methods for
19. Which is the following is not example of a
cluster analysis? constructing a (hopefully) sensible and
- Price Prediction informative classification of an initially
20. You can randomly select any k data point as unclassified set of data, using the variable
cluster center values observed on each individual.”
- True - False
10. What is the minimum no. of variables/ features
required to perform clustering?
FAM - FA8 - 18/20 - 2
- 0
- 3
- 1
1. Identify if the statement is a clustering or not:
11. Clustering is useful in software evolution as it
Cluster analysis may be used to identify helps to reduce legacy properties in code by
whether a tumour is malignant or if it is benign. reforming functionality that has become
dispersed. It is a form of restructuring and
- True hence is a way of direct preventative
2. Clustering is unsupervised classification. maintenance.

- False
- True 12. The easiest and simplest clustering algorithm
3. K-means algorithm has target label. that is widely used because of its simple
- False methods of implementation is called k-means
4. Identify if the statement is a clustering or not: algorithm.

For cities on fault lines, geologists use cluster - True


analysis to evaluate seismic risk and the 13. Unsupervised learning means that there is an
potential weaknesses of earthquake-prone outcome to be predicted.
regions. By considering the results of this - False
research, residents can do their best to prepare 14. This method is used to quickly cluster large
mitigate potential damage. datasets. Here, researchers define the number
of clusters prior to performing the actual study.
- True
5. Market basket partitions the given data set into - Two-Step Cluster
k predefined distinct clusters. - Hierarchical Cluster
- False - K-Means Cluster
6. Identify if the statement is a clustering or not:

Observed earth quake epicenters should be


clustered along continent faults
15. Low kurtosis and skewness statistics on the 24. Is it possible that Assignment of observations to
inputs avoid creating single-case outlier clusters does not change between successive
clusters. iterations in K-Means

- True - True
16. Clustering may be used to identify different can
be distributed more evenly amongst the
evolving species or subspecies. 25. Song recommendation is an example of cluster
- True analysis.
17. K-means can handle noisy data and outliers.
- True
- False

26. In association, you select cluster centers in such


18. K-means can not handle noisy data and outliers. a way that they are as farther as possible from
- True each other.
19. Clustering algorithms are used for robotic
situational awareness to track objects and
- False
detect outliers in sensor data.
- True

27. A cluster is defined as a collection of data points


exhibiting certain similarities.
FAM - FA8 - 18/20

20. When your data has outliers, there's a tendency - True


that the results of the k-means algorithm would
be inaccurate.
- True 28. Natural language processing is an example of
21. Its objective is to identify the features, or clustering.
combination of features, that uniquely describe
each cluster. - True

- Association Rule Mining


- Segmentation 29. Which of the following is not an example of
- Cluster profiling cluster analysis?

- Price prediction
22. It can be defined as the derivation of a class - Song recommendation
label from a proposed cluster solution. - For cities on fault lines, geologists use cluster
analysis to evaluate seismic risk and the
- Cluster profiling potential weaknesses of earthquake-prone
- Segmentation regions. By considering the results of this
- Association Rule Mining research, residents can do their best to prepare
23. Clustering is supervised classification. mitigate potential damage.
- False - Observed earth quake epicenters should be
clustered along continent faults
30. K-means partitions the given data set into k - False
predefined distinct clusters. 2. It requires to specify the number of clusters (k)
in advance is an advance of k-means
- True - True
3. Predicting house price based on the size of the
house is an example of cluster analysis.
- True
31. Data points belonging to different clusters have 4. It is an algorithm is an iterative algorithm that
low degree of dissimilarity. tries to partition the dataset into kpre-defined
district nonoverlapping subgroups (clusters)
- True where each data point belongs to only one
group
- Kmeans
5. A cluster is defined as a collection of data points
exhibiting certain similarities
- True
32. Movie Recommendation systems are an
6. It can be defined as the derivation of a class
example of clustering analysis. label from a proposed cluster solution
- Cluster Profiling
- True 7. Clustering is supervised classification
- False
8. There is a separate “quality” fuction that
33. Data points belonging to one cluster have low measured the “goodness” of a cluster
- True
degree of similarity.
9. Identify if the statement is a clustering or not:
Identifying groups of motor insurance policy
- False
holders with a high average claim cost.
- True
10. Cluster analysis is a statistical technique used to
34. Cluster analysis is a multivariate method which identify how various units—like people, groups,
aims to classify a sample of subjects (or objects) or societies—can be grouped together because
on the basis of a set of measured variables into of the characteristics they have in common
a number of different groups such that similar - True
subjects are placed in the same group. 11. The easiest and simplest clustering algorithm
that is widely used because of its simple
(https://www.sheffield.ac.uk/mash/statistics/m methods of implementation is called k-means
ultivariate#:~:text=It%20also%20covers%20usin algorithm
g%20Factor%20analysis%20as%20a%20classific - True
ation%20tool%20in%20practice.&text=Cluster% 12. K-means algorithm can be used in forecasting
car plant electricity usage.
20analysis%20is%20a%20multivariate,placed%2
- False
0in%20the%20same%20group.) 13. Natural language processing is an example of
clustering
- True
- True
35. K-means algorithm can be used in forecasting 14. K-means can not handle noisy data and outliers
car plant electricity usage. - True
- True 15. Data point belonging to different clusters have
low degree of dissimilarity
- True
M8 16/20 16. Data point belonging to different clusters have
low degree of dissimilarity
1. Clustering analysis in negatively affected by - True
heteroscedasticity
17. This method is used to quickly cluster large - True,
datasets. Here, researchers define the number (In Association rule mining, the more rules
of clusters prior to performing the actual study that you will generate, the more riskier your
- Hierarchical Cluster
model is, di niya na alam kung alin don sa
18. The Euclidean distance from each case in the
training data to each cluster center is calculated combination ang significant.)
- True
19. Which is the following is not example of a 7. What variable selection is best to utilize if
cluster analysis? there are 100 columns available in the dataset?
- Price Prediction
20. You can randomly select any k data point as - Forward selection
cluster center
8. Which of the following type of association rules
- True
pertain to a rule that is well-known by that expert
with the business?
FINAL EXAM REVIEW
- Trivial rules (alam mo na nalalabas kapag
1. Identify whether the statement is null or ni-run mo ang mining)
alternative hypothesis: all iPhone 6 plus
(Inexplicable rule- rules na di mo maexplain dahil
weighs 6.77 ounces.
di ka part ng company (need mo dito ng expert
- Null hypothesis
2. Association rule mining is about grouping advices)
similar samples into segments Actionable rule- Nakita mong rule na bago (high
- False
quality information))
(because it should be cluster analysis.
Association rule mining refers to the market 9. The data is bimodal. Data: 77 82 74 81 79 84 74
basket in a way na ito yung paghahanap ng 82
mga objects or items that co-occur together
or are present in the same time for a specific -True (dahil dalawang mode ang nageexist)
database.) 10. The corona outbreak is an example of seasonal
3. Behavioral analytsics are also part of pattern variations
discovery.
- True, behavior of a customer has a certain -False (hindi kasi di naman siya nageexist every
pattern year)
4. Market Basket Analysis creates If-Then scenario (seasonal happens periodically every year or month
rules. The “THEN” part is called as ________. ibig sabihin that trend is on-going)
-Consequent 11. A technique that enables machines to mimic
5. It measures the proportion of data records that human behavior
contain the item set X. -AI
- Support (machine learning- creating a machine or program
(In a certain data set, It is concerned with the
na ang ginagawa nya is automatically learning from
overall impact of X.)
the data being fed to the machine itself. It’s working
6. In Association rule mining, the more rules automatically based on historical data.
you produce, the greater the risk of the model to
deep learning- subset of machine learning that is
be generated.
related to you trying to mimic human brain)
12. If the lift ratio is <1, it means that having the -Ordinal
antecedent does not increase the chances of having
the consequent. - likert scale kasi siya (may order)

-true 20. A good rule of thumb is having the right amount


of data is to have 10 records for every predictor
(if the lift ratio >1 there is a greater chance that value
someone will buy product B if they buy product A)
-True
13. Identify the third step in CRISP-DM
21. The standard deviation can never be a negative
-Data preparation number

14. A special case of categorical with just two -True


categories
22. The electricity usage is an example of ______.
-Binary
- Seasonal variations
15. You can use boxplot to detect outliers
(Kasi monthly ka nagbabayad. Nagrereset, at
-True ang trend niya ay monthly. Monthly nagtataas,
monthly nagbaba.)

(Basta ang scenario ay nauulit monthly)

(cyclical- a seasonal variation pero mas Malaki


ang awang niya. Ex: corona virus, economic
depression na nangyayari every 5 or 10 years)

23. What is the minimum no. of variables/ features


required to perform clustering?

-1

(pwede tayo igrupo based on 1 variable alone)


16. Which of the following does not define a leaf
node? 24. Which of the following is not an advantage of
the association rule mining?
- Also called as internal node
17. Student height is a categorical variable - Assumes transaction database is memory
resident
-False
25. Movie Recommendation systems are an
18. The simplest of all variable selection procedure example of clustering analysis.
is stepwise procedure
-True
- False
19. You conduct a survey where the respondents
could choose from good, better, best, excellent.
What would be the data type?

You might also like