[IT0089] FUNDAMENTALS OF - Pertains to a value of a variable.


MODULE 1 – STATISTICAL - It pertains to a variable that can be
ANALYSIS measured numerically
WHY DO WE SUMMARIZE DATA? - A data collected on a quantitative
- The main objective of summarizing variable are called quantitative data
data is to easily identify the o Discrete Variables – a
characteristics of the data to be variable whose values are
processed. countable. It can assume
- Summarizing your data will enable only certain values.
you to classify the normal values of o Continuous Variables –
your data and uncover what are the there are that can take on any
odd values present on your dataset. interval values. Also called
WHAT IS RAW DATA? as float, interval, numeric
- Raw data pertains to the collected QUALITATIVE VARIABLES
data before it’s processed or ranked. - Qualitative or categorical variable
o Qualitative raw data pertains to the variables that values
o Categorical raw data couldn’t be measured
- Statistics is defined as the science of - Categorical data that has an explicit
collecting, analyzing, presenting, and ordering
interpreting data, as well as of - Also called as ordered factor
making decisions based on such BINARY VARIABLES
analysis. - A special case of categorical with
TYPES OF STATISTICS just two categories. (0/1, True or
1. Descriptive Statistics False)
- consists of different techniques for - Also called as dichotomous, logical,
organizing, displaying, and indicator
describing data by using labels, EXPLORATORY DATA ANALYSIS
graph, and summary measures (EDA)
2. Inferential Statistics - It refers to the critical process of
- consists of methods that use sample performing initial investigations on
results to help make decisions or the data so as to discover patterns,
predictions about a population to spot anomalies, to test hypothesis
DATA SET and to check assumptions with the
- a collection of observations on one help of summary statistics and
or more variables graphical representations.
- a characteristic under study that TENDENCIES
assures different values for different - A basic step in exploring your data is
elements getting a “typical value” for each
OBSERVATION OR MEASUREMENT feature (variable): an estimate of
where most of the data are located 3. MULTIMODAL - if the
(i.e. their central tendency) distribution has more than two
- Calculated by adding together the MEASURES OF DISPERSION
values in the sample. The sum is then - A second dimension, variability, also
divided by the number of items in referred to as dispersion, measures
the sample (the replication) whether the data values are tightly
clustered or spread out
- The difference between the largest
and the smallest value in a data set
MEDIAN - Range = largest value – smallest
- The median is the middle value, value
taken when you arrange your STANDARD DEVIATION
numbers in order (rank) - The standard deviation is used when
- When calculating for median, one the data are normally distributed.
must remember the following: You can think of it as a sort of
o Arrange the data values of “average deviation” from the mean.
the given data set in - General formula for calculating
increasing order. (from standard deviation
smallest to largest)
o Find the value that divides
the ranked data set in two
equal parts
- It is the most frequent value in a
- A small standard deviation means
that the values in a statistical data set
- It is calculated by working out how
are close to the mean of the data set,
many there are of each value in your
on average
- A large standard deviation means
- The one with the highest frequency
that the values in the data set are
is the mode
farther away from the mean, on
- It is possible to get tied frequencies,
in which case you report both values
- The standard deviation can never be
a negative number
1. UNIMODAL - the distribution is
- The smallest possible value for the
said to be unimodal if it has only one
standard deviation is 0
value with highest number of
- The standard deviation is affected
by outliers (extremely low or
2. BIMODAL - the data set contains
extremely high numbers in the data
two modes
set). That’s because the standard
deviation is based on the distance the largest among the three and mode
from the mean. has the lowest value
- The standard deviation has the same
units as the original data
- Overall distance of all the points
from your mean
1. Graphically – using frequency - If a histogram and frequency
histograms or tally plots draws a distribution are skewed to the left,
picture of the sample shape. the value of the mean is the smallest
2. Shape statistics – such as skewness and mode has the largest value
and kurtosis
- A plot of the frequency table with
the bins on the x-axis and the count
(or pro‐ portion) on the y-axis.

- A plot introduced by Tukey as a
quick way to visualize the
distribution of data.
- a measure of how central the average
is in the distribution
- The skewness of a sample is a
- If a histogram is symmetric, then it measure of how central the average
can be said that the values for the is in relation to the overall spread of
mean, median, and mode are the values
positive value indicates that
the average is skewed to the
left, that is, there is a long
“tail” of more positive values
A negative value indicates
that the average is skewed to
- If a histogram and frequency
the right, that is, there is a
distribution is skewed to the right it
means that the value of the mean is
long “tail” of more positive The contrary of this claim is the
values alternative hypothesis
- A measure of how pointy the In statistics, student’s t-test is a method
distribution developed by William Sealy Gosset used in
WHAT IS STATISTICS? testing hypothesis about the mean of a small
- Statistics is the science concerned sample drawn from a normally distributed
with developing and studying population when the population standard is
methods for collecting, analyzing, unknown
interpreting and presenting empirical
data • If the t value < critical value = don’t
- Rely upon the calculation of reject the null hypothesis
numbers • If the t value > critical value = reject
- Rely upon how the numbers are the null hypothesis
chosen and how statistics are SIGNIFICANCE LEVELS (ALPHA)
interpreted - A significance level, also known as
TYPES OF STATISTICS alpha or α, is an evidentiary standard
1. Descriptive Statistics - describing that a researcher sets before the study
and summarizing data sets using - It defines how strongly the sample
pictures and statistical quantities evidence must contradict the null
2. Inferential Statistics - analyzing hypothesis before you can reject the
data sets and drawing conclusions null hypothesis for the entire
from them population
3. Probability - the study of chance CONFIDENCE INTERVALS
events governed by rules (or laws) - The table shows the equivalent
WHAT IS HYPOTHESIS TESTING? confidence levels and levels of
- a statistical procedure for testing significance
whether chance is a plausible
explanation of an experimental
- Null Hypothesis is usually denoted
by H0 and the alternative hypothesis
is denoted by H1
- The null hypothesis is a statement
that is assumed to be true at the
beginning of an analysis • If the p value > α= don’t reject the
- Suppose you are trying to test a null hypothesis
certain case, initially, the person • If the p value < α= reject the null
being questioned is not guilty. This hypothesis
initial verdict is the null hypothesis. DEGREES OF FREEDOM
= N1 + N2 – 2
• independent observations RESEARCH QUESTIONS FOR ONE-
• normally distributed data for each WAY ANOVA
group • Do accountants, on average, earn
• equal variances for each group more than teachers?
ANOVA • Do people treated with one of two
- Analysis of variance (ANOVA) is a new drugs have higher average T-
statistical technique used to compare cell counts than people in the control
the means of two or more groups of group?
observations or treatments. For this • Do people spend different amounts
type of problem, you have the depending on which type of credit
following: card they have?
o a continuous dependent ASSUMPTIONS IN ANOVA
variable or response variable • Observations are independent
o a discrete independent • Errors are normally distributed
variable, also called a • All groups have equal response
predictor or explanatory variances.
ONE-WAY ANOVA - Exploratory data analysis in many
- Use analysis of variance to test for modeling projects (whether in data
differences between population science or in research) involves
means. examining correlation among
TYPE OF CATEG CONTI CONTI predictors, and between predictors
PREDICT ORICA NUOU NUOUS and a target variable
- Variables X and Y (each with
RESPONS ORICA measured data) are said to be
E L positively correlated if high values of
CONTIN Ordina Analysis X go with high values of Y, and low
UOUS ry of values of X go with low values of Y
Square nce - A metric that measures the extent to
which numeric variables are
Regres associated with one another (ranges
sion from -1 to +1).
CATEGO Contige Logisti Logistic - Scatterplot is a plot in which the x-
RICAL ncy c Regressi axis is the value of one variable, and
Table Regres on the y-axis the value of another.
Logistic • The correlation coefficient measures
Regressi the extent to which two variables are
on associated with one another.
• When high values of v1 go with used. In module 1, the most common
high values of v2, v1 and v2 are techniques will be identified.
positively associated What is raw data?
• When high values of v1 are
associated with low values of v2, v1 Raw data pertains to the collected data
and v2 are negatively associated before it’s processed or ranked.
• The correlation coefficient is a Suppose you are tasked to gather the ages
standardized metric so that it always (in years) of 50 students in a university.
ranges from -1 (perfect negative
correlation) to +1 (perfect positive The table below is an example of
correlation) quantitative raw data.
• 0 indicates no correlation, but be
aware that random arrangements of
data will produce both positive and
negative values for the correlation
coefficient just by chance
Another example is gathering the student
status of the same 50 students. Now we will
have an example of categorical raw data
Subtopic 1
Every data analysis requires data. Data can
be in different forms such as images, text,
videos, and etc, that are usually gathered The examples above can also be called
from different data sources. Organizations ungrouped data. An ungrouped data set
store data in different ways, some through contains information on each member of a
data warehouses, traditional RDBMS, or sample or population individually.
even through cloud. With the voluminous
Raw data can be summarized through
amount of data that an organization charts, dashboards, tables, and numbers.
processes each day, the dilemma on how The most common way to describe raw
to start data analysis emerges. data is through the frequency distributions.
How do we start performing an analysis?
A frequency distribution shows how
First and foremost, know your data.
the frequencies are distributed over
To understand your organization’s data, various categories. Below is a
there are numerous techniques that can be
frequency table that summarized the
survey conducted by Gallup Poll about
the Worries
About Not Having Enough Money to Pay
Normal Monthly Bills

A frequency distribution of a
qualitative variable enumerates all the
categories and the number of
instances that belong to each
Let’s transform the following responses into
a frequency table in order to interpret the
data better
Is it easier to understand through frequency table, isn’t it?

In order to study data, in this module we will be using statistics.

Statistics is defined as the science of collecting, analyzing, presenting, and interpreting data,
as well as of making decisions based on such analysis.

Since statistics is a broad body of knowledge, it is divided into two areas: descriptive
statistics and inferential statistics.

Descriptive statistics is consists of different techniques for organizing, displaying, and

describing data by using labels, graph, and summary measures.

Inferential statistics - It is consist of methods that use sample results to help

make decisions or predictions about a population.
Basic terms

A variable is a characteristic under study that assures different values for different elements.

Observation or measurement pertains to a value of a variable.

A data set is a collection of observations on one or more variables.

Customer ID First Name Last Name Address Age

1 Ana Liza Quezon 29
2 Ben Tan Makati 30
3 Rose Park Antipolo 22
4 Dan Mangulabnan Pampanga 23
5 Romeo Pascual San Juan 24
In the above table, the variables are the customer ID, first name, last name, address, and
age. The data set are all the transactions from customer 1 to 5. Each transaction can also be
called as measurement or observation.
Types of Variables

1. Quantitative Variables
It pertains to a variable that can be measured numerically. A data collected on a
quantitative variable are called quantitative data.

Income, height, gross sales, price of a home, number of cars owned.
Quantitative variables are divided into two types: discrete variables and continuous

• Discrete variable is a variable whose values are countable. In other

words, a discrete variable can assume only certain values.
• Continuous variables are variables that can assume any numerical value or

2. Qualitative or Categorical Variables

It pertains to the variables that values couldn’t be measured.

Population Vs Sample

A population is consist of all elements – individuals, items, or objects, whose characteristics

are being studied.
A sample is a portion of the population selected for study.

Visual example of Population vs Sample

Let’s proceed to the different ways to group and study data.

Aside from using the frequency table, we can also make use of the different graphs that are
commonly used to visually present data.

The first one is a bar graph. A bar graph is made of bars whose heights represent the
frequencies of respective categories. One type of bar graph is called pareto chart.

Pareto chart is a bar graph where in the bars are arranged based on their heights. It is
arranged in descending order (largest to smallest).

Another way to present data is through pie chart. A pie chart is a circle divided into
portions that represent frequencies or percent ages of a population
To graph grouped data, we can use of the following methods:

A group data can be presented using histograms. Histograms can be drawn for a frequency
distribution. A histogram is a graph which classes are marked on horizontal axis and the
frequencies, relative frequencies, or percentages are marked on the vertical axis.

The above histogram shows the percentage distribution of annual car insurance premiums in 50
states. The data used to make this distribution and histogram are based on estimates made by
Understanding Frequency Distribution Curve

Knowing the meaning for each curve in histogram would be helpful to interpret a dataset. A
histogram can be:
1. Symmetric
2. Skewed
3. Uniform rectangular
A symmetric histogram is the type of histogram which both sides are equal.

If a histogram doesn’t have equal sides, it is said to be skewed.

• A skewed-to-the-right histogram has longer tail on the right side.

• A skewed-to-the-left histogram has longer tail on the left.

If the histogram has equal values then it would be considered as uniform or

rectangular histogram.

Measures of Tendencies
1. Mean
2. Median
3. Modes
Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
Examples of statistical analysis application in real life:

1. Manufacturers use statistics to weave quality into beautiful fabrics, to bring lift to
the airline industry and to help guitarists make beautiful music.
2. Researchers keep children healthy by using statistics to analyze data from the
production of viral vaccines, which ensures consistency and safety.
3. Communication companies use statistics to optimize network resources, improve
service and reduce customer churn by gaining greater insight into subscriber
4. Government agencies around the world rely on statistics for a clear understanding
of their countries, their businesses and their people.
Understanding the measures of central tendencies:

Measures of central tendencies are useful in identifying the middle value for histograms and
frequency distribution. Methods used to calculate for the measures central tendencies can
determine the typical values that can be found in your data.

What is mean?
Mean is also called as arithmetic mean which pertains to the sum of all the values over the
number of items added.
What is median?
The median is the middle value, taken when you arrange your numbers in order (rank). This
measure of the average does not depend on the shape of the data.
When calculating for median, one must remember the following:
1. Arrange the data values of the given data set in increasing order. (from smallest to
2. Find the value that divides the ranked data set in two equal parts.

Find the median for 2014 compensation:

Step 1: Arrange the values in increasing order

16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1

Step 2: Identify the center of the data sets.

16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1

For this example, the middle value is 21.0, therefore it is the median of the data values.

There are instances where the data values are even, in this case, the two middle numbers are
gathered and divided into two.
See the example below:

The following data describes the cell phone minutes used last month by 12 randomly
selected customers:
230, 2053, 160, 397, 510, 380, 263, 3864, 184, 201, 326, 721
Here, we need to arrange the data values first. It will give us:

160 184 201 230 263 326 380 397 510 721 2053 3864

If we observe our data, we have no central value, to compute for the median, we need to
identify the values that divide the data into two equal parts. In our case we have 326 and 380.
We can only have one median per data set, so the median would be calculated using:

326 + 380
= = 353
What is mode?

Mode is the value that occurs the most frequent in the given dataset.

Let’s identify the most frequent values in the following example:

77 82 74 81 79 84 74 78

In this data set, the value 74 appears twice. Therefore, our mode is 74.

When a dataset has only one value that repeat the most then the distribution can be called as
unimodal. If the data in the distribution has two values that repeat the most, then it is called
bimodal. If there are more than two modes in a dataset, it is said to be multimodal.

Relationships among mean, median, and mode:

• If a histogram is symmetric, then it can be said that the values for the mean, median,
and mode are the equal.

• If a histogram and frequency distribution is skewed to the right it means that the
value of the mean is the largest among the three and mode has the lowest value.
If the mean is the largest, it means that the dataset is sensitive to outliers that occur in the
4. If a histogram and frequency distribution are skewed to the left, the value of the mean
is the smallest and mode has the largest value. In this scenario, the left tail contains
the outliers.

Measures of Dispersion
If an analyst wants to know how disperse a dataset is, the methods of calculating the measures
of dispersion can be used. Measures of dispersion can help in determining how spread the data
values are.
What is range?

Range is the simplest method to compute when measuring the dispersion of data. The range
can be obtained by subtracting the smallest value in the dataset from the largest value.
= −

What is standard deviation?

This method is the most commonly used measure of dispersion. The value of standard
deviation tells how closely the values of a data set are clustered around the mean.
The things one must remember when dealing with standard deviation are the ff:

1. A lower value of standard deviation indicates that the data set are spread over a smaller
range around the mean.
2. A larger value of the standard deviation indicates that the values of that data set are
spread over a relatively larger range around the mean.
3. The standard deviation can be obtained by taking the positive square root of the
C 30
17/20 D 35
1. A skewed-to-the-right histogram has longer tail E 24
on the left side [TRUE]

2. Boxplot is a plot of frequency table with the

11. Identify whether the statement below is a null
bins on the x-axis and the count (or pro-
hypothesis or alternative hypothesis:
portion) on the y-axis [FALSE]
Bioflu is more effective than aspirin in helping
3. Categorical data can have an explicit ordering
a person who has had a heart attack [NULL
4. The difference between largest and the smallest
12. Identify whether the statement below is an
value in a data set [RANGE]
example of
A. Positive correlation
5. A data set is a collection of observation on one
B. Negative correlation
or more variables [TRUE]
C. No correlation
The more one runs, the less likely one is to have
6. People’s age is an example of continuous
cardiovascular problems. [POSITIVE
variable. [TRUE] – FALSE - DISCRETE
7. You conduct a survey where the respondents
could choose from good, better, best, excellent.
13. The scores awarded to 25 students for an
What type of variable should contain this type
assignment were as follows:
of data? [ORDINAL]
4 7 5 9 8 6 7 7 8 5 6 9 8 5 8 7 4 7 3 6 8 9 7 6 9.
What is the mode? [7]
8. The data that is collected before being
14. Inferential statistics is about describing and
processed is called statistical data [FALSE]
summarizing data sets using pictures and
statistical quantities [FALSE] –
9. The value of the mean is ____ to/the value of
the mode in the histogram shown below.
15. On the semester final, Joe scored 85, Jill scored
89, and Bill scored a 99. What was the average
score the average score for these students? [91]
16. Identify whether the statement below is an
example of
a. Positive correlation
b. Negative correlation
c. No correlation
As one increases in age, often one’s agility
[LESS THAN] 17. Find the median: 12, 5, 9, 18, 22, 25, 5 [12]
18. Identify whether the statement below is a null
10. What is the mode of the following data? hypothesis or alternative hypothesis:
A 10
19. Observe the histogram below. Based on it, how 5. Boxplot is a plot of frequency table with the
many students were greater than or equal to 60 bins on the x-axis and the count (or pro-
inches tall? [6] - 11 portion) on the y-axis [FALSE] -
6. Continuous variables are variables that can
assume any numerical value or interval
7. Continuous variable is variable whose
values are countable [FALSE] - DISCRETE
8. People’s age is an example of continuous
variable [TRUE] - DISCRETE
9. A population is consisting of all elements –
20. Identify whether the statement below is an individuals, items, objects, whose
example of characteristics are being studied. [TRUE]
a. Positive correlation 10. Match the type of variable of each instance:
b. Negative correlation Number of car accident – DISCRETE
c. No correlation Alicia’s weight during December –
As you drink more coffee, the number of hours you CONTINUOUS
stay awake increases. [POSITIVE Customer feedback - CATEGORICAL
11. The standard deviation can never be a
negative number [TRUE]
18/20 12. Inferential statistics is about describing and
summarizing data sets using pictures and
1. Which of the following doesn’t measure statistical quantities [FALSE] -
how spread a data set is? [MODE] DESCRIPTIVE
2. If the distribution has more than two modes, 13. Identify whether the statement below is an
it is called multimodal [TRUE] example of
3. The value of the mean is ____ to/the value d. Positive correlation
of the mode in the histogram shown below. e. Negative correlation
f. No correlation
As you drink more coffee, the number of hours you
stay awake increases. [POSITIVE

14. When finding the median, what do you do if

there are two middle numbers?[ADD
15. Identify whether the statement below is an
[LESS THAN] example of
d. Positive correlation
4. The data that is collected before being e. Negative correlation
processed is called statistical data [TRUE] f. No correlation
As one increases in age, often one’s agility 7. Categorical data can have an explicit
decreases. [NEGATIVE CORRELATION] ordering [TRUE]
16. To find the average of a set of numbers, add 8. Match the ff:
up all items and divide by _____. [THE
NUMBER OF ITEMS] Size of computer monitor – ordinal –
17. Inferential statistics is the study of chance INTERVAL
events governed by rules [FALSE] - Anime genre – categorical
18. When x and y are negatively correlated, as Anime season – interval – ORDINAL
the value of x increases, the value of y tends
9. Which of the ff doesn’t measure how spread
to decrease [TRUE]
a data set is? [MODE]
19. Identify whether the statement below is a
10. What is median of following data?
null hypothesis or alternative hypothesis:
iPhone 6 Plus weighs 6.77 ounces [NULL
A 10
20. Identify whether the statement below is a C 30
null hypothesis or alternative hypothesis: D 35
the sky is blue [NULL HYPOTHESIS] E 24

11. If you are going to conduct a study if there’s a

difference between five fertilizers, you use
1. A boxplot is a qualitative variable correlation [TRUE] – FALSE (correlation only
enumerates all the categories and the requires two variables)
number of instances that belong to each
12. The cost of four cell phones is $345, $400,
category. [FALSE] – FREQUENCY
$110, and $640. What is the median cost? [$372.50]
2. A data set is a collection of observations on 13. A residual is the difference between the
one or more variables [TRUE] observed value of the response and the predict value
3. Continuous variables are variables that can of the response variable [FALSE] - TRUE
assume any numerical value or interval
[TRUE] 14. What is the median of this data set? 7.3, 2.9, 1.5,
4. A data set is a collection of observations on 0.6, 3.8 [2.9]
one or more variables [TRUE]
15. The standard deviation can never be a negative
5. The difference between the largest and the
number [TRUE]
smallest value in a data set [RANGE]
6. What is the mode of the following data? 16. Boxplot is a plot in which the x-axis is the value
STUDENT SCORE of one variable, and the y-axis the value of another.
B 20
C 30 17. two-tailed test is a statistical technique used to
D 35 compare the means of two or more groups of
E 24 observations or treatments [FALSE] - ANOVA
18. Identify whether the statement below is a null
hypothesis or alternative hypothesis:
Bioflu is more effective than aspirin in helping 6. The data below is bimodal.
a person who has had a heart attack
Data: 77 82 74 81 79 84 74 82
19. Observe the histogram below. Based on it, how - True
many students were greater than or equal to 60
8. What is the median of the following data?
inches tall? [6] - 11

- 24

9. People’s age is an example of continuous variable

20. Identify whether the statement below is a null
hypothesis or alternative hypothesis: - True – FALSE - DISCRETE
The sky is blue [NULL HYPOTHESIS]
10. It is a portion of the population selected for study.
(Use lowercase for your answer)

- Sample
11. If the histogram is skewed right, the mean is greater
1. If a histogram and frequency distribution are skewed than the median.
to the left, the value of the means is the largest.
- True
- False
12. Identify whether the statement below is an example
2. Which of the following is an example of categorical of
raw data?
a) positive correlation
- Collected subjects offered
b) negative correlation
3. You conduct a survey where the respondents could
choose from good, better, best, excellent. What type of c) no correlation
variable should contain this type of data?
The more one eats, the less hunger one will have.
- Ordinal
- Positive correlation – NEGATIVE?
4. Student height is a categorical variable.
14. Boxplot is a plot in which the x-axis is the value of
- True ? one variable, and the y-axis the value of another.

5. Quantitative variable pertains to the variables that - False

values couldn’t be measured.
15.0 indicates no correlation, but be aware that random
- False arrangement of data will produce both positive and
negative values for the correlation coefficient just by
- True - True

16. The alternative hypothesis or research hypothesis 2. It refers to the critical process of performing initial
Ha represents an alternative claim about the value of investigations on the data so as to discover patterns,
the parameter. to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and
- True graphical representations
17. Identify whether the statement below is an example - Exploratory Data Analysis
3. A variable is a characteristic under study that
a) positive correlation assures different values for different elements.
b) negative correlation - True
c) no correlation 4. The data that is collected before being processed
is called statistical data.
As one increases in age, often one’s agility decreases.
- False
- Negative correlation

18. Find the median: 15, 5, 9, 18, 22, 25, 5 5. The smallest possible value for the standard
deviation is 1.
- 15
- False
19. Observe the histogram below. Based on it, how
many students were greater than or equal to 60 inches 6. Observation or measurement pertains to a
tall? value of a variable.

- True

7. Customer gender is an example of ordinal data.

- False

8. If the distribution has more than two modes, it is

called multimodal.

- True

9. It is a measure of how central the average is in

- 11 relation to the overall spread of values

20. The height of three volleyball players is 210 cm, - Skewness

220 cm and 191 cm. What is their average height?
10. A special case of categorical with just two
- 207 categories is called logical.

- False - BINARY

20/20 11. Identify whether the statement below is a

null hypothesis or alternative hypothesis:
1. The standard deviation can never be a negative
number Contrary to popular belief, people can see
through walls.
- null hypothesis 20. When x and y are positively correlated, as
the value of x increases, the value of y tends to
12. The height of three volleyball players is 210 cm,
increase as well.
220 cm and 191 cm. What is their average height?
- True
- 207

13. 0 indicates no correlation, but be aware that

random arrangements of data will produce both
positive and negative values for the correlation
coefficient just by chance

- True

14. Inferential statistics is about describing and

summarizing data sets using pictures and statistical


15. Identify whether the statement below is an

example of
a) positive correlation
b) negative correlation
c) no correlation

The more you exercise your muscles, the

stronger they get.
- positive correlation

16. Inferential Statistics is about analyzing data

sets and drawing conclusions from them
- True
17. A residual is the difference between the
observed value of the response and the
predicted value of the response variable.
- True
18. The smallest possible value for the standard
deviation is 0.
- True
19. When conducting two-tailed T-test, the data
is normally distributed.
- True
[IT0089] MODULE 2 depend on the values of the attribute
• Singapore’s Supply and Demand gap MINING
on a typical day ➢ Classification and prediction patterns
• Microsoft’s Voice Recognition ➢ Cluster and association patterns
• Gmail’s Spam ➢ Data reduction patterns
• Instagram algorithm ➢ Outliers and anomaly patterns
• Weather Forecast ➢ Sequential and temporal patterns
What exactly is data mining?
Data Mining is defined as a process used to Data-mining approaches can be separated
extract usable insights from a larger set of into two categories.
any raw data.
- It implies analysing data patterns in TWO MAJOR APPROACHES IN DATA
large batches of data using one or MINING
more software. 1. Supervised learning - the desired
- Automatic pattern predictions based 2. Unsupervised learning - it is used
on trend and behavior analysis against data that has no historical
- Prediction based on likely outcomes labels
- Creation of decision-oriented DATA ANALYTICS LIFE CYCLE
information - Big data analysis differs from
- Focus on large data sets and traditional data analysis primarily due
databases for analysis to the volume, velocity, and variety
- Clustering based on finding and characteristics of data being
visually documented groups of facts processed
not previously known. WHAT IS BIG DATA?
KEY TERMS - Normal data comes from your
- A data set may have attribute traditional relational database system
variables and target variable(s) - Big data comes from different data
- The values of the attribute variables sources. Data are usually in
are used to determine the values of petabytes, or terabytes
the target variable(s) CHARACTERISTICS OF BIG DATA
- Attribute variables and target
variables may also be called as
- big data starts as low as 1 terabyte
independent variables and dependent
and it has no upper limit
variables, respectively, to reflect that
the values of the target variables
- big data enters an average system at
velocity ranging between 30
Kilobytes (KB) p/sec to as much as
30 gigabytes (GB) per sec
- big data is composed of unstructured,
semi structured, structured dataset
People – first, you need the right people
- Whether you’re a data scientist
developing a new model to reduce
churn, or a business executive
• Business User – understands the wanting to improve the customer
domain area experience, this phase defines what
• Project Sponsor – provides your business needs to know
requirements (monetary, equipment,
- This phase is both critical to success
• Project manager – ensures meeting
and frustratingly time-consuming.
You have data sitting in database, on
• Business Intelligence Analyst –
desktops or in Hadoop, plus you want
provides business domain expertise
to capture live-streaming data.
based on deep understanding of the
• Database Administrator (DBA) – - In this phase, you’ll search for
creates DB environment relationships, trends and patterns to
• Data Engineer – provides technical gain a deeper understanding of
skills, assist data management and your data.
extraction, supports analytics - You’ll also develop and test
sandbox hypotheses through rapid prototyping
• Data Scientist – provides analytic in an iterative process.
techniques data and modeling 4TH PHASE: MODEL
DIFFERENT ANALYTICAL LIFE - This is the phase where you code
CYCLE your own models using R or Python
DATA LIFE CYCLE or an interactive predictive software
- Organizations must be able to
identify the appropriate algorithm
to analyze the available data.
5TH PHASE: IMPLEMENT for expert systems, and data
- After validating the accuracy of the visualization.
model generated during the previous STEPS OF THE KDD PROCESS
phase, it must be implemented in the 1. Developing an understanding of
organization. a. The application domain
- Models are usually deployed by b. The relevant prior knowledge
automating common manual tasks c. The goals of the end-user
6TH PHASE: ACT 2. Creating a target data set: selecting
- In this phase, we enable two types of a data set or focusing on a subset of
decisions: operational that are variables, or data samples, on which
automated, and strategic decisions discovery is to be performed.
where individuals make a long term 3. Data Cleaning and preprocessing
impact. a. Removal of noise or outliers
b. Collecting necessary
information to model or
- Organizations must always monitor account for noise
the performance of the model c. Strategies for handling
generated. missing data fields
- When the performance starts to d. Accounting for time sequence
degrade below the acceptance level, and known changes
the model can be recalibrated or 4. Data reduction and projection
replaced with a new model. a. Finding useful features to
8TH PHASE: ASK AGAIN represent the data depending
- The marketplace changes. Your on the goal of the task
business changes. And that’s why 5. Choosing the data mining task
your analytics process occasionally a. Deciding whether the goal of
needs to change. the KDD process is
classification, regression,
clustering, etc.
6. Choosing the data mining
- The term Knowledge Discovery in
Databases, or KDD for short, refers
a. Selecting method(s) to be
to the broad process of finding
used for searching for patterns
knowledge in data, and emphasizes
in the data
the "high-level" application of
b. Deciding which models and
particular data mining methods.
parameters may be
- It is of interest to researchers in appropriate
machine learning, pattern recognition, c. Matching a particular data
databases, statistics, artificial mining method with the
intelligence, knowledge acquisition
overall criteria of the KDD form hypotheses regarding hidden
Process information.
a. Searching for patterns of
- The data preparation phase covers all
interest in particular
activities needed to construct the final
representation form or a set of
dataset (data that will be fed into the
such representations as
modeling tool(s)) from the initial raw
classification rules or tress,
regression, clustering, and so
- Data preparation tasks are likely to be
performed multiple times and not in
8. Interpreting mined patterns
any prescribed order. Tasks include
9. Consolidating discovered
table, record, and attribute selection,
as well as transformation and
CRISP-DM cleaning of data for modeling tools.
- The CRISP-DM methodology is 4TH PHASE: MODELING
described in terms of a hierarchical
- In this phase, various modeling
process model, consisting of sets of
techniques are selected and applied,
tasks described at four levels of
and their parameters are calibrated to
abstraction (from general to specific):
optimal values
phase, generic task, specialized task,
- Typically, there are several
and process instance
techniques for the same data mining
1ST PHASE: BUSINESS problem type. Some techniques have
UNDERSTANDING specific requirements on the form
- This initial phase focuses on of data. Therefore, going back to the
understanding the project data preparation phase is often
objectives and requirements from a necessary.
business perspective, then converting 5TH PHASE: EVALUATION
this knowledge into a data mining
- Before proceeding to final
problem definition and a preliminary
deployment of the model, it is
plan designed to achieve the
important to thoroughly evaluate it
and review the steps executed to
2ND PHASE: DATA UNDERSTANDING create it, to be certain the model
- The data understanding phase starts properly achieves the business
with initial data collection and objectives.
proceeds with activities that enable - A key objective is to determine if
you to become familiar with the data, there is some important business
identify data quality problems, issue that has not been sufficiently
discover first insights into the data, considered. At the end of this phase, a
and/or detect interesting subsets to
decision on the use of the data mining Suppose your email program watches which
results should be reached. emails you do or do not mark as spam, and
6TH PHASE: DEPLOYMENT based on that learns how to better filter
- Depending on the requirements, the spam. What is the task T in this setting?
deployment phase can be as simple as Classifying emails as spam or not spam. T
generating a report or as complex as Watching you label emails as spam or not
implementing a repeatable data spam. E
mining process across the enterprise.
The number (or fraction) of emails correctly
- In many cases, it is the customer, not
the data analyst, who carries out the classified as spam/not spam. P
deployment steps. None of the above—this is not a machine
learning problem.
- Data mining aims at discovering • Smart Tagging
useful data patterns from massive • Product Recommendations
amounts of data. • Priority Filtering
• Spam Filtering
• Text Processing
- is the broad science of mimicking
• Speech Recognition
human abilities.
• Face Recognition
- AI is the science of training machines
Artificial Intelligence – a technique which
to perform human tasks
enables machines to mimic human behavior
Machine Learning – subset of AI
- Arthur Samuel (1959). Machine techniques which use statistical methods to
Learning: Field of study that gives enable machines to improve with experience
computers the ability to learn
Deep Learning – subset of ML which make
without being explicitly the computation of multi-layer neural
programmed. network feasible
- Tom Mitchell (1998) Well-posed ASPECTS OF BUSINESS ANALYTICS
Learning Problem: A computer Forecasting – leveraging historical time
series data to provide better insights into
program is said to learn from decision-making about the future
experience E with respect to some Data Mining – perform predictive analytics
and pattern discovery techniques to address
task T and some performance numerous business problems
measure P, if its performance on T, Text Analytics – finding treasures in
unstructured data like social media or survey
as measured by P, improves with tools which could uncover insights about
experience E. consumer sentiment
Optimization – analyze massive amounts of
ASK YOURSELF! data in order to accurately identify decisions
which are likely to produce the most optimal • Probability of default in credit risk
results assessment
 Relational databases
 Data warehouses
 Advanced DB and information
 Object-oriented and object-relational
 Transactional and Spatial databases
 Heterogenous and legacy databases
 Multimedia and streaming database
 Text databases
 Text mining and web mining


- Allow us to classify or predict values
DATA MINING TRENDS of target variables from values of
• Ongoing research to ensure that attribute variables.
analysts have access to modern
techniques which are robust and
• Innovative computational
implementations of existing
analytical methods
• Creative applications of existing
methods to solve new and different
• Integration of methods from multiple
disciplines to provide targeted


• Attrition/Churn prediction
• Propensity to buy/avail of a product
or service - Cluster patterns give groups of
• Cross-sell or up-sell probability similar data records such that data
• Next-best offer records in one group are similar but
• Time-to-event modeling have larger differences from data
• Fraud detection records in another group.
• Revenue/profit predictions
- Association patterns are established
based on co-occurrences of items in DATA MINING TECHNIQUES
data records - CLASSIFICATION
- Data reduction patterns look for a
small number of variables that can - PREDICTION
be used to represent a data set with - ASSOCIATION RULES
a much larger number of variables
- Skilled Experts are needed to formulate
the data mining queries.
- Over fitting: Due to small size training
database, a model may not fit future
- Data mining needs large databases
which sometimes are difficult to manage
- Business practices may need to be
modified to determine to use the
information uncovered.
- If the data set is not diverse, data mining
results may not be accurate.
 OUTLIER AND ANOMALY - Integration information needed from
PATTERNS heterogeneous databases and global
- Outliers and anomalies are data information systems could be complex
points that differ largely from the ADVANTAGES OF DATA MINING
norm of data - Data mining technique helps companies
to get knowledge based information.
- Data mining helps organizations to make
the profitable adjustments in operation
and production.
- The data mining is a cost-effective and
efficient solution compared to other
statistical data applications.
- Data mining helps with the decision-
making process.
- Facilitates automated prediction of
trends and behaviors as well as
automated discovery of hidden patterns.
- It can be implemented in new systems
as well as existing platforms
 SEQUENTIAL AND TEMPORAL - It is the speedy process which makes it
PATTERNS easy for the users to analyze huge
- Sequential and temporal patterns amount of data in less time.
reveal patterns in a sequence of data DISADVANTAGES OF DATA MINING
points. - There are chances of companies may sell
- If the sequence is defined by the useful information of their customers to
time over which data points are other companies for money. For
observed, we call the sequence of example, American Express has sold
data points as a time series. credit card purchases of their customers
to the other companies.
- Many data mining analytics software is
difficult to operate and requires advance
training to work on.
- Different data mining tools work in
different manners due to different
algorithms employed in their design.
Therefore, the selection of correct data
mining tool is a very difficult task.
- The data mining techniques are not variable.
accurate, and so it can cause serious - Series of questions that successively
consequences in certain conditions narrow down observations into smaller
INDUSTRIES THAT UTILIZE DATA and smaller groups of decreasing
MINING impurity.
 Communications
 Insurance
 Education Logistic Regression
 Manufacturing - Attempts to classify a categorical
 Banking
 Retail outcome (y = 0 or 1) as a linear function
 Service providers of explanatory variables.
 Super markets - used against data that has no historical
 Crime labels
 Bioinformatics - the goal is to explore the data and find
SUBTOPIC 2 some structure within
SUPERVISED LEARNING - There is no right or wrong answer
- the desired output is known. UNSUPERVISED LEARNING
- also known as predictive modeling EXAMPLE TECHNIQUES
- uses patterns to predict the values of Self-organizing maps
the label on additional unlabeled data
- used in applications where historical
data predicts likely future events
- We can use the abundance of data to
guard against the potential for
overfitting by decomposing the data set
into partitions:
o TRAINING DATASET - Consists of
the data used to build the
candidate models.
o TEST DATASET - The data set to
which the final model should be Nearest-neighbor mapping
applied to estimate this model’s - k-nearest neighbors (k-NN): This method
effectiveness when applied to can be used either to classify an
data that have not been used to outcome category or predict a
build or select the model. continuous outcome.
- If there is only one dataset, it may be o k-NN uses the k most similar
partitioned into a training and test sets. observations from the training
- The basic assumption is that the training set, where similarity is typically
and test sets are produced by measured with Euclidean
independent sampling from an infinite distance.
population. o When k-NN is used as a
SUPERVISED LEARNING EXAMPLE classification method, a new
TECHNIQUE observation is classified as Class
1 if the percentage of it k
Classification Tree nearest neighbors in Class 1 is
- Partition a data set of observations into greater than or equal to a
increasingly smaller and more specified cut-off value (e.g. 0.5).
homogeneous subsets. o When k-NN is used as a
- At each iteration, a subset of prediction method, a new
observations is split into two new observation’s outcome value is
subsets based on the values of a single predicted to be the average of
the outcome values of its k-
nearest neighbors
K-Nearest neighbor
- To classify an outcome, the training set
is searched for the one that is “most
like” it. This is an example of “instance
based” learning. It is “rote learning”, the 10. It is defined as heterogeneous data
simplest form of learning from multiple sources combined in a
Clustering common source [DATA
- A definition of clustering could be “the 11. Machine learning is a subset of
process of organising objects into groups artificial intelligence [TRUE]
whose members are similar in some 12. Different data mining tools work in
way”. different manners due to different
WHAT IS WEKA? algorithms employed in their design.
- Weka is tried and tested open source Therefore, the selection of correct
machine learning software that can be data mining tool is a very difficult
accessed through a graphical user task. [TRUE]
interface, standard terminal 13. Supposed you want to train a
applications, or a Java API. machine to help you predict how long
Logistic Regression in Weka it will take you to drive home from
your workplace. What type of data
• Open diabetes dataset. mining approach would you use?
• Click Classifier Tab [UNSUPERVISED LEARNING] –
• Choose Classifier: functions>Logistic Supervised learning
• Use Test Options: Use Training Set 14. Data mining implies analyzing data
• Press Start patterns in large batches of data using
Trees in Weka one or more software [TRUE]
• Open weather dataset 15. It uses statistical methods to enable
• Click Classify Tab machines to improve with experience
• Choose Classifier: trees>J48 [MACHINE LEARNING]
• Use Test Options: Use Training Set 16. In this approach in data mining, only
• Press Start put data will be given
• Right-click result from Result List for [UNSUPERVISED LEARNING]
options 17. One of the advantages of data mining
• Choose Visualize tree is that there are chances of companies
may sell useful information of their
customers to other companies for
money [FALSE] - disadvantage
FORMATIVES 18. Data mining cannot be used on
spatial databases [FALSE]
19/20 19. Regression is a famous supervised
1. Ordinal data can also be called as learning technique [TRUE]
ordered factor [TRUE] 20. The goal of unsupervised learning is
2. Which of the following is not to explore the data and find some
included on the analytical life cycle structure within [TRUE]
defined by SAS? [INTEGRATION] 17/20
3. The last phase of the analytical life
cycle defined by SAS implementation
[FALSE] – ask again 1. Reviewing process is under what
4. It is a role that is responsible for
collecting, analyzing, and interpreting phase in CRISP-DM? [BUSINESS
large amount of data [DATA UNDERSTANDING] -
5. Data preparation is the first step in EVALUATION
CRSIP-DM [FALSE] – Business
Understanding 2. Data preparation is the first step in
6. These are values that lie away from CRISP-DM [FALSE] – Business
the bulk of the data [OUTLIERS]
7. It refers to the broad process of Understanding
finding knowledge in data and 3. Business user provides business
emphasizes the “high-level”
application of particular data mining domain expertise based on deep
methods [KDD] understanding of the data [FALSE] –
8. A key objective is to determine if
there is some important business Business Intelligence analyst
issue that has not been sufficiently
considered [TRUE]
9. Identify the third step in CRISP-DM
4. Selecting data mining technique is given, can predict the ouput [TRUE]
under what phase in CRISP-DM? 17. In this approach in data mining, input
[MODELING] variables and ouput variables will be
5. Interpreting mined patterns concludes given [SUPERVISED LEARNING]
the KDD process [TRUE] – FALSE 18. Data mining is a cost-effective and
– Consolidating discovered efficient solution compared to other
knowledge statistical data applications. [TRUE]
6. The last phase of the analytical life 19. Association rule algorithm is an
cycle defined by SAS is example of what approach in data
implementation [FALSE] – ask mining? [UNSUPERVISED
7. Depending on the requirements, the 20. It makes the computation of multi-
deployment phase can be as simple as layer neural network feasible [DEEP
generating a report [TRUE] LEARNING]
8. Evaluation is the last phase in
CRISP-DM [FALSE] - Deployment
1. This is the type of learning which
9. It is the role that ensures the progress
uses patterns to predict the values of
of any project [PROJECT
the label on additional unlabeled data
10. A special case of categorical with just
2. Unsupervised machine learning finds
two categories [CATEGORICAL] -
all kind of unknown patterns in data
11. Artificial intelligence is a subset of a
3. Data mining focuses on small data
machine learning [FALSE]
sets and databases for analysis
12. Weka is tried and tested open source
machine learning software that can be
4. Which of the following is not
accessed through a graphical user
considered as data mining technique?
interface, standard terminal
applications or a JAVA API [TRUE]
5. Data mining can be performed on
13. Regression is a famous supervised
web mining data [TRUE]
learning technique [TRUE]
6. Data mining cannot be used on
14. Machine learning is a subset of
spatial databases [FALSE]
artificial intelligence [TRUE]
7. Data mining is the broad science of
15. A supervised learning algorithm
mimicking human abilities [FALSE]
learns from labeled training data,
- AI
helps you to predict outcomes for
8. Association rule algorithm is an
unforeseen data [TRUE]
example of what approach in data
16. Supervised learning goal is to
determine the function so well that
when new input data set
9. There is no right or wrong on 20. This role provides the funding when
predictive modeling [FALSE] doing analytical project [PROJECT
10. It is used against data that has no SPONSOR]
historical labels [UNSUPERVISED 16/20
1. This phase starts with initial data
11. It is defined as heterogeneous data
collection. [DATA
from multiple sources combined in a
common source [DATA
2. It is the role that ensures the
progress of any project. [PROJECT
12. It refers to the broad process of
finding knowledge in data and MANAGER]
emphasizes the “high-level” 3. A special case of categorical with
application of particular data mining just two categories. [BINARY]
methods [KDD] 4. Creating target datasets also
13. This role is responsible for creating includes collecting necessary
database environment for analytic information to model or account
projects [DATABASE for noise. [FALSE]
ADMINISTRATOR] 5. It is a role that is responsible for
14. Selecting data mining technique is collecting, analyzing, and
under what phase in CRISP-DM? interpreting large amount of data.
MODELING 6. This role is responsible for creating
15. It is the practice of science and database environment for analytic
technology that is dedicated to projects. [DATABASE
building and data-handling problems ADMINISTRATOR]
that arise due to high volume of data 7. Ordinal data can also be called as
[DATA SCIENCE] ? ordered factor. [TRUE]
16. CRISP-DM stands for _______
8. Identify the third step in CRISP-
9. Which of the following is not
17. Evaluation is the last phase in
included on the analytical life cycle
defined by SAS? [INTEGRATION]
10. The last phase of the analytical life
18. Which of the following is not
considered as a phase of CRISP-DM? cycle defined by SAS is
[DATA SELECTION] implementation. [FALSE] -
19. These are values that lie away from Deployment
the bulk of the data [OUTLIERS] 11. In this approach in data mining,
input variables and output
variables will be given.
12. Artificial intelligence is a subset of In this phase you’ll also develop and test
machine learning. [FALSE] hypotheses through rapid prototyping in an
13. Supposed you want to train a iterative process.
machine to help you predict how - Explore
long it will take you to drive home
It is a comprehensive data mining
from your workplace. What type of
methodology and process model that
data mining approach should you
provides anyone – from novice to data mining
experts – with a complete blueprint for
LEARNING] ? conducting a data mining project.
14. Association rule algorithm is an
example of what approach in data - CRISP-DM
mining? [SUPERVISED CRISP-DM stands for
LEARNING] - UNSUPERVISED ____________________________.
15. Data mining cannot be used on
- Cross-industry standard process for
text databases. [FALSE] data mining
16. A data value that is very different
from most of the data. [DATA It is a role that is responsible for collecting,
analyzing, and interpreting large amount of
17. Self-organizing maps are example
of supervised learning. [TRUE] - - Data Scientist
FALSE It is the aspect of business analytics that finds
18. Unsupervised machine learning patterns in unstructured data like social
finds all kind of unknown patterns media or survey tools which could uncover
in data. [TRUE] insights about consumer sentiment
19. Suppose your email program
watches which emails you do or do - Data Mining ?
not mark as spam, and based on One of the benefits of data mining is
that learns how to better filter overfitting.
spam. What is the experience E in
this setting? [WATCHING YOU - False
It is the phase of CRISP-DM where analysts
20. Logistic regression is classified as
supervised learning. [TRUE] review the steps executed.

- Evaluation
Ordinal data can also be called as ordered
A key objective is to determine if there is
some important business issue that has not
- True been sufficiently considered.

- True

Identify the third step in CRISP-DM.

- Data Preparation All data is labeled and the algorithms learn to
predict the output from the input data. This
In this approach in data mining, input
statement pertains to ___________. (Use
variables and output variables will be given.
lowercase for your answer)
- Supervised Learning
- supervised Learning
Self-organizing maps are example of
supervised learning. 17/20
- False This role provides the funding when doing
Logistic regression is classified as supervised analytical project.
- Project Sponsor
- True It is a role that is responsible for collecting,
Association rule algorithm is an example of analyzing, and interpreting large amount of
what approach in data mining? data.

- Unsupervised Learning - Data Scientist

Weka is tried and tested open source Based on the analytical life cycle defined by
machine learning software that can be SAS, this phase has two types of decisions:
accessed through a graphical user interface, operational and strategic.
standard terminal applications, or a Java API.
- Act
- True Interpreting mined patterns concludes the
Unsupervised machine learning finds all kind KDD process.
of unknown patterns in data
- True
- True It is the aspect of business analytics that finds
All data is unlabeled and the algorithms learn patterns in unstructured data like social
to inherent structure from the input data. media or survey tools which could uncover
This statement pertains to ___________. insights about consumer sentiment.

- UnSupervised Learning ? - Text analytics

Random forest is an example of supervised This is the first step in KDD process.
learning. - Data Selection
- True In this phase, you’ll search for relationships,
Unsupervised methods help you to find trends and patterns to gain a deeper
features which can be useful for understanding of your data.
categorization. - Explore
- True
Which of the following is not included on the - False - ADVANTAGES
analytical life cycle defined by SAS?
One of the advantages of data mining is that
- Integration there are chances of companies may sell
useful information of their customers to
These are values that lie away from the bulk
other companies for money.
of the data.
- Outliers
Which of the following is not considered as
CRISP-DM stands for
data mining technique?
- Kurtosis
- Cross-industry standard process for
data mining Supposed your email program watches which
emails you do or do not mark as spam, and
All data is labeled and the algorithms learn to
based on that learns how to better filter
predict the output from the input data. This
spam. What is the performance measure P in
statement pertains to ___________. (Use
this setting?
lowercase for your answer)
- The number of email correctly classified
- supervised
as spam/not spam
It makes the computation of multi-layer
neural network feasible. Self-organizing maps are example of
supervised learning.
- Deep Learning
- False
Unsupervised methods help you to find
features which can be useful for

- True

Logistic regression is classified as supervised SUPPLEMENTARY MATERIALS


- True
In this module, we’ve discussed the
concept of data mining and its application
Supposed you want to train a machine to help in real life. Data Mining is defined as a
you predict how long it will take you to drive process used to extract usable data from a
larger set of any raw data.
home from your workplace. What type of
data mining approach should you use?
It implies analyzing data patterns in large
- Supervised learning batches of data using one or more software.

One of the challenges of data mining is It can

The increase in the use of data-mining
be implemented in new systems as well as
techniques in business has been caused
existing platforms.
largely by three events.
• The explosion in the amount of data All these details are your inputs. The
being produced and electronically output is the amount of time it took to
tracked drive back home on that specific day.
• The ability to electronically
warehouse these data
• The affordability of computer power
to analyze the data

What exactly is Supervised Learning?

In Supervised learning, you train the

machine using data which is well
"labeled." It means some data is already
tagged with the correct answer. It can be
compared to learning which takes place in
the presence of a supervisor or a teacher.

A supervised learning algorithm learns from

labeled training data, helps you to predict
outcomes for unforeseen data. Successfully
building, scaling, and deploying accurate
supervised machine learning Data science
model takes time and technical expertise
from a team of highly skilled data scientists.

Moreover, Data scientist must rebuild

models to make sure the insights given
remains true until its data changes.

How does Supervised Learning Works?

For example, you want to train a machine to

help you predict how long it will take you to
drive home from your workplace. Here, you
start by creating a set of labeled data.

This data includes:

• Weather conditions
• Time of the day
• Holidays
Example scenario:

You instinctively know that if it's raining outside, then it will take you longer to drive
home. But the machine needs data and statistics.

Let's see now how you can develop a supervised learning model of this example which help the
user to determine the commute time. The first thing you requires to create is a training data set.
This training set will contain the total commute time and corresponding factors like weather,
time, etc. Based on this training set, your machine might see there's a direct relationship between
the amount of rain and time you will take to get home.

So, it ascertains that the more it rains, the longer you will be driving to get back to your home.
It might also see the connection between the time you leave work and the time you'll be on the

The closer you're to 6 p.m. the longer time it takes for you to get home. Your machine may find
some of the relationships with your labeled data.

This is the start of your Data Model. It begins to impact how rain impacts the way people
drive. It also starts to see that more people travel during a particular time of day.

Famous Supervised Learning Techniques

In both regression and classification, the goal is to find specific relationships or structure in an
input data that allow us to effectively produce correct output data.

What exactly is Unsupervised Learning?

Unsupervised learning is a machine learning technique, where you do not need to supervise the
model. Instead, you need to allow the model to work on its own to discover information. It mainly
deals with the
unlabeled data.

Unsupervised learning algorithms allow you to perform more complex processing tasks
compared to supervised learning. Although, unsupervised learning can be more unpredictable
compared with other natural learning deep learning and reinforcement learning methods.

Here, are prime reasons for using Unsupervised Learning:

3. Unsupervised machine learning finds all kind of unknown patterns in data.
4. Unsupervised methods help you to find features which can be useful for categorization.
5. It is taken place in real time, so all the input data to be analyzed and labeled in the
presence of learners.
6. It is easier to get unlabeled data from a computer than labeled data, which needs
manual intervention.
How does Unsupervised Learning works?

Sample scenario:
Let's, take the case of a baby and her family dog.

She knows and identifies this dog. A few weeks later a family friend brings along a dog and
tries to play with the baby.

Baby has not seen this dog earlier. But it recognizes many features (2 ears, eyes, walking on 4
legs) are like her pet dog. She identifies a new animal like a dog. This is unsupervised learning,
where you are not taught but you learn from the data (in this case data about a dog.) Had this been
supervised learning, the family friend would have told the baby that it's a dog.

To further understand the differences between the two methods, observe the table below.
Supervised Learning Unsupervised Learning

Input variables and Only input data will be given

output variables will

be given.

Supervised learning The unsupervised learning goal is

goal is to determine to model the hidden patterns or

the function so well underlying structure in the given

that when new input

data set given, can input data in order to learn about

predict the output. the data.

Machine Machine Learning, Data Mining,

learning Problems, Problems and Neural Network

Data Mining and

Neural Network,

• Classification • Clustering

• Regression • Association

• Linear • k-means

Examples regression • Association

• Support vector


Who uses Data scientists Data scientists

Big data Processing, Data mining

Eco-systems Big data Processing,

Data mining etc

Unsupervised learning algorithms

are used to pre-process the data,

Supervised learning is
during exploratory analysis or to
often used for export
pre-train supervised learning
systems in image
recognition, speech


forecasting, financial

analysis and training

neural networks and

decision trees etc

M3- Main
(Introduction to Regression Analysis)
What is Regression Analysis?

• A technique of studying the dependence variable (called dependent variable), on one or more
variables (called explanatory variable), with a view to estimate or predict the average value of
the dependent variables in terms of the known or fixed values of the independent variable.
When do you use regression?
• Estimate the relationship that exists, on the average, between the dependent variable and the
explanatory variable
• Determine the effect of each of the explanatory variables on the dependent variable, controlling the
effects of all other explanatory variables
• Predict the value of dependent variable for a given value of the explanatory variable
Understanding regression model based on the concept of slope
• The mathematical of slope is similar to regression model.
y = mx + b
• And when using the slope intercept formula, we focus on the
two constants (numbers) m and b.
• m describes the slope or steepness of the line, whereas
• b represents the y-intercept or the point where the graph crosses the y-axis.
Regression Model
• The situation using the regression model is analogous
to that of the interviewers, except instead of using
interviewers, predictions are made by performing a
linear transformation of the predictor variable.
• The prediction takes the form
y = ax + b
• where a and b are parameters in the regression model
• Dependent variable or response: Variable being predicted
• Independent variables or predictor variables: Variables being used to predict the value of the dependent
• Simple regression: A regression analysis involving one independent variable and one dependent
• In statistical notation:
y = dependent variable
x = independent variable
Types of regression

Who uses regression?

• Data analysts
• Market researchers
• Professors
• Data scientists
Advantages of using regression
• It indicates the strength of impact of multiple independent variables on a dependent variable
• It indicates the significant relationships between dependent variable and independent variable.
Regression Pitfalls

• Overfitting
It pertains to the accuracy of the
provisional model is not as high
on the test set as it is on the
training set, often because the
provisional model is overfitting
on the training set

• Excluding Important Predictor

The linear association between
two variables ignoring other
relevant variables can differ both
in magnitude and direction from
the association that controls for
other relevant variables.

• Extrapolation
It refers to estimates and
predictions of the target variable
made using the regression
equation with values of the
predictor variable outside of the
range of the values of x in the
data set

• Missing Values
Missing data has the potential to
adversely affect a regression
analysis by reducing the total
usable sample size.

• Power and sample size

In small datasets, a lack of
observations can lead to poorly
estimated models with large
standard errors.
(Overview of regression)
What is Regression analysis?
- Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables.
- It can be utilized to assess the strength of the relationship between variables and for modeling the
future relationship between them.
Linear Regression
-It is a model that tests the relationship between a dependent variable and a single independent variable.
-It can also be described using the following expression:

𝑦 = 𝑎 + 𝑏𝑋 + 𝜖
• y – dependent variable
• X – independent (explanatory) variable
• a – intercept
• b – slope

• 𝜖 – residual (error)
Linear Model Assumptions
1. The dependent and independent variables show a linear relationship between the slope and the
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all
5. The value of the residual (error) is not correlated across all
6. The residual (error) values follow the normal distribution.
Example Problem:
If you want to know the strength of relationship between House Price and Square feet, you can use
Step 1:
Identify the dependent and independent variables.
Step 2:
Run regression analysis on the data using any system that offers statistical analysis. For this example, we
can use Microsoft Excel with the help of Data Analysis Tool pack.
Note: Data Analysis Tools must be embedded manually on your excel through Options.
Step 3:
Analyze the results. Take note of the values for coefficients.
Step 4:
Substitute the values of the coefficients to the formula mentioned previously.

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕.)

• 98.25- value of intercept from the results provided in Step 3

• 0.01098- Value of X variable 1 from the results provided in Step 3.
Step 5:
Insert the value for sq. ft. to predict for the house price.

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕. )

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖 (𝟏𝟓𝟎𝟎)

𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟐𝟔𝟐. 𝟗𝟓

• 1500- example value for sq.ft.

Multiple linear
• Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
𝑦 = 𝑎 + 𝑏𝑋1 + 𝑐𝑋2 + 𝑑𝑋3 + 𝜖
• Interpretation of slope coefficient βj : Represents the change in the mean value of the dependent
variable y that corresponds to a one unit increase in the independent variable xj , holding the values of
all other independent variables in the model constant.
• The multiple regression equation that describes how the mean value of y is related to x1 , x2 , . . . , xq :
E( y | x1 , x2 , . . . , xq ) = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq
Multiple Linear Regression Model

• Estimated multiple regression equation: 𝑦ො = b0 + b1x1 + b2x2 + ∙ ∙ ∙ + bqxq

• b0 , b1 , b2 , . . . , bq = the point estimates of β0 , β1 , β2 , . . . , βq

• 𝑦ො = estimated value of the dependent variable

• The least squares method is used to develop the estimated multiple regression equation:

• Finding b0 , b1 , b2 , . . . , bq that satisfy minσ𝑖=1 𝑛 𝑦𝑖 − 𝑦ො 𝑖 2 = minσ𝑖=1 𝑛 𝑒𝑖 2.

• Uses sample data to provide the values of b0 , b1 , b2 , . . . , bq that minimize the sum of
squared residuals.
Nonlinear Regression
• Nonlinear regression is a regression in which the dependent or criterion variables are modeled as
a non-linear function of model parameters and one or more independent variables.
• Nonlinear regression can be model in several equations similar to the one below:

Regression Analysis
• Regression is a technique used for forecasting, time series modeling and finding the casual effect
between the variables.
Why use regression?
1) Prediction of a target variable (forecasting).
2) Modeling the relationships between the dependent variable and the explanatory variable.
3) Testing hypothesis.
(Linear Regression)
Linear Regression
• The whole process of linear regression is based on the fact that there exists a relation between the
independent variables and dependent variable.
Simple Linear Regression
• Regression Model: The equation that describes how y is related to x and an error term.
• Simple Linear Regression Model:
y = β0 + β1 x + ε
• Parameters: The characteristics of the population, β0 and β1
• Random variable - Error term, ε
• The error term accounts for the variability in y that cannot be explained by the linear relationship
between x and y.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
Least Square Method
• Least squares method: A procedure for using sample data to find the estimated regression equation.
• Here, we will determine the values of b0 and b1 .
• Interpretation of b0 and b1 :
• The slope b1 is the estimated change in the mean of the dependent variable y that is associated
with a one unit increase in the independent variable x.
• The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0.
• Least squares method equation:

• y𝑖 = observed value of the dependent variable for the ith observation

• = predicted value of the dependent variable for the ith observation

• n = total number of observations
• ith residual: The error made using the regression model to estimate the mean value of the dependent
variable for the ith observation.

• We are finding the regression that minimizes the sum of squared errors.
• Least squares estimates of the regression parameters:
Slope equation

• xi = value of the independent variable for the ith observation

• yi = value of the dependent variable for the ith observation

• = mean value for the independent variable

• = mean value for the dependent variable

• n = total number of observations
• Experimental region: The range of values of the independent variables in the data used to estimate
the model.
• The regression model is valid only over this region.
• Extrapolation: Prediction of the value of the dependent variable outside the experimental region.
• It is risky
Multiple Linear
• It is a statistical technique that uses several independent variables to predict the dependent variable.
• It is the extension of linear regression.
• Linear Regression only deals with one independent variable.

• Multiple linear Regression with more than one independent variable.

Case Study
You want to know the effect of violence, stress, social support on internalizing behavior.
Details about the study
• Participants were children aging from 8 to 12
• Lived in high-violence areas, USA
• Hypothesis: violence and stress lead to internalizing behavior, whereas social support would reduce
internalizing behavior.
• Predictors
• Degree of witnessing violence
• Measure of life stress
• Measure of social support
• Outcome
• Internalizing behavior (i.e. depression, anxiety)
Test for overall significance
• Shows if there is a linear relationship between all of the X variables taken together and Y
• Hypothesis:

• Slopes for Witness and Stress are positive, but slope for Social Support is negative
• If you had subjects with identical stress and social support, a one unit increase in Witness would
produce .038 unit increase in internalizing symptoms
• If Witness = 20, Stress = 5 and SocSupport 35, then we would predict that the internalizing symptoms
would be .012.
(Logistic Regression)
Logistic Regression
- It is a statistical technique used to develop predictive models with categorical dependent
variables having dichotomous or binary outcomes.
- Similar to linear regression, the logistic regression models the relationship between the
dependent variable and one or more independent variables.
Graph of Logistic Regression

Logistic Regression Assumption

• Logistic regression measures the relationship between the categorical dependent variable and one or
more independent variables by estimating probabilities using a logistic function, which is the cumulative
logistic distribution.
• There are other variants of logistic regression that focus on modeling a categorical variable with three
or more levels, say X, Y, and Z and a few others.
Concept of Probability
- Unlike linear regression, logistic regression model is concern for the log of odds ratio or the
probability of the event to happen.
- Everything starts with the concept of probability.
Example Scenario
Let's say that the probability of success of some event is 0.8.
Then the probability of failure is 1 – 0.8 = 0.2.
The odds of success are defined as the ratio of the probability of success over the probability of failure.
Logistic Regression Equation
In our example, the odds of success are 0.8/0.2 = 4. That is to say that the odds of success are 4 to 1. If
the probability of success is 0.5, that is, a 50-50 percent chance, then the odds of success are 1 to 1.
- The equation of logistic regression can be defined as follows:

To predict the probability of the event to happen, we can further solve the preceding equation as follows:

Maximum-Likelihood estimation
- It is a method of estimating the parameters of a statistical model with given data.
- The method of maximum likelihood selects the set of values of the model parameters that
maximizes the likelihood function, that is, it maximizes the “agreement” of the selected model
with the observed data
Building Logistic Regression Model
- You can perform logistic regression using R by using the glm() function.
- The family = "binomial" command tells R to use the glm function to fit a logistic regression
model. (The glm() function can fit other models too; we'll look into this later.)
Interpreting Results
• A positive estimate indicates that, for every unit increase of the respective independent variable, there
is a corresponding increase in the log of odds ratio and the other way for a negative estimate.
• A long with the independent variables, we also see 'Intercept'. Intercept is the log of odds of the
event (Good or Bad Quality) when we have all the categorical predictors having a value as 0.
• We can see the standard error, z value, and p-value along with an asterisk indication to easily identify
• We then determine whether the estimate is truly far away from 0. If the standard error of the estimate
is small, then relatively small values of the estimate can reject the null hypothesis.
• If the standard error is large, then the estimate should also be large enough to reject the null hypothesis.
Testing the Significance
• To test the significance, we use the 'Wald Z Statistic' to measure how many standard deviations the
estimate is away from 0.
• The significance of the estimate can be determined if the probability of the event happening by
chance is less than 5%.
Two ways to validate model accuracy
• Confusion Matrix
• ROC curve

What is Confusion Matrix?

- A confusion matrix is a table used to analyze the performance of a model (classification).
- Each column of the matrix represents the instances in a predicted class while each row represents the
instances in an actual class or vice versa.
What is ROC Curve?
- The Receiver Operating Characteristic (ROC) curve is a standard technique to summarize
classification model performance over a range of trade-offs between true positive (TP) and false
positive (FP) error rates.
- The ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly)
versus 1-specificity for the possible cutoff classification probability values.
Columns in Confusion Matrix
• True Positive (TP): When it is predicted as TRUE and is actually TRUE
• False Positive (FP): When it is predicted as TRUE and is actually FALSE
• True Negative (TN): When it is predicted as FALSE and is actually FALSE
• False Negative (FN): When it is predicted as TRUE and is actually FALSE
Confusion Matrix Formula
An exhaustive list of metrics that are usually computed from the confusion matrix to aid in interpreting
the goodness of fit for the classification model are as follows:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Interpreting ROC Curve

- Interpreting the ROC curve is again straightforward. The ROC curve visually helps us
understand how our model compares with a random prediction.
- The random prediction will always have a 50% chance of predicting correctly; after comparing
with this model, we can understand how much better is our model
- The diagonal line indicates the accuracy of random predictions and the lift from the diagonal line
towards the left upper corner indicates how much improvement our model has in comparison to
the random predictions.
- Models having a higher lift from the diagonals are considered to be more accurate models.
Formative assessment 3
1. It is computed as the ratio of the two odds. [Odds ratio]
2. It pertains to the accuracy of the provisional model is not as high on the test set as it is on the
training set. [OVERFITTING]
3. Supposed you want to train a machine to help you predict how long it will take you to drive
home from your workplace using regression. What type of data mining approach should you
use? [supervised Learning]
4. There should not be any multicollinearity between the independent variables in the model, and
all independent variables should be independent to each other. [TRUE]
5. In logistic regression, it is that the target variable must be discrete and mostly binary or
dichotomous. [TRUE]
6. The explanatory variable is the variable being predicted. [FALSE]
7. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear]
8. In regression, the value of the residual (error) is one. [FALSE]
9. Multiple linear regressions are classified as supervised learning. [TRUE]
10. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [Origin of the coffee]
11. It is the tool pack in Microsoft Excel that can be downloaded to perform linear regression. [Data
Analysis Tool Pack]
12. It is the range of values of the independent variables in the data used to estimate the model.
[Experimental region]
13. It is essentially similar to the simple linear model, with the exception that multiple independent
variables are used in the model. [Multiple Regression]
14. The graph of the simple multiple regression equation is a straight line. [TRUE]
15. Researcher question: Do fourth graders tend to be taller than third graders?

This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the explanatory variable on this study? [grade level]

16. It is used to explain the variability in the response variable. [Error term]
17. Given the results for multiple linear regression below. Predict the exam score if a student spent
10 hours in studying and had 4 prep exams
taken. [(67.67 + 5.56 (number ng hours) + -0.60 (4)]


18. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [origin of the coffee]
19. It is a statistical technique that uses several independent variables to predict the dependent
variable. [Multiple linear regression]
20. Researcher question: Do fourth graders tend to be taller than third graders?

This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the response variable on this study? [height]
21. Logistic regression is used to predict continuous target variable. [FALSE]
22. Given the results for multiple linear regression below. Identify the estimated regression
[exam score= 5.58*hours+ prep_exams*-0.60+67.67]
23. y= a+ bX_1+cX_2+dx_3+ 𝜖
the formula above is used to model non linear regression. [FALSE]
24. When it is predicted as TRUE and is actually FALSE. [FALSE POSITIVE]
25. Multiple linear regression is classified as unsupervised learning. [FALSE]
26. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear regression]
27. Which of the following evaluation metrics can’t be applied in the case of logistic regression
output to compare with the target? [mean squared error]
28. Compute for the accuracy of the model depicted in the confusion matrix below;
TP= 20
TN= 27
29. Nonlinear is the extension of linear regression. [FALSE] - multiple
30. Provide one (1) way to validate the accuracy of the logit model [ROC CURVE]

28/30 Formative 3
1. It is used to develop the estimated multiple regression [Least squared method]
2. It is used to measure the binary or dichotomous classifier performance visually and Area Under
Curve (AUC) is used to quantify the model performance [ROC Curve]
3. If a predictor variable X is found to be highly significant, we could conclude that: [changes in X
are associated to changes in Y]
4. Which of the following is not a reason when to use regression? [When you aim to know the
products that are usually bought together]
5. A group of middle school students wants to know if they can use height to predict age, what is
the response variable in this study? [Age]
6. It can be utilized to assess the strength of the relationship between variables and for modeling the
future relationship between them. [Regression Analysis]
7. Two variables are correlated if there is a linear association between them. If not, the variables are
uncorrelated. [TRUE]
8. In linear models, the dependent and independent variables show a linear relationship between the
slope and the intercept. [TRUE]
9. What type of relationship is shown in the graph below?
[Positive linear relationship]
10. It is a points that lies far away from the rest. [OUTLIERS]
11. What type of relationship is shown in the graph below?

[Negative Linear Relationship]

12. Formula below describes:

[Linear regression]
13. Logistic regression is classified as one of the example of unsupervised learning [false]
14. Logistic regression is used to predict continuous target variable [FALSE]
15. In the equation

A and B are considered as independent variables [FALSE]

16. A regression line is used for all of the following except one. Which one is not a valid use of a
regression line? [To predict the value of Y for an individual, given that individual’s X-
17. It is considered as the log of odds of the event [Intercept] – probability
45.What type of relationship is shown in the What is considered as the parameter/s?
graph above?
- B, x
Fat intake – No relationship
46. In the equation Y =a + bX + €
It is a model that tests the relationship between a
A and b are considered as independent dependent variable and a single independent
variables. – False variable.

47.Provide one (1) way to validate the accuracy - Linear regression

of a logit model. - ROC MODEL
Supposed you want to train a machine to help you
48. A regression line is used for all of the predict how long it will take you to drive home from
following except one. Which one is not a valid your workplace using regression. What type of data
use of a regression line? – to predict the value mining approach should you use?
of Y for an individual given that individual’s X-
- Supervised Learning
In small datasets, a lack of observations can lead to
49. it is considered as the log odds of the event.
poorly estimated models with large standard errors.
– Intercept
- True
50. A linear regression analysis was conducted
to predict the company sales. Below are the Multiple linear regression is classified as supervised
results. Identify the linear regression equation learning.
assuming that x is the parameter for
advertising. - True

It is computed as the ratio of the two odds.

- Odds ratio

Multilinear regression is a regression in which the

dependent or criterion variables are modeled as a
non-linear function of model parameters and one or
- sales=23.42+intercept more independent variables.

51. Compute for the accuracy of the model - False

depicted in the confusion matrix below;
Regression is a technique used for forecasting, time
TP = 20 series modeling and finding the casual effect
between the variables.
TN = 27
- False
FP = 18
It is used to compare two nested generalized linear
FN = 25 models.

- 52% - Likelihood ratio test

52. In the equation The lower is the AUC value, the worse is the model
predictive accuracy.
Y =a + bX + €
- True - exam score = 5.56*hours + prep_exams*-0.60 +

It is the variable being manipulated by researchers.

- Explanatory variable
Which of the following is not a reason when to use

- When you aim to know the products that are The formula below describes
usually bought together.

It is a regression in which the dependent or criterion

variables are modeled as a non-linear function of
model parameters and one or more independent - Multiple linear regression
variables. Regression is a technique used for forecasting, time
- Nonlinear Regression series modeling, and finding the casual effect
between the variables.
It is essentially similar to the simple linear model,
with the exception that multiple independent - False
variables are used in the model. It is the tool pack in Microsoft Excel that can be
- Multiple Regression downloaded to perform linear regression.

There should not be any multicollinearity between - Data Analysis Tool Pack
the independent variables in the model, and all A group of middle school students wants to know if
independent variables should be independent to they can use height to predict age. The explanatory
each other variable is height.
- True - True
If a predictor variable x is found to be highly Multiple linear regression can be expressed
significant we would conclude that:

- a change in x causes a change in y as

(changes in x are associated to changes in y)
- False
Overfitted data can significantly lose the predictive
ability due to an erratic response to noise whereas The _____________ are defined as the ratio of the
underfitted will lack the accuracy to account for the probability of success over the probability of failure.
variability in response in its entirety. (Use lowercase for your answer).

- True - odds ratio

Given the results for multiple linear regression Multiple linear regression is classified as
below. Identify the estimated regression equation. unsupervised learning.

- False

The formula above is used to model nonlinear

- False -False

It is a model that tests the relationship between a Logistic regression is classified as supervised
dependent variable and a single independent learning.
- Linear Regression
If a predictor variable x is found to be highly
When it is predicted as TRUE and is actually TRUE. significant we would conclude that:
- True Positive -changes in x are associated to changes in y
If a predictor variable x is found to be highly It is the variable being predicted.
significant we would conclude that: a change in y
causes a change in x. -Dependent variable

- False (changes in x are associated to changes in In logistic regression, it is that the target variable
y) must be discrete and mostly binary or dichotomous.

When it is predicted as FALSE and is actually FALSE. -True

- False Negative It is the tool pack in Microsoft Excel that can be

downloaded to perform linear regression.
Response variable is the variable being manipulated
by researchers. -Data Analysis Tool Pak
- False It is essentially similar to the simple linear model,
with the exception that multiple independent
variables are used in the model.
-Multiple regression
A linear regression analysis was conducted to
predict the company sales. Below are the It pertains to the accuracy of the provisional model
results. Identify the linear regression equation is not as high on the test set as it is on the training
assuming that x is the parameter for set. (Use UPPERCASE for your answer)

Extrapolation of the range of values of the

independent variables in the data used to estimate
the model.


In logistic regression, it is that the target variable

- sales = 23.42x + intercept must be discrete and mostly binary or dichotomous.


The higher is the AUC value, the better is the It is the tool pack in Microsoft Excel that can be
model predictive accuracy. downloaded to perform linear regression.

-True -Data Analysis Tool Pack

The explanatory variable is the variable being Multiple linear regression is classified as
predicted. unsupervised learning.
-False -False
A researcher believes that the origin of the beans When it is predicted as FALSE and is actually
used to make a cup of coffee affects hyperactivity. FALSE.
He wants to compare coffee from three different
regions: Africa, South America, and Mexico. What -True Negative
is the response variable on this study?
Logistic regression measures the relationship
-Hyperactivity level between the ____________ dependent
variable and one or more independent
The graph of the simple linear regression equation variables. (Use lowercase for your answer)
is a straight line
A group of middle school students wants to know if
Given the results for multiple linear regression they can use height to predict age. The explanatory
variable is height.
below. Identify the estimated regression equation.

Which of the following evaluation metrics can’t be

applied in the case of logistic regression output to
compare with the target?

-exam score = 5.56*hours + prep_exams*-0.60 +


There should not be any multicollinearity between

the independent variables in the model, and all
independent variables should be independent to
each other


Logistic regression is classified as one of the

It is essentially similar to the simple linear model,
examples of unsupervised learning. with the exception that multiple independent
variables are used in the model.
-False -Multiple linear regression
Which of the following methods do we use to best
It is essentially similar to the simple linear model, fit the data in Logistic Regression?
with the exception that multiple independent -Maximum Likelihood
variables are used in the model.
What type of relationship is shown in the graph
-Multiple Regression below?

The graph of the simple linear regression equation

is a straight line.


Multiple linear regression can be expressed

as y=a+bX+ ϵ
2. Overfitted data can significantly lose the
predictive ability due to an erratic response
to noise whereas underfitted will lack the
accuracy to account for the variability in
response in its entirety.
• True

3. A group of middle school students wants to

know if they can use height to predict age.
What is the response variable in this study
• Height
• The type of school
• Age

• The number of students

-Undefined 4. Multilinear regression is a regression in

which the dependent or criterion variables
are modeled as a non-linear function of
model parameters and one or more
A regression line is used for all of the following independent variables.
except one. Which one is not a valid use of a • False
regression line?

-to determine if a change in X causes a change in

Y. 5. It is the variable being manipulated by
• Explanatory Variable
• Factor
• Response Variable
• Dependent Variable

6. Multiple linear regression is classified as

supervised learning.
• True

7. Research question: Do fourth graders

24/30 tend to be taller than the third graders?

1. Regression is a famous supervised learning This is an observational study. The researcher

technique. wants to use grade level to explain differences
• True
in height. What is the response variable in this • The cups of coffee taken
• Origin of the coffee
• Gender
• Hyperactivity level
• Age
• Grade level
13. 1The value of the residual (error) is not
• Height correlated across all observations.
• True

8. If p value for model fit is less than 0.5, 14. There should not be any multicollinearity
then signify that our full model fits between the independent variables in the
significantly better than our reduced model, and all independent variables should
model. be independent to each other
• False
• False

9. It is used to compare two nested 15. It is the variable being predicted.

generalized linear models. • Explanatory variable
• Odds ratio • Dependent variable
• Hosmer-lemeshow test • Independent variable
• Likelihood ratio test • Factor
• ROC Curve
16. The graph of the simple linear regression
10. In linear models, the dependent and equation is a straight line
independent variables show a linear • True
relationship between the slope and the
intercept 17. It is the tool pack in Microsoft Excel that can
be downloaded to perform linear regression.
• True • Data Analysis Tool Pack
11. Supposed you want to train a machine • Test Mining Tool Pack
to help you predict how long it will take • Regression Tool Pack
you to drive home from your workplace. • Data Mining Tool Pack
What type of data mining approach
should you use 18. It is a point that lies far away from the rest.
• Outliers
• Unsupervised Learning • Dummy variables
• Supervised learning • Residual

19. The formula below describes

12. A researcher believes that the origin of y=a+bX_1+cX_2+ 〖dX〗_3+ ϵ

the beans used to make a cup of coffee • Linear Regression
affects hyperactivity. He wants to • Multiple Linear Regression
compare coffee from three different • Nonlinear Regression
regions: Africa, South America, and
Mexico. What is the explanatory variable
in this study? 20. Logistic regression is classified as one of
the examples of unsupervised learning.
• The taste of the coffee • False
A and b are considered as independent
21. It is a table used to analyze the
performance of a model (classification). • False
(Use lowercase)
• confusion matrix
28. Shown below is a scatterplot of Y versus X.
Which of the following is the approximate
22. The family = ______________ command value for R2?
tells R to use the glm function to fit a logistic • 99.5%
regression model. (Use lowercase for your • 50%
answer to supply the code) • 25%
• logit regression • 0%
23. It is a model that tests the relationship
between a dependent variable and a single
independent variable.
• Linear Regression
• Least squared method
• Multiple regression
• Nonlinear regression

24. y=a+bX_1+cX_2+ 〖dX〗

_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ
The formula above is used to model nonlinear
• False

29. Nonlinear is the extension of linear regression

25. It is a regression in which the dependent
• False
or criterion variables are modeled as a
30. A group of middle school students wants to
non-linear function of model parameters
know if they can use height to predict age. The
and one or more independent variables.
response variable in this study is age.
• Linear Regression • True
• Least squared method
• Multiple regression
• Nonlinear regression ADDITIONAL FORMATIVE

It is essentially similar to the simple linear

26. Logistic regression is classified as model, with the exception that multiple
supervised learning. independent variables are used in the model.
• True
Multiple regression

Which of the following is not a reason when to

27. In the equation use regression? When you aim to know the
y=a+bX+ ϵy=a+bX+ ϵ products that are usually bought together.
It pertains to the accuracy of the provisional Logistic regression is classified as supervised
model is not as high on the test set as it is on the learning. True
training set. (Use UPPERCASE for your answer) In logistic regression, it is that the target variable
OVERFITTING must be discrete and mostly binary or dichotomous.
It is used to explain the variability in the
It can be utilized to assess the strength of the
response variable. response variable relationship between variables and for modeling the
future relationship between them Regression
It is used to measure the binary or dichotomous Analysis
classifier performance visually and Area Under
Curve (AUC) is used to quantify the model If p value for model fit is less than 0.5, then signify
that our full model fits significantly better than our
performance. ROC Curve reduced model. False

Missing data has the potential to adversely Factor is the variable being predicted. False
affect a regression analysis by reducing the total
It is a model that tests the relationship between
usable sample size. True
a dependent variable and a single independent
In regression, the value of the residual (error) is variable. Linear Regression
one. True False What type of relationship is shown in the graph
below? Negative Linear Relationship
The value of the residual (error) is not
correlated across all observations. True

It is the variable of primary interest. response

If p value for model fit is less than 0.5, then signify
that our full model fits significantly better than our
reduced model. False

It is used to explain the variability in the response

variable parameter

The value of the residual (error) is one false

The value of the residual (error) is correlated across It is essentially similar to the simple linear model,
all observations. False with the exception that multiple independent
variables are used in the model. Multiple linear
It is essentially similar to the simple linear model, regression
with the exception that multiple independent
Logistic regression is used to predict continuous
variables are used in the model. Multiple Regression
target variable. True
The value of the residual (error) is not correlated
Response variable is the variable being manipulated
across all observations. True
by researcher. False
Given the results for multiple linear regression
below. Predict the exam score if a student spent Logistic regression is classified as supervised
hours studying and had 2 prep exams taken.. learning. True

A researcher believes that the origin of the

beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from
three different regions: Africa, South America,
and Mexico. What is the explanatory variable in The value of the residual (error) is not
this study? Origin of the coffee correlated across all observations. True
The lower is the AUC value, the worse is the model Extrapolation of the range of values of the
predictive accuracy. True independent variables in the data used to estimate
the model. True
Regression is a famous supervised learning
technique. True A group of middle school students wants to know if
they can use height to predict age. What is the
In regression, the value of the residual (error) is one. explanatory variable in this study? height
Supposed you want to train a machine to help you
How will you express the equation of a regression predict how long it will take you to drive home from
analysis where you aim to predict the value of y your workplace. What type of data mining approach
based on x. (Use lowercase and avoid so much space) should you use? Supervised Learning
Multiple linear regression is classified as
Regression is a technique used for forecasting, time unsupervised learning. False
series modeling, and finding the causal effect
between the variables. False It is a point that lies far away from the rest. Outliers

Supposed you want to train a machine to help you When it is predicted as TRUE and is actually TRUE.
predict how long it will take you to drive home from True Positive
your workplace using regression. What type of data
mining approach should you use? Supervised Two variables are correlated if there is a linear
Learning association between them. If not, the variables are
uncorrelated. True
Multilinear regression is a regression in which the
dependent or criterion variables are modeled as a Multiple linear regression is classified as
non-linear function of model parameters and one or unsupervised learning. False
more independent variables. True
What type of relationship is shown in the graph
Multilinear regression is a regression in which the below? Negative Linear Relationship
dependent or criterion variables are modeled as a
non-linear function of model parameters and one or
more independent variables. False

If a predictor variable x is found to be highly

significant we would conclude that: changes in x are
associated to changes in y

It is used to develop the estimated multiple

regression. Multiple Regression

The graph for a nonlinear relationship is often a

straight line false

It is essentially similar to the simple linear model,

with the exception that multiple independent Logistic regression is used to predict continuous
variables are used in the model. Multiple Regression. target variable. True

In linear regression, it is that the target variable You predicted negative and it’s false. False Negative
must be discrete and mostly binary or dichotomous.
True In ROC Curve, models having a higher lift from the
diagonals are considered to be more accurate
What type of relationship is shown in the graph models. True
above? No relationship
y=a+bX_1+cX_2+ 〖dX〗 A group of middle school students wants to
_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ know if they can use height to predict age. What
is the response variable in this study? Age
The formula above is used to model nonlinear
regression. False It is a regression in which the dependent or criterion
variables are modeled as a non-linear function of
Logistic regression is classified as supervised model parameters and one or more independent
learning. True variables. Nonlinear Regression

It is essentially similar to the simple linear Slope is the point that lies far away from the
model, with the exception that multiple rest. True
independent variables are used in the model.
Nonlinear is the extension of linear regression.
Multiple regression
The lower is the AUC value, the worse is the
In logistic regression, it is that the target variable
model predictive accuracy. True
must be discrete and mostly binary or dichotomous.
Multiple linear regression is classified as supervised True
learning. True
Given the results for multiple linear regression
It is the variable of primary interest. Parameter below. Predict the exam score if a student spent
A group of middle school students wants to know if 10 hours in studying and had 4 prep exams
they can use height to predict age. What is the taken.
response variable in this study? Age

If p value for model fit is less than 0.5, then signify

that our full model fits significantly better than our
reduced model. False
Supposed you want to train a machine to help It can be utilized to assess the strength of the
you predict how long it will take you to drive relationship between variables and for modeling the
home from your workplace using regression. future relationship between them. Regression
What type of data mining approach should you Analysis

use? Supervised Learning Logistic regression is classified as one of the

examples of unsupervised learning. False
In the equation y=a+bX+ ϵ What is considered
as a dependent parameter/s? y Research question: Do fourth graders tend to be
taller than third graders?
the graph for a nonlinear relationship is often a
straight line. True This is an observational study. The researcher
wants to use grade level to explain differences
In linear models, the dependent and in height. What is the explanatory variable on
independent variables show a linear this study? grade level
relationship between the slope and the
intercept true The family = ______________ command tells R to use
the glm function to fit a logistic regression model.
The formula below describes (Use lowercase for your answer to supply the code)
y=a+bX_1+cX_2+ 〖dX〗
_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ What type of relationship is shown in the graph
Multiple linear regression below? No relationship
This is an observational study. The researcher
wants to use grade level to explain differences
in height. What is the response variable in this
study? Height
It can be utilized to assess the strength of the
relationship between variables and for modeling
the future relationship between them.
Regression Analysis

It is used to develop the estimated multiple

regression. Multiple Regression
It refers to estimates and predictions of the
target variable made using the regression
It is a regression in which the dependent or criterion equation with values of the predictor variable
variables are modeled as a non-linear function of outside of the range of the values of x in the data
model parameters and one or more independent set. (Use UPPERCASE for your answer)
variables. Nonlinear Regression PREDICTION
Which of the following evaluation metrics can’t be The formula below describes
applied in the case of logistic regression output to
compare with the target? Mean-Squared-Error y= (b_1+ b_2 x_1+b_3 x_2+b_4 x_3)/(1+
Nonlinear is the extension of linear regression. False
b_5 x_1+b_6 x_2+ b_7 x_3 )
Nonlinear regression
Which of the following methods do we use to best fit
the data in Logistic Regression? Maximum Likelihood
A researcher believes that the origin of the beans
used to make a cup of coffee affects hyperactivity.
When it is predicted as TRUE and is actually TRUE.
He wants to compare coffee from three different
True Positive
regions: Africa, South America, and Mexico. What is
the response variable in this study? Hyperactivity
Residual is a point that lies far away from the rest
The graph of the simple linear regression equation is
You predicted negative and it’s false. False Negative. a straight line true

How do we assess the goodness of fit or accuracy of In the equation

the model in logistic regression? (Use lowercase for
your answer) roc curve y=a+bX+ ϵy=a+bX+ ϵ

The graph for a nonlinear relationship is often a What is considered as the parameter/s? a,b
straight line. True In logistic regression, it is that the target
variable must be discrete and mostly binary or
How will you express the equation of a regression
dichotomous. True
analysis where you aim to predict the value of y
based on x. Y=a+bX It is used to explain the variability in the
response variable parameter
It is used to explain the variability in the response
variable parameter Logistic regression is classified as one of the
examples of unsupervised learning. False
Research question: Do fourth graders tend to be
taller than the third graders?
Regression is a technique used for forecasting, If a predictor variable x is found to be highly
time series modeling, and finding the casual significant we would conclude that: a change in y
effect between the variables. True causes a change in x. false
There should not be any multicollinearity Regression is a famous supervised learning
between the independent variables in the technique. True
model, and all independent variables should be
independent to each other true exam score = 5.56*hours + prep_exams*-0.60 + 67.67

It is the range of values of the independent It is a statistical technique that uses several
variables in the data used to estimate the model. independent variables to predict the dependent
Experimental region variable. Multiple linear regression
The value of the residual (error) is correlated
It is the tool pack in Microsoft Excel that can be
across all observations. False
downloaded to perform linear regression. Data
Two variables are correlated if there is a linear Analysis Tool Pack
association between them. If not, the variables
are uncorrelated. True Compute for the accuracy of the model depicted
in the confusion matrix below;
Suppose you have been given a fair coin and you
want to find out the odds of getting heads. TP = 20
Which of the following option is true for such a TN = 27
case? odds will be 1
FP = 18
It is essentially similar to the simple linear
model, with the exception that multiple FN = 25
independent variables are used in the model.
Multiple linear regression 52%

It is a regression in which the dependent or

criterion variables are modeled as a non-linear It is considered as the log of odds of the event. (Use
function of model parameters and one or more lowercase for your answer) intercept
independent variables Multiple Regression
Supposed your model was able to predict that a
Shown below is a scatterplot of Y versus X. certain student fails the certification exam and she
Which of the following is the approximate value actually is not. False Negative
for R2? 99.5% Two variables are correlated if there is a linear
association between them. If not, the variables are
uncorrelated. True
of data that may be used for causal
[IT0089] MODULE 4 MAIN (TIME SERIES forecasting.

TIME SERIES 1. Trend Pattern

- Gradual shifts or movements to relatively
- Time series analysis was introduced by Box higher or lower values over a longer period
and Jenkins (1976) to model and analyze of time
time series data with autocorrelation. o A trend is usually the result of long-
- A sequence of observations on a variable term factors such as population
measured at successive points in time or increases or decreases, shifting
over successive periods of time. demographic characteristics of the
o The measurements may be taken population, improving technology,
every hour, day, week, month, year, changes in the competitive
or any other regular interval. The landscape, and/or changes in
pattern of the data is an important to consumer preferences.
understand the series’ past behavior 2. Seasonal pattern
o If the behavior of the times series - Recurring patterns over successive periods
data of the past is expected to of time
continue in the future, we can use it o Example: A manufacturer of
to guide us in selecting an swimming pools expects low sales
appropriate forecasting method. activity in the fall and winter
months, with peak sales in the spring
TIME SERIES DATA and summer months to occur every
- Time series data consist of data observations year.
over time. - Time series plot not only exhibits a seasonal
pattern over a one-year period but also for
APPLICATIONS OF TIME SERIES less than one year in duration.
o Example: daily traffic volume shows
- Predicting stock prices
within-the-day “seasonal” behavior
- Airline fares
- Labor force size M4S1 [OVERVIEW OF TIME SERIES
- Unemployment data ANALYSIS]
- Natural gas price
- Time series data is a sequence of
- The objective of time series analysis is to observations collected from a process with
uncover a pattern in the time series and equally spaced periods of time.
then extrapolate the pattern into the future. - It establish relation between “cause” and
- The forecast is based solely on past values “effect”
of the variable and/or on past forecast errors - One variable is “Time” which is considered
- Modern data-collection technologies have as the independent variable and the second
enabled individuals, businesses, and is “Data” also known as the dependent
government agencies to collect vast amounts variable.
Understanding the Time Series Data o Change in technological progress
o Large scale shift in consumers
- Time series data is expected to have two demands
variables: time and data
Seasonal variation
- It is a short-term fluctuation in a time series
1. Daily data on sales which occur periodically in a year.
2. Monthly inventory
3. Daily customers Examples:
4. Monthly interest rates, cost
5. Monthly unemployment rates • More woolen clothes are sold in winter than
6. Weekly measures of money supply in the season of summer
7. Daily closing prices of stock indexes, and • Each year more ice creams are sold in
soon summer and very little in Winter season


- These are recurrent upward or downward

movements in a time series but the period of
cycle is greater than a year.

Irregular Variations

- These are fluctuations in the time series that

are short in duration, erratic in nature and
BENEFITS OF TIME SERIES ANALYSIS follow no regularity in the occurrence
Through Time Series Analysis, businessmen can
predict about the changes in economy. Furthermore, FORECASTING METHODOLOGIES
it could also help in determining the following:
1. Exponential Smoothing
• Stock Market Analysis 2. Moving Average
• Risk analysis and evaluation
• Census analysis
• Budgetary analysis
• Inventory studies 1. Tableau
• Sales forecasting 2. Excel
3. SAS 9.4
4. SAP Analytics
Secular Trend
- The increase or decrease in the movements
- Computing Forecasts and Measures of
of a time series is called secular trend.
Forecast Accuracy using the most recent
- A time series data may show upward trend
Value as the Forecast for the next Period
or downward trend for a period of years and
this may be due to factors like:
o Increase of population

- Measures to determine how well a particular - Univariate time series models are models
forecasting method is able to reproduce the used when the dependent variable is a single
time series data that are already available time series.
- Forecast Error: Difference between the
actual and the forecasted values for period t.
- used when there are multiple dependent
variables. In addition to depending on their
own past values, each series may depend on
past and present values of the other series
- Modeling U.S. gross domestic product,
- Mean Forecast Error: Mean or average of inflation, and unemployment together as
the forecast errors. endogenous variables is an example of a
multivariate time series model.



1. Reliability - time series uses collected

- Mean Absolute Error (MAE): Measure of historical data over a period of time.
forecast accuracy that avoids the problem of 2. Seasonal Patterns - since TSA deals with
positive and negative forecast errors periodic data, it is easy to predict seasonal
offsetting one another. pattern.
3. Estimation of trends - the graph of TSA
makes it easy to visualize the increase or
decrease in sales, production, etc.
4. Growth - through depicting patterns, TSA
helps in measuring the financial growth.


- Mean Squared Error (MSE): measure that CONCEPT OF AVERAGING METHODS

avoids the problem of positive and negative
errors offsetting each other is obtained by - If a time series is generated by a constant
computing the average of the squared process subject to random error, then mean
forecast errors. is a useful statistic and can be used as a
forecast for the next period.
- Averaging methods are suitable for
stationary time series data where the series
is in equilibrium around a constant value
(the underlying mean) with a constant
variance over time.


- It is a technique that calculates the overall AVERAGING METHODS
trend in sales volume from historical data of
The Mean
the company
- This technique is well-known when - Uses the average of all the historical data as
forecasting short-term trends. the forecast


- When new data becomes available, the

forecast for time t+2 is the new mean
including the previously observed data plus
this new observation.


The mean sales for the first five years (2003-2007)

is calculated by finding the mean from the first five

- This method is appropriate when there is no

noticeable trend or seasonality.

• Compute for the second subset of 5 years


- The moving average for time period t is the

mean of the “k” most recent observations.
• Get the average of the third subset (2005- - The constant number k is specified at the
2009) outset.
- The smaller the number k, the more weight
is given to recent periods
- The greater the number k, the less weight is
given to more recent periods.
• Continue calculating each five-year average - A large k is desirable when there are wide,
until you reach 2009-2013. This gives you a infrequent fluctuations in the series.
series of points that you can plot a chart for - A small k is most desirable when there are
moving averages. sudden shifts in the level of series.
- For quarterly data, a four-quarter moving
average, MA (4), eliminates or averages out
seasonal effects.


- For monthly data, a 12-month moving

average, MA (12), eliminate or averages out
seasonal effect.
- Equal weights are assigned to each
observation used in the average.
- Each new data point is included in the
average as it becomes available, and the
oldest data point is discarded. Graph of Store’s weekly sales


- A moving average of order k, MA (k) is the

value of k consecutive observations.

Using Moving Average

• Use a three-week moving average (k=3) for

- K is the number of terms in the moving
the department store sales to forecast for the
average. The moving average model does
week 24 and 26.
not handle trend or seasonality very well
although it can do better than the total mean.


• The forecast error is
- The weekly sales figures (in millions of
dollars) presented in the following table are
used by a major department store to
Forecasted Value
determine the need for temporary sales
personnel. • The forecast for the week 26 is

Compute for the MSE


100 abs(20 – 30) ^ 2

Compute for the MSE: -9 85.5 81 9


A trend is usually the result of long-term factors 81 abs(81 – 90) ^ 2

such as population increases or decreases, MSE stands for the mean standard error. False
shifting demographic characteristics of the
population, improving technology, changes in The reproduction of crops is highly dependent
the competitive landscape, and/or changes in on this component in time series. Seasonal
consumer preferences. True variations

What does the graph below illustrate? an increasing Which of the following describes an unpredictable,
trend only rare event that appears in the time series? Irregular

Moving average is suitable for dealing with a time
series that has short-term trends. True

Compute for the MAD based on the following

Mean absolute error is the mean or average of values:3
the forecast errors. False
Error |Error|
Qualitative methods can be used when past
information about the variable being forecast is
available. False 5 5
Trend projection is an example of a time series.
True -3 3

Compute for the forecasted error: 3

4 4
Actual Forecasted
39 42
The formula below describes MAD
Another measure is mean standard error (MSE),
which is a measure of the average size of the
prediction errors in squared units. True It pertains to the gradual shifts or movements to
relatively higher or lower values over a longer
It is a model that tests the relationship between a period of time. Trend Pattern
dependent variable and a single independent
Another measure is mean standard error (MSE),
variable. Linear Regression
which is a measure of the average size of the
The moving average method uses a weighted prediction errors in squared units. True
average of the observed value. False
The forecast is Ft+1 based on weighting the most
Which of the following describes the overall recent observation yt with a weight alpha and
tendency of a time series? Trend weighting the most recent forecast Ft with a weight
of 1-alpha . true
Its objective is to uncover a pattern in the time series
and then extrapolate the pattern into the future. Supply the missing values given for the attribute
Time Series Analysis salary. 19, 275 18, 625 18, 650 19,525

If a time series is generated by a constant process

subject to random error, then median is a useful
statistic. True
The value of α with the smallest RMSE is chosen for
use in producing future forecasts. True 12000
One measure of forecasting accuracy is the mean
accuracy deviation (MAD). As the name suggests, it 17500
is a measure of the average size of the prediction
errors. To estimate the change in Y for a one-unit 29000
change in X. false
If elections are held every 6 years in a country, then
Based on the given values below, what is the value of
the existing government may allow wages to
F3 if the value of the alpha is 0.2?
increase the year before the election to make the
people more likely to vote for them. The statement is
an example of cyclical variation. True

Trend series a sequence of observations on a

variable measured at successive points in time or
over successive periods of time. True
The classical multiplicative time series model indicates
that the forecast is the product of which of the
following terms?
Each year, more sweaters are sold in winter and
- trend, cyclical, and irregular components very few in the summer season. This is an
example of a ______________.
Which of the following statements is correct regarding
moving averages? - Seasonal variations
- The choice of the number of periods It is a measure that avoids the problem of positive
impacts the performance of the moving
and negative errors offsetting each other is obtained
average forecast
by computing the average of the squared forecast
The method of moving averages is used for which of
the following purposes?

- smooth the time series

- Mean Absolute Error

When exponential smoothing is used as a forecasting Which of the following research scenarios would
method, which of the following is used as the forecast time-series analysis be best for?
for the next period?
- Measuring the time it takes for something to
- smoothed value in the current time period happen based on a given number of variables
Which of the following is a valid weight for exponential

- 0.5
Which of the following indicates the purpose for using Compute for the Mean Absolute Deviation:
the least squares method on time series data?

- identifying the trend variations

An autoregressive forecast includes which of the

following terms?

- All of the above

- 4
Which of the following indicates a guideline for
selecting a forecast model?
A trend is usually the result of long-term factors
- All of the above such as population increases or decreases,
shifting demographic characteristics of the
To assess the adequacy of a forecasting model, which population, improving technology, changes in
of the following methods is used?
the competitive landscape, and/or changes in
- Mean absolute deviation consumer preferences.

Which of the following indicates the calculation of an

- True
index number?
It pertains to the gradual shifts or movements to
- (Pi/Pbase)100
relatively higher or lower values over a longer period
How many degrees of freedom are used for the t test of time
to determine the significance of the highest order
autoregressive term? - Trend Pattern
- n – 2p – 1 It is a measure of forecast accuracy that avoids the
problem of positive and negative forecast errors
offsetting one another.
- Mean Absolute Error

Compute for the MSE

- 100

Spike is the predictable change in something based

on the season.

- False
It pertains to the difference between the actual and
the forecasted values for period t

- Forecast Error

A time series data usually has two variables namely

probability of occurrence of consequent
TRANSACTION ITEMS given the antecedent.
Ong Bread, Lady’s Choice, Eggs
Gonzales Bread, Milk, Butter
Gepilano Bread, Diaper, Coke
Francisco Bread, Diaper, Milk
Tupaz Bread, Milk, Butter
support for the bread => milk is 4/6


LIFT RATIO - shows how effective the rule is in

finding consequents (useful if finding particular
consequents is important)

CONFIDENCE - shows the rate at which

consequents will be found (useful in learning costs
of promotion)

SUPPORT - measures overall impact



- Market basket analysis (also known as

association rule discovery or affinity
analysis) is a popular data mining method.
In the simplest situation, the data consists of
two variables: a transaction and an item


- Forbes (Palmeri 1997) reported that a major • A set of items is referred as an itemset. An
retailer determined that customers who buy itemset that contains k items is a k-itemset.
Barbie dolls have a 60% likelihood of • The support s of an itemset X is the
buying one of three types of candy bars. percentage of transactions in the transaction
database D that contain X
• The support of the rule X  Y in the
• Market Basket Analysis: given a database transaction database D is the support of the
of customer transactions, where each items set X  Y in D.
transaction is a set of items the goal is to • The confidence of the rule X  Y in the
find groups of items which are frequently transaction database D is the ratio of the
purchased together number of transactions in D that contain X
• Telecommunication: (each customer is a  Y to the number of transactions that
transaction containing the set of phone calls) contain X in D.
• Credit Cards/Banking Services: (each
card/account is a transaction containing the
set of customer’s payments)
• Medical Treatments: (each patient is
represented as a transaction containing the
ordered set of diseases)
• Basketball-Game Analysis: (each game is
represented as a transaction containing the VARIABLES IN ASSOCIATION RULE
ordered set of ball passes)
ADVANTAGES - a set I of all the items;
- a database D of transactions;
- Uses large itemset property
- minimum support s;
- Easily parallelized
- minimum confidence c;
- Easy to implement
DISADVANTAGES - all association rules X  Y with a minimum
support s and confidence c.
- Assumes transaction database is memory
- Requires many database scans
• Find all sets of items that have minimum
KEY IDEAS support (frequent itemsets)
• Use the frequent itemsets to generate the
• a set of all the items desired rules
• Transaction T: a set of items such that T I
• Transaction Database D: a set of
• A transaction T  I contains a set X  I of
some items, if X  T
• An Association Rule: is an implication of
the form X  Y, where X, Y  I
• [(support {red, white, green})/(support
{red, green})]

Plus 4 more with confidence of 100%, 33%, 29%

& 100%

TERMS If confidence criterion is 70%, report only rules

2, 3 and 6
“IF” part = antecedent
“THEN” part = consequent

“Item set” = the items (e.g., products) comprising

the antecedent or consequent

• Antecedent and consequent are disjoint (i.e.,

have no items in common)




For example: Transaction 1 supports several rules,

such as

• “If red, then white” (“If a red faceplate is

purchased, then so is a white one”)
• “If white, then red”
• “If red and white, then green”
• + several more

{red, white} > {green} with confidence = 2/4 =


• [(support {red, white, green})/(support {red,


{red, green} > {white} with confidence = 2/2 =


1. Write the support formula for the following


A => B

[ (transactions that contain every item in A and

B) / (all transactions) ]

2. It is intended to select the “best” subset of

predictors. (Use lowercase for your answer)
ASSOCIATION RULE MINING [variable selection]
3. When it comes to association analysis, the
more rules you produce, the greater the risk
is. [TRUE]
4. Observe the table below and compute for
the confidence of beer -> diaper [NONE
OF THE CHOICES] [1/4 OR 0.25]

Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
5. Supposed you want to solve a time series
problem where a rapid response to a real
change in the pattern of observations is
desired, which among the following is the
ideal value for your alpha? [0.8]
6. Affinity analysis is a data mining method
that usually consists of two variables: a
transaction and an item [TRUE]
7. Observe the table below and compute for the
lift ratio of
Diaper ->milk landscape, and/or changes in consumer
preferences [TRUE]
Trans A Beer, Peanut, Egg ID
Trans B Beer, Milk, Peanut, Diaper Trans A Beer, Peanut, Egg
Trans C Milk, Diaper, Egg Trans B Beer, Milk, Peanut, Diaper
Trans D Peanut, Egg, Diaper Trans C Milk, Diaper, Egg
Trans E Beer, Peanut, Egg Trans D Peanut, Egg, Diaper
Trans F Egg, Beer, Peanut Trans E Beer, Peanut, Egg
[1.5] [2] Trans F Egg, Beer, Peanut

15. observe the table below and compute for the

TRANSACTION ITEMS support for beer ->peanut [4/7]
16. observe the table below and compute for the
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper support for SD Card => phone case [0.3]
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut

8. observe the table below and compute for the

support for beer ->peanut [4/7]
9. Which of the following is not an application
of a sequential pattern? [IDENTIFYING
10. Which of the following is not an application
of pattern discovery? [NONE OF THE
11. It shows how effective the rule is in finding
consequents [LIFT RATIO]
12. Trend series a sequence of observations on a
variable measured at successive points in
time or over successive periods of time
13. Which of the following is not an advantage
of using association rule? [ASSUMES
14. A trend is usually the result of long-term
factors such as population increases or
decreases, shifting demographic
characteristics of the population, improving
technology, changes in the competitive
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut

categorical inputs except:
- Decision node
4. SAS Enterprise Guide does not include a full 20. It is a commonly used statistical technique to
programming interface predict future behavior. (Use lowercase for your
- False answer)
5. Child nodes are also called as sub-nodes
- True
6. It train forest predictive models by fitting
multile decision Module 8 - CLUSTER ANALYSIS
- Forest selection
7. It trains a decision tree predictive model Unsupervised Classification
- Decision tree selection
8. A decision tree has no shortcomings in inputs
expressing classification and prediction patterns
because it uses multiple attribute variable in grouping
split criterion. cluster 1
- False cluster 2
9. It trains a gradient boosting predictive model by
cluster 3
fitting a set of addictive decision tress
cluster 1
- Gradient boosting selection
10. It is a process a dividing a node into two or cluster 2
more sub-nodes
- Splitting Unsupervised classification: grouping of cases based
11. Assessing whether a mortgage application is a on similarities in input values.
good or bad credit risk is an example of
classification What is Cluster Analysis?
- True Naturally occurring groups?
12. It is the process of integrating multiple Yes- Cluster Analysis
databases, data cubes, or files. No- Segmentation
- Data Integration Clustering
13. Which Apache product is used for managing
real time transaction such as logs and events?
“Cluster analysis is a set of methods for constructing a
- Apache Kafka
14. Which of the following is not an example of (hopefully) sensible and informative classification of an
classification? initially unclassified set of data, using the variable
- Estimating the grade point average (GPA) of a values observed on each individual.”
graduate student, based on that student’s Everitt (1998), The Cambridge Dictionary of Statistics
undergraduate GPA
15. Factor is variable being manipulated by Clustering in real life
researchers -While you have thousands of customers, there are really
- True only a handful of major types into which most of your
16. Determining whether a will was written by the customers can be grouped.
actual decreased or fraudulently by someone
• Bargain hunter
else is an else is an example of classification.
• Man/woman on a mission
- True
• Impulse shopper
17. It is the scientific domain that’s dedicated to
knowledge discovery via data analysis • Weary Parent
- Data Science • DINK (dual income, no kids)
18. The model can be used to classify or predict the
outcome of interest in new cases where the
outcome unknown.
- True
19. Behavioral analytics are also part of the pattern
- True
A case study

K-means Algorithm- is an algorithm is an iterative

algorithm that tries to partition the dataset into Kpre-
defined distinct nonoverlapping subgroups (clusters)
where each data point belongs to only one group.

Steps in performing K-means

Training Data
1. Select inputs.

Euclidean Distance
• Euclidean distance gives the linear distance
between any two points in n-dimensional
2. Select k cluster centers.
• It is a generalization of the Pythagorean

DE= (x1,x2)

3. Assign cases to closest center.

4. Update cluster centers. -The objective is to identify the features, or combination
of features, that uniquely describe each cluster.
• small-investors
• younger biginvestors
• older, big investors

5. Re-assign cases.


Use Clustering to identify fraud

-Most fraudulent customer activity is difficult to identify
Steps in performing K-means by a single variable. Are there unusual combinations of
Training Data behaviors that can help identify criminal activity or
4. Update cluster centers. fraud? -A customer who rarely travels starts making
5. Re-assign cases. purchases in many foreign countries. Fraud alert!
6. Repeat steps 4 and 5 until convergence. -A customer who has never shopped online before
begins to make many online purchases. Fraud alert!

Clustering for Store

-Location You want to open new grocery stores in the
U.S. based on demographics. Where should you locate
the following types of new stores?
• Low-end budget grocery stores
• small boutique grocery stores
• large full-service supermarkets
Segmentation Analysis
Training Data Euclidean Distance
• Euclidean distance gives the linear distance
between any two points in n-dimensional
• It is a generalization of the Pythagorean


-When no clusters exist, use the K-means algorithm to
partition cases into contiguous groups.

Cluster Profiling
-Cluster profiling can be defined as the derivation of a
class label from a proposed cluster solution.
P(Point, Mean1)= |x2-x1| + |y2 – yı|
= |2-2| + |10-10|
Example Problem = 0+0
Cluster the following eight points (with (x,y)) =0
representing locations into three clusters A1(2,10)
A2(2,5) A3(8,4) A5(7,5) A6(6,4) A7(1,2) A8(4,9). ----------------------------

Initial cluster centers are: A102, 10), A4(5,8) and


The distance function between two points a = (x1,y1)

and b = (x2, y2) is defined as p(a,b) = |x2 – x1| + |y2 –

Iteration (2, 10) (5, 8) (1, 2)

Point Dist Dist Dist Cluster
Mean Mean Mean Point
1 2 3 x1 y1
A1 (2, 10) (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5) x2 y2
A6 (6, 4) (5, 8)
A7 (1, 2) P(a,b) = | x2- x1| + | y2 – y1|
A8 (4, 9) P(Point, Mean1)= |x2-x1| + |y2 – yı|
= |5-2| + |8-10|
Solution = 3+2

Iteration (2, 10) (5, 8) (1, 2)

Point Dist Dist Dist Cluster
Mean Mean Mean
1 2 3
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
Point A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
x1 y1
(2, 10)
Cluster 1
x2 y2
Cluster 2
(2, 10)
P(a,b) = | x2- x1| + | y2 – y1|
(7,5) 1 2 3
(6,4) A1 (2, 10) 1.5 9.25 7 1
A2 (2, 5) 5.5 4.25 2 3
Cluster 3 A3 (8, 4) 10.5 4 7 2
(2,5) A4 (5, 8) 3.5 3 8 2
A5 (7, 5) 8.5 2 7 2
A6 (6, 4) 8.5 2 5 2
(4,9) A7 (1, 2) 9.5 9 2 3
A8 (4, 9) 1.5 5 8 1
Cluster 1: has 1 point A1(2,10) which was the old mean
(remains) Cluster 1
Cluster 2: ((8 + 5 + 7 + 6 + 4)/5 , (4 + 8 + 5 + 4 + 9)/5) (2,10)
= (6, 6) (5,8)
Cluster 3: ((2 + 1)/2, (5 + 3)/2)) = (1.5, 3.5) (4,9)

Iteration (2, 10) (5, 8) (1, 2) Cluster 2

1 (8,4)
Point Dist Dist Dist Cluster
Mean Mean Mean
1 2 3 (7,5)
A1 (2, 10) 0 8 7 1 (6,4)
A2 (2, 5) 5 5 2 3
A3 (8, 4) 12 4 7 2 Cluster 3
A4 (5, 8) 5 3 8 2 (2,5)
A5 (7, 5) 10 2 7 2 (1,2)
A6 (6, 4) 10 2 5 2
A7 (1, 2) 9 9 2 3 Cluster 1: points 1,4 & 8
A8 (4, 9) 3 5 8 1
((2 +5+4)/3 , (10 + 8 + 9)/3) = (3.67, 9)
Cluster 1:
Cluster 2: points 3,5 & 6
((8 + 7 + 6 )/3, (4 + 5 + 4)/3) = (7, 4.3)
Cluster 3: ((2 + 1)/2 , (5 + 3)/2)) = (1.5, 3.5)
Cluster 2:

Cluster 3:

Cluster 1: points 1 & 8

((2 + 4) 12 , (10 + 9)/2) = (3, 9.5)
Cluster 2: points 3, 4, 5 & 6
((8 + 5 + 7 + 6 )/4 , (4 + 8 + 5 + 4)/4) = (6.5, 5.25)
Cluster 3: ((2 + 1)/2 , (5 + 3)/2)) = (1.5, 3.5)

Iteration (2, 10) (5, 8) (1, 2)

Point Dist Dist Dist Cluster
Mean Mean Mean
Step 4: For each of the k clusters, find the cluster
centroid, and update the location of each cluster center to
the new
value of the centroid.
Step 5: Repeat steps 3–5 until convergence or

Application of K-means using SAS Enterprise Miner

Enterprise Miner profile of International Plan

adopters across clusters


What is cluster and cluster analysis?

• A cluster is a group of similar objects
• Cluster analysis is a set of data-driven
partitioning techniques designed to a group
(clusters) the degree of association or similarity
is strong between numbers of the same cluster
VoiceMail Plan adopters and nonadopters are
mutually exclusive

What is k-means algorithm?

-K-means is one of the simplest unsupervised learning
algorithms that solve the well-known clustering
-The procedure follows a simple and easy way to
classify a given data set through a certain number of
clusters (assume k clusters) fixed apriori.
Clusters of Users
K-means algorithm process: • Cluster 1: Sophisticated Users. A small group
Step 1: Ask the user how many clusters k the data set of customers who have adopted both the
should be partitioned into. International Plan and the VoiceMail Plan.
Step 2: Randomly assign k records to be the initial • Cluster 2: The Average Majority. The largest
cluster center locations. segment of the customer base, some of whom
Step 3: For each record, find the nearest cluster center. have adopted the VoiceMail Plan but none of
Thus, in a sense, each cluster center "owns" a subset of whom have adopted the International Plan.
the records, thereby representing a partition of the data • Cluster 3: Voice Mail Users. A medium-sized
set. We therefore have k clusters, C1, C2, ..., Ck. group of customers who have all adopted the
VoiceMail Plan but not the International Plan.
- True
Concept of K-means 3. Predicting house price based on the size of
-The main idea is to define k centers, one for each the house is an example of cluster analysis.
cluster. These centers should be placed in a cunning way - True
because of different location causes different result. So, 4. It is an algorithm is an iterative algorithm
the better choice is to place them as much as possible far that tries to partition the dataset into kpre-
away from each other. defined district nonoverlapping subgroups
-The next step is to take each point belonging to a given (clusters) where each data point belongs to
data set and associate it to the nearest center. When no only one group
point is pending, the first step is completed and an early - Kmeans
5. A cluster is defined as a collection of data
group age is done.
points exhibiting certain similarities
- True
6. It can be defined as the derivation of a class
• Fast, robust and easier to understand. label from a proposed cluster solution
• Relatively efficient: - Cluster Profiling
• Gives best result when data set are distinct or 7. Clustering is supervised classification
well separated from each other. - False
8. There is a separate “quality” fuction that
Disadvantages measured the “goodness” of a cluster
• The learning algorithm requires apriori - True
specification of the number of cluster centers. 9. Identify if the statement is a clustering or
• The use of Exclusive Assignment - If there are not:
two highly overlapping data then k-means will Identifying groups of motor insurance policy
not be able to resolve that there are two clusters. holders with a high average claim cost.
• The learning algorithm is not invariant to non-
- True
linear transformations i.e. with different
10. Cluster analysis is a statistical technique
representation of data we get different results
used to identify how various units—like
(data represented in form of cartesian
people, groups, or societies—can be
coordinates and polar co-ordinates will give
grouped together because of the
different results).
characteristics they have in common
• Euclidean distance measures can unequally
- True
weight underlying factors.
11. The easiest and simplest clustering
• The learning algorithm provides the local optima
algorithm that is widely used because of its
of the squared error function.
simple methods of implementation is called
• Randomly choosing of the cluster center cannot k-means algorithm
lead us to the fruitful result
- True
• Applicable only when mean is defined i.e. fails 12. K-means algorithm can be used in
for categorical data. forecasting car plant electricity usage.
• Unable to handle noisy data and outliers. - False
• Algorithm fails for non-linear data set. 13. Natural language processing is an example
of clustering
- True
FORMATIVES: 14. K-means cannot handle noisy data and
M8 16/20
- True
1. Clustering analysis in negatively affected by 15. Data point belonging to different clusters
heteroscedasticity have low degree of dissimilarity
- False - True
2. It requires to specify the number of clusters 16. Data point belonging to different clusters
(k) in advance is an advance of k-means have low degree of dissimilarity
