It0089 Finalreviewer
It0089 Finalreviewer
It0089 Finalreviewer
BOXPLOT
- A plot introduced by Tukey as a
quick way to visualize the
distribution of data.
SKEWNESS
- a measure of how central the average
is in the distribution
- The skewness of a sample is a
- If a histogram is symmetric, then it measure of how central the average
can be said that the values for the is in relation to the overall spread of
mean, median, and mode are the values
equal o POSITIVELY SKEWED - A
positive value indicates that
the average is skewed to the
left, that is, there is a long
“tail” of more positive values
o NEGATIVELY SKEWED -
A negative value indicates
that the average is skewed to
- If a histogram and frequency
the right, that is, there is a
distribution is skewed to the right it
means that the value of the mean is
long “tail” of more positive The contrary of this claim is the
values alternative hypothesis
KURTOSIS STUDENT’S T-test
- A measure of how pointy the In statistics, student’s t-test is a method
distribution developed by William Sealy Gosset used in
WHAT IS STATISTICS? testing hypothesis about the mean of a small
- Statistics is the science concerned sample drawn from a normally distributed
with developing and studying population when the population standard is
methods for collecting, analyzing, unknown
interpreting and presenting empirical
data • If the t value < critical value = don’t
- Rely upon the calculation of reject the null hypothesis
numbers • If the t value > critical value = reject
- Rely upon how the numbers are the null hypothesis
chosen and how statistics are SIGNIFICANCE LEVELS (ALPHA)
interpreted - A significance level, also known as
TYPES OF STATISTICS alpha or α, is an evidentiary standard
1. Descriptive Statistics - describing that a researcher sets before the study
and summarizing data sets using - It defines how strongly the sample
pictures and statistical quantities evidence must contradict the null
2. Inferential Statistics - analyzing hypothesis before you can reject the
data sets and drawing conclusions null hypothesis for the entire
from them population
3. Probability - the study of chance CONFIDENCE INTERVALS
events governed by rules (or laws) - The table shows the equivalent
WHAT IS HYPOTHESIS TESTING? confidence levels and levels of
- a statistical procedure for testing significance
whether chance is a plausible
explanation of an experimental
finding
NULL HYPOTHESIS
- Null Hypothesis is usually denoted
by H0 and the alternative hypothesis
is denoted by H1
- The null hypothesis is a statement
that is assumed to be true at the
beginning of an analysis • If the p value > α= don’t reject the
- Suppose you are trying to test a null hypothesis
certain case, initially, the person • If the p value < α= reject the null
being questioned is not guilty. This hypothesis
initial verdict is the null hypothesis. DEGREES OF FREEDOM
= N1 + N2 – 2
ASSUMPTION IN T-TEST
• independent observations RESEARCH QUESTIONS FOR ONE-
• normally distributed data for each WAY ANOVA
group • Do accountants, on average, earn
• equal variances for each group more than teachers?
ANOVA • Do people treated with one of two
- Analysis of variance (ANOVA) is a new drugs have higher average T-
statistical technique used to compare cell counts than people in the control
the means of two or more groups of group?
observations or treatments. For this • Do people spend different amounts
type of problem, you have the depending on which type of credit
following: card they have?
o a continuous dependent ASSUMPTIONS IN ANOVA
variable or response variable • Observations are independent
o a discrete independent • Errors are normally distributed
variable, also called a • All groups have equal response
predictor or explanatory variances.
variable CORRELATION
ONE-WAY ANOVA - Exploratory data analysis in many
- Use analysis of variance to test for modeling projects (whether in data
differences between population science or in research) involves
means. examining correlation among
TYPE OF CATEG CONTI CONTI predictors, and between predictors
PREDICT ORICA NUOU NUOUS and a target variable
ORS/TYP L S AND
- Variables X and Y (each with
E OF CATEG
RESPONS ORICA measured data) are said to be
E L positively correlated if high values of
CONTIN Ordina Analysis X go with high values of Y, and low
UOUS ry of values of X go with low values of Y
Least Covaria CORRELATION COEFFICIENT
Square nce - A metric that measures the extent to
s (ANCO
which numeric variables are
(OLS) VA)
Regres associated with one another (ranges
sion from -1 to +1).
CATEGO Contige Logisti Logistic - Scatterplot is a plot in which the x-
RICAL ncy c Regressi axis is the value of one variable, and
Table Regres on the y-axis the value of another.
Analysis sion ASSUMPTIONS IN CORRELATION
or
Logistic • The correlation coefficient measures
Regressi the extent to which two variables are
on associated with one another.
• When high values of v1 go with used. In module 1, the most common
high values of v2, v1 and v2 are techniques will be identified.
positively associated What is raw data?
• When high values of v1 are
associated with low values of v2, v1 Raw data pertains to the collected data
and v2 are negatively associated before it’s processed or ranked.
• The correlation coefficient is a Suppose you are tasked to gather the ages
standardized metric so that it always (in years) of 50 students in a university.
ranges from -1 (perfect negative
correlation) to +1 (perfect positive The table below is an example of
correlation) quantitative raw data.
• 0 indicates no correlation, but be
aware that random arrangements of
data will produce both positive and
negative values for the correlation
coefficient just by chance
TYPES OF CORRELATION
Another example is gathering the student
status of the same 50 students. Now we will
have an example of categorical raw data
which is presented in the table below.
SUPPLEMENTARY MATERIALS
Subtopic 1
Every data analysis requires data. Data can
be in different forms such as images, text,
videos, and etc, that are usually gathered The examples above can also be called
from different data sources. Organizations ungrouped data. An ungrouped data set
store data in different ways, some through contains information on each member of a
data warehouses, traditional RDBMS, or sample or population individually.
even through cloud. With the voluminous
Raw data can be summarized through
amount of data that an organization charts, dashboards, tables, and numbers.
processes each day, the dilemma on how The most common way to describe raw
to start data analysis emerges. data is through the frequency distributions.
How do we start performing an analysis?
A frequency distribution shows how
First and foremost, know your data.
the frequencies are distributed over
To understand your organization’s data, various categories. Below is a
there are numerous techniques that can be
frequency table that summarized the
survey conducted by Gallup Poll about
the Worries
About Not Having Enough Money to Pay
Normal Monthly Bills
A frequency distribution of a
qualitative variable enumerates all the
categories and the number of
instances that belong to each
category.
Let’s transform the following responses into
a frequency table in order to interpret the
data better
Is it easier to understand through frequency table, isn’t it?
Statistics is defined as the science of collecting, analyzing, presenting, and interpreting data,
as well as of making decisions based on such analysis.
Since statistics is a broad body of knowledge, it is divided into two areas: descriptive
statistics and inferential statistics.
A variable is a characteristic under study that assures different values for different elements.
1. Quantitative Variables
It pertains to a variable that can be measured numerically. A data collected on a
quantitative variable are called quantitative data.
Examples:
Income, height, gross sales, price of a home, number of cars owned.
Quantitative variables are divided into two types: discrete variables and continuous
variables.
Population Vs Sample
Aside from using the frequency table, we can also make use of the different graphs that are
commonly used to visually present data.
The first one is a bar graph. A bar graph is made of bars whose heights represent the
frequencies of respective categories. One type of bar graph is called pareto chart.
Pareto chart is a bar graph where in the bars are arranged based on their heights. It is
arranged in descending order (largest to smallest).
Another way to present data is through pie chart. A pie chart is a circle divided into
portions that represent frequencies or percent ages of a population
To graph grouped data, we can use of the following methods:
A group data can be presented using histograms. Histograms can be drawn for a frequency
distribution. A histogram is a graph which classes are marked on horizontal axis and the
frequencies, relative frequencies, or percentages are marked on the vertical axis.
The above histogram shows the percentage distribution of annual car insurance premiums in 50
states. The data used to make this distribution and histogram are based on estimates made by
insure.com
Understanding Frequency Distribution Curve
Knowing the meaning for each curve in histogram would be helpful to interpret a dataset. A
histogram can be:
1. Symmetric
2. Skewed
3. Uniform rectangular
A symmetric histogram is the type of histogram which both sides are equal.
Measures of Tendencies
1. Mean
2. Median
3. Modes
Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
Examples of statistical analysis application in real life:
1. Manufacturers use statistics to weave quality into beautiful fabrics, to bring lift to
the airline industry and to help guitarists make beautiful music.
2. Researchers keep children healthy by using statistics to analyze data from the
production of viral vaccines, which ensures consistency and safety.
3. Communication companies use statistics to optimize network resources, improve
service and reduce customer churn by gaining greater insight into subscriber
requirements.
4. Government agencies around the world rely on statistics for a clear understanding
of their countries, their businesses and their people.
Understanding the measures of central tendencies:
Measures of central tendencies are useful in identifying the middle value for histograms and
frequency distribution. Methods used to calculate for the measures central tendencies can
determine the typical values that can be found in your data.
What is mean?
Mean is also called as arithmetic mean which pertains to the sum of all the values over the
number of items added.
What is median?
The median is the middle value, taken when you arrange your numbers in order (rank). This
measure of the average does not depend on the shape of the data.
When calculating for median, one must remember the following:
1. Arrange the data values of the given data set in increasing order. (from smallest to
largest)
2. Find the value that divides the ranked data set in two equal parts.
Example:
16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1
16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 33.7 42.1
For this example, the middle value is 21.0, therefore it is the median of the data values.
There are instances where the data values are even, in this case, the two middle numbers are
gathered and divided into two.
See the example below:
The following data describes the cell phone minutes used last month by 12 randomly
selected customers:
230, 2053, 160, 397, 510, 380, 263, 3864, 184, 201, 326, 721
Here, we need to arrange the data values first. It will give us:
160 184 201 230 263 326 380 397 510 721 2053 3864
If we observe our data, we have no central value, to compute for the median, we need to
identify the values that divide the data into two equal parts. In our case we have 326 and 380.
We can only have one median per data set, so the median would be calculated using:
326 + 380
= = 353
2
What is mode?
Mode is the value that occurs the most frequent in the given dataset.
77 82 74 81 79 84 74 78
In this data set, the value 74 appears twice. Therefore, our mode is 74.
When a dataset has only one value that repeat the most then the distribution can be called as
unimodal. If the data in the distribution has two values that repeat the most, then it is called
bimodal. If there are more than two modes in a dataset, it is said to be multimodal.
• If a histogram and frequency distribution is skewed to the right it means that the
value of the mean is the largest among the three and mode has the lowest value.
If the mean is the largest, it means that the dataset is sensitive to outliers that occur in the
right.
4. If a histogram and frequency distribution are skewed to the left, the value of the mean
is the smallest and mode has the largest value. In this scenario, the left tail contains
the outliers.
Measures of Dispersion
If an analyst wants to know how disperse a dataset is, the methods of calculating the measures
of dispersion can be used. Measures of dispersion can help in determining how spread the data
values are.
What is range?
Range is the simplest method to compute when measuring the dispersion of data. The range
can be obtained by subtracting the smallest value in the dataset from the largest value.
= −
This method is the most commonly used measure of dispersion. The value of standard
deviation tells how closely the values of a data set are clustered around the mean.
The things one must remember when dealing with standard deviation are the ff:
1. A lower value of standard deviation indicates that the data set are spread over a smaller
range around the mean.
2. A larger value of the standard deviation indicates that the values of that data set are
spread over a relatively larger range around the mean.
3. The standard deviation can be obtained by taking the positive square root of the
variance.
FORMATIVES B 20
C 30
17/20 D 35
1. A skewed-to-the-right histogram has longer tail E 24
[NONE]
on the left side [TRUE]
- 24
- Sample
FORMATIVES
11. If the histogram is skewed right, the mean is greater
1. If a histogram and frequency distribution are skewed than the median.
to the left, the value of the means is the largest.
- True
- False
12. Identify whether the statement below is an example
2. Which of the following is an example of categorical of
raw data?
a) positive correlation
- Collected subjects offered
b) negative correlation
3. You conduct a survey where the respondents could
choose from good, better, best, excellent. What type of c) no correlation
variable should contain this type of data?
The more one eats, the less hunger one will have.
- Ordinal
- Positive correlation – NEGATIVE?
4. Student height is a categorical variable.
14. Boxplot is a plot in which the x-axis is the value of
- True ? one variable, and the y-axis the value of another.
16. The alternative hypothesis or research hypothesis 2. It refers to the critical process of performing initial
Ha represents an alternative claim about the value of investigations on the data so as to discover patterns,
the parameter. to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and
- True graphical representations
17. Identify whether the statement below is an example - Exploratory Data Analysis
of
3. A variable is a characteristic under study that
a) positive correlation assures different values for different elements.
b) negative correlation - True
c) no correlation 4. The data that is collected before being processed
is called statistical data.
As one increases in age, often one’s agility decreases.
- False
- Negative correlation
18. Find the median: 15, 5, 9, 18, 22, 25, 5 5. The smallest possible value for the standard
deviation is 1.
- 15
- False
19. Observe the histogram below. Based on it, how
many students were greater than or equal to 60 inches 6. Observation or measurement pertains to a
tall? value of a variable.
- True
- False
- True
- False - BINARY
- True
- False - DESCRIPTIVE
SPECIFIC APPLICATION
• Attrition/Churn prediction
• Propensity to buy/avail of a product
or service - Cluster patterns give groups of
• Cross-sell or up-sell probability similar data records such that data
• Next-best offer records in one group are similar but
• Time-to-event modeling have larger differences from data
• Fraud detection records in another group.
• Revenue/profit predictions
- Association patterns are established
based on co-occurrences of items in DATA MINING TECHNIQUES
data records - CLASSIFICATION
- CLUSTERING
- REGRESSION
DATA REDUCTION PATTERNS - OUTER
- SEQUENTIAL PATTERNS
- Data reduction patterns look for a
small number of variables that can - PREDICTION
be used to represent a data set with - ASSOCIATION RULES
a much larger number of variables
CHALLENGES
- Skilled Experts are needed to formulate
the data mining queries.
- Over fitting: Due to small size training
database, a model may not fit future
states.
- Data mining needs large databases
which sometimes are difficult to manage
- Business practices may need to be
modified to determine to use the
information uncovered.
- If the data set is not diverse, data mining
results may not be accurate.
OUTLIER AND ANOMALY - Integration information needed from
PATTERNS heterogeneous databases and global
- Outliers and anomalies are data information systems could be complex
points that differ largely from the ADVANTAGES OF DATA MINING
norm of data - Data mining technique helps companies
to get knowledge based information.
- Data mining helps organizations to make
the profitable adjustments in operation
and production.
- The data mining is a cost-effective and
efficient solution compared to other
statistical data applications.
- Data mining helps with the decision-
making process.
- Facilitates automated prediction of
trends and behaviors as well as
automated discovery of hidden patterns.
- It can be implemented in new systems
as well as existing platforms
SEQUENTIAL AND TEMPORAL - It is the speedy process which makes it
PATTERNS easy for the users to analyze huge
- Sequential and temporal patterns amount of data in less time.
reveal patterns in a sequence of data DISADVANTAGES OF DATA MINING
points. - There are chances of companies may sell
- If the sequence is defined by the useful information of their customers to
time over which data points are other companies for money. For
observed, we call the sequence of example, American Express has sold
data points as a time series. credit card purchases of their customers
to the other companies.
- Many data mining analytics software is
difficult to operate and requires advance
training to work on.
- Different data mining tools work in
different manners due to different
algorithms employed in their design.
Therefore, the selection of correct data
mining tool is a very difficult task.
- The data mining techniques are not variable.
accurate, and so it can cause serious - Series of questions that successively
consequences in certain conditions narrow down observations into smaller
INDUSTRIES THAT UTILIZE DATA and smaller groups of decreasing
MINING impurity.
Communications
Insurance
Education Logistic Regression
Manufacturing - Attempts to classify a categorical
Banking
Retail outcome (y = 0 or 1) as a linear function
Service providers of explanatory variables.
E-commerce UNSUPERVISED LEARNING
Super markets - used against data that has no historical
Crime labels
Bioinformatics - the goal is to explore the data and find
SUBTOPIC 2 some structure within
SUPERVISED LEARNING - There is no right or wrong answer
- the desired output is known. UNSUPERVISED LEARNING
- also known as predictive modeling EXAMPLE TECHNIQUES
- uses patterns to predict the values of Self-organizing maps
the label on additional unlabeled data
- used in applications where historical
data predicts likely future events
USE OF DATA IN SUPERVISED
LEARNING
- We can use the abundance of data to
guard against the potential for
overfitting by decomposing the data set
into partitions:
o TRAINING DATASET - Consists of
the data used to build the
candidate models.
o TEST DATASET - The data set to
which the final model should be Nearest-neighbor mapping
applied to estimate this model’s - k-nearest neighbors (k-NN): This method
effectiveness when applied to can be used either to classify an
data that have not been used to outcome category or predict a
build or select the model. continuous outcome.
- If there is only one dataset, it may be o k-NN uses the k most similar
partitioned into a training and test sets. observations from the training
- The basic assumption is that the training set, where similarity is typically
and test sets are produced by measured with Euclidean
independent sampling from an infinite distance.
population. o When k-NN is used as a
SUPERVISED LEARNING EXAMPLE classification method, a new
TECHNIQUE observation is classified as Class
1 if the percentage of it k
Classification Tree nearest neighbors in Class 1 is
- Partition a data set of observations into greater than or equal to a
increasingly smaller and more specified cut-off value (e.g. 0.5).
homogeneous subsets. o When k-NN is used as a
- At each iteration, a subset of prediction method, a new
observations is split into two new observation’s outcome value is
subsets based on the values of a single predicted to be the average of
the outcome values of its k-
nearest neighbors
K-Nearest neighbor
- To classify an outcome, the training set
is searched for the one that is “most
like” it. This is an example of “instance
based” learning. It is “rote learning”, the 10. It is defined as heterogeneous data
simplest form of learning from multiple sources combined in a
Clustering common source [DATA
INTEGRATION]
- A definition of clustering could be “the 11. Machine learning is a subset of
process of organising objects into groups artificial intelligence [TRUE]
whose members are similar in some 12. Different data mining tools work in
way”. different manners due to different
WHAT IS WEKA? algorithms employed in their design.
- Weka is tried and tested open source Therefore, the selection of correct
machine learning software that can be data mining tool is a very difficult
accessed through a graphical user task. [TRUE]
interface, standard terminal 13. Supposed you want to train a
applications, or a Java API. machine to help you predict how long
Logistic Regression in Weka it will take you to drive home from
your workplace. What type of data
• Open diabetes dataset. mining approach would you use?
• Click Classifier Tab [UNSUPERVISED LEARNING] –
• Choose Classifier: functions>Logistic Supervised learning
• Use Test Options: Use Training Set 14. Data mining implies analyzing data
• Press Start patterns in large batches of data using
Trees in Weka one or more software [TRUE]
• Open weather dataset 15. It uses statistical methods to enable
• Click Classify Tab machines to improve with experience
• Choose Classifier: trees>J48 [MACHINE LEARNING]
• Use Test Options: Use Training Set 16. In this approach in data mining, only
• Press Start put data will be given
• Right-click result from Result List for [UNSUPERVISED LEARNING]
options 17. One of the advantages of data mining
• Choose Visualize tree is that there are chances of companies
may sell useful information of their
customers to other companies for
money [FALSE] - disadvantage
FORMATIVES 18. Data mining cannot be used on
spatial databases [FALSE]
19/20 19. Regression is a famous supervised
1. Ordinal data can also be called as learning technique [TRUE]
ordered factor [TRUE] 20. The goal of unsupervised learning is
2. Which of the following is not to explore the data and find some
included on the analytical life cycle structure within [TRUE]
defined by SAS? [INTEGRATION] 17/20
3. The last phase of the analytical life
cycle defined by SAS implementation
[FALSE] – ask again 1. Reviewing process is under what
4. It is a role that is responsible for
collecting, analyzing, and interpreting phase in CRISP-DM? [BUSINESS
large amount of data [DATA UNDERSTANDING] -
SCIENTIST]
5. Data preparation is the first step in EVALUATION
CRSIP-DM [FALSE] – Business
Understanding 2. Data preparation is the first step in
6. These are values that lie away from CRISP-DM [FALSE] – Business
the bulk of the data [OUTLIERS]
7. It refers to the broad process of Understanding
finding knowledge in data and 3. Business user provides business
emphasizes the “high-level”
application of particular data mining domain expertise based on deep
methods [KDD] understanding of the data [FALSE] –
8. A key objective is to determine if
there is some important business Business Intelligence analyst
issue that has not been sufficiently
considered [TRUE]
9. Identify the third step in CRISP-DM
[DATA PREPARATION]
4. Selecting data mining technique is given, can predict the ouput [TRUE]
under what phase in CRISP-DM? 17. In this approach in data mining, input
[MODELING] variables and ouput variables will be
5. Interpreting mined patterns concludes given [SUPERVISED LEARNING]
the KDD process [TRUE] – FALSE 18. Data mining is a cost-effective and
– Consolidating discovered efficient solution compared to other
knowledge statistical data applications. [TRUE]
6. The last phase of the analytical life 19. Association rule algorithm is an
cycle defined by SAS is example of what approach in data
implementation [FALSE] – ask mining? [UNSUPERVISED
again LEARNING]
7. Depending on the requirements, the 20. It makes the computation of multi-
deployment phase can be as simple as layer neural network feasible [DEEP
generating a report [TRUE] LEARNING]
8. Evaluation is the last phase in
16/20
CRISP-DM [FALSE] - Deployment
1. This is the type of learning which
9. It is the role that ensures the progress
uses patterns to predict the values of
of any project [PROJECT
the label on additional unlabeled data
MANAGER]
[SUPERVISED LEARNING]
10. A special case of categorical with just
2. Unsupervised machine learning finds
two categories [CATEGORICAL] -
all kind of unknown patterns in data
BINARY
[TRUE]
11. Artificial intelligence is a subset of a
3. Data mining focuses on small data
machine learning [FALSE]
sets and databases for analysis
12. Weka is tried and tested open source
[TRUE] -FALSE
machine learning software that can be
4. Which of the following is not
accessed through a graphical user
considered as data mining technique?
interface, standard terminal
[KURTOSIS]
applications or a JAVA API [TRUE]
5. Data mining can be performed on
13. Regression is a famous supervised
web mining data [TRUE]
learning technique [TRUE]
6. Data mining cannot be used on
14. Machine learning is a subset of
spatial databases [FALSE]
artificial intelligence [TRUE]
7. Data mining is the broad science of
15. A supervised learning algorithm
mimicking human abilities [FALSE]
learns from labeled training data,
- AI
helps you to predict outcomes for
8. Association rule algorithm is an
unforeseen data [TRUE]
example of what approach in data
16. Supervised learning goal is to
mining? [SUPERVISED
determine the function so well that
LEARNING] - UNSUPERVISED
when new input data set
9. There is no right or wrong on 20. This role provides the funding when
predictive modeling [FALSE] doing analytical project [PROJECT
10. It is used against data that has no SPONSOR]
historical labels [UNSUPERVISED 16/20
LEARNING]
1. This phase starts with initial data
11. It is defined as heterogeneous data
collection. [DATA
from multiple sources combined in a
UNDERSTANDING]
common source [DATA
2. It is the role that ensures the
INTEGRATION]
progress of any project. [PROJECT
12. It refers to the broad process of
finding knowledge in data and MANAGER]
emphasizes the “high-level” 3. A special case of categorical with
application of particular data mining just two categories. [BINARY]
methods [KDD] 4. Creating target datasets also
13. This role is responsible for creating includes collecting necessary
database environment for analytic information to model or account
projects [DATABASE for noise. [FALSE]
ADMINISTRATOR] 5. It is a role that is responsible for
14. Selecting data mining technique is collecting, analyzing, and
under what phase in CRISP-DM? interpreting large amount of data.
[BUSINESS UNDERSTANDING] - [DATA SCIENTIST]
MODELING 6. This role is responsible for creating
15. It is the practice of science and database environment for analytic
technology that is dedicated to projects. [DATABASE
building and data-handling problems ADMINISTRATOR]
that arise due to high volume of data 7. Ordinal data can also be called as
[DATA SCIENCE] ? ordered factor. [TRUE]
16. CRISP-DM stands for _______
8. Identify the third step in CRISP-
[CROSS-INDUSTRY STANDARD
DM. [DATA PREPARATION]
PROCESS FOR DATA MINING]
9. Which of the following is not
17. Evaluation is the last phase in
included on the analytical life cycle
CRISP-DM [FALSE] -
defined by SAS? [INTEGRATION]
DEPLOYMENT
10. The last phase of the analytical life
18. Which of the following is not
considered as a phase of CRISP-DM? cycle defined by SAS is
[DATA SELECTION] implementation. [FALSE] -
19. These are values that lie away from Deployment
the bulk of the data [OUTLIERS] 11. In this approach in data mining,
input variables and output
variables will be given.
[SUPERVISED LEARNING]
12. Artificial intelligence is a subset of In this phase you’ll also develop and test
machine learning. [FALSE] hypotheses through rapid prototyping in an
13. Supposed you want to train a iterative process.
machine to help you predict how - Explore
long it will take you to drive home
It is a comprehensive data mining
from your workplace. What type of
methodology and process model that
data mining approach should you
provides anyone – from novice to data mining
use? [UNSUPERVISED
experts – with a complete blueprint for
LEARNING] ? conducting a data mining project.
14. Association rule algorithm is an
example of what approach in data - CRISP-DM
mining? [SUPERVISED CRISP-DM stands for
LEARNING] - UNSUPERVISED ____________________________.
15. Data mining cannot be used on
- Cross-industry standard process for
text databases. [FALSE] data mining
16. A data value that is very different
from most of the data. [DATA It is a role that is responsible for collecting,
analyzing, and interpreting large amount of
FRAME] ?
data.
17. Self-organizing maps are example
of supervised learning. [TRUE] - - Data Scientist
FALSE It is the aspect of business analytics that finds
18. Unsupervised machine learning patterns in unstructured data like social
finds all kind of unknown patterns media or survey tools which could uncover
in data. [TRUE] insights about consumer sentiment
19. Suppose your email program
watches which emails you do or do - Data Mining ?
not mark as spam, and based on One of the benefits of data mining is
that learns how to better filter overfitting.
spam. What is the experience E in
this setting? [WATCHING YOU - False
LABELS EMAILS SPAM AS SPAM]
It is the phase of CRISP-DM where analysts
20. Logistic regression is classified as
supervised learning. [TRUE] review the steps executed.
- Evaluation
Ordinal data can also be called as ordered
A key objective is to determine if there is
factor.
some important business issue that has not
- True been sufficiently considered.
- True
Weka is tried and tested open source Based on the analytical life cycle defined by
machine learning software that can be SAS, this phase has two types of decisions:
accessed through a graphical user interface, operational and strategic.
standard terminal applications, or a Java API.
- Act
- True Interpreting mined patterns concludes the
Unsupervised machine learning finds all kind KDD process.
of unknown patterns in data
- True
- True It is the aspect of business analytics that finds
All data is unlabeled and the algorithms learn patterns in unstructured data like social
to inherent structure from the input data. media or survey tools which could uncover
This statement pertains to ___________. insights about consumer sentiment.
Random forest is an example of supervised This is the first step in KDD process.
learning. - Data Selection
- True In this phase, you’ll search for relationships,
Unsupervised methods help you to find trends and patterns to gain a deeper
features which can be useful for understanding of your data.
categorization. - Explore
- True
Which of the following is not included on the - False - ADVANTAGES
analytical life cycle defined by SAS?
One of the advantages of data mining is that
- Integration there are chances of companies may sell
useful information of their customers to
These are values that lie away from the bulk
other companies for money.
of the data.
- False - DISADVANTAGES
- Outliers
Which of the following is not considered as
CRISP-DM stands for
data mining technique?
____________________________.
- Kurtosis
- Cross-industry standard process for
data mining Supposed your email program watches which
emails you do or do not mark as spam, and
All data is labeled and the algorithms learn to
based on that learns how to better filter
predict the output from the input data. This
spam. What is the performance measure P in
statement pertains to ___________. (Use
this setting?
lowercase for your answer)
- The number of email correctly classified
- supervised
as spam/not spam
It makes the computation of multi-layer
neural network feasible. Self-organizing maps are example of
supervised learning.
- Deep Learning
- False
Unsupervised methods help you to find
features which can be useful for
categorization.
- True
- True
In this module, we’ve discussed the
concept of data mining and its application
Supposed you want to train a machine to help in real life. Data Mining is defined as a
you predict how long it will take you to drive process used to extract usable data from a
larger set of any raw data.
home from your workplace. What type of
data mining approach should you use?
It implies analyzing data patterns in large
- Supervised learning batches of data using one or more software.
• Weather conditions
• Time of the day
• Holidays
Example scenario:
You instinctively know that if it's raining outside, then it will take you longer to drive
home. But the machine needs data and statistics.
Let's see now how you can develop a supervised learning model of this example which help the
user to determine the commute time. The first thing you requires to create is a training data set.
This training set will contain the total commute time and corresponding factors like weather,
time, etc. Based on this training set, your machine might see there's a direct relationship between
the amount of rain and time you will take to get home.
So, it ascertains that the more it rains, the longer you will be driving to get back to your home.
It might also see the connection between the time you leave work and the time you'll be on the
road.
The closer you're to 6 p.m. the longer time it takes for you to get home. Your machine may find
some of the relationships with your labeled data.
This is the start of your Data Model. It begins to impact how rain impacts the way people
drive. It also starts to see that more people travel during a particular time of day.
In both regression and classification, the goal is to find specific relationships or structure in an
input data that allow us to effectively produce correct output data.
Unsupervised learning is a machine learning technique, where you do not need to supervise the
model. Instead, you need to allow the model to work on its own to discover information. It mainly
deals with the
unlabeled data.
Unsupervised learning algorithms allow you to perform more complex processing tasks
compared to supervised learning. Although, unsupervised learning can be more unpredictable
compared with other natural learning deep learning and reinforcement learning methods.
Sample scenario:
Let's, take the case of a baby and her family dog.
She knows and identifies this dog. A few weeks later a family friend brings along a dog and
tries to play with the baby.
Baby has not seen this dog earlier. But it recognizes many features (2 ears, eyes, walking on 4
legs) are like her pet dog. She identifies a new animal like a dog. This is unsupervised learning,
where you are not taught but you learn from the data (in this case data about a dog.) Had this been
supervised learning, the family friend would have told the baby that it's a dog.
To further understand the differences between the two methods, observe the table below.
Supervised Learning Unsupervised Learning
Neural Network,
• Classification • Clustering
• Regression • Association
• Linear • k-means
• Support vector
machine
etc
Eco-systems Big data Processing,
recognition,
forecasting, financial
• A technique of studying the dependence variable (called dependent variable), on one or more
variables (called explanatory variable), with a view to estimate or predict the average value of
the dependent variables in terms of the known or fixed values of the independent variable.
When do you use regression?
• Estimate the relationship that exists, on the average, between the dependent variable and the
explanatory variable
• Determine the effect of each of the explanatory variables on the dependent variable, controlling the
effects of all other explanatory variables
• Predict the value of dependent variable for a given value of the explanatory variable
Understanding regression model based on the concept of slope
• The mathematical of slope is similar to regression model.
y = mx + b
• And when using the slope intercept formula, we focus on the
two constants (numbers) m and b.
• m describes the slope or steepness of the line, whereas
• b represents the y-intercept or the point where the graph crosses the y-axis.
Regression Model
• The situation using the regression model is analogous
to that of the interviewers, except instead of using
interviewers, predictions are made by performing a
linear transformation of the predictor variable.
• The prediction takes the form
y = ax + b
• where a and b are parameters in the regression model
Parameters
• Dependent variable or response: Variable being predicted
• Independent variables or predictor variables: Variables being used to predict the value of the dependent
variable.
• Simple regression: A regression analysis involving one independent variable and one dependent
variable.
• In statistical notation:
y = dependent variable
x = independent variable
Types of regression
• Overfitting
It pertains to the accuracy of the
provisional model is not as high
on the test set as it is on the
training set, often because the
provisional model is overfitting
on the training set
• Extrapolation
It refers to estimates and
predictions of the target variable
made using the regression
equation with values of the
predictor variable outside of the
range of the values of x in the
data set
• Missing Values
Missing data has the potential to
adversely affect a regression
analysis by reducing the total
usable sample size.
𝑦 = 𝑎 + 𝑏𝑋 + 𝜖
Where:
• y – dependent variable
• X – independent (explanatory) variable
• a – intercept
• b – slope
• 𝜖 – residual (error)
Linear Model Assumptions
1. The dependent and independent variables show a linear relationship between the slope and the
intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all
observations.
5. The value of the residual (error) is not correlated across all
observations.
6. The residual (error) values follow the normal distribution.
Example Problem:
If you want to know the strength of relationship between House Price and Square feet, you can use
regression.
Step 1:
Identify the dependent and independent variables.
Step 2:
Run regression analysis on the data using any system that offers statistical analysis. For this example, we
can use Microsoft Excel with the help of Data Analysis Tool pack.
Note: Data Analysis Tools must be embedded manually on your excel through Options.
Step 3:
Analyze the results. Take note of the values for coefficients.
Step 4:
Substitute the values of the coefficients to the formula mentioned previously.
Ex:
• The least squares method is used to develop the estimated multiple regression equation:
• Uses sample data to provide the values of b0 , b1 , b2 , . . . , bq that minimize the sum of
squared residuals.
Nonlinear Regression
• Nonlinear regression is a regression in which the dependent or criterion variables are modeled as
a non-linear function of model parameters and one or more independent variables.
• Nonlinear regression can be model in several equations similar to the one below:
Regression Analysis
• Regression is a technique used for forecasting, time series modeling and finding the casual effect
between the variables.
Why use regression?
1) Prediction of a target variable (forecasting).
2) Modeling the relationships between the dependent variable and the explanatory variable.
3) Testing hypothesis.
M3S2
(Linear Regression)
Linear Regression
• The whole process of linear regression is based on the fact that there exists a relation between the
independent variables and dependent variable.
Simple Linear Regression
• Regression Model: The equation that describes how y is related to x and an error term.
• Simple Linear Regression Model:
y = β0 + β1 x + ε
• Parameters: The characteristics of the population, β0 and β1
• Random variable - Error term, ε
• The error term accounts for the variability in y that cannot be explained by the linear relationship
between x and y.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
• Regression equation: The equation that describes how the expected value of y, denoted E(y), is related
to x.
• Regression equation for simple linear regression: E(y|x) = β0 + β1x
• E(y|x) = expected value of y for a given value of x
• β0 = y-intercept of the regression line
• β1 = slope
• The graph of the simple linear regression equation is a straight line.
Least Square Method
• Least squares method: A procedure for using sample data to find the estimated regression equation.
• Here, we will determine the values of b0 and b1 .
• Interpretation of b0 and b1 :
• The slope b1 is the estimated change in the mean of the dependent variable y that is associated
with a one unit increase in the independent variable x.
• The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0.
• Least squares method equation:
• We are finding the regression that minimizes the sum of squared errors.
• Least squares estimates of the regression parameters:
Slope equation
y-intercept
Interpretations
• Slopes for Witness and Stress are positive, but slope for Social Support is negative
• If you had subjects with identical stress and social support, a one unit increase in Witness would
produce .038 unit increase in internalizing symptoms
• If Witness = 20, Stress = 5 and SocSupport 35, then we would predict that the internalizing symptoms
would be .012.
M3S3
(Logistic Regression)
Logistic Regression
- It is a statistical technique used to develop predictive models with categorical dependent
variables having dichotomous or binary outcomes.
- Similar to linear regression, the logistic regression models the relationship between the
dependent variable and one or more independent variables.
Graph of Logistic Regression
To predict the probability of the event to happen, we can further solve the preceding equation as follows:
Maximum-Likelihood estimation
- It is a method of estimating the parameters of a statistical model with given data.
- The method of maximum likelihood selects the set of values of the model parameters that
maximizes the likelihood function, that is, it maximizes the “agreement” of the selected model
with the observed data
Building Logistic Regression Model
- You can perform logistic regression using R by using the glm() function.
- The family = "binomial" command tells R to use the glm function to fit a logistic regression
model. (The glm() function can fit other models too; we'll look into this later.)
Interpreting Results
• A positive estimate indicates that, for every unit increase of the respective independent variable, there
is a corresponding increase in the log of odds ratio and the other way for a negative estimate.
• A long with the independent variables, we also see 'Intercept'. Intercept is the log of odds of the
event (Good or Bad Quality) when we have all the categorical predictors having a value as 0.
• We can see the standard error, z value, and p-value along with an asterisk indication to easily identify
significance.
• We then determine whether the estimate is truly far away from 0. If the standard error of the estimate
is small, then relatively small values of the estimate can reject the null hypothesis.
• If the standard error is large, then the estimate should also be large enough to reject the null hypothesis.
Testing the Significance
• To test the significance, we use the 'Wald Z Statistic' to measure how many standard deviations the
estimate is away from 0.
• The significance of the estimate can be determined if the probability of the event happening by
chance is less than 5%.
Two ways to validate model accuracy
• Confusion Matrix
• ROC curve
- Interpreting the ROC curve is again straightforward. The ROC curve visually helps us
understand how our model compares with a random prediction.
- The random prediction will always have a 50% chance of predicting correctly; after comparing
with this model, we can understand how much better is our model
- The diagonal line indicates the accuracy of random predictions and the lift from the diagonal line
towards the left upper corner indicates how much improvement our model has in comparison to
the random predictions.
- Models having a higher lift from the diagonals are considered to be more accurate models.
Formative assessment 3
26/30
1. It is computed as the ratio of the two odds. [Odds ratio]
2. It pertains to the accuracy of the provisional model is not as high on the test set as it is on the
training set. [OVERFITTING]
3. Supposed you want to train a machine to help you predict how long it will take you to drive
home from your workplace using regression. What type of data mining approach should you
use? [supervised Learning]
4. There should not be any multicollinearity between the independent variables in the model, and
all independent variables should be independent to each other. [TRUE]
5. In logistic regression, it is that the target variable must be discrete and mostly binary or
dichotomous. [TRUE]
6. The explanatory variable is the variable being predicted. [FALSE]
7. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear]
8. In regression, the value of the residual (error) is one. [FALSE]
9. Multiple linear regressions are classified as supervised learning. [TRUE]
10. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [Origin of the coffee]
11. It is the tool pack in Microsoft Excel that can be downloaded to perform linear regression. [Data
Analysis Tool Pack]
12. It is the range of values of the independent variables in the data used to estimate the model.
[Experimental region]
13. It is essentially similar to the simple linear model, with the exception that multiple independent
variables are used in the model. [Multiple Regression]
14. The graph of the simple multiple regression equation is a straight line. [TRUE]
15. Researcher question: Do fourth graders tend to be taller than third graders?
This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the explanatory variable on this study? [grade level]
16. It is used to explain the variability in the response variable. [Error term]
17. Given the results for multiple linear regression below. Predict the exam score if a student spent
10 hours in studying and had 4 prep exams
taken. [(67.67 + 5.56 (number ng hours) + -0.60 (4)]
[120.87]
18. A researcher believes that the origin of the beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee from three different regions: Africa, South America,
and Mexico. What is the explanatory variable in this study? [origin of the coffee]
19. It is a statistical technique that uses several independent variables to predict the dependent
variable. [Multiple linear regression]
20. Researcher question: Do fourth graders tend to be taller than third graders?
This is an observational study. The researcher wants to use grade level to explain differences in
height. What is the response variable on this study? [height]
21. Logistic regression is used to predict continuous target variable. [FALSE]
22. Given the results for multiple linear regression below. Identify the estimated regression
equation.
[exam score= 5.58*hours+ prep_exams*-0.60+67.67]
23. y= a+ bX_1+cX_2+dx_3+ 𝜖
the formula above is used to model non linear regression. [FALSE]
24. When it is predicted as TRUE and is actually FALSE. [FALSE POSITIVE]
25. Multiple linear regression is classified as unsupervised learning. [FALSE]
26. It is a regression in which the dependent or criterion variables are modeled as a non-linear
function of model parameters and one or more independent variables. [Nonlinear regression]
27. Which of the following evaluation metrics can’t be applied in the case of logistic regression
output to compare with the target? [mean squared error]
28. Compute for the accuracy of the model depicted in the confusion matrix below;
TP= 20
TN= 27
FP=18
FN=25
[52%]
29. Nonlinear is the extension of linear regression. [FALSE] - multiple
30. Provide one (1) way to validate the accuracy of the logit model [ROC CURVE]
28/30 Formative 3
1. It is used to develop the estimated multiple regression [Least squared method]
2. It is used to measure the binary or dichotomous classifier performance visually and Area Under
Curve (AUC) is used to quantify the model performance [ROC Curve]
3. If a predictor variable X is found to be highly significant, we could conclude that: [changes in X
are associated to changes in Y]
4. Which of the following is not a reason when to use regression? [When you aim to know the
products that are usually bought together]
5. A group of middle school students wants to know if they can use height to predict age, what is
the response variable in this study? [Age]
6. It can be utilized to assess the strength of the relationship between variables and for modeling the
future relationship between them. [Regression Analysis]
7. Two variables are correlated if there is a linear association between them. If not, the variables are
uncorrelated. [TRUE]
8. In linear models, the dependent and independent variables show a linear relationship between the
slope and the intercept. [TRUE]
9. What type of relationship is shown in the graph below?
[Positive linear relationship]
10. It is a points that lies far away from the rest. [OUTLIERS]
11. What type of relationship is shown in the graph below?
[Linear regression]
13. Logistic regression is classified as one of the example of unsupervised learning [false]
14. Logistic regression is used to predict continuous target variable [FALSE]
15. In the equation
- Odds ratio
52. In the equation The lower is the AUC value, the worse is the model
predictive accuracy.
Y =a + bX + €
- True - exam score = 5.56*hours + prep_exams*-0.60 +
67.67
- Explanatory variable
Which of the following is not a reason when to use
regression?
- When you aim to know the products that are The formula below describes
usually bought together.
There should not be any multicollinearity between - Data Analysis Tool Pack
the independent variables in the model, and all A group of middle school students wants to know if
independent variables should be independent to they can use height to predict age. The explanatory
each other variable is height.
- True - True
If a predictor variable x is found to be highly Multiple linear regression can be expressed
significant we would conclude that:
Given the results for multiple linear regression Multiple linear regression is classified as
below. Identify the estimated regression equation. unsupervised learning.
- False
It is a model that tests the relationship between a Logistic regression is classified as supervised
dependent variable and a single independent learning.
variable.
-True
- Linear Regression
If a predictor variable x is found to be highly
When it is predicted as TRUE and is actually TRUE. significant we would conclude that:
- True Positive -changes in x are associated to changes in y
If a predictor variable x is found to be highly It is the variable being predicted.
significant we would conclude that: a change in y
causes a change in x. -Dependent variable
- False (changes in x are associated to changes in In logistic regression, it is that the target variable
y) must be discrete and mostly binary or dichotomous.
-True
-True
The higher is the AUC value, the better is the It is the tool pack in Microsoft Excel that can be
model predictive accuracy. downloaded to perform linear regression.
The explanatory variable is the variable being Multiple linear regression is classified as
predicted. unsupervised learning.
-False -False
A researcher believes that the origin of the beans When it is predicted as FALSE and is actually
used to make a cup of coffee affects hyperactivity. FALSE.
He wants to compare coffee from three different
regions: Africa, South America, and Mexico. What -True Negative
is the response variable on this study?
Logistic regression measures the relationship
-Hyperactivity level between the ____________ dependent
variable and one or more independent
The graph of the simple linear regression equation variables. (Use lowercase for your answer)
is a straight line
categorical
-True
-categorical
A group of middle school students wants to know if
Given the results for multiple linear regression they can use height to predict age. The explanatory
variable is height.
below. Identify the estimated regression equation.
-True
-True
-True
8. If p value for model fit is less than 0.5, 14. There should not be any multicollinearity
then signify that our full model fits between the independent variables in the
significantly better than our reduced model, and all independent variables should
model. be independent to each other
• False
• False
Missing data has the potential to adversely Factor is the variable being predicted. False
affect a regression analysis by reducing the total
It is a model that tests the relationship between
usable sample size. True
a dependent variable and a single independent
In regression, the value of the residual (error) is variable. Linear Regression
one. True False What type of relationship is shown in the graph
below? Negative Linear Relationship
The value of the residual (error) is not
correlated across all observations. True
The value of the residual (error) is correlated across It is essentially similar to the simple linear model,
all observations. False with the exception that multiple independent
variables are used in the model. Multiple linear
It is essentially similar to the simple linear model, regression
with the exception that multiple independent
Logistic regression is used to predict continuous
variables are used in the model. Multiple Regression
target variable. True
The value of the residual (error) is not correlated
Response variable is the variable being manipulated
across all observations. True
by researcher. False
Given the results for multiple linear regression
below. Predict the exam score if a student spent Logistic regression is classified as supervised
hours studying and had 2 prep exams taken.. learning. True
Supposed you want to train a machine to help you When it is predicted as TRUE and is actually TRUE.
predict how long it will take you to drive home from True Positive
your workplace using regression. What type of data
mining approach should you use? Supervised Two variables are correlated if there is a linear
Learning association between them. If not, the variables are
uncorrelated. True
Multilinear regression is a regression in which the
dependent or criterion variables are modeled as a Multiple linear regression is classified as
non-linear function of model parameters and one or unsupervised learning. False
more independent variables. True
What type of relationship is shown in the graph
Multilinear regression is a regression in which the below? Negative Linear Relationship
dependent or criterion variables are modeled as a
non-linear function of model parameters and one or
more independent variables. False
In linear regression, it is that the target variable You predicted negative and it’s false. False Negative
must be discrete and mostly binary or dichotomous.
True In ROC Curve, models having a higher lift from the
diagonals are considered to be more accurate
What type of relationship is shown in the graph models. True
above? No relationship
y=a+bX_1+cX_2+ 〖dX〗 A group of middle school students wants to
_3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+ ϵ know if they can use height to predict age. What
is the response variable in this study? Age
The formula above is used to model nonlinear
regression. False It is a regression in which the dependent or criterion
variables are modeled as a non-linear function of
Logistic regression is classified as supervised model parameters and one or more independent
learning. True variables. Nonlinear Regression
It is essentially similar to the simple linear Slope is the point that lies far away from the
model, with the exception that multiple rest. True
independent variables are used in the model.
Nonlinear is the extension of linear regression.
Multiple regression
False
The lower is the AUC value, the worse is the
In logistic regression, it is that the target variable
model predictive accuracy. True
must be discrete and mostly binary or dichotomous.
Multiple linear regression is classified as supervised True
learning. True
Given the results for multiple linear regression
It is the variable of primary interest. Parameter below. Predict the exam score if a student spent
A group of middle school students wants to know if 10 hours in studying and had 4 prep exams
they can use height to predict age. What is the taken.
response variable in this study? Age
The graph for a nonlinear relationship is often a What is considered as the parameter/s? a,b
straight line. True In logistic regression, it is that the target
variable must be discrete and mostly binary or
How will you express the equation of a regression
dichotomous. True
analysis where you aim to predict the value of y
based on x. Y=a+bX It is used to explain the variability in the
response variable parameter
It is used to explain the variability in the response
variable parameter Logistic regression is classified as one of the
examples of unsupervised learning. False
Research question: Do fourth graders tend to be
taller than the third graders?
Regression is a technique used for forecasting, If a predictor variable x is found to be highly
time series modeling, and finding the casual significant we would conclude that: a change in y
effect between the variables. True causes a change in x. false
There should not be any multicollinearity Regression is a famous supervised learning
between the independent variables in the technique. True
model, and all independent variables should be
independent to each other true exam score = 5.56*hours + prep_exams*-0.60 + 67.67
It is the range of values of the independent It is a statistical technique that uses several
variables in the data used to estimate the model. independent variables to predict the dependent
Experimental region variable. Multiple linear regression
The value of the residual (error) is correlated
It is the tool pack in Microsoft Excel that can be
across all observations. False
downloaded to perform linear regression. Data
Two variables are correlated if there is a linear Analysis Tool Pack
association between them. If not, the variables
are uncorrelated. True Compute for the accuracy of the model depicted
in the confusion matrix below;
Suppose you have been given a fair coin and you
want to find out the odds of getting heads. TP = 20
Which of the following option is true for such a TN = 27
case? odds will be 1
FP = 18
It is essentially similar to the simple linear
model, with the exception that multiple FN = 25
independent variables are used in the model.
Multiple linear regression 52%
Irregular Variations
- Measures to determine how well a particular - Univariate time series models are models
forecasting method is able to reproduce the used when the dependent variable is a single
time series data that are already available time series.
- Forecast Error: Difference between the
MULTIVARIATE TIME SERIES MODELS
actual and the forecasted values for period t.
- used when there are multiple dependent
variables. In addition to depending on their
own past values, each series may depend on
past and present values of the other series
- Modeling U.S. gross domestic product,
- Mean Forecast Error: Mean or average of inflation, and unemployment together as
the forecast errors. endogenous variables is an example of a
multivariate time series model.
Advantages
SAMPLE SCENARIO
REMINDERS
Results
Compute for the MSE
Actual
20
Actual
FORMATIVES 81
What does the graph below illustrate? an increasing Which of the following describes an unpredictable,
trend only rare event that appears in the time series? Irregular
variations
When exponential smoothing is used as a forecasting Which of the following research scenarios would
method, which of the following is used as the forecast time-series analysis be best for?
for the next period?
- Measuring the time it takes for something to
- smoothed value in the current time period happen based on a given number of variables
Which of the following is a valid weight for exponential
smoothing?
- 0.5
Which of the following indicates the purpose for using Compute for the Mean Absolute Deviation:
the least squares method on time series data?
- 100
- False
It pertains to the difference between the actual and
the forecasted values for period t
- Forecast Error
- 53
- False
- True (False)
- 6
Given the following values for age, what is the
It is the variable being manipulated by researchers. problem with the data?
- Explanatory variable
- False
- True
- True
- True
The sales of the mini electric fans that Louise sells A time series data usually has two variables namely
varies every season. This is an example of a seasonal transaction and item.
effect to a time series.
- False
- True
- True
MSE stands for the mean standard error. The classical multiplicative time series model
indicates that the forecast is the product of which of
- False the following terms?
It pertains to the gradual shifts or movements to
- trend, cyclical, and irregular components
relatively higher or lower values over a longer period
of time. The MAD for the following values is 63.67.
- Trend Pattern
- -7.69
- Irregular variations
- Irregular variations
an increasing trend only The choice of the number of periods impacts the
performance of the moving average forecast.
- True
The seasonal component represents periodic
fluctuations that recur within the business cycle. It is the variable being manipulated by researchers.
- True
- False
- False
- False
- Irregular variations
- False
The value of α with the smallest RMSE is chosen for [IT0089] MODULE 5 (MAIN) OVERVIEW OF
use in producing future forecasts. ASSOCIATION ANALYSIS
- True WHAT ARE PATTERNS?
The forecast is Ft+1 based on weighting the most
recent observation yt with a weight alpha and
- Patterns are set of items, subsequences, or 4. Package Barbie + candy + poorly selling
substructures that occur frequently together item.
(or strongly correlated) in a data set 5. Raise the price on one, and lower it on the
- Patterns represent intrinsic and importance other.
properties of datasets 6. Offer Barbie accessories for proofs of
purchase
WHEN DO YOU USE PATTERN 7. Do not advertise candy and Barbie together
DISCOVERY? 8. Offer candies in the shape of a Barbie doll
▪ What products were often purchased PROCESS OF RULE SELECTION
together?
▪ What are the subsequent purchases after Generate all rules that meet specified support &
buying an iPad? confidence
▪ What code segments likely contain copy-
and-paste bugs? • Find frequent item sets (those with sufficient
▪ What word sequences likely form phrases in support – see above)
this corpus? • From these item sets, generate rules with
sufficient confidence
CONCEPT OF MARKET BASKET ANALYSIS
RULE INTERPRETATION
- Market basket analysis is like an imaginary
basket used by retailers to check the Lift Ratio
combination of two or more items that the - shows how effective the rule is in finding
customers are likely to buy consequents (useful if finding particular
- “Two-thirds of what we buy in the consequents is important)
supermarket we had no intention of buying,”
- Paco Underhill, author of Why We Buy: Confidence
The Science of Shopping
- shows the rate at which consequents will be
ASSOCIATION RULE IS THE FOUNDATION found (useful in learning costs of promotion)
OF SEVERAL RECOMMENDER SYSTEMS
Support
SUPPORT
RULE INTERPRETATION
A => B
TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
5. Supposed you want to solve a time series
problem where a rapid response to a real
change in the pattern of observations is
desired, which among the following is the
ideal value for your alpha? [0.8]
6. Affinity analysis is a data mining method
that usually consists of two variables: a
transaction and an item [TRUE]
7. Observe the table below and compute for the
lift ratio of
Diaper ->milk landscape, and/or changes in consumer
preferences [TRUE]
TRANSACTION ITEMS
ID TRANSACTION ITEMS
Trans A Beer, Peanut, Egg ID
Trans B Beer, Milk, Peanut, Diaper Trans A Beer, Peanut, Egg
Trans C Milk, Diaper, Egg Trans B Beer, Milk, Peanut, Diaper
Trans D Peanut, Egg, Diaper Trans C Milk, Diaper, Egg
Trans E Beer, Peanut, Egg Trans D Peanut, Egg, Diaper
Trans F Egg, Beer, Peanut Trans E Beer, Peanut, Egg
[1.5] [2] Trans F Egg, Beer, Peanut
TRANSACTION ITEMS
ID
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
12. it shows how effective the rule is in finding
components (consequents) [LIFT RATIO]
13. observe the table below and compute for the
support for diaper ->peanut [0.43]
TRANSACTION ITEMS
ID
15. observe the table below and compute for the
Trans A Beer, Peanut, Egg
Trans B Beer, Milk, Peanut, Diaper support for Fries => Burger [0.6]
Trans C Milk, Diaper, Egg
Trans D Peanut, Egg, Diaper
Trans E Beer, Peanut, Egg
Trans F Egg, Beer, Peanut
Trans G Beer, Diaper, Peanut
14. observe the table below and compute for the
support for airpods => charger [0.4]
A => B
Answer: 0.6
9. It is the conditional probability of --------------------------------------------------------------------------
occurrence of consequent given the
antecedent. [Confidence] 15/20
10. Input validation helps to lessen what
type of anomaly? [ Insertion anomaly] 1. Market Basket Analysis creates if-Then
11. observe the table below and compute scenario rules. The IF part is called the
for the confidence of _______ (use lower case for your
Phone -> SD card answer) [ANTECEDENT]
2. It is the conditional probability of
occurrence of consequent given the
antecedent [CONFIDENCE]
3. It controls for the support (frequency)
of consequent while calculating the
conditional probability of occurrence [Y]
given [X]. [LIFT RATIO]
4. Observe the table below and compute
for the confidence of diaper->egg.
Answer: 0.5
12. Lift ratio shows how effective the rule is
in finding consequents. [True]
13. observe the table below and compute
for the lift ratio of
[0.50]
5. Observe the table below and compute for
milk->egg.
[0.14]
Answer: 0.1 6. Observe the table below and compute for
lift ratio diaper->milk.
8. Observe the table below and compute for
the confidence of egg->peanut.
[1.5]
7. Observe the table below and compute for
[0.8]
the confidence of beer->diaper.
[0.40]
[0.33]
4. Observe the table below and compute for the
support for diaper->peanut.
[0.43]
[0.67]
13. Which of the following is not an application of a
sequential pattern? [IDENTIFYING FAKE NEWS] 18. It pertains to how likely item Y is purchased when
item X is purchased, expressed as [X->Y]/ [confidence]
14. Lift ratio shows how effective the rule is in finding
consequents [TRUE] 19. Which of the following is not an application of
pattern discovery? [None of the choices]
15. Market Basket Analysis creates if-Then scenario
rules. The THEN part is called the _______ (use lower 20. Observe the table below and compute for the
case for your answer) [CONSEQUENT] support for Phone case->SD card.
17. Observe the table below and compute for the lift
ratio of powerbank->airpods.
Transaction E Beer, Peanut, Egg
3 airpods, phone case, charger 16. Market Basket Analysis creates If-Then
scenario rules. True
17. It shows how effective the rule is in finding
4 phone case, SD Card consequents. Lift ratio
18. It is another type of association analysis that
involves using sequence data. Association
5 SD Card, charger, airpods rule
19. Observe the table below and compute
6 SD Card, phonecase, powerbank for the support for diaper⇒peanut
Transaction ID Items
7 Powerbank, phonecase, SD Card
• MISSING VALUES
- Having null values in your data set could Variables Selection
affect the accuracy of the model.
Observe minimizing garbage in, garbage out
• OUTLIERS
(GIGO)
- When your data has outliers, it could affect
the distribution of your data. Procedures
• INCONSISTENT DATA
• IMPROPERLY FORMATTED DATA 1. Backward-selection
2. Forward-selection
• LIMITED FEATURES
• THE NEED FOR TECHNIQUES SUCH [M6-ST1] DATA ANOMALIES
AS FEATURE ENGINEERING
What are anomalies?
Ways to Preprocess Data
- Anomalies are problems that can occur in
DATA CLEANING poorly planned, unnormalized databases
where all the data is stored in one table (a
- Data cleaning (or data cleansing) routines
flat-file database).
attempt to fill in missing values, smooth out
- Anomalies are caused when there is too
noise while identifying outliers, and correct
much redundancy in the database's
inconsistencies in the data. In this section,
information
you will study basic methods for data
cleaning Database Anomalies
DATA INTEGRATION AND • Insertion Anomaly – happen when
TRANSFORMATION inserting vital data into the database is not
- There are a number of issues to consider possible because other data is not already
during data integration. Schema integration there.
and object matching can be tricky. This is • Update Anomalies – happen when the
referred to as the entity identification person charged with the task of keeping all
problem. the records current and accurate, is asked,
for example, to change an employee’s title 4. Replace the missing values with imputed
due to a promotion. values based on the other characteristics of
- If the data is stored redundantly in the same the record.
table, and the person misses any of them,
then there will be multiple titles associated Applications
with the employee. The end user has no way 1) Replacing missing field values with user-
of knowing which the correct title is. defined constants.
• Deletion Anomalies – happen when the
deletion of unwanted information causes
desired information to be deleted as well.
- For example, if a single database record
contains information about a particular
product along with information about a
salesperson for the company and the 2) Replacing missing field values with means
salesperson quits, then information about the or modes
product is deleted along with salesperson
information.
Forward Selection
STEPS OF BACKWARD-SELECTION
BACKWARD SELECTION
EXAMPLE
STEPWISE PROCEDURE
Caloocan
• True
You can also use regression when handling noisy
data.
True
- Quality of the research
Given the following values for age, what is the
problem with the data? The following can be done to treat unsatisfactory
response except:
Age
Returning to the field
16
A good rule of thumb in having a right amount of
27 data is to have 10 records for every predictor value.
True
-8990
Given are the following records for the
19
attribute rating. What is the problem with the data?
15 Data Inconsistency
18
application_rating
Anomaly detection is also known as outlier
1 analysis. True
• False Caloocan
Enumerate at least one of the two(2) types of Makati
variables transformation commonly used in
machine learning Caloocan
• Caloocan
• Makati
Data preparation affects: • Quezon
• None of the choices
• The objectives of the research
• The quality of the research
These anomalies have values that significantly Age
deviate from the other data points that exist in the
same context. 16
27
• Contextual outliers
When there’s a missing value for a categorical -8990
variable, it is ideal to supply it by computing for the
19
average of the data values available.
15
• True
18
Outlier analysis can provide good product quality.
Answer: Data Inconsistency
• True
A review of the questionnaires is essential in order Anomaly detection can cause a bad user experience.
to:
• False
• Select the data analysis strategy You can use histogram to detect outliers.
• Find new insights
• Increase the quality of the data • True
• Increase accuracy and precise of the When a subset points within a set is anomalous to
collected data the entire data set, those values are:
Feature selection maps the original feature space to
Collective outliers
a new feature space with lower dimensions by
combining the original feature space. These anomalies have values that significantly
deviate from the other data points that exist in the
True
same context.
False
• Contextual outliers
This happens when inserting vital data into the
These are problems that can occur in poorly planed,
database is not possible because other data is not
un-normalized databases where all the data is stored
already there.
in one table (a flat-file database).
• Insertion anomaly
• Anomalies
Unnecessary predictors will add noise to the
estimation of other quantities that we are interested
in.
• True
You can also use regression when handling noisy
data.
True 16/20
Given the following values for age, what is the A homogenous data set is a data set whose data
problem with the data? records have the same target value. True
Supply for the missing values. This happens when the deletion of unwanted
information causes desired information to be
deleted as well.
Deletion anomaly
Young = 12 – 17
Adult = 18 -34
Old = 35 – 60
What kind of data preparation was practiced? Data
Cleaning
It is the process of integrating multiple databases,
data cubes, or files. data integration
These are problems that can occur in poorly
- 19.6 planned, un-normalized databases where all the data
It is a manipulation of scale values to ensure is stored in one table (a flat-file database).
Anomalies
comparability with variables with other scales:
You can also use regression when handling noisy
Scale transformation data. True
Supply the missing value in the given data below. The procedure starts with an empty set of features
[reduced set]. Forward Selection
It is the simplest of all variable selection procedures
and can be easily implemented without special
software (Use lowercase for your answer)
Backward Selection
The forward selection procedure starts with no
variables in the model. True
Estimation is about estimating the value for the
target variable except that the target variable is
categorical rather than numeric.
True
The figure below illustrates the first step in doing
backward selection. False (wala pics 😊)
It is intended to select the “best” subset of
predictors. (Use lowercase for your answer)
- 88.2
Variables Selection
It is a best practice to divide your dataset into train
and test dataset. True
Enumerate at least one of the two (2) types of 5. Input validation helps to lessen the deletion
variables transformation commonly used in anomaly [FALSE]
machine learning: (Use lowercase for your answer) 6. Given the following values for age, what is
categorical variables? the problem with the data?
Age
numerical variables? 16
Forward selection is the opposite of stepwise 27
selection. False -8990
19
15
The figure below illustrates the basic steps for what 18 [DATA INCONSISTENCY]
type of variable selection method? 7. Mode is used when catering missing values
for numerical variables [FALSE]
8. A homogenous data set is a data set whose
data records have the same target value
[TRUE]
9. Post coding Process is necessary for:
[STRUCTURED QUESTIONS]
10. Given the following records for the attribute
rating. What is the problem with the data?
Application_rating
Backward
Prior to variable selection, one must identify 1
outliers and influential points - maybe exclude them
2
at least temporarily. True
A
17/20
B
1. It fits and performs variable selection on an
ordinary least square regression predictive C
model [LINEAR REGRESSION
SELECTION] 3
2. It is a manipulation of scale values to ensure
[DATA INCONSISTENCY]
comparability with variables with other
scales: [SCALE TRANSFORMATION] 11. It identifies the set of input variables that
3. It is the process of integrating multiple jointly explains the maximum amount of
databases, data cubes, or files [DATA data variance. The target variable is not
INTEGRATION] considered with this method.
4. If the data is stored redundantly in the same [UNSUPERVISED SELECTION]
table, and the person misses any of them, 12. Clustering aims to discover certain features
then there will be multiple titles associated that often appear together in data [FALSE]
with the employees. This is an example of 13. Backward selection starts with no variables
what type of data anomaly? [UPDATE [FALSE]
ANOMALY] 14. Forward selection is the simplest variable
selection model [FALSE]
15. It fits and performs variable selection on an 75
ordinary least square regression predictive
model. [LINEAR REGRESSION 87
SELECTION] [88.2]
16. It identifies the set of input variables that
jointly explain the maximum amount of
variance contained in the target
8. When Stephen tried to change the section of
[UNSUPERVISED SELECTION]
all students enrolled to his class however,
17. The simplest of all variable selection
upon performing the query, only one data
procedures is stepwise procedure [FALSE]
record was modified instead of all the
18. It is intended to slect the “best” subset of
records. What data anomaly was present in
predictors (use lowercase for your answer)
Stephen’s database? [UPDATE
[variable selection]
ANOMALY]
19. Forward selection is the opposite of stepwise
9. In this category, individual values aren’t
selection [FALSE]
anomalous globally or contextually
20. It is the simplest of all variable selection
[COLLECTIVE OUTLIERS]
procedures and can be easily implemented
10. It is used when there is a single
without special software (use lowercase for
measurement of each element in the sample:
your answer) [backward selection]
[INTERDEPENDENCE TECHNIQUES]
20/20 11. It fits and perform variable selection on an
ordinary least square regression predictive
1. You can also use regression when handling model [LINEAR REGRESSION
noisy data [TRUE] SELECTION]
2. When a subset of data points within a set is 12. Data preparation affects: [THE QUALITY
anomalous to the entire dataset, those values OF THE RESEARCH]
are: [COLLECTIVE OUTLIERS] 13. It performs a greedy search to find the best
3. The following can be done to treat performing feature subset. It iteratively
unsatisfactory response except: creates models and determines the best or
[ASSIGNING MISSING VALUES] the worst performing feature at each
4. A homogenous data set is a data set whose iteration [RECURSIVE FEATURE
data records have the same target value ELIMINATION]
[TRUE] 14. The first step in stepwise procedure is to
5. Post coding process is necessary for: select the predictor most highly correlated
[STRUCTURED QUESTIONS] with the target [FALSE]
6. Anomaly detection is also known as outlier 15. It involves both running the analysis to
analysis [TRUE] create unique clusters or segments and
7. Supply the missing value in the given data evaluating or describing the clusters that are
below Exam_scores created in the analysis. [CLUSTER
ANALYSIS]
100
16. Give the first step for backward selection
89 [Perform the regression on the full model]
17. A good rule of thumb in having a right
- amount of data is to have 10 records for
90 every predictor value [TRUE]
18. Forward selection is the simplest variable 10. Identify atleast one of the two principal
selection model [FALSE] reasons for eliminating a variable: (use
19. The simplest of all variable selection lowercase for your answer) [redundancy or
procedures is stepwise procedure. [FALSE] irrelevancy]
20. Clustering aims to discover certain features 11. Variable clustering is about grouping the
that often appear together in data [FALSE] attributes with similarities [TRUE]
12. Supply the missing values given for the
18/20 attribute salary
1. If the data is stored redundantly in the same Salary
table, and the person misses any of them,
then there will be multiple titles associated 16000
with the employee. This is an example of
12000
what type of data anomaly? [UPDATE
ANOMALY] 17500
2. It is a best practice to divide your dataset
into train and test dataset. [TRUE] 29000
3. You can use histogram to detect outliers
[18,625]
[TRUE]
4. Anomaly detection is also known as outlier 13. It is the process of transforming the existing
analysis [TRUE] features into a lower-dimensional space,
5. The following can be done to treat typically generating new features that are
unsatisfactory response except: composites of the existing features.
[RETURNING TO THE FIELD] [FEATURE EXTRACTION]
6. Given the following values for age, what is 14. The figure below illustrates the basic steps
the problem with the data? for what type of variable selection method?
[FORWARD SELECTION]
Age
16
27
-8990
19
15
15. Forward selection is the simplest variable
18
selection model [FALSE]
[DATA INCONSISTENCY] 16. These are variables that significantly
influence Y and so should be in the model
7. Histogram is used to see missing data but are excluded [OMITTED
[FALSE] VARIABLES]
8. Clustering can also detect outliers [TRUE] 17. Unnecessary predictors will add noise to the
9. These outliers exist far outside the entirety estimation of other quantities that we are
of a data set [GLOBAL OUTLIERS] interested in. [TRUE]
18. The first step in stepwise procedure is to
select the predictor most highly correlated WMA
with the target. [FALSE]
19. Prior to variable selection, one must identify -
outliers and influential points – maybe
exclude them at least temporarily [TRUE] AGD
20. The procedure starts with an empty set of
features [reduced set]. [FORWARD AGD
SELECTION]
4. The following are techniques to treat
16/20 missing values except: [RETURNING TO
THE FIELD]
1. Given the following values for age, what is
5. Supply for the missing values. [19.6]
the problem with the data?
Age
16
27
-8990
19
15
18
WMA
Estimation is about estimating the value for the
- target variable except that the target variable is
categorical rather than numeric. True
AGD
The figure below illustrates the first step in
doing backward selection.
AGD
These outliers exist far outside the entirety of a data
set. Global outliers
False
These are also known as point anomalies. Global
outliers Forward selection is the opposite of stepwise
selection. False
Anomaly detection can cause a bad user experience.
True When Stephen tried to change the section of all
students enrolled to his class however, upon performing
Anomaly detection is also known as outlier analysis. the query, only one data record was modified instead of
True all the records. What data anomaly was present in
Stephen’s database?
It is a best practice to divide your dataset into train
and test dataset. True Update anomaly
It fits and performs variable selection on an ordinary A review of the questionnaires is essential in order
least square regression predictive model. Linear to:
regression selection
Increase accuracy and precision of the collected
It is a useful tool for data reduction, such as choosing data
the best variables or cluster components for
analysis. (Use lowercase for your answer) variable These are variables that significantly influence Y
clustering and so should be in the model but are excluded.
The simplest of all variable selection procedures is Omitted variables
stepwise procedure. FALSE
The first step in stepwise procedure is to select the
Backward selection starts with all the variables. True predictor most highly correlated with the target.
It is about estimating the value for the target False
variable except that the target variable is categorical
rather than numeric. Estimation It performs a greedy search to find the best
performing feature subset. It iteratively creates
Prior to variable selection, one must identify outliers
models and determines the best or the worst
and influential points - maybe exclude them at least
performing feature at each iteration.
temporarily. True
Recursive feature elimination
Variable clustering is about grouping the attributes
with similarities. True When a subset of data points within a set is
Resampling refers to the process of sampling at
anomalous to the entire dataset, those values are:
random and with replacement from a data set. True Collective outliers
[IT0089] MODULE 7 – CLASSIFICATION
ALGORITHM
SUPERVISED CLASSIFICATION
GENERALIZATION
SEPARATE SAMPLING
CLASSIFICATION
MODULE 7 – SUBTOPIC 1
BENEFITS
1. Demand forecasting
2. Workforce planning and churn analysis
3. Forecasting or external factors
4. Analysis of competitors
5. Fleet or equipment maintenance
6. Modeling credit or other financial risks
4. SAS
- A programming environment and language
2. Tableau Public for data manipulation and a leader in
- A free software that connects any data analytics. SAS is easily accessible,
source it corporate Data Warehouse, manageable and can analyze data from any
Microsoft Excel or web-based data, and sources.
creates data visualizations, maps,
dashboards etc. with real-time updates
presenting on web.
5. RapidMiner
3. Python - One of the best predictive analysis system
- It is an object-oriented scripting language developed. It provides an integrated
which is easy to read, write, maintain and is environment for deep learning, text mining,
a free open source tool. machine learning & predictive analysis.
6. Orange
- It is a perfect software suite for machine
APPLICATIONS
learning & data mining.
- It best aids the data visualization and is a - Target Marketing
component based software. - Attrition Prediction
- It has been written in Python computing - Credit Scoring
language. - Fraud Detection
SOME CHALLENGES
- Operational/Observational
- Massive
- Errors and Outliers
- Missing Values
- Transaction data
7. Weka - CRM data
- It is best suited for data analysis and - Customer service data
predictive modeling - Survey or polling data
- It contains algorithms and visualization tools - Digital marketing and advertising data
that support machine learning. - Economic data
- Demographic data
- Machine-generated data (i.e. telemetric data
or data from sensors)
- Geographical data
- Web traffic data
MODULE 7 – SUBTOPIC 2
DECISION TREE
- a graphical representation of possible
outcome to a decision based on certain
conditions.
- It’s called a decision tree because it starts
with a single box (or root), which then
branches off into a number of solutions, just
like a tree.
KEY TERMS
Decision Node
Leaf/Terminal Node
Pruning
Branch/ Sub-Tree
REQUIREMENTS OF DECISION TREE
- A sub section of the decision tree is called
branch or sub-tree. 1. Decision tree algorithms represent
supervised learning, and as such require
Parent and Child Node
preclassified target variables. A training data
- A node, which is divided into subnodes, is set must be supplied, which provides the
called a parent node of sub-nodes whereas algorithm with the values of the target
subnodes are the child of a parent node. variable.
2. This training data set should be rich and
VISUALIZING DECISION TREE varied, providing the algorithm with a
healthy cross section of the types of records
for which classification may be needed in FORMATIVES:
the future.
3. The target attribute classes must be discrete.
That is, one cannot apply decision tree M7 16/20
analysis to a continuous target variable.
Rather, the target variable must take on 1. A decision tree has no shortcomings in
values that are clearly demarcated as either expressing classification and prediction
belonging to a particular class or not patterns because it uses multiple attribute
variable in split criterion.
belonging
- False
ADVANTAGES 2. A decision tree can be large to have as many
leaf nodes as data records in the training
- Are simple to understand and interpret data set with each leaf node containing each
data record.
- Have value even with a complex data
- True
- It can be combined with other decision 3. A sub section of the decision tree
techniques - Sub-tree
4. Which of the following does not define a
KEY IDEAS leaf node?
- Also called as internal node
- Decision trees are drawn from top-to-bottom
5. A node, which is divided into sub-nodes is?
or left-to-right - Parent node
- Top (or left-most) node is called root node 6. The topmost node in a decision tree is
- Descendant node(s) are called child node(s) known as the internal node.
- Bottom (or right-most) node is called leaf - False
node 7. Decision tree can work on continuous target
variables
- Unique path from root to each leaf is called
- True
a rule 8. It is also called a terminal node
- Leaf Node
TYPES OF DECISION TREES
9. The decision tree can classify a data record
1. Binary Trees – only two choices in each by passing the data record through the
decision tree using the attribute values in the
split. It can be non-uniform(uneven) in
data record.
depth - True
2. N-way trees – three or more choices in at 10. It is a type of decision tree with only two
least one of its splits choices in each split. It can be non-uniform
(uneven) in depth.
DECISION TREE ALGORITHMS - Binary tress
11. It is also called a terminal node
• Hunt’s Algorithm - Leaf node
• CART 12. It is an unsupervised learning technique as it
• ID3 describes how the data is organized without
• C4.5 using an outcome.
- Clustering
• SLIQ
13. An efficient model was defined as provide at
• SPRINT least 1.
• CHAID - Decision tree
14. Affinity analysis is a data mining method
that usually consists of two variables: a
transaction and an item
- True - True
15. A objective of clustering is to uncover a 6. It train forest predictive models by fitting
pattern in the time series and then multile decision
extrapolate the pattern into the future. - Forest selection
- Flase 7. It trains a decision tree predictive model
16. It is also called as terminal node - Decision tree selection
- Leaf node 8. A decision tree has no shortcomings in
17. A heterogenous data set is a data set whose expressing classification and prediction
data records have the same target value. patterns because it uses multiple attribute
- True variable in split criterion.
18. A decision tree can be large to have as many - False
leaf nodes as data records in the training 9. It trains a gradient boosting predictive model
data set with each leaf node containing each by fitting a set of addictive decision tress
data record. - Gradient boosting selection
- True 10. It is a process a dividing a node into two or
19. Segments profile is the tool available in SAS more sub-nodes
Enterprise miner to execute sequence - Splitting
analysis 11. Assessing whether a mortgage application is
- False a good or bad credit risk is an example of
20. A heterogenous data set is date set whose classification
data records have the same target value. - True
- True 12. It is the process of integrating multiple
databases, data cubes, or files.
- Data Integration
13. Which Apache product is used for managing
M7 18/20 real time transaction such as logs and
events?
1. Leaf node is also known as internal node - Apache Kafka
- False 14. Which of the following is not an example of
2. A node, which is divided into sub-node is classification?
________ - Estimating the grade point average (GPA) of
- Parent node a graduate student, based on that student’s
3. The encircled part shown the picture below undergraduate GPA
illustrates the: 15. Factor is variable being manipulated by
researchers
- True
16. Determining whether a will was written by
the actual decreased or fraudulently by
someone else is an else is an example of
classification.
- True
17. It is the scientific domain that’s dedicated to
knowledge discovery via data analysis
- Data Science
18. The model can be used to classify or predict
the outcome of interest in new cases where
- Decision node the outcome unknown.
4. SAS Enterprise Guide does not include a - True
full programming interface 19. Behavioral analytics are also part of the
- False pattern discovery
5. Child nodes are also called as sub-nodes - True
M7 13/20 14. A sub section of the decision tree
- Sub-nodes
1. It is a type of decision tree with only two 15. The objective of clustering is to uncover a
choices in each split. It can be non-uniform pattern in the time series and then
(uneven) in depth. extrapolated the pattern into the future
- Binary trees - True
2. The decision tree can classify a data record 16. SAS Enterprise Guide does not include a
by passing the data record through the full programming interface
decision tree using the attribute values in the - False
data record. 17. An efficient model was defined as provide at
- True least 1.
3. If a data set has a numeric attribute variable, - Weka
the variable needs to be transformed into a 18. It is the scientific domain that’s dedicated to
categorical variable before being used to knowledge discovery via data analysis
construct a decision tree. - Data Science
- True 19. A heterogenous data set is a data set whose
4. It is a type of decision tree with only two data records have the same target values
choices in each split. It can be non-uniform - True
(uneven) in depth. 20. The following can be done when dealing
- Binary trees with categorical inputs except:
5. Sub-nodes are the sub sections of the - Collapse the categories based on the
decision tree reduction in the chi-square test of
- True association between the categorical input
6. The decision tree can classify a data record and the target.
by passing the data record through the
decision tree using the attribute values in the
data record. Other:
- True
7. Decision tree can work on continuous target 1. The encircled part shown in the picture below
variables illustrates the:
- False
8. It trains a gradient boosting predictive model
by fitting a set of addictive decision trees
- Gradient boosting selection
9. It is a process of dividing a node into two or
more sub-nodes
- Splitting
10. It is also called a terminal node
- Leaf node
11. Factor is the variable being manipulated by
researchers • decision node
- False 2. It is a type of decision tree with only two
12. It is a commonly used statistical technique choices in each split. It can be non-
to predict future behavior uniform(uneven) in depth.
- Predictive modeling • Binary trees
13. Movie Recommendation system are an 3. It trains a gradient boosting predictive model by
example of: fitting a set of additive decision trees.
1. Classification • Gradient boosting selection
2. Clustering 4. Decision node represents the entire population
3. Reinforcement learning or sample and this further gets divided into two
4. Regression or more homogeneous sets
- 2 and 3
• True ID3
5. It is the scientific domain that’s dedicated to C4.5
knowledge discovery via data analysis. SLIQ
• Data Science SPRINT
6. It aims at dividing the data set into groups. CHAID
• Clustering 16. Pruning It is a process of dividing a node into
7. Predicting the percentage increase in traffic two or more sub-nodes.
deaths next year if the speed limit is increased • False
is an example of clustering. 17. It is a graphical representation of possible
• False outcome to a decision based on certain
8. Provide an example of predictive models: conditions.
1. Logistic regression • decision tree
18. A node, which is divided into sub-nodes is
2. Clustering ______________.
• Parent node
3. Decision Tree 19. Child nodes are also called as sub-nodes.
• True
4. Random forest
20. It trains forest predictive models by fitting
5. K-nearest neighbor multiple decision trees.
• Forest selection
6. XGBoost 21. It is a process of dividing a node into two or
more sub-nodes.
9. Behavioral analytics are also part of the pattern • Splitting
discovery. 22. Sub-nodes are the sub sections of the decision
• True tree.
10. The objective of clustering is to uncover a • True
pattern in the time series and then extrapolate 23. A sub section of the decision tree.
the pattern into the future. • Sub-tree
• False 24. The decision tree can classify a data record by
11. It is the process of integrating multiple passing the data record through the decision
databases, data cubes, or files. tree using the attribute values in the data
• Data Integration record.
12. A decision tree can be large to have as many • True
leaf nodes as data records in the training data 25. A decision tree has no shortcomings in
set with each leaf node containing each data expressing classification and prediction patterns
record. because it uses multiple attribute variables in a
• True split criterion.
13. It is essentially similar to the simple linear • False
model, with the exception that multiple 26. Which Apache product is used for managing
independent variables are used in the model. real time transactions such as logs and events?
• Multiple linear regression • Apache Kafka
14. Root node is a graphical representation of 27. As you build tasks, SAS Enterprise Guide
possible outcomes to a decision based on generates SAS codes.
certain conditions. • True
• False 28. An efficient model was defined as provide at
15. Give at least one decision tree algorithm. (Use least 1.
UPPERCASE or your answer) • N-way trees
Hunt’s Algorithm
CART
29. It is a famous data mining method which • Use smoothed weight of evidence coding to
requires that the data must be consists of two convert the categorical input to a
variables: a transaction and an item. continuous input.
• Association Rule • Collapse the categories based on the
30. It is essentially similar to the simple linear number of observations in a category
model, with the exception that multiple • No answer text provided.
independent variables are used in the model. • Collapse the categories based on the
• Multiple linear regression reduction in the chi-square test of
31. Pruning It is a process of dividing a node into association between the categorical input
two or more sub-nodes. and the target.
• False 42. It is a type of decision tree with only two
32. Decision tree can work on continuous target choices in each split. It can be non-
variables. uniform(uneven) in depth.
• True • Binary trees
33. Which of the following is not an example of 43. It pertains to the process of reducing the size of
classification? decision trees by removing nodes.
• Estimating the grade point average (GPA) • Pruning
of a graduate student, based on that 44. Which of the following is not an algorithm used
student’s undergraduate GPA for streaming features?
34. There is no guarantee that making locally • Alpha-investing algorithm
optimal decisions at separate times leads to the • ANOVA
smallest decision tree or a globally optimal • OSFS
decision. • Grafting algorithm
• True
35. In cluster analysis, the goal is to identify distinct May mga mali need ng corrections
groupings of cases across a set of inputs.
• True
• False M7 parin
36. In cluster analysis, the goal is to partition cases
from a cloud of data (data that doesn't 18/20
necessarily have distinct groups) into
contiguous groups. 1. Leaf node is also known as internal node
• False - False
2. A node, which is divided into sub-node is
37. SAS Enterprise Guide does not include a full
________
programming interface.
- Parent node
• False 3. The encircled part shown the picture below
38. Determining whether a will was written by the illustrates the:
actual deceased, or fraudulently by someone
else is an example of classification.
• True
39. It aims at dividing the data set into groups.
• Clustering
40. Predicting the percentage increase in traffic
deaths next year if the speed limit is increased
is an example of clustering.
• False
41. The following can be done when dealing with
categorical inputs except:
- Decision node
4. SAS Enterprise Guide does not include a full 20. It is a commonly used statistical technique to
programming interface predict future behavior. (Use lowercase for your
- False answer)
5. Child nodes are also called as sub-nodes
- True
6. It train forest predictive models by fitting
multile decision Module 8 - CLUSTER ANALYSIS
- Forest selection
7. It trains a decision tree predictive model Unsupervised Classification
- Decision tree selection
8. A decision tree has no shortcomings in inputs
expressing classification and prediction patterns
because it uses multiple attribute variable in grouping
split criterion. cluster 1
- False cluster 2
9. It trains a gradient boosting predictive model by
cluster 3
fitting a set of addictive decision tress
cluster 1
- Gradient boosting selection
10. It is a process a dividing a node into two or cluster 2
more sub-nodes
- Splitting Unsupervised classification: grouping of cases based
11. Assessing whether a mortgage application is a on similarities in input values.
good or bad credit risk is an example of
classification What is Cluster Analysis?
- True Naturally occurring groups?
12. It is the process of integrating multiple Yes- Cluster Analysis
databases, data cubes, or files. No- Segmentation
- Data Integration Clustering
13. Which Apache product is used for managing
real time transaction such as logs and events?
“Cluster analysis is a set of methods for constructing a
- Apache Kafka
14. Which of the following is not an example of (hopefully) sensible and informative classification of an
classification? initially unclassified set of data, using the variable
- Estimating the grade point average (GPA) of a values observed on each individual.”
graduate student, based on that student’s Everitt (1998), The Cambridge Dictionary of Statistics
undergraduate GPA
15. Factor is variable being manipulated by Clustering in real life
researchers -While you have thousands of customers, there are really
- True only a handful of major types into which most of your
16. Determining whether a will was written by the customers can be grouped.
actual decreased or fraudulently by someone
• Bargain hunter
else is an else is an example of classification.
• Man/woman on a mission
- True
• Impulse shopper
17. It is the scientific domain that’s dedicated to
knowledge discovery via data analysis • Weary Parent
- Data Science • DINK (dual income, no kids)
18. The model can be used to classify or predict the
outcome of interest in new cases where the
outcome unknown.
- True
19. Behavioral analytics are also part of the pattern
discovery
- True
A case study
Training Data
1. Select inputs.
Euclidean Distance
• Euclidean distance gives the linear distance
between any two points in n-dimensional
space.
2. Select k cluster centers.
• It is a generalization of the Pythagorean
theorem.
DE= (x1,x2)
(0,0)
5. Re-assign cases.
DE=
(x1,x2)
-When no clusters exist, use the K-means algorithm to
(0,0)
partition cases into contiguous groups.
Cluster Profiling
-Cluster profiling can be defined as the derivation of a
class label from a proposed cluster solution.
P(Point, Mean1)= |x2-x1| + |y2 – yı|
= |2-2| + |10-10|
Example Problem = 0+0
Cluster the following eight points (with (x,y)) =0
representing locations into three clusters A1(2,10)
A2(2,5) A3(8,4) A5(7,5) A6(6,4) A7(1,2) A8(4,9). ----------------------------
Cluster 3:
(2,5)
(1,2)
- False
- True 12. The easiest and simplest clustering algorithm
3. K-means algorithm has target label. that is widely used because of its simple
- False methods of implementation is called k-means
4. Identify if the statement is a clustering or not: algorithm.
- True - True
16. Clustering may be used to identify different can
be distributed more evenly amongst the
evolving species or subspecies. 25. Song recommendation is an example of cluster
- True analysis.
17. K-means can handle noisy data and outliers.
- True
- False
- Price prediction
22. It can be defined as the derivation of a class - Song recommendation
label from a proposed cluster solution. - For cities on fault lines, geologists use cluster
analysis to evaluate seismic risk and the
- Cluster profiling potential weaknesses of earthquake-prone
- Segmentation regions. By considering the results of this
- Association Rule Mining research, residents can do their best to prepare
23. Clustering is supervised classification. mitigate potential damage.
- False - Observed earth quake epicenters should be
clustered along continent faults
30. K-means partitions the given data set into k - False
predefined distinct clusters. 2. It requires to specify the number of clusters (k)
in advance is an advance of k-means
- True - True
3. Predicting house price based on the size of the
house is an example of cluster analysis.
- True
31. Data points belonging to different clusters have 4. It is an algorithm is an iterative algorithm that
low degree of dissimilarity. tries to partition the dataset into kpre-defined
district nonoverlapping subgroups (clusters)
- True where each data point belongs to only one
group
- Kmeans
5. A cluster is defined as a collection of data points
exhibiting certain similarities
- True
32. Movie Recommendation systems are an
6. It can be defined as the derivation of a class
example of clustering analysis. label from a proposed cluster solution
- Cluster Profiling
- True 7. Clustering is supervised classification
- False
8. There is a separate “quality” fuction that
33. Data points belonging to one cluster have low measured the “goodness” of a cluster
- True
degree of similarity.
9. Identify if the statement is a clustering or not:
Identifying groups of motor insurance policy
- False
holders with a high average claim cost.
- True
10. Cluster analysis is a statistical technique used to
34. Cluster analysis is a multivariate method which identify how various units—like people, groups,
aims to classify a sample of subjects (or objects) or societies—can be grouped together because
on the basis of a set of measured variables into of the characteristics they have in common
a number of different groups such that similar - True
subjects are placed in the same group. 11. The easiest and simplest clustering algorithm
that is widely used because of its simple
(https://www.sheffield.ac.uk/mash/statistics/m methods of implementation is called k-means
ultivariate#:~:text=It%20also%20covers%20usin algorithm
g%20Factor%20analysis%20as%20a%20classific - True
ation%20tool%20in%20practice.&text=Cluster% 12. K-means algorithm can be used in forecasting
car plant electricity usage.
20analysis%20is%20a%20multivariate,placed%2
- False
0in%20the%20same%20group.) 13. Natural language processing is an example of
clustering
- True
- True
35. K-means algorithm can be used in forecasting 14. K-means can not handle noisy data and outliers
car plant electricity usage. - True
- True 15. Data point belonging to different clusters have
low degree of dissimilarity
- True
M8 16/20 16. Data point belonging to different clusters have
low degree of dissimilarity
1. Clustering analysis in negatively affected by - True
heteroscedasticity
17. This method is used to quickly cluster large - True,
datasets. Here, researchers define the number (In Association rule mining, the more rules
of clusters prior to performing the actual study that you will generate, the more riskier your
- Hierarchical Cluster
model is, di niya na alam kung alin don sa
18. The Euclidean distance from each case in the
training data to each cluster center is calculated combination ang significant.)
- True
19. Which is the following is not example of a 7. What variable selection is best to utilize if
cluster analysis? there are 100 columns available in the dataset?
- Price Prediction
20. You can randomly select any k data point as - Forward selection
cluster center
8. Which of the following type of association rules
- True
pertain to a rule that is well-known by that expert
with the business?
FINAL EXAM REVIEW
- Trivial rules (alam mo na nalalabas kapag
1. Identify whether the statement is null or ni-run mo ang mining)
alternative hypothesis: all iPhone 6 plus
(Inexplicable rule- rules na di mo maexplain dahil
weighs 6.77 ounces.
di ka part ng company (need mo dito ng expert
- Null hypothesis
2. Association rule mining is about grouping advices)
similar samples into segments Actionable rule- Nakita mong rule na bago (high
- False
quality information))
(because it should be cluster analysis.
Association rule mining refers to the market 9. The data is bimodal. Data: 77 82 74 81 79 84 74
basket in a way na ito yung paghahanap ng 82
mga objects or items that co-occur together
or are present in the same time for a specific -True (dahil dalawang mode ang nageexist)
database.) 10. The corona outbreak is an example of seasonal
3. Behavioral analytsics are also part of pattern variations
discovery.
- True, behavior of a customer has a certain -False (hindi kasi di naman siya nageexist every
pattern year)
4. Market Basket Analysis creates If-Then scenario (seasonal happens periodically every year or month
rules. The “THEN” part is called as ________. ibig sabihin that trend is on-going)
-Consequent 11. A technique that enables machines to mimic
5. It measures the proportion of data records that human behavior
contain the item set X. -AI
- Support (machine learning- creating a machine or program
(In a certain data set, It is concerned with the
na ang ginagawa nya is automatically learning from
overall impact of X.)
the data being fed to the machine itself. It’s working
6. In Association rule mining, the more rules automatically based on historical data.
you produce, the greater the risk of the model to
deep learning- subset of machine learning that is
be generated.
related to you trying to mimic human brain)
12. If the lift ratio is <1, it means that having the -Ordinal
antecedent does not increase the chances of having
the consequent. - likert scale kasi siya (may order)
-1