Chapter 10 - Introduction To Data Mining
Chapter 10 - Introduction To Data Mining
and Decisions
Third Edition
Chapter 10
Introduction to
Data Mining
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 1
Data Mining
• Data mining is focused on better understanding of
characteristics and patterns among variables in
large databases using a variety of statistical and
analytical tools.
– It is used to identify relationships among variables in
large data sets and understand hidden patterns that
they may contain.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 2
The Scope of Data Mining
• Cluster Analysis
– identifying groups in which elements are in some way similar
• Classification
– analyzing data to predict how to classify a new data element
• Association
– analyzing databases to identify natural associations among
variables and create rules for target marketing or buying
recommendations
• Cause-and-effect Modeling
– developing analytic models to describe relationships between
metrics that drive business performance
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 3
Cluster Analysis
• Cluster analysis, also called data segmentation, is
a collection of techniques that seek to group or
segment a collection of objects (observations or
records) into subsets or clusters, such that those
within each cluster are more closely related to one
another than objects assigned to different clusters.
– The objects within clusters should exhibit a high amount
of similarity, whereas those in different clusters will be
dissimilar.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 4
Clustering Methods
• Hierarchical clustering
– Agglomerative
clustering methods,
which proceed by
series of fusions of the
n objects into groups.
– Divisive clustering
methods, which
separate n objects
successively into finer
groupings.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 5
Distance Measures
• Euclidean distance is the straight-line distance
between two points.
• The Euclidean distance measure between two points
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 6
Example 10.1: Applying the Euclidean
Distance Measure
• Colleges and Universities
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 7
Normalizing Distance Measures
• Convert to z-scores
• Normalized distance measure between Amherst
and Barnard is
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 8
Single Linkage Clustering
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 9
Example 10.2: Single Linkage
Clustering
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 10
Example 10.2 Continued
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 11
Example 10.2 Continued
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 12
Example 10.2 Continued
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 13
Dendogram
• Visualization of the clustering process. The y -axis
measures the intercluster distance. A dendogram shows
the sequence in which clusters are formed as you move
up the diagram.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 14
Classification
• Classification methods seek to classify a
categorical outcome into one of two or more
categories based on various data attributes.
• For each record in a database, we have a
categorical variable of interest and a number of
additional predictor variables.
• For a given set of predictor variables, we would like
to assign the best value of the categorical variable.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 15
Credit Approval Decisions Data
• Categorical variable of interest: Decision (whether to
approve – coded as 1 – or reject – coded as 0 – a
credit application)
• Predictor variables: shown in columns A-E (note that
homeowner is also coded numerically)
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 16
Example 10.3: Classifying Credit-
Approval Decisions Intuitively
• Large bubbles correspond to rejected applications.
• When the credit score is > 640, most applications
were approved.
– Classification rule:
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 17
Example 10.3 Continued
• Alternate classification rule using visualization
–
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 18
Measuring Classification
Performance
• Find the probability of making a misclassification
error and summarize the results in a classification
matrix, which shows the number of cases that were
classified either correctly or incorrectly.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 19
Example 10.4: Classification Matrix for
Credit-Approval Classification Rules
Two applicants with credit
scores exceeding 640 were
rejected.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 20
Example 10.5: Classifying Records for Credit
Decisions Using Credit Scores and Years of
Credit History
• Classify new records.
• If we use the simple credit score rule (that a score of more than
640 is needed to approve an application), then we would
classify the decision for the first, third, and sixth records to be 1
and the rest to be 0.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 21
Example 10.5 Continued
• Using the second rule, if
then only the last record would be approved for credit.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 22
Classification Techniques
• k-Nearest Neighbors (k-NN) Algorithm
– Finds records in a database that have similar numerical
values of a set of predictor variables.
• Discriminant Analysis
– Uses predefined classes based on a set of linear
discriminant functions of the predictor variables.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 23
k-Nearest Neighbors (k-NN)
• The k-nearest neighbors (k-NN) algorithm is a
classification scheme that attempts to find records in a
database that are similar to one we wish to classify.
Similarity is based on the “closeness” of a record to
numerical predictors in the other records, using
normalized Euclidean distances.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 24
k-Nearest Neighbor Rules
• The nearest neighbor to a record is the one that that has
the smallest distance from it.
– If k = 1, then the 1-NN rule classifies a record in the same
category as its nearest neighbor.
– k-NN rule finds the k-Nearest Neighbors to each record we want to
classify and then assigns the classification as the classification of
majority of the k nearest neighbors.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 25
Example 10.6: Using k-NN for Classifying
Credit-Approval Decisions
• Credit Approval Decisions Classification Data
• Consider the first new record, 51. If k = 1, the record having
the minimum distance from record 51 is record 27. Since the
credit decision was to approve, we would classify record 51 as
an approval.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 26
Example 10.6 Continued
• We can easily implement the search for the nearest
neighbor in Excel using the SMALL, MATCH, and
VLOOKUP functions. To find the kth smallest value in an
array, use the function =SMALL(array, k). To identify the
record associated with this value, use the MATCH
function with match_type = 0 for an exact match. Then
use the VLOOKUP function to identify the decision
associated with the record.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 27
Discriminant Analysis
• Discriminant analysis is a technique for classifying a set
of observations into predefined classes. The purpose is to
determine the class of an observation based on a set of
predictor variables.
• With only two classification groups, we can apply
regression analysis. Unfortunately, when there are
more than two, linear regression cannot be applied,
and special software must be used.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 28
Example 10.7: Classifying Credit
Decisions Using Discriminant Analysis
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 29
Example 10.7 Continued
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 30
Example 10.7 Continued
• Rule for classifying observations using the discriminant
scores. Compute a cut-off value so that if a discriminant
score is less than or equal to it, the observation is
assigned to one group; otherwise, it is assigned to the
other group.
• One simple way is to use the midpoint of the average
discriminant scores:
– Cut-Off Value = (0.9083 + 0.0781)/2 = 0.4932
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 31
Association Rule Mining
• Association rule mining, often called affinity
analysis, seeks to uncover associations and/or
correlation relationships in large data sets.
– Association rules identify attributes that occur together
frequently in a given data set.
– Market basket analysis, for example, is used to
determine groups of items consumers tend to purchase
together.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 32
Example 10.8: Custom Computer
Configuration
• PC Purchase Data
• We might want to know which components are often
ordered together.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 33
Measuring Strength of Association
• Support for the (association) rule is the percentage (or number) of
transactions that include all items both antecedent and consequent.
• Confidence of the (association) rule is the ratio of the number of
transactions that include all items in the consequent as well as the
antecedent (namely, the support) to the number of transactions that
include all items in the antecedent.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 34
Example 10.9: Measuring Strength of
Association
• A supermarket database has 100,000 point-of-sale
transactions; 2000 include both A and B items; 5000
include C; and 800 include A, B, and C.
• Association rule: “If A and B are purchased, then C is also
purchased.”
– Support =
– Confidence =
– Expected confidence =
– Lift =
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 35
Example 10.10: Using Correlations to
Explore Associations
• PC Purchase Data
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 36
Example 10.10 Continued
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 37
Cause-and-Effect Modeling
• Correlation analysis can help us develop cause-and-
effect models that relate lagging and leading
measures.
– Lagging measures tell us what has happened and are
often external business results such as profit, market
share, or customer satisfaction.
– Leading measures predict what will happen and are
usually internal metrics such as employee satisfaction,
productivity, and turnover.
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 38
Example 10.11: Using Correlation for
Cause-and-Effect Modeling
• Ten Year Survey data
– Satisfaction was measured on a 1-5 scale.
• Correlation matrix
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 39
Example 10.11 Continued
• Logical model
Copyright © 2020, 2016, 2013 Pearson Education, Inc. All Rights Reserved Slide - 40