0% found this document useful (0 votes)
48 views

Intro To Data Mining

Uploaded by

Kamote Sweet
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Intro To Data Mining

Uploaded by

Kamote Sweet
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Data Mining

12-1
Data Mining

• Data mining is a rapidly growing field of


business analytics focused on better
understanding of characteristics and patterns
among variables in large data sets.
• It is used to identify and understand hidden
patterns that large data sets may contain.
• It involves both descriptive and prescriptive
analytics, though it is primarily prescriptive.

Copyright © 2013 Pearson Education, Inc.


12-2
publishing as Prentice Hall
The Scope of Data Mining
Some common approaches to data mining
Association
• - analyze data to identify natural associations
among variables and create rules for target
marketing or buying recommendations
• Netflix uses association to understand what
types of movies a customer likes and provides
recommendations based on the data
• Amazon makes recommendations based on
past purchases
• Supermarket loyalty cards collect data on
customer purchase habits and print coupons
based on what was currently bought.
12-3
The Scope of Data Mining
Some common approaches to data mining
Clustering
₋ Similar to classification, but when no groups have
been defined; finds groupings within data
₋ Example: Insurance company could use
clustering to group clients by their age, location
and types of insurance purchased.
₋ The categories are unspecified and this is
referred to as ‘unsupervised learning’

12-4
The Scope of Data Mining

Some common approaches to data mining


Classification
- analyze data to predict how to classify new
elements
– Spam filtering in email by examining textural
characteristics of message
– Help predict if credit-card transaction may be
fraudulent
– Is a loan application high risk
– Will a consumer respond to an ad

12-5
Association Rule Mining
Association Rule Mining (affinity analysis)
• Seeks to uncover associations in large data sets
• Association rules identify attributes that occur
together frequently in a given data set.
• Market basket analysis, for example, is used determine
groups of items consumers tend to purchase together.
• Association rules provide information in the form of if-
then (antecedent-consequent) statements.
• The rules are probabilistic in nature.

Copyright © 2013 Pearson Education, Inc.


12-6
publishing as Prentice Hall
Association Rule Mining
Custom Computer Configuration
(PC Purchase Data)
• Suppose we want to know which PC
components are often ordered together.

Figure 12.35

Copyright © 2013 Pearson Education, Inc.


12-7
publishing as Prentice Hall
Association Rule Mining

Measuring the Strength of Association Rules


Support for the (association) rule is the
percentage (or number) of transactions that
include all items both antecedent and
consequent.
Confidence of the (association) rule:
Lift is a ratio of confidence to expected
confidence.

Copyright © 2013 Pearson Education, Inc.


12-8
publishing as Prentice Hall
Association Rule Mining
Measuring Strength of Association
A supermarket database has 100,000 point-of-sale transactions:
2000 include both A and B items
5000 include C
800 include A, B, and C
Association rule:
If A and B are purchased, then C is also purchased.
Support = 800/100,000 = 0.008
Confidence = 800/2000 = 0.40
Expected confidence = 5000/100,000 = 0.05
Lift = 0.40/0.05 = 8

12-9
Association Rule Mining
(continued) Identifying Association Rules for PC
Purchase Data

Figure 12.37

Copyright © 2013 Pearson Education, Inc.


12-10
publishing as Prentice Hall
Association Rule Mining

Example 12.14 (continued) Identifying


Association Rules for PC Purchase Data

Figure 12.38

Rules are sorted by their Lift Ratio (how much more likely one is to
purchase the consequent if they purchase the antecedents).

Copyright © 2013 Pearson Education, Inc.


12-11
publishing as Prentice Hall
Cluster Analysis
• Similar to classification, but when no groups have been
defined; finds groupings within data
• Cluster Analysis has many powerful uses like Market Segmentation.
• You can view individual record’s predicted cluster membership.

• Also called data segmentation


• Two major methods
1. Hierarchical clustering
a) Agglomerative methods (used in XLMiner)
proceed as a series of fusions

2. k-means clustering (available in XLMiner)


partitions data into k clusters so that each element belongs to the
cluster with the closest mean
12-12
Cluster Analysis – Agglomerative Methods
Dendrogram – a diagram illustrating fusions or
divisions at successive stages
Objects “closest” in distance to each other are
gradually joined together.
Euclidean distance is
the most commonly
used measure of the clid ea
n
Eu
distance between
objects.
Figure 12.2

Copyright © 2013 Pearson Education, Inc.


12-13
publishing as Prentice Hall
Clustering Colleges and Universities
Cluster the Colleges and Universities data
using the five numeric columns in the data
set.
Use the hierarchical method

Figure 12.3

Copyright © 2013 Pearson Education, Inc.


12-14
publishing as Prentice Hall
• This process of agglomeration leads to the construction of a dendrogram.

• This is a tree-like diagram that summarizes the process of clustering.

• For any given number of clusters we can determine the records in the clusters by sliding a
horizontal line (ruler) up and down the dendrogram until the number of vertical intersections of
the horizontal line equals the number of clusters desired.

Copyright © 2013 Pearson Education, Inc.


12-15
publishing as Prentice Hall
(continued) Clustering of Colleges
Hierarchical clustering results: Dendrogram

Height of the bars is a measure of


dissimilarity in the clusters that are
merging into one.

Smaller clusters “agglomerate” into


bigger ones, with least possible loss of
cohesiveness at each stage.

From Figure 12.8

Copyright © 2013 Pearson Education, Inc.


12-16
publishing as Prentice Hall
(continued) Clustering of Colleges
Hierarchical clustering results: Predicted clusters

From Figure 12.9

12-17
(continued) Clustering of Colleges

Hierarchical clustering results:


Predicted clusters

Cluster # Colleges
1 23
2 22
3 3
4 1

Copyright © 2013
Figure Pearson
12.9 Education, Inc.
12-18
publishing as Prentice Hall
(continued) Clustering of Colleges
Hierarchical clustering results for clusters 3 and 4

Schools in cluster 3 appear similar.


Cluster 4 has considerably higher Median SAT and Expenditures/Student.

12-19
Classification
 Recognizes patterns that describe group to
which item belongs
 We will analyze the Credit Approval Decisions
data to predict how to classify new elements.
Categorical variable of interest: Decision
(whether to approve or reject a credit
application)
Predictor variables: shown in columns A-E

Figure 12.10

12-20
Classification
Modified Credit Approval Decisions
The categorical variables are coded as numeric:
Homeowner - 0 if No, 1 if Yes
Decision - 0 if Reject, 1 if Approve

Figure 12.11

Copyright © 2013 Pearson Education, Inc.


12-21
publishing as Prentice Hall
Classification
Using Training and Validation Data
Data mining projects typically involve large volumes of
data.
The data can be partitioned into:
▪ training data set – has known outcomes and is used to
“teach” the data-mining algorithm
▪ validation data set – used to fine-tune a model
▪ test data set – tests the accuracy of the model
In XLMiner, partitioning can be random or user-specified.

12-22
Classification
(continued) Partitioning Data Sets in XLMiner
Partitioning choices when choosing random
1. Automatic 60% training, 40% validation
2. Specify % 50% training, 30% validation, 20% test
(training and validation % can be modified)
3. Equal # records 33.33% training, validation, test
XLMiner has size and relative size limitations on the
data sets, which can affect the amount and % of data
assigned to the data sets.

Copyright © 2013 Pearson Education, Inc.


12-23
publishing as Prentice Hall
Classification Techniques
Three Data-Mining Approaches to Classification:
1. k-Nearest Neighbors (k-NN) Algorithm
find records in a database that have similar numerical values of a
set of predictor variables
2. Discriminant Analysis (what we will do)
use predefined classes based on a set of linear
discriminant functions of the predictor variables
3. Logistic Regression
estimate the probability of belonging to a category using a
regression on the predictor variables

12-24
Classification Techniques
(continued) Using Discriminant Analysis for
Classifying New Data

Figure 12.27

Half of the applicants are in the “Approved” class

12-25

You might also like