Chapter 4 - IS 466 - Spring Semester 23-24 Final
Chapter 4 - IS 466 - Spring Semester 23-24 Final
Chapter 4
Data Mining Process, Methods, and
Algorithms
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
What is Data Mining?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Definition of Data Mining
• The nontrivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data stored in structured
databases. -- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial, valid, novel, potentially
useful, understandable.
• Other names: knowledge extraction, pattern analysis, knowledge
discovery, information harvesting, pattern searching, data dredging.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Opening Vignette
Miami-Dade Police Department Is Using Predictive
Analytics to Foresee and Fight Crime
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Concepts and Definitions:
Why Data Mining?
• More intense competition at the global scale.
• Recognition of the value in data sources.
• Availability of quality data on customers, vendors, transactions, Web,
etc.
• Consolidation and integration of data repositories into data
warehouses.
• The exponential increase in data processing and storage capabilities.
• Decrease in hardware and software for data storage & processing
costs.
• Movement toward conversion of information resources into nonphysical
form.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Is a Blend of Multiple
Disciplines
Figure 4.1 Data Mining Is a Blend of Multiple Disciplines.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Characteristics & Objectives
• Source of data for DM is often a consolidated data warehouse (not
always!).
• DM environment is usually a client-server or a Web-based information
systems architecture.
• Data is the most critical ingredient for DM which may include
soft/unstructured data.
• The miner is often an end user
• Striking it rich requires creative thinking
• Data mining tools’ capabilities and ease of use are essential (Web,
parallel processing, etc.)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
How Data Mining Works
• DM extract patterns from data
– Pattern? A mathematical (numeric and/or symbolic) relationship among
data items
• Types of patterns
– There are four different types of patterns:
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
How Data Mining Works
• Types of patterns (continued)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
A Taxonomy for Data Mining
Figure 4.2 A Simple
Taxonomy for Data
Mining Tasks,
Methods, and
Algorithms.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Other Data Mining Patterns/Tasks
• Time-series forecasting
– Part of the sequence or link analysis?
• Visualization
– Another data mining task?
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the two?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (1 of 4)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (2 of 4)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (3 of 4)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (4 of 4)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: CRISP-DM (1 of 2)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: CRISP-DM (2 of 2)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: SEMMA
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: KD D
Figure 4.6 K DD (Knowledge Discovery in Databases)
Process.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
What Data Mining Methodology are you
using?
Figure 4.7 Ranking of Data Mining Methodologies/Processes.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Best Algorithms based on type of DM Task
• Depending on the business need, different types of data mining tasks
can be used: prediction, clustering, or association.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Methods for Prediction:
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Methods: Classification
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Methods: Classification
• The output variable is categorical (nominal or ordinal) in nature
– Nominal data:
data that can be labelled or classified into mutually exclusive
categories within a variable.
Categories cannot be ordered in a meaningful way.
Example, for the nominal variable of preferred mode of
transportation, you may have the categories of car, bus, train, tram or
bicycle.
– Ordinal data:
statistical data type where the variables have natural, ordered
categories
Example: For a grading system: excellent, very good, good, poor;
– or for winner in a race: first, second, third.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Methods: Classification
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Estimation Methodologies for
Classification: Single/Simple Split
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets: training (~70%) and
testing (30%)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Estimation Methodologies for
Classification: k-Fold Cross Validation
• Data is split into k mutual subsets and k number training/testing
experiments are conducted
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Accuracy of Classification Models
• In classification problems, the primary source for accuracy
estimation is the confusion matrix (or, classification matrix)
TP TN
Accuracy
TP TN FP FN
TP
True Positive Rate
TP FN
TN
True Negative Rate
TN FP
TP TP
Precision Recall
TP FP TP FN
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Classification Techniques
• Decision tree analysis
• Statistical analysis
• Neural networks
• Support vector machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
• Employs a divide-and-conquer method
• Recursively divides a training set until each division consists of
examples from one class:
A general 1. Create a root node and assign all of the training data to
algorithm it.
(steps) for 2. Select the best splitting attribute.
building a 3. Add a branch to the root node for each value of the
decision split. Split the data into mutually exclusive subsets
tree along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf node
until the stopping criteria is reached.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
• DT algorithms mainly differ on
1. Splitting criteria
Which variable, what value, etc.
2. Stopping criteria
When to stop building the tree
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
Source:
https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
Source of image:
https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
• Example: Should I play golf or not?
Source of image:
https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees
• Example: Should I give a loan or not?
Source of image:
https://www.softwaretestinghelp.com/decision-tree-algorithm-examples-data-mining/
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
(1 of 4)
• Used for automatic identification of natural groupings of
things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns
new instances
• There is not an output/target variable
• In marketing, it is also known as segmentation
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
(2 of 4)
• Clustering results may be used to
– Identify natural groupings of customers
– Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
– Provide characterization, definition, labeling of
populations
– Decrease the size and complexity of problems for
other data mining methods
– Identify outliers in a specific domain (e.g., rare-event
detection)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
(3 of 4)
• Analysis methods
– Statistical methods such as k-means, k-modes, and so
on.
– Neural networks (adaptive resonance theory [ART],
self-organizing map [SO M])
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• How many clusters?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
(4 of 4)
• k-Means Clustering Algorithm
– k: pre-determined number of clusters
– Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable).
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining -
k-Means Clustering Algorithm
Figure 4.13 A Graphical Illustration of the Steps in the
k-Means Algorithm.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (1 of 7)
• A very popular DM method in business
• Finds interesting relationships (affinities) between
variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis or affinity analysis
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (2 of 7)
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a lap-top computer and a virus
protection software, also bought extended service plan
70 percent of the time."
• How do you use such a pattern/knowledge?
– Put the items next to each other
– Promote the items as a package
– Place items far apart from each other!
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (3 of 7)
• A representative applications of association rule mining
include
– In business: cross-marketing, cross-selling, store
design, catalog design, e-commerce site design,
optimization of online advertising, product pricing, and
sales/promotion configuration
– In medicine: relationships between symptoms and
illnesses; diagnosis and patient characteristics and
treatments (to be used in medical DSS); and genes
and their functions (to be used in genomics projects)
– …
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (4 of 7)
• Are all association rules interesting and useful?
A Generic Rule: X Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS) ~ antecedent
Y: Right-hand-side (RHS) ~ consequent
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X
Example:
In the total number of transactions data:
{Laptop Computer, Antivirus Software} {Extended Service Plan}
[30%, 70%]
i.e., laptops and antivirus software were present in 30% of total
transactions, and in cases where laptops and antivirus software were
present also extended service plan was found 70% of the time.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (5 of 7)
Example:
In the total number of transactions data:
{Laptop Computer, Antivirus Software} {Extended
Service Plan} [30%, 70%]
i.e., laptops and antivirus software were present in 30% of total
transactions, and in cases where laptops and antivirus software
were present also extended service plan was found 70% of the
time.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (6 of 7)
• Several algorithms are developed for discovering
(identifying) association rules
– Apriori
– Eclat
– FP-Growth
– + Derivatives and hybrids of the three
• The algorithms help identify the frequent item sets, which
are, then converted to association rules
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining (7 of 7)
• Apriori Algorithm
– Finds subsets that are common to at least a minimum
number of the itemsets
– Uses a bottom-up approach
frequent subsets are extended one item at a time
(the size of frequent subsets increases from one-
item subsets to two-item subsets, then three-item
subsets, and so on), and
groups of candidates at each level are tested
against the data for minimum support. (see the
figure) --
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining Apriori
Algorithm
Figure 4.14 A Graphical Illustration of the Steps in the
Apriori Algorithm.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Software Tools
Figure 4.15 Popular Data Mining Software Tools (Poll Results).
• Commercial
– IBM SPSS Modeler
(formerly Clementine)
– SAS Enterprise Miner
– Statistica - Dell/Statsoft
– … many more
• Free and/or Open Source
– KNIME
– RapidMiner
– Weka
– R, …
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 4.6 (2 of 4)
Data Mining Goes to Hollywood: Predicting
Financial Success of Movies
A Typical Classification Problem
Table 4.3 Movie Classification based on Receipts
Class No. 1 2 3 4 5 6 7 8 9
Range >1 >1 >10 >20 >40 >65 >100 >150 >200
(in millions of dollars) (Flop) <610 <20 <640 <665 <6100 <6150 <6200 (Blockbuster)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 4.6 (3 of 4)
Data Mining Goes to Hollywood: Predicting
Financial Success of Movies
FIGURE 4.16 Process
Flow Screenshot for the
Box-Office Prediction
System.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Copyright
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved