Data Mining - Bi 3
Data Mining - Bi 3
(Continued…)
Learning Objectives
Ar
tifi
Pattern
c
ial
Recognition
s
tic
Int
tis
ellig
Sta
en
ce
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Unstructured or
Structured
Semi-Structured
• Types of patterns
– Association
– Prediction
– Cluster (segmentation)
– Sequential (or time series) relationships
Application Case 5.2
Harnessing Analytics to Combat Crime:
Predictive Analytics Helps Memphis
Police Department Pinpoint Crime and
Focus Police Resources
Questions for Discussion
1. How did the Memphis Police Department use data
mining to better combat crime?
2. What were the challenges, the proposed solution,
and the obtained results?
A Taxonomy for Data Mining Tasks
Data Mining Learning Method Popular Algorithms
• Types of DM
– Hypothesis-driven data mining
– Discovery-driven data mining
Data Mining Applications
• Customer Relationship Management
– Maximize return on marketing campaigns
– Improve customer retention (churn analysis)
– Maximize customer value (cross-, up-selling)
– Identify and treat most valued customers
• Insurance
– Forecast claim costs for better business planning
– Determine optimal rate plans
– Optimize marketing to specific customers
– Identify and prevent fraudulent claim activities
Data Mining Applications
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel industry
• Healthcare Increasingly more popular
application areas for data
• Medicine mining
• Entertainment industry
• Sports
• Etc.
Application Case 5.3
A Mine on Terrorist Funding
Questions for Discussion
1. How can data mining be used to fight terrorism?
Comment on what else can be done beyond what is
covered in this short application case.
2. Do you think data mining, while essential for
fighting terrorist cells, also jeopardizes individuals’
rights of privacy?
Data Mining Process
• A manifestation of best practices
• A systematic way to conduct DM projects
• Different groups have different versions
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for
Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and
Assess)
– KDD (Knowledge Discovery in Databases)
Data Mining Process: CRISP-DM
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
Data Mining Process: CRISP-DM
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
Data Mining Process: SEMMA
Sample
(Generate a representative
sample of the data)
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Application Case 5.4
Data Mining in Cancer Research
Questions for Discussion
1. How can data mining be used for ultimately
curing illnesses like cancer?
2. What do you think are the promises and major
challenges for data miners in contributing to
medical and biological research endeavors?
Data Mining Methods: Classification
• Most frequently used DM method
• Part of the machine-learning family
• Employ supervised learning
• Learn from past data, classify new data
• The output variable is categorical (nominal or ordinal)
in nature
• Classification versus regression?
• Classification versus clustering?
Assessment Methods for Classification
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Accuracy of Classification Models
• In classification problems, the primary source for
accuracy estimation is the confusion matrix
True Class TP + TN
Positive Negative Accuracy =
TP + TN + FP + FN
Positive True False TP
True Positive Rate =
Predicted Class
Positive Positive
TP + FN
Count (TP) Count (FP)
TN
True Negative Rate =
TN + FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision = Recall =
TP + FP TP + FN
Estimation Methodologies for Classification
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Accuracy
– For ANN, the data is split into three sub-sets (training [~60%],
Testing Data (scoring)
0.9
0.8
A
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Source: KDNuggets.com
Big Data Software Tools and Platforms
Apache Hadoop/Hbase/Pig/Hive (67)
0 10 20 30 40 SQL
50(185)
60 70 80
Java (138)
Python (119)
C/C++ (66)
Other languages (57)
Perl (37)
Awk/Gawk/Shell (31)
F# (5)