Introduction to Data Mining
Introduction to Data Mining
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Introduction to Data Mining
Outline:
3
A Categorization of Analytical Methods
What should
happen? Prescriptive
Analytics
4
What is Data Mining?
◦ According to McKinsey Global Institute (MGI)
Most American companies with more that 1000 employees have 200TB
of data, increasing 40% annually
Retailers could expect to realize an increase in their operating margin of
more than 60%
5
Why Data Mining?
• Other examples
– Bank of America, West Coast customer service call center (source:
CIO Magazine)
• 13 million customer calls per month – in the past they all were offered
the same products/services
• Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
– Supermarkets
• Each cash-register product scan collected helps to build a profile about
the shopping habits of your family, and the other families who are
checking out
7
The Need for Human Direction of Data Mining
– Some early data mining definitions described process as
“automatic”
– “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
– Automation no substitute for human input
– Data mining is easy to do badly
– Humans need to be actively involved in every phase of data
mining process
– Task of data mining should be integrated into human process of
problem solving
8
Cross Industry Standard Process: CRISP-DM
CRISP-DM Lifecycle
9
Cross Industry Standard Process: CRISP-DM
(1) Business/Research Understanding Phase
◦ Define project requirements and objectives
◦ Translate objectives into data mining problem definition
◦ Prepare preliminary strategy to meet objectives
(2) Data Understanding Phase
◦ Collect data
◦ Perform exploratory data analysis (EDA)
◦ Assess data quality
◦ Optionally, select interesting subsets
(3) Data Preparation Phase
◦ Prepares for modeling in subsequent phases
◦ Select cases and variables appropriate for analysis
◦ Cleanse and prepare data so it is ready for modeling tools
◦ Perform transformation of certain variables, if needed
10
Cross Industry Standard Process: CRISP-DM
• (4) Modeling Phase
– Select and apply one or more modeling techniques
– Calibrate model parameters to optimize results
– If necessary, additional data preparation may be required for supporting a
particular technique
• (5) Evaluation Phase
– Evaluate one or more models for effectiveness
– Determine whether defined objectives achieved
– Establish whether some important facet of the problem has not been
sufficiently accounted for
– Make decision regarding data mining results before deploying to field
• (6) Deployment Phase
– Make use of models created
– Simple deployment example: generate report
– Complex deployment example: implement parallel data mining effort in
another department
– In businesses, customer often carries out deployment based on your model
11
Fallacies of Data Mining
• Five Fallacies of Data Mining (Louie, Nautilus Systems, Inc.)
Fallacy Reality
1 • Data mining process is • Requires significant intervention during every phase
autonomous • After model deployment, new models require updates
• Requires little oversight • Continuous evaluative measures monitored by analysts
2 • Data mining quickly pays for • Return rates vary
itself • Depending on startup, personnel, data preparation costs,
etc.
3 • Data mining software easy to • Ease of use varies across projects
use • Analysts must combine subject matter knowledge with
specific problem domain
4 • Data mining automatically • Data mining often uses data from legacy systems
cleans data in databases • Data possibly not examined or used in years
• Organizations starting data mining efforts confronted with
huge data preprocessing task
5 • Data mining always provides • There is no guarantee of positive results
positive results. • But used properly, data mining can provide actionable and
highly profitable results.
12
What Tasks Can Data Mining Accomplish?
• Six common data mining tasks
– Description
– Estimation
– Classification
– Prediction
– Clustering
– Association
13
What Tasks Can Data Mining Accomplish? (cont’d)
1. Description
– Describes patterns or trends in data
– Data mining models should be transparent
• That is, results should be interpretable by humans
• Some data mining methods more transparent than others
– Decision Trees (Transparent)
– Neural Networks (Blackbox)
14
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (1/3)
◦ Target variable is numeric
◦ Models built from complete data records
Records include values for each predictor field and numeric
target variable in training set
◦ For new observations, estimate the target variable
15
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (2/3) – Further examples
16
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (3/3)
– The following figure shows scatter • Regression line estimates
plot of graduate GPA against student’s graduate GPA based on
undergraduate GPA (1000 students) their undergraduate GPA,
– Linear regression finds line (blue) resulting in the following model:
best approximating relationship ŷ = 1.24 + 0.67x
between two variables • For example, suppose student’s
undergraduate GPA = 3.0
• According to estimation model,
estimated student’s graduate
GPA = 1.24 + 0.67(3.0) = 3.25
• Point (x = 3.0, ŷ = 3.25) lies on
regression line
17
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (1/4)
◦ Similar to Estimation task, except target variable is categorical
a) Use training data to develop model that classifies Income Bracket based
on predictor variables
18
What Tasks Can Data Mining Accomplish? (cont’d)
Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High
Marketing
002 28 M Middle
Consultant
003 35 M Unemployed Low
… … … … …
19
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (3/4) – The drug prescription example
• Interested in classifying the type of drug a patient should be prescribed, based on age
of the patient, and the patient’s sodium / potassium ratio
• Scatter plot of 200 patients with their sodium/potassium ratios against age, and the
particular drug prescribed by the shade of the points
• What drug should be prescribed for:
21
What Tasks Can Data Mining Accomplish? (cont’d)
4. Prediction
Example prediction tasks in business
◦ Similar to classification and
and research:
estimation, except results lie in
the future
◦ Methods used for estimation ?
prediction
?
Includes point estimation, Q1 Q2 Q3 Q4
confidence interval
estimation, linear regression Predict price of stock 3 months into
and correlation, multiple future, based on past performance
regression, k-nearest Predict percentage increase in traffic
deaths next year, if speed limit increased
neighbor, decision trees and
Predict whether molecule in newly
neural networks
discovered drug leads to profitable
pharmaceutical drug
22
What Tasks Can Data Mining Accomplish? (cont’d)
5. Clustering
– Refers to grouping records into classes of similar objects
– Cluster – a collection of records similar to one another, and dissimilar to
records in other clusters
– Clustering algorithm seeks to segment data set into homogeneous
subgroups
– Target variable not specified
• Clustering does not try to classify/estimate/predict target variable
24
What Tasks Can Data Mining Accomplish? (cont’d)
6. Association (2/2) - Association Tasks in Business and
Research:
25
What Tasks Can Data Mining Accomplish? (cont’d)
26
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.