Chapter 1
Chapter 1
Business
Introduction to
Data Mining
What is Data Mining?
• Data mining is the process of discovering useful patterns and trends
in large data sets
• United States 2012 Presidential Elections (source: MIT
Technology Review)
First identified likely Obama voters using a data mining model, and
then made sure that these voters actually got to the polls
used a separate data mining model to predict the polling outcomes
county-by-county
Hamilton, Ohio: the model predicted 56.4% for Obama; actual result
was 56.6%, so that the prediction was off by only 0.02%
3
Why Data Mining?
(cont’d)
• Other examples
• Bank of America, West Coast customer service call center
(source: CIO Magazine)
13 million customers customer calls per month – in the past they all were
offered the same products/services
Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
• Supermarkets
Each cash-register product scan collected helps to build a profile about the
shopping habits of your family, and the other families who are checking out
4
Wanted: Data Miners
“We are drowning in
We are inundated with data in information but starved
most fields, but… for knowledge.”
There are not trained human -Megatrends, John
analysts available who are skilled Naisbitt
to convert the data into knowledge
According to McKinsey Report
◦ “There will be a shortage of • Factors
• Explosive growth in data
talent…”
collection, as in supermarket
◦ “…particularly of people with scanners
deep expertise in statistics and • Storing the data in data
machine learning, and the warehouses
managers and analysts who • Increased access to data from
know how to operate companies web navigation an intranets
by using insights from big data.” • Competitive pressure to
◦ Demand for talent to exceed increase market share in
globalized economy
supply “…by 140,000 to 190,000 • Growth of computing power
positions” and storage capacity
◦ “… we project a need for 1.5
million additional managers and
analysts in the United States”
5
The Need for Human
Direction of Data Mining
• Some early data mining definitions described process as
“automatic”
• “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
• Automation no substitute for human input
• Data mining is easy to do badly
• Understanding statistical and mathematical model structures of
underlying software required
• Humans need to be actively involved in every phase of data
mining process
• Task of data mining should be integrated into human process of
problem solving
6
Cross Industry Standard Process: CRISP-DM
developed in 1996 Fits data mining into the general problem-
solving strategy of business/research unit)
• Iterative CRIP-DM
process shown in
outer circle
between phases
shown Deployment Phase Data Preparation
Phase
• Next phase
depends on results Evaluation Phase Modeling Phase
from preceding
phase
• Returning to
earlier phase
possible before CRISP-DM Lifecycle 7
moving forward
Cross Industry Standard
Process: CRISP-DM (cont’d)
(1) Business/Research Understanding Phase
• Define project requirements and objectives
• Translate objectives into data mining problem definition
• Prepare preliminary strategy to meet objectives
(2) Data Understanding Phase
• Collect data
• Perform exploratory data analysis (EDA)
• Assess data quality
• Optionally, select interesting subsets
(3) Data Preparation Phase
• Prepares for modeling in subsequent phases
• Select cases and variables appropriate for analysis
• Cleanse and prepare data so it is ready for modeling tools
• Perform transformation of certain variables, if needed
8
Cross Industry Standard
Process: CRISP-DM (cont’d)
• (4) Modeling Phase
• Select and apply one or more modeling techniques
• Calibrate model settings to optimize results
• If necessary, additional data preparation may be required
for supporting a particular technique
• (5) Evaluation Phase
• Evaluate one or more models for effectiveness
• Determine whether defined objectives achieved
• Establish whether some important facet of the problem has
not been sufficiently accounted for
• Make decision regarding data mining results before
deploying to field
9
Cross Industry Standard
Process: CRISP-DM (cont’d)
10
Fallacies of Data Mining
• Four Fallacies of Data Mining (Louie,
Nautilus Systems, Inc.)
Fallacy Reality
1 • Set of tools can be turned • No automatic data mining tools solve
loose on data repositories problems
• Finds answers to all business • Rather, data mining is process (CRISP-DM)
problems • Integrates into overall business objectives
• Description
• Estimation
• Prediction
• Classification
• Clustering
• Association
12
What Tasks Can Data Mining
Accomplish? (cont’d)
1. Description
• Describes patterns or trends in data
• Data mining models should be transparent
• That is, results should be interpretable by humans
• For example, Decision Trees (transparent) > Neural Networks (opaque)
• High-quality description accomplished using Exploratory Data
Analysis (EDA)
• Graphical method of exploring patterns and trends in data
13
What Tasks Can Data Mining
Accomplish? (cont’d)
2. Estimation
• Similar to Classification task, except target variable is numeric
• Models built from complete data records
• Records include values for each predictor field and numeric target variable
in training set
• For new observations, estimate the target variable
14
What Tasks Can Data Mining
Accomplish? (cont’d)
• Estimation – Further examples
• Estimate amount of money, family of four will spend on back-to-
school shopping
• basketball player
• Estimate CGPA
15
What Tasks Can Data Mining
Accomplish? (cont’d)
3. Prediction Example prediction tasks in
• Similar to business and research: ?
4. Classification
• Similar to Estimation task, except target variable is categorical
• Example: Classify the Income Bracket of an individual as Low,
Middle or High based their Age, Gender and Occupation
• Use training data to develop model that classifies Income Bracket based
on predictor variables
• Apply model to cases not currently in the database, to obtain estimated
Income Bracket classification
17
What Tasks Can Data Mining
Accomplish? (cont’d)
• Classification – Example in detail
Using the training data set, the algorithm would:
Examine the data set containing both the predictor variables and the (already
classified) target variable, income bracket
Algorithm (software) “learns about” which combinations of variables are
associated with which income brackets (for example, Older females -> High
Income)
Then, when looking at new records with no income information,
the algorithm would:
Based in the classification in the training set, would assign classifications to the
new records (for example, 63-year-old female professor -> high)
Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High
… … … … …
18
Figure 1.3 - Which Drug Should Be Prescribed
for Which Type of Patient?
Na / K (sodium / potassium
ratio)
Age
Dot Legend:
Light gray – Drug Y
Medium gray – Drugs A or X
Dark gray – Drugs B or C 19
What Tasks Can Data Mining
Accomplish? (cont’d)
5. Clustering
• Refers to grouping records into classes of similar objects
• Cluster – a collection of records similar to one another, and
dissimilar to records in other clusters
• Clustering algorithm seeks to segment data set into
homogeneous subgroups
• Target variable not specified
• Clustering does not try to classify/estimate/predict target variable
20
Nielsen Claritas’ PRIZM
segmentation system
01 Upper Crust
04 Young Digerati
02 Blue Blood Estates
05 Country Squires
03 Movers and Shakers
06 Winner’s Circle
• Clusters for zip code 90210, Beverly
07 Money and Brains 08 Executive Suites 09 Big Fish, Small Pond Hills, California are:
10 Second City Elite 11 God’s Country 12 Brite Lites, Little City
13 Upward Bound 14 New Empty Nests 15 Pools and Patios
• #01: Upper Crust Estates
16 Bohemian Mix 17 Beltway Boomers 18 Kids and Cul-de-sacs • #03: Movers and Shakers
19 Home Sweet Home 20 Fast-Track Families 21 Gray Power
22 Young Influentials 23 Greenbelt Sports 24 Up-and-Comers
• #04: Young Digerati
25 Country Casuals 26 The Cosmopolitans 27 Middleburg Managers • #07: Money and Brains
28 Traditional Times 29American Dreams 30 Suburban Sprawl
31 Urban Achievers 32 New Homesteaders 33 Big Sky Families
• #16: Bohemian Mix
34 White Picket Fences 35 Boomtown Singles 36 Blue-Chip Blues • The description for Cluster # 01:
Upper Crust
37 Mayberry-ville 38 Simple Pleasures 39 Domestic Duos
40 Close-in Couples 41 Sunset City Blues 42 Red, White and Blues
43 Heartlanders 44 New Beginnings 45 Blue Highways • The nation’s most exclusive address
46 Old Glories 47 City Startups 48 Young and Rustic
49 American Classics 50 Kid Country, USA 51 Shotguns and Pickups • the wealthiest lifestyle in America
52 Suburban Pioneers 53 Mobility Blues 54 Multi-Culti Mosaic • Haven for empty-nesting couples
55Golden Ponds 56 Crossroads Villagers 57 Old Milltowns
between the ages of 45 and 64
58 Back Country Folks 59 Urban Elders 60 Park Bench Seniors
61 City Roots 62 Hometown Retired 63 Family Thrifts • highest concentration of residents with:
64 Bedrock America 65 Big City Blues 66 Low-Rise Living • over $100,000/year
Table 1.2 The 66 clusters used by the PRIZM • Most opulent standard of living.
segmentation system.
21
What Tasks Can Data Mining
Accomplish? (cont’d)
22
What Tasks Can Data Mining
Accomplish? (cont’d)
6. Association
• Find out which attributes “go together”
• Commonly used for Market Basket Analysis (aka Affinity
Association)
• Quantify relationships between two or more attributes in the
form of rules as:
IF antecedent THEN consequent
• Rules measured using support and confidence
23