0% found this document useful (0 votes)
22 views23 pages

Chapter 1

Data mining is the process of discovering useful patterns and trends in large data sets. It allows organizations to better understand their data, make predictions about future events, and identify opportunities. The document discusses why data mining is important for applications like predicting election outcomes, customizing customer service, and optimizing supermarket sales. It also outlines the Cross Industry Standard Process for Data Mining (CRISP-DM), an iterative 6-phase process for conducting data mining projects. The tasks that can be accomplished through data mining include description, estimation, prediction, classification, clustering, and association. Human guidance is still needed throughout the data mining process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

Chapter 1

Data mining is the process of discovering useful patterns and trends in large data sets. It allows organizations to better understand their data, make predictions about future events, and identify opportunities. The document discusses why data mining is important for applications like predicting election outcomes, customizing customer service, and optimizing supermarket sales. It also outlines the Cross Industry Standard Process for Data Mining (CRISP-DM), an iterative 6-phase process for conducting data mining projects. The tasks that can be accomplished through data mining include description, estimation, prediction, classification, clustering, and association. Human guidance is still needed throughout the data mining process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Mining for

Business
Introduction to
Data Mining
What is Data Mining?
• Data mining is the process of discovering useful patterns and trends
in large data sets
• United States 2012 Presidential Elections (source: MIT
Technology Review)
First identified likely Obama voters using a data mining model, and
then made sure that these voters actually got to the polls
used a separate data mining model to predict the polling outcomes
county-by-county
Hamilton, Ohio: the model predicted 56.4% for Obama; actual result
was 56.6%, so that the prediction was off by only 0.02%

3
Why Data Mining?
(cont’d)
• Other examples
• Bank of America, West Coast customer service call center
(source: CIO Magazine)
13 million customers customer calls per month – in the past they all were
offered the same products/services
Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
• Supermarkets
Each cash-register product scan collected helps to build a profile about the
shopping habits of your family, and the other families who are checking out

4
Wanted: Data Miners
“We are drowning in
 We are inundated with data in information but starved
most fields, but… for knowledge.”
 There are not trained human -Megatrends, John
analysts available who are skilled Naisbitt
to convert the data into knowledge
 According to McKinsey Report
◦ “There will be a shortage of • Factors
• Explosive growth in data
talent…”
collection, as in supermarket
◦ “…particularly of people with scanners
deep expertise in statistics and • Storing the data in data
machine learning, and the warehouses
managers and analysts who • Increased access to data from
know how to operate companies web navigation an intranets
by using insights from big data.” • Competitive pressure to
◦ Demand for talent to exceed increase market share in
globalized economy
supply “…by 140,000 to 190,000 • Growth of computing power
positions” and storage capacity
◦ “… we project a need for 1.5
million additional managers and
analysts in the United States”

5
The Need for Human
Direction of Data Mining
• Some early data mining definitions described process as
“automatic”
• “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
• Automation no substitute for human input
• Data mining is easy to do badly
• Understanding statistical and mathematical model structures of
underlying software required
• Humans need to be actively involved in every phase of data
mining process
• Task of data mining should be integrated into human process of
problem solving

6
Cross Industry Standard Process: CRISP-DM
developed in 1996 Fits data mining into the general problem-
solving strategy of business/research unit)

• Iterative CRIP-DM
process shown in
outer circle

• Most significant Business / Research Data Understanding


dependencies Understanding Phase Phase

between phases
shown Deployment Phase Data Preparation
Phase

• Next phase
depends on results Evaluation Phase Modeling Phase
from preceding
phase

• Returning to
earlier phase
possible before CRISP-DM Lifecycle 7
moving forward
Cross Industry Standard
Process: CRISP-DM (cont’d)
(1) Business/Research Understanding Phase
• Define project requirements and objectives
• Translate objectives into data mining problem definition
• Prepare preliminary strategy to meet objectives
(2) Data Understanding Phase
• Collect data
• Perform exploratory data analysis (EDA)
• Assess data quality
• Optionally, select interesting subsets
(3) Data Preparation Phase
• Prepares for modeling in subsequent phases
• Select cases and variables appropriate for analysis
• Cleanse and prepare data so it is ready for modeling tools
• Perform transformation of certain variables, if needed

8
Cross Industry Standard
Process: CRISP-DM (cont’d)
• (4) Modeling Phase
• Select and apply one or more modeling techniques
• Calibrate model settings to optimize results
• If necessary, additional data preparation may be required
for supporting a particular technique
• (5) Evaluation Phase
• Evaluate one or more models for effectiveness
• Determine whether defined objectives achieved
• Establish whether some important facet of the problem has
not been sufficiently accounted for
• Make decision regarding data mining results before
deploying to field

9
Cross Industry Standard
Process: CRISP-DM (cont’d)

• (6) Deployment Phase


• Make use of models created
• Simple deployment example: generate report
• Complex deployment example: implement parallel data mining
effort in another department
• In businesses, customer often carries out deployment based on
your model

10
Fallacies of Data Mining
• Four Fallacies of Data Mining (Louie,
Nautilus Systems, Inc.)
Fallacy Reality
1 • Set of tools can be turned • No automatic data mining tools solve
loose on data repositories problems
• Finds answers to all business • Rather, data mining is process (CRISP-DM)
problems • Integrates into overall business objectives

2 • Data mining process is • Requires significant intervention during every


autonomous phase
• Requires little oversight • After model deployment, new models require
updates
• Continuous evaluative measures monitored
by analysts
3 • Data mining quickly pays for • Return rates vary
itself • Depending on startup, personnel, data
preparation costs, etc.
4 • Data mining software easy to • Ease of use varies across projects
use • Analysts must combine subject matter
knowledge with specific problem domain 11
What Tasks Can Data Mining
Accomplish?
• Six common data mining Strategies

• Description
• Estimation
• Prediction
• Classification
• Clustering
• Association

12
What Tasks Can Data Mining
Accomplish? (cont’d)

1. Description
• Describes patterns or trends in data
• Data mining models should be transparent
• That is, results should be interpretable by humans
• For example, Decision Trees (transparent) > Neural Networks (opaque)
• High-quality description accomplished using Exploratory Data
Analysis (EDA)
• Graphical method of exploring patterns and trends in data

13
What Tasks Can Data Mining
Accomplish? (cont’d)
2. Estimation
• Similar to Classification task, except target variable is numeric
• Models built from complete data records
• Records include values for each predictor field and numeric target variable
in training set
• For new observations, estimate the target variable

• Example: Estimate a patient’s systolic blood pressure, based on


patient’s age, gender, body-mass index, and sodium levels
• Use training data to develop model that estimates systolic blood pressure
based on predictor variables
• Apply model to new cases, to obtain estimated systolic blood pressure

14
What Tasks Can Data Mining
Accomplish? (cont’d)
• Estimation – Further examples
• Estimate amount of money, family of four will spend on back-to-
school shopping
• basketball player
• Estimate CGPA

◦ Statistical Analysis uses several estimation methods:


point estimation, confidence interval estimation, linear
regression and correlation, and multiple regression

15
What Tasks Can Data Mining
Accomplish? (cont’d)
3. Prediction  Example prediction tasks in
• Similar to business and research: ?

classification and Stock ?


estimation, except Price

results lie in the future ?


Q1 Q2 Q3 Q4

• Predict price of stock 3 months


into future, based on past
performance
• Predict percentage increase in
traffic deaths next year, if speed
limit increased
• Predicting the winner of this fall’s
World Series, based on a
comparison of the team statistics
• Predict whether molecule in newly
discovered drug leads to profitable
pharmaceutical drug
16
What Tasks Can Data Mining
Accomplish? (cont’d)

4. Classification
• Similar to Estimation task, except target variable is categorical
• Example: Classify the Income Bracket of an individual as Low,
Middle or High based their Age, Gender and Occupation
• Use training data to develop model that classifies Income Bracket based
on predictor variables
• Apply model to cases not currently in the database, to obtain estimated
Income Bracket classification

17
What Tasks Can Data Mining
Accomplish? (cont’d)
• Classification – Example in detail
 Using the training data set, the algorithm would:
 Examine the data set containing both the predictor variables and the (already
classified) target variable, income bracket
 Algorithm (software) “learns about” which combinations of variables are
associated with which income brackets (for example, Older females -> High
Income)
 Then, when looking at new records with no income information,
the algorithm would:
 Based in the classification in the training set, would assign classifications to the
new records (for example, 63-year-old female professor -> high)

Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High

002 28 M Marketing Consultant Middle

003 35 M Unemployed Low

… … … … …

18
Figure 1.3 - Which Drug Should Be Prescribed
for Which Type of Patient?
Na / K (sodium / potassium
ratio)

Age
Dot Legend:
Light gray – Drug Y
Medium gray – Drugs A or X
Dark gray – Drugs B or C 19
What Tasks Can Data Mining
Accomplish? (cont’d)
5. Clustering
• Refers to grouping records into classes of similar objects
• Cluster – a collection of records similar to one another, and
dissimilar to records in other clusters
• Clustering algorithm seeks to segment data set into
homogeneous subgroups
• Target variable not specified
• Clustering does not try to classify/estimate/predict target variable

• For example, Claritas, Inc. PRIZM software clusters demographic


profiles for different geographic areas according to zip code
• It describes every American zip code area in terms of distinct lifestyle types
(see next slide for example)

20
Nielsen Claritas’ PRIZM
segmentation system
01 Upper Crust
04 Young Digerati
02 Blue Blood Estates
05 Country Squires
03 Movers and Shakers
06 Winner’s Circle
• Clusters for zip code 90210, Beverly
07 Money and Brains 08 Executive Suites 09 Big Fish, Small Pond Hills, California are:
10 Second City Elite 11 God’s Country 12 Brite Lites, Little City
13 Upward Bound 14 New Empty Nests 15 Pools and Patios
• #01: Upper Crust Estates
16 Bohemian Mix 17 Beltway Boomers 18 Kids and Cul-de-sacs • #03: Movers and Shakers
19 Home Sweet Home 20 Fast-Track Families 21 Gray Power
22 Young Influentials 23 Greenbelt Sports 24 Up-and-Comers
• #04: Young Digerati
25 Country Casuals 26 The Cosmopolitans 27 Middleburg Managers • #07: Money and Brains
28 Traditional Times 29American Dreams 30 Suburban Sprawl
31 Urban Achievers 32 New Homesteaders 33 Big Sky Families
• #16: Bohemian Mix
34 White Picket Fences 35 Boomtown Singles 36 Blue-Chip Blues • The description for Cluster # 01:
Upper Crust
37 Mayberry-ville 38 Simple Pleasures 39 Domestic Duos
40 Close-in Couples 41 Sunset City Blues 42 Red, White and Blues
43 Heartlanders 44 New Beginnings 45 Blue Highways • The nation’s most exclusive address
46 Old Glories 47 City Startups 48 Young and Rustic
49 American Classics 50 Kid Country, USA 51 Shotguns and Pickups • the wealthiest lifestyle in America
52 Suburban Pioneers 53 Mobility Blues 54 Multi-Culti Mosaic • Haven for empty-nesting couples
55Golden Ponds 56 Crossroads Villagers 57 Old Milltowns
between the ages of 45 and 64
58 Back Country Folks 59 Urban Elders 60 Park Bench Seniors
61 City Roots 62 Hometown Retired 63 Family Thrifts • highest concentration of residents with:
64 Bedrock America 65 Big City Blues 66 Low-Rise Living • over $100,000/year
Table 1.2 The 66 clusters used by the PRIZM • Most opulent standard of living.
segmentation system.

21
What Tasks Can Data Mining
Accomplish? (cont’d)

Clustering - Clustering Tasks in Business


and Research:
• Use as dimensionality-reduction tool for data set having several
hundred inputs
• For gene expression clustering, where very large quantities of
genes may exhibit similar behavior

• As preliminary step in data mining


• Resulting clusters used as input to different technique downstream, such as
neural networks

22
What Tasks Can Data Mining
Accomplish? (cont’d)

6. Association
• Find out which attributes “go together”
• Commonly used for Market Basket Analysis (aka Affinity
Association)
• Quantify relationships between two or more attributes in the
form of rules as:
IF antecedent THEN consequent
• Rules measured using support and confidence

• Example: A particular supermarket might find that:


• Thursday night 200 of 1,000 customers bought diapers, and of those buying
diapers, 50 purchased beer
• Association Rule: “IF buy diapers, THEN buy beer”
• Support = 200/1,000 = 5%, and confidence = 50/200 = 25%

23

You might also like