0% found this document useful (0 votes)

22 views23 pages

Chapter 1

Data mining is the process of discovering useful patterns and trends in large data sets. It allows organizations to better understand their data, make predictions about future events, and identify opportunities. The document discusses why data mining is important for applications like predicting election outcomes, customizing customer service, and optimizing supermarket sales. It also outlines the Cross Industry Standard Process for Data Mining (CRISP-DM), an iterative 6-phase process for conducting data mining projects. The tasks that can be accomplished through data mining include description, estimation, prediction, classification, clustering, and association. Human guidance is still needed throughout the data mining process.

Uploaded by

shiva kulshrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views23 pages

Chapter 1

Uploaded by

shiva kulshrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Mining for

Business
Introduction to
Data Mining
What is Data Mining?
• Data mining is the process of discovering useful patterns and trends
in large data sets
• United States 2012 Presidential Elections (source: MIT
Technology Review)
First identified likely Obama voters using a data mining model, and
then made sure that these voters actually got to the polls
used a separate data mining model to predict the polling outcomes
county-by-county
Hamilton, Ohio: the model predicted 56.4% for Obama; actual result
was 56.6%, so that the prediction was off by only 0.02%

3
Why Data Mining?
(cont’d)
• Other examples
• Bank of America, West Coast customer service call center
(source: CIO Magazine)
13 million customers customer calls per month – in the past they all were
offered the same products/services
Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
• Supermarkets
Each cash-register product scan collected helps to build a profile about the
shopping habits of your family, and the other families who are checking out

4
Wanted: Data Miners
“We are drowning in
 We are inundated with data in information but starved
most fields, but… for knowledge.”
 There are not trained human -Megatrends, John
analysts available who are skilled Naisbitt
to convert the data into knowledge
 According to McKinsey Report
◦ “There will be a shortage of • Factors
• Explosive growth in data
talent…”
collection, as in supermarket
◦ “…particularly of people with scanners
deep expertise in statistics and • Storing the data in data
machine learning, and the warehouses
managers and analysts who • Increased access to data from
know how to operate companies web navigation an intranets
by using insights from big data.” • Competitive pressure to
◦ Demand for talent to exceed increase market share in
globalized economy
supply “…by 140,000 to 190,000 • Growth of computing power
positions” and storage capacity
◦ “… we project a need for 1.5
million additional managers and
analysts in the United States”

5
The Need for Human
Direction of Data Mining
• Some early data mining definitions described process as
“automatic”
• “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
• Automation no substitute for human input
• Data mining is easy to do badly
• Understanding statistical and mathematical model structures of
underlying software required
• Humans need to be actively involved in every phase of data
mining process
• Task of data mining should be integrated into human process of
problem solving

6
Cross Industry Standard Process: CRISP-DM
developed in 1996 Fits data mining into the general problem-
solving strategy of business/research unit)

• Iterative CRIP-DM
process shown in
outer circle

• Most significant Business / Research Data Understanding

dependencies Understanding Phase Phase

between phases
shown Deployment Phase Data Preparation
Phase

• Next phase
depends on results Evaluation Phase Modeling Phase
from preceding
phase

• Returning to
earlier phase
possible before CRISP-DM Lifecycle 7
moving forward
Cross Industry Standard
Process: CRISP-DM (cont’d)
(1) Business/Research Understanding Phase
• Define project requirements and objectives
• Translate objectives into data mining problem definition
• Prepare preliminary strategy to meet objectives
(2) Data Understanding Phase
• Collect data
• Perform exploratory data analysis (EDA)
• Assess data quality
• Optionally, select interesting subsets
(3) Data Preparation Phase
• Prepares for modeling in subsequent phases
• Select cases and variables appropriate for analysis
• Cleanse and prepare data so it is ready for modeling tools
• Perform transformation of certain variables, if needed

8
Cross Industry Standard
Process: CRISP-DM (cont’d)
• (4) Modeling Phase
• Select and apply one or more modeling techniques
• Calibrate model settings to optimize results
• If necessary, additional data preparation may be required
for supporting a particular technique
• (5) Evaluation Phase
• Evaluate one or more models for effectiveness
• Determine whether defined objectives achieved
• Establish whether some important facet of the problem has
not been sufficiently accounted for
• Make decision regarding data mining results before
deploying to field

9
Cross Industry Standard
Process: CRISP-DM (cont’d)

• (6) Deployment Phase

• Make use of models created
• Simple deployment example: generate report
• Complex deployment example: implement parallel data mining
effort in another department
• In businesses, customer often carries out deployment based on
your model

10
Fallacies of Data Mining
• Four Fallacies of Data Mining (Louie,
Nautilus Systems, Inc.)
Fallacy Reality
1 • Set of tools can be turned • No automatic data mining tools solve
loose on data repositories problems
• Finds answers to all business • Rather, data mining is process (CRISP-DM)
problems • Integrates into overall business objectives

2 • Data mining process is • Requires significant intervention during every

autonomous phase
• Requires little oversight • After model deployment, new models require
updates
• Continuous evaluative measures monitored
by analysts
3 • Data mining quickly pays for • Return rates vary
itself • Depending on startup, personnel, data
preparation costs, etc.
4 • Data mining software easy to • Ease of use varies across projects
use • Analysts must combine subject matter
knowledge with specific problem domain 11
What Tasks Can Data Mining
Accomplish?
• Six common data mining Strategies

• Description
• Estimation
• Prediction
• Classification
• Clustering
• Association

12
What Tasks Can Data Mining
Accomplish? (cont’d)

1. Description
• Describes patterns or trends in data
• Data mining models should be transparent
• That is, results should be interpretable by humans
• For example, Decision Trees (transparent) > Neural Networks (opaque)
• High-quality description accomplished using Exploratory Data
Analysis (EDA)
• Graphical method of exploring patterns and trends in data

13
What Tasks Can Data Mining
Accomplish? (cont’d)
2. Estimation
• Similar to Classification task, except target variable is numeric
• Models built from complete data records
• Records include values for each predictor field and numeric target variable
in training set
• For new observations, estimate the target variable

• Example: Estimate a patient’s systolic blood pressure, based on

patient’s age, gender, body-mass index, and sodium levels
• Use training data to develop model that estimates systolic blood pressure
based on predictor variables
• Apply model to new cases, to obtain estimated systolic blood pressure

14
What Tasks Can Data Mining
Accomplish? (cont’d)
• Estimation – Further examples
• Estimate amount of money, family of four will spend on back-to-
school shopping
• basketball player
• Estimate CGPA

◦ Statistical Analysis uses several estimation methods:

point estimation, confidence interval estimation, linear
regression and correlation, and multiple regression

15
What Tasks Can Data Mining
Accomplish? (cont’d)
3. Prediction  Example prediction tasks in
• Similar to business and research: ?

classification and Stock ?

estimation, except Price

results lie in the future ?

Q1 Q2 Q3 Q4

• Predict price of stock 3 months

into future, based on past
performance
• Predict percentage increase in
traffic deaths next year, if speed
limit increased
• Predicting the winner of this fall’s
World Series, based on a
comparison of the team statistics
• Predict whether molecule in newly
discovered drug leads to profitable
pharmaceutical drug
16
What Tasks Can Data Mining
Accomplish? (cont’d)

4. Classification
• Similar to Estimation task, except target variable is categorical
• Example: Classify the Income Bracket of an individual as Low,
Middle or High based their Age, Gender and Occupation
• Use training data to develop model that classifies Income Bracket based
on predictor variables
• Apply model to cases not currently in the database, to obtain estimated
Income Bracket classification

17
What Tasks Can Data Mining
Accomplish? (cont’d)
• Classification – Example in detail
 Using the training data set, the algorithm would:
 Examine the data set containing both the predictor variables and the (already
classified) target variable, income bracket
 Algorithm (software) “learns about” which combinations of variables are
associated with which income brackets (for example, Older females -> High
Income)
 Then, when looking at new records with no income information,
the algorithm would:
 Based in the classification in the training set, would assign classifications to the
new records (for example, 63-year-old female professor -> high)

Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High

002 28 M Marketing Consultant Middle

003 35 M Unemployed Low

… … … … …

18
Figure 1.3 - Which Drug Should Be Prescribed
for Which Type of Patient?
Na / K (sodium / potassium
ratio)

Age
Dot Legend:
Light gray – Drug Y
Medium gray – Drugs A or X
Dark gray – Drugs B or C 19
What Tasks Can Data Mining
Accomplish? (cont’d)
5. Clustering
• Refers to grouping records into classes of similar objects
• Cluster – a collection of records similar to one another, and
dissimilar to records in other clusters
• Clustering algorithm seeks to segment data set into
homogeneous subgroups
• Target variable not specified
• Clustering does not try to classify/estimate/predict target variable

• For example, Claritas, Inc. PRIZM software clusters demographic

profiles for different geographic areas according to zip code
• It describes every American zip code area in terms of distinct lifestyle types
(see next slide for example)

20
Nielsen Claritas’ PRIZM
segmentation system
01 Upper Crust
04 Young Digerati
02 Blue Blood Estates
05 Country Squires
03 Movers and Shakers
06 Winner’s Circle
• Clusters for zip code 90210, Beverly
07 Money and Brains 08 Executive Suites 09 Big Fish, Small Pond Hills, California are:
10 Second City Elite 11 God’s Country 12 Brite Lites, Little City
13 Upward Bound 14 New Empty Nests 15 Pools and Patios
• #01: Upper Crust Estates
16 Bohemian Mix 17 Beltway Boomers 18 Kids and Cul-de-sacs • #03: Movers and Shakers
19 Home Sweet Home 20 Fast-Track Families 21 Gray Power
22 Young Influentials 23 Greenbelt Sports 24 Up-and-Comers
• #04: Young Digerati
25 Country Casuals 26 The Cosmopolitans 27 Middleburg Managers • #07: Money and Brains
28 Traditional Times 29American Dreams 30 Suburban Sprawl
31 Urban Achievers 32 New Homesteaders 33 Big Sky Families
• #16: Bohemian Mix
34 White Picket Fences 35 Boomtown Singles 36 Blue-Chip Blues • The description for Cluster # 01:
Upper Crust
37 Mayberry-ville 38 Simple Pleasures 39 Domestic Duos
40 Close-in Couples 41 Sunset City Blues 42 Red, White and Blues
43 Heartlanders 44 New Beginnings 45 Blue Highways • The nation’s most exclusive address
46 Old Glories 47 City Startups 48 Young and Rustic
49 American Classics 50 Kid Country, USA 51 Shotguns and Pickups • the wealthiest lifestyle in America
52 Suburban Pioneers 53 Mobility Blues 54 Multi-Culti Mosaic • Haven for empty-nesting couples
55Golden Ponds 56 Crossroads Villagers 57 Old Milltowns
between the ages of 45 and 64
58 Back Country Folks 59 Urban Elders 60 Park Bench Seniors
61 City Roots 62 Hometown Retired 63 Family Thrifts • highest concentration of residents with:
64 Bedrock America 65 Big City Blues 66 Low-Rise Living • over $100,000/year
Table 1.2 The 66 clusters used by the PRIZM • Most opulent standard of living.
segmentation system.

21
What Tasks Can Data Mining
Accomplish? (cont’d)

Clustering - Clustering Tasks in Business

and Research:
• Use as dimensionality-reduction tool for data set having several
hundred inputs
• For gene expression clustering, where very large quantities of
genes may exhibit similar behavior

• As preliminary step in data mining

• Resulting clusters used as input to different technique downstream, such as
neural networks

22
What Tasks Can Data Mining
Accomplish? (cont’d)

6. Association
• Find out which attributes “go together”
• Commonly used for Market Basket Analysis (aka Affinity
Association)
• Quantify relationships between two or more attributes in the
form of rules as:
IF antecedent THEN consequent
• Rules measured using support and confidence

• Example: A particular supermarket might find that:

• Thursday night 200 of 1,000 customers bought diapers, and of those buying
diapers, 50 purchased beer
• Association Rule: “IF buy diapers, THEN buy beer”
• Support = 200/1,000 = 5%, and confidence = 50/200 = 25%

Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
OHST Complete Guide
100% (1)
OHST Complete Guide
24 pages
Data Mining
No ratings yet
Data Mining
63 pages
Unit 3
No ratings yet
Unit 3
22 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
Data Mining Week 1 2
No ratings yet
Data Mining Week 1 2
117 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Process-Tracing Methods Foundations and PDF
No ratings yet
Process-Tracing Methods Foundations and PDF
208 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
27 pages
Intorduction To Data Mining
No ratings yet
Intorduction To Data Mining
26 pages
Mod03-Lifecycle Dataprocessing
No ratings yet
Mod03-Lifecycle Dataprocessing
72 pages
Data Mining
No ratings yet
Data Mining
21 pages
Sampling and Sampling Distributions: Business Statistics: Communicating With Numbers, 4e
No ratings yet
Sampling and Sampling Distributions: Business Statistics: Communicating With Numbers, 4e
44 pages
Data Mining
No ratings yet
Data Mining
30 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Data Mining
No ratings yet
Data Mining
30 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Data Mining
No ratings yet
Data Mining
254 pages
2 & 3 - Business Problems and Science Solution
No ratings yet
2 & 3 - Business Problems and Science Solution
26 pages
Intro Data Mining
100% (1)
Intro Data Mining
87 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Chapter Five Data Mining For Healthcare Analytics
No ratings yet
Chapter Five Data Mining For Healthcare Analytics
77 pages
DWDM
No ratings yet
DWDM
30 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
My Chapter Two
No ratings yet
My Chapter Two
57 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Public Speaking Strategies For Success 9th Edition by David ZarefskyJeremy David Engels
No ratings yet
Public Speaking Strategies For Success 9th Edition by David ZarefskyJeremy David Engels
325 pages
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
No ratings yet
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
60 pages
Study Guide For Test 4
No ratings yet
Study Guide For Test 4
6 pages
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
0% (1)
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
31 pages
Lecture 1 & 2 - Introduction To Data Mining2
No ratings yet
Lecture 1 & 2 - Introduction To Data Mining2
19 pages
Data Management
No ratings yet
Data Management
36 pages
Chapter 5 - Data Mining
No ratings yet
Chapter 5 - Data Mining
29 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
48 pages
Turban Dss9e ch05
No ratings yet
Turban Dss9e ch05
54 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Predictive & Prescriptive Analytics
No ratings yet
Predictive & Prescriptive Analytics
19 pages
Topic 3 Data Mining For Business Intelligence
No ratings yet
Topic 3 Data Mining For Business Intelligence
49 pages
09-Datamining Concepts
100% (1)
09-Datamining Concepts
121 pages
Chapter 6 - Data Mining
No ratings yet
Chapter 6 - Data Mining
62 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Data Mining
No ratings yet
Data Mining
13 pages
Distribution Management of Dinshaws
No ratings yet
Distribution Management of Dinshaws
61 pages
Business Intelligence: A Managerial Approach (2 Edition)
No ratings yet
Business Intelligence: A Managerial Approach (2 Edition)
58 pages
Minitab Basic Tutorial
No ratings yet
Minitab Basic Tutorial
32 pages
1 Intro
No ratings yet
1 Intro
33 pages
Nature of Environmental Studies
No ratings yet
Nature of Environmental Studies
34 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Big Five Personality Trait Test
No ratings yet
Big Five Personality Trait Test
11 pages
Service Quality and Satisfaction, Ueltschy
No ratings yet
Service Quality and Satisfaction, Ueltschy
14 pages
Theories of Personality Unit 3
No ratings yet
Theories of Personality Unit 3
14 pages
Project Appraisal Under Risk and Uncertainity
No ratings yet
Project Appraisal Under Risk and Uncertainity
14 pages
Non Performing Assets of PACS
No ratings yet
Non Performing Assets of PACS
9 pages
Syed Raghab Ali Conflict Studies 2020 Ndu Isb
No ratings yet
Syed Raghab Ali Conflict Studies 2020 Ndu Isb
312 pages
Forecasting - Project - Yogurt Sales
No ratings yet
Forecasting - Project - Yogurt Sales
54 pages
Laksmi Maharani, M.SC., Apt.: Laboratorium Farmakologi & Farmasi Klinik Farmasi Fikes Unsoed
No ratings yet
Laksmi Maharani, M.SC., Apt.: Laboratorium Farmakologi & Farmasi Klinik Farmasi Fikes Unsoed
15 pages
SHG Final Report in Himachal Pradesh
No ratings yet
SHG Final Report in Himachal Pradesh
165 pages
BU Internship Policy AY 24 25
No ratings yet
BU Internship Policy AY 24 25
10 pages
Spring 2018 Bus 498 Exit Assessment Test HRM
No ratings yet
Spring 2018 Bus 498 Exit Assessment Test HRM
8 pages
FDP Program Imr 2023
No ratings yet
FDP Program Imr 2023
2 pages
Creating Blue Ocean Strategy in Air Indus (PVT) LTD PDF
No ratings yet
Creating Blue Ocean Strategy in Air Indus (PVT) LTD PDF
182 pages
LY2021046-TEMBA Final Document
No ratings yet
LY2021046-TEMBA Final Document
88 pages
Essay On Stem Cells
No ratings yet
Essay On Stem Cells
4 pages
Investment Behavior in Generation Z and Millennial
No ratings yet
Investment Behavior in Generation Z and Millennial
15 pages
Basic Research Templates
No ratings yet
Basic Research Templates
15 pages
Jurnal 3
No ratings yet
Jurnal 3
19 pages
Problems Faced by Rural Handloom Weavers: - With Specific Reference To Pollachi Taluk
No ratings yet
Problems Faced by Rural Handloom Weavers: - With Specific Reference To Pollachi Taluk
8 pages
Cortellis Generics Intelligence - Factsheet
No ratings yet
Cortellis Generics Intelligence - Factsheet
1 page
The Impact of Communication On Community Development
No ratings yet
The Impact of Communication On Community Development
11 pages
AFM 20530: Business Finance Semester I Group Assignment I - Intake 14 and 13 EX
No ratings yet
AFM 20530: Business Finance Semester I Group Assignment I - Intake 14 and 13 EX
2 pages
CSCMP Southern Africa Conference
No ratings yet
CSCMP Southern Africa Conference
2 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Chapter 1

Uploaded by

Chapter 1

Uploaded by

Data Mining for

• Most significant Business / Research Data Understanding

• (6) Deployment Phase

2 • Data mining process is • Requires significant intervention during every

• Example: Estimate a patient’s systolic blood pressure, based on

◦ Statistical Analysis uses several estimation methods:

classification and Stock ?

results lie in the future ?

• Predict price of stock 3 months

002 28 M Marketing Consultant Middle

003 35 M Unemployed Low

• For example, Claritas, Inc. PRIZM software clusters demographic

Clustering - Clustering Tasks in Business

• As preliminary step in data mining

• Example: A particular supermarket might find that:

You might also like