0% found this document useful (0 votes)

18 views27 pages

Introduction to Data Mining

Uploaded by

SahilPatel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views27 pages

Introduction to Data Mining

Uploaded by

SahilPatel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws

and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Introduction to Data Mining
Outline:

• What is Data Mining?

• Why Data Mining?
• Cross-Industry Standard Process for Data Mining
(CRISP-DM)
• What Tasks Can Data Mining Accomplish?
• Warming Up to Programming in R

3
A Categorization of Analytical Methods

What should
happen? Prescriptive
Analytics

What will happen?

What happened?
Predictive
Why did it happened? Analytics

4
What is Data Mining?
◦ According to McKinsey Global Institute (MGI)
 Most American companies with more that 1000 employees have 200TB
of data, increasing 40% annually
 Retailers could expect to realize an increase in their operating margin of
more than 60%

◦ United States 2012 Presidential Elections (source: MIT Technology

Review)
 First identified likely Obama voters using a data mining model, and then
made sure that these voters actually got to the polls
 used a separate data mining model to predict the polling outcomes
county-by-county
 Hamilton, Ohio: the model predicted 56.4% for Obama; actual result was
56.6%, so that the prediction was off by only 0.02%

5
Why Data Mining?
• Other examples
– Bank of America, West Coast customer service call center (source:
CIO Magazine)
• 13 million customer calls per month – in the past they all were offered
the same products/services
• Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
– Supermarkets
• Each cash-register product scan collected helps to build a profile about
the shopping habits of your family, and the other families who are
checking out

Data mining is the process of discovering useful patterns and

trends in large datasets
6
“We are drowning in information
Wanted: Data Miners but starved for knowledge.”
Megatrends, John Naisbitt

 We are inundated with data in

most fields, but…
 There are not trained human
analysts available who are skilled • Factors
to convert the data into knowledge – Explosive growth in data
collection, as in supermarket
 According to McKinsey Report scanners
◦ “There will be a shortage of talent…”
– Storing the data in data
◦ “…particularly of people with deep
warehouses
expertise in statistics and machine
learning, and the managers and analysts – Increased access to data from web
who know how to operate companies navigation an intranets
by using insights from big data.” – Competitive pressure to increase
◦ Demand for talent to exceed supply market share in globalized
“…by 140,000 to 190,000 positions” economy
◦ “… we project a need for 1.5 million
– Growth of computing power and
additional managers and analysts in the
United States” storage capacity

7
The Need for Human Direction of Data Mining
– Some early data mining definitions described process as
“automatic”
– “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
– Automation no substitute for human input
– Data mining is easy to do badly
– Humans need to be actively involved in every phase of data
mining process
– Task of data mining should be integrated into human process of
problem solving
8
Cross Industry Standard Process: CRISP-DM

• Cross-Industry Standard Process for Data Mining (CRISP-DM)

developed in 1996

– Adaptive: Next phase

depends on results from Business / Research Data Understanding
Understanding Phase Phase
preceding phase

– Returning to earlier Deployment Phase Data Preparation

Phase
phase possible before
moving forward

Evaluation Phase Modeling Phase

CRISP-DM Lifecycle
9
Cross Industry Standard Process: CRISP-DM
 (1) Business/Research Understanding Phase
◦ Define project requirements and objectives
◦ Translate objectives into data mining problem definition
◦ Prepare preliminary strategy to meet objectives
 (2) Data Understanding Phase
◦ Collect data
◦ Perform exploratory data analysis (EDA)
◦ Assess data quality
◦ Optionally, select interesting subsets
 (3) Data Preparation Phase
◦ Prepares for modeling in subsequent phases
◦ Select cases and variables appropriate for analysis
◦ Cleanse and prepare data so it is ready for modeling tools
◦ Perform transformation of certain variables, if needed

10
Cross Industry Standard Process: CRISP-DM
• (4) Modeling Phase
– Select and apply one or more modeling techniques
– Calibrate model parameters to optimize results
– If necessary, additional data preparation may be required for supporting a
particular technique
• (5) Evaluation Phase
– Evaluate one or more models for effectiveness
– Determine whether defined objectives achieved
– Establish whether some important facet of the problem has not been
sufficiently accounted for
– Make decision regarding data mining results before deploying to field
• (6) Deployment Phase
– Make use of models created
– Simple deployment example: generate report
– Complex deployment example: implement parallel data mining effort in
another department
– In businesses, customer often carries out deployment based on your model

11
Fallacies of Data Mining
• Five Fallacies of Data Mining (Louie, Nautilus Systems, Inc.)
Fallacy Reality
1 • Data mining process is • Requires significant intervention during every phase
autonomous • After model deployment, new models require updates
• Requires little oversight • Continuous evaluative measures monitored by analysts
2 • Data mining quickly pays for • Return rates vary
itself • Depending on startup, personnel, data preparation costs,
etc.
3 • Data mining software easy to • Ease of use varies across projects
use • Analysts must combine subject matter knowledge with
specific problem domain
4 • Data mining automatically • Data mining often uses data from legacy systems
cleans data in databases • Data possibly not examined or used in years
• Organizations starting data mining efforts confronted with
huge data preprocessing task
5 • Data mining always provides • There is no guarantee of positive results
positive results. • But used properly, data mining can provide actionable and
highly profitable results.

12
What Tasks Can Data Mining Accomplish?
• Six common data mining tasks
– Description
– Estimation
– Classification
– Prediction
– Clustering
– Association

13
What Tasks Can Data Mining Accomplish? (cont’d)

1. Description
– Describes patterns or trends in data
– Data mining models should be transparent
• That is, results should be interpretable by humans
• Some data mining methods more transparent than others
– Decision Trees (Transparent)
– Neural Networks (Blackbox)

– High-quality description accomplished using Exploratory

Data Analysis (EDA)
• Graphical method of exploring patterns and trends in data

14
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (1/3)
◦ Target variable is numeric
◦ Models built from complete data records
 Records include values for each predictor field and numeric
target variable in training set
◦ For new observations, estimate the target variable

◦ Example: Estimate a patient’s systolic blood pressure, based on

patient’s age, gender, body-mass index, and sodium levels
a) Use training data to develop model that estimates systolic
blood pressure based on predictor variables
b) Apply model to new cases, to obtain estimated systolic blood
pressure

15
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (2/3) – Further examples

– Estimate amount of money, family of four will spend on

back-to-school shopping

– Estimate GPA of graduate student, based on student’s

undergraduate GPA

Statistical Analysis uses several estimation methods: point estimation,

confidence interval estimation, linear regression and correlation, and
multiple regression

16
What Tasks Can Data Mining Accomplish? (cont’d)

2. Estimation (3/3)
– The following figure shows scatter • Regression line estimates
plot of graduate GPA against student’s graduate GPA based on
undergraduate GPA (1000 students) their undergraduate GPA,
– Linear regression finds line (blue) resulting in the following model:
best approximating relationship ŷ = 1.24 + 0.67x
between two variables • For example, suppose student’s
undergraduate GPA = 3.0
• According to estimation model,
estimated student’s graduate
GPA = 1.24 + 0.67(3.0) = 3.25
• Point (x = 3.0, ŷ = 3.25) lies on
regression line

17
What Tasks Can Data Mining Accomplish? (cont’d)

3. Classification (1/4)
◦ Similar to Estimation task, except target variable is categorical

◦ Example: Classify the Income Bracket of an individual as Low,

Middle or High based their Age, Gender and Occupation

a) Use training data to develop model that classifies Income Bracket based
on predictor variables

b) Apply model to cases not currently in the database, to obtain estimated

Income Bracket classification

18
What Tasks Can Data Mining Accomplish? (cont’d)

3. Classification (2/4) – Example in detail

– Using the training data set, the algorithm would:
◦ Examine the data set containing both the predictor variables and the (already classified)
target variable, income bracket
◦ Algorithm (software) “learns about” which combinations of variables are associated with
which income brackets (for example, Older females -> High Income)
– Then, when looking at new records with no income information, the
algorithm would:
◦ Based in the classification in the training set, would assign classifications to the new
records (for example, 63-year-old female professor -> high)

Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High
Marketing
002 28 M Middle
Consultant
003 35 M Unemployed Low
… … … … …

19
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (3/4) – The drug prescription example
• Interested in classifying the type of drug a patient should be prescribed, based on age
of the patient, and the patient’s sodium / potassium ratio
• Scatter plot of 200 patients with their sodium/potassium ratios against age, and the
particular drug prescribed by the shade of the points
• What drug should be prescribed for:

• Young patient with high • Older patient with

Na/K ratio? low Na/K ratio?
• Lower right region
• Young patients with high Na/K
• Past patients in this
are in the upper left region
region got either dark
• Past patients in this region got
gray (Drugs C) or
Drug A
medium gray (Drugs B).
• The recommended classification
• Definitive classification
for such patients is Drug A
not possible without
further information

Light gray – Drug A

Medium gray – Drugs B
Dark gray – Drugs C
20
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (4/4) – Handling many predictors
• Classification tasks with 2 or 3 predictors
– Can be analyzed using charts and plots like the drug example
above

• Many datasets have multiple predictors

– This requires common data mining methods for
classification like:
• k-nearest neighbor
• decision trees

21
What Tasks Can Data Mining Accomplish? (cont’d)

4. Prediction
 Example prediction tasks in business
◦ Similar to classification and
and research:
estimation, except results lie in
the future
◦ Methods used for estimation ?

and classification applicable to Stock

Price
?

prediction
?
 Includes point estimation, Q1 Q2 Q3 Q4

confidence interval
estimation, linear regression  Predict price of stock 3 months into
and correlation, multiple future, based on past performance
regression, k-nearest  Predict percentage increase in traffic
deaths next year, if speed limit increased
neighbor, decision trees and
 Predict whether molecule in newly
neural networks
discovered drug leads to profitable
pharmaceutical drug

22
What Tasks Can Data Mining Accomplish? (cont’d)

5. Clustering
– Refers to grouping records into classes of similar objects
– Cluster – a collection of records similar to one another, and dissimilar to
records in other clusters
– Clustering algorithm seeks to segment data set into homogeneous
subgroups
– Target variable not specified
• Clustering does not try to classify/estimate/predict target variable

• Clustering Tasks in Business and Research:

– Target marketing niche product for small business that does not have large
marketing budget
– For accounting purposes, to segmented financial behavior into benign and
suspicious categories
– Use as dimensionality-reduction tool for data set having several hundred
inputs
23
What Tasks Can Data Mining Accomplish? (cont’d)
6. Association (1/2)
– Find out which attributes “go together”
– Commonly used for Market Basket Analysis
– Quantify relationships between two or more attributes in the form of rules
as:
IF antecedent THEN consequent

– Rules measured using support and confidence

– Example: A particular supermarket might find that:

• Thursday night 200 of 1,000 customers bought diapers, and of those buying diapers, 50
purchased beer
• Association Rule: “IF buy diapers, THEN buy beer”
• Support = 200/1,000 = 20%, and confidence = 50/200 = 25%

24
What Tasks Can Data Mining Accomplish? (cont’d)
6. Association (2/2) - Association Tasks in Business and
Research:

• Investigating the proportion of subscribers to your

company’s cell phone plan that respond positively to an
offer of an service upgrade.

• Determining the proportion of cases in which a new drug

will exhibit dangerous side effects.

25
What Tasks Can Data Mining Accomplish? (cont’d)

Classification > Supervised learning

Clustering > Unsupervised learning

26
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.

IFRS 15 Questions
No ratings yet
IFRS 15 Questions
6 pages
1) Intro To Datamining
No ratings yet
1) Intro To Datamining
17 pages
Predictive Analytics
100% (1)
Predictive Analytics
62 pages
DrAshrafElsafty E RM 62M MidTerm HassanAliHaider Sabra
100% (1)
DrAshrafElsafty E RM 62M MidTerm HassanAliHaider Sabra
31 pages
ISO/IATF 16949 QMS - Manual-Procedures-Forms-Matrix
100% (1)
ISO/IATF 16949 QMS - Manual-Procedures-Forms-Matrix
1 page
Chapter 1
No ratings yet
Chapter 1
23 pages
Mod03-Lifecycle Dataprocessing
No ratings yet
Mod03-Lifecycle Dataprocessing
72 pages
PREDICTIVE & PRESCRIPTIVE ANALYTICS
No ratings yet
PREDICTIVE & PRESCRIPTIVE ANALYTICS
19 pages
Data Mining
No ratings yet
Data Mining
13 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
0% (1)
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
31 pages
turban_dss9e_ch05
No ratings yet
turban_dss9e_ch05
54 pages
CH 1 Intro To Data Mining
No ratings yet
CH 1 Intro To Data Mining
17 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Data Mining
No ratings yet
Data Mining
254 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Intro Data Mining
100% (1)
Intro Data Mining
87 pages
Chapter 5- Data Mining
No ratings yet
Chapter 5- Data Mining
29 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Data Mining
No ratings yet
Data Mining
30 pages
1 Intro
No ratings yet
1 Intro
33 pages
Chapter Five Data Mining for Healthcare Analytics
No ratings yet
Chapter Five Data Mining for Healthcare Analytics
77 pages
Unit 3
No ratings yet
Unit 3
22 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Business Intelligence: A Managerial Approach (2 Edition)
No ratings yet
Business Intelligence: A Managerial Approach (2 Edition)
58 pages
09-Datamining Concepts
100% (1)
09-Datamining Concepts
121 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
introduction to Data Mining
No ratings yet
introduction to Data Mining
48 pages
Lecture 1 & 2- Introduction to Data Mining2
No ratings yet
Lecture 1 & 2- Introduction to Data Mining2
19 pages
Dr. Gaurav Dixit: Department of Management Studies
No ratings yet
Dr. Gaurav Dixit: Department of Management Studies
26 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data management
No ratings yet
Data management
36 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Data Mining
No ratings yet
Data Mining
63 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
Data Mining: V Mounika Revathi Dept of Cse Sitam
No ratings yet
Data Mining: V Mounika Revathi Dept of Cse Sitam
13 pages
Chapter 6_Data Mining
No ratings yet
Chapter 6_Data Mining
62 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining
No ratings yet
Data Mining
30 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Lecture 7 8 Data Mining
No ratings yet
Lecture 7 8 Data Mining
23 pages
Data Mining Week 1 2
No ratings yet
Data Mining Week 1 2
117 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
Data Mining Transparencies
No ratings yet
Data Mining Transparencies
50 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
T Assignment
No ratings yet
T Assignment
5 pages
Data Mining
No ratings yet
Data Mining
21 pages
BI module 4
No ratings yet
BI module 4
8 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Linear Regression
No ratings yet
Linear Regression
35 pages
Dimension Reduction Methods
No ratings yet
Dimension Reduction Methods
32 pages
EDA
No ratings yet
EDA
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
IFRS 3 Business Combinations Etc Inc Associates and JA
No ratings yet
IFRS 3 Business Combinations Etc Inc Associates and JA
41 pages
IAS 36 Impairment of Assets Including Goodwill
No ratings yet
IAS 36 Impairment of Assets Including Goodwill
39 pages
Group Statement of Financial Position Recap
No ratings yet
Group Statement of Financial Position Recap
14 pages
Data Analytics
No ratings yet
Data Analytics
12 pages
Lab 1 - 2021197285 - Siti Raziatul
No ratings yet
Lab 1 - 2021197285 - Siti Raziatul
23 pages
UNIT-IV Notes
No ratings yet
UNIT-IV Notes
42 pages
mca-3-sem-data-warehousing-data-mining-kca012-2023
No ratings yet
mca-3-sem-data-warehousing-data-mining-kca012-2023
2 pages
How To Write A Paper For Publication
No ratings yet
How To Write A Paper For Publication
10 pages
Performing A Cement Plant Operations Audit
No ratings yet
Performing A Cement Plant Operations Audit
6 pages
Comprehensive Assessment of Pharmaceutical Waste Disposal Practices in Ishaka-Bushenyi Municipality Implications For Public Health and Environmental Sustainability
No ratings yet
Comprehensive Assessment of Pharmaceutical Waste Disposal Practices in Ishaka-Bushenyi Municipality Implications For Public Health and Environmental Sustainability
11 pages
Community Extension Program Proposal
100% (1)
Community Extension Program Proposal
8 pages
MATH 533 Part C - Regression and Correlation Analysis
0% (1)
MATH 533 Part C - Regression and Correlation Analysis
9 pages
MKTG Analytics Course Outline Prantosh - B - Subject To Revision
No ratings yet
MKTG Analytics Course Outline Prantosh - B - Subject To Revision
4 pages
Analysis of Covariance-ANCOVA-with Two Groups PDF
No ratings yet
Analysis of Covariance-ANCOVA-with Two Groups PDF
41 pages
A Study of Vehicle Kilometres Travel Among Malaysia Motor Vehicle Users
No ratings yet
A Study of Vehicle Kilometres Travel Among Malaysia Motor Vehicle Users
21 pages
4310 Exam 2
No ratings yet
4310 Exam 2
11 pages
Police Finance Organizations
No ratings yet
Police Finance Organizations
49 pages
Module 2 Lecture
No ratings yet
Module 2 Lecture
15 pages
Soma.35.Data Analysis - Final
100% (3)
Soma.35.Data Analysis - Final
7 pages
BPMF 2020 Syed
No ratings yet
BPMF 2020 Syed
16 pages
List of Formulas
No ratings yet
List of Formulas
3 pages
An Introduction To Quantitative Research
No ratings yet
An Introduction To Quantitative Research
5 pages
Pearson Product Moment Correlation Coefficient (Pearson R) Final
100% (1)
Pearson Product Moment Correlation Coefficient (Pearson R) Final
20 pages
Factors Affecting The Group Dynamics of
No ratings yet
Factors Affecting The Group Dynamics of
64 pages
Module 01 - Performance Metrics in ML (1)
No ratings yet
Module 01 - Performance Metrics in ML (1)
15 pages
Jurnal Faktor 1
No ratings yet
Jurnal Faktor 1
15 pages
Data Analysis UNIT-III
No ratings yet
Data Analysis UNIT-III
11 pages
Exploratory Data Analysis: M. Srinath
No ratings yet
Exploratory Data Analysis: M. Srinath
19 pages
SQL Notes
No ratings yet
SQL Notes
3 pages
35857-Article Text-140509-1-10-20210701
No ratings yet
35857-Article Text-140509-1-10-20210701
8 pages
Research Methods Activity Booklet
No ratings yet
Research Methods Activity Booklet
49 pages
Agile Data Science With R
No ratings yet
Agile Data Science With R
65 pages

Introduction to Data Mining

Uploaded by

Introduction to Data Mining

Uploaded by

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws

• What is Data Mining?

What will happen?

◦ United States 2012 Presidential Elections (source: MIT Technology

Data mining is the process of discovering useful patterns and

 We are inundated with data in

• Cross-Industry Standard Process for Data Mining (CRISP-DM)

– Adaptive: Next phase

– Returning to earlier Deployment Phase Data Preparation

Evaluation Phase Modeling Phase

– High-quality description accomplished using Exploratory

◦ Example: Estimate a patient’s systolic blood pressure, based on

– Estimate amount of money, family of four will spend on

– Estimate GPA of graduate student, based on student’s

Statistical Analysis uses several estimation methods: point estimation,

◦ Example: Classify the Income Bracket of an individual as Low,

b) Apply model to cases not currently in the database, to obtain estimated

3. Classification (2/4) – Example in detail

• Young patient with high • Older patient with

Light gray – Drug A

• Many datasets have multiple predictors

and classification applicable to Stock

• Clustering Tasks in Business and Research:

– Rules measured using support and confidence

– Example: A particular supermarket might find that:

• Investigating the proportion of subscribers to your

• Determining the proportion of cases in which a new drug

Classification > Supervised learning

Clustering > Unsupervised learning

You might also like