0% found this document useful (0 votes)
9 views

Introduction to Data Mining

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Introduction to Data Mining

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws


and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Introduction to Data Mining
Outline:

• What is Data Mining?


• Why Data Mining?
• Cross-Industry Standard Process for Data Mining
(CRISP-DM)
• What Tasks Can Data Mining Accomplish?
• Warming Up to Programming in R

3
A Categorization of Analytical Methods

What should
happen? Prescriptive
Analytics

What will happen?


What happened?
Predictive
Why did it happened? Analytics

4
What is Data Mining?
◦ According to McKinsey Global Institute (MGI)
 Most American companies with more that 1000 employees have 200TB
of data, increasing 40% annually
 Retailers could expect to realize an increase in their operating margin of
more than 60%

◦ United States 2012 Presidential Elections (source: MIT Technology


Review)
 First identified likely Obama voters using a data mining model, and then
made sure that these voters actually got to the polls
 used a separate data mining model to predict the polling outcomes
county-by-county
 Hamilton, Ohio: the model predicted 56.4% for Obama; actual result was
56.6%, so that the prediction was off by only 0.02%

5
Why Data Mining?
• Other examples
– Bank of America, West Coast customer service call center (source:
CIO Magazine)
• 13 million customer calls per month – in the past they all were offered
the same products/services
• Now, with access to customer’s individual profile, customer service
representatives offers new products or services that may be of greatest
interest to him/her
– Supermarkets
• Each cash-register product scan collected helps to build a profile about
the shopping habits of your family, and the other families who are
checking out

Data mining is the process of discovering useful patterns and


trends in large datasets
6
“We are drowning in information
Wanted: Data Miners but starved for knowledge.”
Megatrends, John Naisbitt

 We are inundated with data in


most fields, but…
 There are not trained human
analysts available who are skilled • Factors
to convert the data into knowledge – Explosive growth in data
collection, as in supermarket
 According to McKinsey Report scanners
◦ “There will be a shortage of talent…”
– Storing the data in data
◦ “…particularly of people with deep
warehouses
expertise in statistics and machine
learning, and the managers and analysts – Increased access to data from web
who know how to operate companies navigation an intranets
by using insights from big data.” – Competitive pressure to increase
◦ Demand for talent to exceed supply market share in globalized
“…by 140,000 to 190,000 positions” economy
◦ “… we project a need for 1.5 million
– Growth of computing power and
additional managers and analysts in the
United States” storage capacity

7
The Need for Human Direction of Data Mining
– Some early data mining definitions described process as
“automatic”
– “…this has misled many people into believing data mining is
product that can be bought rather than a discipline that must be
mastered.” (Berry, Linoff)
– Automation no substitute for human input
– Data mining is easy to do badly
– Humans need to be actively involved in every phase of data
mining process
– Task of data mining should be integrated into human process of
problem solving
8
Cross Industry Standard Process: CRISP-DM

• Cross-Industry Standard Process for Data Mining (CRISP-DM)


developed in 1996

– Adaptive: Next phase


depends on results from Business / Research Data Understanding
Understanding Phase Phase
preceding phase

– Returning to earlier Deployment Phase Data Preparation


Phase
phase possible before
moving forward

Evaluation Phase Modeling Phase

CRISP-DM Lifecycle
9
Cross Industry Standard Process: CRISP-DM
 (1) Business/Research Understanding Phase
◦ Define project requirements and objectives
◦ Translate objectives into data mining problem definition
◦ Prepare preliminary strategy to meet objectives
 (2) Data Understanding Phase
◦ Collect data
◦ Perform exploratory data analysis (EDA)
◦ Assess data quality
◦ Optionally, select interesting subsets
 (3) Data Preparation Phase
◦ Prepares for modeling in subsequent phases
◦ Select cases and variables appropriate for analysis
◦ Cleanse and prepare data so it is ready for modeling tools
◦ Perform transformation of certain variables, if needed

10
Cross Industry Standard Process: CRISP-DM
• (4) Modeling Phase
– Select and apply one or more modeling techniques
– Calibrate model parameters to optimize results
– If necessary, additional data preparation may be required for supporting a
particular technique
• (5) Evaluation Phase
– Evaluate one or more models for effectiveness
– Determine whether defined objectives achieved
– Establish whether some important facet of the problem has not been
sufficiently accounted for
– Make decision regarding data mining results before deploying to field
• (6) Deployment Phase
– Make use of models created
– Simple deployment example: generate report
– Complex deployment example: implement parallel data mining effort in
another department
– In businesses, customer often carries out deployment based on your model

11
Fallacies of Data Mining
• Five Fallacies of Data Mining (Louie, Nautilus Systems, Inc.)
Fallacy Reality
1 • Data mining process is • Requires significant intervention during every phase
autonomous • After model deployment, new models require updates
• Requires little oversight • Continuous evaluative measures monitored by analysts
2 • Data mining quickly pays for • Return rates vary
itself • Depending on startup, personnel, data preparation costs,
etc.
3 • Data mining software easy to • Ease of use varies across projects
use • Analysts must combine subject matter knowledge with
specific problem domain
4 • Data mining automatically • Data mining often uses data from legacy systems
cleans data in databases • Data possibly not examined or used in years
• Organizations starting data mining efforts confronted with
huge data preprocessing task
5 • Data mining always provides • There is no guarantee of positive results
positive results. • But used properly, data mining can provide actionable and
highly profitable results.

12
What Tasks Can Data Mining Accomplish?
• Six common data mining tasks
– Description
– Estimation
– Classification
– Prediction
– Clustering
– Association

13
What Tasks Can Data Mining Accomplish? (cont’d)

1. Description
– Describes patterns or trends in data
– Data mining models should be transparent
• That is, results should be interpretable by humans
• Some data mining methods more transparent than others
– Decision Trees (Transparent)
– Neural Networks (Blackbox)

– High-quality description accomplished using Exploratory


Data Analysis (EDA)
• Graphical method of exploring patterns and trends in data

14
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (1/3)
◦ Target variable is numeric
◦ Models built from complete data records
 Records include values for each predictor field and numeric
target variable in training set
◦ For new observations, estimate the target variable

◦ Example: Estimate a patient’s systolic blood pressure, based on


patient’s age, gender, body-mass index, and sodium levels
a) Use training data to develop model that estimates systolic
blood pressure based on predictor variables
b) Apply model to new cases, to obtain estimated systolic blood
pressure

15
What Tasks Can Data Mining Accomplish? (cont’d)
2. Estimation (2/3) – Further examples

– Estimate amount of money, family of four will spend on


back-to-school shopping

– Estimate GPA of graduate student, based on student’s


undergraduate GPA

Statistical Analysis uses several estimation methods: point estimation,


confidence interval estimation, linear regression and correlation, and
multiple regression

16
What Tasks Can Data Mining Accomplish? (cont’d)

2. Estimation (3/3)
– The following figure shows scatter • Regression line estimates
plot of graduate GPA against student’s graduate GPA based on
undergraduate GPA (1000 students) their undergraduate GPA,
– Linear regression finds line (blue) resulting in the following model:
best approximating relationship ŷ = 1.24 + 0.67x
between two variables • For example, suppose student’s
undergraduate GPA = 3.0
• According to estimation model,
estimated student’s graduate
GPA = 1.24 + 0.67(3.0) = 3.25
• Point (x = 3.0, ŷ = 3.25) lies on
regression line

17
What Tasks Can Data Mining Accomplish? (cont’d)

3. Classification (1/4)
◦ Similar to Estimation task, except target variable is categorical

◦ Example: Classify the Income Bracket of an individual as Low,


Middle or High based their Age, Gender and Occupation

a) Use training data to develop model that classifies Income Bracket based
on predictor variables

b) Apply model to cases not currently in the database, to obtain estimated


Income Bracket classification

18
What Tasks Can Data Mining Accomplish? (cont’d)

3. Classification (2/4) – Example in detail


– Using the training data set, the algorithm would:
◦ Examine the data set containing both the predictor variables and the (already classified)
target variable, income bracket
◦ Algorithm (software) “learns about” which combinations of variables are associated with
which income brackets (for example, Older females -> High Income)
– Then, when looking at new records with no income information, the
algorithm would:
◦ Based in the classification in the training set, would assign classifications to the new
records (for example, 63-year-old female professor -> high)

Income
Subject Age Gender Occupation
Bracket
001 47 F Software Engineer High
Marketing
002 28 M Middle
Consultant
003 35 M Unemployed Low
… … … … …

19
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (3/4) – The drug prescription example
• Interested in classifying the type of drug a patient should be prescribed, based on age
of the patient, and the patient’s sodium / potassium ratio
• Scatter plot of 200 patients with their sodium/potassium ratios against age, and the
particular drug prescribed by the shade of the points
• What drug should be prescribed for:

• Young patient with high • Older patient with


Na/K ratio? low Na/K ratio?
• Lower right region
• Young patients with high Na/K
• Past patients in this
are in the upper left region
region got either dark
• Past patients in this region got
gray (Drugs C) or
Drug A
medium gray (Drugs B).
• The recommended classification
• Definitive classification
for such patients is Drug A
not possible without
further information

Light gray – Drug A


Medium gray – Drugs B
Dark gray – Drugs C
20
What Tasks Can Data Mining Accomplish? (cont’d)
3. Classification (4/4) – Handling many predictors
• Classification tasks with 2 or 3 predictors
– Can be analyzed using charts and plots like the drug example
above

• Many datasets have multiple predictors


– This requires common data mining methods for
classification like:
• k-nearest neighbor
• decision trees

21
What Tasks Can Data Mining Accomplish? (cont’d)

4. Prediction
 Example prediction tasks in business
◦ Similar to classification and
and research:
estimation, except results lie in
the future
◦ Methods used for estimation ?

and classification applicable to Stock


Price
?

prediction
?
 Includes point estimation, Q1 Q2 Q3 Q4

confidence interval
estimation, linear regression  Predict price of stock 3 months into
and correlation, multiple future, based on past performance
regression, k-nearest  Predict percentage increase in traffic
deaths next year, if speed limit increased
neighbor, decision trees and
 Predict whether molecule in newly
neural networks
discovered drug leads to profitable
pharmaceutical drug

22
What Tasks Can Data Mining Accomplish? (cont’d)

5. Clustering
– Refers to grouping records into classes of similar objects
– Cluster – a collection of records similar to one another, and dissimilar to
records in other clusters
– Clustering algorithm seeks to segment data set into homogeneous
subgroups
– Target variable not specified
• Clustering does not try to classify/estimate/predict target variable

• Clustering Tasks in Business and Research:


– Target marketing niche product for small business that does not have large
marketing budget
– For accounting purposes, to segmented financial behavior into benign and
suspicious categories
– Use as dimensionality-reduction tool for data set having several hundred
inputs
23
What Tasks Can Data Mining Accomplish? (cont’d)
6. Association (1/2)
– Find out which attributes “go together”
– Commonly used for Market Basket Analysis
– Quantify relationships between two or more attributes in the form of rules
as:
IF antecedent THEN consequent

– Rules measured using support and confidence

– Example: A particular supermarket might find that:


• Thursday night 200 of 1,000 customers bought diapers, and of those buying diapers, 50
purchased beer
• Association Rule: “IF buy diapers, THEN buy beer”
• Support = 200/1,000 = 20%, and confidence = 50/200 = 25%

24
What Tasks Can Data Mining Accomplish? (cont’d)
6. Association (2/2) - Association Tasks in Business and
Research:

• Investigating the proportion of subscribers to your


company’s cell phone plan that respond positively to an
offer of an service upgrade.

• Determining the proportion of cases in which a new drug


will exhibit dangerous side effects.

25
What Tasks Can Data Mining Accomplish? (cont’d)

Classification > Supervised learning

Clustering > Unsupervised learning

26
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.

You might also like