Lecturenotes Data Mining
Lecturenotes Data Mining
IDENTIFICATION
STYLES TO DATA MINING
• Directed data mining- takes the form of predictive
modelling where we know exactly what we want to
predict
• It classifies data for use in making predictions or
estimates with the goal of deriving target values
• Egs banks may use it to predict defaulters on loans,
businesses may use it to decide whom to market their
products to
• Uses popular data mining algorithms such as
decision trees(which will be discussed later on in detail)
� Undirected data mining- which finds patterns
in the data and leaves it up to the user to
determine whether or not these patterns are
important
� Data is placed in a format that makes it easier
for us to make sense of it
� Most commonly used algorithm is clustering
which clumps data together in groups based on
common characteristics(to be discussed later in detail)
� One can then take one of the derived clusters
and apply the decision tree algorithm to it so
that they focus on a particular segment of the
cluster
DATA MINING METHODOLOGY
DATA MINING ALGORITHMS
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
CROSS INDUSTRY STANDARD PROCESS FOR DATA
MINING (CRISP- DM)
CRISP-DM: OVERVIEW
Business Understanding
� Understanding project objectives and
requirements; Data mining problem definition
Data Understanding
Initial data collection and familiarization; Identify
data quality issues; Initial, obvious results
Data Preparation
� Record and attribute selection; Data cleansing
Modeling
� Run the data mining tools
Evaluation
� Determine if results meet business objectives;
Identify business issues that should have been
addressed earlier
Deployment
� Put the resulting models into practice; Set up for
continuous mining of the data