Lesson 6 - Data Mining
Lesson 6 - Data Mining
LESSON 6
CONTENT
IT resources
Business group
Manipulation on columns:
Transformation
Derivation
Elimination
DATA MANIPULATION
Manipulation on rows:
Aggregation
Change detection
Missing value detection
Outlier detection
DATA PREPARATION
For modeling, incoming data is sampled and split into various streams
as:
Train set: Used to build models
Test set: Used for out-of-sample tests of the model quality and to
select the final model candidate
Scoring data: Used for model-based prediction
The data sets must be carefully examined and designed to assure
statistical significance of the results obtained
DEFINE BUSINESS OBJECTIVES
A Cost / Revenue matrix describes how the business mechanics will work in the
supported campaign and give business users an immediately interpretable table
Example: Call Center Campaign
Assuming average cost per call is $5, each positive responder (purchaser) will
generate additional cost due to:
Administration work required to register him as a new customer
Cost of the delivered phone handset ($100)
Customers who respond positively will generate average revenue of $1,000 per
year
COST/REVENUE MATRIX – CALL CENTER CAMPAIGN
Data sourcing
Mixed top-down and bottom-up process driven by business requirements
(top) and technical restrictions (bottom)
Data warehouse infrastructures with advanced data cleansing processes can
help ensure working with high-quality data
All metadata available has to be collected to fully understand data types,
value ranges and the primary/ foreign key structures
Build a (simple) relational data model onto which the source data will be
mapped
STEP 2: LOADING THE DATA
Inspect the descriptive statistics of all univariate distributions associated to all available
variables
Variables that can be excluded
Taking on only one value (i.e. the variable is a constant)
With mostly missing values
Directly or indirectly identifying an individual customer
Showing collinearities
Showing very little correlation with the target variable
Containing personal identifiers
Check if all variables have been mapped to the appropriate data types
GAIN CUSTOMER INSIGHT
Two steps:
The rules (or linear / non-linear analytical models) are built based
on a training set
These rules are then applied to a new dataset for generating the
answers needed for the campaign
STEP 2: PREDICTIVE MODELING
Guidelines:
Distinguish between different types of predictive models obtained
through different modeling paradigms: supervised and unsupervised
modeling
Find the right relationships between variables describing the customers
to predict their respective group membership likelihood: purchaser or
non-purchaser, referred to as scoring (e.g., between 0 and 1)
Apply unsupervised modeling where group membership is not known
beforehand
STEP 3: SELECT MODEL
Each data mining project will produce a huge amount of information including:
Raw data used
Transformations for each variable
Formulas for creating derived variables
Train, test and score data sets
Target variable calculation
Models and their parameterizations
Score threshold levels
Final customer target selections
STEP 2: ARCHIVE RESULTS
Data Mining can assist in selecting the right target customers or in identifying previously
unknown customers with similar behavior and needs
A good target list is likely to increase purchase rates and has a positive impact on
revenue
In the context of CRM, the individual customer is often the central object analyzed by
means of data mining methods
A complete data mining process comprises assessing and specifying the business
objectives, data sourcing, transformation and creation of analytical variables and
building analytical models using techniques such as logistic regression and neural
networks, scoring customers and obtaining feedback from the field
Learning and refining the data mining process is the key to success