0% found this document useful (0 votes)
40 views21 pages

Big Data Lesson 2 Lucrezia Noli

The document discusses the CRISP-DM methodology for predictive analytics projects. CRISP-DM stands for Cross-Industry Standard Process for Data Mining and provides a standard process with six main steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It is widely used for creating predictive analytics solutions and was developed through a European Union funded project in the late 1990s.

Uploaded by

Reyansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views21 pages

Big Data Lesson 2 Lucrezia Noli

The document discusses the CRISP-DM methodology for predictive analytics projects. CRISP-DM stands for Cross-Industry Standard Process for Data Mining and provides a standard process with six main steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It is widely used for creating predictive analytics solutions and was developed through a European Union funded project in the late 1990s.

Uploaded by

Reyansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

BIG DATA

Predictive Analytics Methodology

Lecturer: Lucrezia Noli


Lesson 2
CRISP - DM

• CRISP-DM stands for “Cross Industry Standard Process for Data Mining” and it’s
a widely used methodology to create predictive analysis solutions
• In 1996 the European Union finances the work to define the methodology, which
is carried out by four companies: SPSS, NCR Corporation, Daimler-Benz e OHRA.
• The first version is definied by 1999, in 2006 new works start to define a second
standard CRISP-DM 2.0.
• This second one was never finished
• Nonetheless, CRISP-DM in its origianl version is widely used by companies
entering data mining projects
CRISP - DM

FEEDBACK
• Monitoring DATA EXPLORATION
performance • Understanding data
• Review requirements sources
• Model review • Statistics
• Visual analysis
• Outliers analysis
• Quality assessment

ASSESSMENT DATA QUALITY


• requirements,obst FIXES
ancles • Missing values
• Risks &
unexpected events
• Costs vs benefits FEATURE
ENGINEERING
• Aggregations
• Transformations
• Normalizations

SUBSETTING
• Training set
• Test set
OPERATIONALIZE • Validation set
• Automatic Re-
training
• Automatic Scoring
• Scoring on-demand

Statistical Economic MODELING


metrics metrics • Choice of algorithm
• Choice of
parameters
• Training
Business understanding

What is the problem we want to solve?


• This is pivotal because it will define all the subsequent steps of our
analysis
• Do we have the data necessary to solve the problem?
• Do we have them internally or we have to ask them to someone
else?
• What are the requirements of the project?
• How much will it cost? do we have the skill-set necessary to carry
out such analysis, or should we hire someone externally
• What are the risks involved in the project?
• What are the best and worst case scenarios?
• Are benefits higher than costs?
Data understanding

• How do our data look?


• Statistical descriptive analysis (mean, sdev)
• How many variables are we dealing with, of what type?
• Is there correlation between variables?
• Are there missing values?

• Expecially in the case of Big Data…


• What supports do we need to efficiently store and analyze all the data?
• Do we have unstructured data as well? In this case we will need to turn them
into structured before we can carry out any analysis
Data preparation

• Exclude constant variables


• Can you tell me why?
• Exclude/substitute missing values
• Most algorithms won’t work if they find holes in the dataset
• There are many ways to do this, it’s very important to understand how
• Standardize/normalize
• If we deal with data values on different scales, we will need to standardize
their scales so that variables are comparable
• Aggregate variables
• We don’t necessarily use variables the way they are initially presented, we
might want to make operations with them, and use a new aggregated
variable instead of the original ones
Splitting the data

• For SUPERVISED models, we need to split the dataset in two parts:


• Training set
• Test set

• Can you tell me WHY?


Modelling

• Identify which type of problem we are solving (classification, regression,


clustering, market basket analysis, …)
• Depending on this, data will be prepared differently, and we’ll have a number
of algorithms that can solve that kind of problem to choose from
• Select the algorithm(s)
• We cannot know which model is the best performing beforehand, we usually
try more than one and compare them
• Train the model on training data
• In the training phase, the model learns from the data and understand what
caused a certain output to occur
• Optimize parameters
• The true scope of training is to find those parameters which yield the best
predictive result (i.e. the most accurate)
Evaluation

• Confront model performance by calculating both:


• Statistical performance
• Economic performance

• Metrics of evaluation will depend on wether the model is supervised or


unsupervised
• For supervised model we have the «answer to our quesion» from the past
data
• For unsupervised models, we don’t. Will have to use different metrics
• Metrics of evaluation also depend on whether the model is:
• a classification,
• a regression,
• a clustering analysis
Deployment

• Operationalize the model by including it in the client’s business


structure
• For example it could be that our model’s predictions need to be
provided to the users of an app. In this case the model will have to
comunicate with the app and provide predictions when required
• Or it might be included in a production pipeline, to decide whether
products are likely to have defects and shouldn’t be delivered
• It could be used to create alerts, so that our prediction has to be
connected to some signaling device
•…
Boxplot

outlier

Top limit

Q3
r = (Q3-Q1)
Bottom limit= Q1-1,5r.
mean Top limit= Q3+1,5r

Q1

Bottom limit

outlier
Candle Plot
number variable

A B C
Class variable
One hot encoding (binarization)

Gender Female Male


M 0 1
F 1 0
F 1 0 Notice:
M 0 1 • We are adding N new
M 0 1 columns, where N is the
F 1 0 number of classes of the
transformed variable
M 0 1 • We will only need N-1
F 1 0 columns to express the
M transformed variable
0 1
M 0 1
F 1 0
F 1 0
Target rate encoding

Notice:

• We are not adding any


additional columns
• We are using the target
variable to calculate the
numerical value used to
substitute the
categories…
Why transforming categories to numbers?

Most algorithms are based on distances or non-linear transformations based on


equations
• Linear regression fits a line to a set of points, which are represented as
vectors
• Logistic regression identifies a line that divides groups of points, which are
again represented as vectors
• Clustering algorithms calculate distances between points
• Neural networks are a set of logistic (or more complex) functions, all based
on vectors

➢ Obviously, trasforming a category in a number makes it harder to


understand the meaning of a variable.
➢ We have a trade-off between readability and possibility to use certain
algorithms
Create new instances
• Where there is unbalancing between classes of the target variable(eg. Fraud/not
fraud) ,it’s very difficult for the model to learn to predict the minority class

• It is thus possible to create new instances of the minority class

• Oversampling with replacement


• SMOTE: Synthetic Minority Over-sampling TEchnique
- See"SMOTE: Synthetic Minority Over-sampling Technique", Chawla et
al., 2002
Oversampling with replacement
• Minority cases are just replicated

• It was proven that this doesn’t really help improve the predictive power
of our model, because we’re not really adding information
- See Ling & Li, 1998; Japkowicz, 2000
SMOTE
• SMOTE creates new instanses by mixing the features of a group of neightbor
observations

• Differently from «oversampling with replacement» technique, cases are not just
replicated but actually created
• How synthetic cases are created:
• Calcualte the difference vetween vectors of features af a certain number of
«nearest neighbors» which belong to the minority class
• Multiply this difference by a multiplier randomly picked between 0 and 1. this
process creates the feature vector of the new observation
• In this way the new instance is positioned on the segment which connects
the initially picked nearest neightbors,.
Exercise on data preparation

• What to note when reading a table


• Base descriptive statistics
• Missing data analysis
• Outliers analysis
• Constant variables
Descriptive statistics

• Are there ID-like variables? → can’t include them in a predictive model


• Mean and standard deviation → is the variable constant?
• Outliers → are there out-of-range values?
• Missing values → how many do we have? How to deal with them
• Classes proportion → do we have the same proportion of observations
among different classes of a variable?
Handling missing values

• Exlude the entire variable → we miss a lot of information


• Exlude rows with missing values → shrinks the number of observations
• Substitute → how?
• Mean
• Most frequent value
• Linear interpolation
• More advanced methods: use a function to predict missing value

You might also like