Big Data Lesson 2 Lucrezia Noli
Big Data Lesson 2 Lucrezia Noli
• CRISP-DM stands for “Cross Industry Standard Process for Data Mining” and it’s
a widely used methodology to create predictive analysis solutions
• In 1996 the European Union finances the work to define the methodology, which
is carried out by four companies: SPSS, NCR Corporation, Daimler-Benz e OHRA.
• The first version is definied by 1999, in 2006 new works start to define a second
standard CRISP-DM 2.0.
• This second one was never finished
• Nonetheless, CRISP-DM in its origianl version is widely used by companies
entering data mining projects
CRISP - DM
FEEDBACK
• Monitoring DATA EXPLORATION
performance • Understanding data
• Review requirements sources
• Model review • Statistics
• Visual analysis
• Outliers analysis
• Quality assessment
SUBSETTING
• Training set
• Test set
OPERATIONALIZE • Validation set
• Automatic Re-
training
• Automatic Scoring
• Scoring on-demand
outlier
Top limit
Q3
r = (Q3-Q1)
Bottom limit= Q1-1,5r.
mean Top limit= Q3+1,5r
Q1
Bottom limit
outlier
Candle Plot
number variable
A B C
Class variable
One hot encoding (binarization)
Notice:
• It was proven that this doesn’t really help improve the predictive power
of our model, because we’re not really adding information
- See Ling & Li, 1998; Japkowicz, 2000
SMOTE
• SMOTE creates new instanses by mixing the features of a group of neightbor
observations
• Differently from «oversampling with replacement» technique, cases are not just
replicated but actually created
• How synthetic cases are created:
• Calcualte the difference vetween vectors of features af a certain number of
«nearest neighbors» which belong to the minority class
• Multiply this difference by a multiplier randomly picked between 0 and 1. this
process creates the feature vector of the new observation
• In this way the new instance is positioned on the segment which connects
the initially picked nearest neightbors,.
Exercise on data preparation