0% found this document useful (0 votes)

40 views21 pages

Big Data Lesson 2 Lucrezia Noli

The document discusses the CRISP-DM methodology for predictive analytics projects. CRISP-DM stands for Cross-Industry Standard Process for Data Mining and provides a standard process with six main steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It is widely used for creating predictive analytics solutions and was developed through a European Union funded project in the late 1990s.

Uploaded by

Reyansh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views21 pages

Big Data Lesson 2 Lucrezia Noli

Uploaded by

Reyansh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

BIG DATA

Predictive Analytics Methodology

Lecturer: Lucrezia Noli

Lesson 2
CRISP - DM

• CRISP-DM stands for “Cross Industry Standard Process for Data Mining” and it’s
a widely used methodology to create predictive analysis solutions
• In 1996 the European Union finances the work to define the methodology, which
is carried out by four companies: SPSS, NCR Corporation, Daimler-Benz e OHRA.
• The first version is definied by 1999, in 2006 new works start to define a second
standard CRISP-DM 2.0.
• This second one was never finished
• Nonetheless, CRISP-DM in its origianl version is widely used by companies
entering data mining projects
CRISP - DM

FEEDBACK
• Monitoring DATA EXPLORATION
performance • Understanding data
• Review requirements sources
• Model review • Statistics
• Visual analysis
• Outliers analysis
• Quality assessment

ASSESSMENT DATA QUALITY

• requirements,obst FIXES
ancles • Missing values
• Risks &
unexpected events
• Costs vs benefits FEATURE
ENGINEERING
• Aggregations
• Transformations
• Normalizations

SUBSETTING
• Training set
• Test set
OPERATIONALIZE • Validation set
• Automatic Re-
training
• Automatic Scoring
• Scoring on-demand

Statistical Economic MODELING

metrics metrics • Choice of algorithm
• Choice of
parameters
• Training
Business understanding

What is the problem we want to solve?

• This is pivotal because it will define all the subsequent steps of our
analysis
• Do we have the data necessary to solve the problem?
• Do we have them internally or we have to ask them to someone
else?
• What are the requirements of the project?
• How much will it cost? do we have the skill-set necessary to carry
out such analysis, or should we hire someone externally
• What are the risks involved in the project?
• What are the best and worst case scenarios?
• Are benefits higher than costs?
Data understanding

• How do our data look?

• Statistical descriptive analysis (mean, sdev)
• How many variables are we dealing with, of what type?
• Is there correlation between variables?
• Are there missing values?

• Expecially in the case of Big Data…

• What supports do we need to efficiently store and analyze all the data?
• Do we have unstructured data as well? In this case we will need to turn them
into structured before we can carry out any analysis
Data preparation

• Exclude constant variables

• Can you tell me why?
• Exclude/substitute missing values
• Most algorithms won’t work if they find holes in the dataset
• There are many ways to do this, it’s very important to understand how
• Standardize/normalize
• If we deal with data values on different scales, we will need to standardize
their scales so that variables are comparable
• Aggregate variables
• We don’t necessarily use variables the way they are initially presented, we
might want to make operations with them, and use a new aggregated
variable instead of the original ones
Splitting the data

• For SUPERVISED models, we need to split the dataset in two parts:

• Training set
• Test set

• Can you tell me WHY?

Modelling

• Identify which type of problem we are solving (classification, regression,

clustering, market basket analysis, …)
• Depending on this, data will be prepared differently, and we’ll have a number
of algorithms that can solve that kind of problem to choose from
• Select the algorithm(s)
• We cannot know which model is the best performing beforehand, we usually
try more than one and compare them
• Train the model on training data
• In the training phase, the model learns from the data and understand what
caused a certain output to occur
• Optimize parameters
• The true scope of training is to find those parameters which yield the best
predictive result (i.e. the most accurate)
Evaluation

• Confront model performance by calculating both:

• Statistical performance
• Economic performance

• Metrics of evaluation will depend on wether the model is supervised or

unsupervised
• For supervised model we have the «answer to our quesion» from the past
data
• For unsupervised models, we don’t. Will have to use different metrics
• Metrics of evaluation also depend on whether the model is:
• a classification,
• a regression,
• a clustering analysis
Deployment

• Operationalize the model by including it in the client’s business

structure
• For example it could be that our model’s predictions need to be
provided to the users of an app. In this case the model will have to
comunicate with the app and provide predictions when required
• Or it might be included in a production pipeline, to decide whether
products are likely to have defects and shouldn’t be delivered
• It could be used to create alerts, so that our prediction has to be
connected to some signaling device
•…
Boxplot

outlier

Top limit

Q3
r = (Q3-Q1)
Bottom limit= Q1-1,5r.
mean Top limit= Q3+1,5r

Bottom limit

outlier
Candle Plot
number variable

A B C
Class variable
One hot encoding (binarization)

Gender Female Male

M 0 1
F 1 0
F 1 0 Notice:
M 0 1 • We are adding N new
M 0 1 columns, where N is the
F 1 0 number of classes of the
transformed variable
M 0 1 • We will only need N-1
F 1 0 columns to express the
M transformed variable
0 1
M 0 1
F 1 0
F 1 0
Target rate encoding

Notice:

• We are not adding any

additional columns
• We are using the target
variable to calculate the
numerical value used to
substitute the
categories…
Why transforming categories to numbers?

Most algorithms are based on distances or non-linear transformations based on

equations
• Linear regression fits a line to a set of points, which are represented as
vectors
• Logistic regression identifies a line that divides groups of points, which are
again represented as vectors
• Clustering algorithms calculate distances between points
• Neural networks are a set of logistic (or more complex) functions, all based
on vectors

➢ Obviously, trasforming a category in a number makes it harder to

understand the meaning of a variable.
➢ We have a trade-off between readability and possibility to use certain
algorithms
Create new instances
• Where there is unbalancing between classes of the target variable(eg. Fraud/not
fraud) ,it’s very difficult for the model to learn to predict the minority class

• It is thus possible to create new instances of the minority class

• Oversampling with replacement

• SMOTE: Synthetic Minority Over-sampling TEchnique
- See"SMOTE: Synthetic Minority Over-sampling Technique", Chawla et
al., 2002
Oversampling with replacement
• Minority cases are just replicated

• It was proven that this doesn’t really help improve the predictive power
of our model, because we’re not really adding information
- See Ling & Li, 1998; Japkowicz, 2000
SMOTE
• SMOTE creates new instanses by mixing the features of a group of neightbor
observations

• Differently from «oversampling with replacement» technique, cases are not just
replicated but actually created
• How synthetic cases are created:
• Calcualte the difference vetween vectors of features af a certain number of
«nearest neighbors» which belong to the minority class
• Multiply this difference by a multiplier randomly picked between 0 and 1. this
process creates the feature vector of the new observation
• In this way the new instance is positioned on the segment which connects
the initially picked nearest neightbors,.
Exercise on data preparation

• What to note when reading a table

• Base descriptive statistics
• Missing data analysis
• Outliers analysis
• Constant variables
Descriptive statistics

• Are there ID-like variables? → can’t include them in a predictive model

• Mean and standard deviation → is the variable constant?
• Outliers → are there out-of-range values?
• Missing values → how many do we have? How to deal with them
• Classes proportion → do we have the same proportion of observations
among different classes of a variable?
Handling missing values

• Exlude the entire variable → we miss a lot of information

• Exlude rows with missing values → shrinks the number of observations
• Substitute → how?
• Mean
• Most frequent value
• Linear interpolation
• More advanced methods: use a function to predict missing value

Late Speech and Gifted Children
100% (3)
Late Speech and Gifted Children
4 pages
Brain e Tics Parent Guide
100% (2)
Brain e Tics Parent Guide
10 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Predictive Unit 1
No ratings yet
Predictive Unit 1
22 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Predictive Analysis 1
No ratings yet
Predictive Analysis 1
22 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Data Analysis Process My Notes
No ratings yet
Data Analysis Process My Notes
7 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Exam 1
No ratings yet
Exam 1
12 pages
Chap2-Some Unique Features of Data Science Projects
No ratings yet
Chap2-Some Unique Features of Data Science Projects
44 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Unit 3
No ratings yet
Unit 3
55 pages
Capstone Project
No ratings yet
Capstone Project
28 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
LFD 1
No ratings yet
LFD 1
39 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
My Notes
No ratings yet
My Notes
15 pages
Capstone Project
No ratings yet
Capstone Project
9 pages
Predective Analytics
No ratings yet
Predective Analytics
11 pages
Final ML
No ratings yet
Final ML
2 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
Session 5
No ratings yet
Session 5
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Ways to Achieve Quality
From Everand
Ways to Achieve Quality
chakrapani srinivasa
5/5 (1)
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Aldi Group in Retailing (World)
No ratings yet
Aldi Group in Retailing (World)
37 pages
Applications For Management 30280 3 Computer Lab Session Factor Analysis Part I. Factor Analysis With Principal Axis Factoring
No ratings yet
Applications For Management 30280 3 Computer Lab Session Factor Analysis Part I. Factor Analysis With Principal Axis Factoring
2 pages
Case Safilo-Luxottica PART A 2018 V - 1
No ratings yet
Case Safilo-Luxottica PART A 2018 V - 1
5 pages
Applications For Management: Factor Analysis III
No ratings yet
Applications For Management: Factor Analysis III
43 pages
Applications For Management: Cluster Analysis I
No ratings yet
Applications For Management: Cluster Analysis I
43 pages
The Role of Think Tanks in The Definition and Application of Defence Policies and Strategies
No ratings yet
The Role of Think Tanks in The Definition and Application of Defence Policies and Strategies
33 pages
Comparative Financial Statement Analysis - M2
No ratings yet
Comparative Financial Statement Analysis - M2
13 pages
Big Data Lesson 5 Lucrezia Noli
No ratings yet
Big Data Lesson 5 Lucrezia Noli
30 pages
Applications For Management: Scale Construction and Reliability Analysis
No ratings yet
Applications For Management: Scale Construction and Reliability Analysis
40 pages
Managing Innovation Teams at Disney Case Analysis: William R. Cook NETW583 Strategic Management of Technology
No ratings yet
Managing Innovation Teams at Disney Case Analysis: William R. Cook NETW583 Strategic Management of Technology
6 pages
Policy Brief - Paternity Leave
No ratings yet
Policy Brief - Paternity Leave
23 pages
TA 2: Radiant Practical:D: The Following Steps Will Guide You in Exploring Radiant and The Dataset CPS1995:D!
No ratings yet
TA 2: Radiant Practical:D: The Following Steps Will Guide You in Exploring Radiant and The Dataset CPS1995:D!
2 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
The Decalougue of The Strong Company
No ratings yet
The Decalougue of The Strong Company
7 pages
Ta2 PDF
No ratings yet
Ta2 PDF
3 pages
Public Management Public Management: Course Overview and Instructions
No ratings yet
Public Management Public Management: Course Overview and Instructions
12 pages
Big Data Lesson 4 Lucrezia Noli
No ratings yet
Big Data Lesson 4 Lucrezia Noli
16 pages
Organizing For Innovation Avimanyu, PH.D.: Mcgraw-Hill/Irwin
No ratings yet
Organizing For Innovation Avimanyu, PH.D.: Mcgraw-Hill/Irwin
20 pages
Exercise 1: STATISTISCS 30001 - CLASSES 15/21
No ratings yet
Exercise 1: STATISTISCS 30001 - CLASSES 15/21
7 pages
Exercise 1: Statistiscs 30001 - Classes 15/21
No ratings yet
Exercise 1: Statistiscs 30001 - Classes 15/21
8 pages
Fundamental Exercises For First Partial (Exercise Session 5 - Pier)
No ratings yet
Fundamental Exercises For First Partial (Exercise Session 5 - Pier)
2 pages
TA5 Sol
No ratings yet
TA5 Sol
3 pages
PS2 Sol
No ratings yet
PS2 Sol
11 pages
TA3 Sol
No ratings yet
TA3 Sol
6 pages
Practical Session 3: With Maximum Frequency)
No ratings yet
Practical Session 3: With Maximum Frequency)
3 pages
DQ 5
No ratings yet
DQ 5
3 pages
Danielson Framework Rubric School Counselors
No ratings yet
Danielson Framework Rubric School Counselors
7 pages
Methods, Procedure and Technique of Teaching Language
No ratings yet
Methods, Procedure and Technique of Teaching Language
28 pages
Homeroom Guidance Grade 11 Quarter 1 Module 2
100% (1)
Homeroom Guidance Grade 11 Quarter 1 Module 2
5 pages
E PG Pathshala Perception 1
No ratings yet
E PG Pathshala Perception 1
12 pages
Placement Assistance
No ratings yet
Placement Assistance
3 pages
2011 ReanalysisChange
No ratings yet
2011 ReanalysisChange
14 pages
RW S2Q3 Reviewer For Long Test
No ratings yet
RW S2Q3 Reviewer For Long Test
4 pages
English 6 CO Q4
No ratings yet
English 6 CO Q4
6 pages
The Learning of A Complex Subject Matter Is Most Effective When It Is Intentional Process of Constructing Meaning From Information and Experience
100% (3)
The Learning of A Complex Subject Matter Is Most Effective When It Is Intentional Process of Constructing Meaning From Information and Experience
6 pages
IV-Day 5
No ratings yet
IV-Day 5
3 pages
Critical Thinking
No ratings yet
Critical Thinking
14 pages
Psychological Perspective of The Self
No ratings yet
Psychological Perspective of The Self
13 pages
Definition of Politics: Report By: Kevin John C. de Guzman Mrs. Janelle Rosal Bsict Iii-A Instructor
No ratings yet
Definition of Politics: Report By: Kevin John C. de Guzman Mrs. Janelle Rosal Bsict Iii-A Instructor
7 pages
Session3 - 1D - Lost - Property
No ratings yet
Session3 - 1D - Lost - Property
16 pages
Grade 11: How To Use Modal Verbs, Nouns, and Adverbs Appropriately: EN8G-llla-3.6
No ratings yet
Grade 11: How To Use Modal Verbs, Nouns, and Adverbs Appropriately: EN8G-llla-3.6
2 pages
Laboratory Environment and Academic
No ratings yet
Laboratory Environment and Academic
8 pages
X 13 Oep Lecture 13 Q&a
No ratings yet
X 13 Oep Lecture 13 Q&a
20 pages
Directions: Read Carefully & State Your Answers in A 3 - 5 Paragraph Essay. 10 Pts. Each
No ratings yet
Directions: Read Carefully & State Your Answers in A 3 - 5 Paragraph Essay. 10 Pts. Each
3 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
18 pages
Text - Shakira
No ratings yet
Text - Shakira
5 pages
Fs 3 Answers 1
No ratings yet
Fs 3 Answers 1
8 pages
Gusti Rayyan Noor (1710117210015) Pengajaran Mikro A9
No ratings yet
Gusti Rayyan Noor (1710117210015) Pengajaran Mikro A9
16 pages
Psycho Linguistics
100% (1)
Psycho Linguistics
7 pages
Lesson Plan Grade 9 January
No ratings yet
Lesson Plan Grade 9 January
7 pages
PSTM Topics Prelim Reviewer
No ratings yet
PSTM Topics Prelim Reviewer
5 pages
Form 2 Lesson 2 Listening
No ratings yet
Form 2 Lesson 2 Listening
2 pages
Management: Richard L. Daft
No ratings yet
Management: Richard L. Daft
24 pages

Big Data Lesson 2 Lucrezia Noli

Uploaded by

Big Data Lesson 2 Lucrezia Noli

Uploaded by

BIG DATA

Predictive Analytics Methodology

Lecturer: Lucrezia Noli

ASSESSMENT DATA QUALITY

Statistical Economic MODELING

What is the problem we want to solve?

• How do our data look?

• Expecially in the case of Big Data…

• Exclude constant variables

• For SUPERVISED models, we need to split the dataset in two parts:

• Can you tell me WHY?

• Identify which type of problem we are solving (classification, regression,

• Confront model performance by calculating both:

• Metrics of evaluation will depend on wether the model is supervised or

• Operationalize the model by including it in the client’s business

Gender Female Male

• We are not adding any

Most algorithms are based on distances or non-linear transformations based on

➢ Obviously, trasforming a category in a number makes it harder to

• It is thus possible to create new instances of the minority class

• Oversampling with replacement

• What to note when reading a table

• Are there ID-like variables? → can’t include them in a predictive model

• Exlude the entire variable → we miss a lot of information

You might also like