0% found this document useful (0 votes)

0 views

data science slides

The document outlines the essential processes of data preparation in data science, specifically for marketing, including data cleaning, reduction, transformation, and integration. It emphasizes the importance of addressing issues such as noise, missing values, and outliers to improve data quality and mining results. Various techniques for handling these issues, such as normalization, imputation, and feature engineering, are discussed to enhance the efficiency of data analysis.

Uploaded by

Clarisse Gaiola

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

data science slides

Uploaded by

Clarisse Gaiola

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

NOVA

IMS 6
Information
Management
School

DATA
PREPARATION

Data Science for Marketing

© 2021-2024 Nuno António
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa
Summary
1.Introduction
2.Data cleaning
3.Data reduction
4.Data transformation
5.Data integration

2
6.1
0

Introduction
Data preparation

3
Data preparation

original modeling
dataset dataset
aka
Analytical
Base Table
(ABT)

4
Why data preparation
Due to their size and multiple, heterogenous sources, real-word databases
commonly have:
§ “Noise” (random error or variance)
§ Missing data
§ Inconsistent data

Low quality data → Low quality mining results

For this reason, data must be preprocessed and prepared to improve the
efficiency and ease of the mining process.

5
Forms of data preprocessing and preparation
Han et al. (2012)

6
6.2
0

Data cleaning
Data preparation

7
Objective
Fix variables problems, such as:
§Duplicates
§Redundancy
§Incorrect or miscoded values
§Outliers
§Missing values

8
Duplicates
It is common for real-world datasets to have duplicate instances, even when
they should not exist (e.g., having two instances of the same customer profile)
§ If instances are exact-match duplicates, with all columns having the same
values, most of the time all duplicated instances could be deleted (except
one, of course)

§ If some columns are not equal (e.g., if there are two customer instances with
the same name, telephone, and address but a different volume of purchases),
aggregations may be required. In this example, sales could need to be
summed up, and one of the instances deleted after

9
Redundancy
When two attributes are redundant, one
of them should not be included in the
modeling dataset. Removing correlated
attributes:

§Improves the model development speed

§Decrease harmful bias
§Increases interpretability

10
Incorrect or miscoded values

Treat as missing Column

N Correct the problem,

recollect data
correctly, or delete
Inconsistent the column
Occur Y N (e.g. dates in Y
Type?
frequently? multiple
formats)?
Treat as outliers
Numerical
N Categorical

Can an expert Outliers,

interpret and Y Incorrect unusual Y Occur
impute the categories? values, or frequently?
correct value? spikes?

Y N N N

Correct the value Leave as it is

11
Approaches to handling
outliers
Data cleaning

12
Approaches to handling outliers (1/6)
Remove from the modeling data
When outliers distort the models more than they
can help, in numeric algorithms (K-means
clustering or Principal component analysis)

Risk of removing outliers:

Model deployment can be compromised
when outliers appear (produces unexpected
scores)

13
Approaches to handling outliers (2/6)
Separate the outliers and create models just for them
§ Relax the definition of outliers from two standard
deviations from the mean to three standard deviations
§ Create a separate model for outliers (e.g., linear
regression)

Some algorithms, such as Decision trees-based

algorithms already incorporate this approach in the
algorithm design itself

14
Approaches to handling outliers (3/6)
Transform the outliers so they
are no longer outliers
§ Apply skew transformation or
normalization techniques to
reduce the distance between
the outliers and the main body
of the distribution
§ Apply a MIN or MAX function
based on ”valid” minimum or
maximum values

15
Approaches to handling outliers (4/6)
Transform the outlier and
create an indicator column
Apply skew transformation of
normalization techniques as in
the previous approach, but,
additionally, create a dummy
column indicating if the
observation is an outlier (0: no;
1:yes)

16
Approaches to handling outliers (5/6)
Bin the data (discretize the data)
Because transformations may not capture too
extreme outliers, an alternative to transformations
is to transform the numeric variable in categorical
(e.g., instead of salary amount use low, medium,
high)

Common binning options:

§ Equal-Frequency: the number of unique values
in all bins are similar
§ Equal-Width: pre-defined or range-based width
(dividing the range by the number of bins to
define size)
§ Clustering: dividing the data into discrete
groups or clusters

17
Approaches to handling outliers (6/6)
Leave in the data without modification
Employ on algorithms that are unaffected by outliers, such as Decision trees-
based algorithms

18
Approaches to handling
missing values
Data cleaning

19
Approaches to handling missing values (1/6)
Listwise and column deletion
§If a small percentage of observations
have columns with missing values, just
remove those observations
§If a specific column has many missing
values, consider removing it

20
Approaches to handling missing values (2/6)
Imputation with a constant
§For categorical variables, this is as
simple as filling the missing values with a
value indicating that is missing (e.g.,
“NULL”)
§For numeric variables, if the 0 (zero)
makes sense (e.g., bank balance) then fill
it with a 0 (zero). Otherwise, try other
approach, like the “Mean or median
imputation”

21
Approaches to handling missing values (3/6)
Mean and median imputation (for
continuous variables)
One of the most common approaches in
continuous variables is the imputation of the
mean value. However, if the distribution is
skewed, the median could be better

If the number of observations is large, this

operation could be computationally
expensive.

22
Approaches to handling missing values (4/6)
Imputations with distributions
In numeric variables, when a large
percentage of values are missing, the
summary statistics are affected by
mean/median imputation. In these
cases, the missing value should be
replaced from a random number of a
known distribution (based on the
variable distribution)

23
Approaches to handling missing values (5/6)
Random imputation from own
distributions
This approach involves for each
missing value, randomly, select
a value of one of the non-
missing values existing on the
column.

The advantage of this approach

is that the distribution of
imputed values matches the
populated data.
24
Approaches to handling missing values (6/6)
Impute value from a model
This is the more complex approach. It involves developing a model to impute
missing values.

This approach can take time

When deployed, requires that missing values in data to be also processed
by the model

25
Additional consideration on missing values
Creation of dummy variables
In some cases, the existence of missing values can be informative for the
model. In those cases, besides implementing one of the previous approaches, a
dummy variable could be created to indicate if there is a missing value (0: no;
1:yes)

26
6.3
0

Data reduction
Data preparation

27
Dimensionality reduction
Data reduction

28
The curse of dimensionality
As the number of candidate variables for modeling increase, the number of
observations must also increase (exponentially) to be able to capture the high-
dimensional patterns. One way to address this problem is to reduce the
number of dimensions

source: http://www.turingfinance.com
29
Attribute subset selection
Datasets may contain hundreds of attributes, but many of which may be
irrelevant to the mining task or redundant. For example, for segmenting
customers, telephone number may be irrelevant

“The goal of attribute subset selection is to find a minimum set of attributes

such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes”
Han et al. (2012)

30
Attribute subset selection types
§Filter: uses statistics tests (Pearson correlation, Chi-squared, etc.)
§Wrapper: uses ML to select features to use (forward selection,
backward selection, or other method)
§Embedded: included in the algorithm (e.g., Decision trees)

These methods will be discussed in more detail in Machine

Learning, Marketing Engineering, and other courses as they tend to
be used more in Predictive Modeling

31
Techniques for dimensionality reduction
§Principal Components Analysis (PCA): reduces dimensionality,
while retaining as much variance in data as possible (finds a new
set of variables that are a linear combination of the original
variables)
§Kernel PCA (KPCA): nonlinear variation of PCA
§Linear Discriminant Analysis (LDA): unsupervised learning
method that transforms a set of features to a new set
§Singular Value Decomposition (SVD): extracts important features
from data, while reconstructing the original dataset to a smaller
dataset (e.g., transform a 1 024 pixels image to 66 pixels)
§Among others
Numerosity reduction
Data reduction

33
Numerosity reduction (1/2)
Methods:
§Aggregations: aggregate the data in a different unit of analysis
(e.g., weekly data, instead of daily data)
§Clustering: cluster representations of the data are used to replace
the actual data
§Parametric data reduction: regression and log-linear models are
used to “predict” an output, based on a set of inputs (e.g., using
multivariate linear regression to transform a set of variables in only
one)

34
Numerosity reduction (2/2)
Methods (cont.): RS (s=4)

§Sampling: allows a dataset to be 1

3
youth
middle_aged
represented by a smaller subset. full dataset 4 middle_aged
Could be:
§ Random sampling (RS): selects a 1 youth 6 middle_aged

random percentage of instances 2 youth RSWR (s=4)

§ Random sampling with replacement 3 middle_aged 1 youth
(RSWR): similar to previous, but the 4 middle_aged 3 middle_aged
same instance can be selected more 5 middle_aged 7 senior
than once 6 middle_aged
§ Stratified sample (SS): selects 7 senior
7 senior
SS (s=4)
instances accordingly the relative
frequencies of the levels of a specified 1 youth
stratification feature. This selection 3 middle_aged
ensures that the sample presents a 4 middle_aged
distribution similar to the population
7 senior
35
Sampling consideration

Survivorship bias
Concentrating on the instances
that passed some selection or
sample process and
overlooking those who did not
(one focus on what can see and
ignore what can not see)

It can lead to excessively

optimistic certainties as
multiple variables are ignored source: https://www.wikipedia.com

36
6.4
0

Data transformation
Data preparation

37
Normalization
Data transformation

38
Normalization
§ Some algorithms, such as the K-MEANS algorithm, have
difficulty in covering variables in very different ranges (e.g., age
in the range of [15, 80] and salary in the range [30 000, 80 000]
§ Linear regression coefficients are also influenced
disproportionately by the large values of a skewed distribution
§ Normalization can make a continuous variable fall within a
specific range while maintaining the relative differences between
the values for the variable

39
Common normalization techniques

Method Formula Range

"
Magnitude scaling 𝑥! = [-1, 1]
#$%( " )
(
Sigmoid 𝑥 ! = (()* !") [0, 1]
("+"#$%)
Min-max 𝑥 ! = (" [0, 1]
#&"+"#$%)
("+")̅
Z-score 𝑥! = -"
mostly [-3, 3]
((.. × 0123 405*0)
Rank binning 𝑥! = # 478*091:;428
[0, 100]
" +<*5;12(")
Robust scaling 𝑥! = mostly [-1, 1]
=>?(")

40
Measures scaling
Normalization techniques are also used to scale measurements in different scales to
the same scale

Example
Tripadvisor’s reviews rating scale: [1, 5]
Booking.com’s reviews rating scale: [2.5, 10]

Min-max scale to convert an 8.1 rating in Booking.com to 0-10 scale

(#$#!"# ) (&.($).*) *.,
scale = 𝑥 ! = = = = 0.7467 × 10 = 7.5
(#!$% $#!"# ) ((+$).*) -.*

Min-max scale to convert a 4 rating in Tripadvisor to 0-10 scale

(#$#!"# ) (.$() /
scale = 𝑥 ! = = = = 0.75 × 10 = 7.5
(#!$% $#!"# ) (*$() .

41
Feature engineering
Data transformation

42
Feature engineering
The creation of new features (also know as ”derived variables” or
“derived attributes”) provides more value-added to the quality of
data than any other modeling step

43
Distributions and possible “corrections”

Abbott (2014) 44
Binning (discretizing) variables (1/2)

Abbott (2014)

www.towardsdatascience.com
45
Binning (discretizing) variables (2/2)

46
Other possible transformations

§Reciprocal transformation: "!

§Square root transformation: 𝑥
§Exponential transformation: 𝑒 "
" 0 #!
𝑓𝑜𝑟 𝛾 ≠ 0
§Box-cox transformation: # $
log 𝑥 𝑓𝑜𝑟 𝛾 = 0

47
Encode categorical variables – Label encoding (1/3)

Numerical algorithms such as Linear

Regression or K-MEANS require inputs to be
numerical.
One way to use categorical variables when
there is an inherent order to the different levels
is to assign a number to each level

CustomerID Spent Education CustomerID Spent Education

1 € 100 Bachelor 1 € 100 1
2 € 120 Master 2 € 120 2
3 € 110 Doctorate 3 € 110 3
4 € 140 Master 4 € 140 2

48
Encode categorical variables – One-hot encoding (2/3)

When there is no inherent order

in the levels, the most common
approach is to create dummy
variables
CustomerID Spent Segment CustomerID Spent Corporate SME Individual
1 € 100 Corporate 1 € 100 1 0 0
2 € 120 SME 2 € 120 0 1 0
3 € 110 Individual 3 € 110 0 0 1

49
Encode categorical variables (3/3)
Approach to handling high cardinality:
§Encode categorical variables using an encoder that does not
generate a column for each value/level of the categorical
variable (e.g., the count or probability of observations that have
that value/level)
§If there is a hierarchy, consider using higher levels only . For
example, if you have street, city, and region, consider using only
city and region, or even just region
§For values/levels present in more than a predetermined
threshold of observations (e.g., 2%) create dummy variables
CustomerID Spent Segment CustomerID Spent Segment Corporate
1 € 100 Corporate 1 € 100 2/4 1
2 € 120 SME 2 € 120 1/4 0
30%
3 € 110 Individual 3 € 110 1/4 0
4 € 105 Corporate 4 €105 2/4 1
50
Date/time variables
Datasets are two-dimensional, so, when models require the
introduction of time, transformations are necessary to include a
third dimension (time).

Usually Date/time variables are converted to numeric units related

to the outcome to be analyzed. For example:
§The date the customer was offered a quotation for a loan could be
converted to the number of days before the mortgage was signed
or the number of days since a certain date (e.g., 2000-01-01)
§The date of mortgage signature could be converted to the the day
in the year or the week number

51
Multidimensional features
The most powerful of features. The two most common examples
are:
§Interactions: multiplication of variables
§Ratios: division of variables
Usually, domain expertise is required to understand which
interactions, and above all, which ratios may have modeling value.

52
Multidimensional features - ratios
Ratios are import because they are difficult for most algorithms to
uncover. Ratios can:
§Provided a normalized version of a variable. For example, a
percentage (e.g., a customer website purchase ratio =
%&'()* +, -&*./01)1
)
.&12+')* 3)(142) 541421
§Can incorporate complex ideas. For example,.604'1
the claims received
*).)45)7
to premiums paid in an insurance company =
-*)'4&'1 -047
§Can make models to live “longer”. For example, a model for real
estate property value, instead of using each property price, due to
-*+-)*28 -*4.) (':)
prices increasing trend, could be =
05)*0<) -*+-)*28 -*4.) (':)

53
6.5
0

Data integration
Data preparation

54
Merge data
Joining data that comes from two or more
databases about the unit of analysis under studied

stocks history social reputation currencies national official statistics

stocks forecast

55
Reformat data
Apply syntactic modifications that do not change data meaning,
but are required for modeling, for example:
§Remove commas from text fields if the dataset is supposed to be
saved as comma separated values
§Remove any ordering that might exist in the observations
§Trim some variables (e.g., text variables) to a certain maximum size

56
Data Science for Marketing
© 2021-2024 Nuno António (Rev. 2024-08-28)
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa

Drymax E60 EN V1 6
No ratings yet
Drymax E60 EN V1 6
53 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
CH 2
No ratings yet
CH 2
36 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
Week 2
No ratings yet
Week 2
96 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Script
No ratings yet
Script
5 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Example New
No ratings yet
Example New
92 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Normalization
No ratings yet
Normalization
35 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
253777
No ratings yet
253777
66 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 1
No ratings yet
Unit 1
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
FANUC Alarms - Diagnosing & Tracking - Motion Controls Robotics - Certified FANUC System Integrator_1
No ratings yet
FANUC Alarms - Diagnosing & Tracking - Motion Controls Robotics - Certified FANUC System Integrator_1
4 pages
Mobile-Edge Computing - Introductory Technical White Paper V1 18-09-14
No ratings yet
Mobile-Edge Computing - Introductory Technical White Paper V1 18-09-14
36 pages
Teaching Philosophy Lbs 400
No ratings yet
Teaching Philosophy Lbs 400
4 pages
The Revised Blooms Taxonomy ASSESSMENT
No ratings yet
The Revised Blooms Taxonomy ASSESSMENT
21 pages
The Online Bach Bibliography
No ratings yet
The Online Bach Bibliography
7 pages
(PDF) Concrete Cracks Repair Using Epoxy Resin
No ratings yet
(PDF) Concrete Cracks Repair Using Epoxy Resin
18 pages
Strategy Collection Paper
No ratings yet
Strategy Collection Paper
7 pages
The Effectiveness of University of Luzon Pozorrubio Campus' in Bullying Prevention Protocols
No ratings yet
The Effectiveness of University of Luzon Pozorrubio Campus' in Bullying Prevention Protocols
25 pages
Subject and Verb Agreement
No ratings yet
Subject and Verb Agreement
10 pages
A Homegrown Economic Reform Agenda - A Pathway To Prosperity - Public Version - March 2020
No ratings yet
A Homegrown Economic Reform Agenda - A Pathway To Prosperity - Public Version - March 2020
42 pages
Ontario Electricity System
No ratings yet
Ontario Electricity System
2 pages
XR06CX: 1. Contents 7. Front Panel Commands
100% (1)
XR06CX: 1. Contents 7. Front Panel Commands
3 pages
DR-ID 300CL Service Manual: Safety Precautions
No ratings yet
DR-ID 300CL Service Manual: Safety Precautions
12 pages
Neet PG 2022 All India Deemed Colleges Round Wise Final Cut Off
No ratings yet
Neet PG 2022 All India Deemed Colleges Round Wise Final Cut Off
2 pages
2024 Brochure
No ratings yet
2024 Brochure
4 pages
SWAMI VIVEKANANDA COMPLETE WORKS Vol 3 PDF
No ratings yet
SWAMI VIVEKANANDA COMPLETE WORKS Vol 3 PDF
450 pages
RR 01 Artificial Intelligence
No ratings yet
RR 01 Artificial Intelligence
14 pages
Year 6 Sci 2nd Test March 24
No ratings yet
Year 6 Sci 2nd Test March 24
3 pages
3-Question Paper Mgt1051 - Question Bank Fat
No ratings yet
3-Question Paper Mgt1051 - Question Bank Fat
3 pages
Agni College of Technology: Lesson Plan - Including Coaching Class
No ratings yet
Agni College of Technology: Lesson Plan - Including Coaching Class
6 pages
The Crooked Ladder Thesis
100% (3)
The Crooked Ladder Thesis
5 pages
Immediate download (Ebook) Future Spacecraft Propulsion Systems and Integration: Enabling Technologies for Space Exploration by Paul A. Czysz, Claudio Bruno, Bernd Chudoba ISBN 9783662547427, 9783662547441, 3662547422, 3662547449 ebooks 2024
100% (9)
Immediate download (Ebook) Future Spacecraft Propulsion Systems and Integration: Enabling Technologies for Space Exploration by Paul A. Czysz, Claudio Bruno, Bernd Chudoba ISBN 9783662547427, 9783662547441, 3662547422, 3662547449 ebooks 2024
65 pages
Saira Nokhaiz Saira - Nokhaiz@cs - Uol.edu - PK
No ratings yet
Saira Nokhaiz Saira - Nokhaiz@cs - Uol.edu - PK
18 pages
Ash Content Determination
50% (2)
Ash Content Determination
17 pages
Front Pages of Report PDF
No ratings yet
Front Pages of Report PDF
5 pages
ALU - BFD Session Down - RCA - TAC
No ratings yet
ALU - BFD Session Down - RCA - TAC
11 pages
NPO Coursera
No ratings yet
NPO Coursera
6 pages
The Effectively Integrating Technology into Mathematics Education and its Benefits and Challenges to Teachers: A Systematic Literature Review
No ratings yet
The Effectively Integrating Technology into Mathematics Education and its Benefits and Challenges to Teachers: A Systematic Literature Review
13 pages
02-08-2016 Non Life Products of 2015-16, Aug 2016 Attachment-1
No ratings yet
02-08-2016 Non Life Products of 2015-16, Aug 2016 Attachment-1
5 pages

data science slides

Uploaded by

data science slides

Uploaded by

NOVA

Data Science for Marketing

Low quality data → Low quality mining results

§Improves the model development speed

Treat as missing Column

N Correct the problem,

Can an expert Outliers,

Correct the value Leave as it is

Risk of removing outliers:

Some algorithms, such as Decision trees-based

Common binning options:

If the number of observations is large, this

The advantage of this approach

This approach can take time

“The goal of attribute subset selection is to find a minimum set of attributes

These methods will be discussed in more detail in Machine

§Sampling: allows a dataset to be 1

random percentage of instances 2 youth RSWR (s=4)

It can lead to excessively

Method Formula Range

Min-max scale to convert an 8.1 rating in Booking.com to 0-10 scale

Min-max scale to convert a 4 rating in Tripadvisor to 0-10 scale

§Reciprocal transformation: "!

Numerical algorithms such as Linear

CustomerID Spent Education CustomerID Spent Education

When there is no inherent order

Usually Date/time variables are converted to numeric units related

stocks history social reputation currencies national official statistics

You might also like