Module 2
Module 2
Data Processing
Processing Information
Missing values
Noisy data
Incomplete data
Accurate
Precise
Complete
Interpretable
Correct
as possible.
Accurate
Precise
Complete
Interpretable
Correct
as possible.
Data cleaning
Dealing with missing values
Dealing with erroneous data and outliers
Data transformation
Changing data types (discretization)
Changing range of data values (normalization)
Adding variables
Data reduction
Feature selection
Sampling
A blank
A ‘.’
A ‘n/a’
A ‘?’
Delete the entire row (depends on how many rows you have)
Replace by a fixed value (‘unknown’)
Replace values by a statistic associated with a particular
column or a particular group – mean, median, mode
Replace values based on nearest neighbors
Replace values based on likelihood.
Impute a
Handling constant value
missing values
Impute with
mean …
Impute based
on a model
Impute
randomly
Isabelle Bichindaritz, SUNY Oswego 20
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R Bichindaritz, SUNY Oswego
Isabelle 21
Normalizing and Discretizing Data
In order to handle noise in the data, they can be transformed
globally to
Methods include:
Min-max normalization
Z-score normalization
Decimal scaling
this formula transforms the values into interval [-1, 1] is there are negative values, and
into [0, 1] otherwise.
we want the highest age to be less than 1, therefore divide by 1,000 = 103
The method that preserves the original data distribution is decimal scaling,
therefore it preserves more than the others the shape of the data repartition. It
acts similarly to image resizing in photo editing software (shrink / magnify).
Z-score normalization is the most used because the resulting distribution is going
to be normal, which is advantageous with certain statistical methods. However it
distorts the natural shape of the data distribution.
Min-max normalization can accommodate any new range we want, not only [0, 1]
and [-1, 1] like the other ones.
Effects of discretization:
Smooths data.
Reduces noise.
Reduces data size.
Enables specific methods using nominal data.
Manual methods:
Distribution analysis.
Automatic methods:
Binning.
Equal-width binning
Equal-depth binning
Regression analysis.
Cluster analysis.
Natural partitioning.
w = max – min / n
Ex: if the range is [0, 100] and we want 4 bins, each bin will have
a width of
100 – 0 / 4 = 25
the bins will be: [0, 24], [25, 49], [50, 74], [75, 100].
n = nb / d
Ex: if the range is [0, 100] for 100 samples of different values (for
example 99 is missing), we want 20 samples in each bin, the number
of bins will be:
100 / 20 = 5
the bins will be: [0, 19], [20, 39], [40, 59], [60, 79], [80, 100].
Feature selection.
Sampling.
Data compression.
Data aggregation.
etc.
Main methods:
Simple random sampling with replacement.
Simple random sampling without replacement.
Stratified sampling.
W O R
SRS le random
i m p ho ut
(s l e wit from Han and Kamber
sa m p m e nt ) (2014)
e p l a ce
r
SRSW
R
Raw Data
Isabelle Bichindaritz, SUNY Oswego
43
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R Bichindaritz, SUNY Oswego
Isabelle 44
Introduction to R Language
R is a computation, graphic, and open source programming environment for statistical
analysis and data science applications.
Developed originally by Ross Ihaka and Robert Gentleman at the University of Auckland
in New Zealand, it is now maintained by the “R core group” (http://www.R-project.org).