3-Data Pre-Processing

Intro

Uploaded by

BindiyaAbhilash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views18 pages

3-Data Pre-Processing

Intro

Uploaded by

BindiyaAbhilash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Pre-processing

Why Data Preprocessing?

No quality data, no quality mining
Data in the real world is dirty
results!
• incomplete: lacking attribute • Quality decisions must be based
values, lacking certain attributes on quality data
of interest, or containing only • Data warehouse needs consistent
aggregate data integration of quality data
• noisy: containing errors or
outliers
• inconsistent: containing
discrepancies in codes or names

2
Steps Involved
What is Data
For instance, days in a
week : {Monday, Tuesday,
Wednesday, Thursday, Another example could
Friday, Saturday, Sunday} be the Boolean set :
is a category because its {True, False}
value is always taken
from this set.

Features
whose values
are taken from
a defined set
of values.

Categorical Attributes
Numerical : Features whose values are continuous or integer-valued. They are represented by numbers
and possess most of the properties of numbers.
Missing values :
Eliminate rows with missing data :
• Simple and sometimes effective strategy. Fails if
many objects have missing values. If a feature has
mostly missing values, then that feature itself can
also be eliminated.

Estimate missing values :

• If only a reasonable percentage of values are missing,
then we can also run simple interpolation methods to
fill in those values. However, most common method
of dealing with missing values is by filling them in
with the mean, median or mode value of the
respective feature.
Inconsistent values :
Data can contain inconsistent values.

For instance, the ‘Address’ field contains the ‘Phone

number’. It may be due to human error or maybe the
information was misread while being scanned from a
handwritten form.

It is therefore always advised to perform data assessment

like knowing what the data type of the features should be
and whether it is the same for all the data objects.
Duplicate values :

A dataset may include data objects which are duplicates of one another.
It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of
dealing with duplicates.

In most cases, the duplicates are removed so as to not give that

particular data object an advantage or bias, when running machine
learning algorithms.
• Don’t Impute does nothing with the missing values.
• Average/Most-frequent uses the average value (for continuous attributes) or the most
common value (for discrete attributes).
• As a distinct value creates new values to substitute the missing ones.
• Model-based imputer constructs a model for predicting the missing value, based on
values of other attributes; a separate model is constructed for each attribute. The default
model is 1-NN learner, which takes the value from the most similar example (this is
sometimes referred to as hot deck imputation). This algorithm can be substituted by one that
the user connects to the input signal Learner for Imputation. Note, however, that if there are
discrete and continuous attributes in the data, the algorithm needs to be capable of handling
them both; at the moment only 1-NN learner can do that. (In the future, when Orange has
more regressors, the Impute widget may have separate input signals for discrete and
continuous models.)
• Random values computes the distributions of values for each attribute and then imputes by
picking random values from them.
• Remove examples with missing values removes the example containing missing values.
This check also applies to the class attribute if Impute class values is checked.
Noisy data are data with a large amount of additional meaningless information in it called
noise.
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?

• Binning method:
• Clustering
• Combined computer and human inspection
• Regression
Binning method is used to smoothing data or to handle noisy
data. In this method, the data is first sorted and then the sorted
values are distributed into a number of buckets or bins.
As binning methods consult the neighbourhood of values, they
perform local smoothing.

There are three approaches to perform smoothing –

• Smoothing by bin means : In smoothing by bin means, each

value in a bin is replaced by the mean value of the bin.
• Smoothing by bin median : In this method each bin value is
replaced by its bin median value.
• Smoothing by bin boundary : In smoothing by bin boundaries,
the minimum and maximum values in a given bin are identified
as the bin boundaries. Each bin value is then replaced by the
Binning method closest boundary value.
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
1 July 2024 Data Mining: Concepts and Techniques 15
Outlier Detection using Cluster Analysis
Outlier Detection using Linear Regression
Continued…

Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit 2
No ratings yet
Unit 2
34 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
CH 2
No ratings yet
CH 2
36 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit 2
No ratings yet
Unit 2
37 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Normalization
No ratings yet
Normalization
35 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
NAT REVIEW in Statistics and Probability For Answer
No ratings yet
NAT REVIEW in Statistics and Probability For Answer
52 pages
MG University Syllabus Computer Science
No ratings yet
MG University Syllabus Computer Science
222 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Distribucion Log Normal
No ratings yet
Distribucion Log Normal
52 pages
MATH 231-Statistics-Hira Nadeem PDF
No ratings yet
MATH 231-Statistics-Hira Nadeem PDF
3 pages
Data Analysis
No ratings yet
Data Analysis
10 pages
Types of Experimental Research Design
No ratings yet
Types of Experimental Research Design
3 pages
Training at Gudar Campus
No ratings yet
Training at Gudar Campus
83 pages
2-Data Warehousing
No ratings yet
2-Data Warehousing
30 pages
Study Theme 1 - Chapter 1 - Hello Data
No ratings yet
Study Theme 1 - Chapter 1 - Hello Data
23 pages
Indicator Variables: Variable or Dummy Variables
No ratings yet
Indicator Variables: Variable or Dummy Variables
11 pages
Unit 04 - Maximum Likelihood Estimation - 1 Per Page
No ratings yet
Unit 04 - Maximum Likelihood Estimation - 1 Per Page
62 pages
Quantitative Techniques For Business Decisions
0% (1)
Quantitative Techniques For Business Decisions
8 pages
Lovedeep Singh Bussiness Analytics
No ratings yet
Lovedeep Singh Bussiness Analytics
24 pages
Chap 14 - Statistical Process Control
No ratings yet
Chap 14 - Statistical Process Control
38 pages
Exercise 2.1: Historical Simulation Method
No ratings yet
Exercise 2.1: Historical Simulation Method
8 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Cohen Chap 7 T Test For Independent Sample Means (Screen)
No ratings yet
Cohen Chap 7 T Test For Independent Sample Means (Screen)
20 pages
Specification Variable in Econometrics
No ratings yet
Specification Variable in Econometrics
15 pages
Numerical Computation - 7 - Linear Regression
No ratings yet
Numerical Computation - 7 - Linear Regression
27 pages
Technical Writing
No ratings yet
Technical Writing
4 pages
1-Data Mining
No ratings yet
1-Data Mining
17 pages
Statistics in Economics and Business
No ratings yet
Statistics in Economics and Business
34 pages
Sample - Pyq mgt646
No ratings yet
Sample - Pyq mgt646
5 pages
Nunung Manis Setiyani, Rita Andini, Abrar Oemar
No ratings yet
Nunung Manis Setiyani, Rita Andini, Abrar Oemar
18 pages
Missing Data Management
No ratings yet
Missing Data Management
19 pages
4 Marks (Statistics)
No ratings yet
4 Marks (Statistics)
7 pages
Ken Black QA ch12
No ratings yet
Ken Black QA ch12
32 pages
Griya Report-Con S Eom 1704.Dwg Jalan
No ratings yet
Griya Report-Con S Eom 1704.Dwg Jalan
11 pages
Detection of Outliers: Iglewicz and Hoaglin
No ratings yet
Detection of Outliers: Iglewicz and Hoaglin
2 pages
Yy 1 Xy
No ratings yet
Yy 1 Xy
4 pages
Assignment 2-Data Analysis and Report Writing
No ratings yet
Assignment 2-Data Analysis and Report Writing
2 pages
Chapter 06 (Part A) PowerPoint
No ratings yet
Chapter 06 (Part A) PowerPoint
11 pages
Biostatistics I - Assignment 02 Solution
No ratings yet
Biostatistics I - Assignment 02 Solution
5 pages
Week 8 - Hypothesis Testing Part 1
No ratings yet
Week 8 - Hypothesis Testing Part 1
4 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

3-Data Pre-Processing

Uploaded by

3-Data Pre-Processing

Uploaded by

Data Pre-processing

Why Data Preprocessing?

Estimate missing values :

For instance, the ‘Address’ field contains the ‘Phone

It is therefore always advised to perform data assessment

In most cases, the duplicates are removed so as to not give that

There are three approaches to perform smoothing –

• Smoothing by bin means : In smoothing by bin means, each

You might also like