Data Preprocessing 013333

Uploaded by

ambooka abdulrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Data Preprocessing 013333

Uploaded by

ambooka abdulrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Preprocessing

• The representation and quality of data which is being used for carrying out an
analysis is the first and foremost concern to be addressed by any analyst.
• Data preprocessing is a data mining technique that involves transformation of
raw data into an understandable format, because real world data can often be
incomplete, inconsistent or even erroneous in nature.
• Data preprocessing resolves such issues.
• Issues of data quality include:
o Noise and outliers
o Missing data
o Duplicate data
• Data preprocessing ensures that further data mining process are free from
errors.
• It is a prerequisite preparation for data mining, it prepares raw data for the
core processes.
• Data preprocessing is part of a very complex phase known as ETL (Extraction,
Transformation and Loading).
• It involves extraction of data from multiple sources of data, transforming it to
a standardized format and loading it to the data mining system for analysis.
Data Preprocessing Methods
Raw data is highly vulnerable to missing values, noise and inconsistency and the
quality of data affects the data mining results. So, there is a need to improve the
quality of data, in order to improve mining results. For achieving better results,
raw data is pre-processed so as to enhance its quality and make it error free. This
eases the mining process.
The various stages in which data preprocessing is performed.
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
1. Data cleaning
First the raw data or noisy data goes through the process of cleansing.
In data cleansing missing values are filled, noisy data is smoothened,
inconsistencies are resolved, and outliers are identified and removed in order to
clean the data. To elaborate further:
a) Handling missing values:
It is often found that many of the database tuples or records do not have any
recorded values for some attributes. Such cases of missing values are filled by
different methods, as described below.
i) Fill in the missing value manually: Naturally, manually filling each missing
value is laborious and time consuming and so it is practical only when the
missing values are few in number. There are other methods to deal with the
problem of missing values when the dataset is very large or when the missing
values are very many.
ii) Use of some global constant in place of missing value: In this method,
missing values are replaced by some global label such as ‘Unknown’ or -∞.
Although one of the easiest approaches to deal with the missing values, it
should be avoided when mining program presents a pattern due to repetitive
occurrences of global labels such as ‘Unknown’. Hence, this method should be
used with caution.
iii) Use the attribute mean to fill in the missing value: Fill in the missing values
for each attribute with the mean of other data values of the same attribute.
This is a better way to handle missing values in a dataset.
iv) Use some other value which is high in probability to fill in the missing value:
Another efficient method is to fill in the missing values with values
determined by tools such as Bayesian Formalism or Decision Tree Induction
or other inference-based tools. This is one of the best methods, as it uses
most of the information already present to predict the missing values,
although it is not biased like previous methods. The only difficulty with this
method is complexity in performing the analysis.
v) Ignore the tuple: If the tuple contains more than one missing value and all
other methods are not applicable, then the best strategy to cope with missing
values is to ignore the whole tuple. This is commonly used if the class label
goes missing or the tuple contains missing values for most of the attributes.
This method should not be used, if the percentage of values that are missing
per attribute varies significantly.
Note: Use of the attribute mean to fill in the missing value is most common
technique used by most data mining tools to handle the missing values. However,
one can always use knowledge of probability to fill these values.
b) Handling noisy data
• Most data mining algorithms are affected adversely due to noisy data.
• The noise can be defined as unwanted variance or some random error that
occurred in a measurable variable.
• Noise is removed from the data by the method of ‘smoothing’. The
methods used for data smoothing are as follows:

i) Binning methods
• The Binning method is used to divide the values of an attribute into bins or
buckets.
• It is commonly used to convert one type of attribute to another type.
• For example, it may be necessary to convert a real-valued numeric attribute
like temperature to a nominal attribute with values cold, cool, warm, and hot
before its processing.
• This is also called ‘discretizing’ a numeric attribute.
• There are two types of discretization, namely, equal interval and equal
frequency.
• In equal interval binning, we calculate a bin size and then put the samples
into the appropriate bin.
• In equal frequency binning, we allow the bin sizes to vary, with our goal being
to choose bin sizes so that every bin has about the same number of samples
in it.
o The idea is that if each bin has the same number of samples, no bin, or
sample, will have greater or lesser impact on the results of data
mining.
• To understand this process, consider a dataset of the marks of 50 students.
The process divides this dataset on the basis of their marks into, for this
example, 10 bins.
In case of equal interval binning, we will create bins from 0-10, 10-20, 20-
30, 30-40, 40-50, 50- 60, 60-70, 70-80, 80-90, 90-100. If most students
commonly have marks between 60 to 80, some bins may be full and most
bins may have very few entries e.g., 0-10, 10-20, 90-100. Thus, it might be
better to divide this dataset on the equal frequency basis. It means that
with the same 50 students in class and we want to put these into 10 bins on
the basis of their marks then instead of creating the bins for marks like 0-10,
10-20 and so on, here we will first sort the records of students on the basis
of their marks in descending order (or ascending order as we prefer). The
first 5 students having highest marks will put into one bin and next 5
students on the basis of their marks will put into another and so on. If our
boundary students have same marks then bin range can be shifted to
accommodate students with the same marks into one common bin.
ii) Clustering or outlier analysis
• Clustering or outlier analysis is a method that allows detection of outliers by
clustering.
• In clustering, values which are common or similar are organized into groups
or ‘clusters’, and those values which lie outside these clusters are termed as
outliers or noise.

iii) Regression
• Regression is another such method which allows data smoothing by fitting
it to some function.
• For example, Linear Regression is one of the most used methods that aims
at finding the most suitable line to fit values of two variables or attributes
(i.e., best fit).
• The primary purpose of this is to predict the value of other variable the
using the first one.
• Similarly, Multiple Regression is used when more than two variables are
involved. Regression allows data fitting which in turn removes noise from
data and hence smoothens the dataset using mathematical equations.

iv) Combined computer and human inspection

• Using both computers and human inspection one can detect suspicious
values and outliers

c) Handling of inconsistent data

• Many times, data inconsistencies are encountered when data is recorded
during some transaction.
• Such inconsistencies can be manually removed by using external references.
• As an example: errors that have been made at the time of data entry be
corrected manually by performing a paper trace operation

2. Data integration
• A most necessary step to be taken during data analysis is Data Integration.
• Data integration is a process which combines data from a plethora of
sources (such as multiple databases, flat files or data cubes) into a unified
data store.
• During data integration, a number of tricky issues have to be considered.
• For example, how does the data analyst or the analyzing machine be sure
that student_id of one database and student_number of another database
refer to the same entity? This is referred to as the problem of entity
identification.
• Solution to the problem lies with the term ‘metadata’. Databases and data
warehouses consist of metadata, which is data about data. This metadata
is taken as a reference and referred by the data analyst to avoid errors
during the process of data integration.
• Another such issue which may be caused due to schema integration is
redundancy. In the language of database, an attribute is said to be
redundant if it is derivable from some other table (of the same database).
• Mistakes in attribute naming can also lead to data redundancies in the
resulting dataset. We use a number of tools to perform data integration
from different sources into one unified schema.

3) Data transformation
• When the value of one attribute is small as compared to other attributes,
then that attribute will not have much influence on mining of information,
since the values of this attrib ute were smaller than other attributes and
the variation within the attribute will also be small.
• Thus, data transformation is a process in which data is consolidated or
transformed into some other standard forms which are better suited for
data mining.
• All attributes should be transformed to a similar scale for clustering to be
effective unless we wish to give more weight to some attributes that are
comparatively large in scale.
• Commonly, we use two techniques to convert the attributes: Normalization
and Standardization are the most popular and widely used data
transformation methods.
Normalization
• In case of normalization, all the attributes are converted to a normalized
score or to a range (0, 1). The problem of normalization is an outlier.
• If there is an outlier, it will tend to crunch all of the other values down
toward the value of zero. In order to understand this, let’s suppose the
range of students’ marks is 35 to 45 out of 100.
• Then 35 will be considered as 0 and 45 as 1, and students will be distributed
between 0 to 1 depending upon their marks.
• But if there is one student having marks 90, then it will act as an outlier and
in this case, 35 will be considered as 0 and 90 as 1.
• Now, it will crunch most of the values down toward the value of zero. In this
scenario, the solution is standardization.
Standardization
• In case of standardization, the values are all spread out so that we have a
standard deviation of 1.
• Generally, there is no rule for when to use normalization versus
standardization.
• However, if your data has outliers, use standardization, otherwise use
normalization.
• Using standardization tends to make the remaining values for all of the
other attributes fall into similar ranges since all attributes will have the
same standard deviation of 1.

4) Data reduction
• It is often seen that when the complex data analysis and mining processes
are carried out over humongous datasets, they take such a long time that
the whole data mining or analysis process becomes unviable.
• Data reduction techniques come to the rescue in such situations.
• Using data reduction techniques, a dataset can be represented in a reduced
manner without actually compromising the integrity of original data.
• Data reduction is all about reducing the dimensions (referring to the total
number of attributes) or reducing the volume.
• Moreover, mining when carried out on reduced datasets often results in
better accuracy and proves to be more efficient.
• There are many methods to reduce large datasets to yield useful
knowledge. A few among them are:
i. Dimension reduction: In data warehousing, ‘dimension’ equips us with
structured labeling information. But not all dimensions (attributes) are
necessary at a time. Dimension reduction uses algorithm such as
Principal Component Analysis (PCA) and others. With the usage of such
algorithms one can detect and remove redundant and weakly relevant,
attributes or dimensions.
ii. Numerosity reduction: It is a technique which is used to choose smaller
forms of data representation for reducing the dataset volume.
iii. Data compression: We can also use data compression techniques to
reduce the dataset size. These techniques are classified as lossy and
loseless compression techniques where some encoding mechanisms
(e.g. Huffman coding) are used.

3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Unit 2
No ratings yet
Unit 2
34 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
Unit 2 Preprocessing in Data Mining
No ratings yet
Unit 2 Preprocessing in Data Mining
6 pages
ML 4
No ratings yet
ML 4
17 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit 2
No ratings yet
Unit 2
37 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
100% (4)
Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
21 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Advantages and Disadvantages of Central Tedency
100% (1)
Advantages and Disadvantages of Central Tedency
2 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
Data Mining
No ratings yet
Data Mining
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Transcript CCS00013020
No ratings yet
Transcript CCS00013020
2 pages
Final Exam-Mmw
No ratings yet
Final Exam-Mmw
7 pages
PDA Vol. 77, Issue 3 - VinaGMP
100% (1)
PDA Vol. 77, Issue 3 - VinaGMP
111 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
Solution Manual For Elementary Statistics Using The TI 83 84 4th Edition by Triola ISBN 055873703X 9780558737030 Instant Download
100% (8)
Solution Manual For Elementary Statistics Using The TI 83 84 4th Edition by Triola ISBN 055873703X 9780558737030 Instant Download
56 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Business Analytics Midterm Exam
No ratings yet
Business Analytics Midterm Exam
10 pages
Data Mining Notes
100% (1)
Data Mining Notes
178 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
Juita Et Al. (2024)
No ratings yet
Juita Et Al. (2024)
17 pages
Visual Analysis in Single Case Experimental Design Studies: Brief Review and Guidelines
No ratings yet
Visual Analysis in Single Case Experimental Design Studies: Brief Review and Guidelines
22 pages
MAT240 Project One
No ratings yet
MAT240 Project One
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
19 pages
Cis 417.CCS 415.CCT 416 Course Outline - 091557
No ratings yet
Cis 417.CCS 415.CCT 416 Course Outline - 091557
6 pages
Introduction To Data Mining - 125604
No ratings yet
Introduction To Data Mining - 125604
7 pages
11-LQMS - QC For Quantitative Tests
No ratings yet
11-LQMS - QC For Quantitative Tests
20 pages
Mechanical Damage and Fatigue Assessment of Dented Pipelines Using Fea
No ratings yet
Mechanical Damage and Fatigue Assessment of Dented Pipelines Using Fea
10 pages
T H E Analysis and USE Financial Ratios: A Review Article
No ratings yet
T H E Analysis and USE Financial Ratios: A Review Article
13 pages
Growth Standard Charts For Monitoring Bodyweight I
No ratings yet
Growth Standard Charts For Monitoring Bodyweight I
28 pages
OPIS NaturalGasPricing
No ratings yet
OPIS NaturalGasPricing
19 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Week 6-8 Module
No ratings yet
Week 6-8 Module
46 pages
Metrics and Operators On (1,1) Tensor Bundle 25 Some Notes On Lifts of The Modified Riemannian Extension Cotangent Bundle 26 Some Notes On Metalic PDF
No ratings yet
Metrics and Operators On (1,1) Tensor Bundle 25 Some Notes On Lifts of The Modified Riemannian Extension Cotangent Bundle 26 Some Notes On Metalic PDF
300 pages
Datamining Classification
No ratings yet
Datamining Classification
12 pages
ST2187 Block 2
No ratings yet
ST2187 Block 2
27 pages
AS Statistics and Mechanics Paper I Mark Scheme
No ratings yet
AS Statistics and Mechanics Paper I Mark Scheme
8 pages
The Impact of Enterprise Systems On Corporate Performance
No ratings yet
The Impact of Enterprise Systems On Corporate Performance
18 pages
Managerial Projections
No ratings yet
Managerial Projections
11 pages
Yr 8 MATHDairy Herd Data TCH Guide
No ratings yet
Yr 8 MATHDairy Herd Data TCH Guide
8 pages
Section 1: Organizing Data: Line List Line Listing
No ratings yet
Section 1: Organizing Data: Line List Line Listing
45 pages
Riazi.-Determination of Dewpoint Pressure in Gas Condensate Reservoirs Based On A
No ratings yet
Riazi.-Determination of Dewpoint Pressure in Gas Condensate Reservoirs Based On A
13 pages
ReportCard-CCS00013020 5
No ratings yet
ReportCard-CCS00013020 5
1 page
Stajkovic y Luthans 1997 Metaanalysis of The Efects of Organizational Behavior
No ratings yet
Stajkovic y Luthans 1997 Metaanalysis of The Efects of Organizational Behavior
29 pages
Concordance C Index - 2 PDF
No ratings yet
Concordance C Index - 2 PDF
8 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Preprocessing 013333

Uploaded by

Data Preprocessing 013333

Uploaded by

Data Preprocessing

iv) Combined computer and human inspection

c) Handling of inconsistent data

You might also like