05 DS Data Preprocessing - Cleaning

Data preprocessing is a crucial step in data mining that involves cleaning, transforming, and integrating data to enhance its quality for analysis. Key processes include data cleaning to handle missing and noisy data, data transformation to prepare data for mining, and data reduction to create a smaller, efficient dataset. Additionally, data integration addresses challenges like redundancy and value conflicts from multiple data sources.

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views14 pages

05 DS Data Preprocessing - Cleaning

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Science

PRAVEEN KUMAR SRIVASTAVA

Data Preprocessing

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
❖ Data Cleaning
❖ Data Integration
❖ Data Transformation
❖ Data Reduction
❖ Data Discretization
❖ Data Normalization
Data Cleaning

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

• Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

• Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Binning Method:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local
smoothing.
For Example: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is
9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value. In general, the larger the width, the
greater the effect of the smoothing. Alternatively, bins may be equal width,
where the interval range of values in each bin is constant.
Binning is also used as a discretization technique.
Data Transformation

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1.Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

3.Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4.Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in
hierarchy.
For Example-The attribute “city” can be converted to “country”
Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in
volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be
more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected
and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as
parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation
of concept hierarchies.Discretization and concept hierarchy generation are powerful tools for datamining, in that they
allow the mining of data at multiple levels of abstraction.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can
help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy
and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration

Entity Identification Problem

There are a number of issues to consider during data integration. Schema integration and object matching can be
tricky. This is referred to as the entity identification problem
When matching attributes from one database to another during integration, special attention must be paid to the
structure of the data. This is to ensure that any attribute functional dependencies and referential constraints in the
source system match those in the target system.
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how
strongly one attribute implies the other, based on the available data.
For nominal data, we use the 2 (chi-square) test. For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from those of another.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given unique data entry case). The use of de-normalized tables
(often done to improve performance by avoiding joins) is another source of data redundancy. Inconsistencies often
arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences.
For example, if a purchase order database contains attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s name appearing
with different addresses within the purchase order database.
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For example, for the same real-
world entity, attribute values from different sources may differ. This may be due to differences in representation,
scaling, or encoding.
For instance, a weight attribute may be stored in metric units in one system and British imperial units in another. For
a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services
(e.g., free breakfast) and taxes.
When exchanging information between schools, for example, each school may have its own curriculum and grading
scheme. One university may adopt a quarter system, offer three courses on database systems, and assign grades from
AC to F, whereas another may adopt a semester system, offer two courses on databases, and assign grades from 1 to
10. It is difficult to work out precise course-to-grade transformation rules between the two universities, making
information exchange difficult
Data Transformation and Data Discretization
In this pre-processing step, the data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand.

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for
data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a
data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as

5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–
20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-
level concepts, resulting in a concept hierarchy for the numeric attribute. e. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.

Pico Bricks Ebook 15
100% (1)
Pico Bricks Ebook 15
234 pages
Dayo Asubiojo - Resume - Dec 26
No ratings yet
Dayo Asubiojo - Resume - Dec 26
4 pages
Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
No ratings yet
Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
108 pages
DBMS - Unit 3 - Notes (Relational Calculus)
No ratings yet
DBMS - Unit 3 - Notes (Relational Calculus)
22 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Coroner's Findings - Inquest Into The Mangatepopo Gorge Disaster - Coroner CJ Davenport - 30th March 2010
No ratings yet
Coroner's Findings - Inquest Into The Mangatepopo Gorge Disaster - Coroner CJ Davenport - 30th March 2010
39 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
ML 4
No ratings yet
ML 4
17 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Optimization of Transportation Costs in Supply Cha PDF
No ratings yet
Optimization of Transportation Costs in Supply Cha PDF
83 pages
Ai 2023
No ratings yet
Ai 2023
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
DPU4E HdwGuide A5
No ratings yet
DPU4E HdwGuide A5
58 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Null 1
No ratings yet
Null 1
62 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Unit 2
No ratings yet
Unit 2
34 pages
Circuit 2: Admittance: I. Applying Complex Numbers To Parallel Ac Circuits
No ratings yet
Circuit 2: Admittance: I. Applying Complex Numbers To Parallel Ac Circuits
22 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Jujutsu Kaisen Manga Chapter 241
No ratings yet
Jujutsu Kaisen Manga Chapter 241
1 page
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Down 2
No ratings yet
Down 2
61 pages
Annual Report 2023-24 Draft1 Print
No ratings yet
Annual Report 2023-24 Draft1 Print
38 pages
Module 2 DMW
No ratings yet
Module 2 DMW
19 pages
Reactions in Organic Chemistry
No ratings yet
Reactions in Organic Chemistry
31 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
DCA2102 Unit-05
No ratings yet
DCA2102 Unit-05
21 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
DMDW
No ratings yet
DMDW
14 pages
Normalization
No ratings yet
Normalization
35 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
UT525 526 User Manual
No ratings yet
UT525 526 User Manual
31 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Thermodynamics I
No ratings yet
Thermodynamics I
34 pages
Norton Introduction To Literature Shorter 12e The All Chapter Instant Download
100% (1)
Norton Introduction To Literature Shorter 12e The All Chapter Instant Download
24 pages
Oceanic Feeling PDF
No ratings yet
Oceanic Feeling PDF
20 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Grade10 QP Mathematics 2
No ratings yet
Grade10 QP Mathematics 2
12 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Information Brochure Diploma Certificate Courses
No ratings yet
Information Brochure Diploma Certificate Courses
12 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Patent and Intellectual Property Rights Issues With Technology Transfer in Bhutan
No ratings yet
Patent and Intellectual Property Rights Issues With Technology Transfer in Bhutan
16 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Energy Efficient Pumping Technology Innovations and Recent Trends
No ratings yet
Energy Efficient Pumping Technology Innovations and Recent Trends
15 pages
00.01 Heko Chain Conveyors 2007
No ratings yet
00.01 Heko Chain Conveyors 2007
7 pages
Data Mining
No ratings yet
Data Mining
5 pages
LM4040, LM4041 Precision Micro-Power Shunt Voltage References
No ratings yet
LM4040, LM4041 Precision Micro-Power Shunt Voltage References
14 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Strategists Toolkit Diversification Matrices Template
No ratings yet
Strategists Toolkit Diversification Matrices Template
4 pages
Reciepes
No ratings yet
Reciepes
10 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
Enbridge Gas Form
No ratings yet
Enbridge Gas Form
1 page
OLAP and Metadata
No ratings yet
OLAP and Metadata
6 pages
Jealousy, Jealousy - Olivia Rodrigo Song Worksheet
No ratings yet
Jealousy, Jealousy - Olivia Rodrigo Song Worksheet
1 page
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
2 pages
Line Sizing Calculation - Pump Discharge
No ratings yet
Line Sizing Calculation - Pump Discharge
2 pages
Result of GCG Assessment 241 242
No ratings yet
Result of GCG Assessment 241 242
2 pages
Elsa NG Resume sp15
No ratings yet
Elsa NG Resume sp15
1 page
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)

05 DS Data Preprocessing - Cleaning

Uploaded by

05 DS Data Preprocessing - Cleaning

Uploaded by

Data Science

PRAVEEN KUMAR SRIVASTAVA

(a). Missing Data:

• Fill the Missing values:

Entity Identification Problem

Data Transformation Strategies Overview

You might also like