0% found this document useful (0 votes)

3 views57 pages

2 Data Preprocessing

The document outlines the importance of data preprocessing, which includes data cleaning, integration, transformation, and reduction to ensure data quality for effective machine learning. It discusses various data quality problems such as noise, outliers, missing values, and duplication, along with methods for addressing these issues. Additionally, it covers techniques for data reduction like dimensionality reduction, feature subset selection, and the use of histograms to manage high-dimensional data.

Uploaded by

banadawithunde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views57 pages

2 Data Preprocessing

Uploaded by

banadawithunde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Preprocessing

Dec 2024
Outline

• Data quality problems

• Data preprocessing
• Data cleaning
• Data integration
• Data transformation
• Data reduction
Introduction

• Object and attributes

Data a sets of
objects/samples/vectors/insta
nces/etc., placed on the rows
of a table
Attribute Values

• Attribute values are numbers or symbols assigned to an

attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values.
• Example height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
– Properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value
Attributes with Examples
Data Quality

• Data have quality if they satisfy the requirements of the

intended use.
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability: how much the data are trusted by users.
– Interpretability: how easy the data are understood.
Data Quality Problems

• Noise refers to distortion (modification) of original values, due

to different interferences mainly occurring in the process of
data collecting
– Faulty data collection instruments
– Data entry errors
– Data transmission problems
– Technology limitation
– Inconsistency in naming convention
Data Quality Problems

• Outliers often generated by measurement errors

• Refers to data objects with characteristics that are
considerably different than most of the other data objects in
the data set.
– Outlier is an object (observation) that is, in a certain way,
distant from the rest of the data.
– Represents an ‘alien’ object in the dataset
Data Quality Problems
• Missing values: attribute values that is not stored for an
attribute
• Reasons for missing values
– Information is not collected (e.g., people decline to give their
age and weight)
– Attributes may not be applicable to all cases (e.g., annual
income is not applicable to children)
Data Quality Problems
• Duplication: data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous sources
• Example:
– Same person with multiple email addresses
Data Quality Problems

• Inconsistent: containing discrepancies

– e.g., Age=”42” Birthday=’03/07/1997’
– e.g., Rating was ”1,2,3”, now rating ”A, B, C” ?
– e.g., discrepancy between duplicate records
• Impossible data combination (eg. Gender: Male, Pregnant:
Yes)
• Specified ML model needs information in a specified format–
RF– Null value
Why data Preprocessing is Important?

• Data preprocessing consumes most of the time and

implementation efforts and can be more critical than the machine-
learning algorithm itself.

• Less data (fewer attributes): machine learning methods can learn

faster
• Higher accuracy: machine learning methods can generalize better
• Simple results: they are easier to understand
Data Preprocessing Major Tasks
Data Cleaning

• Dirty data can cause confusion for the mining procedure,

resulting in unreliable output.

• Routines work to “clean” the data by:

– Filling in missing values
– Smoothing noisy data
– Identifying or removing outliers
– Resolving inconsistencies.
Data Cleaning

• Handling missing values:

– Ignore the tuple (data object) whenever a class label is missed.
• By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple that might be useful.
• Not very effective, unless the tuple contains several
attributes with missing value.

– Fill in the missing values manually- time consuming and

not feasible for a large data set.
Data Cleaning
• Handling missing values:
– Estimate Missing Values
• Use a global constant
– Simple, it is not foolproof.
• Use the attribute mean and median
– For all samples belonging to the same class as the given tuple
• Use the most probable value (popular)
– Determined by regression, Bayesian formalism, or decision
tree
– Popular strategy since it uses the most information from the
present data to predict missing values
• Choosing the right technique is a choice that depends on the
problem domain.
• Missing value may not imply an error in the data!
Example: Missing Values Handling
method
Attribute Data type Handling method
Name
Sex Nominal Replace by the mode value.
Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Noise Data
• Noise is a random error or variance in a variable measure
• Noisy data are data with a large amount of additional
meaningless information called noise. They are corrupted and
distorted data.
• How to handle?
– Binning method
– Regression
– Clustering
• Noisy data can be smoothed:
– Binning methods:- smooth a sorted data value by consulting its
neighborhood-- the values around it.
– Perform local smoothing
– Distribute the sorted values into bins
Noise Data

• Noisy data can be smoothed:

– Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform
grid
• If A and B are the lowest and highest values of the attribute,
the width of intervals will be W = (B-A)/N
• Skewed data is not handled well
Noise Data

• Noisy data can be smoothed:

– Equal-depth (frequency) partitioning
• It divides the rang into N intervals, each containing
approximately the same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
– Smoothing by bin means: each value in a bin is replaced by the
mean value of the bin.
– Smoothing by bin median: replaced by the bin median
– Smoothing by bin boundaries: minimum and maximum
values in a given bin are identified as the bin boundaries
Binning-Example
Regression

•Conforms data values to a function.

•Linear regression involves finding

the “best” line to fit two attributes
(or variables) so that one attribute
can be used to predict the other.

•Multiple linear regression more

than two attributes are involved
and the data are fit to a
multidimensional surface.
Clustering

•Outliers may be detected as values that

fall outside of the sets of clusters.
Data Cleaning as a Process

• The first step in data cleaning as a process is discrepancy

detection
• Caused by several factors:
– Poorly designed data entry forms
– Human error in data entry
– Deliberate errors
– Data decay: example, outdated addresses
– Data representations and inconsistent use of codes
– Inconsistencies due to data integration , example different names in
different database
Data Cleaning as a Process

• Commercial tools that can aid in the discrepancy detection

step
– Data scrubbing tools: use simple domain knowledge to detect
errors and make corrections
• Rely on parsing and fuzzy matching techniques
– Data auditing tools: find discrepancies by analyzing the data to
discover rules and relationships
• Detecting data that violate such conditions
• Once we find discrepancies, we typically need to define and
apply (a series of) transformations to correct them.
Data Cleaning as a Process

• Commercial tools can assist in the data transformation:

– ETL (extraction/transformation/loading) tools: allow simple
transformations; example, replace “gender” by “sex.”
Data Integration

• Blending data from multiple sources into a coherent data store.

“Data integration is the combination of technical and business processes
used to combine data from disparate sources into meaningful and
valuable information.”

• Careful integration can help reduce and avoid redundancies and

inconsistencies.
• The semantic heterogeneity and structure of data pose great
challenges in data integration.
Data Integration

• In addition to detecting redundancies between attributes,

duplication should also be detected at the tuple level.
• The use of denormalized tables is another source of data
redundancy.
Data Integration
• Entity identification problem:
– There are a number of issues to consider during data integration--
Schema integration and object matching.
• How can equivalent real-world entities from multiple data
sources be matched up? -- the entity identification problem.
• Customer-id in one database and cust-number in another refer
to the same attribute.
– Solution: meta data-- data about data
Data Integration
• Redundancy and Correlation Analysis :
– Redundancy is another important issue in data integration.
– Inconsistencies in attribute or dimension naming can also cause
redundancies in data integration.
– Some redundancies can be detected by correlation analysis.
• Chi-square: nominal attribute
• Correlation/covariance: numeric attribute
Data Integration Approaches

• Data Consolidation
– Brings data together from several separate systems
– The goal is to reduce the number of data storage locations.

• Data Propagation
– Data propagation is the use of applications to copy data from
one location to another.
– It is event-driven and can be done synchronously or
asynchronously
Data Integration Approaches

• Data Virtualization
– Uses an interface to provide a near real-time, unified view of
data from disparate sources with different data models.
• Data Federation
– A form of data virtualization
– Uses a virtual database and creates a common data model for
heterogeneous data from different systems
• Data Warehousing
– Data warehouses are storage repositories for data
– Data warehousing implies the cleansing, reformatting, and
storage of data
Data Reduction

• Most machine learning and data mining techniques may not

be effective for high-dimensional data

• Data reduction techniques can be applied to obtain a reduced

representation of the data set that is much smaller in volume,
yet closely maintains the integrity of the original data.

• Analytics on the reduced data set should be more efficient yet

produce the same (or almost the same) analytical results.
Data Reduction

• Data reduction strategies include:

– Dimensionality reduction: process of reducing the number of
random variables or attributes under consideration.
• Methods: Attribute subset selection, wavelet transform and
principal components analysis.
– Numerosity reduction: replace the original data volume by
alternative, smaller forms of data representation.
• Method:
– Parametric : a model is used to estimate the data, so
that typically only the data parameters need to be
stored, instead of the actual data.
» Regression and log-linear models
Data Reduction

• Data reduction strategies include:

• Nonparametric methods for storing reduced representations

of the data include:

– Histograms, clustering, sampling, data cube aggregation

– Data compression: transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
Data Reduction

– Lossless compression: If the original data can be

reconstructed from the compressed data without any
information loss.
– Lossy: If we can reconstruct only an approximation of the
original data.
Data Reduction- Wavelet Transform
• DWT:- when applied to a data vector X, transforms it to a numerically
different vector, X’, of wavelet coefficients.
– The two vectors are of the same length.
• “How can this technique be useful for data reduction if the wavelet
transformed data are of the same length as the original data?”
• The Wavelet Transform is a mathematical tool used in signal
processing and data analysis that decomposes a signal into its
constituent parts at different frequency levels.

• compressed approximation of the data can be retained by

storing only a small fraction of the strongest of the wavelet
coefficients.
Data Reduction- Wavelet Transform

• Given a set of coefficients, an approximation of the original

data can be constructed by applying the inverse of the DWT
used.
• WT give good results on sparse or skewed data and on data
with ordered attributes.
Data Reduction- Wavelet Transform

• Suppose that the data to be reduced consist of tuples or data

vectors described by n attributes or dimensions.

• Searches for k n-dimensional orthogonal vectors that can best

be used to represent the data, where k is less than or equal to
n.

• Goal is to find a projection that captures the largest amount

of variation in data.
Data Reduction: PCA

• PCA is a technique for forming new variables which

are linear composites of the original variables.
– Reduce the dimensionality of a data set by finding a new set
of variables.
• PCs may be used as inputs to multiple regression and
cluster analysis.
• PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for
data of high dimensionality.
Data Reduction: PCA

• Retains most of the sample's information (the variation

present in the sample, given by the correlations between the
original variables)

– The new variables are called principal components and the

values of the new variables are called principal component
scores.

• PCA is a dimensionality reduction technique that transforms a

dataset into a set of orthogonal components, known as
principal components.
Data Reduction: PCs

• PCs are a series of linear least squares fits to a sample, each

orthogonal to all the previous.
– The 1st PC is a minimum distance fit to a line in X space
– The 2nd PC is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
Data Reduction: PCA- How it works
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space

• Covariance Matrix: PCA starts by computing the covariance matrix

of the data, which represents how different dimensions (features)
of the data vary together.
• Eigenvalues: indicate the amount of variance captured by each
principal component.
• Eigenvectors: represent the directions of the axes along which the
data varies the most.
• PCs may be used as inputs to multiple regression and cluster
analysis.
Data Reduction: Feature Subset Selection

• Another way to reduce dimensionality of data

• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students ’ID is often irrelevant to the task of
predicting students’ GPA
Data Reduction: Feature Subset Selection
• Techniques
– Brute-force approach:
• Try all possible feature subsets as input to machine learning
algorithm
• Embedded approaches:
• Feature selection occurs naturally as part of the machine learning
algorithm
– Filter approaches:
• Features are selected before machine learning algorithm is run

– Wrapper approaches:
• Use the machine learning algorithm as a black box to find best
subset of attributes
Data Reduction: Histograms

• Histograms use binning to approximate data distributions and

are a popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution
of A into disjoint subsets, referred to as buckets or bins.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12,
14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20,
20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

A histogram for price using singleton

buckets—each bucket represents one price–
value/frequency pair.
Data Reduction: Histograms

• How are the buckets determined and the attribute values

partitioned?
• Partitioning rules:
– Equal-width: In an equal-width histogram, the width of each
bucket range is uniform.
– Equal-frequency (or equal-depth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant.
• Histograms are highly effective at approximating both sparse
and dense data, as well as highly skewed and uniform data.
Application of DR

• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
• Face recognition
• Handwritten digit recognition
• Intrusion detection
Data Reduction-Example
Data Transformation and Discretization

• Data are transformed or consolidated into forms appropriate for

mining
• Involves the following:
– Smoothing: remove noise from the data – binning, regression,
and clustering
– Attribute construction: new attributes are constructed from the
given set of attributes
– Aggregation: summary or aggregation operations are applied to
the data
Data Transformation and Discretization

• Data are transformed or consolidated into forms appropriate

for mining
• Involves the following:
– Normalization: scale the attribute value with in a small
specified range– -1.0 to 1.0 or 0 to 1

– Discretization: raw values of a numeric attribute (e.g., age) are

replaced by interval labels

– Concept hierarchy generation for nominal data: eg. Street can

be generalized to higher-level concepts, like city or country
Data Transformation by Normalization

• To help avoid dependence on the choice of measurement

units, the data should be normalized or standardized.
• Normalizing the data attempts to give all attributes an equal
weight.
• Normalization useful for classification (neural networks,
nearest-neighbor) and clustering.
• Methods for data normalization:
– Min-max normalization
– Z-score normalization
– Normalization by decimal scaling
Data Transformation by Normalization

• Min-max normalization performs a linear transformation on

the original data.
• Min-max normalization preserves the relationships among the
original data values.
Data Transformation by Normalization
Data Transformation by Normalization
Discretization by Binning

• Binning is a top-down splitting technique based on a specified

number of bins.
• These methods are also used as discretization methods for
data reduction and concept hierarchy generation.
Case-Study
Online data preprocessing: a case study approach
(Mohammed et al., 2019)

• Implemented preprocessing to Flight MH370 social data. After

preprocessing, they used the resultant data to examine the flight
community structure, discover types of social relationships, reveal
the truth behind some of the unusual events, and study people
coping behavior (adaptation patterns) during disaster time.

DFL-WD II English Version Manual (v2.4)
100% (1)
DFL-WD II English Version Manual (v2.4)
83 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Null 1
No ratings yet
Null 1
62 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Correlation
No ratings yet
Correlation
14 pages
DP
No ratings yet
DP
44 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Data Mining and Data Warehousing - Data Preprocessing - L03
No ratings yet
Data Mining and Data Warehousing - Data Preprocessing - L03
10 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Unit - II
No ratings yet
Unit - II
56 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
BDA University Question Paper
No ratings yet
BDA University Question Paper
10 pages
Intra Office Memo - Data Privacy - DMOs
100% (1)
Intra Office Memo - Data Privacy - DMOs
2 pages
AI Practical
No ratings yet
AI Practical
4 pages
Journaling
No ratings yet
Journaling
22 pages
Database Performance Tuning and Query Optimization
No ratings yet
Database Performance Tuning and Query Optimization
33 pages
Research Project
No ratings yet
Research Project
45 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
F.4. Data Analytics Part 1
No ratings yet
F.4. Data Analytics Part 1
29 pages
Database Modelling & Design (Chap 1 - 3)
No ratings yet
Database Modelling & Design (Chap 1 - 3)
21 pages
Unit 04 Database Design and Development
No ratings yet
Unit 04 Database Design and Development
91 pages
JRC134308 01
No ratings yet
JRC134308 01
85 pages
INVENTORY-MANAGEMENT-SYSTEM-FOR-VESSEL-SHOP
No ratings yet
INVENTORY-MANAGEMENT-SYSTEM-FOR-VESSEL-SHOP
19 pages
Marketing Strategy Jindal Saw Ltd. 1
No ratings yet
Marketing Strategy Jindal Saw Ltd. 1
56 pages
The Effectiveness of Recycling Programs in School
No ratings yet
The Effectiveness of Recycling Programs in School
10 pages
Rohith First Round Option Entry
No ratings yet
Rohith First Round Option Entry
5 pages
Physical Database Design and Tuning: R&G - Chapter 20
No ratings yet
Physical Database Design and Tuning: R&G - Chapter 20
19 pages
How Are Hadoop and Big Data Related?
No ratings yet
How Are Hadoop and Big Data Related?
18 pages
EECS 370 Final Review
No ratings yet
EECS 370 Final Review
16 pages
DATABASE MANAGEMENT SYSTEMS Unit Wise Important Questions: PART - A (Short Answer Questions)
100% (6)
DATABASE MANAGEMENT SYSTEMS Unit Wise Important Questions: PART - A (Short Answer Questions)
7 pages
(CT208H) Ch2 - Oracle Architecture
No ratings yet
(CT208H) Ch2 - Oracle Architecture
60 pages
DAY SCHEDULE CLASS XII-CS
No ratings yet
DAY SCHEDULE CLASS XII-CS
2 pages
Brief Overview of JDBC Process
83% (6)
Brief Overview of JDBC Process
15 pages
MSC MBP
No ratings yet
MSC MBP
4 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Network Traffic Intrusion Detection System Using Decision Tree & K-Means Clustering Algorithm
No ratings yet
Network Traffic Intrusion Detection System Using Decision Tree & K-Means Clustering Algorithm
3 pages
Group Assignment 2
No ratings yet
Group Assignment 2
2 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
S4HANA Migration - Nisha Chawda
100% (1)
S4HANA Migration - Nisha Chawda
4 pages
Reporting Findings Drawing Conclusions Making Recommendations and Writing References
No ratings yet
Reporting Findings Drawing Conclusions Making Recommendations and Writing References
49 pages

2 Data Preprocessing

Uploaded by

2 Data Preprocessing

Uploaded by

Data Preprocessing

• Data quality problems

• Object and attributes

• Attribute values are numbers or symbols assigned to an

• Data have quality if they satisfy the requirements of the

• Noise refers to distortion (modification) of original values, due

• Outliers often generated by measurement errors

• Inconsistent: containing discrepancies

• Data preprocessing consumes most of the time and

• Less data (fewer attributes): machine learning methods can learn

• Dirty data can cause confusion for the mining procedure,

• Routines work to “clean” the data by:

• Handling missing values:

– Fill in the missing values manually- time consuming and

• Noisy data can be smoothed:

• Noisy data can be smoothed:

•Conforms data values to a function.

•Linear regression involves finding

•Multiple linear regression more

•Outliers may be detected as values that

• The first step in data cleaning as a process is discrepancy

• Commercial tools that can aid in the discrepancy detection

• Commercial tools can assist in the data transformation:

• Blending data from multiple sources into a coherent data store.

• Careful integration can help reduce and avoid redundancies and

• In addition to detecting redundancies between attributes,

• Most machine learning and data mining techniques may not

• Data reduction techniques can be applied to obtain a reduced

• Analytics on the reduced data set should be more efficient yet

• Data reduction strategies include:

• Data reduction strategies include:

• Nonparametric methods for storing reduced representations

of the data include:

– Histograms, clustering, sampling, data cube aggregation

– Lossless compression: If the original data can be

• compressed approximation of the data can be retained by

• Given a set of coefficients, an approximation of the original

• Suppose that the data to be reduced consist of tuples or data

• Searches for k n-dimensional orthogonal vectors that can best

• Goal is to find a projection that captures the largest amount

• PCA is a technique for forming new variables which

• Retains most of the sample's information (the variation

– The new variables are called principal components and the

• PCA is a dimensionality reduction technique that transforms a

• PCs are a series of linear least squares fits to a sample, each

• Covariance Matrix: PCA starts by computing the covariance matrix

• Another way to reduce dimensionality of data

• Histograms use binning to approximate data distributions and

A histogram for price using singleton

• How are the buckets determined and the attribute values

• Data are transformed or consolidated into forms appropriate for

• Data are transformed or consolidated into forms appropriate

– Discretization: raw values of a numeric attribute (e.g., age) are

– Concept hierarchy generation for nominal data: eg. Street can

• To help avoid dependence on the choice of measurement

• Min-max normalization performs a linear transformation on

• Binning is a top-down splitting technique based on a specified

• Implemented preprocessing to Flight MH370 social data. After

You might also like