ICS 2408 - Lecture 2 - Data Preprocessing

The document discusses the importance of data preprocessing for data mining and analytics. It explains that real-world data is often dirty, incomplete, noisy, and inconsistent. The major tasks in data preprocessing are data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data transformation includes normalization, aggregation, and attribute construction. Data reduction reduces data volume while maintaining analytical results. Dimensionality reduction and discretization are common reduction techniques.

Uploaded by

petergitagia9781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views29 pages

ICS 2408 - Lecture 2 - Data Preprocessing

Uploaded by

petergitagia9781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Preprocessing

Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g. occupation=“ ”
noisy: containing errors or outliers e.g. Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”
e.g. discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
 Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
 Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why is Data Preprocessing important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation

comprises the majority of the work of building a data
warehouse
Measures for data quality
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable are the data, are they
correct?
Interpretability: how easily the data can be understood?
Value added
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Cleaning
Importance
“Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks – this routine attempts to

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data

Missing data may be due to

Equipment malfunction
Inconsistent with other recorded data and thus deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the time
of entry
No registered history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data
1. Ignore the tuple
2. Fill in missing values manually: tedious + infeasible
3. Fill in automatically with a global constant.
4. Fill in with the attribute mean
5. Fill in with the attribute mean for all samples
belonging to the same class as the given tuple
6. Fill in with the most probable value determined with
regression, inference-based such as Bayesian
formula, decision tree.
Noisy Data
Noise: random error or variance in a measured
variable.

Incorrect attribute values may be due to

faulty data collection instruments
data entry problems
data transmission problems

Other data problems which requires data cleaning

duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data
 Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
Similar values are organized into groups (clusters).
Values that fall outside of clusters are considered outliers.
 Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
 Regression
Data can be smoothed by fitting the data to a function such as
with regression. (linear regression/multiple linear regression)
Discretization Method: Binning
 Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well

 Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
Sorted data for price : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Integration
 Data integration:
Combines data from multiple sources(data cubes, multiple db
or flat files)
 Issues during data integration
Schema integration
Integrate metadata (about the data) from different sources
Entity identification problem: identify real world entities from
multiple data sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
Removing duplicates and redundant data
An attribute can be derived from another table (annual revenue)
Inconsistencies in attribute naming
Data Transformation
Smoothing: remove noise from data (binning, clustering,
regression)
Normalization: scaled to fall within a small, specified
range such as –1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations
apply to data
Generalization: concept hierarchy climbing
Low level/ primitive/raw data are replace by higher level
concepts
Data Transformation: Normalization
Useful for classification algorithms involving
Neural networks
Distance measurements (nearest neighbor)
Backpropagation algorithm (NN) – normalizing helps to
speed up the learning phase
Distance-based methods – normalization prevents
attributes with initially large range (i.e. income) from
outweighing attributes with initially smaller ranges (i.e.
binary attribute)
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v  minAA
v'  (new _ maxAA  new _ minAA)  new _ minAA
maxAA  minAA
Ex. Let income range 12,000 to 98,000 normalized to [0.0,
1.0]. Then 73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A
73,600  54,000
Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
Normalization by decimal scaling
v
v'  j v'
Where j is the smallest integer such that Max(|
10 |)<1
Data Reduction Strategies
 Data is too big to work with – may take time, impractical or
infeasible analysis
 Data reduction techniques
 Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
 Data reduction strategies
 Data cube aggregation – apply aggregation operations (data cube)
 Dimensionality reduction—remove unimportant attributes
 Data compression – encoding mechanism used to reduce data size
 Numerosity reduction – data replaced or estimated by alternative,
smaller data representation - parametric model (store model parameter
instead of actual data), non-parametric (clustering sampling,
histogram)
 Discretization and concept hierarchy generation – replaced by ranges
or higher conceptual levels
Dimensionality Reduction
Problem: Feature selection (i.e., attribute subset
selection):
Select a minimum set of attributes (features) that is
sufficient for the data mining task.
Best/worst attributes are determined using test of
statistical significance – information gain (building
decision tree for classification)
Solution: Heuristic methods (due to exponential # of
choices – 2d):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video, image compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Compression

Original Data Compressed

Data
lossless

s s y
lo
Original Data
Approximated
Numerosity Reduction
Reduce the data volume by choosing alternative ‘smaller’
forms of data representation
Two type:
Parametric – a model is used to estimate the data, only
the data parameters is stored instead of actual data
regression
log-linear model
Nonparametric –storing reduced representation of the
data
Histograms
Clustering
Sampling
Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each
bucket
Use binning to approximate data distributions
Bucket – horizontal axis, height (area) of bucket – the
average frequency of the values represented by the
bucket
Bucket for single attribute-value/frequency pair –
singleton buckets
Continuous ranges for the given attribute
Clustering
Partition data set into clusters, and one can store cluster
representation only.
Can be very effective if data is clustered but not if data
is “smeared”/ spread.
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
There are many choices of clustering definitions and
clustering algorithms
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods e.g. Stratified
sampling.
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
Discretization and Concept Hierarchy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values

Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
because some data mining algorithms only accept
categorical attributes.
Some techniques:
Binning methods – equal-width, equal-frequency
Histogram
Entropy-based methods
References
J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan Kaufmann, 2000.
T. Dasu and T. Johnson. Exploratory Data Mining and
Data Cleaning. John Wiley & Sons, 2003
V. Raman and J. Hellerstein. Potters Wheel: An
Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997.
H.V. Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997

ML Using Scikit
50% (4)
ML Using Scikit
23 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Football Data Analysis Using Machine Learning Techniques
No ratings yet
Football Data Analysis Using Machine Learning Techniques
3 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWM
No ratings yet
DWM
14 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Fake News Detection - Report
No ratings yet
Fake News Detection - Report
21 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit - II
No ratings yet
Unit - II
56 pages
CH 3
No ratings yet
CH 3
68 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Introduction To Analytics - BBA 2020 - CO
No ratings yet
Introduction To Analytics - BBA 2020 - CO
13 pages
Department of Computer Engineering: Experiment No.3
No ratings yet
Department of Computer Engineering: Experiment No.3
4 pages
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
PDF Based Question &answering Using Langchain and Openai Api
No ratings yet
PDF Based Question &answering Using Langchain and Openai Api
58 pages
11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook
No ratings yet
11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook
22 pages
Framework For A Smart Data Analytics Platform Towards Process Monitoring and Alarm Management
No ratings yet
Framework For A Smart Data Analytics Platform Towards Process Monitoring and Alarm Management
31 pages
ScientoPy User Manual
No ratings yet
ScientoPy User Manual
11 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
20p11a0462 Ybi Doc F1
No ratings yet
20p11a0462 Ybi Doc F1
48 pages
DM Unit-1 Notes
No ratings yet
DM Unit-1 Notes
47 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Semantic Data Mining A Survey of Ontology-Based Approaches
No ratings yet
Semantic Data Mining A Survey of Ontology-Based Approaches
8 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
ITEC4433 - Data Warehousing and Data Mining
No ratings yet
ITEC4433 - Data Warehousing and Data Mining
3 pages
Fetal Brain Ultrasound Image Classification Using Deep Learning
100% (1)
Fetal Brain Ultrasound Image Classification Using Deep Learning
5 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
(Detectron2) Application of Convolutional Neural Network (CNN) To Recognize Ship Structures
No ratings yet
(Detectron2) Application of Convolutional Neural Network (CNN) To Recognize Ship Structures
16 pages
Data Preprocessing For Supervised Leaning
No ratings yet
Data Preprocessing For Supervised Leaning
6 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Deep Learning-Based Depression Detection From Social Media
No ratings yet
Deep Learning-Based Depression Detection From Social Media
20 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Gen 1
No ratings yet
Gen 1
27 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
A Deep Learning Approach To Job Recommendation Analysis With NLP
No ratings yet
A Deep Learning Approach To Job Recommendation Analysis With NLP
8 pages
Netflix HD
No ratings yet
Netflix HD
21 pages
ConvoKit: A Toolkit For The Analysis of Conversations
No ratings yet
ConvoKit: A Toolkit For The Analysis of Conversations
4 pages
Number Plate Recogination Using Machine Learning
No ratings yet
Number Plate Recogination Using Machine Learning
11 pages
An Efficient Framework To Build Up Malware Dataset
No ratings yet
An Efficient Framework To Build Up Malware Dataset
7 pages
Introduction To Gender and Age Detection Project
No ratings yet
Introduction To Gender and Age Detection Project
8 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
2023 Oct CSC649 Group Project - Evaluation Form
No ratings yet
2023 Oct CSC649 Group Project - Evaluation Form
1 page
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages

ICS 2408 - Lecture 2 - Data Preprocessing

Uploaded by

ICS 2408 - Lecture 2 - Data Preprocessing

Uploaded by

Data Preprocessing

Why Data Preprocessing?

Data extraction, cleaning, and transformation

Data cleaning tasks – this routine attempts to

Missing data may be due to

Incorrect attribute values may be due to

Other data problems which requires data cleaning

 Equal-depth (frequency) partitioning

Original Data Compressed

You might also like