0% found this document useful (0 votes)

508 views40 pages

Unit 2 Data Preprocessing

The document discusses data preprocessing which includes data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing, noisy, and inconsistent data through techniques like filling in missing values, smoothing noisy data, and resolving inconsistencies. Data integration combines data from multiple sources by mapping schemas and resolving conflicts. Data reduction reduces dimensionality and numerosity through techniques like compression and discretization transforms and normalizes data.

Uploaded by

Abhijeet Thamake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

508 views40 pages

Unit 2 Data Preprocessing

Uploaded by

Abhijeet Thamake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
1
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

2
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

3
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
4
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
6
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
7
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

8
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

9
10
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

11
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
12
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton
= William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources
are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
13
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
14
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
2
(Observed  Expected )
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

15
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories)
2 (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
      507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
16
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product

moment coefficient)

i1 (ai  A)(bi  B) 

n n
(ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
17
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:
where n is the number of tuples, and are the respective mean or
A σ areBthe respective standard
expected values of A and B, σA and B

deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
18
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
20
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

21
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

22
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

x1
23
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

24
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model

parameters, store only the parameters, and discard

the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal

subspaces
 Non-parametric methods
 Do not assume models

 Major families: histograms, clustering, sampling, …

25
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line

 Often uses the least-square method to fit the line

 Multiple regression
 Allows a response variable Y to be modeled as a

linear function of multidimensional feature vector

 Log-linear model
 Approximates discrete multidimensional probability

distributions

26
y
Regression Analysis
Y1
 Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors)  Used for prediction
 The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
 Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used

27
Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 10

28
Sampling

 Sampling: obtaining a small sample s to represent the whole

data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
29
Types of Sampling

 Simple random sampling

 There is an equal probability of selecting any particular

item
 Sampling without replacement
 Once an object is selected, it is removed from the

population
 Sampling with replacement
 A selected object is not removed from the population

 Stratified sampling:
 Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the same

percentage of the data)
 Used in conjunction with skewed data

30
Sampling: With or without Replacement

W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re

SRSW
R

Raw Data
31
Data Cube Aggregation

 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
32
Data Compression

Original Data Compressed

Data
lossless

ss y
lo
Original Data
Approximated

33
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
34
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified with
one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing
35
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000
73,600  12normalized
,000 to [0.0,
(1.0  0)  0  0.716
1.0]. Then $73,000 is mapped to 98, 000  12, 000

 Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
36
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

37
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

38
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
39
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at

the lowest level of the hierarchy

 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Data Preprocessing Overview and Techniques
100% (1)
Data Preprocessing Overview and Techniques
41 pages
Data Similarity and Dissimilarity Measures
No ratings yet
Data Similarity and Dissimilarity Measures
3 pages
Dat Science Unit 2
No ratings yet
Dat Science Unit 2
27 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Frequency Distributions Guide
No ratings yet
Frequency Distributions Guide
27 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Big Data Analytics Overview and Notes
No ratings yet
Big Data Analytics Overview and Notes
9 pages
FDS Unit 3
No ratings yet
FDS Unit 3
25 pages
Clustering
No ratings yet
Clustering
75 pages
File Operations with Pandas DataFrames
100% (2)
File Operations with Pandas DataFrames
4 pages
RDBMS Lab Programs
No ratings yet
RDBMS Lab Programs
44 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
Statistical Data Descriptions in Mining
No ratings yet
Statistical Data Descriptions in Mining
5 pages
Understanding Data Science and Pandas
100% (2)
Understanding Data Science and Pandas
173 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Excel Data Science Fundamentals
100% (1)
Excel Data Science Fundamentals
21 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Module 6 Data Visualiztion Matplotlib
No ratings yet
Module 6 Data Visualiztion Matplotlib
69 pages
R Language
No ratings yet
R Language
59 pages
R Factor Variables and Data Frames Guide
No ratings yet
R Factor Variables and Data Frames Guide
6 pages
Data Mining Lab Manual for GTU
No ratings yet
Data Mining Lab Manual for GTU
52 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
Installing and Configuring Tools
No ratings yet
Installing and Configuring Tools
5 pages
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
47 pages
DWDM Unit 6 Cluster Analysis
No ratings yet
DWDM Unit 6 Cluster Analysis
183 pages
DBMS Concepts and Architecture Overview
100% (1)
DBMS Concepts and Architecture Overview
26 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Data Mining Course Handout BITS Goa
No ratings yet
Data Mining Course Handout BITS Goa
4 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
DWDM Unit-4
No ratings yet
DWDM Unit-4
27 pages
EDA Techniques in R with dlookr
100% (2)
EDA Techniques in R with dlookr
11 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Wrangling and Imputation Techniques
100% (1)
Data Wrangling and Imputation Techniques
41 pages
Data Handling Python NCERT
No ratings yet
Data Handling Python NCERT
36 pages
Searching and Sorting Techniques Explained
No ratings yet
Searching and Sorting Techniques Explained
24 pages
Unit 3
100% (1)
Unit 3
22 pages
Chp4 Advance Analytics-KMeans
No ratings yet
Chp4 Advance Analytics-KMeans
40 pages
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
Fdsa Unit 3
No ratings yet
Fdsa Unit 3
42 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Python Data Types and Operators Guide
No ratings yet
Python Data Types and Operators Guide
116 pages
Machine Learning Foundations - Overview
100% (1)
Machine Learning Foundations - Overview
24 pages
Data Objects and Attribute Types
No ratings yet
Data Objects and Attribute Types
1 page
Interview Questions For DS & DA (ML)
100% (1)
Interview Questions For DS & DA (ML)
66 pages
Python Copy Methods: Shallow vs Deep
No ratings yet
Python Copy Methods: Shallow vs Deep
33 pages
Data Ingestion and Reshaping Guide
100% (1)
Data Ingestion and Reshaping Guide
2 pages
Unit 4 PPT Part2 - Pandas
No ratings yet
Unit 4 PPT Part2 - Pandas
40 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Unit3 Python
No ratings yet
Unit3 Python
11 pages
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
No ratings yet
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
4 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Data Science and Big Data Analytics
0% (1)
Data Science and Big Data Analytics
3 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
B69 (5) - 11zon
No ratings yet
B69 (5) - 11zon
16 pages
Macro Programming Essentials
No ratings yet
Macro Programming Essentials
73 pages
T-Test Guide for Data Analytics Course
No ratings yet
T-Test Guide for Data Analytics Course
30 pages
Diffie-Hellman Key Exchange Man-in-the-Middle Attack Elgamal Cryptographic System
No ratings yet
Diffie-Hellman Key Exchange Man-in-the-Middle Attack Elgamal Cryptographic System
24 pages
ANOVA Problems
100% (3)
ANOVA Problems
13 pages
Z Test
No ratings yet
Z Test
5 pages
SSCD Sys
No ratings yet
SSCD Sys
3 pages
Ched Stufap 2014-2015 Scholarship H-I-J-K
No ratings yet
Ched Stufap 2014-2015 Scholarship H-I-J-K
253 pages
Know - EG and Email (Students) PDF
No ratings yet
Know - EG and Email (Students) PDF
3 pages
Event & Exhibition Services
No ratings yet
Event & Exhibition Services
62 pages
PG1F05D Programming Test Memo 2023
No ratings yet
PG1F05D Programming Test Memo 2023
14 pages
MBot Grid II 3D Printer USER MANUAL PDF
No ratings yet
MBot Grid II 3D Printer USER MANUAL PDF
13 pages
Work 2 - Final Boiler Simulator
No ratings yet
Work 2 - Final Boiler Simulator
13 pages
Tones of The Passage Notes and Practice Questions
No ratings yet
Tones of The Passage Notes and Practice Questions
28 pages
Ict Components I
No ratings yet
Ict Components I
15 pages
Business Analyst Job Description-1
No ratings yet
Business Analyst Job Description-1
3 pages
Value Stream Mapping
No ratings yet
Value Stream Mapping
7 pages
6 Pulse & 12 Pulse UPS
No ratings yet
6 Pulse & 12 Pulse UPS
1 page
Lanphan Spray Dryer Quotation
No ratings yet
Lanphan Spray Dryer Quotation
1 page
Job Description - Customer Success Manager
No ratings yet
Job Description - Customer Success Manager
2 pages
Datasheet DINFIR3
No ratings yet
Datasheet DINFIR3
6 pages
Statement Sutton Bank November 2024
No ratings yet
Statement Sutton Bank November 2024
2 pages
Tascam DR-07 MKII Firmware Version 1.13 Update Guide (E - DR-07mk2 - RN - Ve PDF
No ratings yet
Tascam DR-07 MKII Firmware Version 1.13 Update Guide (E - DR-07mk2 - RN - Ve PDF
2 pages
Flight Booking Details: VTZ to BOM
No ratings yet
Flight Booking Details: VTZ to BOM
2 pages
Scilab Matrices
No ratings yet
Scilab Matrices
25 pages
2010 Lancer PDF
No ratings yet
2010 Lancer PDF
592 pages
Bn81-23593e-01 Web G55TQB Eu Eng 231102.0
No ratings yet
Bn81-23593e-01 Web G55TQB Eu Eng 231102.0
39 pages
Types of Emotions in Video Content
No ratings yet
Types of Emotions in Video Content
27 pages
Introduction - Advanced RCC Roofs
No ratings yet
Introduction - Advanced RCC Roofs
9 pages
rtl9301 CG Layer 3 Managed 24x10 100 1000m - 4x10g - Port - Switch - Controller
No ratings yet
rtl9301 CG Layer 3 Managed 24x10 100 1000m - 4x10g - Port - Switch - Controller
71 pages
SLO-SYN MD808 Stepper Drive Manual
No ratings yet
SLO-SYN MD808 Stepper Drive Manual
38 pages
B.Tech - CSE AI ML 2023 24
No ratings yet
B.Tech - CSE AI ML 2023 24
194 pages
CS1103 Graded Quiz Unit 6
No ratings yet
CS1103 Graded Quiz Unit 6
8 pages
FCFS Disk Scheduling Algorithm Explained
No ratings yet
FCFS Disk Scheduling Algorithm Explained
4 pages
Industrial Motion & Presence Sensors
No ratings yet
Industrial Motion & Presence Sensors
2 pages
DIP 21EC732 Full Notes SJ
No ratings yet
DIP 21EC732 Full Notes SJ
247 pages
Pros and Cons of Using The Internet As A Student: A Qualitative Research
No ratings yet
Pros and Cons of Using The Internet As A Student: A Qualitative Research
3 pages

Unit 2 Data Preprocessing

Uploaded by

Unit 2 Data Preprocessing

Uploaded by

Data Preprocessing

 Data Preprocessing: An Overview

 Measures for data quality: A multidimensional view

 Data Preprocessing: An Overview

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Combined computer and human inspection

deal with possible outliers)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

relationship to detect violators (e.g., correlation and clustering

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Data Preprocessing: An Overview

 Redundant data occur often when integration of multiple

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

 Correlation coefficient (also called Pearson’s product

i1 (ai  A)(bi  B) 

where n is the number of tuples, A and B are the respective

 It can be simplified in computation as

 Data Preprocessing: An Overview

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

parameters, store only the parameters, and discard

D space as the product on appropriate marginal

 Major families: histograms, clustering, sampling, …

 Often uses the least-square method to fit the line

linear function of multidimensional feature vector

 Sampling: obtaining a small sample s to represent the whole

 Simple random sampling

partition (proportionally, i.e., approximately the same

 The lowest level of a data cube (base cuboid)

Original Data Compressed

 Data Preprocessing: An Overview

 Z-score normalization (μ: mean, σ: standard deviation):

the lowest level of the hierarchy

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like