0% found this document useful (0 votes)

86 views

Normalization

The document discusses various techniques for preprocessing data before analysis, including data cleaning, integration, transformation, reduction, and discretization. Specifically, it covers filling in missing values, identifying outliers, integrating multiple data sources, normalization, aggregation, binning, and concept hierarchy generation to prepare raw data for mining and ensure high quality results.

Uploaded by

Hrithik Reigns

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Normalization

Uploaded by

Hrithik Reigns

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining:

Data
Preprocessing
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Why Data Preprocessing?

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of
quality data
Multi-Dimensional Measure of
Data Quality

 A well-accepted multidimensional view:

 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
Major Tasks in Data
Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Forms of data
preprocessing
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Data Cleaning

 Data cleaning tasks

 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.
How to Handle
Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably)
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
 Use the attribute mean to fill in the missing value
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Noisy Data

 Noise: random error or variance in a measured variable

 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy
Data?
 Binning method:
 first sort data and partition into (equi-depth) bins
 then smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human
 Regression
 smooth by fitting the data into regression functions
Simple Discretization
Methods: Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store.
 Careful integration can help reduce and avoid
redundancies and inconsistencies in resulting data
set.
 This can help improve the accuracy and speed of
the subsequent data mining process.
Data Integration
 There are a number of issues to consider during data
integration. Schema integration and object matching
can be tricky.
How can equivalent real-world entities from
multiple data sources be matched up?
 This is referred to as the entity identification
problem.
 For example, how can the data analyst or the computer
be sure that customer-id in one database and cust-
number in another refer to the same attribute?
Data Integration

 When matching attributes from one database

to another during integration, special attention
must be paid to the structure of the data.
 For example, in one system, a discount may be
applied to the order, whereas in another system
it is applied to each individual line item within
the order.
 If this is not caught before integration, items in
the target system may be improperly discounted.
Handling Redundant Data
 Redundant data occur often when integration of multiple
databases
 The same attribute may have different names in
different databases Careful integration of the data
from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining
speed and quality
Data Transformation

Strategies for data transformation are:

Smoothing: remove noise from data
Attribute Construction: new attributes are constructed
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range.
Discretization: raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior ).
Data Transformation by
Normalization

 The measurement unit used can affect the data

analysis.
 For example, changing measurement units from
meters to inches for height, or from kilograms to
pounds for weight, may lead to very different results.
 To help avoid dependence on the choice of
measurement units, the data should be normalized or
standardized.
 This involves transforming the data to fall within a
smaller or common range such as [−1,1] or [0.0, 1.0].
Data Transformation by
Normalization

Methods for Normalization:

min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
z-score normalization
v  mean A
v' 
stand _ dev A

normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Transformation by
Normalization

Let A be the numeric attribute with n observed values v1, v2…………vn

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Data Transformation by
Normalization

v  meanA
v' 
stand _ devA
Data Transformation by
Normalization
Discretization
 The raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior ).
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
Discretization
 Discretization techniques can be categorized based on
how the discretization is performed, such as whether it
uses class information or which direction it proceeds
(i.e., top-down vs. bottom-up).
 If the discretization process uses class information, then
we say it is supervised discretization.
 If the process starts by first finding one or a few points
to split the entire attribute range, and then repeats this
recursively on the resulting intervals, it is called top-
down discretization or splitting.
Discretization
 This contrasts with bottom-up discretization or
merging, which starts by considering all of the
continuous values as potential split-points, removes
some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting
intervals.
Data Reduction Strategies

 Warehouse may store terabytes of data: Complex data, so

analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
 Data reduction strategies
 Dimensionality reduction
 Numerosity reduction
 Data compression
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):

 Dimensionality reduction is the process of reducing the
number of random variables or attributes under
consideration.
 It transform or project the original data onto a smaller
space.
 Attribute subset selection is a method of dimensionality
reduction in which irrelevant, weakly relevant, or
redundant attributes or dimensions are detected and
removed
Numerosity Reduction

 Numerosity reduction techniques replace the original data

volume by alternative, smaller forms of data representation.
 These techniques may be parametric or
nonparametric.
 For parametric methods, a model is used to estimate
the data, so that typically only the data parameters need
to be stored, instead of the actual data. (Outliers may
also be stored.)
 Nonparametric methods for storing reduced
representations of the data include histograms,
clustering, sampling etc
Data Compression

 In data compression, transformations are applied so as to

obtain a reduced or “compressed” representation of the
original data.
 If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless.
 If, instead, we can reconstruct only an approximation of the
original data, then the data reduction is called lossy.
 Dimensionality reduction and numerosity reduction
techniques can also be considered forms of data
compression.
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Discretization and Concept
hierachy

 Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Discretization for numeric
data

 Binning (see sections before)

 Histogram analysis (see sections before)

 Clustering analysis (see sections before)

Frequency Tables and Histograms Notes
No ratings yet
Frequency Tables and Histograms Notes
6 pages
SPV Basic Tutorial v1
100% (4)
SPV Basic Tutorial v1
53 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
CH 3
No ratings yet
CH 3
68 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
253777
No ratings yet
253777
66 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Week2-2
No ratings yet
Week2-2
25 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
CH 2
No ratings yet
CH 2
36 pages
Chapter3
No ratings yet
Chapter3
50 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Lecture123
No ratings yet
Lecture123
20 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
MATH 361 (Autosaved)
No ratings yet
MATH 361 (Autosaved)
17 pages
Lecture 10
No ratings yet
Lecture 10
30 pages
Department of Collegiate and Technical Education
No ratings yet
Department of Collegiate and Technical Education
11 pages
Matplotlib 1
No ratings yet
Matplotlib 1
29 pages
State 301 Grand Quiz
No ratings yet
State 301 Grand Quiz
4 pages
Business Research Methods and Statistics Using SPSS (Chapter 7 - Describing and Presenting Your Data)
No ratings yet
Business Research Methods and Statistics Using SPSS (Chapter 7 - Describing and Presenting Your Data)
29 pages
5 1 Representation of Data Hard
No ratings yet
5 1 Representation of Data Hard
29 pages
DSBDA Lab Assignment No 10
No ratings yet
DSBDA Lab Assignment No 10
3 pages
Subject Geology: Paper No and Title Remote Sensing and GIS Module No and Title Module Tag
No ratings yet
Subject Geology: Paper No and Title Remote Sensing and GIS Module No and Title Module Tag
20 pages
Organizing Data EDA
No ratings yet
Organizing Data EDA
20 pages
Promodel Version History
No ratings yet
Promodel Version History
37 pages
Statistical Analysis of Corrosion Data
No ratings yet
Statistical Analysis of Corrosion Data
18 pages
Graphs
No ratings yet
Graphs
20 pages
Industrial Engineering & Enterprise Resource Planning: Statistical Quality Control
100% (1)
Industrial Engineering & Enterprise Resource Planning: Statistical Quality Control
41 pages
UPDATED - Black Belt Cheat Sheet - 17NOV2017
No ratings yet
UPDATED - Black Belt Cheat Sheet - 17NOV2017
5 pages
Research 8 Grade 8 Melc 4 6 q4
No ratings yet
Research 8 Grade 8 Melc 4 6 q4
17 pages
Modeling Mixed Type Random Variables
No ratings yet
Modeling Mixed Type Random Variables
12 pages
Chapter 2 Descriptive Statistics
No ratings yet
Chapter 2 Descriptive Statistics
13 pages
Histogram 2018 Jan
No ratings yet
Histogram 2018 Jan
4 pages
DesktopCA ExamGuide PDF
No ratings yet
DesktopCA ExamGuide PDF
16 pages
Mathematics - 4 and 5 Standard - Unit Plans - Harrison, Huizink, Sproat-Clements, Torres-Skoumal - Second Edition - Oxford 2021
No ratings yet
Mathematics - 4 and 5 Standard - Unit Plans - Harrison, Huizink, Sproat-Clements, Torres-Skoumal - Second Edition - Oxford 2021
37 pages
Worksheet For Engineers
100% (2)
Worksheet For Engineers
2 pages
Lecture 3 EDA 2022
No ratings yet
Lecture 3 EDA 2022
16 pages
Cambridge O Level: Mathematics (Syllabus D) 4024/12
No ratings yet
Cambridge O Level: Mathematics (Syllabus D) 4024/12
20 pages
Handout 04 Data Description
100% (1)
Handout 04 Data Description
44 pages
AP Research Presentation
No ratings yet
AP Research Presentation
2 pages
Integrated Quality Management System
No ratings yet
Integrated Quality Management System
8 pages
Test Bank Chap014
100% (1)
Test Bank Chap014
71 pages

Normalization

Uploaded by

Normalization

Uploaded by

Data Mining:

 Why preprocess the data?

 Data in the real world is dirty

 A well-accepted multidimensional view:

 Why preprocess the data?

 Data cleaning tasks

 Noise: random error or variance in a measured variable

 Why preprocess the data?

 When matching attributes from one database

Strategies for data transformation are:

 The measurement unit used can affect the data

Methods for Normalization:

normalization by decimal scaling

Let A be the numeric attribute with n observed values v1, v2…………vn

 Warehouse may store terabytes of data: Complex data, so

 Feature selection (i.e., attribute subset selection):

 Numerosity reduction techniques replace the original data

 In data compression, transformations are applied so as to

 Why preprocess the data?

 Binning (see sections before)

 Histogram analysis (see sections before)

 Clustering analysis (see sections before)

You might also like