0% found this document useful (0 votes)

5 views51 pages

DWDMUNIT2

The document outlines the importance of data preprocessing, which includes tasks such as data cleaning, integration, transformation, reduction, and discretization. It emphasizes that quality data is essential for quality mining results, detailing various methods for handling issues like missing or noisy data, and strategies for data reduction. Additionally, it discusses the creation of concept hierarchies to simplify data analysis by categorizing continuous attributes into intervals.

Uploaded by

Dr-Samson Chepuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPS, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views51 pages

DWDMUNIT2

Uploaded by

Dr-Samson Chepuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPS, PDF, TXT or read online on Scribd

You are on page 1/ 51

UNIT-2 Data Preprocessing

Lecture Topic
**********************************************
Lecture-13 Why preprocess the data?
Lecture-14 Data cleaning
Lecture-15 Data integration and transformation
Lecture-16 Data reduction
Lecture-17 Discretization and concept
hierarchgeneration
Lecture-13
Why preprocess the data?
Lecture-13 Why Data Preprocessing?
Data in the real world is:

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of
quality data

Lecture-13 Why Data Preprocessing?

Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility
Broad categories:

intrinsic, contextual, representational, and
accessibility.

Lecture-13 Why Data Preprocessing?

Major Tasks in Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, or files
Data transformation

Normalization and aggregation
Data reduction

Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data

Lecture-13 Why Data Preprocessing?

Forms of data preprocessing

Lecture-13 Why Data Preprocessing?

Lecture-14
Data cleaning
Data Cleaning

Data cleaning tasks


Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Lecture-14 - Data cleaning

Missing Data
Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of
entry

not register history or changes of the data
Missing data may need to be inferred.
Lecture-14 - Data cleaning
How to Handle Missing Data?
Ignore the tuple: usually done when class label
is missing
Fill in the missing value manually
Use a global constant to fill in the missing value:
ex. “unknown”

Lecture-14 - Data cleaning

How to Handle Missing Data?
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the
same class to fill in the missing value
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision
tree

Lecture-14 - Data cleaning

Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

Lecture-14 - Data cleaning

How to Handle Noisy Data?
Binning method:

first sort data and partition into (equal-
frequency) bins

then one can smooth by bin means, smooth
by bin median, smooth by bin boundaries
Clustering

detect and remove outliers
Regression

smooth by fitting the data to a regression
functions – linear regression
Lecture-14 - Data cleaning
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:


It divides the range into N intervals of equal size:
uniform grid

if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward

But outliers may dominate presentation

Skewed data is not handled well.
Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing
approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky.

Lecture-14 - Data cleaning

Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Lecture-14 - Data cleaning

Cluster Analysis

Lecture-14 - Data cleaning

Regression
y

Y1’ y=x+1

X1 x

Lecture-14 - Data cleaning

Lecture-15
Data integration and
transformation
Data Integration
Data integration:

combines data from multiple sources into a coherent
store
Schema integration

integrate metadata from different sources

Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts

for the same real world entity, attribute values from
different sources are different

possible reasons: different representations, different
scales, e.g., metric vs. British units

Lecture-15 - Data integration and transformation

Handling Redundant Data in Data
Integration

Redundant data occur often when

integration of multiple databases

The same attribute may have different names
in different databases

One attribute may be a “derived” attribute in
another table, e.g., annual revenue

Lecture-15 - Data integration and transformation

Handling Redundant Data in Data
Integration
Redundant data may be able to be
detected by correlation analysis
Careful integration of the data from
multiple sources may help reduce/avoid
redundancies and inconsistencies and
improve mining speed and quality

Lecture-15 - Data integration and transformation

Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing

Lecture-15 - Data integration and transformation

Data Transformation

Normalization: scaled to fall within a small,

specified range

min-max normalization

z-score normalization

normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones

Lecture-15 - Data integration and transformation

Data Transformation: Normalization

min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
z-score normalization
v  mean A
v' 
stand _ dev A

normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10

Lecture-15 - Data integration and transformation

Lecture-16
Data reduction
Data Reduction

Warehouse may store terabytes of data:

Complex data analysis/mining may take a
very long time to run on the complete data
set
Data reduction

Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results

Lecture-16 - Data reduction

Data Reduction Strategies

Data reduction strategies


Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy
generation

Lecture-16 - Data reduction

Data Cube Aggregation
The lowest level of a data cube

the aggregated data for an individual entity of interest

e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
Reference appropriate levels

Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should
be answered using data cube, when possible
Lecture-16 - Data reduction
Dimensionality Reduction
Feature selection (attribute subset selection):

Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand
Heuristic methods

step-wise forward selection

step-wise backward elimination

combining forward selection and backward elimination

decision-tree induction

Lecture-16 - Data reduction

Wavelet Transforms
Haar2 Daubechie4

Discrete wavelet transform (DWT): linear signal

processing
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:

Length, L, must be an integer power of 2 (padding with 0s, when
necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length
Lecture-16 - Data reduction
Principal Component Analysis

Given N data vectors from k-dimensions, find c

<= k orthogonal vectors that can be best used
to represent data

The original data set is reduced to one consisting of N
data vectors on c principal components (reduced
dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large

Lecture-16 - Data reduction

Principal Component Analysis

Y1
Y2

Lecture-16 - Data reduction

Attribute subset selection
Attribute subset selection reduces the data
set size by removing irrelevent or
redundant attributes.
Goal is find min set of attributes
Uses basic heuristic methods of attribute
selection

Lecture-16 - Data reduction

Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Lecture-16 - Data reduction

Numerosity Reduction
Parametric methods

Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)

Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling

Lecture-16 - Data reduction

Regression and Log-Linear Models
Linear regression: Data are modeled to fit a
straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable

Y to be modeled as a linear function of
multidimensional feature vector

Log-linear model: approximates discrete

multidimensional probability distributions
Lecture-16 - Data reduction
Regress Analysis and Log-Linear
Models
Linear regression: Y =  +  X

Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.

using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the
above.
Log-linear models:

The multi-way table of joint probabilities is
approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd
Lecture-16 - Data reduction
Histograms
A popular data
reduction technique 40
Divide data into 35
buckets and store 30
average (sum) for
25
each bucket
20
Can be constructed
optimally in one 15
dimension using 10
dynamic programming
5
Related to
0
quantization problems. 10000 30000 50000 70000 90000
Lecture-16 - Data reduction
Clustering

Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.

Lecture-16 - Data reduction

Sampling
Allows a large data set to be represented
by a much smaller of the data.
Let a large data set D, contains N tuples.
Methods to reduce data set D:

Simple random sample without replacement
(SRSWOR)

Simple random sample with replacement
(SRSWR)

Cluster sample

Stright sample
Lecture-16 - Data reduction
Sampling

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
Lecture-16 - Data reduction
Sampling

Raw Data Cluster/Stratified Sample

Lecture-16 - Data reduction

Lecture-17
Discretization and concept
hierarchy generation
Discretization

Three types of attributes:


Nominal — values from an unordered set

Ordinal — values from an ordered set

Continuous — real numbers
Discretization: divide the range of a continuous
attribute into intervals

Some classification algorithms only accept
categorical attributes.

Reduce data size by discretization

Prepare for further analysis

Lecture-17 - Discretization and concept hierarchy generation

Discretization and Concept hierachy

Discretization

reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies

reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).

Lecture-17 - Discretization and concept hierarchy generation

Discretization and concept hierarchy
generation for numeric data

Binning

Histogram analysis

Clustering analysis

Entropy-based discretization

Discretization by intuitive partitioning

Lecture-17 - Discretization and concept hierarchy generation

Entropy-Based Discretization
Given a set of samples S, if S is partitioned into
two intervals S1 and S2 using boundary T, the
entropy after partitioning is
| S 1| |S 2|
E (S ,T )  Ent ( S 1)  Ent ( S 2)
|S| |S|
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments showEnt ( Sthat
)  Eit(may
T , S ) reduce
 data size
and improve classification accuracy

Lecture-17 - Discretization and concept hierarchy generation

Discretization by intuitive partitioning

3-4-5 rule can be used to segment numeric data into

relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equal-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

Lecture-17 - Discretization and concept hierarchy generation

Example of 3-4-5 rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Lecture-17 - Discretization and concept hierarchy generation
Concept hierarchy generation for categorical
data
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
Specification of a portion of a hierarchy by
explicit data grouping
Specification of a set of attributes, but not of
their partial ordering
Specification of only a partial set of attributes

Lecture-17 - Discretization and concept hierarchy generation

Specification of a set of attributes

Concept hierarchy can be automatically

generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is
placed at the lowest level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct

values
city 3567 distinct values

street 674,339 distinct values

Lecture-17 - Discretization and concept hierarchy generation

Sample of Globe Proof of Billing
No ratings yet
Sample of Globe Proof of Billing
2 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
DWM
No ratings yet
DWM
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit - II
No ratings yet
Unit - II
56 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Mining
No ratings yet
Data Mining
31 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Data Preprocessing 1 - Annotated
No ratings yet
Data Preprocessing 1 - Annotated
23 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Unit 2
No ratings yet
Unit 2
37 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Normalization
No ratings yet
Normalization
35 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Mastering Tableau
From Everand
Mastering Tableau
David Baldwin
2.5/5 (3)
Unit 4
No ratings yet
Unit 4
23 pages
Files Inodes
No ratings yet
Files Inodes
10 pages
DWDMUNIT5
No ratings yet
DWDMUNIT5
55 pages
Working With Windows and DOS Systems
No ratings yet
Working With Windows and DOS Systems
86 pages
Unit 1
No ratings yet
Unit 1
34 pages
8051 Interrupts Final
0% (1)
8051 Interrupts Final
17 pages
New Eco. Envi.
No ratings yet
New Eco. Envi.
21 pages
6.interrupts 1
No ratings yet
6.interrupts 1
34 pages
5.serial Communication - 1
No ratings yet
5.serial Communication - 1
28 pages
8051 - Timers
No ratings yet
8051 - Timers
39 pages
(WWW - Entrance-Exam - Net) - JNTU ECE 3rd Year Computer Graphics Sample Paper 2
No ratings yet
(WWW - Entrance-Exam - Net) - JNTU ECE 3rd Year Computer Graphics Sample Paper 2
1 page
Capital Budgetting1
No ratings yet
Capital Budgetting1
49 pages
DATA COMMUNICATIONS (PC 404 IT) Syllabus
No ratings yet
DATA COMMUNICATIONS (PC 404 IT) Syllabus
2 pages
Fundamentals of Cyber Security
100% (2)
Fundamentals of Cyber Security
2 pages
Computer Graphics Question Bank
No ratings yet
Computer Graphics Question Bank
9 pages
Vikas
No ratings yet
Vikas
9 pages
Chapter1cybersecurity PDF
No ratings yet
Chapter1cybersecurity PDF
22 pages
2A DC Time Table
No ratings yet
2A DC Time Table
2 pages
DC Unit-V
No ratings yet
DC Unit-V
150 pages
DC Unit-IV
No ratings yet
DC Unit-IV
121 pages
Ethernet
No ratings yet
Ethernet
29 pages
CNOS - Lab - Manual - 2019-20 DT 17-6-19
No ratings yet
CNOS - Lab - Manual - 2019-20 DT 17-6-19
117 pages
Types of Computer Forensics Technology
100% (1)
Types of Computer Forensics Technology
20 pages
Computer Forensics Fundamentals
100% (1)
Computer Forensics Fundamentals
19 pages
Sap IT Leads
No ratings yet
Sap IT Leads
5 pages
Purchase Order Version Management - S - 4HANA Materials Management
No ratings yet
Purchase Order Version Management - S - 4HANA Materials Management
18 pages
Excel Dynamic Arrays: Course Notes
No ratings yet
Excel Dynamic Arrays: Course Notes
34 pages
Syllabus CSI104 Summer 2021
No ratings yet
Syllabus CSI104 Summer 2021
13 pages
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
No ratings yet
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
4 pages
Operation Manual MIPLUS REV - 00 en
No ratings yet
Operation Manual MIPLUS REV - 00 en
86 pages
Face Project
No ratings yet
Face Project
43 pages
MaceMM InstallGuide E PDF
No ratings yet
MaceMM InstallGuide E PDF
36 pages
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
No ratings yet
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
13 pages
Chapter 4 Enumeration
No ratings yet
Chapter 4 Enumeration
26 pages
04 CBLM With Competency Assessment Tools
No ratings yet
04 CBLM With Competency Assessment Tools
73 pages
Discussion of Relay Protection Testing Technology For Intelligent Substation
No ratings yet
Discussion of Relay Protection Testing Technology For Intelligent Substation
6 pages
Artificial Intelligence CS-3431w (V2)
No ratings yet
Artificial Intelligence CS-3431w (V2)
23 pages
Case Study Mysql
0% (1)
Case Study Mysql
3 pages
Com Profibus 7sj602 en
No ratings yet
Com Profibus 7sj602 en
54 pages
Daa ELab Level 2-3 Questions
100% (1)
Daa ELab Level 2-3 Questions
19 pages
Resume To Yaya Wallet
No ratings yet
Resume To Yaya Wallet
13 pages
DAA - Paper - CT Exam - 2022-2023 - K.kaushik
No ratings yet
DAA - Paper - CT Exam - 2022-2023 - K.kaushik
2 pages
SketchUp 2016 Help
No ratings yet
SketchUp 2016 Help
175 pages
Computer Mcqs File 2 One Paper MCQs Preparation-1
No ratings yet
Computer Mcqs File 2 One Paper MCQs Preparation-1
17 pages
Appendix 8 - Typical Project Execution Plan
No ratings yet
Appendix 8 - Typical Project Execution Plan
19 pages
A New Approach For Object Detection, Recognition and Retrieving in Painting Images
No ratings yet
A New Approach For Object Detection, Recognition and Retrieving in Painting Images
2 pages
DevOps Shack - Mastering Git A Comprehensive Guide
No ratings yet
DevOps Shack - Mastering Git A Comprehensive Guide
41 pages
Airbnb GRP 6
No ratings yet
Airbnb GRP 6
26 pages
Training Activity Matrix
No ratings yet
Training Activity Matrix
9 pages
6502 Status Flags - Nesdev Wiki
No ratings yet
6502 Status Flags - Nesdev Wiki
3 pages
Datasheet of Addressable Visual AI 2 Loop Fire Alarm Control Panel H5800
No ratings yet
Datasheet of Addressable Visual AI 2 Loop Fire Alarm Control Panel H5800
2 pages
DSGW-060 Smart Gateway
No ratings yet
DSGW-060 Smart Gateway
7 pages
ICT Skills - II (Part - A - Unit - 3)
No ratings yet
ICT Skills - II (Part - A - Unit - 3)
28 pages

DWDMUNIT2

Uploaded by

DWDMUNIT2

Uploaded by

UNIT-2 Data Preprocessing

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Data cleaning tasks

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Equal-width (distance) partitioning:

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-15 - Data integration and transformation

Redundant data occur often when

Lecture-15 - Data integration and transformation

Lecture-15 - Data integration and transformation

Smoothing: remove noise from data

Lecture-15 - Data integration and transformation

Normalization: scaled to fall within a small,

Lecture-15 - Data integration and transformation

normalization by decimal scaling

Lecture-15 - Data integration and transformation

Warehouse may store terabytes of data:

Lecture-16 - Data reduction

Data reduction strategies

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Discrete wavelet transform (DWT): linear signal

Given N data vectors from k-dimensions, find c

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Multiple regression: allows a response variable

Log-linear model: approximates discrete

Lecture-16 - Data reduction

Raw Data Cluster/Stratified Sample

Lecture-16 - Data reduction

Three types of attributes:

Lecture-17 - Discretization and concept hierarchy generation

Lecture-17 - Discretization and concept hierarchy generation

Discretization by intuitive partitioning

Lecture-17 - Discretization and concept hierarchy generation

Lecture-17 - Discretization and concept hierarchy generation

3-4-5 rule can be used to segment numeric data into

Lecture-17 - Discretization and concept hierarchy generation

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

($2,000 - $5, 000)

Lecture-17 - Discretization and concept hierarchy generation

Concept hierarchy can be automatically

country 15 distinct values

province_or_ state 65 distinct

street 674,339 distinct values

You might also like