0% found this document useful (0 votes)

229 views

DM Chapter 3 Data Preprocessing

The document discusses data preprocessing for data mining. It covers why preprocessing is important, as real-world data is often dirty with issues like being incomplete, noisy, or inconsistent. The major tasks of preprocessing are then outlined as data cleaning, integration, reduction, transformation, and discretization. Specific techniques for data cleaning are further explained, such as filling in missing values, identifying and removing outliers, and resolving inconsistencies.

Uploaded by

iLikeCode 101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

229 views

DM Chapter 3 Data Preprocessing

Uploaded by

iLikeCode 101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

CSC479

Data Mining
Data Preprocessing
(Ch # 3)
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
2
Why Data Preprocessing?
u Data in the real world is dirty
w incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
w noisy: containing errors or outliers
w inconsistent: containing discrepancies in codes or names
u Data quality is a major concern in Data Mining and Knowledge
Discovery tasks.
u Why: At most all Data Mining algorithms induce knowledge
strictly from data.
u No quality data, no quality mining results!
w Quality decisions must be based on quality data
u No quality data, inefficient mining process!
w Complete, noise-free, and consistent data means faster
algorithms
w The quality of knowledge extracted highly depends on the quality
of data 3
u Measures for Data Quality: A multidimensional
view
w Accuracy: correct or wrong, accurate or not
w Completeness: not recorded, unavailable, …
w Consistency: some modified but some not,
dangling, …
w Timeliness: timely update?
w Believability: how trustable the data are
correct?
w Interpretability: how easily the data can be
understood? 4
Major Tasks in Data Preprocessing
u Data cleaning
w Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
u Data integration
w Integration of multiple databases, data cubes, or files
u Data reduction
w Obtains reduced representation in volume but
produces the same or similar analytical results
u Data transformation
w Normalization and aggregation
u Data discretization
w Part of data reduction but with particular importance,
especially for numerical data
5
Forms of data preprocessing

6
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary
7
Data Cleaning
u Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
w incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g.,Occupation =“ ” (missing data)
w noisy: containing noise, errors, or outliers
• e.g.,Salary =“−10” (an error)
w inconsistent: containing discrepancies in codes or names, e.
g.,
• Age =“42”,Birthday =“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
w Intentional (e.g.,disguised missing data)
• Jan. 1 as everyone’s birthday? 8
Incomplete (Missing) Data
u Data is not always available
w E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
u Missing data may be due to
w equipment malfunction
w inconsistent with other recorded data and thus
deleted
w data not entered due to misunderstanding
w certain data may not be considered important at the
time of entry
w not register history or changes of the data
9
u Missing data may need to be inferred
Data Cleaning

u Data cleaning tasks

w Fill in missing values
w Noisy data
w Correcting inconsistent data

10
Methods of Treating Missing Data
u Ignoring and discarding data:- There are two main ways to discard data
with missing values.
w Discard all those records which have missing data also called as discard case
analysis. Usually done when class label is missing (assuming the task is
classification)
w Discarding only those attributes which have high level of missing data.
u Fill in the missing value manually: tedious + infeasible?
u Use a global constant to fill in the missing value: e.g., “unknown”, a new
class.
u Imputation using Mean/median or Mod:- One of the most frequently used
method (Statistical technique).
w Use the attribute mean to fill in the missing value
w Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
w Replace (numeric continuous) type “attribute missing values” using mean/
median. (Median robust against noise).
w Replace (discrete) type attribute missing values using MOD.
11
Methods of Treating Missing Data
u Replace missing values using prediction/ classification
model:-
w Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
w Advantage:- it considers relationship among the known
attribute values and the missing values, so the imputation
accuracy is very high.
w Disadvantage:- If there is no correlation exist for some missing
attribute values and known attribute values. The imputation
can’t be performed.
w (Alternative approach):- Use hybrid combination of Prediction/
Classification model and Mean/MOD.
• First try to impute missing value using prediction/classification
model, and then Median/MOD.
w We will study more about this topic in Association Rules Mining.
12
Methods of Treating Missing Data
u K-Nearest Neighbor (k-NN) approach (Best approach):-
w k-NN imputes the missing attribute values on the basis
of nearest K neighbor. Neighbors are determined on the
basis of distance measure.
w Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known
attribute values of missing attribute.

Missing value record

Other dataset records

13
Imputation of Missing Data (Basic)
u Imputation is a term that denotes a procedure that
replaces the missing values in a dataset by some
plausible values
w i.e. by considering relationship among
correlated values among the attributes of the
dataset.
A ttr ib u te 1 A ttr ib u te 2 A ttr ib u te 3 A ttr ib u te 4 If we consider only
20 cool h ig h fa ls e {attribute#2}, then value
cool h ig h tru e “cool” appears in 3
20 cool h ig h tr u e records.
2 0 m ild lo w fa ls e
30 cool n o rm a l fa ls e Probability of Imputing
1 0 m ild h ig h tr u e value (20) = 66.7%
Probability of Imputing
value (30) = 33.3%
14
Imputation of Missing Data (Basic)
A ttr ib u te 1 A ttr ib u te 2 A ttr ib u te 3 A ttr ib u te 4 For {attribute#4} the
20 cool h ig h fa ls e value “true” appears in 2
cool h ig h tru e records
20 cool h ig h tr u e
2 0 m ild lo w fa ls e
Probability of Imputing
30 cool n o rm a l fa ls e
value (20) = 50%
1 0 m ild h ig h tr u e Probability of Imputing
value (10) = 50%

A ttr ib u te 1 A ttr ib u te 2 A ttr ib u te 3 A ttr ib u te 4

For {attribute#2,
20 cool h ig h fa ls e attribute#3} the value
cool h ig h tru e {“cool”, “high”} appears in
20 cool h ig h tr u e only 2 records
2 0 m ild lo w fa ls e
30 cool n o rm a l fa ls e
Probability of Imputing
1 0 m ild h ig h tr u e
value (20) = 100%
15
Noisy Data
u Noise: random error or variance in a
measured variable
u Incorrect attribute values may be due to
w faulty data collection instruments
w data entry problems
w data transmission problems
w technology limitation
w inconsistency in naming convention
u Other data problems which requires data
cleaning
w duplicate records
w incomplete data
16
w inconsistent data
Removing Noise
u Data Smoothing (rounding, averaging
within a window).
w Data smoothing by Binning method:
• ﬁrst sort data and partition into bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

u Smoothing by Regression
• smooth by ﬁtting the data into regression functions

u Clustering/merging and Detecting outliers.

w detect and remove outliers

17
Smoothing by Binning Method
u Equal-width (distance) partitioning:
w It divides the range intoN intervals of equal size( range):
uniform grid
w ifA andB are the lowest and highest values of the attribute,
the width of intervals will be:W = B( A- )/
k, wherek is the
number of bins.
w The most straightforward
w But outliers may dominate presentation
w Skewed data is not handled well.
wWhere does k come from?
u Equal-depth/ Equal-height (frequency) partitioning:
w It divides the range intoM intervals, each containing
approximately same number of samples
w Good data scaling

18
u Equal width is easier to implement but equal depth (frequency)
gives better results.

19
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-width) bins: A+w , A+2w , A+(k-1)w
- Bin 1: 4, 8, 9
- Bin 2: 15, 21, 21, 24
- Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 9, 9
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34

20
Regression Method for smoothing the
data
u Regression is a technique
that conforms data values y
Linear
to a function.
regression involves ﬁnding
the “best” line to ﬁt two Y1
attributes (or variables) so
that one attribute can be
used to predict the other. Y1’ y=x+1

X1 x

21
Detecting Outliers (Clustering)
u Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.

u Values which falls outside of the set of clusters may be

considered outliers.

22
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary

23
Data Integration
u Data integration:
w combines data from multiple sources into a
coherent store

u Entity Identiﬁcation Problem:

w How can we match schema and objects from
different sources
w How can equivalent real world entities from
multiple data sources be matched up?

24
Data Integration (Problem 1)
u Attribute naming ( in schema integration)
Problem: Entity identiﬁcation problem: identify real world entities from
multiple data sources. Attributes are named differently across different
data sources, e.g., A.cust-id  B.cust-# (integrate metadata from
different sources).

CustomerID
…
… CustomerID
CustID …
… …
… Extraction,
Transformation, and
ClientID Loading (ETL) tool.
…
…

Multiple Sources Coherent Store

25
Data Integration (Problem 2)
• Data Encoding
Problem: Same attribute has same values denoted in
different ways

Gender
Male
Female Gender
Gender Male
M Female
F

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be

misspelled, or be inconsistently capitalized (some programs are
CASE-SENSITIVE)
26
Data Integration (Problem 3)
Measurement Basis (data value conﬂicts)
Problem: For the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different scales, e.
g., metric vs. British units, kg vs lb

Weight
(kilograms)
6 Weight
10 (kilograms)
Weight 6
10
(pounds) 2.72
6 4.54
10

Multiple Sources Coherent Store

27
Handling Redundant Data in Data Integration
u Redundant data occur often when integrating
multiple databases
w The same attribute may have different names in
different databases
w One attribute may be a “derived” attribute in another
table, e.g., annual revenue
u Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
u Redundancy can be checked using correlation
Analysis. 28
Correlation Analysis for Detecting Redundancy
u For numeric attributes we can use correlation and covariance
u Correlation between two attributes can be checked by:-

 in1 ( a i  A )( bi  B )  in1 ( a i bi )  n A B
r A, B  
n A  B n A  B
1. Resulting value > 0, then A and B are positively correlated. If A
increase B will also increase. If value ofr is close to 1 either A or B can
be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively correlated. If the value
of A increases, the value B will decreases.
u Covariance between two attributes is
 in1 ( a i  A )( bi  B )  in1 ( a i bi )
C ov ( A , B )    AB
n n
u Under some assumptions a covariance of 0 implies independence
29
Example: Correlation and Covariance
u For the following data ﬁnd correlation and covariance measures value

Time AllElctronics (A) HighTech (B)

point
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5
Total 20 54

30
2 Correlation Test for Nominal Data

u
( o i j ei j )
2
r
2   
c

i 1 j 1 ei j

ei j  cou n t ( A  a i ) cou n t ( B bi )

32
u Example: For the following data ﬁnd weather the two attributes are
independent or not. Suppose that we have 1500 data points

male female Total

Fiction 250 200 450
Non- 50 1000 1050
ﬁction
Total 300 1200 1500
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary

35
Data Reduction
wReduced representation of the data set
that is much smaller in volume, yet
closely maintains the integrity of the
original data. Different Strategies are:

• Dimensionality Reduction
• Numerosity reduction
• Data Compression

36
Dimensionality Reduction (DR)
wProcess of reducing the number of
random attributes under consideration.
wTwo very common methods are Wavelet
Transforms and Principal Components
Analysis
wThey transform or project the data onto a
smaller space

37
Numerosity Reduction
wReplace the original data volume by
alternative smaller forms of data
representation.
wRegression and log-linear models
(parametric methods)
wHistograms, clustering, sampling and
data cube aggregation (nonparametric
methods)

38
Data Compression
w Transformations are applied to obtain a
reduced or compressed representation
w Lossless: The original data can be
reconstructed from the compressed data
without any information loss.
w Lossy: Only an approximation of the original
data can be reconstructed.

39
DR: Principal Components Analysis (PCA)

u Why PCA?
u PCA is a useful statistical technique, has
found applications in:
w Face recognition
w Image Compression
w Reducing dimension of data

40
PCA Goal:
Removing Dimensional Redundancy
u The major goal of PCA in Data Mining is to remove
the “dimensional redundancy” from data.
u What does that mean?
w A typical dataset contains several dimensions
(variables) that may or may not correlate.
w Dimensions that correlate vary together.
w The information represented by a set of dimensions
with high correlation can be extracted by studying just
one dimension the represents the whole set.
w Hence the goal is to reduce the dimensions of a dataset
to a smaller set of representative dimensions that do
not correlate. 41
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dimensional data is
Dim 5
Dim 6
challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy

Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6
“reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Lets assume we have a
Dim 4 “PCA black box” that can
Dim 5 reduce the correlating
Dim 6 dimensions.
Dim 7
Dim 8
Pass the 12 d data set
Dim 9
through the black box to
Dim 10
Dim 11
get a three dimensional
Dim 12 data set.
PCA Goal:
Removing Dimensional Redundancy
Given appropriate reduction,
Dim 1
analyzing the reduced dataset is
Dim 2
Dim 3
much more eﬃcient than the
Dim 4
original “redundant” data.
Dim 5 Dim A
Dim 6 PCA Dim B
Dim 7 Black box
Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the black
Dim 11 box to get a three dimensional data set.
Dim 12
PCA Goal: Change of Basis
u Assume X is the 6-dimensional data set given as
input
Dimensions

 x11 x12 x13 x14 x15 x16 

x 
Data Points

x 22 x 23 x 24 x 25 x 26
 21 
X   x 31 x 32 x 33 x 34 x 35 x 36 
 
 x 41 x 42 x 43 x 44 x 45 x 46 
 x 51 x 52 x 53 x 54 x 55 x 56 

• A naïve basis for X is standard basis for R6 and hence

BX = X
• Here, we want to ﬁnd a new (reduced) basis P such as
PX = Y
• Y will be the resultant reduced data set.
PCA Goal
u Change of Basis
PX Y

 p11 p 12  p1 m   x1   y 1 
p p 22  p2m
 x   y 
 21  2   2 
      
     
      
      
    
 p m1 pm2  p mm   x m   y m 

• QUESTION: What is a good choice for P ?

– Lets park this question right now and revisit after
studying some related concepts
Background Stats/Maths
u Mean and Standard Deviation
u Variance and Covariance
u Covariance Matrix
u Eigenvectors and Eigenvalues
Covariance
u SD and Variance are 1-dimensional
u 1-D data sets could be
w Heights of all the people in the room
w Salary of employee in a company
w Marks in the quiz
u However many datasets have more than 1-dimension
u Our aim is to ﬁnd any relationship between different
dimensions.
u E.g. Finding relationship with students result and their
hour of study.
u It is used to measure relationship between 2-Dimensions.
n

( X i
 X )( Yi  Y )
cov ( X , Y )  i 1

(n  1)
Covariance Interpretation
u We have data set for students study hour (H) and
marks achieved (M)
u We ﬁnd cov(H,M)
u Exact value of covariance is not as important as the
sign (i.e. positive or negative)
u +ve , both dimensions increase together
u -ve , as one dimension increases other decreases
u Zero, their exist no relationship
Covariance Matrix
u Covariance is always measured
between 2 – dim.
u What if we have a data set with more
than 2-dim?
u We have to calculate more than one
covariance measurement.
u E.g. from a 3-dim data set (dimensions
x,y,z) we could cacluate cov(x,y) , cov(x,
z) , cov(y,z)
Covariance Matrix
u Can use covariance matrix to ﬁnd
covariance of all the possible pairs
u Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the
main diagonal

 cov( x, x) cov( x , y ) cov( x , z ) 

 
c   cov( y, x) cov( y , y ) cov( y , z) 
 cov( z , z ) 
 z, x ) cov( z , y ) cov(
Eigenvectors
u Consider the two
multiplications between a
matrix and a vector 2 3  1  11 
  x     
u In ﬁrst example the 2 1  3   5 
resulting vector is not an
integer multiple of the
2 3  3  12  3 
original vector.          4   
u Whereas in second 2 1 2   8  2 
example, the resulting
vector is 4 times the
original matrix
Eigenvectors and Eigenvalues
u More formally deﬁned

w Let A be an n x n matrix. The vector v that

satisﬁes
Av  v
w For some scalar v is called the eigenvector of
vector A and is the eigenvalue corresponding
to eigenvector v
Principal Component Analysis
u PCA is a technique for identifying
patterns in data.
u Also used to express data in such a way
as to highlight similarities and
differences.
u PCA are used to reduce the dimension
in data without losing the integrity of
information.
Step by Step
u Step 1:
w We need to have some data for PCA

u Step 2:
w Mean normalization and feature scaling
w Subtract the mean from each of the data
point
Step1 & Step2

u Mean of x = 18.1/10 = 1.81

u Mean of y = 19.1/10 = 1.91
u Subtract each element from mean (x1-mean etc.)
Step3: Calculate the Covariance

u Calculate the covariance matrix

u Non-diagonal elements in the covariance matrix are
positive
u So x , y variable increase together

 . 616555556 . 615444444 
cov   
 . 615444444 . 716555556 
Step 4: Calculate the eigenvectors and eigenvalues of the
covariance matrix
u 
| - I| = 0 is used to ﬁnd Eigenvalues
u Then Bx = 0 is solved
u Since covariance matrix is square, we can calculate the
eigenvector and eigenvalues of the matrix
 0 . 04908 
eigenvalues =  
 1 . 28403 
  0 . 735 0 . 678 
eigenvectors =  
 0 . 678 0 . 735 
What does this all mean?

Data Points

Eigenvectors
Conclusion
u Eigenvector give us information about
the pattern.
u By looking at graph in previous slide.
See how one of the eigenvectors go
through the middle of the points.
u Second eigenvector tells about another
weak pattern in data.
u So by finding eigenvectors of
covariance matrix we are able to extract
lines that characterize the data.
Step 5:Chosing components and forming a
feature vector.
u Highest eigenvalue is the principal
component of the data set.
u In our example, the eigenvector with the
largest eigenvalue was the one that
pointed down the middle of the data.
u So, once the eigenvectors are found, the
next step is to order them by
eigenvectors, highest to lowest.
u This gives the components in order of
significance.
Cont’d
u We have n – dimension
u So we will find n eigenvectors
u But if we chose only p first eigenvectors.
u Then the final dataset has only p
dimension
Cont’d
u To obtain the final dataset we will
multiply the above vector matrix
transposed with the transpose of
original data matrix.
u Final dataset will have data items in
columns and dimensions along rows.
u So we have original data set
represented in terms of the vectors we
chose.
Original data set represented using two
eigenvectors.
Original data set represented using one
eigenvectors.
Data Preprocessing
u Why preprocess the data?
u Data cleaning
u Data integration
u Data reduction
u Data Transformation and Discretization
u Summary

67
Data Transformation
u The data are transformed or consolidated into forms
appropriate for mining processing. Different strategies
includes
w Smoothing – binning, regression, and clustering
w Attribute construction – new attributes are constructed from
the given set of attributes.
w Aggregation – Summary or aggregation operations are applied
e.g. construction of data cube
w Normalization – data are scaled so as to fall within a smaller
rage e.g. -1.0 to 1.0 or 0.0 to 1.0
w Discretization - Values of numeric attribute are replaced by
interval labels or conceptual labels. (concept hierarchy for
numeric attribute)
w Concept hierarchy generation for nominal data – nominal
attribute values are generalized to higher-level concepts e.g.
68
street is generalized to block, city or country.
Data Normalization
u A database can contain n numbers of continuous type
attributes.
u Where a larger range continuous type attribute or noise can
shift the objects distance.
w (Remember: All continuous type attributes similarity is checked
using a single Euclidean distance formula).
u For example: ‘Income’ attribute can dominate the distance
as compared to ‘Weight’ and ‘Age’ attributes.
u The objective of normalization is convert all integer type
attributes, so that there values fall within a small speciﬁed
range, such as 0 to 1.0.
u Normalization is particularly useful for clustering and
distance measure algorithms such as k-nearest-neighbor.
Data Normalization
u min-max normalization

v  min
v'  ( new _ max A  new _ min A )  new
A
_ min A

max  min
A A

w Example:- Suppose that the minimum and

maximum values for the attribute income are 12,
000 and 98,000. By min-max normalization, a
value of 73,600 for income is transformed to
» ((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 = 0.716
Data Normalization
u z-score normalization
w This method of normalization is useful when the
actual minimum and maximum of any attribute
are unknown.
w Or when outliers which dominate the min-max
normalization.
v  mean
v' 
A

stand_ dev A
u Decimal scaling normalization
v
v'  j
Where j is the smallest integer such that Max(| v ' |)<1
10
Building mineable data sets
Data Transformation: Normalization
u min-max normalization
 min
v
v'  ( new _ max A  new _ min A )  new
A
_ min A

max  min A A
u z-score normalization
v  mean
v' 
A

stand _ dev A

u normalization by decimal scaling

v
v'  j
Where j is the smallest integer such that Max(|
v'
10 |)<1
Price in € 4 6 14 16 18 19 21 22 23 24 27 34
Min-max [0,1] 0 .06 .33 .4 .46 .5 .56 .6 .63 .66 .76 1
Z-score -1.8 -1.6 -0.6 -0.3 -0.1 0 0.2 0.4 0.5 0.6 1 1.8
decimal .04 .06 .14 .16 .18 .19 .21 .22 .23 .24 .27 .34
Practice Questions
u Solve Exercise Questions 3.3, 3.6, 3.7,
3.8
u Find PCA for the following Data using
Correlation matrix

x 2. 0.5 2.2 1.9 3.1 2.3 2.0 1.0 1.5 1.1

5
y 2. 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9
4

73
Concept Hierarchy Generation
u Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
u Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
u Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values forage ) by
youth, adult , or
higher level concepts (such as senior )
u Concept hierarchies can be explicitly specified by domain experts and/
or data warehouse designers
u Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
74
Concept Hierarchy Generation
for Nominal Data
u Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
wstreet <city <state <country
u Specification of a hierarchy for a set of values by explicit
data grouping
w {Urbana, Champaign, Chicago} < Illinois
u Specification of only a partial set of attributes
w E.g., onlystreet <city , not others
u Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
w E.g., for a set of attributes: {street, city, state, country }
75
Automatic Concept Hierarchy Generation
u Some hierarchies can be automatically generated
based on the analysis of the number of distinct values
per attribute in the data set
w The attribute with the most distinct values is placed
at the lowest level of the hierarchy
w Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Basic Marketing Research 9Th Edition Brown Test Bank Full Chapter PDF
100% (23)
Basic Marketing Research 9Th Edition Brown Test Bank Full Chapter PDF
51 pages
ADA SolBank Final
No ratings yet
ADA SolBank Final
80 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Kebere Goshu: Bahir Dar University
0% (1)
Kebere Goshu: Bahir Dar University
22 pages
Algorithm Analysis Design Lecture1 PowerPoint Presentation
No ratings yet
Algorithm Analysis Design Lecture1 PowerPoint Presentation
9 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
DCCN Prefinal Paper
No ratings yet
DCCN Prefinal Paper
2 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
No ratings yet
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
5 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Chapter - 4 - Association Rule Mining
No ratings yet
Chapter - 4 - Association Rule Mining
86 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
Computer Science (Optional II) Grade 9-10: Micro Syllabus - Academic Year 2069
100% (1)
Computer Science (Optional II) Grade 9-10: Micro Syllabus - Academic Year 2069
6 pages
Election Algorithm and Distributed Processing - Unit 2
100% (1)
Election Algorithm and Distributed Processing - Unit 2
2 pages
Unit #3 - Data Warehouse and Data Mining
No ratings yet
Unit #3 - Data Warehouse and Data Mining
70 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Lecture 3 - Introduction To NoSQL - Updated
No ratings yet
Lecture 3 - Introduction To NoSQL - Updated
35 pages
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
No ratings yet
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
4 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
3 - Stack Applications-Expression Conversion and Evaluation
100% (1)
3 - Stack Applications-Expression Conversion and Evaluation
18 pages
V Sem Solution Bank
100% (1)
V Sem Solution Bank
303 pages
Data Preprocessing: L1+ Freq
No ratings yet
Data Preprocessing: L1+ Freq
13 pages
Elmasri and Navathe DBMS Concepts 25
No ratings yet
Elmasri and Navathe DBMS Concepts 25
10 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Lab 1: Preprocessing Using Python
No ratings yet
Lab 1: Preprocessing Using Python
5 pages
P, NP, NP - Complete, NP Hard
No ratings yet
P, NP, NP - Complete, NP Hard
19 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
41 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
AoA Important Question
100% (1)
AoA Important Question
3 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
Association Rule - Data Mining
100% (1)
Association Rule - Data Mining
131 pages
Relational Algebra and SQL
No ratings yet
Relational Algebra and SQL
68 pages
IT366 Advanced Database Management Systems
0% (1)
IT366 Advanced Database Management Systems
2 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
2.data Analysis With Python by Rituraj Dixit - Z-Library
No ratings yet
2.data Analysis With Python by Rituraj Dixit - Z-Library
4 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
21 pages
Computer Graphics Handout
No ratings yet
Computer Graphics Handout
98 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Unit 2 Searching and Sorting
No ratings yet
Unit 2 Searching and Sorting
83 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
CSD 205 - Design and Analysis of Algorithms: Instructor: Dr. M. Hasan Jamal Lecture# 01: Introduction
100% (1)
CSD 205 - Design and Analysis of Algorithms: Instructor: Dr. M. Hasan Jamal Lecture# 01: Introduction
101 pages
Chapter 2 - Algorithm and Algorithm
No ratings yet
Chapter 2 - Algorithm and Algorithm
36 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
C++ Practice Questions PDF
100% (1)
C++ Practice Questions PDF
3 pages
Finding Max Min
No ratings yet
Finding Max Min
20 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Chapter 1 - Introdution To AI
100% (1)
Chapter 1 - Introdution To AI
45 pages
Chapter 3. Control Statements
100% (1)
Chapter 3. Control Statements
62 pages
CG Mcqs
75% (4)
CG Mcqs
11 pages
Beginning C# 3.0: An Introduction to Object Oriented Programming
From Everand
Beginning C# 3.0: An Introduction to Object Oriented Programming
Jack Purdum
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Machine Learning Specialization CloudxLab PDF
No ratings yet
Machine Learning Specialization CloudxLab PDF
12 pages
Detailed Outline For Writing Dessertation On Idea Development
No ratings yet
Detailed Outline For Writing Dessertation On Idea Development
12 pages
Effectiveness of Anti-Bullying Intervention Programs
100% (2)
Effectiveness of Anti-Bullying Intervention Programs
25 pages
Statistics and Probability: Department of Education
100% (1)
Statistics and Probability: Department of Education
3 pages
Ib SL Maths Summary Notes
No ratings yet
Ib SL Maths Summary Notes
43 pages
Illustrates A Random Variable
No ratings yet
Illustrates A Random Variable
19 pages
Statistics and Probability 6
No ratings yet
Statistics and Probability 6
24 pages
Practical Research 2
No ratings yet
Practical Research 2
4 pages
Correlation Paper 1
No ratings yet
Correlation Paper 1
5 pages
HW 3 Soln
100% (1)
HW 3 Soln
4 pages
Commented (I - (1) : Tidak Stationer Pada Level
No ratings yet
Commented (I - (1) : Tidak Stationer Pada Level
3 pages
14 - Design of Experiments With Several Factors
100% (1)
14 - Design of Experiments With Several Factors
121 pages
16 AS Statistics and Mechanics Practice Paper H mark scheme
No ratings yet
16 AS Statistics and Mechanics Practice Paper H mark scheme
8 pages
Module08 PolynomialRegressionSplineGAMs
No ratings yet
Module08 PolynomialRegressionSplineGAMs
56 pages
Asbio Manual
No ratings yet
Asbio Manual
145 pages
The Performance of The Broad Based Black Economic Empowerment Compliant Listed Property Firms in South Africa
No ratings yet
The Performance of The Broad Based Black Economic Empowerment Compliant Listed Property Firms in South Africa
24 pages
Chapter 4
No ratings yet
Chapter 4
27 pages
Package Nnet': R Topics Documented
No ratings yet
Package Nnet': R Topics Documented
11 pages
Introduction To ANOVA - Student Notes
No ratings yet
Introduction To ANOVA - Student Notes
17 pages
Machine Learning Lesson - Plan
No ratings yet
Machine Learning Lesson - Plan
3 pages
Final Business Analytics
No ratings yet
Final Business Analytics
112 pages
Data Analyisis Finals
100% (1)
Data Analyisis Finals
25 pages
trắc nghiệm phân tích dữ liệu trong kế toán
No ratings yet
trắc nghiệm phân tích dữ liệu trong kế toán
24 pages
Implementation of Green Accounting in Improving Operational Sustainability at PT Malindo Animal Feed Company in Gresik
No ratings yet
Implementation of Green Accounting in Improving Operational Sustainability at PT Malindo Animal Feed Company in Gresik
19 pages
Statistician Job Profile - Prospects - Ac
No ratings yet
Statistician Job Profile - Prospects - Ac
5 pages
Principles Of Social Research Methodology M Rezaul Islam Niaz Ahmed Khan pdf download
100% (2)
Principles Of Social Research Methodology M Rezaul Islam Niaz Ahmed Khan pdf download
83 pages
Research Reviewer
No ratings yet
Research Reviewer
18 pages
Research Technical Terms
No ratings yet
Research Technical Terms
69 pages
EDITED PRAC RES 2 q2 Mod1 Quantitative Research Design and Sampling Procedures
No ratings yet
EDITED PRAC RES 2 q2 Mod1 Quantitative Research Design and Sampling Procedures
30 pages

DM Chapter 3 Data Preprocessing

Uploaded by

DM Chapter 3 Data Preprocessing

Uploaded by

CSC479

u Data cleaning tasks

Missing value record

Other dataset records

A ttr ib u te 1 A ttr ib u te 2 A ttr ib u te 3 A ttr ib u te 4

u Clustering/merging and Detecting outliers.

u Values which falls outside of the set of clusters may be

u Entity Identiﬁcation Problem:

Multiple Sources Coherent Store

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be

Multiple Sources Coherent Store

Time AllElctronics (A) HighTech (B)

ei j  cou n t ( A  a i ) cou n t ( B bi )

male female Total

 x11 x12 x13 x14 x15 x16 

• A naïve basis for X is standard basis for R6 and hence

• QUESTION: What is a good choice for P ?

 cov( x, x) cov( x , y ) cov( x , z ) 

w Let A be an n x n matrix. The vector v that

u Mean of x = 18.1/10 = 1.81

u Calculate the covariance matrix

w Example:- Suppose that the minimum and

u normalization by decimal scaling

x 2. 0.5 2.2 1.9 3.1 2.3 2.0 1.0 1.5 1.1

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like