0% found this document useful (0 votes)
31 views13 pages

Module 8

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views13 pages

Module 8

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 8

Exploratory Data Analysis and


Data Transformations

Copyright © 2018 McGraw Hill Education, All Rights Reserved.

PROPRIETARY MATERIAL © 2018 The McGraw Hill Education, Inc. All rights reserved. No part of this PowerPoint slide may be displayed, reproduced or distributed in any form or by any
means, without the prior written permission of the publisher, or used beyond the limited distribution to teachers and educators permitted by McGraw Hill for their individual course preparation.
If you are a student using this PowerPoint slide, you are using it without permission.
Data Exploration
• What to look for when exploring data?
• No trivial answer.
• Data summaries, correlation analysis and data
visualization through graphs and plots bring out the
characteristics of the data that will guide the data
mining process.
• Exploration is thus an exercise to bring out something
new, unknown to the miner, which can then be
exploited to improve the success of data mining.
Basic Statistical Descriptions
• Basic Statistical descriptions can be used for getting
familiar with the data and their characteristics such as
identifying noise and outliers.
• Mean, median and mode help in identifying ‘where do
most of the attributes fall?’
• Second aspect of statistical description is to have an
idea of dispersion of the data.
• The most common data dispersion measures are
variance and standard deviation of the data.
• Correlation analysis is used to find the redundancies in
data.
Data Visualization
• Data visualization aims to communicate data clearly
and effectively through graphical representation.

• Visualization techniques help to discover data


relationships that are otherwise not clearly observable
by looking at the raw data.

• For huge datasets (for example one million points), it


is common to draw a random sample from the data
and use it to generate visualization to make it more
interpretable.
Data Transformations
Data Cleansing
• Real world data tend to be incomplete, noisy and inconsistent.
• Data cleansing routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in
the data.
Missing Values
• Reasons: malfunctioning measurement equipment, changes in
experimental design during data collection, human errors in data
entry, and deliberate errors (eg., respondents not willing to
divulge information).
• If the number of instances with missing values is small, those
instances might be omitted.
• Replacing missing value with an imputed value : as with
the mean of the variable across all samples.
Noisy data
• Noise is a random error or variance in a measured variable.
• Various smoothing techniques are employed by
commercial tools available for cleansing the data.
Outliers
• Inaccurate values resulting from measurement error, data-
entry or the like.
• Inaccurate values often deviate significantly from the
pattern that is apparent in the remaining values.
• Identifying outliers: Clustering techniques, data
visualization techniques
Derived attributes
• An art of making the data mean more.
• Defining new variables that express the information
inherent in the data in ways that make the information
more useful.
Standard Numeric Variables
• Normalization of the data to lie in a fixed range.
Replacing Categorical Variables with Numeric Ones
• Random enumeration creates spurious information
that data mining algorithms have no way of ignoring.
• A popular approach is to create a separate binary
variable for each category.
Discretizing numeric attributes
• Some classification and clustering methods deal with
categorical attributes only, and cannot handle one measured
on a numeric scale.
• To use them on general datasets numeric attributes must first
be ‘discretized’ into a smaller number of distinct ranges.
Attribute Reduction
• Principal Component Analysis (PCA) is a useful technique for
reducing the number of attributes in the model by analyzing
the input variables.
• Very valuable for highly correlated data.
• It provides a few new variables that are weighted linear
combinations of the original variables.
• It does not use the target variable.
Principal Component
Analysis
• The potential problems with the data are noise and
redundancy.
• Covariance matrix describes all relationships between
pairs of attributes in our dataset.
• To reduce redundancy, ensure each variable co-varies
as little as possible with other variables.
• Evidently, in an optimized matrix, all off-diagonal
terms in are zero.
• Removing redundancy diagonalizes .
• We can visualize the data with n attributes as a cloud
of N points is an n-dimensional vector space.
• The N x n data matrix X given as:

• When we subtract the mean from each of the data


dimensions, we get the transformed data in mean-
deviation form.
• For PCA, working with the data whose mean is zero, is
more convenient.
• From now onwards, assume that the given dataset is a
zero-mean dataset.
• still represents for the zero mean dataset as well.
• For this dataset,

• In terms of , the covariance is given by

• Thus, the symmetric covariance matrix can be


diagonalized by selecting the transformation matrix W
to be a matrix whose each row is an eigenvector of .
• By this selection,

• This was the first goal for PCA.


• The other goal is: reducing noise.
• We see

• The variances in the transformed domain are,


therefore, given by

or
• The largest eigenvalue, thus, corresponds to maximum
variance.
• Therefore the first principal component is the
eigenvector of the covariance matrix associated with
the largest eigenvalue.
• Other principal components are given by eigenvectors
associated with eigenvalues with decreasing magnitude.
• Next step is to order the eigenvalue-eigenvector pairs by
the magnitude of the eigenvalues in descending order.
• Select the leading k components that explain more that,
for example, 90% of variance.
• The final step in PCA is to derive the new dataset-
transformation of the original dataset.

You might also like