Module 8
Module 8
PROPRIETARY MATERIAL © 2018 The McGraw Hill Education, Inc. All rights reserved. No part of this PowerPoint slide may be displayed, reproduced or distributed in any form or by any
means, without the prior written permission of the publisher, or used beyond the limited distribution to teachers and educators permitted by McGraw Hill for their individual course preparation.
If you are a student using this PowerPoint slide, you are using it without permission.
Data Exploration
• What to look for when exploring data?
• No trivial answer.
• Data summaries, correlation analysis and data
visualization through graphs and plots bring out the
characteristics of the data that will guide the data
mining process.
• Exploration is thus an exercise to bring out something
new, unknown to the miner, which can then be
exploited to improve the success of data mining.
Basic Statistical Descriptions
• Basic Statistical descriptions can be used for getting
familiar with the data and their characteristics such as
identifying noise and outliers.
• Mean, median and mode help in identifying ‘where do
most of the attributes fall?’
• Second aspect of statistical description is to have an
idea of dispersion of the data.
• The most common data dispersion measures are
variance and standard deviation of the data.
• Correlation analysis is used to find the redundancies in
data.
Data Visualization
• Data visualization aims to communicate data clearly
and effectively through graphical representation.
or
• The largest eigenvalue, thus, corresponds to maximum
variance.
• Therefore the first principal component is the
eigenvector of the covariance matrix associated with
the largest eigenvalue.
• Other principal components are given by eigenvectors
associated with eigenvalues with decreasing magnitude.
• Next step is to order the eigenvalue-eigenvector pairs by
the magnitude of the eigenvalues in descending order.
• Select the leading k components that explain more that,
for example, 90% of variance.
• The final step in PCA is to derive the new dataset-
transformation of the original dataset.