0% found this document useful (0 votes)
90 views14 pages

Principal Component Analysis and Cluster Analysis

The document provides an overview of Principal Component Analysis (PCA) and Cluster Analysis, detailing PCA's role in dimensionality reduction and its computational methods, including eigenvectors and eigenvalues. It also discusses clustering techniques, their applications, and requirements, highlighting various methods such as partitioning, hierarchical, density-based, grid-based, and model-based approaches. Additionally, it outlines the benefits and limitations of PCA and clustering in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views14 pages

Principal Component Analysis and Cluster Analysis

The document provides an overview of Principal Component Analysis (PCA) and Cluster Analysis, detailing PCA's role in dimensionality reduction and its computational methods, including eigenvectors and eigenvalues. It also discusses clustering techniques, their applications, and requirements, highlighting various methods such as partitioning, hierarchical, density-based, grid-based, and model-based approaches. Additionally, it outlines the benefits and limitations of PCA and clustering in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Principal Component Analysis and Cluster Analysis

Principal Components Analysis

● Definition: PCA is a dimensionality reduction technique that transforms a


dataset with potentially correlated variables into a set of linearly
uncorrelated variables called principal components.
● Objective: The primary goal of PCA is to simplify the covariance structure
of the dataset by finding new axes (principal components) that capture
the maximum variance in the data.

Key Concepts in PCA

1. Covariance Structure:
○ Covariance measures the extent to which two variables change
together. In PCA, we are interested in understanding the
relationships between variables in terms of their covariance.
○ PCA seeks to transform the data into new coordinates (principal
components) where these covariances are zero, meaning the
components are uncorrelated.

2. Principal Directions of Variance:


○ Principal Directions are the directions in which the data shows the
most variance (or spread).
○ Principal Components are vectors that represent these directions.
The first principal component represents the direction with the
highest variance, while the second is the direction orthogonal to it
with the next highest variance, and so on.

3. Orthogonal Transformation:
○ PCA applies an orthogonal transformation to the dataset to achieve
decorrelation. Each new axis is orthogonal (at a right angle) to the
others, which eliminates redundancy in the data.

Understanding PCA with an Example

Consider a dataset with two


variables, represented by the X and
Y axes in a 2D plane.

● Principal Directions in Data:


○ Suppose the data
distribution is primarily along an axis called the U-axis, which
represents the direction of maximum variance. The V-axis is
orthogonal to the U-axis and represents the secondary direction of
variance.
○ By re-orienting the coordinate system to align with the U and V
axes, the dataset gains a more compact representation centered
around its mean.

● Transformation to the U-V System:


○ Each data point, initially represented as (X,Y), can be transformed
into the (U,V) coordinate system.
○ In this transformed system, the covariance between U and V is zero,
meaning the dataset is decorrelated, providing a simplified and clear
view of the underlying data structure.

Dimensionality Reduction Using PCA

1. Reducing Complexity in High-Dimensional Data:


○ When working with multidimensional datasets, some dimensions
might add minimal new information (often due to noise or
redundant relationships). PCA helps to identify and eliminate such
dimensions, thus reducing the dataset’s dimensionality without
losing essential information.

2. Illustration with Linearly Related Variables:


○ If two variables in the dataset are linearly related, PCA will identify
one main direction of variance (e.g., the U-axis).
○ In such cases, all values along the secondary direction (V-axis) may
be close to zero, primarily due to noise or insignificant variance.
○ By discarding the V-axis, we can represent the data solely by the U
variable, reducing dimensionality from two to one, while retaining
most of the dataset’s variance.

3. Hyper-Ellipse and Class Boundaries:


○ When data follows a normal distribution, PCA can represent the
data spread with a hyper-ellipse (or an ellipse in 2D).
○ This hyper-ellipse, enclosing most data points, acts as a boundary
within which data points are likely to fall, helping identify classes
and outliers.

Computing the Principal Components

● In computational terms the principal components are found by calculating


the eigenvectors and eigenvalues of the data covariance matrix. This
process is equivalent to finding the axis system in which the covariance
matrix is diagonal.
● The eigenvector with the largest eigenvalue is the direction of greatest
variation, the one with the second largest eigenvalue is the (orthogonal)
direction with the next highest variation and so on.
● Let A be an n × n matrix. The eigenvalues of A are defined as the roots of:
○ where I is the n × n identity matrix. This equation is called the
characteristic equation (or characteristic polynomial) and has n
roots.
● Let λ be an eigenvalue of A. Then there exists a vector x such that:
Ax = λx

● The vector x is called an eigenvector of A associated with the eigenvalue


λ. It is a direction vector only and can be scaled to any magnitude.
● To find a numerical solution for x we need to set one of its elements to an
arbitrary value, say 1, which gives us a set of simultaneous equations to
solve for the other other elements.
● If there is no solution we repeat the process with another element.
Ordinarily we normalise the final values so that x has length one, that is x
· xt = 1. Suppose we have a 3 × 3 matrix A with eigenvectors x1, x2, x3, and
eigenvalues λ1, λ2, λ3 so:

● Putting the eigenvectors as the columns of a matrix gives:

● writing:

● gives us the matrix equation: AΦ = ΦΛ

● We normalize the eigenvectors to unit magnitude, and they are


orthogonal, so: ΦΦT = ΦTΦ = I

● which means that: ΦTAΦ = Λ


● And: A = ΦΛΦT

● Now let us consider how this applies to the covariance matrix in the PCA
process. Let Σ be an n×n covariance matrix. There is an orthogonal n × n
matrix Φ whose columns are eigenvectors of Σ and a diagonal matrix Λ
whose diagonal elements are the eigenvalues of Σ, such that: ΦT ΣΦ = Λ

● We can look at the matrix of eigenvectors Φ as a linear transformation


which, in the example of figure 1 transforms data points in the [X, Y ] axis
system into the [U, V ] axis system.
● In the general case the linear transformation given by Φ transforms the
data points into a data set where the variables are uncorrelated. The
correlation matrix of the data in the new coordinate system is Λ which has
zeros in all the off diagonal elements.

Benefits of Using PCA

● Reduction of Noise and Overfitting: By removing components that


mainly represent noise, PCA reduces the risk of overfitting in machine
learning applications.
● Simplification of Data Interpretation: PCA provides a more compact
representation of data, making complex relationships easier to visualize
and analyze.
● Efficient Computation in High-Dimensional Data: For high-dimensional
datasets, PCA reduces computational costs by focusing only on significant
components.

Limitations of PCA

● Linear Assumption: PCA assumes that data variation is linear, which may
not always hold, especially in complex datasets.
● Sensitivity to Scaling: PCA is sensitive to the scale of the data. It’s
essential to standardize data to avoid misleading principal component
results.
● Loss of Interpretability: Reduced dimensions might lead to a loss of
direct interpretability, as principal components are combinations of
original variables.

Applications of PCA

PCA is widely used across various fields due to its versatility in simplifying
datasets:

● Image Compression: PCA helps in reducing image data by focusing on


key visual components.
● Genomics and Bioinformatics: PCA helps in gene expression analysis by
identifying significant patterns in genetic data.
● Finance: PCA is used in stock market analysis to reduce the complexity of
data and find key patterns.
● Data Visualization: PCA aids in visualizing high-dimensional data in two
or three dimensions, facilitating easier pattern recognition.

Cluster Analysis

The process of grouping a set of physical or abstract objects into classes of


similar objects is called clustering. A cluster is a collection of data objects that
are similar to one another within the same cluster and are dissimilar to the
objects in other clusters. A cluster of data objects can be treated collectively as
one group and so may be considered as a form of data compression. Cluster
analysis tools based on k-means, k-medoids, and several methods have also
been built into many statistical analysis software packages or systems, such as
S-Plus, SPSS, and SAS.

Applications:

● Cluster analysis has been widely used in numerous applications, including


market research, pattern recognition, data analysis, and image processing.
● In business, clustering can help marketers discover distinct groups in their
customer bases and characterize customer groups based on purchasing
patterns.
● Clustering helps in the identification of areas of similar land use in an
earth observation database and in the identification of groups of houses
in a city according to house type, value,and geographic location, as well
as the identification of groups of automobile insurance policy holders with
a high average claim cost.
● Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their
similarity.

Typical Requirements Of Clustering In Data Mining

● Scalability: Many clustering algorithms work well on small data sets


containing fewer than several hundred data objects; however, a large
database may contain millions of objects. Clustering on a sample of a
given large data set may lead to biased results.
● Ability to deal with different types of attributes: Many algorithms are
designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixtures of these data types.
● Discovery of clusters with arbitrary shape: Many clustering algorithms
determine clusters based on Euclidean or Manhattan distance measures.
Algorithms based on such distance measures tend to find spherical
clusters with similar size and density. However, a cluster could be of any
shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
● Minimal requirements for domain knowledge to determine input
parameters: Many clustering algorithms require users to input certain
parameters in cluster analysis (such as the number of desired clusters).
The clustering results can be quite sensitive to input parameters.
Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it
also makes the quality of clustering difficult to control.
● Ability to deal with noisy data: Most real-world databases contain
outliers or missing, unknown, or erroneous data. Some clustering
algorithms are sensitive to such data and may lead to clusters of poor
quality.
● Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e.,
database updates) into existing clustering structures and, instead, must
determine a new clustering from scratch. Some clustering algorithms are
sensitive to the order of input data. That is, given a set of data objects,
such an algorithm may return dramatically different

Major Clustering Methods:

A. Partitioning Methods

● A partitioning method constructs k partitions of the data, where each


partition represents a cluster and k <= n. That is, it classifies the data into
k groups, which together satisfy the following requirements: Each group
must contain at least one object, and Each object must belong to exactly
one group.
● A partitioning method creates an initial partitioning. It then uses an
iterative relocation technique that attempts to improve the partitioning by
moving objects from one group to another.
● The general criterion of a good partitioning is that objects in the same
cluster are close or related to each other, whereas objects of different
clusters are far apart or very different.

B. Hierarchical Methods
● A hierarchical method creates a hierarchical decomposition of the given
set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is
formed.
○ The Agglomerative approach, also called the bottom-up approach,
starts with each object forming a separate group. It successively
merges the objects or groups that are close to one another, until all
of the groups are merged into one or until a termination condition
holds.
○ The divisive approach, also called the top-down approach, starts
with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until eventually
each object is in one cluster, or until a termination condition holds.

C. Density-based methods

● Most partitioning methods cluster objects based on the distance between


objects. Such methods can find only spherical-shaped clusters and
encounter difficulty at discovering clusters of arbitrary shapes.
● Other clustering methods have been developed based on the notion of
density. Their general idea is to continue growing the given cluster as
long as the density in the neighborhood exceeds some threshold; that is,
for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points. Such a method
can be used to filter out noise (outliers)and discover clusters of arbitrary
shape.
● DBSCAN and its extension, OPTICS, are typical density-based methods
that grow clusters according to a density-based connectivity analysis.
DENCLUE is a method that clusters objects based on the analysis of the
value distributions of density functions.

D. Grid-Based Methods
● Grid-based methods quantize the object space into a finite number of
cells that form a grid structure.
● All of the clustering operations are performed on the grid structure i.e., on
the quantized space. The main advantage of this approach is its fast
processing time, which is typically independent of the number of data
objects and dependent only on the number of cells in each dimension in
the quantized space.
● STING is a typical example of a grid-based method. Wave Cluster applies
wavelet transformation for clustering analysis and is both grid-based and
density-based.

E. Model-Based Methods

● Model-based methods hypothesize a model for each of the clusters and


find the best fit of the data to the given model.
● A model-based algorithm may locate clusters by constructing a density
function that reflects the spatial distribution of the data points.
● It also leads to a way of automatically determining the number of clusters
based on standard statistics, taking ―noise‖ or outliers into account and
thus yielding robust clustering methods.

Classical Partitioning Methods

1. Centroid-Based Technique: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intra cluster similarity is high but the
inter cluster similarity is low. Cluster similarity is measured in regard to the
mean value of the objects in a cluster, which can be viewed as the cluster’s
centroid or center of gravity.

The k-means algorithm proceeds as follows:


● First, it randomly selects k of the objects, each of which initially
represents a cluster mean or center.
● For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the distance between the object and
the cluster mean.
● It then computes the new mean for each cluster.
● This process iterates until the criterion function converges.
● Typically, the square-error criterion is used, defined as:

where E is the sum of the square error for all objects in the data
set p is the point in space representing a given object mi is the mean of cluster
Ci.

2. The k-Medoids Method

● The k-means algorithm is sensitive to outliers because an object with an


extremely large value may substantially distort the distribution of data.
This effect is particularly exacerbated due to the use of the square-error
function.
● Instead of taking the mean value of the objects in a cluster as a reference
point, we can pick actual objects to represent the clusters, using one
representative object per cluster. Each remaining object is clustered with
the representative object to which it is the most similar.
● The Partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is used,
defined as:

where E is the sum of the absolute error for all objects in the data set p is
the point in space representing a given object in cluster Cj. Oj is the
representative object of Cj.

● The initial representative objects are chosen arbitrarily. The iterative


process of replacing representative objects by non representative objects
continues as long as the quality of the resulting clustering is improved.
● This quality is estimated using a cost function that measures the average
dissimilarity between an object and the representative object of its
cluster.
● To determine whether a non representative object, oj random, is a good
replacement for a current representative object, oj, the following four
cases are examined for each of the nonrepresentative objects.

Case 1: p currently belongs to representative object, oj . If oj is replaced by


orandom as a representative object and p is closest to one of the other
representative objects, oi, i≠j, then p is reassigned to oi .

Case 2: p currently belongs to representative object, oj. If oj is replaced by


orandom as a representative object and p is closest to orandom, then p is reassigned
to orandom.

Case 3: p currently belongs to representative object, oi , i≠j. If oj is replaced by


orandom as a representative object and p is still closest to oi , then the assignment
does not change.
Case 4: p currently belongs to representative object, oi, i≠j. If oj is replaced by
orandom as a representative object and p is closest to orandom, then p is reassigned
to orandom.

Outlier Analysis

● There exist data objects that do not comply with the general behavior or
model of the data. Such data objects, which are grossly different from or
inconsistent with the remaining set of data, are called outliers.
● Many data mining algorithms try to minimize the influence of outliers or
eliminate them all together. This, however, could result in the loss of
important hidden information because one person’s noise could be
another person’s signal.
● In other words, the outliers may be of particular interest, such as in the
case of fraud detection, where outliers may indicate fraudulent activity.
● Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining. It can be used in fraud detection, for
example, by detecting unusual usage of credit cards or telecommunication
services.
● In addition, it is useful in customized marketing for identifying the
spending behavior of customers with extremely low or extremely high
incomes, or in medical analysis for finding unusual responses to various
medical treatments.

You might also like