0% found this document useful (0 votes)
32 views13 pages

ML - Unit - 2

ML Unit-II
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views13 pages

ML - Unit - 2

ML Unit-II
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unsupervised Learning

Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is used for exploratory
data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure
of similarity which is defined upon metrics such as Euclidean or probabilistic distance.

Common clustering algorithms include:

Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree

k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a
cluster

Gaussian mixture models: models clusters as a mixture of multivariate normal density components

Self-organizing maps: uses neural networks that learn the topology and distribution of the data

Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic
clustering; in data mining for sequence and pattern mining; in medical imaging for image
segmentation; and in computer vision for object recognition.

k-means clustering algorithm

k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering
problem.

The procedure follows a simple and easy way to classify a given data set through a certain number
of clusters (assume k clusters) fixed apriori.

The main idea is to define k centers, one for each cluster.

These centers should be placed in a cunning way because of different location causes different result.
So, the better choice is to place them as much as possible far away from each other. The next step
is to take each point belonging to a given data set and associate it to the nearest center. When no
point is pending, the first step is completed and an early group age is done. At this point we need to
re-calculate k new centroids as barycenter of the clusters resulting from the previous step.

After we have these k new centroids, a new binding has to be done between the same data set
points and the nearest new center. A loop has been generated. As a result of this loop we may notice
that the k centers change their location step by step until no more changes are done or in other words
centers do not move any more.
Finally, this algorithm aims at minimizing an objective function know as squared error function given
by:

where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.

‘ci’ is the number of data points in ith cluster.

‘c’ is the number of cluster centers.

Algorithmic steps for k-means clustering

Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers.

1) Randomly select ‘c’ cluster centers.

2) Calculate the distance between each data point and cluster centers.

3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all
the cluster centers..

4) Recalculate the new cluster center using:

where, ‘ci’ represents the number of data points in ith cluster.

5) Recalculate the distance between each data point and new obtained cluster centers.

6) If no data point was reassigned then stop, otherwise repeat from step 3).

Advantages:-
1) Fast, robust and easier to understand.

2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object,


and t is # iterations. Normally, k, t, d << n.

3) Gives best result when data set are distinct or well separated from each other.

Fig I: Showing the result of k-means for 'N' = 60 and 'c' = 3

Disadvantages:-

1) The learning algorithm requires apriori specification of the number of cluster centers.

2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be
able to resolve that there are two clusters.

3) The learning algorithm is not invariant to non-linear transformations i.e. with different representation
of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates
will give different results).

4) Euclidean distance measures can unequally weight underlying factors.

5) The learning algorithm provides the local optima of the squared error function.

6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.

7) Applicable only when mean is defined i.e. fails for categorical data.

8) Unable to handle noisy data and outliers.

9) Algorithm fails for non-linear data set.


Fig II: Showing the non-linear data set where k-means algorithm fails

kernel k-means clustering algorithm

This algorithm applies the same trick as k-means but with one difference that here in the calculation of distanc
kernel method is used instead of the Euclidean distance.

Algorithmic steps for Kernel k-means clustering

Let X = {a1, a2, a3, ..., an} be the set of data points and 'c' be the number of clusters.

1) Randomly initialize ‘c’ cluster center.

2) Compute the distance of each data point and the cluster center in the transformed space using:
where,

cth cluster is denoted by πc.

‘mc’ denotes the mean of the cluster πc.

‘Ф(ai)’ denotes the data point ai in transformed space.

Ф(ai). Ф(aj) = exp- (||ai - aj||)*q for gaussian kernel.

= (c + ai.aj)^d for polynomial kernel.

3) Assign data point to that cluster center whose distance is minimum.

4) Until data points are re-assigned repeat from step 2).

Fig I: Result obtained by applying Gaussian Kernel k-means with 'q' =10

Advantages

1) Algorithm is able to identify the non-linear structures.

2) Algorithm is best suited for real life data set.

Disadvantages

1) Number of cluster centers need to be predefined.

2) Algorithm is complex in nature and time complexity is large.

What is Dimensionality Reduction?


In machine learning classification problems, there are often too many factors on the basis of which
the final classification is done. These factors are basically variables called features. The higher the
number of features, the harder it gets to visualize the training set and then work on it. Sometimes,
most of these features are correlated, and hence redundant. This is where dimensionality reduction
algorithms come into play. Dimensionality reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal variables. It can be divided
into feature selection and feature extraction.

Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?


An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the content
of the e-mail, whether the e-mail uses a template, etc. However, some of these features may
overlap. In another condition, a classification problem that relies on both humidity and rainfall can
be collapsed into just one underlying feature, since both of the aforementioned are correlated to a
high degree. Hence, we can reduce the number of features in such problems. A 3-D classification
problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional
space, and a 1-D problem to a simple line. The below figure illustrates this concept, where a 3-D
feature space is split into two 1-D feature spaces, and later, if found to be correlated, the number
of features can be reduced even further.

Components of Dimensionality Reduction


There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or features,
to get a smaller subset which can be used to model the problem. It usually involves three
ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used. The
prime linear method, called Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition that while the data in a higher
dimensional space is mapped to data in a lower dimension space, the variance of the data in the
lower dimensional space should be maximum.
It involves the following steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction
of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss
in the process. But, the most important variances should be retained by the remaining eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb rules
are applied.
 Explain fold
 based filtering

3. Kernel Principal Component Analysis

There are a lot of machine learning problems which a nonlinear, and the use of nonlinear feature
mappings can help to produce new features which make prediction problems linear. In this section
we will discuss the following idea: transformation of the dataset to a new higher-dimensional (in
some cases infinite-dimensional) feature space and the use of PCA in that space in order to produce
uncorrelated features. Such a method is called Kernel Principal Component Analysis or KPCA.

Let us denote a covariance matrix in a new feature space as

where . Will consider that the dimensionality of the feature space equals
to .

Eigen decomposition of is given by

By the definition of

and therefore

It is obviously to see, that is a linear combination of and thus can be written as

Substituting it to the equation above and writing it in a matrix notation, we get

where is a Gram matrix in , and are column-vectors with


elements . Eigenvectors of should be orthonormal, therefore, we get the following:

Having eigenvectors of , we can get the projection of an item on -th eigenvector:


So far, we have assumed that the mapping is known. From the equations above, we can see,
that only a thing that we need for the data transformation is the eigendecomposition of a Gram
matrix . Dot products, which are its elements can be defined without any definition of .
The function defining such dot products in some Hilbert space is called kernel. Kernels are
satisfied by the Mercer’s theorem. There are many different types of kernels, there are several
popular:

1. Linear: ;
2. Gaussian: ;
3. Polynomial: .

Using a kernel function we can write new equation for a projection of some data item onto -th
eigenvector:

So far, we have assumed that the columns of have zero mean. Using

and substituting it to the equation for , we get

where is a matrix , where each element equals to .

Summary: Now we are ready to write the whole sequence of steps to perform KPCA:

1. Calculate .
2. Calculate .
3. Find the eigenvectors of corresponding to nonzero eigenvalues and normalize
them: .
4. Sort found eigenvectors in the descending order of coresponding eigenvalues.
5. Perform projections onto the given subset of eigenvectors.

The method described above requires to define the number of components, the kernel and its
parameters. It should be noted, that the number of nonlinear principal components in the general
case is infinite, but since we are computing the eigenvectors of a matrix , at maximum
we can calculate nonlinear principal components.

Matrix Factorization:
matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more)
matrices such that when you multiply them you will get back the original matrix.

Matrix factorization can be used to discover latent features underlying the interactions
between two different kinds of entities. (Of course, you can consider more than two
kinds of entities and you will be dealing with tensor factorization, which would be more
complicated.) And one obvious application is to predict ratings in collaborative filtering.

In a recommendation system such as Netflix or MovieLens, there is a group of users and


a set of items (movies for the above two systems). Given that each users have rated some
items in the system, we would like to predict how the users would rate the items that
they have not yet rated, such that we can make recommendations to the users. In this
case, all the information we have about the existing ratings can be represented in a
matrix. Assume now we have 5 users and 10 items, and ratings are integers ranging from
1 to 5, the matrix may look something like this (a hyphen means that the user has not yet
rated the movie):
D1 D2 D3 D4

U1 5 3 - 1

U2 4 - - 1

U3 1 1 - 5

U4 1 - - 4

U5 - 1 5 4

Hence, the task of predicting the missing ratings can be considered as filling in the
blanks (the hyphens in the matrix) such that the values would be consistent with the
existing ratings in the matrix.

The intuition behind using matrix factorization to solve this problem is that there should
be some latent features that determine how a user rates an item. For example, two users
would give high ratings to a certain movie if they both like the actors/actresses of the
movie, or if the movie is an action movie, which is a genre preferred by both users.
Hence, if we can discover these latent features, we should be able to predict a rating with
respect to a certain user and a certain item, because the features associated with the user
should match with the features associated with the item.

In trying to discover the different features, we also make the assumption that the
number of features would be smaller than the number of users and the number of items.
It should not be difficult to understand this assumption because clearly it would not be
reasonable to assume that each user is associated with a unique feature (although this is
not impossible). And anyway if this is the case there would be no point in making
recommendations, because each of these users would not be interested in the items
rated by other users. Similarly, the same argument applies to the items.

The mathematics of matrix factorization


Having discussed the intuition behind matrix factorization, we can now go on to work
on the mathematics. Firstly, we have a set of users, and a set of items. Let of
size be the matrix that contains all the ratings that the users have assigned to
the items. Also, we assume that we would like to discover $K$ latent features. Our task,
then, is to find two matrics matrices (a matrix) and (a matrix)
such that their product approximates :

In this way, each row of would represent the strength of the associations between a
user and the features. Similarly, each row of would represent the strength of the
associations between an item and the features. To get the prediction of a rating of an
item by , we can calculate the dot product of the two vectors corresponding
to and :

Now, we have to find a way to obtain and . One way to approach this problem is the
first intialize the two matrices with some values, calculate how `different’ their product
is to , and then try to minimize this difference iteratively. Such a method is called
gradient descent, aiming at finding a local minimum of the difference.

The difference here, usually called the error between the estimated rating and the real
rating, can be calculated by the following equation for each user-item pair:
Here we consider the squared error because the estimated rating can be either higher or
lower than the real rating.

To minimize the error, we have to know in which direction we have to modify the values
of and . In other words, we need to know the gradient at the current values, and
therefore we differentiate the above equation with respect to these two variables
separately:

Having obtained the gradient, we can now formulate the update rules for
both and :

Here, is a constant whose value determines the rate of approaching the minimum.
Usually we will choose a small value for , say 0.0002. This is because if we make too
large a step towards the minimum we may run into the risk of missing the minimum
and end up oscillating around the minimum.

A question might have come to your mind by now: if we find two matrices and such
that approximates , isn’t that our predictions of all the unseen ratings will all
be zeros? In fact, we are not really trying to come up with and such that we can
reproduce exactly. Instead, we will only try to minimise the errors of the observed
user-item pairs. In other words, if we let be a set of tuples, each of which is in the form
of , such that contains all the observed user-item pairs together with the
associated ratings, we are only trying to minimise every for . (In other
words, is our set of training data.) As for the rest of the unknowns, we will be able to
determine their values once the associations between the users, items and features have
been learnt.

Using the above update rules, we can then iteratively perform the operation until the
error converges to its minimum. We can check the overall error as calculated using the
following equation and determine when we should stop the process.
Matrix completion:

You might also like