ML - Unit - 2
ML - Unit - 2
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses.
The most common unsupervised learning method is cluster analysis, which is used for exploratory
data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure
of similarity which is defined upon metrics such as Euclidean or probabilistic distance.
k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a
cluster
Gaussian mixture models: models clusters as a mixture of multivariate normal density components
Self-organizing maps: uses neural networks that learn the topology and distribution of the data
Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic
clustering; in data mining for sequence and pattern mining; in medical imaging for image
segmentation; and in computer vision for object recognition.
k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering
problem.
The procedure follows a simple and easy way to classify a given data set through a certain number
of clusters (assume k clusters) fixed apriori.
These centers should be placed in a cunning way because of different location causes different result.
So, the better choice is to place them as much as possible far away from each other. The next step
is to take each point belonging to a given data set and associate it to the nearest center. When no
point is pending, the first step is completed and an early group age is done. At this point we need to
re-calculate k new centroids as barycenter of the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be done between the same data set
points and the nearest new center. A loop has been generated. As a result of this loop we may notice
that the k centers change their location step by step until no more changes are done or in other words
centers do not move any more.
Finally, this algorithm aims at minimizing an objective function know as squared error function given
by:
where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all
the cluster centers..
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Advantages:-
1) Fast, robust and easier to understand.
3) Gives best result when data set are distinct or well separated from each other.
Disadvantages:-
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be
able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with different representation
of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates
will give different results).
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
This algorithm applies the same trick as k-means but with one difference that here in the calculation of distanc
kernel method is used instead of the Euclidean distance.
Let X = {a1, a2, a3, ..., an} be the set of data points and 'c' be the number of clusters.
2) Compute the distance of each data point and the cluster center in the transformed space using:
where,
Fig I: Result obtained by applying Gaussian Kernel k-means with 'q' =10
Advantages
Disadvantages
There are a lot of machine learning problems which a nonlinear, and the use of nonlinear feature
mappings can help to produce new features which make prediction problems linear. In this section
we will discuss the following idea: transformation of the dataset to a new higher-dimensional (in
some cases infinite-dimensional) feature space and the use of PCA in that space in order to produce
uncorrelated features. Such a method is called Kernel Principal Component Analysis or KPCA.
where . Will consider that the dimensionality of the feature space equals
to .
By the definition of
and therefore
1. Linear: ;
2. Gaussian: ;
3. Polynomial: .
Using a kernel function we can write new equation for a projection of some data item onto -th
eigenvector:
So far, we have assumed that the columns of have zero mean. Using
Summary: Now we are ready to write the whole sequence of steps to perform KPCA:
1. Calculate .
2. Calculate .
3. Find the eigenvectors of corresponding to nonzero eigenvalues and normalize
them: .
4. Sort found eigenvectors in the descending order of coresponding eigenvalues.
5. Perform projections onto the given subset of eigenvectors.
The method described above requires to define the number of components, the kernel and its
parameters. It should be noted, that the number of nonlinear principal components in the general
case is infinite, but since we are computing the eigenvectors of a matrix , at maximum
we can calculate nonlinear principal components.
Matrix Factorization:
matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more)
matrices such that when you multiply them you will get back the original matrix.
Matrix factorization can be used to discover latent features underlying the interactions
between two different kinds of entities. (Of course, you can consider more than two
kinds of entities and you will be dealing with tensor factorization, which would be more
complicated.) And one obvious application is to predict ratings in collaborative filtering.
U1 5 3 - 1
U2 4 - - 1
U3 1 1 - 5
U4 1 - - 4
U5 - 1 5 4
Hence, the task of predicting the missing ratings can be considered as filling in the
blanks (the hyphens in the matrix) such that the values would be consistent with the
existing ratings in the matrix.
The intuition behind using matrix factorization to solve this problem is that there should
be some latent features that determine how a user rates an item. For example, two users
would give high ratings to a certain movie if they both like the actors/actresses of the
movie, or if the movie is an action movie, which is a genre preferred by both users.
Hence, if we can discover these latent features, we should be able to predict a rating with
respect to a certain user and a certain item, because the features associated with the user
should match with the features associated with the item.
In trying to discover the different features, we also make the assumption that the
number of features would be smaller than the number of users and the number of items.
It should not be difficult to understand this assumption because clearly it would not be
reasonable to assume that each user is associated with a unique feature (although this is
not impossible). And anyway if this is the case there would be no point in making
recommendations, because each of these users would not be interested in the items
rated by other users. Similarly, the same argument applies to the items.
In this way, each row of would represent the strength of the associations between a
user and the features. Similarly, each row of would represent the strength of the
associations between an item and the features. To get the prediction of a rating of an
item by , we can calculate the dot product of the two vectors corresponding
to and :
Now, we have to find a way to obtain and . One way to approach this problem is the
first intialize the two matrices with some values, calculate how `different’ their product
is to , and then try to minimize this difference iteratively. Such a method is called
gradient descent, aiming at finding a local minimum of the difference.
The difference here, usually called the error between the estimated rating and the real
rating, can be calculated by the following equation for each user-item pair:
Here we consider the squared error because the estimated rating can be either higher or
lower than the real rating.
To minimize the error, we have to know in which direction we have to modify the values
of and . In other words, we need to know the gradient at the current values, and
therefore we differentiate the above equation with respect to these two variables
separately:
Having obtained the gradient, we can now formulate the update rules for
both and :
Here, is a constant whose value determines the rate of approaching the minimum.
Usually we will choose a small value for , say 0.0002. This is because if we make too
large a step towards the minimum we may run into the risk of missing the minimum
and end up oscillating around the minimum.
A question might have come to your mind by now: if we find two matrices and such
that approximates , isn’t that our predictions of all the unseen ratings will all
be zeros? In fact, we are not really trying to come up with and such that we can
reproduce exactly. Instead, we will only try to minimise the errors of the observed
user-item pairs. In other words, if we let be a set of tuples, each of which is in the form
of , such that contains all the observed user-item pairs together with the
associated ratings, we are only trying to minimise every for . (In other
words, is our set of training data.) As for the rest of the unknowns, we will be able to
determine their values once the associations between the users, items and features have
been learnt.
Using the above update rules, we can then iteratively perform the operation until the
error converges to its minimum. We can check the overall error as calculated using the
following equation and determine when we should stop the process.
Matrix completion: