0% found this document useful (0 votes)
108 views2 pages

K-Menas Problem

The document provides instructions for a clustering analysis assignment. It includes: 1) An unlabeled dataset of 350 samples with 90 attributes each and initial centroids dataset of 15 samples to cluster. 2) Tasks to implement K-means clustering using different distance measures, normalize the data, cluster the data using different numbers of K clusters, and analyze the RSS value at each iteration. 3) The goal is to reveal hidden structures in the data, find the optimal number of clusters K, and discuss RSS value changes and potential empty clusters.

Uploaded by

Gentian Strana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views2 pages

K-Menas Problem

The document provides instructions for a clustering analysis assignment. It includes: 1) An unlabeled dataset of 350 samples with 90 attributes each and initial centroids dataset of 15 samples to cluster. 2) Tasks to implement K-means clustering using different distance measures, normalize the data, cluster the data using different numbers of K clusters, and analyze the RSS value at each iteration. 3) The goal is to reveal hidden structures in the data, find the optimal number of clusters K, and discuss RSS value changes and potential empty clusters.

Uploaded by

Gentian Strana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 2

Njohja e Mostrave

Universiteti i Prishtinës

Pjesa e dytë e projektit (Afati për dorëzim: 23.01.2017)

Problem Description

There exists an unclassified data set with hidden data structures in it. The task in
this assignment is to perform comprehensive Cluster Analysis in order to reveal
the structures and similar data groups.
The data set consists of unlabeled data set called test.txt and initial centroids data
set namely centroids.txt in the archive. Both files have the following format:
[attribute1_value <space> attribute2_value <space> ... <space>
attribute90_value].
The unlabeled data set includes 350 samples and the initial centroids set consists
of 15 samples. Data instances in both files have 90 attributes.
Finally, prepare an academic report and deliver it together with source code and
any additional material, which you were using during you work.

Tasks:

1. Implement a simple K-means method, which is able to handle real values


data in attributes. Also you need to add functionality in your program that
allows utilization of Euclidean, Manhattan (City Block), Euclidean Squared
(the same as the Euclidean distance, but does not take the square root) and
Chebyshev distances. You are free to use any kind of weights (for
feature or data instance) in the program if necessary.

2. Perform attributes values rescaling in order to obtain normalized data


within the range [0,1], which is more suitable and reliable for proper cluster
analysis. You can use following equation for rescaling: xNew=(x-Min)/
(Max-Min). Feel free to bring own rescaling method.

3. Perform clustering of the unlabeled data set. You could use provided initial
centroids set or generate your own. Also there could be considered next
stopping criteria:
3.1 Maximal number of iterations: 100
3.2 Cluster are consistent (no changes in group matrix or centroids on
current iteration, which mean that the clusters are balanced).
4. Cluster Analysis could be also represented more formally as optimization
procedure, which tries to minimize the Residual Sum of Squares objective
function:

where μ(ωk) – is a centroid of a particular cluster k, K – total amount of


clusters, x – data sample in this cluster ωk.

4.1 Please, provide value of RSS function on each iteration in your


program for a particular distance measure and K number.

4.2 Discuss the changing of RSS function value (increasing or decreasing


and why) during Cluster Analysis (from the first iteration until the last
one)?

5. Try different numbers of clusters in your program (K=2...15) and build a


plot that shows the dependency between number K and value of RSS
function on the last iteration.

5.1 What is the optimal number of clusters K for a given data set?

5.2 Did you get any empty clusters? What is the possible solution for this
problem?

You might also like