0% found this document useful (0 votes)

160 views38 pages

The K-Means Clustering Algorithm in Java - Baeldung

This document provides an overview of the k-means clustering algorithm and how to implement it in Java. It begins by explaining unsupervised learning and clustering. K-means clustering aims to partition observations into k clusters where each observation belongs to the cluster with the nearest mean. It works by randomly initializing k centroids and then iteratively assigning observations to the closest centroid and recalculating the centroid locations until convergence is reached. The document then discusses how to represent features in data to model different training datasets for clustering in Java.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views38 pages

The K-Means Clustering Algorithm in Java - Baeldung

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

The K-Means Clustering Algorithm in

(/)

Java

Last updated: January 8, 2024

Written by: Ali Dehghani (https://www.baeldung.com/author/ali-dehghani)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 1/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 2/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Reviewed by:Start Eric Martin (https://www.baeldung.com/editor/eric-editor)

Here (/start-here) Courses Guides
▼ About ▼ ▼

Artificial (/)
Intelligence (https://www.baeldung.com/category/artificial-intelligence)

Get started with Spring and Spring Boot, through the Learn Spring
course:
>> CHECK OUT THE COURSE (/ls-course-start)

1. Overview
Clustering is an umbrella term for a class of unsupervised algorithms to discover groups of things,
people, or ideas that are closely related to each other.
In this apparently simple one-liner definition, we saw a few buzzwords. What exactly is clustering?
What is an unsupervised algorithm?
In this tutorial, we’re going to, first, shed some lights on these concepts. Then, we’ll see how they can
manifest themselves in Java.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 3/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

2. Unsupervised
(/)
Algorithms
Before we use most
Startlearning algorithms, we should
Here (/start-here) somehow feedGuides
Courses some sample data to
▼
them and
About ▼ ▼

allow the algorithm to learn from those data. In Machine Learning terminology, we call that sample
(/)
dataset training data. Also, the whole process is known as the training process.
Anyway, we can classify learning algorithms based on the amount of supervision they need during
the training process. The two main types of learning algorithms in this category are:

Supervised Learning: In supervised algorithms, the training data should include the actual
solution for each point. For example, if we’re about to train our spam filtering algorithm, we feed
both the sample emails and their label, i.e. spam or not-spam, to the algorithm. Mathematically
speaking, we’re going to infer the f(x) from a training set including both xs and ys.
Unsupervised Learning: When there are no labels in training data, then the algorithm is an
unsupervised one. For example, we have plenty of data about musicians and we’re going
discover groups of similar musicians in the data.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 4/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

3. Clustering (/)

Clustering is an unsupervised algorithm to discover

Start Here (/start-here) groups of similarGuides
Courses ▼
things, ideas, or people.
AboutUnlike▼ ▼

supervised algorithms, we’re not training clustering algorithms with examples of known labels.
(/)
Instead, clustering tries to find structures within a training set where no point of the data is the label.

3.1. K-Means Clustering

K-Means is a clustering algorithm with one fundamental property: the number of clusters is defined
in advance. In addition to K-Means, there are other types of clustering algorithms like Hierarchical
Clustering, Affinity Propagation, or Spectral Clustering.

3.2. How K-Means Works

Suppose our goal is to find a few similar groups in a dataset like:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 5/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

(/wp-content/uploads/2019/08/Date-6.png)
K-Means begins with k randomly placed centroids. Centroids, as their name suggests, are the center
points of the clusters. For example, here we’re adding four random centroids:

(/wp-content/uploads/2019/08/Date-7.png)
Then we assign each existing data point to its nearest centroid:
https://w w w .baeldung.com/java-k-means-clustering-algorithm 6/38
3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

(/wp-content/uploads/2019/08/Date-8.png)
After the assignment, we move the centroids to the average location of points assigned to it.
Remember, centroids are supposed to be the center points of clusters:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 7/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

(/wp-content/uploads/2019/08/Date-10.png)

The current iteration concludes each time we’re done relocating the centroids. We repeat these
iterations until the assignment between multiple consecutive iterations stops changing:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 8/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

(/wp-content/uploads/2019/08/Date-copy.png)
When the algorithm terminates, those four clusters are found as expected. Now that we know how
K-Means works, let’s implement it in Java.

3.3. Feature Representation

https://w w w .baeldung.com/java-k-means-clustering-algorithm 9/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

When modeling different training datasets, we need a data structure to represent model attributes
(/) values. For example, a musician can have a genre attribute with a value like
and their corresponding
Rock. We usually use the term feature to refer to the combination of an attribute and its value.
Start Here (/start-here) Courses Guides
▼ About ▼ ▼

(/)

To prepare a dataset for a particular learning algorithm, we usually use a common set of numerical
attributes that can be used to compare different items. For example, if we let our users tag each artist
with a genre, then at the end of the day, we can count how many times each artist is tagged with a
specific genre:

(/wp-content/uploads/2019/08/Screen-Shot-1398-04-29-at-22.30.58.png)
The feature vector for an artist like Linkin Park is [rock -> 7890, nu-metal -> 700, alternative -> 520,
pop -> 3]. So if we could find a way to represent attributes as numerical values, then we can simply
compare two different items, e.g. artists, by comparing their corresponding vector entries.
https://w w w .baeldung.com/java-k-means-clustering-algorithm 10/38
3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

Since numeric vectors are such versatile data structures, we’re going to represent features using
(/)
them. Here’s how we implement feature vectors in Java:

Start {Here (/start-here)

public class Record Courses ▼ Guides ▼ About ▼

private final String description;

(/)
private final Map<String, Double> features;

// constructor, getter, toString, equals and hashcode

}

3.4. Finding Similar Items

In each iteration of K-Means, we need a way to find the nearest centroid to each item in the dataset.
One of the simplest ways to calculate the distance between two feature vectors is to use Euclidean
Distance (https://en.wikipedia.org/wiki/Euclidean_distance). The Euclidean distance between two
vectors like [p1, q1] and [p2, q2] is equal to:

(/wp-

content/uploads/2019/08/4febdae84cbc320c19dd13eac5060a984fd438d8.svg)
Let’s implement this function in Java. First, the abstraction:

public interface Distance {

double calculate(Map<String, Double> f1, Map<String, Double> f2);
}

https://w w w .baeldung.com/java-k-means-clustering-algorithm 11/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

In addition to Euclidean distance, there are other approaches to compute the distance or similarity
between different items(/) like the Pearson Correlation Coefficient (/cs/correlation-coefficient). This
abstraction makes it easy to switch between different distance metrics.
Start Here (/start-here)
Let’s see the implementation Courses
for Euclidean distance: Guides
▼ About ▼ ▼

(/)
public class EuclideanDistance implements Distance {

@Override
public double calculate(Map<String, Double> f1, Map<String, Double> f2) {
double sum = 0;
for (String key : f1.keySet()) {
Double v1 = f1.get(key);
Double v2 = f2.get(key);

if (v1 != null && v2 != null) {

sum += Math.pow(v1 - v2, 2);
}
}

return Math.sqrt(sum);
}
}

First, we calculate the sum of squared differences between corresponding entries. Then, by applying
the sqrt function, we compute the actual Euclidean distance.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 12/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

3.5. Centroid Representation

Centroids are in the same space as normal features, so we can represent them similar to features:

public class Centroid {

private final Map<String, Double> coordinates;

// constructors, getter, toString, equals and hashcode

}

Now that we have a few necessary abstractions in place, it’s time to write our K-Means
implementation. Here’s a quick look at our method signature:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 13/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

public class KMeans {(/)

private static final Random random = new Random();

Start Here (/start-here) Courses ▼ Guides ▼ About ▼
public static Map<Centroid, List<Record>> fit(List<Record> records,
(/)k,
int
Distance distance,
int maxIterations) {
// omitted
}
}

Let’s break down this method signature:

The dataset is a set of feature vectors. Since each feature vector is a Record, then the dataset
type is List<Record>
The k parameter determines the number of clusters, which we should provide in advance
distance encapsulates the way we’re going to calculate the difference between two features
K-Means terminates when the assignment stops changing for a few consecutive iterations. In
addition to this termination condition, we can place an upper bound for the number of iterations,
too. The maxIterations argument determines that upper bound
When K-Means terminates, each centroid should have a few assigned features, hence we’re
using a Map<Centroid, List<Record>> as the return type. Basically, each map entry corresponds
to a cluster

3.6. Centroid Generation

The first step is to generate k randomly placed centroids.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 14/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

Although each centroid can contain totally random coordinates, it’s a good practice to generate
(/)
random coordinates between the minimum and maximum possible values for each attribute.
Generating random centroids without considering the range of possible values would cause the
algorithm to converge more(/start-here)
Start Here slowly. Courses Guides
▼ About ▼ ▼

First, we should
(/) compute the minimum and maximum value for each attribute, and then, generate
the random values between each pair of them:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 15/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

private static List<Centroid>

(/) randomCentroids(List<Record> records, int k) {
List<Centroid> centroids = new ArrayList<>();
Map<String, Double> maxs = new HashMap<>();
Map<String,Start
Double>
Heremins = new HashMap<>(); Courses ▼
(/start-here) Guides ▼ About ▼

for (/)
(Record record : records) {
record.getFeatures().forEach((key, value) -> {
// compares the value with the current max and choose the bigger value between them
maxs.compute(key, (k1, max) -> max == null || value > max ? value : max);

// compare the value with the current min and choose the smaller value between them
mins.compute(key, (k1, min) -> min == null || value < min ? value : min);
});
}

Set<String> attributes = records.stream()

.flatMap(e -> e.getFeatures().keySet().stream())
.collect(toSet());
for (int i = 0; i < k; i++) {
Map<String, Double> coordinates = new HashMap<>();
for (String attribute : attributes) {
double max = maxs.get(attribute);
double min = mins.get(attribute);
coordinates.put(attribute, random.nextDouble() * (max - min) + min);
}

centroids.add(new Centroid(coordinates));
}

return centroids;
}

Now, we can assign each record to one of these random centroids.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 16/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

3.7. Assignment
(/)
First off, given a Record, we should find the centroid nearest to it:
Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 17/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

private static Centroid

(/) nearestCentroid(Record record, List<Centroid> centroids, Distance distance)
{
double minimumDistance = Double.MAX_VALUE;
Centroid nearest = null;
Start Here (/start-here) Courses ▼ Guides ▼ About ▼
for (/)
(Centroid centroid : centroids) {
double currentDistance = distance.calculate(record.getFeatures(),
centroid.getCoordinates());

if (currentDistance < minimumDistance) {

minimumDistance = currentDistance;
nearest = centroid;
}
}

return nearest;
}

Each record belongs to its nearest centroid cluster:

private static void assignToCluster(Map<Centroid, List<Record>> clusters,

Record record,
Centroid centroid) {
clusters.compute(centroid, (key, list) -> {
if (list == null) {
list = new ArrayList<>();
}

list.add(record);
return list;
});
}

https://w w w .baeldung.com/java-k-means-clustering-algorithm 18/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

3.8. Centroid Relocation

(/)
If, after one iteration, a centroid does not contain any assignments, then we won’t relocate it.
Start Here
Otherwise, we should (/start-here)
relocate Courses for each attribute
the centroid coordinate Guides
▼
to the average About
location of all
▼ ▼

assigned (/)records:

private static Centroid average(Centroid centroid, List<Record> records) {

if (records == null || records.isEmpty()) {
return centroid;
}

Map<String, Double> average = centroid.getCoordinates();

records.stream().flatMap(e -> e.getFeatures().keySet().stream())
.forEach(k -> average.put(k, 0.0));

for (Record record : records) {

record.getFeatures().forEach(
(k, v) -> average.compute(k, (k1, currentValue) -> v + currentValue)
);
}

average.forEach((k, v) -> average.put(k, v / records.size()));

return new Centroid(average);

}

Since we can relocate a single centroid, now it’s possible to implement

the relocateCentroids method:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 19/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

private static List<Centroid>

(/) relocateCentroids(Map<Centroid, List<Record>> clusters) {
return clusters.entrySet().stream().map(e -> average(e.getKey(),
e.getValue())).collect(toList());
} Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)
This simple one-liner iterates through all centroids, relocates them, and returns the new centroids.

3.9. Putting It All Together

In each iteration, after assigning all records to their nearest centroid, first, we should compare the
current assignments with the last iteration.
If the assignments were identical, then the algorithm terminates. Otherwise, before jumping to the
next iteration, we should relocate the centroids:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 20/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

public static Map<Centroid,

(/) List<Record>> fit(List<Record> records,
int k,
Distance distance,
int maxIterations) {
Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)
List<Centroid> centroids = randomCentroids(records, k);
Map<Centroid, List<Record>> clusters = new HashMap<>();
Map<Centroid, List<Record>> lastState = new HashMap<>();

// iterate for a pre-defined number of times

for (int i = 0; i < maxIterations; i++) {
boolean isLastIteration = i == maxIterations - 1;

// in each iteration we should find the nearest centroid for each record
for (Record record : records) {
Centroid centroid = nearestCentroid(record, centroids, distance);
assignToCluster(clusters, record, centroid);
}

// if the assignments do not change, then the algorithm terminates

boolean shouldTerminate = isLastIteration || clusters.equals(lastState);
lastState = clusters;
if (shouldTerminate) {
break;
}

// at the end of each iteration we should relocate the centroids

centroids = relocateCentroids(clusters);
clusters = new HashMap<>();
}

return lastState;
}

https://w w w .baeldung.com/java-k-means-clustering-algorithm 21/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

4. Example: Discovering
(/)
Similar Artists on Last.fm
Last.fm builds a Start
detailed
Hereprofile of each user’s musical
(/start-here) Coursestaste by recording
▼ Guidesdetails of what About
the user▼ ▼

listens to. In this section, we’re going to find clusters of similar artists. To build a dataset appropriate for
(/) ll use three APIs from Last.fm:
this task, we’
1. API to get a collection of top artists (https://www.last.fm/api/show/chart.getTopArtists) on
Last.fm.
2. Another API to find popular tags (https://www.last.fm/api/show/chart.getTopTags). Each user
can tag an artist with something, e.g. rock. So, Last.fm maintains a database of those tags and
their frequencies.
3. And an API to get the top tags for an artist (https://www.last.fm/api/show/artist.getTopTags),
ordered by popularity. Since there are many such tags, we’ll only keep those tags that are among
the top global tags.

4.1. Last.fm’s API

To use these APIs, we should get an API Key from Last.fm (https://www.last.fm/api/authentication)
and send it in every HTTP request. We’re going to use the following Retrofit (/retrofit) service for
calling those APIs:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 22/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

public interface LastFmService {

@GET("/2.0/?method=chart.gettopartists&format=json&limit=50")
Call<Artists> topArtists(@Query("page") int page);

@GET("/2.0/?method=artist.gettoptags&format=json&limit=20&autocorrect=1")
Call<Tags> topTagsFor(@Query("artist") String artist);

@GET("/2.0/?method=chart.gettoptags&format=json&limit=100")
Call<TopTags> topTags();

// A few DTOs and one interceptor

}

So, let’s find the most popular artists on Last.fm:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 23/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

// setting up the Retrofit

(/) service
private static List<String> getTop100Artists() throws IOException {
List<String> artists
Start Here= (/start-here)
new ArrayList<>(); Courses ▼ Guides ▼ About ▼
// Fetching the first two pages, each containing 50 records.
for (/)
(int i = 1; i <= 2; i++) {
artists.addAll(lastFm.topArtists(i).execute().body().all());
}

return artists;
}

Similarly, we can fetch the top tags:

private static Set<String> getTop100Tags() throws IOException {

return lastFm.topTags().execute().body().all();
}

Finally, we can build a dataset of artists along with their tag frequencies:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 24/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

private static List<Record>

(/) datasetWithTaggedArtists(List<String> artists,
Set<String> topTags) throws IOException {
List<Record> records = new ArrayList<>();
for (StringStart
artist : artists)
Here {
(/start-here) Courses ▼ Guides ▼ About ▼
Map<String, Double> tags = lastFm.topTagsFor(artist).execute().body().all();
(/)
// Only keep popular tags.
tags.entrySet().removeIf(e -> !topTags.contains(e.getKey()));

records.add(new Record(artist, tags));

}

return records;
}

4.2. Forming Artist Clusters

Now, we can feed the prepared dataset to our K-Means implementation:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 25/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

List<String> artists (/)

= getTop100Artists();
Set<String> topTags = getTop100Tags();
List<Record> records = datasetWithTaggedArtists(artists, topTags);
Start Here (/start-here) Courses ▼ Guides ▼ About ▼
Map<Centroid, List<Record>> clusters = KMeans.fit(records, 7, new EuclideanDistance(), 1000);
(/) the cluster configuration
// Printing
clusters.forEach((key, value) -> {
System.out.println("-------------------------- CLUSTER ----------------------------");

// Sorting the coordinates to see the most significant tags first.

System.out.println(sortedCentroid(key));
String members = String.join(", ",
value.stream().map(Record::getDescription).collect(toSet()));
System.out.print(members);

System.out.println();
System.out.println();
});

If we run this code, then it would visualize the clusters as text output:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 26/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

------------------------------
(/) CLUSTER -----------------------------------
Centroid {classic rock=65.58333333333333, rock=64.41666666666667, british=20.333333333333332, ... }
David Bowie, Led Zeppelin, Pink Floyd, System of a Down, Queen, blink-182, The Rolling Stones,
Metallica, Start Here (/start-here) Courses ▼ Guides ▼ About ▼
Fleetwood Mac, The Beatles, Elton John, The Clash
(/)
------------------------------ CLUSTER -----------------------------------
Centroid {Hip-Hop=97.21428571428571, rap=64.85714285714286, hip hop=29.285714285714285, ... }
Kanye West, Post Malone, Childish Gambino, Lil Nas X, A$AP Rocky, Lizzo, xxxtentacion,
Travi$ Scott, Tyler, the Creator, Eminem, Frank Ocean, Kendrick Lamar, Nicki Minaj, Drake

------------------------------ CLUSTER -----------------------------------

Centroid {indie rock=54.0, rock=52.0, Psychedelic Rock=51.0, psychedelic=47.0, ... }
Tame Impala, The Black Keys

------------------------------ CLUSTER -----------------------------------

Centroid {pop=81.96428571428571, female vocalists=41.285714285714285, indie=22.785714285714285, ...
}
Ed Sheeran, Taylor Swift, Rihanna, Miley Cyrus, Billie Eilish, Lorde, Ellie Goulding, Bruno Mars,
Katy Perry, Khalid, Ariana Grande, Bon Iver, Dua Lipa, Beyoncé, Sia, P!nk, Sam Smith, Shawn Mendes,
Mark Ronson, Michael Jackson, Halsey, Lana Del Rey, Carly Rae Jepsen, Britney Spears, Madonna,
Adele, Lady Gaga, Jonas Brothers

------------------------------ CLUSTER -----------------------------------

Centroid {indie=95.23076923076923, alternative=70.61538461538461, indie rock=64.46153846153847, ...
}
Twenty One Pilots, The Smiths, Florence + the Machine, Two Door Cinema Club, The 1975, Imagine
Dragons,
The Killers, Vampire Weekend, Foster the People, The Strokes, Cage the Elephant, Arcade Fire,
Arctic Monkeys

------------------------------ CLUSTER -----------------------------------

Centroid {electronic=91.6923076923077, House=39.46153846153846, dance=38.0, ... }
Charli XCX, The Weeknd, Daft Punk, Calvin Harris, MGMT, Martin Garrix, Depeche Mode, The
Chainsmokers,
Avicii, Kygo, Marshmello, David Guetta, Major Lazer
https://w w w .baeldung.com/java-k-means-clustering-algorithm 27/38
3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)
------------------------------ CLUSTER -----------------------------------
Centroid {rock=87.38888888888889, alternative=72.11111111111111, alternative rock=49.16666666, ...
}
Start
Weezer, The White Here (/start-here)
Stripes, Courses
Nirvana, Foo Fighters, Guides
Maroon ▼5, Oasis, Panic! at the
▼ About ▼
Disco, Gorillaz,
Green Day, The Cure, Fall Out Boy, OneRepublic, Paramore, Coldplay, Radiohead, Linkin Park,
(/)
Red Hot Chili Peppers, Muse

Since centroid coordinations are sorted by the average tag frequency, we can easily spot the
dominant genre in each cluster. For example, the last cluster is a cluster of a good old rock-bands, or
the second one is filled with rap stars.
Although this clustering makes sense, for the most part, it’s not perfect since the data is merely
collected from user behavior.

5. Visualization
A few moments ago, our algorithm visualized the cluster of artists in a terminal-friendly way. If we
convert our cluster configuration to JSON and feed it to D3.js, then with a few lines of JavaScript, we’ll
have a nice human-friendly Radial Tidy-Tree (https://observablehq.com/@d3/radial-tidy-tree?
collection=@d3/d3-hierarchy):

https://w w w .baeldung.com/java-k-means-clustering-algorithm 28/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 29/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 30/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)
(/wp-content/uploads/2019/08/Screen-Shot-1398-05-04-at-12.09.40.png)
We have to convert our Map<Centroid, List<Record>> to a JSON with a similar schema like this d3.js
example (https:/Start
/raw.githubusercontent.com/d3/d3-hierarchy/v1.1.8/test/data/flare.json).
Here (/start-here) Courses Guides ▼ About ▼ ▼

(/)

6. Number of Clusters
One of the fundamental properties of K-Means is the fact that we should define the number of
clusters in advance. So far, we used a static value for k, but determining this value can be a
challenging problem. There are two common ways to calculate the number of clusters:
1. Domain Knowledge
2. Mathematical Heuristics
If we’re lucky enough that we know so much about the domain, then we might be able to simply
guess the right number. Otherwise, we can apply a few heuristics like Elbow Method or Silhouette
Method to get a sense on the number of clusters.
Before going any further, we should know that these heuristics, although useful, are just heuristics
and may not provide clear-cut answers.

6.1. Elbow Method

To use the elbow method, we should first calculate the difference between each cluster centroid and
all its members. As we group more unrelated members in a cluster, the distance between the
centroid and its members goes up, hence the cluster quality decreases.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 31/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

One way to perform this distance calculation is to use the Sum of Squared Errors. Sum of squared
errors or SSE is equal (/)
to the sum of squared differences between a centroid and all its members:

Start Here
public static double (/start-here)List<Record>>
sse(Map<Centroid, Courses
clustered,
▼ Guides
Distance distance)
▼ { About ▼
double sum = 0;
(/)
for (Map.Entry<Centroid, List<Record>> entry : clustered.entrySet()) {
Centroid centroid = entry.getKey();
for (Record record : entry.getValue()) {
double d = distance.calculate(centroid.getCoordinates(), record.getFeatures());
sum += Math.pow(d, 2);
}
}

return sum;
}

Then, we can run the K-Means algorithm for different values of k and calculate the SSE for each of
them:

List<Record> records = // the dataset;

Distance distance = new EuclideanDistance();
List<Double> sumOfSquaredErrors = new ArrayList<>();
for (int k = 2; k <= 16; k++) {
Map<Centroid, List<Record>> clusters = KMeans.fit(records, k, distance, 1000);
double sse = Errors.sse(clusters, distance);
sumOfSquaredErrors.add(sse);
}

At the end of the day, it’s possible to find an appropriate k by plotting the number of clusters against
the SSE:

https://w w w .baeldung.com/java-k-means-clustering-algorithm 32/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 33/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

(/wp-content/uploads/2019/08/Screen-Shot-1398-05-04-at-17.01.36.png)
Usually, as the number of clusters increases, the distance between cluster members decreases.
However, we can’t choose any arbitrary large values for k, since having multiple clusters with just one
member defeats the whole purpose of clustering.
The idea behind the elbow method is to find an appropriate value for k in a way that the SSE
decreases dramatically around that value. For example, k=9 can be a good candidate here.

https://w w w .baeldung.com/java-k-means-clustering-algorithm 34/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

7. Conclusion (/)
In this tutorial, first, weHere
Start covered a few importantCourses
(/start-here) concepts in Machine Learning. Then we About
Guides
▼
got ▼ ▼

aquatinted with the mechanics of the K-Means clustering algorithm. Finally, we wrote a simple
(/)
implementation for K-Means, tested our algorithm with a real-world dataset from Last.fm, and
visualized the clustering result in a nice graphical way.
As usual, the sample code is available on our GitHub
(https://github.com/eugenp/tutorials/tree/master/algorithms-modules/algorithms-miscellaneous-
3) project, so make sure to check it out!

Get started with Spring and Spring Boot, through the Learn Spring
course:
>> CHECK OUT THE COURSE (/ls-course-end)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 35/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

Learning to build your API

with Spring?
Download the E-book (/rest-api-spring-

guide)

Comments are closed on this article!

https://w w w .baeldung.com/java-k-means-clustering-algorithm 36/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

(/)

Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)

COURSES
ALL COURSES (/ALL-COURSES)
ALL BULK COURSES (/ALL-BULK-COURSES)
ALL BULK TEAM COURSES (/ALL-BULK-TEAM-COURSES)
THE COURSES PLATFORM (HTTPS://COURSES.BAELDUNG.COM)

SERIES
JAVA “BACK TO BASICS” TUTORIAL (/JAVA-TUTORIAL)
JACKSON JSON TUTORIAL (/JACKSON)
APACHE HTTPCLIENT TUTORIAL (/HTTPCLIENT-GUIDE)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 37/38

3/10/24, 6:35 AM The K-Means Clustering Algorithm in Java | Baeldung

REST W ITH SPRING TUTORIAL (/REST-W ITH-SPRING-SERIES)

(/) (/PERSISTENCE-W ITH-SPRING-SERIES)
SPRING PERSISTENCE TUTORIAL
SECURITY W ITH SPRING (/SECURITY-SPRING)
SPRING REACTIVE TUTORIALS (/SPRING-REACTIVE-GUIDE)
Start Here (/start-here) Courses ▼ Guides ▼ About ▼

(/)
ABOUT
ABOUT BAELDUNG (/ABOUT)
THE FULL ARCHIVE (/FULL_ARCHIVE)
EDITORS (/EDITORS)
JOBS (/TAG/ACTIVE-JOB/)
OUR PARTNERS (/PARTNERS)
PARTNER W ITH BAELDUNG (/ADVERTISE)

TERMS OF SERVICE (/TERMS-OF-SERVICE)

https://w w w .baeldung.com/java-k-means-clustering-algorithm 38/38

" by Nils Gottfries (2013), Palgrave Macmillan. This Is An Advanced
No ratings yet
" by Nils Gottfries (2013), Palgrave Macmillan. This Is An Advanced
6 pages
Coma
100% (1)
Coma
42 pages
PSS 5000 APNO Vehicle Tagging 80510800
100% (1)
PSS 5000 APNO Vehicle Tagging 80510800
46 pages
Penberthy Eductor
No ratings yet
Penberthy Eductor
16 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Grami Product List & Price 2021
No ratings yet
Grami Product List & Price 2021
6 pages
Recreational & Adventure 1
50% (2)
Recreational & Adventure 1
40 pages
Ecs268: Structural & Material Laboratory: I. Objective
No ratings yet
Ecs268: Structural & Material Laboratory: I. Objective
7 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Lesson 1 Intro To Orgl Behavior
No ratings yet
Lesson 1 Intro To Orgl Behavior
19 pages
Polar & Non Polar-Electronegativity
No ratings yet
Polar & Non Polar-Electronegativity
23 pages
Belvilla en - Rent Out Your Holiday Home Successfully
No ratings yet
Belvilla en - Rent Out Your Holiday Home Successfully
4 pages
Catalogue Corolla Altis Compressed 1
No ratings yet
Catalogue Corolla Altis Compressed 1
8 pages
Banana Fibre Extracting Project
No ratings yet
Banana Fibre Extracting Project
2 pages
General Feedback For Module 7
No ratings yet
General Feedback For Module 7
1 page
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
158 9 (Clustering)
No ratings yet
158 9 (Clustering)
36 pages
Weka
No ratings yet
Weka
22 pages
Clustering
No ratings yet
Clustering
84 pages
UnSupervised Learning
No ratings yet
UnSupervised Learning
40 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
YETI Documentation: Release 1.0
No ratings yet
YETI Documentation: Release 1.0
53 pages
ENVI Classic Tutorial: Target Detection
No ratings yet
ENVI Classic Tutorial: Target Detection
18 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
FHFA Motion Not To Disclose Freddie Mac Accounting
No ratings yet
FHFA Motion Not To Disclose Freddie Mac Accounting
11 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Chemical Identity: Material Safety Data Sheet Gasoline/Petrol
No ratings yet
Chemical Identity: Material Safety Data Sheet Gasoline/Petrol
4 pages
Bearings Archives - Marine Engineering Study Materials
100% (1)
Bearings Archives - Marine Engineering Study Materials
5 pages
K Mean
No ratings yet
K Mean
12 pages
Technical Data Sheet: Zwaluw Fix-O-Chem (Styrene Free)
No ratings yet
Technical Data Sheet: Zwaluw Fix-O-Chem (Styrene Free)
2 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
Hadoop
No ratings yet
Hadoop
7 pages
Data Mining For BI - Part 5
No ratings yet
Data Mining For BI - Part 5
34 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Crew Info Stew
No ratings yet
Crew Info Stew
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Week 11
No ratings yet
Week 11
49 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Futuretech Olympiad 2022
No ratings yet
Futuretech Olympiad 2022
12 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
Clustering
No ratings yet
Clustering
6 pages
SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique For Big Data
No ratings yet
SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique For Big Data
7 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Sinopsis Muhammad Haris Yulianto-1
No ratings yet
Sinopsis Muhammad Haris Yulianto-1
6 pages
ML - Unit - 2
No ratings yet
ML - Unit - 2
13 pages
K, Eans
No ratings yet
K, Eans
4 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Daily Report Swiss Embassy Jakarta
No ratings yet
Daily Report Swiss Embassy Jakarta
1 page
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Tmco Single Chamber Manual
No ratings yet
Tmco Single Chamber Manual
13 pages
Big Data
No ratings yet
Big Data
21 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
The Role of Catestatin in Pree
No ratings yet
The Role of Catestatin in Pree
18 pages
K Mean
No ratings yet
K Mean
7 pages
Improving Statistical Methods To Protect Wildlife Populations - ScienceDaily
No ratings yet
Improving Statistical Methods To Protect Wildlife Populations - ScienceDaily
7 pages
Notice To IEA Dwarka Museum
No ratings yet
Notice To IEA Dwarka Museum
2 pages
Exp 7
No ratings yet
Exp 7
6 pages
K Means
No ratings yet
K Means
2 pages
ML 12
No ratings yet
ML 12
19 pages
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
No ratings yet
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
1 page
K-Means ML
No ratings yet
K-Means ML
23 pages
K Means Final
No ratings yet
K Means Final
10 pages
Som New
No ratings yet
Som New
21 pages
Clustering
No ratings yet
Clustering
80 pages
Unsupervised Learning Final
No ratings yet
Unsupervised Learning Final
17 pages
CCN202 Kinetix 5700 Troubelshooting and Project Interpretation
No ratings yet
CCN202 Kinetix 5700 Troubelshooting and Project Interpretation
2 pages
Lec 37
No ratings yet
Lec 37
13 pages
Charlton Salt Screener
No ratings yet
Charlton Salt Screener
2 pages
K-Means Data Clustering From Scratch Using C# - Visual Studio Magazine
No ratings yet
K-Means Data Clustering From Scratch Using C# - Visual Studio Magazine
9 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Unsupervised ML
No ratings yet
Unsupervised ML
15 pages
K-Means Clustering. in My Previous Blog, We Have Seen Some - by Seema Singh - DataDrivenInvestor
No ratings yet
K-Means Clustering. in My Previous Blog, We Have Seen Some - by Seema Singh - DataDrivenInvestor
19 pages
L7 Clustering
No ratings yet
L7 Clustering
58 pages
ml2 1
No ratings yet
ml2 1
7 pages