0% found this document useful (0 votes)
9 views

K-Means Algorithm

The document provides a comprehensive guide on K-Means clustering in Python, explaining clustering techniques, particularly K-Means and density-based clustering. It outlines the algorithm's steps, implementation using scikit-learn, and the importance of feature scaling. Additionally, it includes practical examples and code snippets for building a K-Means clustering pipeline with real-world datasets.

Uploaded by

Afnan Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
9 views

K-Means Algorithm

The document provides a comprehensive guide on K-Means clustering in Python, explaining clustering techniques, particularly K-Means and density-based clustering. It outlines the algorithm's steps, implementation using scikit-learn, and the importance of feature scaling. Additionally, it includes practical examples and code snippets for building a K-Means clustering pipeline with real-world datasets.

Uploaded by

Afnan Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
© Customize appearance K-Means Clustering in Python: A Practical Guide — Real Python &k-Means Clustering in Python: A Practical Guide What Is Clustering? Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data: 1. Meaningfulness 2. Usefulness Meaningful clusters expand domain knowledge. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who tespond differently to medical treatments. Useful clusters, on the other hand, serve as an intermediate step ina data pipeline. For example, businesses use clustering for customer segmentation. The clustering results segment customers into groups with similar purchase histories, which businesses can then use to create targeted advertising campaigns. Note: You'll learn about unsupervised machine learning techniques in this tutorial. If you're interested in learning more about supervised machine learning techniques, then check out Logistic Regression in Python. There are many other applications of clustering, such as document clustering and social network analysis. These applications are relevant in nearly every industry, making clustering a valuable skill for professionals working with data in any field. Density-Based Clustering Density-based clustering determines cluster assignments based on the density of data points in a region. Clusters are assigned where there are high densities of data points separated by low-density regions. Unlike the other clustering categories, this approach doesn't require the user to specify the number of clusters. Instead, there is a distance- based parameter that acts as a tunable threshold. This threshold determines how close points must be to be considered a cluster member. Examples of density-based clustering algorithms include Density- Based Spatial Clustering of Applications with Noise, or DBSCAN, and Ordering Points To Identify the Clustering Structure, or OPTICS. The strengths of density-based clustering methods include the following: * They excel at identifying clusters of nonspherical shapes. * They're resistant to outliers. The weaknesses of density-based clustering methods include the following: © They aren't well suited for clustering in high-dimensional spaces. * They have trouble identifying clusters of varying densities. How to Perform K-Means Clustering in Python In this section, you'll take a step-by-step tour of the conventional version of the means algorithm. Understanding the details of the algorithm is a fundamental step in the process of writing your Kmeans clustering pipeline in Python. What you learn in this section will help you decide if Kmeans is the right choice to solve your clustering problem. Understanding the K-Means Algorithm Conventional k-means requires only a few steps. The first step is to randomly select k centroids, where kis equal to the number of clusters you choose. Centroids are data points representing the center of a cluster. The main element of the algorithm works by a two-step process called expectation-maximization. The expectation step assigns each data point to its nearest centroid. Then, the maximization step computes the mean of all the points for each cluster and sets the new centroid. Here's what the conventional version of the Kmeans algorithm looks like: Algorithm 1 k-means algorithm 1: Specify the number k of clusters to assign. 2; Randomly initialize k centroids. 3: repeat 4 expectation: Assign each point to its closest centroid. 5: maximization: Compute the new centroid (mean) of each cluster. 6; until The centroid positions do not change. The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of means is to try to minimize this value. The figure below shows the centroids and SSE updating through the first five iterations from two different runs of the means algorithm on the same dataset: k-means iteration: 1 Iniuianzation #1 Initaeetion #2 ‘SSE: 6735.8 SSE: 5238.5 x conoid The purpose of this figure is to show that the initialization of the centroids is an important step. It also highlights the use of SSE as a measure of clustering performance. After choosing a number of clusters and the initial centroids, the expectation-maximization step is tepeated until the centroid positions reach convergence and are unchanged. The random initialization step causes the Kmeans algorithm to be nondeterministic, meaning that cluster assignments will vary if you run the same algorithm twice on the same dataset. Researchers commonly run several initializations of the entire k-means algorithm and choose the cluster assignments from the initialization with the lowest SSE. Writing Your First K-Means Clustering Code in Python Thankfully, there's a robust implementation of Kmeans clustering in Python from the popular machine learning package scikit-learn. You'll learn how to write a practical implementation of the Kmeans algorithm using the scikit-learn version of the algorithm. Note: If you're interested in gaining a deeper understanding of how to write your own k-means algorithm in Python, then check out the Python Data Science Handbook. The code in this tutorial requires some popular external Python packages and assumes that you've installed Python with Anaconda. For more information on setting up your Python environment for machine learning in Windows, read through Setting Up Python for Machine Learning on Windows. Otherwise, you can begin by installing the required packages: (base) $ conda install matplotlib numpy pandas seaborn scikit-learn ipython (base) $ conda install -c conda-forge kneed The code is presented so that you can follow along in an {ipython| console or Jupyter Notebook. Click the prompt (>>>) at the top right of each code block to see the code formatted for copy-paste. You can also download the source code used in this article by clicking on the link below: Download the sample code: Click here to get the code you'l| use to learn how to write a k-means clustering pipeline in this tutorial. This step will import the modules needed for all the code in this section: >>> In [1]: import matplotlib.pyplot as plt ...1 from kneed import KneeLocator : from sklearn.datasets import make_blobs : from sklearn.cluster import KMeans : from sklearn.metrics import silhouette_score ...1 from sklearn.preprocessing import StandardScaler You can generate the data from the above GIF using lmake_blobs( ),a convenience function in scikitlearn used to generate synthetic clusters. make_blobs() uses these parameters: * n_samples is the total number of samples to generate. * centers is the number of centers to generate. + \cluster_std is the standard deviation. fnake_blobs() returns a tuple of two values: 1. Atwo-dimensional NumPy array with the x- and y-values for each of the samples 2. A one-dimensional NumPy array containing the cluster labels for each sample Note: Many scikit-learn algorithms rely heavily on NumPy in their implementations. If you want to learn more about NumPy arrays, check out Look Ma, No |For, Loops: Array Programming With NumPy. Generate the synthetic data and labels: >>> In [2]: features, true_labels = make_blobs( n_samples=2e0, centers=3, cluster_std=2.75, Peet random_state=42 0008 )) Nondeterministic machine learning algorithms like Kmeans are difficult to reproduce. The \random_state parameter is set to an integer value so you can follow the data presented in the tutorial. In practice, it's best to leave random_state|as the default value, None. Here's a look at the first five elements for each of the variables returned by make_blobs(): >>> In [3]: features[:5] Out[3]: array([[ 9.77075874, 3.27621022], [ -9.71349666, 11.27451802], [ -6.91330582, -9.34755911], [-10.86185913, -10.75063497], [ -8.50038027, -4.54370383]]) In [4]: true_labels[:5] Out[4]: array([1, @, 2, 2, 2]) Data sets usually contain numerical features that have been measured in different units, such as height (in inches) and weight (in pounds). A machine learning algorithm would consider weight more important than height only because the values for weight are larger and have higher variability from person to person. Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale. The process of transforming numerical features to use the same scale is known as feature scaling. It's an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm. There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to tead scikit-learn’s preprocessing documentation. In this example, you'll use the StandardScaler class. This class implements a type of feature scaling called standardization. Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1: >>> In [5]: scaler = StandardScaler() ...: scaled features = scaler. fit_transform(features) Take a look at how the values have been scaled in|scaled_features: >>> In [6]: scaled_features[:5] Out [6]: array([[ 2.13082109, ©.25604351], [-1.52698523, 1.41036744], [-1.00130152, -1.56583175], [-1.74256891, -1.76832509], [-1.29924521, [email protected]]]) Now the data are ready to be clustered. The KMeans estimator class in scikit-learn is where you set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is flexible, providing several parameters that can be tuned. Here are the parameters used in this example: + [init controls the initialization technique. The standard version of the kmeans algorithm is implemented by setting init to "random", Setting this to "k-means++" employs an advanced trick to speed up convergence, which you'll use later. n_clusters sets k for the clustering step. This is the most important parameter for means. . n_init) sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten Kmeans runs and return the results of the one with the lowest SSE. * max_iter sets the number of maximum iterations for each initialization of the Kmeans algorithm. Instantiate the KMeans class with the following arguments: >>> In [7]: kmeans = KMeans( init="random", n_clusters=3, n_init=10, max_iter=300, random_state=42 The parameter names match the language that was used to describe the Kmeans algorithm earlier in the tutorial. Now that the Ameans class is ready, the next step is to fit it to the data in|scaled_features. This will perform ten runs of the k-means algorithm on your data with a maximum of 309 iterations per run: >>> In [8]: kmeans.fit(scaled_features) Out[8]: KMeans (ini ‘random’, n_clusters=3, random_state=42) Statistics from the initialization run with the lowest SSE are available as attributes of means after calling .fit(): >>> In [9]: # The lowest SSE value ...: kmeans.inertia_ Out[9]: 74.57960106819854 In [10]: # Final locations of the centroid «++: kmeans.cluster_centers_ Out[1@]: array([[ 1.19539276, @.13158148], [[email protected], 1.05589975], [-9.91941183, -1.18551732]]) In [11]: # The number of iterations required to converge +..1 kmeans.n_iter_ Out[11]: 6 Finally, the cluster assignments are stored as a one-dimensional NumPy array in kmeans -labels_. Here's a look at the first five predicted labels: >>> In [12]: kmeans.labels_[:5] Out[12]: array([®, 1, 2, 2, 2], dtype=int32) Note that the order of the cluster labels for the first two data objects was flipped. The order was|[1, @] in{true_labels but|[@, 1]/in kmeans.labels_ even though those data objects are still members of their original clusters in kmeans.lables_| This behavior is normal, as the ordering of cluster labels is dependent on the initialization. Cluster 0 from the first run could be labeled cluster 1 in the second run and vice versa. This doesn't affect clustering evaluation metrics. How to Build a K-Means Clustering Pipeline in Python Now that you have a basic understanding of k-means clustering in Python, it’s time to perform Ameans clustering on a real-world dataset. These data contain gene expression values from a manuscript authored by The Cancer Genome Atlas (TCGA) Pan-Cancer analysis project investigators. There are (381 samples (rows) representing five distinct cancer subtypes. Each sample has gene expression values for 2e, 531 genes (columns). The dataset is available from the UC Irvine Machine Learning Repository, but you can use the Python code below to obtain the data programmatically. To follow along with the examples below, you can download the source code by clicking on the following link: In this section, you'll build a robust Kmeans clustering pipeline. Since you'll perform multiple transformations of the original input data, your pipeline will also serve as a practical clustering framework. Building a K-Means Clustering Pipeline Assuming you want to start with a fresh namespace, import all the modules needed to build and evaluate the pipeline, including pandas and seaborn for more advanced visualizations: >>> In [1]: import tarfile +++: import urllib +..: import numpy as np «++: import matplotlib.pyplot as plt ...1 import pandas as pd ...! import seaborn as sns ...1 from sklearn.cluster import KMeans ...: from sklearn.decomposition import PCA ...: from sklearn.metrics import silhouette_score, adjusted_rand_score : from sklearn.pipeline import Pipeline : from sklearn.preprocessing import LabelEncoder, MinMaxScaler Download and extract the TCGA dataset from UCI: >>> In [2]: uci_tcga_url = “https: //archive.ics.uci.edu/ml/machine-learning- databases/@0401/" ...: archive_name = "TCGA-PANCAN-HiSeq- 801x20531.tar.gz” ++.2 # Build the url : full_download_url = urllib.parse.urljoin(uci_tcga_url, archive_name) # Download the file r = urllib.request.urlretrieve (full_download_url, archive_name) -: # Extract the data from the archive .: tar = tarfile.open(archive_name, "r:gz") .: tar.extractall() +++: tar.close() After the download and extraction is completed, you should have a directory that looks like this: TCGA-PANCAN -HiSeq-801x20531/ [- data.csv \ labels.csv The |KMeans class in scikit-learn requires a NumPy array as an argument. The NumPy package has a helper function to load the data from the text file into memory as NumPy arrays: >>> In [3]: datafile = "TCGA-PANCAN-HiSeq- 801x20531/data.csv" ...1 labels file = "TCGA-PANCAN-HiSeq- 801x20531/labels.csv" «+1 data = np.genfromtxt( weet datafile, eee delimiter=",", eee usecols=range(1, 20532), skip_header=1 ...1 true_label_names = np.genfromtxt( eel labels_file, weal delimiter=",", weet usecols=(1,), eel skip_header=1, Peet dtype="str" Check out the first three columns of data for the first five samples as well as the labels for the first five samples: >>> In [4]: data[:5, :3] Out[4]: array([[@. » 2.01720929, 3.26552691], [e. » @.59273209, 1.58842082], [e. » 3.51175898, 4.32719872], [@. » 3.66361787, 4.50764878], [e. » 2.65574107, 2.82154696]]) In [5]: true_label_names[:5] Out[5]: array(['PRAD', 'LUAD', 'PRAD', 'PRAD', ‘BRCA'], dtype=">> In [9]: label_encoder.classes_ Out[9]: array(['BRCA', 'COAD', 'KIRC', ‘LUAD', 'PRAD'], dtype='>> In [11]: preprocessor = Pipeline( Pets) L weet ("scaler", MinMaxScaler()), Peet ("pca", PCA(n_components=2, random_state=42)), Now that you've built a pipeline to process the data, you'll build a separate pipeline to perform k-means clustering. You'll override the following default arguments of the KMeans class: « init: You'll use "k-means++" instead of |"random" to ensure centroids are initialized with some distance between them. In most cases, this will be an improvement over "random . * n_init: You'll increase the number of initializations to ensure you find a stable solution. * max_iter: You'll increase the number of iterations per initialization to ensure that means will converge. Build the k-means clustering pipeline with user-defined arguments in the KMeans constructor: >>> In [12]: clusterer = Pipeline( Pet: [ “kmeans", KMeans( n_clusters=n_clusters, init="k-means++", n_init=50, max_iter=500, weal random_state=42, weet )s weed )s The Pipeline class can be chained to form a larger pipeline. Build an end-to-end Kmeans clustering pipeline by passing the "preprocessor" and |"clusterer™ pipelines to Pipeline: >>> In [13]: pipe = Pipeline( Peer L Peet ("preprocessor", preprocessor), weet ("clusterer", clusterer) Calling |. Fit ()| with data as the argument performs all the pipeline steps on the data: >>> In [14]: pipe. fit(data) Out[14]: Pipeline(steps=[(‘preprocessor', Pipeline(steps=[('‘scaler', MinMaxScaler()), C'pca’, PCA(n_components=2, random_state=42))])), (‘clusterer', Pipeline(steps=[(‘kmeans', KMeans(max_iter=500, n_clusters=5, n_init=5e, random_state=42))]))]) The pipeline performs all the necessary steps to execute k-means clustering on the gene expression data! Depending on your Python. REPL, |. fit()) may print a summary of the pipeline. Objects defined inside pipelines are accessible using their step name. Evaluate the performance by calculating the silhouette coefficient: >>> In [15]: preprocessed data = pipe[ "preprocessor" ].transform(data) In [16]: predicted_labels = pipe["clusterer"] ["kmeans"].labels_ In [17]: silhouette_score(preprocessed_data, predicted_labels) Out[17]: @.5118775528450304 Calculate ARI, too, since the ground truth cluster labels are available: >>> In [18]: adjusted_rand_score(true_labels, predicted_labels) Out[18]: @.722276752060253 As mentioned earlier, the scale for each of these clustering performance metrics ranges from -1 to 1. A silhouette coefficient of 0 indicates that clusters are significantly overlapping one another, and a silhouette coefficient of 1 indicates clusters are well-separated. An ARI score of 0 indicates that cluster labels are randomly assigned, and an ARI score of 1 means that the true labels and predicted labels form identical clusters. Since you specified In_components=; in the PCA step of the Ameans clustering pipeline, you can also visualize the data in the context of the true labels and predicted labels. Plot the results using a pandas DataFrame and the seaborn plotting library: >>> In [19]: pcadf = pd.DataFrame( pipe["preprocessor"].transform(data), columns=[“component_1", “component_2"], : pcadf[“predicted_cluster"] = pipe[“clusterer"] ["kmeans"].labels_ set pcadf["true_label"] = label_encoder. inverse_transform(true_labels) In [20]: plt.style.use("fivethirtyeight") plt.figure(figsize=(8, 8)) scat = sns.scatterplot( “component_1", “component_2", s=50, data=pcadf, hue="predicted_cluster", style="true_label", palette="Set2", scat.set_title( "Clustering results from TCGA Pan- Cancer\nGene Expression Data” ) plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0) «+2? plt.show() Here's what the plot looks like: Clustering results from TCGA Pan-Cancer Gene Expression Data predicted cluster $ component.2 component 1 The visual representation of the clusters confirms the results of the two clustering evaluation metrics. The performance of your pipeline was pretty good. The clusters only slightly overlapped, and cluster assignments were much better than random. Tuning a K-Means Clustering Pipeline Your first Kmeans clustering pipeline performed well, but there's still room to improve. That's why you went through the trouble of building the pipeline: you can tune the parameters to get the most desirable clustering results. The process of parameter tuning consists of sequentially altering one of the input values of the algorithm's parameters and recording the results. At the end of the parameter tuning process, you'll have a set of performance scores, one for each new value of a given parameter. Parameter tuning is a powerful method to maximize performance from your clustering pipeline. By setting the PCA| parameter |n_components=2, you squished all the features into two components, or dimensions. This value was convenient for visualization on a two-dimensional plot. But using only two components means that the PCA step won't capture all of the explained variance of the input data. Explained variance measures the discrepancy between the PCA- transformed data and the actual input data. The relations! fh_components, and explained variance can be visualized in a plot to show you how many components you need in your PCA to capture a certain percentage of the variance in the input data. You can also use between clustering performance metrics to evaluate how many components are necessary to achieve satisfactory clustering results. In this example, you'll use clustering performance metrics to identify the appropriate number of components in the PCA step. The Pipeline class is powerful in this situation. It allows you to perform basic parameter tuning using a {for loop. Iterate over a range of n_components and record evaluation metrics for each iteration: >>> In [21]: # Empty lists to hold evaluation metrics silhouette_scores = [] ari_scores = [] for n in range(2, 11): # This set the number of components for # but leaves other steps unchanged pipe["preprocessor"]["pca"].n_components = pipe.fit(data) silhouette_coef = silhouette_score( pipe["preprocessor"].transform(data) , pipe["clusterer"]["kmeans"].labels_, ) ari = adjusted_rand_score( true_labels, pipe["clusterer"]["kmeans"].labels_, eet # Add metrics to their lists silhouette_scores.append(silhouette_coef) ari_scores.append(ari) Plot the evaluation metrics as a function of n_components to visualize the relationship between adding components and the performance of the means clustering results: In [22]: plt.style.use("fivethirtyeight") ...1 plt.figure(figsize=(6, 6)) «..? plt.plot( eee range(2, 11), silhouette_scores, weet c="#008fd5", weet label="Silhouette Coefficient", : plt.plot(range(2, 11), ari_scores, c="#fc4f30", label="ARI") «++: plt.xlabel("n_components") «set plt.legend() ...1 plt.title("Clustering Performance as a Function of n_components") ...2 plt.tight_layout() «..? plt.show() The above code generates the a plot showing performance metrics as a function of n_components: Clustering Performance as a Function of n_components 09 08 o7 — sithouette Coefficient — 06 os oa 2 4 a 10 6 n_components There are two takeaways from this figure: . The silhouette coefficient decreases linearly. The silhouette coefficient depends on the distance between points, so as the number of dimensions increases, the sparsity increases. 2. The ARI improves significantly as you add components. It appears to start tapering off after n_components=7| so that would be the value to use for presenting the best clustering results from this pipeline. Like most machine learning decisions, you must balance optimizing clustering evaluation metrics with the goal of the clustering task. In situations when cluster labels are available, as is the case with the cancer dataset used in this tutorial, ARI is a reasonable choice. ARI quantifies how accurately your pipeline was able to reassign the cluster labels. The silhouette coefficient, on the other hand, is a good choice for exploratory clustering because it helps to identify subclusters. These subclusters warrant additional investigation, which can lead to new and important insights. Conclusion You now know how to perform k-means clustering in Python. Your final k-means clustering pipeline was able to cluster patients with different cancer types using real-world gene expression data. You can use the techniques you learned here to cluster your own data, understand how to get the best clustering results, and share insights with others. In this tutorial, you learned: What the popular clustering techniques are and when to use them What the kmeans algorithm is How to implement Ameans clustering in Python How to evaluate the performance of clustering algorithms ¢ How to build and tune a robust Ameans clustering pipeline in Python « How to analyze and present clustering results from the kAmeans algorithm You also took a whirlwind tour of scikitearn, an accessible and extensible too! for implementing means clustering in Python. If you'd like to reproduce the examples you saw above, then be sure to download the source code by clicking on the following link: You're now ready to perform k-means clustering on datasets you find interesting. Be sure to share your results in the comments below! Note: The dataset used in this tutorial was obtained from the UCI Machine Learning Repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. The original dataset is maintained by The Cancer Genome Atlas Pan- Cancer analysis project. Mark as Completed BD Python Tricks Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team. ‘Send Me Python Tricks » About Kevin Arvai

You might also like