0% found this document useful (0 votes)
33 views8 pages

Clustering in Python-Dr. Afsaneh Javadi

The document provides an overview of clustering in Python using libraries such as Sklearn, NumPy, SciPy, and Matplotlib. It includes examples of K-means and DBSCAN clustering algorithms, along with explanations of data manipulation and visualization techniques. The document emphasizes the importance of preprocessing data and tuning hyperparameters for effective clustering results.

Uploaded by

lindsay.yareth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views8 pages

Clustering in Python-Dr. Afsaneh Javadi

The document provides an overview of clustering in Python using libraries such as Sklearn, NumPy, SciPy, and Matplotlib. It includes examples of K-means and DBSCAN clustering algorithms, along with explanations of data manipulation and visualization techniques. The document emphasizes the importance of preprocessing data and tuning hyperparameters for effective clustering results.

Uploaded by

lindsay.yareth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Clustering in Python Dr.

Afsaneh Javadi

Sklearn (short for Scikit-learn) is a free, open-source machine learning library for Python. It provides
simple and efficient tools for data analysis and is built on top of NumPy, SciPy, and matplotlib. Sklearn
supports various supervised and unsupervised learning algorithms, such as classification, regression,
clustering, and dimensionality reduction. It also includes tools for data preprocessing, model selection,
and evaluation. Sklearn is widely used in academia and industry for a variety of machine learning tasks.

NumPy is a Python library used for performing numerical operations in scientific computing. It provides
efficient multi-dimensional arrays and matrices, along with a large variety of mathematical functions to
operate on these arrays. NumPy is widely used in data science, machine learning, and scientific
computing domains where numerical computations and large data sets are common.

Here are some examples of how to use NumPy:

1.Creating an array:

import numpy as np

arr = np.array([1, 2, 3])

2.Reshaping an array:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

new_arr = arr.reshape(3, 2)

3.Applying math functions to an array:

import numpy as np

arr = np.array([1, 2, 3])

new_arr = np.sqrt(arr)

4.Indexing and slicing an array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr[3])

print(arr[1:4])

5.Combining two arrays:

import numpy as np

1|Page
Clustering in Python Dr. Afsaneh Javadi

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

new_arr = np.concatenate((arr1, arr2))

6.Applying statistical functions to an array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

mean = np.mean(arr)

median = np.median(arr)

std_dev = np.std(arr)

2|Page
Clustering in Python Dr. Afsaneh Javadi

3|Page
Clustering in Python Dr. Afsaneh Javadi

SciPy is a Python-based open-source software library that is used for scientific computing, data analysis,
and numerical optimization. It provides a range of powerful tools and functions for various fields like
mathematics, engineering, physics, and statistics.

here are some examples of common use cases for SciPy:

1. Optimization: using the scipy.optimize module to find the minimum or maximum of a function,
or to solve constrained optimization problems.

2. Integration: using the scipy.integrate module to calculate definite integrals of functions on a


given interval.

3. Interpolation: using the scipy.interpolate module to fit a curve to a set of data points, or to
generate new values from existing data.

4. Linear algebra: using the scipy.linalg module to perform matrix operations, such as finding
eigenvalues and eigenvectors, solving linear systems of equations, or computing matrix
decompositions.

5. Signal processing: using the scipy.signal module to perform a variety of signal processing tasks,
such as Fourier transforms, filtering, and spectral analysis.

6. Statistics: using the scipy.stats module to generate random numbers, calculate summary
statistics, or perform hypothesis testing and statistical inference.

7. Image processing: using the scipy.ndimage module to process images, such as filtering or
enhancing contrast.

8. Sparse matrices: using the scipy.sparse module to work with large, sparse matrices efficiently.

Matplotlib is a Python library used for creating static, interactive, and animated data visualizations in
Python. It provides a wide variety of plots, such as line plots, bar plots, histograms, scatter plots, 3D
plots, and heatmaps. Matplotlib allows for easy customization of the plots, including color and font
choices, line and marker styles, and axis formatting. It is widely used in scientific and engineering fields
for data analysis and visualization.

Line plot: A simple line plot can be created using Matplotlib to show a trend over time or any dependent
variable.

Scatter plot: A scatter plot can be used when we have two numeric variables and we want to see their
relationship.

Histogram plot: Histogram plot helps in understanding the pattern of the distribution of a numeric
variable.

4|Page
Clustering in Python Dr. Afsaneh Javadi

Bar plot: A bar plot can be used when we have categorical variables and we want to compare them.

Heatmap: Heatmap is a graphical representation of data where values are displayed as colors.

6.Box plot: Box plot gives us an idea about the shape of the distribution and the presence of the outliers.

Pie chart: Pie chart can be used to show the proportion of different categories in a dataset.

3D plot: Matplotlib provides tools to create 3D plots which can be used to visualize data in a 3D space.

Polar plot: Polar plot is used to plot data points in a polar coordinate system.

Subplots: Using subplots we can create multiple plots in the same figure for a better comparison of data.

5|Page
Clustering in Python Dr. Afsaneh Javadi

Example1

There are many clustering algorithms that you can use depending on your data and the problem you are
trying to solve. Here is an example of how to write code for K-means clustering algorithm in Python:

from sklearn.cluster import KMeans import numpy as np

# Assume you have a dataset of features that you want to cluster

# The features are stored in a NumPy array called "features"

num_clusters = 3

# Create a KMeans object with the number of clusters you want to find

kmeans = KMeans(n_clusters=num_clusters)

# Fit the KMeans object to your datasetkmeans.fit(features)

# Get the labels of each data point (i.e., which cluster it belongs to)

labels = kmeans.labels_

# Get the centroids of each cluster

centroids = kmeans.cluster_centers_

6|Page
Clustering in Python Dr. Afsaneh Javadi

In the code above, we first import the KMeans class from the scikit-learn library. We then define the
number of clusters we want to find (num_clusters) and create a KMeans object with that number of
clusters. We fit the KMeans object to our dataset of features (features) using the fit() method. We then
use the labels_ attribute to get the cluster labels for each data point in our dataset, and the
cluster_centers_ attribute to get the centroids of each cluster.

Keep in mind that there are many different clustering algorithms, and the implementation details may
differ depending on the algorithm you choose to use. Additionally, you may need to preprocess your
data or tune the hyperparameters of your clustering algorithm to get good results.

Example 2:

here's an example implementation of the DBSCAN algorithm in Python using the scikit-learn library:

7|Page
Clustering in Python Dr. Afsaneh Javadi

In this example, we load the iris dataset and define a DBSCAN instance with an epsilon value of 0.5 and
minimum number of samples per cluster of 5. We then fit the model to the data and retrieve the
resulting cluster labels. Finally, we print out the predicted labels for each data point. Note that the
DBSCAN algorithm does not require us to specify the number of clusters in advance - it discovers them
automatically based on the input parameters.

8|Page

You might also like