Clustering in Python-Dr. Afsaneh Javadi
Clustering in Python-Dr. Afsaneh Javadi
Afsaneh Javadi
Sklearn (short for Scikit-learn) is a free, open-source machine learning library for Python. It provides
simple and efficient tools for data analysis and is built on top of NumPy, SciPy, and matplotlib. Sklearn
supports various supervised and unsupervised learning algorithms, such as classification, regression,
clustering, and dimensionality reduction. It also includes tools for data preprocessing, model selection,
and evaluation. Sklearn is widely used in academia and industry for a variety of machine learning tasks.
NumPy is a Python library used for performing numerical operations in scientific computing. It provides
efficient multi-dimensional arrays and matrices, along with a large variety of mathematical functions to
operate on these arrays. NumPy is widely used in data science, machine learning, and scientific
computing domains where numerical computations and large data sets are common.
1.Creating an array:
import numpy as np
2.Reshaping an array:
import numpy as np
new_arr = arr.reshape(3, 2)
import numpy as np
new_arr = np.sqrt(arr)
import numpy as np
print(arr[3])
print(arr[1:4])
import numpy as np
1|Page
Clustering in Python Dr. Afsaneh Javadi
import numpy as np
mean = np.mean(arr)
median = np.median(arr)
std_dev = np.std(arr)
2|Page
Clustering in Python Dr. Afsaneh Javadi
3|Page
Clustering in Python Dr. Afsaneh Javadi
SciPy is a Python-based open-source software library that is used for scientific computing, data analysis,
and numerical optimization. It provides a range of powerful tools and functions for various fields like
mathematics, engineering, physics, and statistics.
1. Optimization: using the scipy.optimize module to find the minimum or maximum of a function,
or to solve constrained optimization problems.
3. Interpolation: using the scipy.interpolate module to fit a curve to a set of data points, or to
generate new values from existing data.
4. Linear algebra: using the scipy.linalg module to perform matrix operations, such as finding
eigenvalues and eigenvectors, solving linear systems of equations, or computing matrix
decompositions.
5. Signal processing: using the scipy.signal module to perform a variety of signal processing tasks,
such as Fourier transforms, filtering, and spectral analysis.
6. Statistics: using the scipy.stats module to generate random numbers, calculate summary
statistics, or perform hypothesis testing and statistical inference.
7. Image processing: using the scipy.ndimage module to process images, such as filtering or
enhancing contrast.
8. Sparse matrices: using the scipy.sparse module to work with large, sparse matrices efficiently.
Matplotlib is a Python library used for creating static, interactive, and animated data visualizations in
Python. It provides a wide variety of plots, such as line plots, bar plots, histograms, scatter plots, 3D
plots, and heatmaps. Matplotlib allows for easy customization of the plots, including color and font
choices, line and marker styles, and axis formatting. It is widely used in scientific and engineering fields
for data analysis and visualization.
Line plot: A simple line plot can be created using Matplotlib to show a trend over time or any dependent
variable.
Scatter plot: A scatter plot can be used when we have two numeric variables and we want to see their
relationship.
Histogram plot: Histogram plot helps in understanding the pattern of the distribution of a numeric
variable.
4|Page
Clustering in Python Dr. Afsaneh Javadi
Bar plot: A bar plot can be used when we have categorical variables and we want to compare them.
Heatmap: Heatmap is a graphical representation of data where values are displayed as colors.
6.Box plot: Box plot gives us an idea about the shape of the distribution and the presence of the outliers.
Pie chart: Pie chart can be used to show the proportion of different categories in a dataset.
3D plot: Matplotlib provides tools to create 3D plots which can be used to visualize data in a 3D space.
Polar plot: Polar plot is used to plot data points in a polar coordinate system.
Subplots: Using subplots we can create multiple plots in the same figure for a better comparison of data.
5|Page
Clustering in Python Dr. Afsaneh Javadi
Example1
There are many clustering algorithms that you can use depending on your data and the problem you are
trying to solve. Here is an example of how to write code for K-means clustering algorithm in Python:
num_clusters = 3
# Create a KMeans object with the number of clusters you want to find
kmeans = KMeans(n_clusters=num_clusters)
# Get the labels of each data point (i.e., which cluster it belongs to)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
6|Page
Clustering in Python Dr. Afsaneh Javadi
In the code above, we first import the KMeans class from the scikit-learn library. We then define the
number of clusters we want to find (num_clusters) and create a KMeans object with that number of
clusters. We fit the KMeans object to our dataset of features (features) using the fit() method. We then
use the labels_ attribute to get the cluster labels for each data point in our dataset, and the
cluster_centers_ attribute to get the centroids of each cluster.
Keep in mind that there are many different clustering algorithms, and the implementation details may
differ depending on the algorithm you choose to use. Additionally, you may need to preprocess your
data or tune the hyperparameters of your clustering algorithm to get good results.
Example 2:
here's an example implementation of the DBSCAN algorithm in Python using the scikit-learn library:
7|Page
Clustering in Python Dr. Afsaneh Javadi
In this example, we load the iris dataset and define a DBSCAN instance with an epsilon value of 0.5 and
minimum number of samples per cluster of 5. We then fit the model to the data and retrieve the
resulting cluster labels. Finally, we print out the predicted labels for each data point. Note that the
DBSCAN algorithm does not require us to specify the number of clusters in advance - it discovers them
automatically based on the input parameters.
8|Page