0% found this document useful (0 votes)
3 views11 pages

Aam Unit 4 QB With Answer

Uploaded by

ombhavari434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Aam Unit 4 QB With Answer

Uploaded by

ombhavari434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

AAM QUESTION BANK WITH ANSWERS

CHAPTER 4 – Unsupervised Learning: Clustering Algorithms


[12 marks]

Q WHAT IS K-MEANS CLUSTERING?


 K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
 It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any
training.
 It is a centroid-based algorithm, where each cluster is associated with a centroid.
 The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
 The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

Q HOW DOES THE K-MEANS ALGORITHM WORK?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Prof. Kirti Karande 7045014174 AAM


Q DESCRIBE FAILURE OF K-MEANS

Failures or challenges associated with K-Means:


1. Sensitive to Initial Centroid Positions: K-Means is sensitive to the initial placement of
centroids. Different initializations can lead to different final cluster assignments, and the
algorithm may converge to a local minimum rather than the global minimum.
2. Assumes Spherical Clusters: K-Means assumes that clusters are spherical and equally
sized. In situations where clusters have different shapes, densities, or sizes, K-Means may
fail to accurately capture the underlying structure of the data.
3. Sensitive to Outliers: Outliers can significantly impact the performance of K-Means.
Since the algorithm relies on the mean (centroid) of the data points in each cluster,
outliers can disproportionately influence the centroid, leading to suboptimal cluster
assignments.
4. Requires Pre-specification of the Number of Clusters (K): One of the major limitations of
K-Means is that it requires the user to specify the number of clusters (K) in advance.
Choosing an inappropriate value for K can result in poor clustering results.
5. Limited to Euclidean Distance: K-Means uses Euclidean distance to measure the
dissimilarity between data points and centroids. This can be a limitation when dealing
with data that does not adhere to Euclidean geometry or when the features have different
scales.
6. May Produce Unbalanced Clusters: K-Means can produce clusters of significantly
different sizes. In cases where the data naturally forms clusters of unequal sizes, K-
Means may not be the most suitable algorithm.
7. Not Robust to Non-Convex Shapes: K-Means assumes that clusters are convex, which
means it struggles with non-convex shapes. If the true clusters have complex, non-convex
boundaries, K-Means may fail to accurately represent them.
8. Does Not Handle Categorical Data Well: K-Means is designed for numerical data, and it
may not perform well with categorical or binary features. Preprocessing techniques, such
as one-hot encoding, are often required.
9. Noisy Data Impact: Noise in the data can lead to incorrect cluster assignments. K-Means
is not robust to noisy data, and outliers or irrelevant features can affect the clustering
results.

Prof. Kirti Karande 7045014174 AAM


10. Convergence to Local Optima: K-Means uses an iterative optimization process, and it
may converge to a local minimum rather than the global minimum. Multiple runs with
different initializations are often performed to mitigate this issue

Q IMPLEMENTATION OF K-Means ALGORITHM

The steps to be followed for the implementation are given below:


o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
Step-1: Data pre-processing Step
o Importing Libraries:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the dataset.
o Importing the Dataset:
1. # Importing the dataset
2. dataset = pd.read_csv('Mall_Customers_data.csv')

Extracting Independent Variables


1. x = dataset.iloc[:, [3, 4]].values
Step-2: Finding the optimal number of clusters using the elbow method
1. #finding optimal number of clusters using the elbow method
2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

Prof. Kirti Karande 7045014174 AAM


8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()
Step- 3: Training the K-means algorithm on the training dataset

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)
.
Step-4: Visualizing the Clusters
1. #visulaizing the clusters
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster
1') #for first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluste
r 2') #for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3'
) #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster
4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Clu
ster 5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yell
ow', label = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()

Prof. Kirti Karande 7045014174 AAM


Q. Advantages / benefits Dimension Reduction
1. It facilitates data compression and decreases the necessary storage space.
2. It reduces the amount of time needed to conduct identical calculations.
3. It addresses the issue of multi-collinearity, which enhances the performance of the model. It
eliminates superfluous characteristics.

Q. What are the common methods to perform Dimension Reduction?


1. Missing Values:
 When we come across missing values when analysing data, how should we proceed?
 To begin, we should first determine the cause and then address missing data or eliminate
variables using suitable approaches.
 However, what if we encounter an excessive number of missing values? Should we
replace missing values with imputed values or remove the variables entirely?
2. Low Variance:. If there are a large number of dimensions, it is advisable to exclude
variables with low variance in comparison to others, as these variables will not
effectively account for the variation in the target variables.

3. Decision Trees:
 It serves as a comprehensive approach to address various issues like as handling missing
results, outliers, and finding relevant variables.
 It performed effectively during our Data Hackathon as well. Multiple data scientists
employed decision tree algorithms and achieved successful outcomes.

4. Random Forest:
 Random Forest is a method that is similar to a decision tree.
 It is important to note that random forests tend to show a bias towards variables with a
higher number of different values, meaning they favour numeric variables over binary or
category values

5. Strong Correlation:
Dimensions that have a strong correlation can negatively impact the model's performance.
Furthermore, it is undesirable to have several variables that contain comparable informat ion or
exhibit variance, a phenomenon commonly referred to as "multicollinearity".

Prof. Kirti Karande 7045014174 AAM


6. Backward Feature Elimination:
This method begins with all n dimensions. Calculate the sum of squared errors (SSR) by
removing each variable individually, repeating this process n times. Next, we find the variables
that, when removed, result in the smallest increase in the sum of squared residuals (SSR).
Finally, we remove these variables, resulting in a dataset with n-1 input features.

7. Factor Analysis:
. There are essentially two approaches to conducting factor analysis:

Exploratory Factor Analysis (EFA)


Confirmatory Factor Analysis (CFA)
8. Principal Component Analysis (PCA)
 It is a method that involves transforming variables into a new set of variables that are
linear combinations of the original variables.
 The new set of variables is referred to as principal components.
 The components are obtained by ensuring that the first principal component captures the
most of the potential variation in the original data, followed by each subsequent
component having the largest variance possible.
 The use of Principal Component Analysis (PCA) to your data collection becomes
meaningless. If the importance of result interpretability is a priority for your analysis,
Principal Component Analysis (PCA) is not the appropriate technique for your project.

Q. Define Correlation:
 Correlation refers to the statistical relationship between two or more variables.
 Correlation is a statistical term that quantifies the direction and magnitude of the linear
relationship between two variables..
 A correlation value of 0 indicates the absence of a linear relationship between the two
variables, whereas correlation coefficients of 1 and -1 indicate perfect positive and
negative correlations, respectively.
 The principal components in PCA are linear combinations of the original variables that
optimise the amount of variation accounted for by the data. The calculation of principal
components involves the utilisation of the correlation matrix.
.

Prof. Kirti Karande 7045014174 AAM


Q. PCA Implementation
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
:

m_data = pd.read_csv('mushrooms.csv')

# Machine learning systems work with integers, we need to encode these


# string characters into ints

encoder = LabelEncoder()

# Now apply the transformation to all the columns:


for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])

X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]

# Scale the features


scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)

# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_

Prof. Kirti Karande 7045014174 AAM


plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal components')
plt.show()
pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)

plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class'])
plt.show()

pca3 = PCA(n_components=2)
pca3.fit(X_features)
x_3d = pca3.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data['class'])
plt.show()

Q. List Dimensionality Reduction Techniques

Prof. Kirti Karande 7045014174 AAM


Q. Advantages / Benefits of Dimensionality Reduction
o By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Q. Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction, which are given
below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

Prof. Kirti Karande 7045014174 AAM


Q. Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate the
performance. The performance decides whether to add those features or remove to increase the
accuracy of the model. This method is more accurate than the filtering method but complex to
work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some common techniques of
Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.

Prof. Kirti Karande 7045014174 AAM


Some common feature extraction techniques are:
a) Principal Component Analysis
b) Linear Discriminant Analysis
c) Kernel PCA
d) Quadratic Discriminant Analysis

Q. Common techniques of Dimensionality Reduction


a) Principal Component Analysis
b) Backward Elimination
c) Forward Selection
d) Score comparison
e) Missing Value Ratio
f) Low Variance Filter
g) High Correlation Filter
h) Random Forest
i) Factor Analysis
j) Auto-Encoder

Prof. Kirti Karande 7045014174 AAM

You might also like