0% found this document useful (0 votes)
5 views42 pages

CSC407 - Chapter 5-6

The document provides an overview of machine learning models, focusing on the importance of splitting datasets into training, testing, and validation sets to evaluate model performance. It discusses the roles of each dataset type, the significance of various evaluation metrics such as accuracy, precision, recall, F1 score, and confusion matrix, and includes examples of implementing these concepts using Python and scikit-learn. Additionally, it touches on regression tasks and the evaluation metrics specific to them.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

CSC407 - Chapter 5-6

The document provides an overview of machine learning models, focusing on the importance of splitting datasets into training, testing, and validation sets to evaluate model performance. It discusses the roles of each dataset type, the significance of various evaluation metrics such as accuracy, precision, recall, F1 score, and confusion matrix, and includes examples of implementing these concepts using Python and scikit-learn. Additionally, it touches on regression tasks and the evaluation metrics specific to them.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Machine Learning/Data Science

(csc 407)
Chapters 5 & 6
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
5: Introduction to Machine Learning Models

• Training vs Testing vs Validation Sets


• The fundamental purpose for splitting the dataset is to assess
how effective will the trained model be in generalizing to new
data. This split can be achieved by using train_test_split
function of scikit-learn.
• Training Set
• This is the actual dataset from which a model trains .i.e. the
model sees and learns from this data to predict the outcome
or to make the right decisions.
• Most of the training data is collected from several resources
and then preprocessed and organized to provide proper
performance of the model.
• Type of training data hugely determines the ability of the
model to generalize .i.e. the better the quality and diversity
of training data, the better will be the performance of the
model.
• This data is more than 60% of the total data available for the
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Training Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for example


# Making a array for x ranging from 0-15 then reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8,2))

# y is just a list of 0-7 number representing target variable


y = range(8)

# Splitting dataset in 80-20 fashion .i.e. Testing set is 20% while Training set is 80%
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.8, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)

Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
5: Introduction to Machine Learning Models

• Training vs Testing vs Validation Sets


• Testing Set
• This dataset is independent of the training set but has a
somewhat similar type of probability distribution of
classes and is used as a benchmark to evaluate the
model, used only after the training of the model is
complete.
• Testing set is usually a properly organized dataset having
all kinds of data for scenarios that the model would
probably be facing when used in the real world.
• Often the validation and testing set combined is used as
a testing set which is not considered a good practice.
• If the accuracy of the model on training data is greater
than that on testing data then the model is said to have
overfitting.
• This data is approximately 20-25% of the total data
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Testing Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y e.g., making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))

# y is just a list of 0-7 number representing target variable


y = range(8)

# Splitting dataset in 80-20 fashion .i.e. training set is 80%, testing set is 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)

Output:
Testing set x: [[ 2 3]
[10 11]]
Testing set y: [1, 5]
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set
• The validation set is used to fine-tune the hyperparameters of the
model and is considered a part of the training of the model.
• The model only sees this data for evaluation but does not learn from
this data, providing an objective unbiased evaluation of the model.
• Validation dataset can be utilized for regression as well by
interrupting training of model when loss of validation dataset
becomes greater than loss of training dataset .i.e. reducing bias and
variance.
• This data is approximately 10-15% of the total data available for the
project but this can change depending upon the number of
hyperparameters .i.e. if model has quite many hyperparameters
then using large validation set will give better results.
• Now, whenever the accuracy of model on validation data is greater
than that on training data then the model is said to have generalized
well.


5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for e.g., making an array for x ranging from 0-23 then reshaping it to 8x3 matrix
x = np.arange(24).reshape((8,3))
# y is just a list of 0-7 number representing target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e. training set is 80%, combined set of # testing & validation is 20% of total data
x_train, x_Combine, y_train, y_Combine = train_test_split(x,y, train_size=0.8, random_state=42)

# Splitting combined dataset in 50-50 fashion .i.e. Testing set is 50% of combined dataset, Validation set is 50% of
combined dataset
x_val, x_test, y_val, y_test = train_test_split(x_Combine, y_Combine, test_size=0.5, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
print(" ")

# Testing set
print("Testing set x: ",x_test)
print("Testing set y: ",y_test)
print(" ")

# Validation set
print("Validation set x: ",x_val)
print("Validation set y: ",y_val)
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
Output:

Training set x: [[ 0 1 2]
[21 22 23]
[ 6 7 8]
[12 13 14]
[ 9 10 11]
[18 19 20]]
Training set y: [0, 7, 2, 4, 3, 6]

Testing set x: [[15 16 17]]


Testing set y: [5]

Validation set x: [[3 4 5]]


Validation set y: [1]
5: Introduction to Machine Learning Models

• Metrics for Evaluation


• Model evaluation is the process that uses some metrics which
help us to analyze the performance of the model.
• As we all know that model development is a multi-step process,
and a check should be kept on how well the model generalizes
future predictions.
• Therefore, evaluating a model plays a vital role so that we can
judge the performance of our model.
• The evaluation also helps to analyze a model’s key weaknesses.
• There are many metrics like Accuracy, Precision, Recall, F1
score, Area under Curve, Confusion Matrix, and Mean Square
Error.
• Cross Validation is one technique that is followed during the
training phase, and it is a model evaluation technique as well.
5: Introduction to Machine Learning Models

• Metrics for Evaluation


• Evaluation Metrics for Classification Task
• Load the iris dataset which has features like the length
and width of sepals and petals.
• The target values are Iris setosa, Iris virginica, and Iris
versicolor.
• After importing the dataset we divide the dataset into
train and test datasets in the ratio 80:20.
• Then we call Decision Trees and train our model.
• After that, we perform the prediction and calculate the
accuracy score, precision, recall, and f1 score.
• We also plot the confusion matrix.


5: Introduction to Machine Learning Models

• Metrics for Evaluation


# import the libraries and dataset
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score,\
recall_score, f1_score, accuracy_score
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Now let’s load the toy dataset iris flowers from the sklearn.datasets library and then split it into
training and testing parts (for model evaluation) in the 80:20 ratio.

iris = load_iris()
X = iris.data
y = iris.target

# Holdout method. Dividing the data into train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20, test_size=0.20)

• Now, let’s train a Decision Tree Classifier model on the training data, and then we will move on
to the evaluation part of the model using different metrics.

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Accuracy

• Accuracy is defined as the ratio of the number of correct predictions to the total number of
predictions.
• This is the most fundamental metric used to evaluate the model.
• The formula is given by

• However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset.


• Suppose a model classifies that the majority of the data belongs to the major class label.
• It yields higher accuracy. But in general, the model cannot classify on minor class labels and has
poor performance.

print("Accuracy:", accuracy_score(y_test, y_pred))

Output:
Accuracy: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Precision and Recall
• Precision is the ratio of true positives to the summation of true positives and false positives. It basically
analyses the positive predictions.

Precision = TP/(TP+FP)
• The drawback of Precision is that it does not consider the True Negatives and False Negatives.

• Recall is the ratio of true positives to the summation of true positives and false negatives. It basically
analyses the number of correct positive samples.

Recall = TP/(TP+FN)
print("Precision:", precision_score(y_test, y_pred, average="weighted"))

print('Recall:', recall_score(y_test, y_pred, average="weighted"))

Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• F1 Score
• The F1 score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa.
The goal of the F1 score is to combine precision and recall.

F1 score = (2×Precision×Recall)/(Precision+Recall)

# calculating f1 score
print('F1 score:', f1_score(y_test, y_pred, average="weighted"))

Output:
F1 score: 0.9327777777777778
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
• A confusion matrix is an N x N matrix where N is the number of target classes. It represents the
number of actual outputs and the predicted outputs. Some terminologies in the matrix are as
follows:

• True Positives: It is also known as TP. It is the output in which the actual and the predicted values
are YES.
• True Negatives: It is also known as TN. It is the output in which the actual and the predicted
values are NO.
• False Positives: It is also known as FP. It is the output in which the actual value is NO but the
predicted value is YES.
• False Negatives: It is also known as FN. It is the output in which the actual value is YES but the
predicted value is NO.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
cm_display = metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix, display_labels=[0, 1, 2])
cm_display.plot()
plt.show()
Output:

In the output, the accuracy of the model is 93.33%. Precision is approximately 0.944 and Recall is 0.933. F1 score is
approximately 0.933. Finally, the confusion matrix is plotted. Here class labels denote the target classes:
0 = Setosa
1 = Versicolor
2 = Virginica
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• AUC-ROC
• AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification model at different threshold
values. The Receiver Operating Characteristic(ROC) curve is a probabilistic curve used to highlight the model’s
performance. The curve has two parameters:
• TPR: It stands for True positive rate. It basically follows the formula of Recall.
• FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the summation of false positives and
True negatives.
• This curve is useful as it helps us to determine the model’s capacity to distinguish between different classes. Let us
illustrate this with the help of a simple Python example
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = np.round(roc_auc_score(y_true, y_pred), 3)
print("Auc", (auc))
Output:
Auc 0.75
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
• Regression is used to determine continuous values. It is mostly used to find a relation between a dependent and an
independent variable.
• For classification, we use a confusion matrix, accuracy, f1 score, etc.
• But for regression analysis, since we are predicting a numerical value, it may differ from the actual output.
• So, we consider the error calculation as it helps to summarize how close the prediction is to the actual value.
• There are many metrics available for evaluating the regression model.

• We are going to implement simple regression model using the Mumbai weather CSV file. This file comprises Day,
Hour, Temperature, Relative Humidity, Wind Speed, and Wind Direction.

• We are basically interested in finding a relationship between Temperature and Relative Humidity.
• Here Relative Humidity is the dependent variable, and Temperature is the independent variable.
• We will perform the Linear Regression and used the metrics to evaluate the performance of our model.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
# importing the libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,\
mean_squared_error, mean_absolute_percentage_error

• Now let’s load the data into the panda’s data frame and then split it into training and testing parts (for model evaluation) in
the 80:20 ratio.

df = pd.read_csv('weather.csv')
X = df.iloc[:, 2].values
Y = df.iloc[:, 3].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20,
random_state=0)

• Now, let’s train a simple linear regression model. On the training data and we will move to the evaluation part of the model
using different metrics.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
regression = LinearRegression()
regression.fit(X_train, Y_train)
Y_pred = regression.predict(X_test)
• Mean Absolute Error(MAE)
• This is the simplest metric used to analyze the loss over the whole dataset.
• As we all know the error is basically the difference between the predicted and actual values.
• Therefore, MAE is defined as the average of the errors calculated.
• Here we calculate the modulus of the error, perform the summation and then divide the result by the number of data points.
• It is a positive quantity and is not concerned about the direction. The formula of MAE is given by MAE = ∑|ypred-yactual| / N
mae = mean_absolute_error(y_true=Y_test, y_pred=Y_pred)
print("Mean Absolute Error", mae)
Output:
Mean Absolute Error 1.7236295632503873
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Squared Error(MSE)
• The most commonly used metric is Mean Square error or MSE.
• It is a function used to calculate the loss.
• We find the difference between the predicted values and the truth variable, square the result and then find the average over the whole
dataset.
• MSE is always positive as we square the values.
• The small the MSE, the better is the performance of our model. The formula of MSE is given: MSE = ∑(ypred - yactual)2 / N

mse = mean_squared_error(y_true=Y_test, y_pred=Y_pred)


print("Mean Square Error", mse)

Ouput:
Mean Square Error 3.9808057060106954
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Root Mean Squared Error(RMSE)

• RMSE is a popular method and is the extended version of MSE(Mean Squared Error).
• This method is basically used to evaluate the performance of our model.
• It indicates how much the data points are spread around the best line.
• It is the standard deviation of the Mean squared error.
• A lower value means that the data point lies closer to the best fit line.

RMSE=√(∑(ypred - yactual)2 / N)

rmse = mean_squared_error(y_true=Y_test, y_pred=Y_pred, squared=False)


print("Root Mean Square Error", rmse)

Ouput:
Root Mean Square Error 1.9951956560725306
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Absolute Percentage Error(MAPE)
• MAPE is basically used to express the error in terms of percentage.
• It is defined as the difference between the actual and predicted value.
• The error is then divided by the actual value.
• The results are then summed up and finally, we calculate the average. The smaller the percentage the better the performance of the model.
• The formula is given by

MAPE = ∑((ypred-yactual) / yactual) / N * 100 %

mape = mean_absolute_percentage_error(Y_test, Y_pred, sample_weight=None, multioutput='uniform_average')


print("Mean Absolute Percentage Error", mape)

Ouput:
Mean Absolute Percentage Error 0.02334408993333347


6: Supervised Machine Learning Models
• Decision Tree Classifier
• Decision Tree Classifiers is a fundamental machine learning algorithm for classification tasks.
• They organize data into a tree-like structure where internal nodes represent decisions, branches represent outcomes and leaf node
represent class labels.

• Implementing Decision Tree Classifiers with Scikit-Learn

• The DecisionTreeClassifier from Sklearn has the ability to perform multi-class classification on a dataset. The syntax for
DecisionTreeClassifier is as follows:

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None,min_samples_split=2, min_samples_leaf=1,


min_weight_fraction_leaf=0.0, max_features=None,random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
class_weight=None,ccp_alpha=0.0, monotonic_cst=None)
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Let's go through the parameters:
• criterion: It measures the quality of a split. Supported values are 'gini', 'entropy' and 'log_loss'. The default value is 'gini'
• splitter: This parameter is used to choose the split at each node. Supported values are 'best' & 'random'. The default value is 'best'
• max_features: It defines the number of features to consider when looking for the best split.
• max_depth: This parameter denotes maximum depth of the tree (default=None).
• min_samples_split: It defines the minimum number of samples reqd. to split an internal node (default=2).
• min_samples_leaf: The minimum number of samples required to be at a leaf node (default=1)
• max_leaf_nodes: It defines the maximum number of possible leaf nodes.
• min_impurity_split: It defines the threshold for early stopping tree growth.
• class_weight: It defines the weights associated with classes.
• ccp_alpha: It is a complexity parameter used for minimal cost-complexity pruning
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Let’s implement the code
#import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# split dataset to training and test set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state = 99)
# initialize decision tree classifier
clf = DecisionTreeClassifier(random_state=1)
# train the classifier
clf.fit(X_train, y_train)
# predict using classifier
y_pred = clf.predict(X_test)
# calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}’)

Output:
Accuracy: 0.9555555555555556
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Hyperparameter Tuning with Decision Tree Classifier
• Hyperparameters are configuration settings that control the behavior of a decision tree model and significantly affect its performance.
• Proper tuning can improve accuracy, reduce overfitting and enhance generalization of model.
• Popular methods for tuning include Grid Search, Random Search, and Bayesian Optimization, which explore different combinations to find the best configuration.
• Hyperparmater tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV

# Hyperparameter to fine tune


param_grid = {
'max_depth': range(1, 10, 1),
'min_samples_leaf': range(1, 20, 2),
'min_samples_split': range(2, 20, 2),
'criterion': ["entropy", "gini"]
}

tree = DecisionTreeClassifier(random_state=1)
# GridSearchCV
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5, verbose=True)
grid_search.fit(X_train, y_train)

print("best accuracy", grid_search.best_score_)


print(grid_search.best_estimator_)
6: Supervised Machine Learning Models
• Decision Tree Classifier
Output:
Fitting 5 folds for each of 1620 candidates, totalling 8100 fits
best accuracy 0.9714285714285715
DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=3, random_state=1)

• Here we defined the parameter grid with a set of hyperparameters and a list of possible values.
• The GridSearchCV evaluates the different hyperparameter combinations for the DecissionTree Classifier and selects the best combination of
hyperparameters based on the performance across all k folds.
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
• Decision Tree visualization is used to interpret and comprehend model's choices.
• We'll plot feature importance obtained from the Decision Tree model to see which features have the greatest predictive power.
• Here we fetch the best estimator obtained from the gridsearchcv as the decision tree classifier.

from sklearn.tree import plot_tree


import matplotlib.pyplot as plt

tree_clf = grid_search.best_estimator_
plt.figure(figsize=(18, 15))
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
6: Unsupervised Machine Learning Models
• K Means Clustering
• K-means clustering is a technique used to organize data into groups based on their similarity.
• For example, online store uses K-Means to group customers based on purchase frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big
Spenders for personalised marketing.
• The algorithm works by first randomly picking some central points called centroids and each data point is then assigned to the closest centroid forming a cluster.
• After all the points are assigned to a cluster the centroids are updated by finding the average position of the points in each cluster.
• This process repeats until the centroids stop changing forming clusters.
• The goal of clustering is to divide the data points into clusters so that similar data points belong to same group.
• How k-means clustering works?
• We are given a data set of items with certain features and values for these features (like a vector).
• The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters
we want to classify our items into.
6: Unsupervised Machine Learning Models
• K Means Clustering
6: Unsupervised Machine Learning Models
• K Means Clustering
• The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement. The algorithm works
as follows:

• First, we randomly initialize k points, called means or cluster centroids.


• We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
• We repeat the process for a given number of iterations and at the end, we have our clusters.
• The “points” mentioned above are called means because they are the mean values of the items categorized in them.
• To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set.
• Another method is to initialize the means at random values between the boundaries of the data set.
• For example, for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].
6: Unsupervised Machine Learning Models
• K Means Clustering
• Implementation of K-Means Clustering in Python
• We will use blobs datasets and show how clusters are made.

Step 1: Importing the necessary libraries


• We are importing Numpy for statistical computations, Matplotlib to plot the graph, and make_blobs from sklearn.datasets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Step 2: Create the custom dataset with make_blobs and plot it


X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

• Another method is to initialize the means at random values between the boundaries of the data set.
• For example, for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:

Step 3: Initialize the random centroids


The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.
6: Unsupervised Machine Learning Models
• K Means Clustering
k=3
clusters = {}
np.random.seed(23)
for idx in range(k):
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters

Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []},
1: {'center': array([ 1.06183904, -0.87041662]), 'points': []},
2: {'center': array([-1.11581855, 0.74488834]), 'points': []}}

Step 3: Initialize the random centroids


The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 4: Plot the random initialize center with data points
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()

Output:

• The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 5: Define Euclidean distance

def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 6: Create the function to Assign and Update the cluster center
This step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

def assign_clusters(X, clusters):


for idx in range(X.shape[0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

def update_clusters(X, clusters):


for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 7: Create the function to Predict the cluster for the datapoints

def pred_cluster(X, clusters):


pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred

Step 8: Assign, Update, and predict the cluster center

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

Step 9: Plot the data points with their predicted cluster center

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:

• The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

You might also like