0% found this document useful (0 votes)

5 views42 pages

CSC407 - Chapter 5-6

The document provides an overview of machine learning models, focusing on the importance of splitting datasets into training, testing, and validation sets to evaluate model performance. It discusses the roles of each dataset type, the significance of various evaluation metrics such as accuracy, precision, recall, F1 score, and confusion matrix, and includes examples of implementing these concepts using Python and scikit-learn. Additionally, it touches on regression tasks and the evaluation metrics specific to them.

Uploaded by

asukuhamza1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views42 pages

CSC407 - Chapter 5-6

Uploaded by

asukuhamza1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

Machine Learning/Data Science

(csc 407)
Chapters 5 & 6
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
5: Introduction to Machine Learning Models

• Training vs Testing vs Validation Sets

• The fundamental purpose for splitting the dataset is to assess
how effective will the trained model be in generalizing to new
data. This split can be achieved by using train_test_split
function of scikit-learn.
• Training Set
• This is the actual dataset from which a model trains .i.e. the
model sees and learns from this data to predict the outcome
or to make the right decisions.
• Most of the training data is collected from several resources
and then preprocessed and organized to provide proper
performance of the model.
• Type of training data hugely determines the ability of the
model to generalize .i.e. the better the quality and diversity
of training data, the better will be the performance of the
model.
• This data is more than 60% of the total data available for the
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Training Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for example

# Making a array for x ranging from 0-15 then reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8,2))

# y is just a list of 0-7 number representing target variable

y = range(8)

# Splitting dataset in 80-20 fashion .i.e. Testing set is 20% while Training set is 80%
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.8, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)

Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
5: Introduction to Machine Learning Models

• Training vs Testing vs Validation Sets

• Testing Set
• This dataset is independent of the training set but has a
somewhat similar type of probability distribution of
classes and is used as a benchmark to evaluate the
model, used only after the training of the model is
complete.
• Testing set is usually a properly organized dataset having
all kinds of data for scenarios that the model would
probably be facing when used in the real world.
• Often the validation and testing set combined is used as
a testing set which is not considered a good practice.
• If the accuracy of the model on training data is greater
than that on testing data then the model is said to have
overfitting.
• This data is approximately 20-25% of the total data
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Testing Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y e.g., making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))

# y is just a list of 0-7 number representing target variable

y = range(8)

# Splitting dataset in 80-20 fashion .i.e. training set is 80%, testing set is 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)

Output:
Testing set x: [[ 2 3]
[10 11]]
Testing set y: [1, 5]
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set
• The validation set is used to fine-tune the hyperparameters of the
model and is considered a part of the training of the model.
• The model only sees this data for evaluation but does not learn from
this data, providing an objective unbiased evaluation of the model.
• Validation dataset can be utilized for regression as well by
interrupting training of model when loss of validation dataset
becomes greater than loss of training dataset .i.e. reducing bias and
variance.
• This data is approximately 10-15% of the total data available for the
project but this can change depending upon the number of
hyperparameters .i.e. if model has quite many hyperparameters
then using large validation set will give better results.
• Now, whenever the accuracy of model on validation data is greater
than that on training data then the model is said to have generalized
well.

•
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for e.g., making an array for x ranging from 0-23 then reshaping it to 8x3 matrix
x = np.arange(24).reshape((8,3))
# y is just a list of 0-7 number representing target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e. training set is 80%, combined set of # testing & validation is 20% of total data
x_train, x_Combine, y_train, y_Combine = train_test_split(x,y, train_size=0.8, random_state=42)

# Splitting combined dataset in 50-50 fashion .i.e. Testing set is 50% of combined dataset, Validation set is 50% of
combined dataset
x_val, x_test, y_val, y_test = train_test_split(x_Combine, y_Combine, test_size=0.5, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
print(" ")

# Testing set
print("Testing set x: ",x_test)
print("Testing set y: ",y_test)
print(" ")

# Validation set
print("Validation set x: ",x_val)
print("Validation set y: ",y_val)
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
Output:

Training set x: [[ 0 1 2]
[21 22 23]
[ 6 7 8]
[12 13 14]
[ 9 10 11]
[18 19 20]]
Training set y: [0, 7, 2, 4, 3, 6]

Testing set x: [[15 16 17]]

Testing set y: [5]

Validation set x: [[3 4 5]]

Validation set y: [1]
5: Introduction to Machine Learning Models

• Metrics for Evaluation

• Model evaluation is the process that uses some metrics which
help us to analyze the performance of the model.
• As we all know that model development is a multi-step process,
and a check should be kept on how well the model generalizes
future predictions.
• Therefore, evaluating a model plays a vital role so that we can
judge the performance of our model.
• The evaluation also helps to analyze a model’s key weaknesses.
• There are many metrics like Accuracy, Precision, Recall, F1
score, Area under Curve, Confusion Matrix, and Mean Square
Error.
• Cross Validation is one technique that is followed during the
training phase, and it is a model evaluation technique as well.
5: Introduction to Machine Learning Models

• Metrics for Evaluation

• Evaluation Metrics for Classification Task
• Load the iris dataset which has features like the length
and width of sepals and petals.
• The target values are Iris setosa, Iris virginica, and Iris
versicolor.
• After importing the dataset we divide the dataset into
train and test datasets in the ratio 80:20.
• Then we call Decision Trees and train our model.
• After that, we perform the prediction and calculate the
accuracy score, precision, recall, and f1 score.
• We also plot the confusion matrix.

•
5: Introduction to Machine Learning Models

• Metrics for Evaluation

# import the libraries and dataset
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score,\
recall_score, f1_score, accuracy_score
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Now let’s load the toy dataset iris flowers from the sklearn.datasets library and then split it into
training and testing parts (for model evaluation) in the 80:20 ratio.

iris = load_iris()
X = iris.data
y = iris.target

# Holdout method. Dividing the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20, test_size=0.20)

• Now, let’s train a Decision Tree Classifier model on the training data, and then we will move on
to the evaluation part of the model using different metrics.

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Accuracy

• Accuracy is defined as the ratio of the number of correct predictions to the total number of
predictions.
• This is the most fundamental metric used to evaluate the model.
• The formula is given by

• However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset.

• Suppose a model classifies that the majority of the data belongs to the major class label.
• It yields higher accuracy. But in general, the model cannot classify on minor class labels and has
poor performance.

print("Accuracy:", accuracy_score(y_test, y_pred))

Output:
Accuracy: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Precision and Recall
• Precision is the ratio of true positives to the summation of true positives and false positives. It basically
analyses the positive predictions.

Precision = TP/(TP+FP)
• The drawback of Precision is that it does not consider the True Negatives and False Negatives.

• Recall is the ratio of true positives to the summation of true positives and false negatives. It basically
analyses the number of correct positive samples.

Recall = TP/(TP+FN)
print("Precision:", precision_score(y_test, y_pred, average="weighted"))

print('Recall:', recall_score(y_test, y_pred, average="weighted"))

Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• F1 Score
• The F1 score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa.
The goal of the F1 score is to combine precision and recall.

F1 score = (2×Precision×Recall)/(Precision+Recall)

# calculating f1 score
print('F1 score:', f1_score(y_test, y_pred, average="weighted"))

Output:
F1 score: 0.9327777777777778
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
• A confusion matrix is an N x N matrix where N is the number of target classes. It represents the
number of actual outputs and the predicted outputs. Some terminologies in the matrix are as
follows:

• True Positives: It is also known as TP. It is the output in which the actual and the predicted values
are YES.
• True Negatives: It is also known as TN. It is the output in which the actual and the predicted
values are NO.
• False Positives: It is also known as FP. It is the output in which the actual value is NO but the
predicted value is YES.
• False Negatives: It is also known as FN. It is the output in which the actual value is YES but the
predicted value is NO.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
cm_display = metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix, display_labels=[0, 1, 2])
cm_display.plot()
plt.show()
Output:

In the output, the accuracy of the model is 93.33%. Precision is approximately 0.944 and Recall is 0.933. F1 score is
approximately 0.933. Finally, the confusion matrix is plotted. Here class labels denote the target classes:
0 = Setosa
1 = Versicolor
2 = Virginica
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• AUC-ROC
• AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification model at different threshold
values. The Receiver Operating Characteristic(ROC) curve is a probabilistic curve used to highlight the model’s
performance. The curve has two parameters:
• TPR: It stands for True positive rate. It basically follows the formula of Recall.
• FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the summation of false positives and
True negatives.
• This curve is useful as it helps us to determine the model’s capacity to distinguish between different classes. Let us
illustrate this with the help of a simple Python example
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = np.round(roc_auc_score(y_true, y_pred), 3)
print("Auc", (auc))
Output:
Auc 0.75
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
• Regression is used to determine continuous values. It is mostly used to find a relation between a dependent and an
independent variable.
• For classification, we use a confusion matrix, accuracy, f1 score, etc.
• But for regression analysis, since we are predicting a numerical value, it may differ from the actual output.
• So, we consider the error calculation as it helps to summarize how close the prediction is to the actual value.
• There are many metrics available for evaluating the regression model.

• We are going to implement simple regression model using the Mumbai weather CSV file. This file comprises Day,
Hour, Temperature, Relative Humidity, Wind Speed, and Wind Direction.

• We are basically interested in finding a relationship between Temperature and Relative Humidity.
• Here Relative Humidity is the dependent variable, and Temperature is the independent variable.
• We will perform the Linear Regression and used the metrics to evaluate the performance of our model.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
# importing the libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,\
mean_squared_error, mean_absolute_percentage_error

• Now let’s load the data into the panda’s data frame and then split it into training and testing parts (for model evaluation) in
the 80:20 ratio.

df = pd.read_csv('weather.csv')
X = df.iloc[:, 2].values
Y = df.iloc[:, 3].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20,
random_state=0)

• Now, let’s train a simple linear regression model. On the training data and we will move to the evaluation part of the model
using different metrics.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
regression = LinearRegression()
regression.fit(X_train, Y_train)
Y_pred = regression.predict(X_test)
• Mean Absolute Error(MAE)
• This is the simplest metric used to analyze the loss over the whole dataset.
• As we all know the error is basically the difference between the predicted and actual values.
• Therefore, MAE is defined as the average of the errors calculated.
• Here we calculate the modulus of the error, perform the summation and then divide the result by the number of data points.
• It is a positive quantity and is not concerned about the direction. The formula of MAE is given by MAE = ∑|ypred-yactual| / N
mae = mean_absolute_error(y_true=Y_test, y_pred=Y_pred)
print("Mean Absolute Error", mae)
Output:
Mean Absolute Error 1.7236295632503873
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Squared Error(MSE)
• The most commonly used metric is Mean Square error or MSE.
• It is a function used to calculate the loss.
• We find the difference between the predicted values and the truth variable, square the result and then find the average over the whole
dataset.
• MSE is always positive as we square the values.
• The small the MSE, the better is the performance of our model. The formula of MSE is given: MSE = ∑(ypred - yactual)2 / N

mse = mean_squared_error(y_true=Y_test, y_pred=Y_pred)

print("Mean Square Error", mse)

Ouput:
Mean Square Error 3.9808057060106954
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Root Mean Squared Error(RMSE)

• RMSE is a popular method and is the extended version of MSE(Mean Squared Error).
• This method is basically used to evaluate the performance of our model.
• It indicates how much the data points are spread around the best line.
• It is the standard deviation of the Mean squared error.
• A lower value means that the data point lies closer to the best fit line.

RMSE=√(∑(ypred - yactual)2 / N)

rmse = mean_squared_error(y_true=Y_test, y_pred=Y_pred, squared=False)

print("Root Mean Square Error", rmse)

Ouput:
Root Mean Square Error 1.9951956560725306
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Absolute Percentage Error(MAPE)
• MAPE is basically used to express the error in terms of percentage.
• It is defined as the difference between the actual and predicted value.
• The error is then divided by the actual value.
• The results are then summed up and finally, we calculate the average. The smaller the percentage the better the performance of the model.
• The formula is given by

MAPE = ∑((ypred-yactual) / yactual) / N * 100 %

mape = mean_absolute_percentage_error(Y_test, Y_pred, sample_weight=None, multioutput='uniform_average')

print("Mean Absolute Percentage Error", mape)

Ouput:
Mean Absolute Percentage Error 0.02334408993333347

•
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Decision Tree Classifiers is a fundamental machine learning algorithm for classification tasks.
• They organize data into a tree-like structure where internal nodes represent decisions, branches represent outcomes and leaf node
represent class labels.

• Implementing Decision Tree Classifiers with Scikit-Learn

• The DecisionTreeClassifier from Sklearn has the ability to perform multi-class classification on a dataset. The syntax for
DecisionTreeClassifier is as follows:

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None,min_samples_split=2, min_samples_leaf=1,

min_weight_fraction_leaf=0.0, max_features=None,random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
class_weight=None,ccp_alpha=0.0, monotonic_cst=None)
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Let's go through the parameters:
• criterion: It measures the quality of a split. Supported values are 'gini', 'entropy' and 'log_loss'. The default value is 'gini'
• splitter: This parameter is used to choose the split at each node. Supported values are 'best' & 'random'. The default value is 'best'
• max_features: It defines the number of features to consider when looking for the best split.
• max_depth: This parameter denotes maximum depth of the tree (default=None).
• min_samples_split: It defines the minimum number of samples reqd. to split an internal node (default=2).
• min_samples_leaf: The minimum number of samples required to be at a leaf node (default=1)
• max_leaf_nodes: It defines the maximum number of possible leaf nodes.
• min_impurity_split: It defines the threshold for early stopping tree growth.
• class_weight: It defines the weights associated with classes.
• ccp_alpha: It is a complexity parameter used for minimal cost-complexity pruning
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Let’s implement the code
#import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# split dataset to training and test set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state = 99)
# initialize decision tree classifier
clf = DecisionTreeClassifier(random_state=1)
# train the classifier
clf.fit(X_train, y_train)
# predict using classifier
y_pred = clf.predict(X_test)
# calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}’)

Output:
Accuracy: 0.9555555555555556
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Hyperparameter Tuning with Decision Tree Classifier
• Hyperparameters are configuration settings that control the behavior of a decision tree model and significantly affect its performance.
• Proper tuning can improve accuracy, reduce overfitting and enhance generalization of model.
• Popular methods for tuning include Grid Search, Random Search, and Bayesian Optimization, which explore different combinations to find the best configuration.
• Hyperparmater tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV

# Hyperparameter to fine tune

param_grid = {
'max_depth': range(1, 10, 1),
'min_samples_leaf': range(1, 20, 2),
'min_samples_split': range(2, 20, 2),
'criterion': ["entropy", "gini"]
}

tree = DecisionTreeClassifier(random_state=1)
# GridSearchCV
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5, verbose=True)
grid_search.fit(X_train, y_train)

print("best accuracy", grid_search.best_score_)

print(grid_search.best_estimator_)
6: Supervised Machine Learning Models
• Decision Tree Classifier
Output:
Fitting 5 folds for each of 1620 candidates, totalling 8100 fits
best accuracy 0.9714285714285715
DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=3, random_state=1)

• Here we defined the parameter grid with a set of hyperparameters and a list of possible values.
• The GridSearchCV evaluates the different hyperparameter combinations for the DecissionTree Classifier and selects the best combination of
hyperparameters based on the performance across all k folds.
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
• Decision Tree visualization is used to interpret and comprehend model's choices.
• We'll plot feature importance obtained from the Decision Tree model to see which features have the greatest predictive power.
• Here we fetch the best estimator obtained from the gridsearchcv as the decision tree classifier.

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

tree_clf = grid_search.best_estimator_
plt.figure(figsize=(18, 15))
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
6: Unsupervised Machine Learning Models
• K Means Clustering
• K-means clustering is a technique used to organize data into groups based on their similarity.
• For example, online store uses K-Means to group customers based on purchase frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big
Spenders for personalised marketing.
• The algorithm works by first randomly picking some central points called centroids and each data point is then assigned to the closest centroid forming a cluster.
• After all the points are assigned to a cluster the centroids are updated by finding the average position of the points in each cluster.
• This process repeats until the centroids stop changing forming clusters.
• The goal of clustering is to divide the data points into clusters so that similar data points belong to same group.
• How k-means clustering works?
• We are given a data set of items with certain features and values for these features (like a vector).
• The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters
we want to classify our items into.
6: Unsupervised Machine Learning Models
• K Means Clustering
6: Unsupervised Machine Learning Models
• K Means Clustering
• The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement. The algorithm works
as follows:

• First, we randomly initialize k points, called means or cluster centroids.

• We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
• We repeat the process for a given number of iterations and at the end, we have our clusters.
• The “points” mentioned above are called means because they are the mean values of the items categorized in them.
• To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set.
• Another method is to initialize the means at random values between the boundaries of the data set.
• For example, for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].
6: Unsupervised Machine Learning Models
• K Means Clustering
• Implementation of K-Means Clustering in Python
• We will use blobs datasets and show how clusters are made.

Step 1: Importing the necessary libraries

• We are importing Numpy for statistical computations, Matplotlib to plot the graph, and make_blobs from sklearn.datasets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Step 2: Create the custom dataset with make_blobs and plot it

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

• Another method is to initialize the means at random values between the boundaries of the data set.
• For example, for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:

Step 3: Initialize the random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.
6: Unsupervised Machine Learning Models
• K Means Clustering
k=3
clusters = {}
np.random.seed(23)
for idx in range(k):
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters

Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []},
1: {'center': array([ 1.06183904, -0.87041662]), 'points': []},
2: {'center': array([-1.11581855, 0.74488834]), 'points': []}}

Step 3: Initialize the random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 4: Plot the random initialize center with data points
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()

Output:

• The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 5: Define Euclidean distance

def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 6: Create the function to Assign and Update the cluster center
This step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

def assign_clusters(X, clusters):

for idx in range(X.shape[0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

def update_clusters(X, clusters):

for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 7: Create the function to Predict the cluster for the datapoints

def pred_cluster(X, clusters):

pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred

Step 8: Assign, Update, and predict the cluster center

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

Step 9: Plot the data points with their predicted cluster center

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:

• The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

Ai900 Exam Topics
No ratings yet
Ai900 Exam Topics
50 pages
Minutely
No ratings yet
Minutely
1 page
Basic Financial Econometrics PDF
No ratings yet
Basic Financial Econometrics PDF
167 pages
Deep Learning Unit 3
No ratings yet
Deep Learning Unit 3
19 pages
IDML Presentation
No ratings yet
IDML Presentation
12 pages
7 ML
No ratings yet
7 ML
38 pages
Supervised - ML Complete Book
No ratings yet
Supervised - ML Complete Book
153 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Lec 10
No ratings yet
Lec 10
36 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages
ML Fundamentals
No ratings yet
ML Fundamentals
15 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Train and Test Datasets in Machine Learning
No ratings yet
Train and Test Datasets in Machine Learning
26 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
ML
No ratings yet
ML
8 pages
Unit 1
No ratings yet
Unit 1
28 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
Chapter 7 Learning
No ratings yet
Chapter 7 Learning
34 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
ML Unit 2
No ratings yet
ML Unit 2
18 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Ludic - Workshop - Iris - Copie
No ratings yet
Ludic - Workshop - Iris - Copie
5 pages
Understanding Datasets Features Selection Train Test Validation Sets L12
No ratings yet
Understanding Datasets Features Selection Train Test Validation Sets L12
25 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Guide
No ratings yet
Guide
24 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Train, Test, Validation Split
No ratings yet
Train, Test, Validation Split
9 pages
Lec 10 Annotated Xinchao
No ratings yet
Lec 10 Annotated Xinchao
36 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
11-AI ML Intro 2022
No ratings yet
11-AI ML Intro 2022
54 pages
Train, Test and Validation
No ratings yet
Train, Test and Validation
3 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Lecture 12 - Machine Learning
No ratings yet
Lecture 12 - Machine Learning
18 pages
Top 45 Machine Learning Interview Questions in 2025
100% (1)
Top 45 Machine Learning Interview Questions in 2025
37 pages
Train and Test Datasets in Machine Learning
No ratings yet
Train and Test Datasets in Machine Learning
6 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Artificial Intelligence (Advance) Notes?
No ratings yet
Artificial Intelligence (Advance) Notes?
33 pages
P06 The Classification Pipeline Ans
No ratings yet
P06 The Classification Pipeline Ans
16 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
Learn Machine Learning in One Lesson Book
No ratings yet
Learn Machine Learning in One Lesson Book
8 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Unit 4
No ratings yet
Unit 4
34 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
2018 02 Msu Data Science
No ratings yet
2018 02 Msu Data Science
65 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Lec 2
No ratings yet
Lec 2
13 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
10 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Moule 3
No ratings yet
Moule 3
25 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Tuning Decision Trees Python
No ratings yet
Tuning Decision Trees Python
50 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
16.584: Lecture 6: Moments of Random Variables
No ratings yet
16.584: Lecture 6: Moments of Random Variables
32 pages
An Information-Theoretic Framework For Receiver Quantization in Communication
No ratings yet
An Information-Theoretic Framework For Receiver Quantization in Communication
25 pages
Airline Delay Model
No ratings yet
Airline Delay Model
11 pages
Chapter 3 Forecasting
No ratings yet
Chapter 3 Forecasting
55 pages
Unit 3
No ratings yet
Unit 3
18 pages
Miss Forest
No ratings yet
Miss Forest
10 pages
Questions 2016 FRM1 PracticeExam
100% (1)
Questions 2016 FRM1 PracticeExam
37 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Lecture Five - Docx Measure of Dispersion
No ratings yet
Lecture Five - Docx Measure of Dispersion
9 pages
ML Disha
No ratings yet
ML Disha
46 pages
Effect Size, Calculating Cohen's D
No ratings yet
Effect Size, Calculating Cohen's D
6 pages
1 s2.0 S0016706123000903 Main
No ratings yet
1 s2.0 S0016706123000903 Main
19 pages
01 Comparison of ARIMA, ANN and Hybrid ARIMA-ANN Models For Time Series Forecasting
No ratings yet
01 Comparison of ARIMA, ANN and Hybrid ARIMA-ANN Models For Time Series Forecasting
14 pages
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
No ratings yet
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
8 pages
Time Series: International University - Vnu HCMC
No ratings yet
Time Series: International University - Vnu HCMC
35 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
RN srs010
No ratings yet
RN srs010
6 pages
Detail Syllabus For B.A. Part II Honours Anthropology Honours
No ratings yet
Detail Syllabus For B.A. Part II Honours Anthropology Honours
36 pages
Customer Churn Prediction Using Machine Learning
No ratings yet
Customer Churn Prediction Using Machine Learning
7 pages
Separation in Logistic Regression - Causes Consequences and Control
No ratings yet
Separation in Logistic Regression - Causes Consequences and Control
7 pages
Jurnal
No ratings yet
Jurnal
9 pages
A Novel Machine Learning-Based Approach For The Risk Assessment Ofnitrate Groundwater Contamination
No ratings yet
A Novel Machine Learning-Based Approach For The Risk Assessment Ofnitrate Groundwater Contamination
9 pages
Module - 05 Statistical Computing and R Programming
No ratings yet
Module - 05 Statistical Computing and R Programming
53 pages
Taskin 2024
No ratings yet
Taskin 2024
11 pages
Week 6 - Stock Market and Rational Expectations - PM
No ratings yet
Week 6 - Stock Market and Rational Expectations - PM
31 pages
Regression ANOVA
No ratings yet
Regression ANOVA
42 pages
BitTorrent Network Traffic Forecasting With ARMA
No ratings yet
BitTorrent Network Traffic Forecasting With ARMA
14 pages

CSC407 - Chapter 5-6

Uploaded by

CSC407 - Chapter 5-6

Uploaded by

Machine Learning/Data Science

• Training vs Testing vs Validation Sets

# Making a dummy array to represent x,y for example

# y is just a list of 0-7 number representing target variable

• Training vs Testing vs Validation Sets

# y is just a list of 0-7 number representing target variable

Testing set x: [[15 16 17]]

Validation set x: [[3 4 5]]

• Metrics for Evaluation

• Metrics for Evaluation

• Metrics for Evaluation

# Holdout method. Dividing the data into train and test

• However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset.

print("Accuracy:", accuracy_score(y_test, y_pred))

print('Recall:', recall_score(y_test, y_pred, average="weighted"))

mse = mean_squared_error(y_true=Y_test, y_pred=Y_pred)

rmse = mean_squared_error(y_true=Y_test, y_pred=Y_pred, squared=False)

MAPE = ∑((ypred-yactual) / yactual) / N * 100 %

mape = mean_absolute_percentage_error(Y_test, Y_pred, sample_weight=None, multioutput='uniform_average')

• Implementing Decision Tree Classifiers with Scikit-Learn

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None,min_samples_split=2, min_samples_leaf=1,

# Hyperparameter to fine tune

print("best accuracy", grid_search.best_score_)

from sklearn.tree import plot_tree

• First, we randomly initialize k points, called means or cluster centroids.

Step 1: Importing the necessary libraries

Step 2: Create the custom dataset with make_blobs and plot it

Step 3: Initialize the random centroids

Step 3: Initialize the random centroids

def assign_clusters(X, clusters):

def update_clusters(X, clusters):

def pred_cluster(X, clusters):

Step 8: Assign, Update, and predict the cluster center

You might also like