CSC407 - Chapter 5-6
CSC407 - Chapter 5-6
(csc 407)
Chapters 5 & 6
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
5: Introduction to Machine Learning Models
# Splitting dataset in 80-20 fashion .i.e. Testing set is 20% while Training set is 80%
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.8, random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
5: Introduction to Machine Learning Models
# Making a dummy array to represent x,y e.g., making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))
# Splitting dataset in 80-20 fashion .i.e. training set is 80%, testing set is 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)
Output:
Testing set x: [[ 2 3]
[10 11]]
Testing set y: [1, 5]
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set
• The validation set is used to fine-tune the hyperparameters of the
model and is considered a part of the training of the model.
• The model only sees this data for evaluation but does not learn from
this data, providing an objective unbiased evaluation of the model.
• Validation dataset can be utilized for regression as well by
interrupting training of model when loss of validation dataset
becomes greater than loss of training dataset .i.e. reducing bias and
variance.
• This data is approximately 10-15% of the total data available for the
project but this can change depending upon the number of
hyperparameters .i.e. if model has quite many hyperparameters
then using large validation set will give better results.
• Now, whenever the accuracy of model on validation data is greater
than that on training data then the model is said to have generalized
well.
•
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
# Making a dummy array to represent x,y for e.g., making an array for x ranging from 0-23 then reshaping it to 8x3 matrix
x = np.arange(24).reshape((8,3))
# y is just a list of 0-7 number representing target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e. training set is 80%, combined set of # testing & validation is 20% of total data
x_train, x_Combine, y_train, y_Combine = train_test_split(x,y, train_size=0.8, random_state=42)
# Splitting combined dataset in 50-50 fashion .i.e. Testing set is 50% of combined dataset, Validation set is 50% of
combined dataset
x_val, x_test, y_val, y_test = train_test_split(x_Combine, y_Combine, test_size=0.5, random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
print(" ")
# Testing set
print("Testing set x: ",x_test)
print("Testing set y: ",y_test)
print(" ")
# Validation set
print("Validation set x: ",x_val)
print("Validation set y: ",y_val)
5: Introduction to Machine Learning Models
• Training vs Testing vs Validation Sets
• Validating Set Example
Output:
Training set x: [[ 0 1 2]
[21 22 23]
[ 6 7 8]
[12 13 14]
[ 9 10 11]
[18 19 20]]
Training set y: [0, 7, 2, 4, 3, 6]
•
5: Introduction to Machine Learning Models
iris = load_iris()
X = iris.data
y = iris.target
• Now, let’s train a Decision Tree Classifier model on the training data, and then we will move on
to the evaluation part of the model using different metrics.
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Accuracy
• Accuracy is defined as the ratio of the number of correct predictions to the total number of
predictions.
• This is the most fundamental metric used to evaluate the model.
• The formula is given by
Output:
Accuracy: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Precision and Recall
• Precision is the ratio of true positives to the summation of true positives and false positives. It basically
analyses the positive predictions.
Precision = TP/(TP+FP)
• The drawback of Precision is that it does not consider the True Negatives and False Negatives.
• Recall is the ratio of true positives to the summation of true positives and false negatives. It basically
analyses the number of correct positive samples.
Recall = TP/(TP+FN)
print("Precision:", precision_score(y_test, y_pred, average="weighted"))
Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• F1 Score
• The F1 score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa.
The goal of the F1 score is to combine precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)
# calculating f1 score
print('F1 score:', f1_score(y_test, y_pred, average="weighted"))
Output:
F1 score: 0.9327777777777778
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
• A confusion matrix is an N x N matrix where N is the number of target classes. It represents the
number of actual outputs and the predicted outputs. Some terminologies in the matrix are as
follows:
• True Positives: It is also known as TP. It is the output in which the actual and the predicted values
are YES.
• True Negatives: It is also known as TN. It is the output in which the actual and the predicted
values are NO.
• False Positives: It is also known as FP. It is the output in which the actual value is NO but the
predicted value is YES.
• False Negatives: It is also known as FN. It is the output in which the actual value is YES but the
predicted value is NO.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Confusion Matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
cm_display = metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix, display_labels=[0, 1, 2])
cm_display.plot()
plt.show()
Output:
In the output, the accuracy of the model is 93.33%. Precision is approximately 0.944 and Recall is 0.933. F1 score is
approximately 0.933. Finally, the confusion matrix is plotted. Here class labels denote the target classes:
0 = Setosa
1 = Versicolor
2 = Virginica
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• AUC-ROC
• AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification model at different threshold
values. The Receiver Operating Characteristic(ROC) curve is a probabilistic curve used to highlight the model’s
performance. The curve has two parameters:
• TPR: It stands for True positive rate. It basically follows the formula of Recall.
• FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the summation of false positives and
True negatives.
• This curve is useful as it helps us to determine the model’s capacity to distinguish between different classes. Let us
illustrate this with the help of a simple Python example
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = np.round(roc_auc_score(y_true, y_pred), 3)
print("Auc", (auc))
Output:
Auc 0.75
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
• Regression is used to determine continuous values. It is mostly used to find a relation between a dependent and an
independent variable.
• For classification, we use a confusion matrix, accuracy, f1 score, etc.
• But for regression analysis, since we are predicting a numerical value, it may differ from the actual output.
• So, we consider the error calculation as it helps to summarize how close the prediction is to the actual value.
• There are many metrics available for evaluating the regression model.
• We are going to implement simple regression model using the Mumbai weather CSV file. This file comprises Day,
Hour, Temperature, Relative Humidity, Wind Speed, and Wind Direction.
• We are basically interested in finding a relationship between Temperature and Relative Humidity.
• Here Relative Humidity is the dependent variable, and Temperature is the independent variable.
• We will perform the Linear Regression and used the metrics to evaluate the performance of our model.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
# importing the libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,\
mean_squared_error, mean_absolute_percentage_error
• Now let’s load the data into the panda’s data frame and then split it into training and testing parts (for model evaluation) in
the 80:20 ratio.
df = pd.read_csv('weather.csv')
X = df.iloc[:, 2].values
Y = df.iloc[:, 3].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20,
random_state=0)
• Now, let’s train a simple linear regression model. On the training data and we will move to the evaluation part of the model
using different metrics.
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Evaluation Metrics for Regression Task
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
regression = LinearRegression()
regression.fit(X_train, Y_train)
Y_pred = regression.predict(X_test)
• Mean Absolute Error(MAE)
• This is the simplest metric used to analyze the loss over the whole dataset.
• As we all know the error is basically the difference between the predicted and actual values.
• Therefore, MAE is defined as the average of the errors calculated.
• Here we calculate the modulus of the error, perform the summation and then divide the result by the number of data points.
• It is a positive quantity and is not concerned about the direction. The formula of MAE is given by MAE = ∑|ypred-yactual| / N
mae = mean_absolute_error(y_true=Y_test, y_pred=Y_pred)
print("Mean Absolute Error", mae)
Output:
Mean Absolute Error 1.7236295632503873
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Squared Error(MSE)
• The most commonly used metric is Mean Square error or MSE.
• It is a function used to calculate the loss.
• We find the difference between the predicted values and the truth variable, square the result and then find the average over the whole
dataset.
• MSE is always positive as we square the values.
• The small the MSE, the better is the performance of our model. The formula of MSE is given: MSE = ∑(ypred - yactual)2 / N
Ouput:
Mean Square Error 3.9808057060106954
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Root Mean Squared Error(RMSE)
• RMSE is a popular method and is the extended version of MSE(Mean Squared Error).
• This method is basically used to evaluate the performance of our model.
• It indicates how much the data points are spread around the best line.
• It is the standard deviation of the Mean squared error.
• A lower value means that the data point lies closer to the best fit line.
RMSE=√(∑(ypred - yactual)2 / N)
Ouput:
Root Mean Square Error 1.9951956560725306
5: Introduction to Machine Learning Models
• Metrics for Evaluation
• Mean Absolute Percentage Error(MAPE)
• MAPE is basically used to express the error in terms of percentage.
• It is defined as the difference between the actual and predicted value.
• The error is then divided by the actual value.
• The results are then summed up and finally, we calculate the average. The smaller the percentage the better the performance of the model.
• The formula is given by
Ouput:
Mean Absolute Percentage Error 0.02334408993333347
•
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Decision Tree Classifiers is a fundamental machine learning algorithm for classification tasks.
• They organize data into a tree-like structure where internal nodes represent decisions, branches represent outcomes and leaf node
represent class labels.
• The DecisionTreeClassifier from Sklearn has the ability to perform multi-class classification on a dataset. The syntax for
DecisionTreeClassifier is as follows:
Output:
Accuracy: 0.9555555555555556
6: Supervised Machine Learning Models
• Decision Tree Classifier
• Hyperparameter Tuning with Decision Tree Classifier
• Hyperparameters are configuration settings that control the behavior of a decision tree model and significantly affect its performance.
• Proper tuning can improve accuracy, reduce overfitting and enhance generalization of model.
• Popular methods for tuning include Grid Search, Random Search, and Bayesian Optimization, which explore different combinations to find the best configuration.
• Hyperparmater tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV
tree = DecisionTreeClassifier(random_state=1)
# GridSearchCV
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5, verbose=True)
grid_search.fit(X_train, y_train)
• Here we defined the parameter grid with a set of hyperparameters and a list of possible values.
• The GridSearchCV evaluates the different hyperparameter combinations for the DecissionTree Classifier and selects the best combination of
hyperparameters based on the performance across all k folds.
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
• Decision Tree visualization is used to interpret and comprehend model's choices.
• We'll plot feature importance obtained from the Decision Tree model to see which features have the greatest predictive power.
• Here we fetch the best estimator obtained from the gridsearchcv as the decision tree classifier.
tree_clf = grid_search.best_estimator_
plt.figure(figsize=(18, 15))
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
6: Supervised Machine Learning Models
• Visualizing Decision Tree Classifier
6: Unsupervised Machine Learning Models
• K Means Clustering
• K-means clustering is a technique used to organize data into groups based on their similarity.
• For example, online store uses K-Means to group customers based on purchase frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big
Spenders for personalised marketing.
• The algorithm works by first randomly picking some central points called centroids and each data point is then assigned to the closest centroid forming a cluster.
• After all the points are assigned to a cluster the centroids are updated by finding the average position of the points in each cluster.
• This process repeats until the centroids stop changing forming clusters.
• The goal of clustering is to divide the data points into clusters so that similar data points belong to same group.
• How k-means clustering works?
• We are given a data set of items with certain features and values for these features (like a vector).
• The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters
we want to classify our items into.
6: Unsupervised Machine Learning Models
• K Means Clustering
6: Unsupervised Machine Learning Models
• K Means Clustering
• The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement. The algorithm works
as follows:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
• Another method is to initialize the means at random values between the boundaries of the data set.
• For example, for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:
clusters[idx] = cluster
clusters
Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []},
1: {'center': array([ 1.06183904, -0.87041662]), 'points': []},
2: {'center': array([-1.11581855, 0.74488834]), 'points': []}}
Output:
• The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 5: Define Euclidean distance
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 6: Create the function to Assign and Update the cluster center
This step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
clusters[i]['points'] = []
return clusters
6: Unsupervised Machine Learning Models
• K Means Clustering
Step 7: Create the function to Predict the cluster for the datapoints
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Step 9: Plot the data points with their predicted cluster center
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
6: Unsupervised Machine Learning Models
• K Means Clustering
Output:
• The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.