0% found this document useful (0 votes)
6 views100 pages

6 - Classification and Regression Tasks

Uploaded by

b00098269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views100 pages

6 - Classification and Regression Tasks

Uploaded by

b00098269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Supervised Learning: Classification and

Regression Tasks
Intro to AI and Data Science
NGN 112 – Fall 2024

Amer S. Zakaria
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Salam Dhou, CSE

Last Updated on: 13th of Nov. 2024


Table of Content
2

Regression vs. Classification

Regression

Applications of Regression

Classification

Applications of Classification
Regression and Classification
3

◻ Regression is a method for understanding the relationship between


independent variables or features and a dependent variable or output.
Output can be predicted once the relationship between independent and
dependent variables has been estimated.

◻ Classification is a method for finding a function that helps in dividing the


dataset into classes based on different variables. In Classification, a
computer program is trained on the training dataset and based on that
training, it categorizes the data into different classes.

◻ Common things:
▪ Regression and classification are both supervised learning methods
▪ They both require a dataset for training so they can make predictions
Regression versus Classification
4

Difference:
◻ In regression:

▪ Output is continuous (numbers).


▪ The purpose is to find the line or curve that best fits the data and predicts the
output more accurately.
◻ In classification:
▪ Output is discrete (class labels).
▪ The purpose is to find the decision boundary, which divides the dataset into
different classes.
Regression
5

◻ The objective is to find a line, curve, or surface that best fits


the data.
◻ Finding the regression line or curve is an optimization
problem. The best line or curve is the one that minimizes the
distance (error) between the line or curve and the data
points.
◻ Given the training data, the regression algorithms try to find
the best line or curve that best fits the data.
◻ This model (line or curve) is used later for prediction.
Exercise
6

Exercise: The following are regression tasks. Identify the input


variable(s) and output variable(s) for each task.
Examples of Regression Tasks
7

◻ Predict the price of a house based on variables like size of the house,
number of rooms, school district, neighborhood, etc.

Input Variable(s) Output Variable(s)

• Size of the house House price


• Number of room
• School district
• neighborhood
Examples of Regression Tasks (cont.)
8

◻ Predict the net worth of people based on variables like their age,
income, education, etc.

Input Variable(s) Output Variable(s)

• Age Person’s networth


• Income
• Education
Examples of Regression Tasks (cont.)
9

◻ Predicting sales amounts of new product based on advertising


expenditure.

Input Variable(s) Output Variable(s)

• Advertising expenditure Amount of sales


Examples of Regression Tasks (cont.)
10

◻ Predicting wind velocities as a function of temperature, humidity, air


pressure, etc.

Input Variable(s) Output Variable(s)

• Temperature Wind velocity


• Humidity
• Air pressure
Types of Regression
11

◻ Regression
A statistical technique that uses independent (input) variables to predict the
outcome of a dependent (output) variable.
◻ Linear regression

The dependent variable shows a linear relationship with each of the


independent variables.
◻ Non-linear regression

The dependent variable shows a non-linear relationship with the


independent variables.
Types of Regression (cont.)
12

◻ Simple regression
It establishes a relationship between one independent (input) variable and one
dependent (output) variable. It attempts to draw a line or curve that fits the data
most and minimizes regression errors.
▪ Example of Simple Linear Regression:
Equation of a line: 𝑦 = 𝑏1 𝑥1 + 𝑏0
where 𝑥1 is the input variable, 𝑦 is the output variable, 𝑏1 , 𝑏0 are the coefficients.
◻ Multiple regression
It establishes a relationship between multiple independent (input) variables and one
dependent (output) variable.
▪ Example of Multiple Linear Regression:
𝑦 = 𝑏𝑛 𝑥 𝑛 + … . + 𝑏2 𝑥 2 + 𝑏1 𝑥 1 + 𝑏 0
where 𝑥1, … , 𝑥𝑛 are the input variables, 𝑦 is the output variable, 𝑏𝑛, … , 𝑏0 are the
coefficients.
The objective in a regression problem is calculate the
coefficients using optimization techniques.
Simple Linear Regression
13

Example
◻ Estimating the net worth of people based on their age.

◻ One feature: Age, output: Net worth

Net worth

Age
Simple Linear Regression (cont.)
14

Example
◻ If you want to draw a line representing the data, which line of

the following is the best?

Net worth
A
B
C
Answer is Line B

Age
Simple Linear Regression: Training & Fitting
15

Example
◻ Simple Linear Regression model can be represented by a line that
best fits the data.
◻ Can you give the model (line) equation? Given a point that the line
passes through.
Net worth Line Equation: 𝑦 = 𝑏1 𝑥 + 𝑏0
500
Here:
(Net worth) = 𝑏1 (age) + 𝑏0
where 𝑏1 is slope, and 𝑏0 is the y-
intercept (value of y when x =0)
Age (Net worth) = (500/80) (age) + 0
80
Simple Linear Regression: Prediction
16

Example
◻ Using this model, predict the net worth of a person of age 36

Net worth
Given the line equation:
(Net worth) = (500/80) (age) + 0

? By substituting in the equation:


(Net worth) = (500/80) (36) + 0 = 225
Age
36
Evaluation of Regression: MSE
17

◻ Mean squared error (MSE) is an accuracy measure that


measures the average of the squares of the
errors/difference between the predicted values and the
actual value.
𝑛
1
MSE = ෍ 𝑦𝑖 − 𝑦ෝ𝑖 2
𝑛
𝑖=1
◻ Here 𝑛 is the number of actual points, 𝑦𝑖 is the actual
value, 𝑦ෝ𝑖 is the predicted value.
◻ The smaller the value of MSE (close to zero) that better.
Evaluation of Regression: R2 Coefficient
18

◻ Coefficient of Determination (𝑹𝟐 ) is an accuracy measure. It measures


how much of any change in the output is explained by the change in the
input.
2
𝑆𝑆𝐸Regression
𝑅 =1−
𝑆𝑆𝐸Total
◻ Here
𝑛 𝑛

𝑆𝑆𝐸Regression = ෍ 𝑦𝑖 − 𝑦ෝ𝑖 2 and 𝑆𝑆𝐸Total = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖=1 𝑖=1
◻ In 𝑆𝑆𝐸Total , 𝑦ത is the mean of the actual values.
◻ The value of 𝑹𝟐 is between 0.0 and 1.0.
➢ 0.0 means the regression model is not doing a good job of capturing the
trend in the data.
➢ 1.0 means the regression model is doing a good job of describing the
relationship between the input(s) and the output.
Regression Task Pipeline
19

1. Data Preprocessing (perform normalization if necessary)


2. Splitting data into a training set and testing set
3. Selecting and creating the model
4. Training the model
5. Prediction using the trained model
6. Model evaluation
Common Python Codes – Regression
20

Regression Task Pipeline:


1. Data Preprocessing (perform normalization if necessary)
2. Splitting data into training set and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.30, random_state = 1)

3. Selecting and creating the model (see slide)


4. Training the model:
regressor.fit(X_train, y_train)

5. Prediction using the trained model:


y_pred = regressor.predict(X_test)

6. Model evaluation
from sklearn.metrics import mean_squared_error, r2_score
MSE = mean_squared_error(y_test,y_pred)
R2 = r2_score(y_test,y_pred)
Applications of Regression in Engineering-
Example 1: Simple Regression
21

Predicting students scores based on study hours

Dataset link:
https://www.kaggle.com/datasets/himanshunakrani/student-study-
hours?ref=machinelearningnuggets.com
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
22

◻ Data Preparation and Preprocessing

import pandas as pd

# Loading dataset
stud_scores = pd.read_csv('student_scores.csv')
stud_scores.describe()
Output
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
23

◻ Data Preparation and Preprocessing

# Print the first 5 records


stud_scores.head()

Output
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
24

◻ Data Preparation and Preprocessing


# Creating input data and output variable
X = stud_scores['Hours'] # input variable
y = stud_scores['Scores'] # output variable

# The input to machine learning methods have to be arrays


# converting X to an array

X = X.to_numpy()
print(X)
Output
array([2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5, 8.3, 2.7, 7.7,
5.9, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9, 6.1, 7.4, 2.7, 4.8, 3.8,
6.9, 7.8])
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
25

◻ Data Preparation and Preprocessing


# The data has to be represented as an array of records Output
that need to reshape the input if it has a single
feature

# The function reshape is used to change the shape


(dimensions) of an array without changing its data.

# The 1 argument indicates that we want to have 1


column. The -1 argument indicates that we want NumPy to
automatically determine the number of rows needed based
on the total number of elements in the array.

X = X.reshape(-1, 1)
print(X)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
26

◻ Data Preparation and Preprocessing

# the training data to machine learning methods has to be


arrays converting y to an array
y = y.to_numpy()
print(y)

Output

array([21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85, 62, 41, 42,
17, 95, 30, 24, 67, 69, 30, 54, 35, 76, 86], dtype=int64)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
27

◻ Splitting the data into training and testing


# Splitting the data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.30, random_state = 1)

# The parameter random_state controls the shuffling


applied to the data before applying the split. It is set
to None by default.

# Set random_state to an integer for reproducible output


across multiple function calls (in other words, if you
want to get the same results)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
28

◻ Creating the model

# Creating the LINEAR model


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
Let’s consider
Linear Regression
# OR in this example
# Creating the non-linear (polynomial) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'poly') #degree 3 is default value
#regressor = SVR(kernel = 'poly', degree =4) degree 4
# OR
# Creating the non-linear (RBF) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
29

◻ Training the model coef_ is only


#Training the model available
import numpy as np when using a
linear model
regressor.fit(X_train, y_train)

# GETTING THE COEFFICIENTS AND INTERCEPT (for linear models only)


print('Coefficient: ', np.round (regressor.coef_,2) )
# had to use np.round because regressor.coef_ is an array. You
need to remove this operation for nonlinear regression models

print('Intercept: ', np.round (regressor.intercept_, 2))


#you can use np.round or round on a floating point variable.

Output
Coefficient: [10.41]
Intercept: -1.51
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
30

◻ Prediction using the trained model

# PREDICTION OF TEST RESULT


y_pred = regressor.predict(X_test)
print('Predictions:\n', y_pred)

Output
Predictions:
[ 9.93952968 32.84320126 18.26813752 86.97915227 48.45934097
78.65054442 61.99332873 75.52731648]
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
31

◻ Evaluating the model


#Model Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
MSE = mean_squared_error(y_test,y_pred)
R2 = r2_score(y_test,y_pred)
#Note: function r2_score takes the true labels (y_test) and the predicted ones
(y_pred)

print("Mean squared error (MSE):", round (MSE,2))


print('Coefficient of determination(R squared): ', round(R2, 2) )

#Note:Alternative way to calculate R squared


print('Coefficient of determination(R squared) using score function: ',
round(regressor.score(X_test, y_test), 2) )
#Note:function score takes the X_test and y_test

Output
Mean squared error (MSE): 56.09
Coefficient of determination(R squared): 0.89
Coefficient of determination(R squared) using score function: 0.89
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
32

◻ Evaluating the model


import matplotlib.pyplot as plt

#plot basic scatterplot


plt.scatter(X_test, y_test, label
= 'Actual')
Output
#plot the regression line
plt.scatter(X_test, y_pred, label
= 'Predicted' )

plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
33

◻ Evaluating the model


import matplotlib.pyplot as plt

#Using all data


#plot the basic scatterplot
plt.plot(X, y, 'o', label = 'Actual')
#’o’ is to make scatter plot, alternatively you can
write: plt.scatter(X,y, label = ‘Actual’) Output
#plot the predicted regression line
y_pred_all_data = regressor.predict(X)
plt.plot(X,y_pred_all_data,'o',label ='Predicted')
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
34

◻ Regression line/curve produced by several regression models:

Linear Regression Non-Linear regression


(Polynomial of degree 3) RBF Kernel
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
35

California housing dataset

Dataset link:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
36

◻ Data Preparation and Preprocessing


from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing() #load the dataset


X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target
#OR
#X,y = fetch_california_housing(return_X_y = True)
# Extra: For creating a dataframe (next slide)
feature_names = housing_data.feature_names
target_names = housing_data.target_names
print('Feature names: ', feature_names)
print('\nTarget names: ', target_names)#Median house value for households
print('\nShape of dataset', X.shape)

Output
Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude']
Target names: ['MedHouseVal']
Shape of dataset (20640, 8)
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
37

◻ Extra: Data Preparation and Preprocessing – Create a DataFrame


#if you want to display the data set and visualize table,
You need to convert X and y into a DataFrame

import pandas as pd
#for Dataframe data
X = df.drop(‘House_Value', axis=1) #0 =
# Convert X and y into a DataFrame row
df = pd.DataFrame(data=X, columns=feature_names) y = df[‘House_Value’] #target _column
df['House_Value'] = y # new column, target Or
selected_columns = [‘label 1’, ‘label 2’, ..]
X = df[selected_columns]
# Print the DataFrame
df
df.head() Output
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
38

◻ Splitting data into the training set and testing set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

#Note: In this example, random_state is not set to an int, so expect


different splits and consequently different results!

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

Output
(14448, 8)
(6192, 8)
(14448,)
(6192,)
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
39

◻ Creating the model


# importing the linearRegression class
from sklearn.linear_model import LinearRegression

# instantiate the Linear Regression model


regressor = LinearRegression()

#You can also create other non-linear models as in Example 1


Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
40

◻ Training the model

# training the model


regressor.fit(X_train, y_train)
# get the coefficients and intercept

import numpy as np
print("Coefficients:\n", np.round(regressor.coef_,2))
print('Intercept:\n', round(regressor.intercept_,2))

Output x1

Coefficients: [ 0.45 0.01 -0.13 0.84 -0. -0. -0.42 -0.44]


Intercept: -37.35
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
41

◻ Prediction using the trained model

# expose the model to new values and predict the target vector
y_predictions = regressor.predict(X_test)
print('Predictions:', y_predictions)

Output

Predictions: [2.54827248 2.98136965 2.10894987 ...


2.82017938 6.84565693 2.68012622]
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
42

◻ Model evaluation
#Model evaluation
from sklearn.metrics import mean_squared_error, r2_score
# The mean squared error
print("Mean squared error: " , round(mean_squared_error(y_test,
y_predictions),2))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: ", round(r2_score(y_test,
y_predictions),2))

#or you can use the score function as in Example 1

Output

Mean squared error: 0.55


Coefficient of determination: 0.57
Applications of Regression in Engineering-
Example 3: Multiple Regression
43

Diabetes Dataset

Dataset link:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
44

◻ Data Preparation and Preprocessing

from sklearn import datasets

# Load the diabetes dataset


diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
45

◻ Splitting data into the training set and testing set


# Split the data and targets - SHORT WAY
# Use the function train_test_split to split the data and targets into
training and testing sets. Testing data size is 20% of the data, and the rest
is the training portion
from sklearn.model_selection import train_test_split
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test =
train_test_split (diabetes_X, diabetes_y, test_size= 0.2)

# OR: Longer way: Check extra slide 100


Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
46

◻ Creating the model

from sklearn import linear_model

# Create linear regression object


regr = linear_model.LinearRegression()

# You can also create non-linear models as well


Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
47

◻ Training the model


# Train the model using the training set
regr.fit (diabetes_X_train, diabetes_y_train)

# Get the model equation


# The coefficients
print("Coefficients: \n", regr.coef_)
# The y-intercept
print("y-intercept: \n", regr.intercept_)

Output
Coefficients:
[ -30.77592231 -197.11523603 519.50634733 346.49118652 -688.21410873
431.49892496 19.3325826 94.20724607 716.79048049 75.26379265]
y-intercept:
152.09140122905802
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
48

◻ Prediction using the trained model


# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
49

◻ Model evaluation
from sklearn.metrics import mean_squared_error, r2_score

# The mean squared error


print("Mean squared error: " ,
round(mean_squared_error(diabetes_y_test, diabetes_y_pred),2))

# The coefficient of determination: 1 is perfect prediction


print("Coefficient of determination: ",
round(r2_score(diabetes_y_test, diabetes_y_pred),2))

Output

Mean squared error: 2197.35


Coefficient of determination: 0.57
Classification
50

❑ The objective is to find a decision boundary or decision surface that


separates the classes.
❑ The decision boundary is a boundary that partitions the samples in the
dataset into two sets or more, one for each class.
❑ Each machine learning algorithm has it is own way of finding that decision
boundary, that is, how a machine learning model might draw a line/set of
lines/curve to separate the classes.
❑ Different decision boundaries for the same dataset:
Exercise
51

Exercise: The following are classification tasks. Identify the input


variable(s) and output variable(s) for each task.
Examples of Classification Tasks
52

◻ Classifying credit card transactions as legitimate or fraudulent

Input Variable(s) Output Variable(s)

• Features of credit card Classes such as legitimate or


transactions, such as date and fraudulent
time of transaction, amount, etc.
Examples of Classification Tasks (cont.)
53

◻ Classifying land covers (water bodies, urban areas, forests,


etc.) using satellite data

Input Variable(s) Output Variable(s)

• Satellite images Classes such as water bodies, urban


areas, forests, etc.
Examples of Classification Tasks (cont.)
54

◻ Categorizing news stories as finance, weather, entertainment,


sports, etc.

Input Variable(s) Output Variable(s)

• News stories Classes such as finance,


entertainment, sports, etc.
Examples of Classification Tasks (cont.)
55

◻ Predicting tumor cells as benign or malignant

Input Variable(s) Output Variable(s)

• Features describing tumors Classes such as benign or malignant


shape and texture
Classification Algorithms
56

◻ There are several types of classification algorithms you can


use depending on the dataset you’re working with. The
following are five of the most common classification
algorithms:
▪ Decision Tree (DT)
▪ K-Nearest Neighbors (KNN)
▪ Naïve Bayes
▪ Logistic Regression
▪ Support Vector Machines (SVM)
Classification Algorithms (cont.)
57

There are different machine


learning algorithms

Training Phase: Learning


algorithm is used to build the
model

Training/learning
results in trained model

Testing Phase: Trained model


is used for prediction
Classification Algorithms- Decision Tree
58

◻ Decision Trees (DT)


▪ Decision Tree is a supervised learning technique that can be used for
both classification and Regression.
▪ It is a tree-structured classifier where internal nodes represent the
features of a dataset, branches represent the decision rules, and each
leaf node represents the outcome.
Classification Algorithms- DT (cont.)
59

◻ Decision nodes represent questions about features (e.g., Home


Owner?), which have two or more branches (e.g., Yes and No).
◻ Leaf nodes (e.g., Defaulted Borrower --> Yes, Defaulted Borrower --
> No) represent a classification or decision.
◻ Decision trees can handle both categorical and numerical data.
Classification Algorithms- DT (cont.)
60

◻ Decision trees allows you to ask multiple “Linear questions” to


classify a non-linearly separable dataset.
◻ Example 1: The following is a dataset with two features: Sun and
Wind, and there are two classes:
■ Good day for surfing
Sample decision tree to
■ Not a good day for surfing
separate the classes of this
dataset Windy?
Yes No
Sunny

______ Sunny?
not Yes No
sunny

not windy | windy


Classification Algorithms- DT (cont.)
61

◻ Example 2: The following is a dataset with two features: X1


and X2, and there are two classes ( , ).
◻ Can we build the decision tree to classify this sample set?
◻ Hint: Start splitting using X1
Sample decision tree to
separate the classes of this
dataset X1 < 3?
Yes No

X2 < 2 ? X2 < 4 ?
Yes No Yes No
Classification Algorithms- DT
62

◻ The DT algorithm decides where to spit the data based on


Impurity
◻ It finds split points that result in subsets that are as pure as
possible.
◻ A subset is purer when most data in it belong to the same
class.
This is a better split as it
results in purer subsets
Example of a Decision Tree

Splitting Attributes

Home
Owner
Yes No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Training Data Model: Decision Tree


63
Another Example of a Decision Tree

MarSt Single,
Married Divorced

NO Home
Yes Owner No

NO Income
< 80K > 80K

NO YES

There could be more than one tree that


fits the same data!

Training Data
64
Apply the Trained Model to Predict the Class of a Test Sample

Start from the root of tree. Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

65
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

66
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test
Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

67
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

68
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

69
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married
Predicted class is “No”
Income NO
< 80K > 80K

NO YES

70
Classification Algorithms- DT: Python
71

◻ Example of classification using DT in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.tree import DecisionTreeClassifier

#create the classifier


clf = DecisionTreeClassifier()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))

Output
[1 2]
Classification Algorithms- KNN
72

◻ K-Nearest Neighbor (KNN)


▪ Simple algorithm for classification.
▪ Stores all the available training samples and classifies the new
samples based on the similarity measure (e.g., distance
functions).
▪ Basic idea: If it walks like a duck, quacks like a duck, then it’s
probably a duck.
Classification Algorithms- KNN (cont.)
73

◻ Requires the following:


▪ A set of labeled records
▪ Proximity metric to compute
distance/similarity between a pair
of records. For example, calculating
the Euclidean distance between the
sample pairs.
▪ The value of K is the number of
nearest neighbors to consider.
▪ One method for using class labels of
K nearest neighbors to determine
the class label of an unknown
record is, for example, by taking a
majority vote.
Classification Algorithms- KNN (cont.)
74

◻ K-nearest neighbors of a record x are data points that have


the k smallest distances to x
◻ Different values of k affect the predicted class

Predicted class is - Predicted class is Predicted class is +


either + or -
Classification Algorithms- KNN: Python
75

◻ Example of classification using KNN in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.neighbors import KNeighborsClassifier

#create the classifier


clf = KNeighborsClassifier()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))

Output
[1 2]
Classification Algorithms- NB
76

❑ Naïve Bayes (NB): a probabilistic framework for solving


classification problems, based on Bayes Theorem.
𝑃 𝑋𝑌 𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
❑ Consider each attribute and class label as random variables
❑ Given a record 𝑋 with attributes (𝑋1, 𝑋2, … , 𝑋𝑑)
▪ Goal is to predict class 𝑌
▪ Specifically, we want to find the value of Y that maximizes 𝑃(𝑌|𝑋)

❑ We can estimate 𝑃(𝑌|𝑋) directly from data.


Classification Algorithms- NB: Example
77

❑ Example:
❑ Assume you have two friends, Adam and Lena.
❑ You received a message from one of them, but you do not
know who is the sender.
❑ You would like to use machine learning to “predict” the
sender.
❑ Assuming equal prior probability: Both Adam and Lena
have access to internet and can write emails.
■ P(Adam) = 0.5
■ P(Lena) = 0.5
Classification Algorithms- NB: Example (cont.)
78

◻ The model is trained on the following probabilities that


describe the frequency of mentioning specific words by each
of the persons in their conversations (assume the language
consists of three words for simplicity)
◻ Lena mentions ‘Love’ in 50% of her conversations, while she
mentions ‘Deal’ and ‘Life’ in 20% and 30% of her
conversations, respectively.
Classification Algorithms- NB: Example (cont.)
79

◻ Assume you received an Email with contents:


Love!
◻ Whom do you think would be the sender of the email?
Lena! Why? Because Lena has higher probability of using the
word ‘Love’ than Adam.
Classification Algorithms- NB: Example (cont.)
80

◻ Assume you received an Email with contents:


Love Life!
◻ Whom do you think would be the sender of the email?
Lena! Why? Because Lena has higher probability of using both
the word ‘Love’ and the word ‘Life’ than Adam.
Classification Algorithms- NB: Example (cont.)
81

◻ Assume you received an Email with contents:


Life Deal!
◻ Whom do you think would be the sender of the email?
Let’s calculate:
P(Adam is sender of “Life Deal”) = prior_probability × P(Adam saying ‘life’) × P(Adam saying ‘Deal’)
= 0.5 × 0.1 × 0.8 = 0.04
P(Lena is sender of “Life Deal”) = prior_probability × P(Lena saying ‘life’) × P(Lena saying ‘Deal’)
= 0.5 × 0.3 × 0.2 = 0.03
Adam! Why? because of the higher probability
Classification Algorithms- NB: Example (cont.)
82

◻ Assume you received an Email with contents:


Love Deal!
◻ Whom do you think would be the sender of the email?
Let’s calculate: P(Adam is sender of “Love Deal”) = 0.5 × 0.1 × 0.8 = 0.04
P(Lena is sender of “Love Deal”) = 0.5 × 0.5 × 0.2 = 0.05
Lena! Why? because of the higher probability
Classification Algorithms- NB: Python
83

◻ Example of classification using NB in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.naive_bayes import GaussianNB

#create the classifier


clf = GaussianNB()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))
Output
[1 2]
Evaluation of Classification: Accuracy
84

◻ Accuracy measures how many samples were classified


correctly over the total number of samples used in the
prediction.
Number of correctly classifed samples
accuracy =
Total number of samples used in the prediction
◻ The value of accuracy is between 0.0 and 1.0.
➢ 0.0 means the model did not make any correct
predications.
➢ 1.0 means the model predicted ALL the tested samples
correctly.
➢ The higher the accuracy (closer to 1.0), the better the
model.
Classification Task Pipeline
85

1. Data Preprocessing, which includes labeling any non-numerical


output, and, if necessary, normalizing numerical data.
2. Splitting data into a training set and testing set
3. Selecting and creating the classification model
4. Training the model
5. Prediction using the trained model
6. Model evaluation
Common Python Codes – Classification
86

Classification Task Pipeline:


1. Data Preprocessing (if necessary, reshaping, labeling and normalization)
2. Splitting data into a training set and a testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.30, random_state = 1)

3. Selecting and creating the model (see slide)


4. Training the model:
clf.fit(X_train, y_train)

5. Prediction using the trained model:


y_pred = clf.predict(X_test)

6. Model evaluation: Calculating accuracy

clf.score (X_test, y_test)


Applications of Classification in Engineering-
Example 1: Data Processing
87

◻ Data Preprocessing

from sklearn.datasets import load_iris #import the dataset

# Explore the dataset


#data= load_iris(return_X_y=False)
#print(data.target_names) #class labels

X, y = load_iris(return_X_y=True)
#Output already encoded to numbers so no need for labeling

X = X[:, :2] # we only take the first two features for


visualization purposes
##ther option X = X[:, [0,3]] this includes the data of 1st
and 4th feature columns.
Dataset link:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
Applications of Classification in Engineering-
Example 1: Data Processing with Labeling
88

◻ Data Preprocessing, with labeling


◻ If target classes are categorial text, you need to encode them
before you proceed with machine learning. Here is an example
◻ Using the simplest form of encode, which is a Label Encoder:
Applications of Classification in Engineering-
Example 1: Data Splitting
89

◻ Splitting data into a training set and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#Note: random_state is not set to int, so you may get


different results
Applications of Classification in Engineering-
Example 1: Selecting and Creating a Model
90

◻ Selecting and creating the classifier model

from sklearn.tree import DecisionTreeClassifier #import the decision tree class


from sklearn.neighbors import KNeighborsClassifier #import the KNN class
from sklearn.naive_bayes import GaussianNB #import the NB class
from sklearn.svm import LinearSVC #import the Linear Support Vector Classifier
from sklearn.svm import SVC #import the Support Vector Classifier
Applications of Classification in Engineering-
Example 1: Selecting and Creating a Model
91

◻ Creating the model


◻ It can be ONE of the following:
#Create the classifier
#Decision tree Classifier
clf = DecisionTreeClassifier() Let’s consider Decision
Tree Classifier
#K-Nearest Neighbors Classifier
clf = KNeighborsClassifier()

#Naive Bayes Classifier


clf = GaussianNB()

#Linear SVM
clf = LinearSVC()

#Non-Linear SVM with polynomial kernel


clf = SVC(kernel='poly')

#Non-Linear SVM with Radial Basis Function (RBF) kernel


clf = SVC(kernel='rbf')
Applications of Classification in Engineering-
Example 1: Training
92

◻ Training the model

#training the model by calling the function fit


and passing the training features and labels

clf = clf.fit(X_train, y_train)


Applications of Classification in Engineering-
Example 1: Using the Model for Prediction
93

◻ Prediction using the trained model

#predicting the labels of the test features


y_pred = clf.predict(X_test)

print(y_pred)

Output
Applications of Classification in Engineering-
Example 1: Model Evaluation using Accuracy
94

◻ Model evaluation
#Evaluating the model
#print the accuracy score

print ('Accuracy is:', round(clf.score (X_test, y_test),2));

Output
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting (DT)
95

◻ Model evaluation via Plotting for DT algorithms


#plot the decision tree, only if using DT classifier
from sklearn import tree
tree.plot_tree(clf)

Output
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting
96

◻ Plot the Decision Boundary produced by the classifier:


#plot the decision boundary by calling the following ‘user-defined’ function
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

Output
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting
97

◻ Decision Boundaries produced by several classifiers:


DT KNN NB

Linear SVM SVM with poly kernel SVM with RBF Kernel
Summary: Regression vs. Classification
98

◻ Differences between classification and regression

Property Classification Regression

Output type Discrete (class labels) Continuous (numbers)

Purpose To find the decision boundary, To find the line/curve that best fits
which divides the dataset into the data and predicts the output
different classes. more accurately.
Evaluation Accuracy, other measures such as MSE, R2, and other metrics such as
F-score, precision, recall, and SSE, MAE, and MAPE.
confusion matrix
Learning Outcomes
99

Upon completion of the course, students will be able to:


1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools
Extra: Applications of Regression in
Engineering- Example 3: Multiple Regression
100

◻ Another approach in splitting data into the training set and testing set
# Split the data and targets - SHORT WAY
# Use the function train_test_split to split the data and targets into
training and testing sets. Testing data size is 20% of the data, and the rest
is the training portion
from sklearn.model_selection import train_test_split
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test =
train_test_split (diabetes_X, diabetes_y, test_size= 0.2)

# OR: Split the data into training and testing sets – LONG WAY
diabetes_X_train = diabetes_X[:-20] #the first part of the array excluding the
last 20 records
diabetes_X_test = diabetes_X[-20:] #the last 20 records

# Split the labels into training and testing sets


diabetes_y_train = diabetes_y[:-20] #the first part of the array excluding the
last 20 records
diabetes_y_test = diabetes_y[-20:] #the last 20 records

You might also like