0% found this document useful (0 votes)
20 views115 pages

6 - Classification and Regression Tasks

The document provides an overview of supervised learning techniques, specifically focusing on regression and classification tasks. It explains the differences between regression (predicting continuous outputs) and classification (categorizing data into discrete classes), along with examples and applications of each method. Additionally, it outlines the regression task pipeline, evaluation metrics, and common Python codes used for implementing regression models.

Uploaded by

Yusra Eltilib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views115 pages

6 - Classification and Regression Tasks

The document provides an overview of supervised learning techniques, specifically focusing on regression and classification tasks. It explains the differences between regression (predicting continuous outputs) and classification (categorizing data into discrete classes), along with examples and applications of each method. Additionally, it outlines the regression task pipeline, evaluation metrics, and common Python codes used for implementing regression models.

Uploaded by

Yusra Eltilib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Supervised Learning: Classification and

Regression Tasks
Intro to AI and Data Science
NGN 112 – Fall 2024

Ammar Hasan
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Salam Dhou, CSE

Last Updated on: 22nd of August 2024


Table of Content
2

Regression vs. Classification

Regression

Applications of Regression

Classification

Applications of Classification
Regression and Classification
3

◻ Regression is a method for understanding the relationship between


independent variables or features and a dependent variable or output.
Output can be predicted once the relationship between independent and
dependent variables has been estimated.

◻ Classification is a method for finding a function that helps in dividing the


dataset into classes based on different variables. In Classification, a
computer program is trained on the training dataset and based on that
training, it categorizes the data into different classes.

◻ Common things:
▪ Regression and classification are both supervised learning methods
▪ They both require a dataset for training so they can make predictions
Regression versus Classification
4

Difference:
◻ In regression:

▪ Output is continuous (numbers).


▪ The purpose is to find the line or curve that best fits the data and predicts the
output more accurately.
◻ In classification:
▪ Output is discrete (class labels).
▪ The purpose is to find the decision boundary, which divides the dataset into
different classes.
5

Regression
Regression
6

◻ The objective is to find a line, curve, or surface that best fits


the data.
◻ Finding the regression line or curve is an optimization
problem. The best line or curve is the one that minimizes the
distance (error) between the line or curve and the data
points.
◻ Given the training data, the regression algorithms try to find
the best line or curve that best fits the data.
◻ This model (line or curve) is used later for prediction.
Exercise
7

Exercise: The following are regression tasks. Identify the input


variable(s) and output variable(s) for each task.
Examples of Regression Tasks
8

◻ Predict the price of a house based on variables like size of the house,
number of rooms, school district, neighborhood, etc.

Input Variable(s) Output Variable(s)

• Size of the house House price


• Number of room
• School district
• neighborhood
Examples of Regression Tasks (cont.)
9

◻ Predict the net worth of people based on variables like their age,
income, education, etc.

Input Variable(s) Output Variable(s)

• Age Person’s networth


• Income
• Education
Examples of Regression Tasks (cont.)
10

◻ Predicting sales amounts of new product based on advertising


expenditure.

Input Variable(s) Output Variable(s)

• Advertising expenditure Amount of sales


Examples of Regression Tasks (cont.)
11

◻ Predicting wind velocities as a function of temperature, humidity, air


pressure, etc.

Input Variable(s) Output Variable(s)

• Temperature Wind velocity


• Humidity
• Air pressure
Types of Regression
12

◻ Linear regression
A statistical technique that uses independent (input) variables to predict the
outcome of a dependent (output) variable. The dependent variable shows
a linear relationship with each of the independent variables.

◻ Non-linear regression
The dependent variable shows a non-linear relationship with the
independent variables.
Types of Regression (cont.)
13

◻ Simple regression
It establishes a relationship between one independent (input) variable and one
dependent (output) variable. It attempts to draw a line or curve that fits the data most
and minimizes regression errors.
▪ Example of Simple Linear Regression:
Equation of a line: 𝑦 = 𝑏1 𝑥1 + 𝑏0
where 𝑥1 is the input variable, 𝑦 is the output variable, 𝑏1 , 𝑏0 are the coefficients.
◻ Multiple regression
It establishes a relationship between multiple independent (input) variables and one
dependent (output) variable.
▪ Example of Multiple Linear Regression:
𝑦 = 𝑏𝑛 𝑥 𝑛 + … . + 𝑏2 𝑥 2 + 𝑏1 𝑥 1 + 𝑏 0
where 𝑥1, … , 𝑥𝑛 are the input variables, 𝑦 is the output variable, 𝑏𝑛, … , 𝑏0 are the
coefficients.
The objective in a regression problem is calculate the
coefficients using optimization techniques.
Simple Linear Regression
14

Example
◻ Estimating the net worth of people based on their age.

◻ One feature: Age, output: Net worth

Net worth

Age
Simple Linear Regression (cont.)
15

Example
◻ If you want to draw a line representing the data, which line of

the following is the best?

Net worth
A
B
C
Answer is Line B

Age
Simple Linear Regression: Training & Fitting
16

Example
◻ Simple Linear Regression model can be represented by a line that
best fits the data.
◻ Can you give the model (line) equation? Given a point that the line
passes through.
Net worth Line Equation: 𝑦 = 𝑏1 𝑥 + 𝑏0
500
Here:
(Net worth) = 𝑏1 (age) + 𝑏0
where 𝑏1 is slope, and 𝑏0 is the y-
intercept (value of y when x =0)
Age (Net worth) = (500/80) (age) + 0
80
Simple Linear Regression: Prediction
17

Example
◻ Using this model, predict the net worth of a person of age 36

Net worth
Given the line equation:
(Net worth) = (500/80) (age) + 0

? By substituting in the equation:


(Net worth) = (500/80) (36) + 0 = 225
Age
36
Evaluation of Regression: MSE
18

◻ Mean squared error (MSE) is an accuracy measure that


measures the average of the squares of the
errors/difference between the predicted values and the
actual value.
𝑛
1
MSE = ෍ 𝑦𝑖 − 𝑦ෝ𝑖 2
𝑛
𝑖=1
◻ Here 𝑛 is the number of actual points, 𝑦𝑖 is the actual
value, 𝑦ෝ𝑖 is the predicted value.
◻ The smaller the value of MSE (close to zero) that better.
Evaluation of Regression: R2 Coefficient
19

◻ Coefficient of Determination (𝑹𝟐 ) is an accuracy measure. It measures


how much of any change in the output is explained by the change in the
input.
2
𝑆𝑆𝐸Regression
𝑅 =1−
𝑆𝑆𝐸Total
◻ Here
𝑛 𝑛

𝑆𝑆𝐸Regression = ෍ 𝑦𝑖 − 𝑦ෝ𝑖 2 and 𝑆𝑆𝐸Total = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖=1 𝑖=1
◻ In 𝑆𝑆𝐸Total , 𝑦ത is the mean of the actual values.
◻ The value of 𝑹𝟐 is between 0.0 and 1.0.
➢ 0.0 means the regression model is not doing a good job of capturing the
trend in the data.
➢ 1.0 means the regression model is doing a good job of describing the
relationship between the input(s) and the output.
Regression Task Pipeline
20

1. Data Preprocessing (perform normalization if necessary)


2. Splitting data into a training set and testing set
3. Selecting and creating the model
4. Training the model
5. Prediction using the trained model
6. Model evaluation
Tasks in Regression
21

1. 2. 3. 4. 5. 6.
Data Splitting Selecting Training Predictio Model
Preproce data into or the model n using evaluatio
ssing training Creating the n
set and the model trained
testing model
set
Common Python Codes – Regression
22

Regression Task Pipeline:


1. Data Preprocessing (perform normalization if necessary)
2. Splitting data into training set and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state = 1)

3. Selecting and creating the model choose one of the codes from next slide
4. Training the model:
regressor.fit(X_train, y_train)

5. Prediction using the trained model:


y_pred = regressor.predict(X_test)

6. Model evaluation
from sklearn.metrics import mean_squared_error, r2_score
MSE = mean_squared_error(y_test,y_pred)
R2 = r2_score(y_test,y_pred)
Details of Selecting and creating a model
23

◻ Selecting and Creating the model

# Creating the LINEAR model


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

# OR
# Creating the non-linear (polynomial) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'poly') #degree 3 is default value
#regressor = SVR(kernel = 'poly', degree =4) degree 4

# OR
# Creating the non-linear (RBF) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
24

Applications of Regression
Applications of Regression in Engineering-
Example 1: Simple Regression
25

Predicting students scores based on study hours

Input Variable(s) Output Variable(s)


• Hours of Study Score

Dataset link:
https://www.kaggle.com/datasets/himanshunakrani/student-study-
hours?ref=machinelearningnuggets.com
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
26

◻ Data Preparation and Preprocessing


import pandas as pd

# LOADING DATASET
stud_scores = pd.read_csv('student_scores.csv')
stud_scores.describe()

Output
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
27

◻ Data Preparation and Preprocessing

stud_scores.head() #prints the first 5 records

Output
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
28

◻ Data Preparation and Preprocessing


# creating input data and output variable
X = stud_scores['Hours'] # input variable
y = stud_scores['Scores'] # output variable

#the input to machine learning methods have to be arrays


#converting X to an array

X = X.to_numpy()
print(X)
Output
array([2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5, 8.3, 2.7, 7.7,
5.9, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9, 6.1, 7.4, 2.7, 4.8, 3.8,
6.9, 7.8])
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
29

◻ Data Preparation and Preprocessing


#The data has to represented as an array of records. Output
#need to reshape the input if it has a single feature

#function reshape is used to change the shape


(dimensions) of an array without changing its data.
#The 1 argument indicates that we want to have 1 column
#The -1 argument indicates that we want NumPy to
automatically determine the number of rows needed based
on the total number of elements in the array.

X = X.reshape(-1, 1)
print(X)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
30

◻ Data Preparation and Preprocessing

#the input to machine learning methods have to be arrays


#converting y to an array
y = y.to_numpy()
print(y)

Output
array([21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85, 62, 41, 42,
17, 95, 30, 24, 67, 69, 30, 54, 35, 76, 86], dtype=int64)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
31

◻ Splitting the data into training and testing

# SPLITTING THE DATA


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state = 1)

#The parameter random_state controls the shuffling


applied to the data before applying the split. It is set
to None by default.
#Set random_state to an integer for reproducible output
across multiple function calls (in other words, if you
want to get the same results)
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
32

◻ Creating the model

# Creating the LINEAR model


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
Let’s consider
Linear Regression
# OR in this example
# Creating the non-linear (polynomial) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'poly') #degree 3 is default value
#regressor = SVR(kernel = 'poly', degree =4) degree 4
# OR
# Creating the non-linear (RBF) model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
33

◻ Training the model coef_ is only


#Training the model available
import numpy as np when using a
linear model
regressor.fit(X_train, y_train)

# GETTING THE COEFFICIENTS AND INTERCEPT (for linear models only)


print('Coefficient: ', np.round (regressor.coef_,2) )
# had to use np.round because regressor.coef_ is an array. You
need to remove this operation for nonlinear regression models

print('Intercept: ', np.round (regressor.intercept_, 2))


#you can use np.round or round on a floating point variable.
Output
Coefficient: [10.41]
Intercept: -1.51
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
34

◻ Prediction using the trained model

# PREDICTION OF TEST RESULT


y_pred = regressor.predict(X_test)
print('Predictions:\n', y_pred)

Output
Predictions:
[ 9.93952968 32.84320126 18.26813752 86.97915227 48.45934097
78.65054442 61.99332873 75.52731648]
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
35

◻ Evaluating the model


#Model Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
MSE = mean_squared_error(y_test,y_pred)
R2 = r2_score(y_test,y_pred)
#Note: function r2_score takes the true labels (y_test) and the predicted ones
(y_pred)

print("Mean squared error (MSE):", round (MSE,2))


print('Coefficient of determination(R squared): ', round(R2, 2) )

#Note:Alternative way to calculate R squared


print('Coefficient of determination(R squared) using score function: ',
round(regressor.score(X_test, y_test), 2) )
#Note:function score takes the X_test and y_test

Output
Mean squared error (MSE): 56.09
Coefficient of determination(R squared): 0.89
Coefficient of determination(R squared) using score function: 0.89
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
36

◻ Evaluating the model


import matplotlib.pyplot as plt

#plot basic scatterplot


plt.scatter(X_test, y_test, label
= 'Actual')
Output
#plot the regression line
plt.scatter(X_test, y_pred, label
= 'Predicted' )

plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
37

◻ Evaluating the model


import matplotlib.pyplot as plt

#Using all data


#plot the basic scatterplot
plt.plot(X, y, 'o', label = 'Actual')
#’o’ is to make scatter plot, alternatively you can
write: plt.scatter(X,y, label = ‘Actual’) Output
#plot the predicted regression line
y_pred_all_data = regressor.predict(X)
plt.plot(X,y_pred_all_data,'o',label ='Predicted')
Applications of Regression in Engineering-
Example 1: Simple Regression (cont.)
38

◻ Regression line/curve produced by several regression models:

Linear Regression Non-Linear regression


(Polynomial of degree 3) RBF Kernel
Complete code for Student Scores Dataset
after excluding some optional code

#Step 1: Importing data and preprocessing Upload student_scores.csv


import pandas as pd in Colab before running
stud_scores = pd.read_csv('student_scores.csv') this command
stud_scores.describe()
X = stud_scores['Hours'] # input variable
y = stud_scores['Scores'] # output variable
X = X.to_numpy() # convert to numpy because machine learning algorithms use numpy
X = X.reshape(-1, 1) # convert to cloumn for use with machine learning algorithms
y = y.to_numpy() # convert to numpy because machine learning algorithms use numpy
#Step 2: Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
#Step 3: Selecting and Creating the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
#Step 4: training the model on training set
regressor.fit(X_train, y_train)
#Optional code to print out the coefficients
import numpy as np
print("Coefficients:\n", np.round(regressor.coef_,2))
print('Intercept:\n', round(regressor.intercept_,2))
#Step 5: Using the trained model predict the output for testing set
y_pred = regressor.predict(X_test)
#Step 6: Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
print("Mean
39 squared error: " , round(mean_squared_error(y_test, y_pred),2))
print("Coefficient of determination: ", round(r2_score(y_test, y_pred),2))
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
40

Input Variable(s) Output Variable(s)


• Longitude of district Median House Value
• Latitude of district
• Median house age
• Population of district
• Total rooms
• Total bedrooms
• Median Income
• Total houses
Dataset link:
http://scikit-
learn.org/stable/modules/generated/sklearn.d
atasets.fetch_california_housing.html
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
41

◻ Data Preparation and Preprocessing


from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing() #load the dataset


X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target
#OR
#X,y = fetch_california_housing(return_X_y = True)

feature_names = housing_data.feature_names
target_names = housing_data.target_names
print('Feature names: ', feature_names)
print('\nTarget names: ', target_names)#Median house value for households
print('\nShape of dataset', X.shape)

Output
Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude']
Target names: ['MedHouseVal']
Shape of dataset (20640, 8)
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
42

◻ Data Preparation and Preprocessing


#if you want to display the data set and visualize table,
You need to convert X and y into a DataFrame

import pandas as pd
#for Dataframe data
X = df.drop(‘House_Value', axis=1) #0 =
# Convert X and y into a DataFrame row
df = pd.DataFrame(data=X, columns=feature_names) y = df[‘House_Value’] #target _column
df['House_Value'] = y # new column, target Or
selected_columns = [‘label 1’, ‘label 2’, ..]
X = df[selected_columns]
# Print the DataFrame
df
df.head() Output
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
43

◻ Splitting data into training set and testing set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

#Note: In this example, random_state is not set to an int, so expect


different splits and consequently different results!

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

Output (14448, 8)
(6192, 8)
(14448,)
(6192,)
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
44

◻ Creating the model


# importing the linearRegression class
from sklearn.linear_model import LinearRegression

# instantiate the Linear Regression model


regressor = LinearRegression()

#You can also create other non-linear models as in Example 1


Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
45

◻ Training the model

# training the model


regressor.fit(X_train, y_train)
# get the coefficients and intercept

import numpy as np
print("Coefficients:\n", np.round(regressor.coef_,2))
print('Intercept:\n', round(regressor.intercept_,2))

Output x1

Coefficients: [ 0.45 0.01 -0.13 0.84 -0. -0. -0.42 -0.44]


Intercept: -37.35
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
46

◻ Prediction using the trained model

# expose the model to new values and predict the target vector
y_pred = regressor.predict(X_test)
print('Predictions:', y_pred)

Output

Predictions: [2.54827248 2.98136965 2.10894987 ...


2.82017938 6.84565693 2.68012622]
Applications of Regression in Engineering-
Example 2: Multiple Regression (cont.)
47

◻ Model evaluation
#Model evaluation
from sklearn.metrics import mean_squared_error, r2_score
# The mean squared error
print("Mean squared error: " , round(mean_squared_error(y_test,
y_pred),2))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: ", round(r2_score(y_test,
y_pred),2))

#or you can use the score function as in Example 1

Output
Mean squared error: 0.55
Coefficient of determination: 0.57
Complete code for California Housing Dataset
after excluding some optional code
48

#Step 1: Importing data and preprocessing


from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing() #load the dataset
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the output data
#Step 2: Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
#Step 3: Selecting and Creating the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
#Step 4: training the model on training set
regressor.fit(X_train, y_train)
#Optional code to print out the coefficients
import numpy as np
print("Coefficients:\n", np.round(regressor.coef_,2))
print('Intercept:\n', round(regressor.intercept_,2))
#Step 5: Using the trained model predict the output for testing set
y_pred = regressor.predict(X_test)
#Step 6: Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: " , round(mean_squared_error(y_test, y_pred),2))
print("Coefficient of determination: ", round(r2_score(y_test, y_pred),2))
Applications of Regression in Engineering-
Example 3: Multiple Regression
49

Diabetes Dataset
Input Variable(s)
age: Age in years
sex: Gender of the patient
bmi: Body mass index
bp: Average blood pressure Output Variable(s)
Total serum cholesterol (tc) Measure of disease progression
Low-density lipoproteins (ldl)
High-density lipoproteins (hdl)
Total cholesterol / HDL (tch)
Possibly log of serum triglycerides level (ltg)
Blood sugar level (glu)

Dataset link:
https://scikit-
learn.org/stable/modules/generated/sklea
rn.datasets.load_diabetes.html
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
50

◻ Data Preparation and Preprocessing

from sklearn import datasets

# Load the diabetes dataset


X, y = datasets.load_diabetes(return_X_y=True)
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
51

◻ Splitting data into training set and testing set


# Split the data and targets - SHORT WAY
# Use the function train_test_split to split the data and targets into
training and testing sets. Testing data size is 20% of the data, and the rest
is the training portion
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size= 0.2)

# OR: Split the data into training and testing sets – LONG WAY
X_train = X[:-20] #the first part of the array excluding the last 20 records
X_test = X[-20:] #the last 20 records

# Split the labels into training and testing sets


y_train = y[:-20] #the first part of the array excluding the last 20 records
y_test = y[-20:] #the last 20 records
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
52

◻ Creating the model

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

# You can also create non-linear models as well


Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
53

◻ Training the model


# Train the model using the training set
regressor.fit (X_train, y_train)

# Get the model equation


# The coefficients
print("Coefficients: \n", regr.coef_)
# The y-intercept
print("y-intercept: \n", regr.intercept_)

Output:
Coefficients:
[ -30.77592231 -197.11523603 519.50634733 346.49118652 -688.21410873
431.49892496 19.3325826 94.20724607 716.79048049 75.26379265]
y-intercept:
152.09140122905802
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
54

◻ Prediction using the trained model


# Make predictions using the testing set
y_pred = regressor.predict(X_test)
Applications of Regression in Engineering-
Example 3: Multiple Regression (cont.)
55

◻ Model evaluation
from sklearn.metrics import mean_squared_error, r2_score

# The mean squared error


print("Mean squared error: " ,
round(mean_squared_error(y_test, y_pred),2))

# The coefficient of determination: 1 is perfect prediction


print("Coefficient of determination: ", round(r2_score(y_test,
y_pred),2))

Output:
Mean squared error: 2197.35
Coefficient of determination: 0.57
Complete code for Diabetes Dataset
after excluding some optional code
56

#Step 1: Importing data and preprocessing


from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
#Step 2: Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
#Step 3: Selecting and Creating the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
#Step 4: training the model on training set
regressor.fit(X_train, y_train)
#Optional code to print out the coefficients
import numpy as np
print("Coefficients:\n", np.round(regressor.coef_,2))
print('Intercept:\n', round(regressor.intercept_,2))
#Step 5: Using the trained model predict the output for testing set
y_pred = regressor.predict(X_test)
#Step 6: Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: " , round(mean_squared_error(y_test, y_pred),2))
print("Coefficient of determination: ", round(r2_score(y_test, y_pred),2))
Class Activity For Diabetes Dataset
57

1. Try different values of test_size in Step 2 and notice the effect on MSE
and R2 scores
2. Try different models in Step 3 and identify the model that gives the best
R2 score. The model options are given on this slide . Note that if you use
a nonlinear model, then delete the code for printing coefficient and
intercept.
3. In the diabetes dataset, we have 10 features. Therefore, we cannot
make a scatter plot. However, a scatter plot is possible with only one
feature. Use the following code to draw the scatter plot for the feature
at index 2, i.e., BMI which has been normalized with Z-Score
normalization.
import matplotlib.pyplot as plt
#plot basic scatterplot
plt.scatter(X_test[:,2], y_test, label = 'Actual')
#plot the regression line
plt.scatter(X_test[:,2], y_pred, label = 'Predicted' )
plt.xlabel(‘Scaled BMI')
plt.ylabel(‘Disease Progression')
plt.legend()
plt.show()
58

Classification
Classification
59

❑ The objective is to find a decision boundary or decision surface that


separates the classes.
❑ The decision boundary is a boundary that partitions the samples in the
dataset into two sets or more, one for each class.
❑ Each machine learning algorithm has it is own way of finding that decision
boundary, that is, how a machine learning model might draw a line/set of
lines/curve to separate the classes.
❑ Different decision boundaries for the same dataset:
Exercise
60

Exercise: The following are classification tasks. Identify the input


variable(s) and output variable(s) for each task.
Examples of Classification Tasks
61

◻ Classifying credit card transactions as legitimate or fraudulent

Input Variable(s) Output Variable(s)

• Features of credit card Classes such as legitimate or


transactions, such as date and fraudulent
time of transaction, amount, etc.
Examples of Classification Tasks (cont.)
62

◻ Classifying land covers (water bodies, urban areas, forests,


etc.) using satellite data

Input Variable(s) Output Variable(s)

• Satellite images Classes such as water bodies, urban


areas, forests, etc.
Examples of Classification Tasks (cont.)
63

◻ Categorizing news stories as finance, weather, entertainment,


sports, etc.

Input Variable(s) Output Variable(s)

• News stories Classes such as finance,


entertainment, sports, etc.
Examples of Classification Tasks (cont.)
64

◻ Predicting tumor cells as benign or malignant

Input Variable(s) Output Variable(s)

• Features describing tumors Classes such as benign or malignant


shape and texture
Classification Algorithms
65

◻ There are several types of classification algorithms you can


use depending on the dataset you’re working with. The
following are five of the most common classification
algorithms:
▪ Decision Tree (DT)
▪ K-Nearest Neighbors (KNN)
▪ Naïve Bayes
▪ Logistic Regression
▪ Support Vector Machines (SVM)
Tasks in Classification
66

1. 2. 3. 4. 5. 6.
Data Splitting Selecting Training Predictio Model
Preproce data into or the model n using evaluatio
ssing training Creating the n
set and the model trained
testing model
set
Classification Task Pipeline
67

1. Data Preprocessing, which includes labeling any non-numerical


output, and, if necessary, normalizing numerical data.
2. Splitting data into a training set and testing set
3. Selecting and creating the classification model
4. Training the model
5. Prediction using the trained model
6. Model evaluation
Common Python Codes – Classification
68

Classification Task Pipeline:


1. Data Preprocessing (if necessary, reshaping, labeling and normalization)
2. Splitting data into a training set and a testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state = 1)

3. Selecting and creating the model choose one of the codes from the next slide
4. Training the model:
clf.fit(X_train, y_train)

5. Prediction using the trained model:


y_pred = clf.predict(X_test)

6. Model evaluation: Calculating accuracy

clf.score (X_test, y_test)


Details of Selecting and Creating a Model
69

◻ Select one of the following classifiers


#Decision tree Classifier
from sklearn import tree #import the tree class
from sklearn.tree import DecisionTreeClassifier #import the decision tree class
clf = DecisionTreeClassifier()
#OR
#K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier #import the KNN class
clf = KNeighborsClassifier()
#OR
#Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB #import the NB class
clf = GaussianNB()
#OR
#Linear SVM
from sklearn.svm import LinearSVC #import the tree class
clf = LinearSVC()
#OR
#Non-Linear SVM with polynomial kernel
from sklearn.svm import SVC #import the tree class
clf = SVC(kernel='poly’) #Non-Linear SVM with polynomial kernel
#or
clf = SVC(kernel='rbf’) #Non-Linear SVM with Radial Basis Function (RBF) kernel
70

How Classification Algorithms Work


Classification Algorithms (cont.)
71

There are different machine


learning algorithms

Training Phase: Learning


algorithm is used to build the
model

Training/learning
results in trained model

Testing Phase: Trained model


is used for prediction
Classification Algorithms- Decision Tree
72

◻ Decision Trees (DT)


▪ Decision Tree is a supervised learning technique that can be used for
both classification and Regression.
▪ It is a tree-structured classifier where internal nodes represent the
features of a dataset, branches represent the decision rules, and each
leaf node represents the outcome.
Classification Algorithms- DT (cont.)
73

◻ Decision nodes represent questions about features (e.g., Home


Owner?), which have two or more branches (e.g., Yes and No).
◻ Leaf nodes (e.g., Defaulted Borrower --> Yes, Defaulted Borrower --
> No) represent a classification or decision.
◻ Decision trees can handle both categorical and numerical data.
Classification Algorithms- DT (cont.)
74

◻ Decision trees allows you to ask multiple “Linear questions” to


classify a non-linearly separable dataset.
◻ Example 1: The following is a dataset with two features: Sun and
Wind, and there are two classes:
■ Good day for surfing
Sample decision tree to
■ Not a good day for surfing
separate the classes of this
dataset Windy?
Yes No
Sunny

______ Sunny?
not Yes No
sunny

not windy | windy


Classification Algorithms- DT (cont.)
75

◻ Example 2: The following is a dataset with two features: X1


and X2, and there are two classes ( , ).
◻ Can we build the decision tree to classify this sample set?
◻ Hint: Start splitting using X1
Sample decision tree to
separate the classes of this
dataset X1 < 3?
Yes No

X2 < 2 ? X2 < 4 ?
Yes No Yes No
Classification Algorithms- DT
76

◻ The DT algorithm decides where to spit the data based on


Impurity
◻ It finds split points that result in subsets that are as pure as
possible.
◻ A subset is purer when most data in it belong to the same
class.
This is a better split as it
results in purer subsets
Example of a Decision Tree

Splitting Attributes

Home
Owner
Yes No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Training Data Model: Decision Tree


77
Another Example of a Decision Tree

MarSt Single,
Married Divorced

NO Home
Yes Owner No

NO Income
< 80K > 80K

NO YES

There could be more than one tree that


fits the same data!

Training Data
78
Apply the Trained Model to Predict the Class of a Test Sample

Start from the root of tree. Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

79
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

80
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test
Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

81
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

82
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

83
Apply the Trained Model to Predict the Class of a Test Sample (cont.)

Test Sample

Home
Yes Owner No

NO MarSt
Single, Divorced Married
Predicted class is “No”
Income NO
< 80K > 80K

NO YES

84
Classification Algorithms- DT: Python
85

◻ Example of classification using DT in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.tree import DecisionTreeClassifier

#create the classifier


clf = DecisionTreeClassifier()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))

Output:
[1 2]
Classification Algorithms- KNN
86

◻ K-Nearest Neighbor (KNN)


▪ Simple algorithm for classification.
▪ Stores all the available training samples and classifies the new
samples based on the similarity measure (e.g., distance
functions).
▪ Basic idea: If it walks like a duck, quacks like a duck, then it’s
probably a duck.
Classification Algorithms- KNN (cont.)
87

◻ Requires the following:


▪ A set of labeled records
▪ Proximity metric to compute
distance/similarity between a pair
of records. For example, calculating
the Euclidean distance between the
sample pairs.
▪ The value of K is the number of
nearest neighbors to consider.
▪ One method for using class labels of
K nearest neighbors to determine
the class label of an unknown
record is, for example, by taking a
majority vote.
Classification Algorithms- KNN (cont.)
88

◻ K-nearest neighbors of a record x are data points that have


the k smallest distances to x
◻ Different values of k affect the predicted class

Predicted class is - Predicted class is Predicted class is +


either + or -
Classification Algorithms- KNN: Python
89

◻ Example of classification using KNN in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.neighbors import KNeighborsClassifier

#create the classifier


clf = KNeighborsClassifier()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))

Output:
[1 2]
Classification Algorithms- NB
90

❑ Naïve Bayes (NB): a probabilistic framework for solving


classification problems, based on Bayes Theorem.

❑ Consider each attribute and class label as random variables


❑ Given a record X with attributes (X1, X2,…, Xd)
▪ Goal is to predict class Y
▪ Specifically, we want to find the value of Y that maximizes P(Y| X)

❑ We can estimate P(Y| X) directly from data.


Classification Algorithms- NB: Example
91

❑ Example:
❑ Assume you have two friends, Adam and Lena.
❑ You received a message from one of them, but you do not
know who is the sender.
❑ You would like to use machine learning to “predict” the
sender.
❑ Assuming equal prior probability: Both Adam and Lena
have access to internet and can write emails.
■ P(Adam) = 0.5
■ P(Lena) = 0.5
Classification Algorithms- NB
92

◻ The model is trained on the following probabilities that


describe the frequency of mentioning specific words by each
of the persons in their conversations (assume the language
consists of three words for simplicity)
◻ Lena mentions ‘Great’ in 50% of her conversations, while she
mentions ‘Deal’ and ‘Life’ in 20% and 30% of her
conversations, respectively.

Great Great
Classification Algorithms- NB
93

◻ Assume you received an Email with contents:


Great!
◻ Whom do you think would be the sender of the email?
Lena! because Lena has higher probability of using the word
‘Great’ than Adam.

Great Great
Classification Algorithms- NB
94

◻ Assume you received an Email with contents:


Great Life!
◻ Whom do you think would be the sender of the email?
Lena! because Lena has higher probability of using both the
word ‘Great’ and the word ‘Life’ than Adam.

Great Great
Classification Algorithms- NB
95

◻ Assume you received an Email with contents:


Life Deal!
◻ Whom do you think would be the sender of the email?
Let’s calculate:
P(Adam is sender of “Life Deal”) = prior_probability × P(Adam saying ‘life’) × P(Adam saying ‘Deal’)
= 0.5 × 0.1 × 0.8 = 0.04
P(Lena is sender of “Life Deal”) = prior_probability × P(Lena saying ‘life’) × P(Lena saying ‘Deal’)
= 0.5 × 0.3 × 0.2 = 0.03
Adam! 🡪 because of the higher probability

Great Great
Classification Algorithms- NB
96

◻ Assume you received an Email with contents:


Great Deal!
◻ Whom do you think would be the sender of the email?
Let’s calculate: P(Adam is sender of “Great Deal”) = 0.5 × 0.1 × 0.8 = 0.04
P(Lena is sender of “Great Deal”) = 0.5 × 0.5 × 0.2 = 0.05
Lena! because of the higher probability

Great Great
Classification Algorithms- NB: Python
97

◻ Example of classification using NB in Python


import numpy as np
#creating a dataset of 6 samples, where
#X is the array of feature vectors
#y is the array of labels
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

#import the classifier package


from sklearn.naive_bayes import GaussianNB

#create the classifier


clf = GaussianNB()

#train the classifier on the whole dataset


clf.fit(X, y)

#use the trained classifier to predict the classes of two samples


print(clf.predict([[-0.8, -1],[4, 1]]))

Output:
[1 2]
Evaluation of Classification: Accuracy
98

◻ Accuracy measures how many samples were classified


correctly over the total number of samples used in the
prediction.
Number of correctly classifed samples
accuracy =
Total number of samples used in the prediction
◻ The value of accuracy is between 0.0 and 1.0.
➢ 0.0 means the model did not make any correct
predications.
➢ 1.0 means the model predicted ALL the tested samples
correctly.
➢ The higher the accuracy (closer to 1.0), the better the
model.
99

Applications of Classification
Applications of Classification
100

Dataset link: https://scikit-


Iris Dataset learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.
datasets.load_iris

Input Variable(s) Output Variable(s)


Sepal Length Class of flower
Sepal Width
Petal Length
Petal Width
Applications of Classification in Engineering-
Example 1: Data Processing
101

◻ Data Preprocessing

from sklearn.datasets import load_iris #import the dataset

#Explore the dataset


#data= load_iris(return_X_y=False)
#print(data.target_names) #class labels

X, y = load_iris(return_X_y=True)
#Output already encoded to numbers so no need for labeling

X = X[:, :2] # we only take the first two features for


visualization purposes
##ther option X = X[:, [0,3]] this includes the data of 1st
and 4th feature columns.
Dataset link:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
Applications of Classification in Engineering-
Example 1: Data Processing with Labeling
102

◻ Data Preprocessing, with labeling


◻ If target classes are categorial text, you need to encode them
before you proceed with machine learning. Here is an example
◻ Using the simplest form of encode, which is a Label Encoder:
Applications of Classification in Engineering-
Example 1: Data Splitting
103

◻ Splitting data into a training set and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#Note: random_state is not set to int, so you may get


different results
Applications of Classification in Engineering-
Example 1: Selecting and Creating a Model
104

◻ Selecting and creating the classifier model


◻ Need to import the packages

from sklearn import tree #import the tree class


from sklearn.svm import LinearSVC #import the tree class
from sklearn.svm import SVC #import the tree class
from sklearn.tree import DecisionTreeClassifier #import the decision tree class
from sklearn.neighbors import KNeighborsClassifier #import the KNN class
from sklearn.naive_bayes import GaussianNB #import the NB class
Applications of Classification in Engineering-
Example 1: Selecting and Creating a Model
105

◻ Creating the model


◻ It can be ONE of the following:
#Create the classifier
#Decision tree Classifier
clf = DecisionTreeClassifier() Let’s consider Decision
Tree Classifier
#K-Nearest Neighbors Classifier
clf = KNeighborsClassifier()

#Naive Bayes Classifier


clf = GaussianNB()

#Linear SVM
clf = LinearSVC()

#Non-Linear SVM with polynomial kernel


clf = SVC(kernel='poly')

#Non-Linear SVM with Radial Basis Function (RBF) kernel


clf = SVC(kernel='rbf')
Applications of Classification in Engineering-
Example 1: Training
106

◻ Training the model

#training the model by calling the function fit


and passing the training features and labels

clf = clf.fit(X_train, y_train)


Applications of Classification in Engineering-
Example 1: Using the Model for Prediction
107

◻ Prediction using the trained model

#predicting the labels of the test features


y_pred = clf.predict(X_test)

print(y_pred)

Output:
Applications of Classification in Engineering-
Example 1: Model Evaluation using Accuracy
108

◻ Model evaluation
#Evaluating the model
#print the accuracy score

print ('Accuracy is:', round(clf.score (X_test, y_test),2));

Output:
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting (DT)
109

◻ Model evaluation via Plotting for DT algorithms

#plot the decision tree, only if using DT classifier


from sklearn import tree
tree.plot_tree(clf)
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting
110

◻ Plot the Decision Boundary produced by the classifier:


#plot the decision boundary by calling the following ‘user-defined’ function
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
Applications of Classification in Engineering-
Example 1: Model Evaluation via Plotting
111

◻ Decision Boundaries produced by several classifiers:


DT KNN NB

Linear SVM SVM with poly kernel SVM with RBF Kernel
Complete code for Iris Dataset
after excluding some optional code
112

#Step 1: Importing data and preprocessing


from sklearn.datasets import load_iris #import the dataset
X, y = load_iris(return_X_y=True)
X = X[:, :2] # we only take the first two features for visualization purposes
#Step 2: Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
#Step 3: Selecting and Creating the model
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
#Step 4: training the model on training set
clf.fit(X_train, y_train)
#Step 5: Using the trained model predict the output for testing set
y_pred = clf.predict(X_test)
#Step 6: Evaluate the model
print('Accuracy is:', round(clf.score (X_test, y_test),2));
#plot the decision boundary by calling the following ‘user-defined’ function
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
Class Activity For Iris Dataset
113

Try different classification models in Step 3.

The model options are given on this slide .

Plot the decision boundaries for each classification model.


Summary: Regression vs. Classification
114

◻ Differences between classification and regression

Property Classification Regression

Output type Discrete (class labels) Continuous (numbers)

Purpose To find the decision boundary, To find the line/curve that best fits
which divides the dataset into the data and predicts the output
different classes. more accurately.
Evaluation Accuracy (and other measures, MSE, R2 (and other metrics, not
not covered in this course, such as covered in this course, such as SSE,
F-score, precision, recall, and MAE, and MAPE)
confusion matrix)
Learning Outcomes
115

Upon completion of the course, students will be able to:


1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools

You might also like