Lecture 3 - MachineLearning-CrashCourse2023
Lecture 3 - MachineLearning-CrashCourse2023
Kshitij Sharma
Department of Computer Science
Faculty of Information Technology and Electrical Engineering
What?
Why?
Machine Learning is the future
summary(normalisedData)
#randomely generate 80% indices from the original data
trainRandom <- sample(1:nrow(iris), 0.8 * nrow(iris))
# accuracy function
accuracy <- function(x)
{
sum(diag(x)/(sum(rowSums(x)))) * 100
}
# plot
ggplot(d1,aes(x=neighbours,y=accuracy)) + geom_line() + geom_point() + theme_bw()
Classification
• Decision
tree
#read data
set.seed(6789)
titanic <-read.csv('https://raw.githubusercontent.com/guru99-edu/R-Programming/master/titanic_data.csv')
head(titanic)
# clean data
# Drop useless information
# "pclass" and "survived" columns get labels instead of numbers (easy to understand)
# remove NAs
cleanData <- titanic
%>% select(-c(home.dest, cabin, name, x, ticket))
%>% mutate(pclass = factor(
pclass, levels = c(1, 2, 3),
labels = c('Upper', 'Middle', 'Lower')),
survived = factor(
survived, levels = c(0, 1),
labels = c('No', 'Yes')))
%>% na.omit()
glimpse(cleanData)
#Split into training and testing
set.seed(1234)
trainRandom <- sample(1:nrow(cleanData), 0.8 * nrow(cleanData))
trainData <- cleanData[trainRandom,]
testData <- cleanData[-trainRandom,]
Sales
Money on ads
Prediction/classification
Problem with
Linear regression
Prediction/classification
• Linear Regression
Sales
Money on ads
Prediction/classification
• Polynomial Regression
• Y = 14.81X + 2.89X2 + 88.33
Sales
Money on ads
Prediction/classification
Quality
Prediction/classification Quality
Error
Prediction/classification Quality
Error = original - predicted
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
Prediction/classification Quality
Error
Prediction/classification Quality
Prediction/classification Quality
• R-squared
Prediction/classification Quality
• Precision, Recall, Accuracy
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑎𝑐𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
# the following are for the individual algorithms you can try other models by running -->
# for a list of models supported by caret package and to find the names for "modelLookup" run -->
names(getModelInfo())
Caret
train(
formula,
data = …,
method=" … ",
trControl = …,
preProcess = …,
tuneGrid = …,
metric = …
)
=
…
Naïve bayes Outlook Temp Humidity Windy Play
RAINY HOT HIGH FALSE NO
RAINY HOT HIGH TRUE NO
OVERCAST HOT HIGH FALSE YES
SUNNY MILD HIGH FALSE YES
SUNNY COOL NORMAL FALSE YES
SUNNY COOL NORMAL TRUE NO
OVERCAST COOL NORMAL TRUE YES
RAINY MILD HIGH FALSE NO
RAINY COOL NORMAL FALSE YES
SUNNY MILD NORMAL FALSE YES
RAINY MILD NORMAL TRUE YES
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes
Outlook Temp Humidity Windy Play
Frequency table
RAINY • Naïve
HOT HIGH
bayes FALSE NO
RAINY HOT HIGH TRUE NO YES NO
OVERCAST HOT HIGH FALSE YES SUNNY 3 2
SUNNY MILD HIGH FALSE YES OVERCAST 4 0
SUNNY COOL NORMAL FALSE YES RAINY 2 3
SUNNY COOL NORMAL TRUE NO
OVERCAST COOL NORMAL TRUE YES
Likelihood table
RAINY MILD HIGH FALSE NO
YES NO
RAINY COOL NORMAL FALSE YES
SUNNY 3/9 2/5
SUNNY MILD NORMAL FALSE YES
OVERCAST 4/9 0/5
RAINY MILD NORMAL TRUE YES
RAINY 2/9 3/5
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes
YES NO
SUNNY 3/9 2/5 5/14
P(SUNNY|YES) = 3/9 = 0.33
OVERCAST 4/9 0/5 4/14 P(YES) = 9/14 = 0.64
RAINY 2/9 3/5 5/14
9/14 5/14
P(SUNNY) = 5/14 = 0.36
Naïve bayes
= 0.60
= 0.36
Naïve bayes
=
…
Naïve bayes
Outlook Temp Humidity Windy Play YES NO
RAINY • Naïve
HOT HIGH
bayes FALSE NO HOT 2/9 2/5
RAINY HOT HIGH TRUE NO MILD 4/9 2/5
OVERCAST HOT HIGH FALSE YES COOL 3/9 1/5
SUNNY MILD HIGH FALSE YES
SUNNY COOL NORMAL FALSE YES YES NO
SUNNY COOL NORMAL TRUE NO HIGH 3/9 4/5
OVERCAST COOL NORMAL TRUE YES NORMAL 6/9 1/5
RAINY MILD HIGH FALSE NO
RAINY COOL NORMAL FALSE YES YES NO
SUNNY MILD NORMAL FALSE YES FALSE 6/9 2/5
RAINY MILD NORMAL TRUE YES TRUE 3/9 3/5
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes
YES NO YES NO
HOT 2/9 2/5 SUNNY 3/9 2/5
MILD 4/9 2/5 OVERCAST 4/9 0/5
COOL 3/9 1/5 RAINY 2/9 3/5
YES NO YES NO
HIGH 3/9 4/5 FALSE 6/9 2/5
NORMAL 6/9 1/5 TRUE 3/9 3/5
=
…
Naïve bayes
Outlook Temp Humidity Windy Play
RAINY COOL HIGH TRUE ?
𝑃 𝑌𝐸𝑆 𝑋 =
𝑃 𝑅𝑎𝑖𝑛𝑦 𝑌𝑒𝑠 × 𝑃 𝐶𝑂𝑂𝐿 𝑌𝑒𝑠 × 𝑃 𝐻𝐼𝐺𝐻 𝑌𝑒𝑠 × 𝑃 𝑇𝑅𝑈𝐸 𝑌𝑒𝑠 ×
𝑃(𝑌𝐸𝑆)
𝑃 𝑁𝑂 𝑋 =
𝑃 𝑅𝑎𝑖𝑛𝑦 𝑁𝑂 × 𝑃 𝐶𝑂𝑂𝐿 𝑁𝑂 × 𝑃 𝐻𝐼𝐺𝐻 𝑁𝑂 × 𝑃 𝑇𝑅𝑈𝐸 𝑁𝑂 ×
𝑃(𝑁𝑂)
Naïve bayes
Outlook Temp Humidity Windy Play
RAINY COOL HIGH TRUE ?
= 𝑃 𝑌𝐸𝑆 𝑋
=
𝑃 𝑁𝑂 𝑋
=
=
Naïve bayes - Implementation
library(dplyr)
library(ggplot2)
library(caret)
# split data
set.seed(1234)
trainingSamples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
trainData <- PimaIndiansDiabetes2[trainingSamples, ]
testData <- PimaIndiansDiabetes2[-trainingSamples, ]
Naïve bayes - Implementation
#train
model <- train(
diabetes ~.,
data = trainData,
method = "nb",
trControl =
trainControl("cv", number = 10))
#test
predictions <-predict(model,testData)
#evaluate
confusionMatrix(predictions,testData$diabetes)
Naïve bayes - Implementation
#parameter tuning
modelLookup("nb")
Regression