R Course - Part7 ML - Exercise Sheet 2024
R Course - Part7 ML - Exercise Sheet 2024
Introduction
Definition: Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part of artificial
intelligence. Machine learning algorithms build a model based on sample data, known as
training data, in order to make predictions or decisions without being explicitly programmed to
do so. Machine learning algorithms are used in a wide variety of applications, such as in
medicine, email filtering, speech recognition, and computer vision, where it is difficult or
unfeasible to develop conventional algorithms to perform the needed tasks.
There are two basic types of ML: supervised and unsupervised. Unsupervised ML discover just
patterns of data similarity, whereas algorithms of supervised ML train and optimize Machines on
a subset of data with known class labels.
Practical Part
Objective
We will try to build a classifier separating different iris species according to their sepal and petal
features (length and width).
Libraries
Make sure that the packages below are installed. If not, you can install these packages with the
following commands.
install.packages("caret", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("grid", dependencies = TRUE)
install.packages("gridExtra", dependencies = TRUE)
install.packages("factoextra", dependencies = TRUE)
install.packages("ggdendro", dependencies = TRUE)
install.packages("rpart.plot", dependencies = TRUE)
We will use a powerful ‘caret’ package (short for Classification and Regression Training), which
contains functions for supervised ML.
Import it:
library(caret)
Inspect the data and get some basic stats. How does the data look? How many rows, columns?
head(iris)
tail(iris)
dim(iris)
summary(iris)
str(iris)
Now to check the missing values we are using is.na() function in R and print out the number of
missing items in the data frame as shown below.
sum(is.na(iris))
You can create new data set without missing data using na.omit() function in R.
#newdata <- na.omit(mydata)
Plot a boxplot and a density histogram of the iris data for each species
scales = list(x=list(relation="free"), y=list(relation="free"))
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs",scales = scale, auto.key = list(columns
= 3))
dev.off()
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x=iris[,1:4], y=iris$Species, plot = "box", scales = scale, par.strip.text=list(cex=0.6))
dev.off()
Unsupervised classification
In order to test and see how good our classification performs, we will mask the species labels in
30% of the data.
Split the dataset 70/30:
set.seed(9)
TrainingIndex <- createDataPartition(iris$Species, p=0.7, list = F)
Now let’s mask the species labels for the remaining 30% of data, so we assume these are
samples where we do not know the species
TestingSet <- iris[-TrainingIndex,]
TestingSet $Species <- "unknown"
dim(TestingSet)
PCA
In order to test how good our classification performs, we will split the data in 70% and 30%, train
with 70% of the data and test with the other 30%
set.seed(9)
training_indices <- createDataPartition(iris$Species, p=0.7, list = F)
Now we do the supervised ML. Let’s specify some important parameters/settings for the ML (10-
fold cross validation and 3 repeats).
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = multiClassSummary, classProbs = T, savePredictions = T)
Also, we define metric to estimate machine performance: the machine with the highest value in
terms of the metric will be selected.
metric <- "Accuracy"
k-NN
set.seed(7)
fit.knn <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "knn", metric = metric, trControl =
control, preProcess = c("center", "scale"))
Decision Tree
set.seed(7)
fit.cart <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rpart", metric = metric, trControl =
control, preProcess = c("center","scale"))
A useful property of decisions tress is that the decisions of the machine can be followed up
easily. Plot the best decision tree:
library(rpart.plot)
rpart.plot(fit.cart$finalModel, type = 5)
Random Forest
set.seed(7)
fit.rf <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rf", metric = metric, trControl = control,
preProcess = c("center", "scale"))
SVM
set.seed(7)
fit.svm_Lin <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "svmLinear", metric = metric,
trControl = control, preProcess = c("center", "scale"))
Artificial neuronal network
set.seed(7)
fit.nnet <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "nnet", metric = metric,
trControl = control, preProcess = c("center", "scale"))
LDA
set.seed(7)
fit.lda <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "lda", metric = metric, trControl =
control, preProcess = c("center", "scale"))
Summarize ML performance
results <- resamples(list(lda=fit.lda, knn=fit.knn, svm_Lin=fit.svm_Lin, svm_Rad=fit.svm_Rad,
rf=fit.rf, cart=fit.cart, nnet=fit.nnet))
summary(results)
***********************************
Now it’s time for your first own ML project. Load the data
data(GermanCredit, package = “caret")
Do the ML to identify the class "bad" and "good" (‘Class’ column). Use the same settings as
before: 70/30 splitting, 10 times repeated CV, select for the best Balanced Accuracy.
Which ML algorithm performs best? What are the six most important features for the
classification?
Well done!