0% found this document useful (0 votes)
15 views8 pages

R Course - Part7 ML - Exercise Sheet 2024

The document outlines a course on R Programming focused on Machine Learning, specifically classification techniques, at Jena University Hospital in February 2024. It covers both supervised and unsupervised machine learning methods, including algorithms like k-NN, SVM, and decision trees, and includes practical exercises using the iris dataset. Participants will learn to implement these techniques in R, evaluate model performance, and apply their knowledge to a project involving the GermanCredit dataset.

Uploaded by

Martin Levkovski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

R Course - Part7 ML - Exercise Sheet 2024

The document outlines a course on R Programming focused on Machine Learning, specifically classification techniques, at Jena University Hospital in February 2024. It covers both supervised and unsupervised machine learning methods, including algorithms like k-NN, SVM, and decision trees, and includes practical exercises using the iris dataset. Participants will learn to implement these techniques in R, evaluate model performance, and apply their knowledge to a project involving the GermanCredit dataset.

Uploaded by

Martin Levkovski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

R Programming Course

Jena University Hospital, February 2024


R Programming course, Thursday, Classification (Machine Learning)
Prof. Dr. Rainer König, Alicia Hiemisch, Mohadeseh Naseri, supported by Dr Marcus Oswald,
Antje Biering, Maria Dinkelacker and Kulwadee Thanamit

Introduction
Definition: Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part of artificial
intelligence. Machine learning algorithms build a model based on sample data, known as
training data, in order to make predictions or decisions without being explicitly programmed to
do so. Machine learning algorithms are used in a wide variety of applications, such as in
medicine, email filtering, speech recognition, and computer vision, where it is difficult or
unfeasible to develop conventional algorithms to perform the needed tasks.

Major application is classification.

There are two basic types of ML: supervised and unsupervised. Unsupervised ML discover just
patterns of data similarity, whereas algorithms of supervised ML train and optimize Machines on
a subset of data with known class labels.

Unsupervised Machine Learning


Clustering: Basically, clustering is about grouping objects systematically using
distance/similarity measures.
Principal component analysis (PCA): used for dimensionality reduction by projecting each data
point onto only the first few principal components to obtain lower-dimensional data while
preserving as much of the data's variation as possible.

Supervised Machine Learning


k-nearest neighbors algorithm (k-NN): simplest Machine Learning algorithms based on
Supervised Learning technique. K-NN algorithm assumes the similarity between the new
observation/data and available cases and put the new case into the category that is most similar
to the available categories of the k nearest cases. The choice of k is crucial for the quality of the
results.
Linear discriminant analysis (LDA): Aim to find a linear combination of features that
characterizes or separates two or more classes of objects/observations. Therefore, reduces
dimension systematically.
Support vector machine (SVM): SVM maps training examples to points in space and constructs
a hyperplane so that the distance from it to the nearest data point on each side is maximized. A
good separation is achieved by the hyperplane that has the largest distance to the nearest
training-data point of any class. Kernel trick: the method maps a non-linear separable data into
a higher dimensional space, making the data linearly separable.
Decision tree (DT): Structure in which each internal node represents a "test" on an attribute (e.g.
Petal Width > 1), each branch represents the outcome of the test, and each leaf node
represents a class label (decision taken after computing all attributes). The paths from root to
leaf represent classification rules.
Random forest (RF): A set of decision trees each calculated using a bootstrapped dataset and a
random subset of variables at each decision step. The variety is what makes RF more effective
than individual decision tress. To classify the data, the data will be run down all trees created.
The data is assigned to the class that received more votes.
Artificial neuronal network (ANN): An ANN is based on a collection of connected nodes called
neurons. The neurons are connected via channels and are grouped in different layers. The input
layer receives the input, the output layer predicts the output and the hidden layers between
them perform the main part of the calculation. Within the training phase the weights assigned to
the channels are adjusted to improve the prediction of the output layer. To classify the data, the
data will be propagated through the neuronal net using the weights and functions calculated in
the training phase. The data is assigned to the class that achieves highest score (probability) in
the output layer.

Practical Part
Objective
We will try to build a classifier separating different iris species according to their sepal and petal
features (length and width).

Libraries
Make sure that the packages below are installed. If not, you can install these packages with the
following commands.
install.packages("caret", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("grid", dependencies = TRUE)
install.packages("gridExtra", dependencies = TRUE)
install.packages("factoextra", dependencies = TRUE)
install.packages("ggdendro", dependencies = TRUE)
install.packages("rpart.plot", dependencies = TRUE)

We will use a powerful ‘caret’ package (short for Classification and Regression Training), which
contains functions for supervised ML.
Import it:
library(caret)

Set the working directory, where will be saved output figures


setwd(“ … a path to the newly created folder … “)

Load the iris data


data(iris)

Inspect the data and get some basic stats. How does the data look? How many rows, columns?
head(iris)
tail(iris)
dim(iris)
summary(iris)
str(iris)

Now to check the missing values we are using is.na() function in R and print out the number of
missing items in the data frame as shown below.
sum(is.na(iris))

You can create new data set without missing data using na.omit() function in R.
#newdata <- na.omit(mydata)

Shuffle the data to get rid of any patterns.


The set.seed() function in R is used to create reproducible results when writing code that
involves creating variables that take on random values. By using the set. seed() function, you
guarantee that the same random values are produced each time you run the code.
set.seed(7)
iris <- iris[sample(1:nrow(iris), size = nrow(iris), replace = F),]
head(iris, 20)

Plot a boxplot and a density histogram of the iris data for each species
scales = list(x=list(relation="free"), y=list(relation="free"))
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs",scales = scale, auto.key = list(columns
= 3))
dev.off()
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x=iris[,1:4], y=iris$Species, plot = "box", scales = scale, par.strip.text=list(cex=0.6))
dev.off()

Unsupervised classification
In order to test and see how good our classification performs, we will mask the species labels in
30% of the data.
Split the dataset 70/30:
set.seed(9)
TrainingIndex <- createDataPartition(iris$Species, p=0.7, list = F)

70% of the data with class labels:


TrainingSet <- iris[TrainingIndex,]
dim(TrainingSet)

Now let’s mask the species labels for the remaining 30% of data, so we assume these are
samples where we do not know the species
TestingSet <- iris[-TrainingIndex,]
TestingSet $Species <- "unknown"
dim(TestingSet)

We will use two unsupervised ML: PCA and clustering:


data_unsup <- as.data.frame(rbind(TrainingSet, TestingSet))

We will also need some packages for plotting


library(ggplot2)
library(grid)
library(gridExtra)
library(factoextra) # because of ‘fviz_eig()’ function

PCA

pca <- prcomp(data_unsup[,1:4], center = T, scale. = T)

What kind of objects is pca?


class(pca)
is.list(pca)

Get the variances explained by each PC


png(filename = "Scree.Plot.png", width = 12, height = 12, units = "cm", res = 300)
fviz_eig(pca, addlabels = T)
dev.off()

Get the PCA results


df_out <- as.data.frame(pca$x)
head(df_out)

Assign species label


df_out$Species <- as.character(data_unsup[,5])

Plot the PCA results


p1 <- ggplot(df_out, aes(x=PC1, y=PC2, color=Species, label=rownames(df_out))) +
geom_point() + geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

p2 <- ggplot(df_out, aes(x=PC1, y=PC3, color=Species, label=rownames(df_out))) +


geom_point() +geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

p3 <- ggplot(df_out, aes(x=PC2, y=PC3, color=Species, label=rownames(df_out))) +


geom_point() +geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

Combine the three plots to one


pFin <- grid.arrange(p1,p2,p3, ncol=2)

Save PCA plot as .png file


ggsave(pFin, filename = "PCA.png", device = "png", dpi = 600, width = 30, height = 30,
units = "cm")
Supervised Classification
Setosa is way too easy to identify, therefore not interesting for us. We will focus on versicolor
and virginica, since they are not so well separated by PCA and clustering. Therefore, we
remove all setosa samples.
iris <- iris[-c(which(iris$Species == "setosa")),]
iris$Species <- factor(iris$Species)

In order to test how good our classification performs, we will split the data in 70% and 30%, train
with 70% of the data and test with the other 30%
set.seed(9)
training_indices <- createDataPartition(iris$Species, p=0.7, list = F)

Select 70% of the data to train the models


irisTrain <- iris[training_indices,]

Use the remaining 30% of the dataset for model testing


irisTest <- iris[-training_indices,]

Now we do the supervised ML. Let’s specify some important parameters/settings for the ML (10-
fold cross validation and 3 repeats).
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = multiClassSummary, classProbs = T, savePredictions = T)

Also, we define metric to estimate machine performance: the machine with the highest value in
terms of the metric will be selected.
metric <- "Accuracy"

k-NN
set.seed(7)
fit.knn <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "knn", metric = metric, trControl =
control, preProcess = c("center", "scale"))
Decision Tree
set.seed(7)
fit.cart <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rpart", metric = metric, trControl =
control, preProcess = c("center","scale"))
A useful property of decisions tress is that the decisions of the machine can be followed up
easily. Plot the best decision tree:
library(rpart.plot)
rpart.plot(fit.cart$finalModel, type = 5)
Random Forest
set.seed(7)
fit.rf <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rf", metric = metric, trControl = control,
preProcess = c("center", "scale"))
SVM
set.seed(7)
fit.svm_Lin <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "svmLinear", metric = metric,
trControl = control, preProcess = c("center", "scale"))
Artificial neuronal network
set.seed(7)
fit.nnet <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "nnet", metric = metric,
trControl = control, preProcess = c("center", "scale"))
LDA
set.seed(7)
fit.lda <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "lda", metric = metric, trControl =
control, preProcess = c("center", "scale"))

Summarize ML performance
results <- resamples(list(lda=fit.lda, knn=fit.knn, svm_Lin=fit.svm_Lin, svm_Rad=fit.svm_Rad,
rf=fit.rf, cart=fit.cart, nnet=fit.nnet))
summary(results)

Plot comparison of models


scales <- list(x=list(relation="free"), y=list(relation="free"))
png(filename = "ML_performance.png", width = 20, height = 20, units = "cm", res = 300)
dotplot(results, scales=scales, par.strip.text=list(cex=0.76), par.settings = list(par.xlab.text =
list(cex = 0)))
dev.off()
What is the best model in terms of Accuracy"?
Testing
Predicting outcome for testing dataset
predictions <- predict(fit.nnet, irisTest[,1:4])
Compare predicted outcome and true outcome
confusionMatrix(predictions, irisTest$Species)

How important are the individual features for the machine


gbmImp <- varImp(fit.nnet, scale = T)
png(filename = "Iris_ML_FeatImp.png", width = 10, height = 10, units = "cm", res = 300)
plot(gbmImp, main="Iris Variable Importance (ANN)", top=4)
dev.off()

***********************************
Now it’s time for your first own ML project. Load the data
data(GermanCredit, package = “caret")
Do the ML to identify the class "bad" and "good" (‘Class’ column). Use the same settings as
before: 70/30 splitting, 10 times repeated CV, select for the best Balanced Accuracy.
Which ML algorithm performs best? What are the six most important features for the
classification?

Well done!

You might also like