0% found this document useful (0 votes)

15 views8 pages

R Course - Part7 ML - Exercise Sheet 2024

The document outlines a course on R Programming focused on Machine Learning, specifically classification techniques, at Jena University Hospital in February 2024. It covers both supervised and unsupervised machine learning methods, including algorithms like k-NN, SVM, and decision trees, and includes practical exercises using the iris dataset. Participants will learn to implement these techniques in R, evaluate model performance, and apply their knowledge to a project involving the GermanCredit dataset.

Uploaded by

Martin Levkovski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views8 pages

R Course - Part7 ML - Exercise Sheet 2024

Uploaded by

Martin Levkovski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

R Programming Course

Jena University Hospital, February 2024

R Programming course, Thursday, Classification (Machine Learning)
Prof. Dr. Rainer König, Alicia Hiemisch, Mohadeseh Naseri, supported by Dr Marcus Oswald,
Antje Biering, Maria Dinkelacker and Kulwadee Thanamit

Introduction
Definition: Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part of artificial
intelligence. Machine learning algorithms build a model based on sample data, known as
training data, in order to make predictions or decisions without being explicitly programmed to
do so. Machine learning algorithms are used in a wide variety of applications, such as in
medicine, email filtering, speech recognition, and computer vision, where it is difficult or
unfeasible to develop conventional algorithms to perform the needed tasks.

Major application is classification.

There are two basic types of ML: supervised and unsupervised. Unsupervised ML discover just
patterns of data similarity, whereas algorithms of supervised ML train and optimize Machines on
a subset of data with known class labels.

Unsupervised Machine Learning

Clustering: Basically, clustering is about grouping objects systematically using
distance/similarity measures.
Principal component analysis (PCA): used for dimensionality reduction by projecting each data
point onto only the first few principal components to obtain lower-dimensional data while
preserving as much of the data's variation as possible.

Supervised Machine Learning

k-nearest neighbors algorithm (k-NN): simplest Machine Learning algorithms based on
Supervised Learning technique. K-NN algorithm assumes the similarity between the new
observation/data and available cases and put the new case into the category that is most similar
to the available categories of the k nearest cases. The choice of k is crucial for the quality of the
results.
Linear discriminant analysis (LDA): Aim to find a linear combination of features that
characterizes or separates two or more classes of objects/observations. Therefore, reduces
dimension systematically.
Support vector machine (SVM): SVM maps training examples to points in space and constructs
a hyperplane so that the distance from it to the nearest data point on each side is maximized. A
good separation is achieved by the hyperplane that has the largest distance to the nearest
training-data point of any class. Kernel trick: the method maps a non-linear separable data into
a higher dimensional space, making the data linearly separable.
Decision tree (DT): Structure in which each internal node represents a "test" on an attribute (e.g.
Petal Width > 1), each branch represents the outcome of the test, and each leaf node
represents a class label (decision taken after computing all attributes). The paths from root to
leaf represent classification rules.
Random forest (RF): A set of decision trees each calculated using a bootstrapped dataset and a
random subset of variables at each decision step. The variety is what makes RF more effective
than individual decision tress. To classify the data, the data will be run down all trees created.
The data is assigned to the class that received more votes.
Artificial neuronal network (ANN): An ANN is based on a collection of connected nodes called
neurons. The neurons are connected via channels and are grouped in different layers. The input
layer receives the input, the output layer predicts the output and the hidden layers between
them perform the main part of the calculation. Within the training phase the weights assigned to
the channels are adjusted to improve the prediction of the output layer. To classify the data, the
data will be propagated through the neuronal net using the weights and functions calculated in
the training phase. The data is assigned to the class that achieves highest score (probability) in
the output layer.

Practical Part
Objective
We will try to build a classifier separating different iris species according to their sepal and petal
features (length and width).

Libraries
Make sure that the packages below are installed. If not, you can install these packages with the
following commands.
install.packages("caret", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("grid", dependencies = TRUE)
install.packages("gridExtra", dependencies = TRUE)
install.packages("factoextra", dependencies = TRUE)
install.packages("ggdendro", dependencies = TRUE)
install.packages("rpart.plot", dependencies = TRUE)

We will use a powerful ‘caret’ package (short for Classification and Regression Training), which
contains functions for supervised ML.
Import it:
library(caret)

Set the working directory, where will be saved output figures

setwd(“ … a path to the newly created folder … “)

Load the iris data

data(iris)

Inspect the data and get some basic stats. How does the data look? How many rows, columns?
head(iris)
tail(iris)
dim(iris)
summary(iris)
str(iris)

Now to check the missing values we are using is.na() function in R and print out the number of
missing items in the data frame as shown below.
sum(is.na(iris))

You can create new data set without missing data using na.omit() function in R.
#newdata <- na.omit(mydata)

Shuffle the data to get rid of any patterns.

The set.seed() function in R is used to create reproducible results when writing code that
involves creating variables that take on random values. By using the set. seed() function, you
guarantee that the same random values are produced each time you run the code.
set.seed(7)
iris <- iris[sample(1:nrow(iris), size = nrow(iris), replace = F),]
head(iris, 20)

Plot a boxplot and a density histogram of the iris data for each species
scales = list(x=list(relation="free"), y=list(relation="free"))
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs",scales = scale, auto.key = list(columns
= 3))
dev.off()
png(filename = "featurePlot_Density.png", width = 15, height = 15, units = "cm", res = 300)
featurePlot(x=iris[,1:4], y=iris$Species, plot = "box", scales = scale, par.strip.text=list(cex=0.6))
dev.off()

Unsupervised classification
In order to test and see how good our classification performs, we will mask the species labels in
30% of the data.
Split the dataset 70/30:
set.seed(9)
TrainingIndex <- createDataPartition(iris$Species, p=0.7, list = F)

70% of the data with class labels:

TrainingSet <- iris[TrainingIndex,]
dim(TrainingSet)

Now let’s mask the species labels for the remaining 30% of data, so we assume these are
samples where we do not know the species
TestingSet <- iris[-TrainingIndex,]
TestingSet $Species <- "unknown"
dim(TestingSet)

We will use two unsupervised ML: PCA and clustering:

data_unsup <- as.data.frame(rbind(TrainingSet, TestingSet))

We will also need some packages for plotting

library(ggplot2)
library(grid)
library(gridExtra)
library(factoextra) # because of ‘fviz_eig()’ function

PCA

pca <- prcomp(data_unsup[,1:4], center = T, scale. = T)

What kind of objects is pca?

class(pca)
is.list(pca)

Get the variances explained by each PC

png(filename = "Scree.Plot.png", width = 12, height = 12, units = "cm", res = 300)
fviz_eig(pca, addlabels = T)
dev.off()

Get the PCA results

df_out <- as.data.frame(pca$x)
head(df_out)

Assign species label

df_out$Species <- as.character(data_unsup[,5])

Plot the PCA results

p1 <- ggplot(df_out, aes(x=PC1, y=PC2, color=Species, label=rownames(df_out))) +
geom_point() + geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

p2 <- ggplot(df_out, aes(x=PC1, y=PC3, color=Species, label=rownames(df_out))) +

geom_point() +geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

p3 <- ggplot(df_out, aes(x=PC2, y=PC3, color=Species, label=rownames(df_out))) +

geom_point() +geom_text(aes(label=rownames(df_out)), hjust=0, vjust=0) +
theme_bw()

Combine the three plots to one

pFin <- grid.arrange(p1,p2,p3, ncol=2)

Save PCA plot as .png file

ggsave(pFin, filename = "PCA.png", device = "png", dpi = 600, width = 30, height = 30,
units = "cm")
Supervised Classification
Setosa is way too easy to identify, therefore not interesting for us. We will focus on versicolor
and virginica, since they are not so well separated by PCA and clustering. Therefore, we
remove all setosa samples.
iris <- iris[-c(which(iris$Species == "setosa")),]
iris$Species <- factor(iris$Species)

In order to test how good our classification performs, we will split the data in 70% and 30%, train
with 70% of the data and test with the other 30%
set.seed(9)
training_indices <- createDataPartition(iris$Species, p=0.7, list = F)

Select 70% of the data to train the models

irisTrain <- iris[training_indices,]

Use the remaining 30% of the dataset for model testing

irisTest <- iris[-training_indices,]

Now we do the supervised ML. Let’s specify some important parameters/settings for the ML (10-
fold cross validation and 3 repeats).
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = multiClassSummary, classProbs = T, savePredictions = T)

Also, we define metric to estimate machine performance: the machine with the highest value in
terms of the metric will be selected.
metric <- "Accuracy"

k-NN
set.seed(7)
fit.knn <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "knn", metric = metric, trControl =
control, preProcess = c("center", "scale"))
Decision Tree
set.seed(7)
fit.cart <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rpart", metric = metric, trControl =
control, preProcess = c("center","scale"))
A useful property of decisions tress is that the decisions of the machine can be followed up
easily. Plot the best decision tree:
library(rpart.plot)
rpart.plot(fit.cart$finalModel, type = 5)
Random Forest
set.seed(7)
fit.rf <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "rf", metric = metric, trControl = control,
preProcess = c("center", "scale"))
SVM
set.seed(7)
fit.svm_Lin <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "svmLinear", metric = metric,
trControl = control, preProcess = c("center", "scale"))
Artificial neuronal network
set.seed(7)
fit.nnet <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "nnet", metric = metric,
trControl = control, preProcess = c("center", "scale"))
LDA
set.seed(7)
fit.lda <- train(x=irisTrain[,1:4], y=irisTrain[,5], method = "lda", metric = metric, trControl =
control, preProcess = c("center", "scale"))

Summarize ML performance
results <- resamples(list(lda=fit.lda, knn=fit.knn, svm_Lin=fit.svm_Lin, svm_Rad=fit.svm_Rad,
rf=fit.rf, cart=fit.cart, nnet=fit.nnet))
summary(results)

Plot comparison of models

scales <- list(x=list(relation="free"), y=list(relation="free"))
png(filename = "ML_performance.png", width = 20, height = 20, units = "cm", res = 300)
dotplot(results, scales=scales, par.strip.text=list(cex=0.76), par.settings = list(par.xlab.text =
list(cex = 0)))
dev.off()
What is the best model in terms of Accuracy"?
Testing
Predicting outcome for testing dataset
predictions <- predict(fit.nnet, irisTest[,1:4])
Compare predicted outcome and true outcome
confusionMatrix(predictions, irisTest$Species)

How important are the individual features for the machine

gbmImp <- varImp(fit.nnet, scale = T)
png(filename = "Iris_ML_FeatImp.png", width = 10, height = 10, units = "cm", res = 300)
plot(gbmImp, main="Iris Variable Importance (ANN)", top=4)
dev.off()

***********************************
Now it’s time for your first own ML project. Load the data
data(GermanCredit, package = “caret")
Do the ML to identify the class "bad" and "good" (‘Class’ column). Use the same settings as
before: 70/30 splitting, 10 times repeated CV, select for the best Balanced Accuracy.
Which ML algorithm performs best? What are the six most important features for the
classification?

Well done!

Toodegrees Fractal Model PDF
No ratings yet
Toodegrees Fractal Model PDF
11 pages
Iris Flower Classification Project
100% (1)
Iris Flower Classification Project
14 pages
Machine Learning and Deep Learning Supervised Learning 1682688720
No ratings yet
Machine Learning and Deep Learning Supervised Learning 1682688720
121 pages
BUSINESS ANALYTICS Assignment
No ratings yet
BUSINESS ANALYTICS Assignment
14 pages
Data Science in FInancial Services - 3
No ratings yet
Data Science in FInancial Services - 3
76 pages
Lecture 3 - MachineLearning-CrashCourse2023
No ratings yet
Lecture 3 - MachineLearning-CrashCourse2023
99 pages
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
DSM MOd 5
No ratings yet
DSM MOd 5
34 pages
Codes and Other Relevant Explanations For Supervised Learning (Part 1) - Session by Sabyasachi Mukhopadhyay - August 3
No ratings yet
Codes and Other Relevant Explanations For Supervised Learning (Part 1) - Session by Sabyasachi Mukhopadhyay - August 3
5 pages
Scikit-Learn User Guide Release 0.19.dev0
100% (2)
Scikit-Learn User Guide Release 0.19.dev0
2,133 pages
Lecture 8 - Classification - Part 1
No ratings yet
Lecture 8 - Classification - Part 1
28 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Mod3 Classification
No ratings yet
Mod3 Classification
32 pages
KNN Datacamp
No ratings yet
KNN Datacamp
31 pages
Classification in AI
No ratings yet
Classification in AI
21 pages
Scikit Learn Docs
100% (1)
Scikit Learn Docs
2,201 pages
Record
No ratings yet
Record
23 pages
Class 2a-Decision Trees
No ratings yet
Class 2a-Decision Trees
28 pages
Data Science Project
No ratings yet
Data Science Project
31 pages
Hospital Management System Project Report
No ratings yet
Hospital Management System Project Report
87 pages
Content Server 20.3 Administration Guide
No ratings yet
Content Server 20.3 Administration Guide
578 pages
Datamining 2
No ratings yet
Datamining 2
54 pages
Machine Learning in R (Autosaved)
No ratings yet
Machine Learning in R (Autosaved)
8 pages
Class10-Introduction To ML
No ratings yet
Class10-Introduction To ML
32 pages
Assigmnent 3 (Data Mining)
No ratings yet
Assigmnent 3 (Data Mining)
18 pages
ICT202B AI ML and Emerging Technologies UNIT 3 (Classification and Regression) 2
No ratings yet
ICT202B AI ML and Emerging Technologies UNIT 3 (Classification and Regression) 2
23 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
Wa0001
No ratings yet
Wa0001
39 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
CH 5
No ratings yet
CH 5
84 pages
DA LabFile
No ratings yet
DA LabFile
63 pages
DS Report
No ratings yet
DS Report
11 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
Iris Dataset Project Report - Compress
No ratings yet
Iris Dataset Project Report - Compress
16 pages
Project Occupancy Alfonso Vicente Aragues
No ratings yet
Project Occupancy Alfonso Vicente Aragues
18 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Notes - With R Code
No ratings yet
Notes - With R Code
7 pages
Best ML Packages in R
No ratings yet
Best ML Packages in R
9 pages
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
INT 354 CA1 Mokshagna
No ratings yet
INT 354 CA1 Mokshagna
8 pages
1
No ratings yet
1
19 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
Introduction To Data Mining With R: Yanchang Zhao
No ratings yet
Introduction To Data Mining With R: Yanchang Zhao
46 pages
Package Rminer': R Topics Documented
No ratings yet
Package Rminer': R Topics Documented
43 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
OSN 8800 6800 3800 V100R011C10 Trouble Shooting 01
100% (1)
OSN 8800 6800 3800 V100R011C10 Trouble Shooting 01
273 pages
ARM CORTEX - M & OMAP Processors
50% (2)
ARM CORTEX - M & OMAP Processors
34 pages
Classification of Iris Flower Species Updated
100% (1)
Classification of Iris Flower Species Updated
5 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
A Proposed Design of Advanced Coconut Cutting Machine Powered by Hydraulic System
No ratings yet
A Proposed Design of Advanced Coconut Cutting Machine Powered by Hydraulic System
52 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
R Reference Card For Data Mining
No ratings yet
R Reference Card For Data Mining
3 pages
R Assignment
No ratings yet
R Assignment
8 pages
Isilon - Backend Switches-Backend Switch Upgrade-Replace
No ratings yet
Isilon - Backend Switches-Backend Switch Upgrade-Replace
8 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
12 Classification
No ratings yet
12 Classification
16 pages
MYP4-SA BALANCE Criterion D
No ratings yet
MYP4-SA BALANCE Criterion D
3 pages
Powerpoint Tabs
No ratings yet
Powerpoint Tabs
5 pages
Appendix C - Simulink Refresher
No ratings yet
Appendix C - Simulink Refresher
27 pages
MYP 5 SA Environment Biology
100% (1)
MYP 5 SA Environment Biology
6 pages
Notations... Crows Foot
No ratings yet
Notations... Crows Foot
14 pages
Lecture 40 Final Review - F24
No ratings yet
Lecture 40 Final Review - F24
56 pages
Chemistry SA Unit1 Balance Criteria A
No ratings yet
Chemistry SA Unit1 Balance Criteria A
6 pages
Iti Result
No ratings yet
Iti Result
1 page
Ayu Shahirah Salem: Objective
No ratings yet
Ayu Shahirah Salem: Objective
2 pages
Smart Sensors
No ratings yet
Smart Sensors
8 pages
Services That We Can Offer
No ratings yet
Services That We Can Offer
3 pages
Linux Imp Topics
No ratings yet
Linux Imp Topics
29 pages
GeM Bidding 7789897
No ratings yet
GeM Bidding 7789897
8 pages
Use of Tools in PC Hardware Servicing: Instructional Plan (Iplan)
No ratings yet
Use of Tools in PC Hardware Servicing: Instructional Plan (Iplan)
2 pages
OOPS Interview Questions and Answers 2022
100% (1)
OOPS Interview Questions and Answers 2022
6 pages
Anabolicious Recipes
No ratings yet
Anabolicious Recipes
15 pages
CAD Exercise 3
No ratings yet
CAD Exercise 3
15 pages
2ND Summative CSS 10
No ratings yet
2ND Summative CSS 10
3 pages
Week 7 8 End User Computing
No ratings yet
Week 7 8 End User Computing
29 pages
BCAC602 - Lession Plan
No ratings yet
BCAC602 - Lession Plan
2 pages
Radwin Training Catalog
No ratings yet
Radwin Training Catalog
19 pages
The Internet: Bringing Us Together or Tearing Us Apart
No ratings yet
The Internet: Bringing Us Together or Tearing Us Apart
4 pages
Angular CRUD Using PHP and MySQL
No ratings yet
Angular CRUD Using PHP and MySQL
9 pages
Morelia Neo IV Pro KL As Turf Soccer Shoe - Mizuno USA
No ratings yet
Morelia Neo IV Pro KL As Turf Soccer Shoe - Mizuno USA
1 page
DSB For R PDF
No ratings yet
DSB For R PDF
6 pages
A Detection System For Stolen Vehicles Using Vehicle Attributes With Deep Learning
No ratings yet
A Detection System For Stolen Vehicles Using Vehicle Attributes With Deep Learning
4 pages
Exoplayer Dev Playlists HTML
No ratings yet
Exoplayer Dev Playlists HTML
1 page
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet