Tutorial 10
DSA1101
Introduction to Data Science
November 9, 2018
Exercise 1. Logistic regression in R
In tutorial 7, we looked at the CSV dataset “Titanic.csv” which provides information on the
fate of passengers on the fatal maiden voyage of the ocean liner Titanic, and includes the
variables economic status (class), sex, age and survival. We trained a naı̈ve Bayes classifier
using this dataset, and predict survival. This week, we will use logistic regression to predict
survival, and compare the performances of the two classifiers visually using an ROC curve.
(a) Load the dataset “Titanic.csv” which has been posted under the folder for Tutorial 7.
1 Titanic _ dataset = read . csv ( " Titanic . csv " )
2 dim ( Titanic _ dataset )
3 head ( Titanic _ dataset )
(b) Perform logistic regression of ‘Survived’ on all the feature variables.
1 Survival _ logistic <- glm ( Survived ~ . ,
2 data = Titanic _ dataset ,
3 family = binomial ( link = " logit " ) )
(c) Perform naı̈ve Bayes classification of ‘Survived’ based on all the feature variables.
1 library ( e1071 )
2 Survival _ Nbayes <- naiveBayes ( Survived ~ . ,
3 data = Titanic _ dataset )
1
(d) Observe and compare the ROC curves for the two classifiers.
1 library ( ROCR )
2 pred = predict ( Survival _ logistic , type = " response " )
3 predObj = prediction ( pred , Titanic _ dataset $ Survived )
4 rocObj = performance ( predObj , measure = " tpr " , x . measure = " fpr " )
5 plot ( rocObj )
6
7
8 nb _ prediction <- predict ( Survival _ Nbayes , Titanic _ dataset , type = ’ raw ’)
9 score <- nb _ prediction [ , 2]
10 pred _ nb <- prediction ( score , Titanic _ dataset $ Survived )
11 roc _ nb = performance ( pred _ nb , measure = " tpr " , x . measure = " fpr " )
12 plot ( roc _ nb , add = TRUE , col = 2)
13
14
15 legend ( " bottomright " , c ( " logisic regression " ," naive Bayes " ) , col = c ( "
black " ," red " ) , lty =1)