0% found this document useful (0 votes)

9 views

Lab 6

Uploaded by

yash rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Lab 6

Uploaded by

yash rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab 6

Yash Rathod

2022-12-08

1. Upload Titanic dataset

library(tidyr)
library(caret)

## Warning: package ’caret’ was built under R version 4.2.2

## Loading required package: ggplot2

## Warning: package ’ggplot2’ was built under R version 4.2.2

## Loading required package: lattice

test_main <- read.csv("test.csv")

train_main <- read.csv("train.csv")
summary(train_main)

## PassengerId Survived Pclass Name

## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA’s :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20

1
## 3rd Qu.: 31.00
## Max. :512.33
##

summary(test_main)

## PassengerId Pclass Name Sex

## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA’s :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA’s :1

train_main <- train_main[train_main$Embarked == "C" | train_main$Embarked == "Q" | train_main$Embarked =

By summarizing the result we can see that there are 177 NA’s in the column Age of train and Test Dataset.
To visualize the distribution of this column lets plot a Histogram.

hist(train_main$Age, main = "Age Distribution", col = "Light Green", xlab = "Age")

2
Age Distribution
200
150
Frequency

100
50
0

0 20 40 60 80

Age

We can see that the data is almost normally distributed by slightly right skewed. However, we can consider
replacing the NA values with the Median of the Data in Train. We will be replacing the NA’s with value 28
which is the Median of train dataset.

train_main$Age <- ifelse(is.na(train_main$Age), median(train_main$Age, na.rm = TRUE),train_main$Age)

As we can see above that we do not have the Survived column in

our test dataset we will split out train_main dataset into test and
train [ As we need Survival column in our test dataset to calculate
the accuracy further in the code]

set.seed(123)
split = createDataPartition(train_main$Survived, p = 0.8, list = FALSE)
train = train_main[split,]
test = train_main[-split,]

2. Define Survived column as TARGET variable

3
colnames(train)[2] <- "TARGET"
colnames(test)[2] <- "TARGET"
names(train)

## [1] "PassengerId" "TARGET" "Pclass" "Name" "Sex"

## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"

3. Select features that can be predictive of the survival status

4. Drop features that you think are not predictive and explain why
they are being dropped

library(dplyr)

##
## Attaching package: ’dplyr’

## The following objects are masked from ’package:stats’:

##
## filter, lag

## The following objects are masked from ’package:base’:

##
## intersect, setdiff, setequal, union

train_selected <- select(train,"Pclass","Sex","Age","Fare","SibSp","Parch","Embarked","TARGET")

test_selected <- select(test,"Pclass","Sex","Age","Fare","SibSp","Parch","Embarked","TARGET")

Above we have selected the following columns:

Pclass: The passenger class (1 = first class, 2 = second class, 3 = third class). Sex: The passenger’s gender
(male or female). Age: The passenger’s age. SibSp: The number of siblings or spouses the passenger was
traveling with. Parch: The number of parents or children the passenger was traveling with. Fare: The
fare paid by the passenger. Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S =
Southampton).

I have Dropped the following columns:

PassengerId: The unique ID of the passenger. Name: The passenger’s name. Cabin: The passenger’s cabin
number. Ticket: The passenger’s ticket number.
The reason for dropping these columns is that these columns will not contribute much towards survival rate
as they are individualized unique identifiers for each passenger.

4
5. Transform selected categorical features with Dummy values

train_selected$Sex <- as.factor(train_selected$Sex)

train_selected$Embarked <- as.factor(train_selected$Embarked)
test_selected$Sex <- as.factor(test_selected$Sex)
test_selected$Embarked <- as.factor(test_selected$Embarked)
str(test_selected)

## ’data.frame’: 177 obs. of 8 variables:

## $ Pclass : int 3 3 1 1 3 2 2 3 1 1 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 2 2 2 1 ...
## $ Age : num 22 26 54 58 14 28 34 28 19 28 ...
## $ Fare : num 7.25 7.92 51.86 26.55 7.85 ...
## $ SibSp : int 1 0 0 0 0 0 0 0 3 1 ...
## $ Parch : int 0 0 0 0 0 0 0 0 2 0 ...
## $ Embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 1 3 1 ...
## $ TARGET : int 0 1 0 1 0 1 1 0 0 1 ...

We have transformed 2 Variables namely Sex and Embarked. Also, in the Embarked column we can see
there are 4 columns in test_selected dataset. Therefore we need to remove the class “ ” from that column.

#test_selected <- test_selected[!test_selected$Embarked == "",]

6. Apply logistic regression on the train/test dataset

model = glm(TARGET ~.,data = train_selected, family = "binomial")

7. Compute your model’s accuracy using accuracy function

library(caret)
predictions = predict(model, newdata = test_selected, type = "response")
predictions_binary = ifelse(predictions > 0.5, 1, 0)
accuracy <- mean(predictions_binary == test_selected$TARGET)
print(accuracy)

## [1] 0.8248588

Accuracy is : 82.48%
Note: Sorry I was not able to use the accuracy function of Caret library as it was throwing errors in my
system. I could have dig deeper into it but due to time restrictions I had to manually calculate the accuracy.

5
summary(model)

##
## Call:
## glm(formula = TARGET ~ ., family = "binomial", data = train_selected)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5063 -0.6270 -0.4479 0.6743 2.3819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.796595 0.603742 7.945 1.95e-15 ***
## Pclass -0.979673 0.156268 -6.269 3.63e-10 ***
## Sexmale -2.635261 0.220833 -11.933 < 2e-16 ***
## Age -0.033096 0.008459 -3.913 9.13e-05 ***
## Fare 0.001880 0.002480 0.758 0.4484
## SibSp -0.291558 0.121906 -2.392 0.0168 *
## Parch -0.061550 0.124431 -0.495 0.6208
## EmbarkedQ -0.231530 0.410737 -0.564 0.5730
## EmbarkedS -0.524459 0.254227 -2.063 0.0391 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 940.00 on 711 degrees of freedom
## Residual deviance: 644.21 on 703 degrees of freedom
## AIC: 662.21
##
## Number of Fisher Scoring iterations: 5

From the above results we can see that Fare, Parch & Embarked are not significant contributors in model’s
accuracy.

confusion_Matrix <- table(predictions_binary, test_selected$TARGET)

confusion_Matrix

##
## predictions_binary 0 1
## 0 89 18
## 1 13 57

The model is doing pretty well with an accuracy of 82%. Above I have tried to plot the confusion matrix
which shows class wise predictitons of the model. In this we can identify the number of false positive or type
1 error while predicting the results. We can see that for non-survived it has false identified 18 times out of
107 where as for survived it has classified 13 false positives out of 70.
—————————————- After Email Instructions

6
Generating the Survived column for test dataset

predictTest = predict(model, type ="response", newdata = test_main)

test_main$Survived = as.numeric(predictTest >= 0.5)
table(test_main$Survived)

##
## 0 1
## 197 134

According to the results our model predicts that out of 331 people 134 survived and 197 couldn’t.

Coding Interview Patterns - Nails your interview - Alex Xu - 2024
67% (3)
Coding Interview Patterns - Nails your interview - Alex Xu - 2024
154 pages
New Google Hacking Database 2024
No ratings yet
New Google Hacking Database 2024
2 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
Titanic ML Kaggle
No ratings yet
Titanic ML Kaggle
3 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
1.1 Objective: 2. Data Preparation and Exploratory Analysis
No ratings yet
1.1 Objective: 2. Data Preparation and Exploratory Analysis
11 pages
Lab6 Results 8974917
No ratings yet
Lab6 Results 8974917
4 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
Cvms
No ratings yet
Cvms
37 pages
Regression
No ratings yet
Regression
36 pages
What Are Decision Trees?
No ratings yet
What Are Decision Trees?
9 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
Logistic Regression On Titanic Dataset
No ratings yet
Logistic Regression On Titanic Dataset
6 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Data Science Unit-5
No ratings yet
Data Science Unit-5
37 pages
Name: Mussab Bin Shahid Sap-Id: 2024 Assignment: Machine-Learning
No ratings yet
Name: Mussab Bin Shahid Sap-Id: 2024 Assignment: Machine-Learning
5 pages
Titanic Survival
No ratings yet
Titanic Survival
13 pages
R Assignment
No ratings yet
R Assignment
8 pages
DSBAProject Oct 2020
No ratings yet
DSBAProject Oct 2020
24 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
7 pages
Project
No ratings yet
Project
16 pages
Homework2
No ratings yet
Homework2
12 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
01-Logistic Regression With Python
No ratings yet
01-Logistic Regression With Python
12 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Titanic Dataset Model Prediction
No ratings yet
Titanic Dataset Model Prediction
11 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Titanic (4)
No ratings yet
Titanic (4)
3 pages
Titanic (5)
No ratings yet
Titanic (5)
3 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Home Work
No ratings yet
Home Work
12 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
No ratings yet
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
2 pages
Logistic Regression With R
No ratings yet
Logistic Regression With R
5 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
Question 1 The Given Dataset Can Be Visualized As Follows
No ratings yet
Question 1 The Given Dataset Can Be Visualized As Follows
13 pages
Data Science and ML - End Term
No ratings yet
Data Science and ML - End Term
4 pages
8 Feature Engineering
No ratings yet
8 Feature Engineering
29 pages
23BCE7199 ML Lab Assignment[1]
No ratings yet
23BCE7199 ML Lab Assignment[1]
15 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
86 pages
12212221 (1) copy
No ratings yet
12212221 (1) copy
9 pages
Vighnesh - S Log 13
No ratings yet
Vighnesh - S Log 13
4 pages
Exp 3 Bi 30
No ratings yet
Exp 3 Bi 30
7 pages
DataWare Housing Asg01 Shaheer Zia Qazi-47-2
No ratings yet
DataWare Housing Asg01 Shaheer Zia Qazi-47-2
9 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Aquif Ibrar 1212
No ratings yet
Aquif Ibrar 1212
9 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
Lab 3 - Logistic Regression: Part B
No ratings yet
Lab 3 - Logistic Regression: Part B
7 pages
TitanicFeatureEngineering Handout
No ratings yet
TitanicFeatureEngineering Handout
26 pages
lec45
No ratings yet
lec45
20 pages
Titanic
No ratings yet
Titanic
1 page
ApplStats Spring2022 Final Practice
No ratings yet
ApplStats Spring2022 Final Practice
5 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
From Everand
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
U.Q. Magnusson
No ratings yet
Learn C Programming through Nursery Rhymes and Fairy Tales: Classic Stories Translated into C Programs
From Everand
Learn C Programming through Nursery Rhymes and Fairy Tales: Classic Stories Translated into C Programs
Shari Eskenas
No ratings yet
Database Auditing
No ratings yet
Database Auditing
42 pages
Midterm 1 Study Guide
No ratings yet
Midterm 1 Study Guide
2 pages
Fasina PK: Audit Assistant
No ratings yet
Fasina PK: Audit Assistant
2 pages
Nhibernate Reference PDF
No ratings yet
Nhibernate Reference PDF
258 pages
VoLTE Training 2017 NAM
100% (1)
VoLTE Training 2017 NAM
104 pages
CD Questions (Unit-3)
No ratings yet
CD Questions (Unit-3)
5 pages
Working in The IT Industry: Briefing
No ratings yet
Working in The IT Industry: Briefing
8 pages
c0925074 - ILE RPG Programmer's Guide 2002 PDF
No ratings yet
c0925074 - ILE RPG Programmer's Guide 2002 PDF
506 pages
READ ME - HeadRush Pedalboard Firmware Update v2.0.0 PDF
No ratings yet
READ ME - HeadRush Pedalboard Firmware Update v2.0.0 PDF
4 pages
A System Study in Big Bazar
No ratings yet
A System Study in Big Bazar
11 pages
Before The Federal Communications Commission Washington, D.C. 20554
No ratings yet
Before The Federal Communications Commission Washington, D.C. 20554
7 pages
PIC Instructor Manual
No ratings yet
PIC Instructor Manual
47 pages
Mojza OL IG Computer Science P1
100% (2)
Mojza OL IG Computer Science P1
87 pages
The Europe Business Assembly Is Proud To Announce That ADITI IT SERVICES PVT. LTD. Has Received The Prestigious Best Enterprises Award
No ratings yet
The Europe Business Assembly Is Proud To Announce That ADITI IT SERVICES PVT. LTD. Has Received The Prestigious Best Enterprises Award
3 pages
Network Documentation: Network Monitoring and Management
No ratings yet
Network Documentation: Network Monitoring and Management
19 pages
ENT186 Pspice Simulation-Ver2018
No ratings yet
ENT186 Pspice Simulation-Ver2018
42 pages
slide-6
No ratings yet
slide-6
20 pages
Dot Net
No ratings yet
Dot Net
121 pages
ER-Relational Mapping From VIT
No ratings yet
ER-Relational Mapping From VIT
27 pages
Thesis Used Openfoam
No ratings yet
Thesis Used Openfoam
171 pages
Vecana 01
No ratings yet
Vecana 01
17 pages
SCM Assign
No ratings yet
SCM Assign
389 pages
Drystar 5300 Reference Manual 2920 G (English)
No ratings yet
Drystar 5300 Reference Manual 2920 G (English)
240 pages
AWS Limits
No ratings yet
AWS Limits
25 pages
Cursor Handling
100% (1)
Cursor Handling
14 pages
Tutorial1 (Answer) - 230109 095629
No ratings yet
Tutorial1 (Answer) - 230109 095629
9 pages
Exercises: Simatic Net
No ratings yet
Exercises: Simatic Net
10 pages

Lab 6

Uploaded by

Lab 6

Uploaded by

Lab 6

1. Upload Titanic dataset

## Warning: package ’caret’ was built under R version 4.2.2

## Loading required package: ggplot2

## Warning: package ’ggplot2’ was built under R version 4.2.2

## Loading required package: lattice

test_main <- read.csv("test.csv")

## PassengerId Survived Pclass Name

## PassengerId Pclass Name Sex

train_main <- train_main[train_main$Embarked == "C" | train_main$Embarked == "Q" | train_main$Embarked =

hist(train_main$Age, main = "Age Distribution", col = "Light Green", xlab = "Age")

train_main$Age <- ifelse(is.na(train_main$Age), median(train_main$Age, na.rm = TRUE),train_main$Age)

As we can see above that we do not have the Survived column in

2. Define Survived column as TARGET variable

## [1] "PassengerId" "TARGET" "Pclass" "Name" "Sex"

3. Select features that can be predictive of the survival status

## The following objects are masked from ’package:stats’:

## The following objects are masked from ’package:base’:

train_selected <- select(train,"Pclass","Sex","Age","Fare","SibSp","Parch","Embarked","TARGET")

Above we have selected the following columns:

I have Dropped the following columns:

train_selected$Sex <- as.factor(train_selected$Sex)

## ’data.frame’: 177 obs. of 8 variables:

#test_selected <- test_selected[!test_selected$Embarked == "",]

6. Apply logistic regression on the train/test dataset

model = glm(TARGET ~.,data = train_selected, family = "binomial")

7. Compute your model’s accuracy using accuracy function

confusion_Matrix <- table(predictions_binary, test_selected$TARGET)

predictTest = predict(model, type ="response", newdata = test_main)

You might also like