1.1 Loading The Data: Survival by Sex
1.1 Loading The Data: Survival by Sex
str(train_data)
Survival by Sex
So it looks like there is a clear trend that females are much more likely to survive than are men.
It’s well known fact that women and children were favoured for the lifeboats and our data certainly
supports the hypothesis that a female passenger is more likely to survive, however i want to see what
information lurks behind the Age variable.
Survival by Children
Another factor that we would expect to have a big bearing on our predictions in the passenger class.
It’s been said that the lifeboats were favoured for the richer passengers and as the film depicts 3rd
class passengers were often placed at the very bottom of the ship. So let’s see if the data matches up
to the movies.
mosaicplot(table(train_data$Survived,train_data$Pclass), main="Passenger Su
rvival by Class",ylab="Passenger class",xlab="Survived",col = hcl(c(240, 12
0, 80)),)
2.3 Others
Another piece of informatio is hidden in the Cabin variable. You see that the first part of the Cabin
variable indicates the Deck. The data is largely incomplete, but it could prove to be useful for those
passengers that have the cabin data, so let’s add it.
test_data$Survived <- NULL #remove the Survived field from the test data
set
#Now lets split the data 1 step further for our cross_validation_set
temp_data <- split( train_data , train_data$PassengerId > 570 )
Megan Ridsal explains extremely well the rational behind the following fills:
#Passengers 62 and 830 both embarked at Cherbourg according to Megan
train_data$Embarked[which(train_data$PassengerId == 62)] <- 'C'
cross_validate_data$Embarked[which(cross_validate_data$PassengerId == 830)]
<- 'C'
#To guess the fare of passenger 1044 we use the median value of their class
and embarkment location. Thanks again to Megan
Let’s use the valid data in our test_data set to predict values in all of the sets.
Now that we have an accurate age field, we can add the field Person that we investigated with earlier
to determine if a passenger is man/woman/boy/girl
# Build the model (note: not all possible variables are used)
data = train_data)
Using the variable importance function i removed the unimportant variables, leaving us with only the 8
most important. You can see clearly that Title takes the top spot, so we’re glad we made the new
field.
# Recall-Precision curve
plot(tpr)
The accuracy from the ROCR package give us a good measure for our model.
## accuracy cutoff.254
## 0.8380062 0.6520000
5 Conclusion
At the time of writing im currently at 961/4401 which i don’t think is too bad for a first effort. I could
spend much more time improving the age prediction or fine-tuning the model for futher refinement but
it’s time to move onto more advanced competitions.