Coding Titanicmain
Coding Titanicmain
Workflow stages
The competition solution workflow goes through seven stages described in the Data Science Solutions book.
The workflow indicates general sequence of how each stage may follow the other. However there are use
cases with exceptions.
Knowing from a training set of samples listing passengers who survived or did not survive the
Titanic disaster, can our model determine based on a given test dataset not containing the
survival information, if these passengers in the test dataset survived or not.
We may also want to develop some early understanding about the domain of our problem. This is described on
the Kaggle competition description page here (https://www.kaggle.com/c/titanic). Here are the highlights to note.
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502
out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for
the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were
more likely to survive than others, such as women, children, and the upper-class.
Workflow goals
The data science solutions workflow solves for seven major goals.
Classifying. We may want to classify or categorize our samples. We may also want to understand the
implications or correlation of different classes with our solution goal.
Correlating. One can approach the problem based on available features within the training dataset. Which
features within the dataset contribute significantly to our solution goal? Statistically speaking is there a
correlation (https://en.wikiversity.org/wiki/Correlation) among a feature and solution goal? As the feature
values
localhost:8888/notebooks/titanicMain (3).ipynb 1/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
change does the solution state change as well, and visa-versa? This can be tested both for numerical and
categorical features in the given dataset. We may also want to determine correlation among features other than
survival for subsequent goals and workflow stages. Correlating certain features may help in creating,
completing, or correcting features.
Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm
one may require all features to be converted to numerical equivalent values. So for instance converting text
categorical values to numeric values.
Completing. Data preparation may also require us to estimate any missing values within a feature. Model
algorithms may work best when there are no missing values.
Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within
features and try to corrent these values or exclude the samples containing the errors. One way to do this is to
detect any outliers among our samples or features. We may also completely discard a feature if it is not
contribting to the analysis or may significantly skew the results.
Creating. Can we create new features based on an existing feature or a set of features, such that the new
feature follows the correlation, conversion, completeness goals.
Charting. How to select the right visualization plots and charts depending on nature of the data and the solution
goals.
2
Type Markdown and LaTeX: ��
In [1]:
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
Acquire data
The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing
datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets
together.
In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]
newdf = pd.concat(combine)
newdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
Noting the feature names for directly manipulating or analyzing these. These feature names are described on
the Kaggle data page here (https://www.kaggle.com/c/titanic/data).
In [3]: print(train_df.columns.values)
In [4]:
Which features are numerical? These values change from sample to sample. Within numerical features are the
values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots
for visualization.
In [4]:
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare C Braund,
01031211231334114503 Futrelle,
Mrs. female 26.0 0 0STON/O2. 3101282 7.9250
Jacques Heath
(Lily May Peel)
In [6]:
Out[6]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Sur Braund,
0 22.0 NaN S 7.2500 1 38.0 C85 C 71.2833 2 26.0 NaN S 7.9250 3 35.0 C123 S 53.1000 4 35.0 NaN S 8.0500
Mr. Owen Harris Laina Henry
0 1 3 male 1 0 2 1 female 1 0 3 3 female 0 0
Cumings, Mrs. John Bradley Futrelle,
(Florence Briggs Mrs.
Th... Jacques Heath
(Lily May Peel)
Heikkinen, Miss.
4 1 female 1 0 5 3 male 0
Allen, Mr. William
Numerical, alphanumeric data within same feature. These are candidates for correcting goal.
This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just
tell us outright, which features may require correcting.
Name feature may contain errors or typos as there are several ways used to describe a name including
titles, round brackets, and quotes used for alternative or short names.
In [5]: train_df.tail()
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cab Montvila,
Dooley,
Mr.
889 890 1 1 890 891 0 3 Patrick
male 27.0 0 0 211536 13.00 Na female 19.0
6607 23.45 Na
Rev.
Juozas 0 0 112053 30.00 B
male 26.0 0 0 111369 30.00 C1 male 32.0 0
Graham, Miss.
0 370376 7.75 Na
In [8]: newdf.tail()
Out[8]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp S 413 NaN NaN S 8.0500
Spector,
Oliva y
414 39.0 C105 C 108.9000 22.3583 Peter,
Master. Michael J
0 1306 1 female 0
Ocana, Dona.
415 38.5 NaN S 7.2500 Fermina
Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
Cabin > Age are incomplete in case of test dataset.
In [6]:
train_df.info()
print('_'*40)
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
In [10]:
newdf.info()
print('_'*40)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
________________________________________
What is the distribution of numerical feature values across the samples?
This helps us determine, among other early insights, how representative is the training dataset of the actual
problem domain.
Total samples are 891 or 40% of the actual number of passengers on board the Titanic
(2,224). Survived is a categorical feature with 0 or 1 values.
Around 38% samples survived representative of the actual survival rate at 32%.
Most passengers (> 75%) did not travel with parents or children.
Nearly 30% of the passengers had siblings and/or spouse aboard.
Fares varied significantly with few passengers (<1%) paying as high as $512.
Few elderly passengers (<1%) within age range 65-80.
In [7]:
train_df.describe()
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description menti
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
Out[7]:
PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000
1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000
0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000
14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max
In [12]: newdf.describe()
Out[12]:
29.881138 33.295479 0.385027 655.000000 2.294882 0.498854 0.3838 std 14.413493 51.758668 0.865560
378.020061 0.837836 1.041658 0.4865 min 0.170000 0.000000 0.000000 1.000000 1.000000 0.000000 0.0000
25% 21.000000 7.895800 0.000000 328.000000 2.000000 0.000000 0.0000 50% 28.000000 14.454200
0.000000 655.000000 3.000000 0.000000 0.0000 75% 39.000000 31.275000 0.000000 982.000000 3.000000
1.000000 1.0000 max 80.000000 512.329200 9.000000 1309.000000 3.000000 8.000000 1.0000
In [8]: train_df.describe(include=['O'])
Out[8]:
top Goldsmith, Mr. Frank John male 347082 C23 C25 C27 S
In [14]: newdf.describe(include=['O'])
Out[14]:
top C23 C25 C27 S Kelly, Mr. James male CA. 2343
Correlating.
We want to know how well does each feature correlate with Survival. We want to do this early in our project and
match these quick correlations with modelled correlations later in the project.
Completing.
Correcting.
1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may
not be a correlation between Ticket and survival.
2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and
test dataset.
3. PassengerId may be dropped from training dataset as it does not contribute to survival. 4. Name
feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.
Creating.
localhost:8888/notebooks/titanicMain (3).ipynb 10/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family
members on board.
2. We may want to engineer the Name feature to extract Title as a new feature.
3. We may want to create new feature for Age bands. This turns a continous numerical feature into an
ordinal categorical feature.
4. We may also want to create a Fare range feature if it helps our analysis.
Classifying.
We may also add to our assumptions based on the problem description noted earlier.
Pclass We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide
to include this feature in our model.
Sex We confirm the observation during problem definition that Sex=female had very high survival rate at
74% (classifying #1).
SibSp and Parch These features have zero correlation for certain values. It may be best to derive a
feature or a set of features from these individual features (creating #1).
In [9]: train_df[['Pclass', 'Survived']].groupby(['Pclass'],
as_index=False).mean().sort_values(by=
Out[9]:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
as_index=False).mean().sort_values(by='Su
Out[16]:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
Out[10]:
Sex Survived
0 female 0.742038
1 male 0.188908
as_index=False).mean().sort_values(by='Survived
Out[18]:
Sex Survived
0 female 0.742038
1 male 0.188908
Out[11]:
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
as_index=False).mean().sort_values(by='Surv
Out[20]:
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
Out[12]:
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000
as_index=False).mean().sort_values(by='Surv
Out[22]:
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000
7 9 NaN
A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will
help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins
or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better
survival rate?)
Note that x-axis in historgram visualizations represents the count of samples or passengers.
Observations.
Decisions.
This simple analysis confirms our assumptions as decisions for subsequent workflow stages.
We should consider Age (our assumption classifying #2) in our model training.
Complete the Age feature for null values (completing #1).
We should band age groups (creating #3).
In [13]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x2a37db59b70>
In [24]:
g = sns.FacetGrid(newdf, col='Survived')
g.map(plt.hist, 'Age', bins=20)
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x2c61f930a90>
Correlating numerical and ordinal features
We can combine multiple features for identifying correlations using a single plot. This can be done with
numerical and categorical features which have numeric values.
Observations.
Pclass=3 had most passengers, however most did not survive. Confirms our classifying assumption #2.
localhost:8888/notebooks/titanicMain (3).ipynb 15/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption
#2.
Most passengers in Pclass=1 survived. Confirms our classifying assumption #3.
Pclass varies in terms of Age distribution of passengers.
Decisions.
In [14]:
In [26]:
Observations.
Female passengers had much better survival rate than males. Confirms classifying (#1). Exception in
Embarked=C where males had higher survival rate. This could be a correlation between Pclass and
Embarked and in turn Pclass and Survived, not necessarily direct correlation between
localhost:8888/notebooks/titanicMain (3).ipynb 17/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
c ass a d ba ed a d tu c ass a d Su ed, ot ecessa y d ect co e at o bet ee Embarked and
Survived.
Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing
(#2).
Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating
(#1).
Decisions.
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x2a37e6febe0>
In [28]:
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x2c620547cc0>
Correlating categorical and numerical features
We may also want to correlate categorical features (with non-numeric values) and numeric features. We can
consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric
continuous), with Survived (Categorical numeric).
Observations.
Higher fare paying passengers had better survival. Confirms our assumption for creating (#4) fare
ranges. Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).
Decisions.
In [16]:
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x2a37e6fe8d0>
localhost:8888/notebooks/titanicMain (3).ipynb 20/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
In [30]:
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x2c6206518d0>
Wrangle data
We have collected several assumptions and decisions regarding our datasets and solution requirements. So far
we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and
assumptions for correcting, creating, and completing goals.
Based on our assumptions and decisions we want to drop the Cabin (correcting #2) and Ticket (correcting #1)
features.
Note that where applicable we perform operations on both training and testing datasets together to stay
consistent.
In [17]: print("Before", train_df.shape, test_df.shape, combine[0].shape,
combine[1].shape)
Out[17]:
"After", newdf.shape
Out[32]:
In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+\.) matches
the first word which ends with a dot character within Name feature. The expand=False flag returns a
DataFrame.
Observations.
When we plot Title, Age, and Survived, we note the following observations.
Most titles band Age groups accurately. For example: Master title has Age mean of 5
years. Survival among Title Age bands varies slightly.
Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).
Decision.
In [18]:
pd.crosstab(train_df['Title'], train_df['Sex'])
Out[18]:
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
In [34]:
pd.crosstab(newdf['Title'], newdf['Sex'])
Out[34]:
Title
Capt 0 1
Col 0 4
Countess 1 0
Don 0 1
Dona 1 0
Dr 1 7
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 61
Miss 260 0
Mlle 2 0
Mme 1 0
Mr 0 757
Mrs 197 0
Ms 2 0
Rev 0 8
Sir 0 1
Out[19]:
Survived 0 1
Title
Capt 1 0
Col 1 1
Countess 0 1
Don 1 0
Dr 4 3
Jonkheer 1 0
Lady 0 1
Major 1 1
Master 17 23
Miss 55 127
Mlle 0 2
Mme 0 1
Mr 436 81
Mrs 26 99
Ms 0 1
Rev 6 0
Sir 0 1
Out[36]:
Title
Capt 1 0
Col 1 1
Countess 0 1
Don 1 0
Dr 4 3
Jonkheer 1 0
Lady 0 1
Major 1 1
Master 17 23
Miss 55 127
Mlle 0 2
Mme 0 1
Mr 436 81
Mrs 26 99
Ms 0 1
Rev 6 0
Sir 0 1
We can replace many titles with a more common name or classify them as Rare .
In [20]:
Out[20]:
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
In [38]:
Out[38]:
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
In [21]:
train_df.head()
Out[21]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Embarked Braund,
Futrelle,
Mrs.
Mr. Owen Harris Jacques Heath 35.0 1 0 53.1000 S male 35.0 0 0 8.0500 S
(Lily May Peel)
Cumings, Mrs. John Bradley
(Florence Briggs Allen, Mr. William
Th... Henry
male 22.0 1 0 7.2500 S female 38.0 1 0
localhost:8888/notebooks/titanicMain (3).ipynb 28/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
In [40]:
newdf.head()
Out[40]:
Age Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Braund,
0 22.0 S 7.2500 1 38.0 C 71.2833 2 26.0 S Cumings, Mrs. John Bradley Henry
(Florence Briggs 0 1 3 male 1 0.0 0 2 1 female 1 1.0 0 3 3
Th...
Heikkinen, Miss.
7.9250 3 35.0 S 53.1000 4 35.0 S 8.0500 Laina
female 0 1.0 0 4 1 female 1 1.0 0 5 3 male 0
Futrelle,
Mrs.
Jacques Heath
(Lily May Peel)
Mr. Owen Harris
Allen, Mr. William 0.0
Now we can safely drop the Name feature from training and testing datasets. We also do not need the
PassengerId feature in the training dataset.
In [22]:
Out[22]:
In [42]:
(1309, 9)
In [43]: newdf.head()
Out[43]:
Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.
In [23]:
train_df.head()
Out[23]:
0 0 3 0 22.0 1 0 7.2500 S 1
1 1 1 1 38.0 1 0 71.2833 C 3
2 1 3 1 26.0 0 0 7.9250 S 2
3 1 1 1 35.0 1 0 53.1000 S 3
4 0 3 0 35.0 0 0 8.0500 S 1
In [45]:
newdf.head()
Out[45]:
1. A simple way is to generate random numbers between mean and standard deviation
(https://en.wikipedia.org/wiki/Standard_deviation).
2. More accurate way of guessing missing values is to use other correlated features. In our case we note
correlation among Age, Gender, and Pclass. Guess Age values using median
(https://en.wikipedia.org/wiki/Median) values for Age across sets of Pclass and Gender feature
combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on... 3.
Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers
between mean and standard deviation, based on sets of Pclass and Gender combinations.
Method 1 and 3 will introduce random noise into our models. The results from multiple executions might
vary. We will prefer method 2.
In [24]:
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x2a37ec71390>
In [47]:
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x2c621e22f60>
localhost:8888/notebooks/titanicMain (3).ipynb 33/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
In [25]:
Out[25]:
0 1 C 0 40.111111
1 1 C 1 36.052632
2 1 Q 0 44.000000
3 1 Q 1 33.000000
4 1 S 0 41.897188
5 1 S 1 32.704545
11 2 S 1 29.719697
10 2 S 0 30.875889
9 2 Q 1 30.000000
8 2 Q 0 57.000000
7 2 C 1 19.142857
6 2 C 0 25.937500
12 3 C 0 25.016800
13 3 C 1 14.062500
14 3 Q 0 28.142857
15 3 Q 1 22.850000
16 3 S 0 26.574766
17 3 S 1 23.223684
In [49]:
Out[49]:
0 1 C 0 40.047619
1 1 C 1 38.107692
2 1 Q 0 44.000000
3 1 Q 1 35.000000
4 1 S 0 41.705977
5 1 S 1 35.609375
11 2 S 1 28.455165
10 2 S 0 30.491702
9 2 Q 1 30.000000
8 2 Q 0 53.750000
7 2 C 1 19.363636
6 2 C 0 27.269231
12 3 C 0 24.129474
13 3 C 1 16.818182
14 3 Q 0 26.738095
15 3 Q 1 24.333333
16 3 S 0 26.146241
17 3 S 1 22.854771
Let us start by preparing an empty array to contain guessed Age values based on Pclass x Gender
combinations.
In [26]:
guess_ages = np.zeros((2,3))
guess_ages
Out[26]:
Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six
combinations.
In [27]:
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std) age_guess
= guess_df.median()
dataset['Age'] = dataset['Age'].astype(int)
train_df.head()
Out[27]:
0 0 3 0 22 1 0 7.2500 S 1
1 1 1 1 38 1 0 71.2833 C 3
2 1 3 1 26 0 0 7.9250 S 2
3 1 1 1 35 1 0 53.1000 S 3
4 0 3 0 35 0 0 8.0500 S 1
In [52]:
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std) age_guess
= guess_df.median()
dataset['Age'] = dataset['Age'].astype(int)
newdf.head()
Out[52]:
0 22 S 7.2500 0 3 0 1 0.0 1
1 38 C 71.2833 0 1 1 1 1.0 3
2 26 S 7.9250 0 3 1 0 1.0 2
3 35 S 53.1000 0 1 1 1 1.0 3
4 35 S 8.0500 0 3 0 0 0.0 1
In [28]:
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(b
Out[28]:
AgeBand Survived
In [54]:
newdf['AgeBand'] = pd.cut(newdf['Age'], 5)
newdf[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='
Out[54]:
AgeBand Survived
In [29]:
Out[29]:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title AgeBand
In [56]:
Out[56]:
Age Embarked Fare Parch Pclass Sex SibSp Survived Title AgeBand
In [30]:
Out[30]:
0 0 3 0 1 1 0 7.2500 S 1
1 1 1 1 2 1 0 71.2833 C 3
2 1 3 1 1 0 0 7.9250 S 2
3 1 1 1 2 1 0 53.1000 S 3
4 0 3 0 2 0 0 8.0500 S 1
In [58]:
Out[58]:
0 1 S 7.2500 0 3 0 1 0.0 1
1 2 C 71.2833 0 1 1 1 1.0 3
2 1 S 7.9250 0 3 1 0 1.0 2
3 2 S 53.1000 0 1 1 1 1.0 3
4 2 S 8.0500 0 3 0 0 0.0 1
Create new feature combining existing features
We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop
Parch and SibSp from our datasets.
In [31]:
Out[31]:
FamilySize Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
In [60]:
Out[60]:
FamilySize Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
In [32]:
Out[32]:
IsAlone Survived
0 0 0.505650
1 1 0.303538
In [62]:
Out[62]:
IsAlone Survived
0 0 0.505650
1 1 0.303538
In [33]:
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]
train_df.head()
Out[33]:
0 0 3 0 1 7.2500 S 1 0
1 1 1 1 2 71.2833 C 3 0
2 1 3 1 1 7.9250 S 2 1
3 1 1 1 2 53.1000 S 3 0
4 0 3 0 2 8.0500 S 1 1
In [64]:
newdf.head()
Out[64]:
0 1 S 7.2500 3 0 0.0 1 0
1 2 C 71.2833 1 1 1.0 3 0
2 1 S 7.9250 3 1 1.0 2 1
3 2 S 53.1000 1 1 1.0 3 0
4 2 S 8.0500 3 0 0.0 1 1
In [34]:
Out[34]:
0313
1221
2313
3221
4623
5313
6331
7003
8313
9002
In [66]:
Out[66]:
0313
1221
2313
3221
4623
5313
6331
7003
8313
9002
In [35]:
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
Out[35]:
'S'
In [68]:
freq_port = newdf.Embarked.dropna().mode()[0]
freq_port
Out[68]:
'S'
In [36]:
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
Out[36]:
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
In [70]:
Out[70]:
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
In [37]:
train_df.head()
Out[37]:
0 0 3 0 1 7.2500 0 1 0 3
1 1 1 1 2 71.2833 1 3 0 2
2 1 3 1 1 7.9250 0 2 1 3
3 1 1 1 2 53.1000 0 3 0 2
4 0 3 0 2 8.0500 0 1 1 6
In [72]:
newdf.head()
Out[72]:
0 1 0 7.2500 3 0 0.0 1 0 3
1 2 1 71.2833 1 1 1.0 3 0 2
2 1 0 7.9250 3 1 1.0 2 1 3
3 2 0 53.1000 1 1 1.0 3 0 2
4 2 0 8.0500 3 0 0.0 1 1 6
Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess
missing feature as we are replacing only a single value. The completion goal achieves desired requirement for
model algorithm to operate on non-null values.
We may also want round off the fare to two decimals as it represents currency.
In [38]:
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()
Out[38]:
0 892 3 0 2 7.8292 2 1 1 6
1 893 3 1 2 7.0000 0 3 0 6
2 894 2 0 3 9.6875 2 1 1 6
3 895 3 0 1 8.6625 0 1 1 3
4 896 3 1 1 12.2875 0 3 0 3
In [74]:
newdf['Fare'].fillna(newdf['Fare'].dropna().median(), inplace=True)
newdf.head()
Out[74]:
0 1 0 7.2500 3 0 0.0 1 0 3
1 2 1 71.2833 1 1 1.0 3 0 2
2 1 0 7.9250 3 1 1.0 2 1 3
3 2 0 53.1000 1 1 1.0 3 0 2
4 2 0 8.0500 3 0 0.0 1 1 6
In [39]:
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values
Out[39]:
FareBand Survived
In [76]:
newdf['FareBand'] = pd.qcut(newdf['Fare'], 4)
newdf[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by
Out[76]:
FareBand Survived
In [40]:
for dataset in combine:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
train_df.head(10)
Out[40]:
0030100103
1111231302
2131110213
3111230302
4030210116
5030112113
6010330113
7030020400
8131110303
9121021300
In [78]:
newdf.head(10)
Out[78]:
0 1 0 0 3 0 0.0 1 0 3
1 2 1 3 1 1 1.0 3 0 2
2 1 0 1 3 1 1.0 2 1 3
3 2 0 3 1 1 1.0 3 0 2
4 2 0 1 3 0 0.0 1 1 6
5 1 2 1 3 0 0.0 1 1 3
6 3 0 3 1 0 0.0 1 1 3
7 0 0 2 3 0 0.0 4 0 0
8 1 0 1 3 1 1.0 3 0 3
9 0 1 2 2 1 1.0 3 0 0
In [41]: test_df.head(10)
Out[41]:
0 892 3 0 2 0 2 1 1 6
1 893 3 1 2 0 0 3 0 6
2 894 2 0 3 1 2 1 1 6
3 895 3 0 1 1 0 1 1 3
4 896 3 1 1 1 0 3 0 3
5 897 3 0 0 1 0 1 1 0
6 898 3 1 1 0 2 2 1 3
7 899 2 0 1 2 0 1 0 2
8 900 3 1 1 0 1 3 1 3
9 901 3 0 1 2 0 1 0 3
In [80]: newdf.head(10)
Out[80]:
0 1 0 0 3 0 0.0 1 0 3
1 2 1 3 1 1 1.0 3 0 2
2 1 0 1 3 1 1.0 2 1 3
3 2 0 3 1 1 1.0 3 0 2
4 2 0 1 3 0 0.0 1 1 6
5 1 2 1 3 0 0.0 1 1 3
6 3 0 3 1 0 0.0 1 1 3
7 0 0 2 3 0 0.0 4 0 0
8 1 0 1 3 1 1.0 3 0 3
9 0 1 2 2 1 1.0 3 0 0
Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine
In [42]:
Out[42]:
In [82]:
X_newdf.shape, Y_newdf.shape
Out[82]:
In [43]:
Features = X_train
Class = Y_train
(712, 8) (712,)
(179, 8) (179,)
In [151]:
Features = X_newdf
Class = Y_newdf
(1047, 8) (1047,)
(262, 8) (262,)
Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship
between the categorical dependent variable (feature) and one or more independent variables (features) by
estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference
Wikipedia (https://en.wikipedia.org/wiki/Logistic_regression).
Note the confidence score generated by the model based on our training dataset.
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
Out[44]:
80.359999999999999
In [45]: pd.crosstab(logreg.predict(X_train),Y_train)
Out[45]:
Survived 0 1
row_0
0 479 105
1 70 237
logreg = LogisticRegression()
logreg.fit(Feature_Train, Class_Train)
Y_pred = logreg.predict(Feature_Test)
acc_log = round(logreg.score(Feature_Test, Class_Test) * 100, 2)
acc_log
Out[46]:
81.010000000000005
In [ ]:
We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing
goals. This can be done by calculating the coefficient of the features in the decision function.
Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative
coefficients decrease the log-odds of the response (and thus decrease the probability).
Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the
probability of Survived=1 increases the most.
Inversely as Pclass increases, probability of Survived=1 decreases the most.
This way Age*Class is a good artificial feature to model as it has second highest negative correlation with
Survived.
So is Title as second highest positive correlation.
In [47]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)
Out[47]:
Feature Correlation
1 Sex 2.128733
5 Title 0.394961
4 Embarked 0.310878
2 Age 0.261064
6 IsAlone 0.242516
3 Fare -0.000617
7 Age*Class -0.277807
0 Pclass -0.733955
Next we model using Support Vector Machines which are supervised learning models with associated learning
algorithms that analyze data used for classification and regression analysis. Given a set of training samples,
each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that
assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier.
Reference Wikipedia (https://en.wikipedia.org/wiki/Support_vector_machine).
Note that the model generates a confidence score which is higher than Logistics Regression model.
In [48]: # Support Vector Machines without Split
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
Out[48]:
83.840000000000003
svc = SVC()
svc.fit(Feature_Train, Class_Train)
Y_pred = svc.predict(Feature_Test)
acc_svc = round(svc.score(Feature_Test, Class_Test) * 100, 2)
acc_svc
Out[49]:
82.680000000000007
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used
for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample
being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically
small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference
Wikipedia (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).
KNN confidence score is better than Logistics Regression but worse than SVM.
In [50]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
Out[50]:
84.739999999999995
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers
are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning
problem. Reference Wikipedia (https://en.wikipedia.org/wiki/Naive_Bayes_classifier).
The model generated confidence score is the lowest among the models evaluated so far.
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
Out[51]:
72.280000000000001
The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether
localhost:8888/notebooks/titanicMain (3).ipynb 54/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear
classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining
a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in
the training set one at a time. Reference Wikipedia (https://en.wikipedia.org/wiki/Perceptron).
In [52]: # Perceptron
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron
C:\Users\BigDataLab\Anaconda2\envs\mudi\lib\site-packages\sklearn\linear_mod
el\stochastic_gradient.py:84: FutureWarning: max_iter and tol parameters hav
e been added in <class 'sklearn.linear_model.perceptron.Perceptron'> in 0.1
9. If both are left unset, they default to max_iter=5 and tol=None. If tol i
s not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter
will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
Out[52]:
78.0
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
Out[53]:
79.120000000000005
localhost:8888/notebooks/titanicMain (3).ipynb 55/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
C:\Users\BigDataLab\Anaconda2\envs\mudi\lib\site-packages\sklearn\linear_mod
el\stochastic_gradient.py:84: FutureWarning: max_iter and tol parameters hav
e been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifi
er'> in 0.19. If both are left unset, they default to max_iter=5 and tol=Non
e. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, defaul
t max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
Out[54]:
74.069999999999993
This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions
about the target value (tree leaves). Tree models where the target variable can take a finite set of values are
called classification trees; in these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels. Decision trees where the target variable can take
continuous values (typically real numbers) are called regression trees. Reference Wikipedia
(https://en.wikipedia.org/wiki/Decision_tree_learning).
The model confidence score is the highest among models evaluated so far.
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
Out[55]:
86.760000000000005
In [56]:
# print Feature_Train.shape, Class_Train.shape
# print Feature_Test.shape, Class_Test.shape
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(Feature_Train, Class_Train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(Feature_Test, Class_Test) * 100, 2)
acc_decision_tree
Out[56]:
81.560000000000002
Survived 0 1
row_0
0 95 23
1 10 51
In [60]:
prediction =decision_tree.predict(Feature_Test)
type=(Class_Test)
y=pd.DataFrame(Class_Test)
y['PredictedByModel']=prediction
combine =[Feature_Test,y]
newdf= pd.concat(combine, axis=1)
print (newdf.head())
newdf.to_csv('titanic_Output.csv')
PredictedByModel
783 0
347 0
623 0
246 1
309 1
The next model Random Forests is one of the most popular. Random forests or random decision forests are an
ensemble learning method for classification, regression and other tasks, that operate by constructing a
multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia
(https://en.wikipedia.org/wiki/Random_forest).
The model confidence score is the highest among models evaluated so far. We decide to use this model's
output (Y_pred) for creating our competition submission of results.
86.76
Survived 0 1
row_0
0 500 69
1 49 273
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(Feature_Train, Class_Train)
Y_pred = random_forest.predict(X_test)
random_forest.score(Feature_Test, Class_Test)
acc_random_forest = round(random_forest.score(Feature_Test, Class_Test) * 100, 2)
print (acc_random_forest)
81.01
Survived 0 1
row_0
0 92 21
1 13 53
Model evaluation
We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision
Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees'
habit of overfitting to their training set.
localhost:8888/notebooks/titanicMain (3).ipynb 58/60
10/26/21, 5:41 PM titanicMain (3) - Jupyter Notebook
In [68]:
logreg = LogisticRegression(random_state=4)
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=50)
print (scores)
print (scores.mean())
[ 0.77477477 0.78181818 0.74226804 0.64864865 0.74137931]
0.737777791365
C:\Users\BigDataLab\Anaconda2\envs\mudi\lib\site-packages\sklearn\cross_vali
dation.py:41: DeprecationWarning: This module was deprecated in version 0.18
in favor of the model_selection module into which all the refactored classes
and functions are moved. Also note that the interface of the new CV iterator
s are different from that of this module. This module will be removed in 0.2
0.
"This module will be removed in 0.20.", DeprecationWarning)
In [70]:
import numpy as np
print (Feature_Train.shape[0])
rf = RandomForestClassifier(random_state=1, n_estimators=100)
kf = KFold(Feature_Train.shape[0], n_folds=5, random_state=1)
#cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=50)
712
[ 0.69565217 0.73076923 0.80373832 0.69811321 0.75 ]
0.735654585997
2
Type Markdown and LaTeX: ��
2
Type Markdown and LaTeX: ��
localhost:8888/notebooks/titanicMain (3).ipynb 60/60