01-Logistic Regression With Python
01-Logistic Regression With Python
[75]: train.head()
1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
2
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough
for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks
like we are just missing too much of that data to do something useful with at a basic level. We’ll
probably drop this later, or change it to another feature like “Cabin Known: 1 or 0”
Let’s continue on by visualizing some more of the data! Check out the video for full explanations
over these plots, this code is just to serve as reference.
[77]: sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')
[78]: sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
3
[79]: sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
4
[80]: sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)
5
[81]: train['Age'].hist(bins=30,color='darkred',alpha=0.7)
[82]: sns.countplot(x='SibSp',data=train)
6
[83]: train['Fare'].hist(color='green',bins=40,figsize=(8,4))
7
[84]:
<IPython.core.display.HTML object>
[85]:
<IPython.core.display.HTML object>
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll
use these average age values to impute based on Pclass for Age.
[87]: def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
8
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
9
Great! Let’s go ahead and drop the Cabin column and the row in Embarked that is NaN.
[90]: train.drop('Cabin',axis=1,inplace=True)
[91]: train.head()
[92]: train.dropna(inplace=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId 889 non-null int64
Survived 889 non-null int64
Pclass 889 non-null int64
Name 889 non-null object
Sex 889 non-null object
Age 889 non-null float64
SibSp 889 non-null int64
10
Parch 889 non-null int64
Ticket 889 non-null object
Fare 889 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB
[95]: train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
[97]: train.head()
train['Survived'],␣
↪test_size=0.30,
random_state=101)
11
[102]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
3.3 Evaluation
We can check precision,recall,f1-score using classification report!
[104]: from sklearn.metrics import classification_report
[105]: print(classification_report(y_test,predictions))
Not so bad! You might want to explore other feature engineering and the other titanic_text.csv
file, some suggestions for feature engineering:
• Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
• Maybe the Cabin letter could be a feature
• Is there any info you can get from the ticket?
12