0% found this document useful (0 votes)

192 views12 pages

01-Logistic Regression With Python

This document provides an overview of implementing logistic regression with Python to predict survival on the Titanic dataset. It discusses importing libraries, loading and exploring the data to handle missing values and convert categorical features. Logistic regression models are built on the processed training and test datasets to classify passengers as survived or deceased.

Uploaded by

Hellem Biersack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views12 pages

01-Logistic Regression With Python

Uploaded by

Hellem Biersack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

01-Logistic Regression with Python

February 28, 2023

1 Logistic Regression with Python

For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous
data set and very often is a student’s first step in machine learning!
We’ll be trying to predict a classification- survival or deceased. Let’s begin our understanding of
implementing Logistic Regression in Python for classification.
We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on
Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

1.1 Import Libraries

Let’s import some libraries to get started!
[73]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

1.2 The Data

Let’s start by reading in the titanic_train.csv file into a pandas dataframe.
[74]: train = pd.read_csv('titanic_train.csv')

[75]: train.head()

[75]: PassengerId Survived Pclass \

0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \

0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1

1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked

0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

2 Exploratory Data Analysis

Let’s begin some exploratory data analysis! We’ll start by checking out missing data!

2.1 Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!
[76]: sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

[76]: <matplotlib.axes._subplots.AxesSubplot at 0x11a56f7b8>

2
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough
for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks
like we are just missing too much of that data to do something useful with at a basic level. We’ll
probably drop this later, or change it to another feature like “Cabin Known: 1 or 0”
Let’s continue on by visualizing some more of the data! Check out the video for full explanations
over these plots, this code is just to serve as reference.
[77]: sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')

[77]: <matplotlib.axes._subplots.AxesSubplot at 0x11afae630>

[78]: sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

[78]: <matplotlib.axes._subplots.AxesSubplot at 0x11b004a20>

3
[79]: sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

[79]: <matplotlib.axes._subplots.AxesSubplot at 0x11b130f28>

4
[80]: sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)

[80]: <matplotlib.axes._subplots.AxesSubplot at 0x11c16f710>

5
[81]: train['Age'].hist(bins=30,color='darkred',alpha=0.7)

[81]: <matplotlib.axes._subplots.AxesSubplot at 0x11b127ef0>

[82]: sns.countplot(x='SibSp',data=train)

[82]: <matplotlib.axes._subplots.AxesSubplot at 0x11c4139e8>

6
[83]: train['Fare'].hist(color='green',bins=40,figsize=(8,4))

[83]: <matplotlib.axes._subplots.AxesSubplot at 0x113893048>

7
[84]:

<IPython.core.display.HTML object>

[85]:

<IPython.core.display.HTML object>

2.2 Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to
do this is by filling in the mean age of all the passengers (imputation). However we can be smarter
about this and check the average age by passenger class. For example:
[86]: plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

[86]: <matplotlib.axes._subplots.AxesSubplot at 0x11c901cc0>

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll
use these average age values to impute based on Pclass for Age.
[87]: def impute_age(cols):
Age = cols[0]
Pclass = cols[1]

8
if pd.isnull(Age):

if Pclass == 1:
return 37

elif Pclass == 2:
return 29

else:
return 24

else:
return Age

Now apply that function!

[88]: train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

Now let’s check that heat map again!

[89]: sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

[89]: <matplotlib.axes._subplots.AxesSubplot at 0x11c4dae10>

9
Great! Let’s go ahead and drop the Cabin column and the row in Embarked that is NaN.
[90]: train.drop('Cabin',axis=1,inplace=True)

[91]: train.head()

[91]: PassengerId Survived Pclass \

0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \

0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Embarked

0 0 A/5 21171 7.2500 S
1 0 PC 17599 71.2833 C
2 0 STON/O2. 3101282 7.9250 S
3 0 113803 53.1000 S
4 0 373450 8.0500 S

[92]: train.dropna(inplace=True)

2.3 Converting Categorical Features

We’ll need to convert categorical features to dummy variables using pandas! Otherwise our machine
learning algorithm won’t be able to directly take in those features as inputs.
[93]: train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId 889 non-null int64
Survived 889 non-null int64
Pclass 889 non-null int64
Name 889 non-null object
Sex 889 non-null object
Age 889 non-null float64
SibSp 889 non-null int64

10
Parch 889 non-null int64
Ticket 889 non-null object
Fare 889 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB

[94]: sex = pd.get_dummies(train['Sex'],drop_first=True)

embark = pd.get_dummies(train['Embarked'],drop_first=True)

[95]: train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

[96]: train = pd.concat([train,sex,embark],axis=1)

[97]: train.head()

[97]: PassengerId Survived Pclass Age SibSp Parch Fare male Q S

0 1 0 3 22.0 1 0 7.2500 1.0 0.0 1.0
1 2 1 1 38.0 1 0 71.2833 0.0 0.0 0.0
2 3 1 3 26.0 0 0 7.9250 0.0 0.0 1.0
3 4 1 1 35.0 1 0 53.1000 0.0 0.0 1.0
4 5 0 3 35.0 0 0 8.0500 1.0 0.0 1.0

Great! Our data is ready for our model!

3 Building a Logistic Regression model

Let’s start by splitting our data into a training set and test set (there is another test.csv file that
you can play around with in case you want to use all this data for training).

3.1 Train Test Split

[98]: from sklearn.model_selection import train_test_split

[100]: X_train, X_test, y_train, y_test = train_test_split(train.

↪drop('Survived',axis=1),

train['Survived'],␣
↪test_size=0.30,

random_state=101)

3.2 Training and Predicting

[101]: from sklearn.linear_model import LogisticRegression

[102]: logmodel = LogisticRegression()

logmodel.fit(X_train,y_train)

11
[102]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

[103]: predictions = logmodel.predict(X_test)

Let’s move on to evaluate our model!

3.3 Evaluation
We can check precision,recall,f1-score using classification report!
[104]: from sklearn.metrics import classification_report

[105]: print(classification_report(y_test,predictions))

precision recall f1-score support

0 0.81 0.93 0.86 163

1 0.85 0.65 0.74 104

avg / total 0.82 0.82 0.81 267

Not so bad! You might want to explore other feature engineering and the other titanic_text.csv
file, some suggestions for feature engineering:
• Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
• Maybe the Cabin letter could be a feature
• Is there any info you can get from the ticket?

Titanic Logistic Regression Project
No ratings yet
Titanic Logistic Regression Project
35 pages
Logistic Regression on Titanic Data
No ratings yet
Logistic Regression on Titanic Data
6 pages
Titanic Data Analysis
No ratings yet
Titanic Data Analysis
14 pages
The Titanic Dataset
No ratings yet
The Titanic Dataset
6 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Titanic
No ratings yet
Titanic
6 pages
Titanic Survival Prediction with ML
No ratings yet
Titanic Survival Prediction with ML
7 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
11 pages
Titanic Survival Prediction
100% (1)
Titanic Survival Prediction
14 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
5 pages
Titanic Survival Prediction - Step-by-Step Guide
No ratings yet
Titanic Survival Prediction - Step-by-Step Guide
4 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Project Report
No ratings yet
Project Report
7 pages
9914 ML Lab3
No ratings yet
9914 ML Lab3
6 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
12 pages
Home Work
No ratings yet
Home Work
12 pages
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
13 pages
Titanic ML for Data Scientists
No ratings yet
Titanic ML for Data Scientists
36 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
12 pages
Assignment2 DMS672
No ratings yet
Assignment2 DMS672
15 pages
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
No ratings yet
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
13 pages
Titanic Survival Prediction Model
100% (1)
Titanic Survival Prediction Model
7 pages
ML - Lab 03.ipynb Colab
No ratings yet
ML - Lab 03.ipynb Colab
4 pages
LOGISTIC - REGRESSION - Jupyter Notebook
No ratings yet
LOGISTIC - REGRESSION - Jupyter Notebook
18 pages
ML - Other Pracs
No ratings yet
ML - Other Pracs
7 pages
Iml Project
No ratings yet
Iml Project
13 pages
Titanic Survival Prediction Report
No ratings yet
Titanic Survival Prediction Report
4 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
Titanic Dataset
No ratings yet
Titanic Dataset
9 pages
Day 20
No ratings yet
Day 20
5 pages
Ds 9
No ratings yet
Ds 9
12 pages
Titanic Survival Prediction Analysis
No ratings yet
Titanic Survival Prediction Analysis
15 pages
Unit 5 Analysis With Pandas in Python
No ratings yet
Unit 5 Analysis With Pandas in Python
26 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
LogisticRegressionMLModel - Jupyter Notebook
No ratings yet
LogisticRegressionMLModel - Jupyter Notebook
14 pages
Indraneel S (RA2211003010421)
No ratings yet
Indraneel S (RA2211003010421)
21 pages
Titanic
No ratings yet
Titanic
22 pages
Titanic
No ratings yet
Titanic
3 pages
Random Forest Algorithm - Titanic Dataset
No ratings yet
Random Forest Algorithm - Titanic Dataset
12 pages
Titanic
No ratings yet
Titanic
3 pages
CKCS149 - Lab 10 Completed
No ratings yet
CKCS149 - Lab 10 Completed
6 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Titanic ML Kaggle
No ratings yet
Titanic ML Kaggle
3 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Titanic Prediction
No ratings yet
Titanic Prediction
53 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Titanic Survival Analysis
100% (2)
Titanic Survival Analysis
13 pages
Titanic Dataset Preprocessing Guide
No ratings yet
Titanic Dataset Preprocessing Guide
5 pages
BD WPS2
No ratings yet
BD WPS2
11 pages
Advanced Python for Data Scientists
No ratings yet
Advanced Python for Data Scientists
19 pages
Titanic Eda
No ratings yet
Titanic Eda
17 pages
4.1.3.5 Lab - Decision Tree Classification
No ratings yet
4.1.3.5 Lab - Decision Tree Classification
11 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Titanic Data Visualization Analysis
No ratings yet
Titanic Data Visualization Analysis
18 pages
Ultimate Python For Data Science: 200 Essential Functions and Interview Questions
No ratings yet
Ultimate Python For Data Science: 200 Essential Functions and Interview Questions
12 pages
9924 ML Lab3
No ratings yet
9924 ML Lab3
9 pages

01-Logistic Regression With Python

Uploaded by

01-Logistic Regression With Python

Uploaded by

01-Logistic Regression with Python

February 28, 2023

1 Logistic Regression with Python

1.1 Import Libraries

1.2 The Data

[75]: PassengerId Survived Pclass \

Name Sex Age SibSp \

Parch Ticket Fare Cabin Embarked

2 Exploratory Data Analysis

2.1 Missing Data

[76]: <matplotlib.axes._subplots.AxesSubplot at 0x11a56f7b8>

[77]: <matplotlib.axes._subplots.AxesSubplot at 0x11afae630>

[78]: <matplotlib.axes._subplots.AxesSubplot at 0x11b004a20>

[79]: <matplotlib.axes._subplots.AxesSubplot at 0x11b130f28>

[80]: <matplotlib.axes._subplots.AxesSubplot at 0x11c16f710>

[81]: <matplotlib.axes._subplots.AxesSubplot at 0x11b127ef0>

[82]: <matplotlib.axes._subplots.AxesSubplot at 0x11c4139e8>

[83]: <matplotlib.axes._subplots.AxesSubplot at 0x113893048>

2.2 Data Cleaning

[86]: <matplotlib.axes._subplots.AxesSubplot at 0x11c901cc0>

Now apply that function!

Now let’s check that heat map again!

[89]: <matplotlib.axes._subplots.AxesSubplot at 0x11c4dae10>

[91]: PassengerId Survived Pclass \

Name Sex Age SibSp \

Parch Ticket Fare Embarked

2.3 Converting Categorical Features

[94]: sex = pd.get_dummies(train['Sex'],drop_first=True)

[96]: train = pd.concat([train,sex,embark],axis=1)

[97]: PassengerId Survived Pclass Age SibSp Parch Fare male Q S

Great! Our data is ready for our model!

3 Building a Logistic Regression model

3.1 Train Test Split

[100]: X_train, X_test, y_train, y_test = train_test_split(train.

3.2 Training and Predicting

[102]: logmodel = LogisticRegression()

[103]: predictions = logmodel.predict(X_test)

Let’s move on to evaluate our model!

precision recall f1-score support

0 0.81 0.93 0.86 163

avg / total 0.82 0.82 0.81 267

You might also like