Lecture 10-Logistic Regression - Part - 2 - Jupyter Notebook
Lecture 10-Logistic Regression - Part - 2 - Jupyter Notebook
Lecture 10-Part2
Logistic Regression
In [14]: 1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 %matplotlib inline
The Data
Import the Dataset.
Out[17]:
Time
Names emails Country Spent on Salary
Site
Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila
Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes
Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez
Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands
Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach
In [18]: 1 data.head()
Out[18]:
Time
Names emails Country Spent on Salary
Site
Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila
Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes
Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez
Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands
Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach
Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!
In [19]: 1 data.isnull()
Out[19]:
Names emails Country Time Spent on Site Salary Clicked
Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way
to do this is by filling in the mean age of all the passengers (imputation). However we can be
smarter about this and check the average age by passenger class. For example:
Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.
In [21]: 1 data.head()
Out[21]:
Time Spent on Site Salary Clicked
0 25.649648 55330.06006 0
1 32.456107 79049.07674 1
2 20.945978 41098.60826 0
3 54.039325 37143.35536 1
4 34.249729 37355.11276 0
In [22]: 1 data.dropna(inplace=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 499 entries, 0 to 498
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time Spent on Site 499 non-null float64
1 Salary 499 non-null float64
2 Clicked 499 non-null int64
dtypes: float64(2), int64(1)
memory usage: 15.6 KB
In [43]: 1 X_train
Out[43]:
Time Spent on Site Salary
55 27.432028 40814.47633
57 47.070590 80709.83902
63 31.518373 35277.25683
11 34.530898 30221.93714
In [44]: 1 y_train
Out[44]: 187 1
55 0
457 0
57 1
308 1
..
63 0
326 1
337 0
11 0
351 1
Name: Clicked, Length: 399, dtype: int64
In [45]: 1 X_test
Out[45]:
Time Spent on Site Salary
98 12.866031 27148.27919
72 26.410241 55388.71453
In [46]: 1 y_test
Out[46]: 246 0
491 1
330 1
453 0
155 0
..
98 0
183 0
72 0
367 1
405 0
Name: Clicked, Length: 100, dtype: int64
Out[48]: LogisticRegression()
Evaluation
In [37]: 1 print(classification_report(y_test,predictions))
Out[51]: <AxesSubplot:>
In [ ]: 1