Introduction To Data Analytics: ITE 5201 Lecture10-Logistic Regression
Introduction To Data Analytics: ITE 5201 Lecture10-Logistic Regression
Analytics
ITE 5201
Lecture10-Logistic Regression
Instructor: Parisa Pouladzadeh
Email: [email protected]
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women
220
180
160
140
120
100
80
20 30 40 50 60 70 80 90
Age (years)
y
Slope y = α + β 1x 1
x
Regression coefficient b1
◦ Measures association between y and x
◦ Amount by which y changes on average when x changes by one unit
◦ Least squares method
Linear regression?
N
o
0 2
0 4
0 6
0 8
0 1
00
A
GE(y
ears
)
Logistic regression is used to predict binary outputs with two possible values labeled "0" or "1"
o Logistic model output can be one of two classes: pass/fail, win/lose, healthy/sick
Hours Studying Pass/Fail
1 0
PASS/FAIL
1.5 0
2 0
LINEAR MODEL
3 1
3.25 0
4 1
5 1
HOURS OF STUDYING 6 1
LINEAR MODEL
PASS/FAIL
LOGISTIC REGRESSION
MODEL
HOURS OF STUDYING
LOGISTIC REGRESSION
MODEL
HOURS OF STUDYING
• Now we need to convert from a probability to a class value which is “0” or “1”.
PASS/FAIL
LOGISTIC REGRESSION
MODEL
Class 1
0.5
Class 0 THRESHOLD
HOURS OF STUDYING
For classification
Accuracy/Precision/Recall/F1-score, ROC curves,…
For regression
Normalized RMSE, Normalized Mean Absolute Error (NMAE)
Lecture 10-Part2
Logistic Regression
In [14]: 1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 %matplotlib inline
The Data
Import the Dataset.
Out[17]:
Time
Names emails Country Spent on Salary
Site
Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila
Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes
Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez
Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands
Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach
In [18]: 1 data.head()
Out[18]:
Time
Names emails Country Spent on Salary
Site
Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila
Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes
Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez
Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands
Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach
Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!
In [19]: 1 data.isnull()
Out[19]:
Names emails Country Time Spent on Site Salary Clicked
Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way
to do this is by filling in the mean age of all the passengers (imputation). However we can be
smarter about this and check the average age by passenger class. For example:
Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.
In [21]: 1 data.head()
Out[21]:
Time Spent on Site Salary Clicked
0 25.649648 55330.06006 0
1 32.456107 79049.07674 1
2 20.945978 41098.60826 0
3 54.039325 37143.35536 1
4 34.249729 37355.11276 0
In [22]: 1 data.dropna(inplace=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 499 entries, 0 to 498
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time Spent on Site 499 non-null float64
1 Salary 499 non-null float64
2 Clicked 499 non-null int64
dtypes: float64(2), int64(1)
memory usage: 15.6 KB
In [43]: 1 X_train
Out[43]:
Time Spent on Site Salary
55 27.432028 40814.47633
57 47.070590 80709.83902
63 31.518373 35277.25683
11 34.530898 30221.93714
In [44]: 1 y_train
Out[44]: 187 1
55 0
457 0
57 1
308 1
..
63 0
326 1
337 0
11 0
351 1
Name: Clicked, Length: 399, dtype: int64
In [45]: 1 X_test
Out[45]:
Time Spent on Site Salary
98 12.866031 27148.27919
72 26.410241 55388.71453
In [46]: 1 y_test
Out[46]: 246 0
491 1
330 1
453 0
155 0
..
98 0
183 0
72 0
367 1
405 0
Name: Clicked, Length: 100, dtype: int64
Out[48]: LogisticRegression()
Evaluation
In [37]: 1 print(classification_report(y_test,predictions))
Out[51]: <AxesSubplot:>
In [ ]: 1
Basic idea:
◦ If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute
Distance Test Record
Training
Records 3
X X X
Import Libraries
In [10]: 1 import pandas as pd
2 import seaborn as sns
3 import matplotlib.pyplot as plt
4 import numpy as np
5 %matplotlib inline
In [12]: 1 df.head()
Out[12]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS
Out[13]: StandardScaler()
Out[20]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM
Using KNN
Remember that we are trying to come up with a model to predict whether someone will TARGET
CLASS or not. We'll start with k=1.
In [25]: 1 knn.fit(X_train,y_train)
Out[25]: KNeighborsClassifier(n_neighbors=1)
In [28]: 1 print(confusion_matrix(y_test,pred))
[[109 45]
[ 33 113]]
In [29]: 1 print(classification_report(y_test,pred))
Choosing a K Value
Let's go ahead and use the elbow method to pick a good K Value:
In [30]: 1 error_rate = []
2
3 # Will take some time
4 for i in range(1,40):
5
6 knn = KNeighborsClassifier(n_neighbors=i)
7 knn.fit(X_train,y_train)
8 pred_i = knn.predict(X_test)
9 error_rate.append(np.mean(pred_i != y_test))
In [31]: 1 plt.figure(figsize=(10,6))
2 plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='
3 markerfacecolor='red', markersize=10)
4 plt.title('Error Rate vs. K Value')
5 plt.xlabel('K')
6 plt.ylabel('Error Rate')
WITH K=1
[[109 45]
[ 33 113]]
WITH K=30
[[114 40]
[ 20 126]]
In [ ]: 1