0% found this document useful (0 votes)
6 views50 pages

Introduction To Data Analytics: ITE 5201 Lecture10-Logistic Regression

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views50 pages

Introduction To Data Analytics: ITE 5201 Lecture10-Logistic Regression

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to Data

Analytics
ITE 5201
Lecture10-Logistic Regression
Instructor: Parisa Pouladzadeh
Email: [email protected]
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


SBP (mm Hg)

220

SBP = 81.54 + 1.222  Age


200

180

160

140

120

100

80
20 30 40 50 60 70 80 90

Age (years)

adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Simple linear regression
Relation between 2 continuous variables (SBP and age)

y
Slope y = α + β 1x 1
x

Regression coefficient b1
◦ Measures association between y and x
◦ Amount by which y changes on average when x changes by one unit
◦ Least squares method

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Logistic regression

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Logistic regression
Table 2 Age and signs of coronary heart disease (CD)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


How can we analyse these data?
Compare mean age of diseased and non-diseased

◦ Non-diseased: 38.6 years


◦ Diseased: 58.7 years (p<0.0001)

Linear regression?

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Dot-plot: Data from Table 2
Y
es
Signs of coronary disease

N
o

0 2
0 4
0 6
0 8
0 1
00
A
GE(y
ears
)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Logistic regression
Linear regression is used to predict outputs on a continuous spectrum.
o Example: predicting revenue based on the outside air temperature.

Logistic regression is used to predict binary outputs with two possible values labeled "0" or "1"
o Logistic model output can be one of two classes: pass/fail, win/lose, healthy/sick
Hours Studying Pass/Fail
1 0
PASS/FAIL

1.5 0
2 0
LINEAR MODEL
3 1
3.25 0
4 1
5 1
HOURS OF STUDYING 6 1

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Logistic regression
• Linear regression is not suitable for classification problem.
• Linear regression is unbounded, so logistic regression will be better candidate in which
the output value ranges from 0 to 1.

LINEAR MODEL
PASS/FAIL

LOGISTIC REGRESSION
MODEL

HOURS OF STUDYING

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Logistic regression
• Logistic regression algorithm works by implementing a linear equation first with
independent predictors to predict a value.
• We then need to convert this value into a probability that could range from 0 to 1.
PASS/FAIL

LOGISTIC REGRESSION
MODEL

HOURS OF STUDYING

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Logistic regression
FROM PROBABILITY TO CLASS

• Now we need to convert from a probability to a class value which is “0” or “1”.
PASS/FAIL

LOGISTIC REGRESSION
MODEL
Class 1
0.5

Class 0 THRESHOLD

HOURS OF STUDYING

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Metrics
It is extremely important to use quantitative metrics for evaluating a
machine learning model

For classification
Accuracy/Precision/Recall/F1-score, ROC curves,…

For regression
Normalized RMSE, Normalized Mean Absolute Error (NMAE)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Metrics

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Metrics

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


MSE
MSE (Mean Squared Error) is the average squared error between actual
and predicted values.
Squared error, is a row-level error calculation where the difference
between the prediction and the actual is squared.
The main draw for using MSE is that it squares the error, which results in
large errors being punished or clearly highlighted.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


RMSE
Root Mean Squared Error (RMSE) is the square root of the mean
squared error (MSE) between the predicted and actual values.

A benefit of using RMSE is that the metric it produces is in terms of the


unit being predicted. For example, using RMSE in a house price
prediction model would give the error in terms of house price, which
can help end users easily understand model performance.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


R squared compared to RMSE
RMSE (or MSE) is the measure of goodness of predicting the
validation/test values, while R^2 is a measure of goodness of fit in
capturing the variance in the training set.
R Square is not only a measure of Goodness-of-fit, it is also a measure of
how much the model (the set of independent variables you selected)
explain the behavior of your dependent variable.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

Lecture 10-Part2

Logistic Regression
In [14]: 1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 %matplotlib inline

The Data
Import the Dataset.

In [17]: 1 data = pd.read_csv('Downloads/Facebook.csv')


2 data.head(5)

Out[17]:
Time
Names emails Country Spent on Salary
Site

Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila

Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes

Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez

Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands

Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 1/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

In [18]: 1 data.head()

Out[18]:
Time
Names emails Country Spent on Salary
Site

Martina
0 [email protected] Bulgaria 25.649648 55330.06006
Avila

Harlan
1 [email protected] Belize 32.456107 79049.07674
Barnes

Naomi
2 vulputate.mauris.sagittis@ametconsectetueradip... Algeria 20.945978 41098.60826
Rodriquez

Jade Cook
3 [email protected] 54.039325 37143.35536
Cunningham Islands

Cedric
4 [email protected] Brazil 34.249729 37355.11276
Leach

Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!

In [19]: 1 data.isnull()

Out[19]:
Names emails Country Time Spent on Site Salary Clicked

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

494 False False False False False False

495 False False False False False False

496 False False False False False False

497 False False False False False False

498 False False False False False False

499 rows × 6 columns

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 2/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

Explore the dataset


In [38]: 1 click = data[data['Clicked']==1]
2 no_click = data[data['Clicked']==0]

In [39]: 1 print("Total num of data =", len(data))


2 ​
3 print("Number of customers who clicked on Ad =", len(click))
4 print("Percentage Clicked =", 1.*len(click)/len(data)*100.0, "%")
5
6 print("Did not Click =", len(no_click))
7 print("Percentage who did not Click =", 1.*len(no_click)/len(data)*100.0,
8
9

Total num of data = 499


Number of customers who clicked on Ad = 250
Percentage Clicked = 50.1002004008016 %
Did not Click = 249
Percentage who did not Click = 49.899799599198396 %

Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way
to do this is by filling in the mean age of all the passengers (imputation). However we can be
smarter about this and check the average age by passenger class. For example:

Now apply that function!

Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

In [20]: 1 data.drop(['Names', 'emails', 'Country'],axis = 1,inplace=True)

In [21]: 1 data.head()

Out[21]:
Time Spent on Site Salary Clicked

0 25.649648 55330.06006 0

1 32.456107 79049.07674 1

2 20.945978 41098.60826 0

3 54.039325 37143.35536 1

4 34.249729 37355.11276 0

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 3/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

In [22]: 1 data.dropna(inplace=True)

Converting Categorical Features


In [40]: 1 data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499 entries, 0 to 498
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time Spent on Site 499 non-null float64
1 Salary 499 non-null float64
2 Clicked 499 non-null int64
dtypes: float64(2), int64(1)
memory usage: 15.6 KB

Logistic Regression model


Train Test Split
In [41]: 1 from sklearn.model_selection import train_test_split

In [42]: 1 X_train, X_test, y_train, y_test = train_test_split(data.drop('Clicked',ax


2 data['Clicked'], test_
3 random_state=101)

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 4/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

In [43]: 1 X_train

Out[43]:
Time Spent on Site Salary

187 46.995205 89227.57988

55 27.432028 40814.47633

457 25.366808 37192.01715

57 47.070590 80709.83902

308 43.880448 77371.64859

... ... ...

63 31.518373 35277.25683

326 42.903343 78401.67203

337 37.278453 50158.74558

11 34.530898 30221.93714

351 30.391102 59519.43092

399 rows × 2 columns

In [44]: 1 y_train

Out[44]: 187 1
55 0
457 0
57 1
308 1
..
63 0
326 1
337 0
11 0
351 1
Name: Clicked, Length: 399, dtype: int64

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 5/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

In [45]: 1 X_test

Out[45]:
Time Spent on Site Salary

246 19.919153 30201.25465

491 37.173216 63750.41558

330 43.750975 50777.99687

453 29.156654 39394.28363

155 30.730586 47012.72759

... ... ...

98 12.866031 27148.27919

183 23.653926 29808.11365

72 26.410241 55388.71453

367 44.661437 75426.28108

405 30.916826 19123.46645

100 rows × 2 columns

In [46]: 1 y_test

Out[46]: 246 0
491 1
330 1
453 0
155 0
..
98 0
183 0
72 0
367 1
405 0
Name: Clicked, Length: 100, dtype: int64

Training and Predicting


In [47]: 1 from sklearn.linear_model import LogisticRegression

In [48]: 1 logmodel = LogisticRegression()


2 logmodel.fit(X_train,y_train)

Out[48]: LogisticRegression()

In [49]: 1 predictions = logmodel.predict(X_test)

Let's move on to evaluate our model!

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 6/7


3/29/23, 11:20 AM Lecture 10-Logistic Regression_Part_2 - Jupyter Notebook

Evaluation

We can check precision,recall,f1-score using classification report.

In [36]: 1 from sklearn.metrics import classification_report

In [37]: 1 print(classification_report(y_test,predictions))

precision recall f1-score support

0 0.94 0.89 0.92 57


1 0.87 0.93 0.90 43

accuracy 0.91 100


macro avg 0.91 0.91 0.91 100
weighted avg 0.91 0.91 0.91 100

In [51]: 1 from sklearn.metrics import classification_report, confusion_matrix


2 cm = confusion_matrix(y_test, predictions)
3 sns.heatmap(cm, annot=True, fmt="d")

Out[51]: <AxesSubplot:>

In [ ]: 1 ​

localhost:8888/notebooks/Lecture 10-Logistic Regression_Part_2.ipynb# 7/7


Introduction to Data
Analytics
ITE 5201
Lecture11-K Nearest Neighbor
Classification
Instructor: Parisa Pouladzadeh
Email: [email protected]
www.udemy. /course/python-for-data-science-and-machine-learning.com
Nearest Neighbor Classifiers
➢KNN algorithm is one of the simplest classification
algorithm
➢non-parametric
➢it does not make any assumptions on the underlying data
distribution

➢lazy learning algorithm.


➢there is no explicit training phase or it is very minimal.
➢also means that the training phase is pretty fast .
➢Lack of generalization means that KNN keeps all the training data.

➢Its purpose is to use a database in which the data points


are separated into several classes to predict the
classification of a new sample point.
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Nearest Neighbor Classifiers

Basic idea:
◦ If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test Record

Training
Records 3

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Nearest Neighbor Classifiers

KNN Algorithm is based on feature similarity


How closely out-of-sample features resemble our training set
determines how we classify a given data point

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Basic Idea

➢k-NN classification rule is to assign to a test sample the


majority category label of its k nearest training samples
➢In practice, k is usually chosen to be odd, so as to avoid ties
➢The k = 1 rule is generally called the nearest-neighbor
classification rule

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that


have the k smallest distance to x 6

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Classification steps

➢Training phase: a model is constructed from the training


instances.
➢ classification algorithm finds relationships between predictors
and targets
➢ relationships are summarised in a model
➢Testing phase: test the model on a test sample whose class
labels are known but not used for training the model
➢Usage phase: use the model for classification on new data
whose class labels are unknown

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


K-Nearest Neighbor
Features
◦ All instances correspond to points in an n-dimensional Euclidean
space
◦ Classification is delayed till a new instance arrives
◦ Classification done by comparing feature vectors of the different
points
◦ Target function may be discrete or real-valued

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


K-Nearest Neighbor

➢An arbitrary instance is represented by


(a1(x), a2(x), a3(x),.., an(x))
➢ai(x) denotes features
➢Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)
➢Continuous valued target function
➢ mean value of the k nearest training examples

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Euclidean Distance
•K-nearest neighbours uses the local neighborhood to obtain a
prediction
•The K memorized examples more similar to the one that is being
classified are retrieved
•A distance function is needed to compare the examples
similarity
•This means that if we change the distance function, we change
how examples are classified

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Normalization
If the ranges of the features differ, feaures with bigger values
will dominate decision
In general feature values are normalized prior to distance
calculation

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Voronoi diagram
•We frequently need to find the nearest hospital, surgery or
supermarket.
•A map divided into cells, each cell covering the region closest to a
particular centre, can assist us in our quest.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Voronoi diagram
Another practical problem is to choose a location for a new service,
such as a school, which is as far as possible from existing schools while
still serving the maximum number of families.
A Voronoi diagram can be used to find the largest empty circle amid a
collection of points, giving the ideal location for the new school. Of
course, numerous parameters other than distance must be considered,
but access time is often the critical factor.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Numerical Example
Steps:
1. Determine parameter K = number of nearest neighbors
2. Calculate the distance between the query-instance and all the training
samples
3. Sort the distance and determine nearest neighbors based on the K-th
minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the
prediction value of the query instance

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Voronoi diagram

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

Lecture 11-K Nearest Neighbour-Part 2

K Nearest Neighbors with Python

Import Libraries
In [10]: 1 import pandas as pd
2 import seaborn as sns
3 import matplotlib.pyplot as plt
4 import numpy as np
5 %matplotlib inline

Get the Data


In [11]: 1 df = pd.read_csv('Downloads/KNN_Project_Data')

In [12]: 1 df.head()

Out[12]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS

0 1636.670614 817.988525 2565.995189 358.347163 550.417491 1618.870897 2147.641254 33

1 1013.402760 577.587332 2644.141273 280.428203 1161.873391 2084.107872 853.404981 44

2 1300.035501 820.518697 2025.854469 525.562292 922.206261 2552.355407 818.676686 84

3 1059.347542 1066.866418 612.000041 480.827789 419.467495 685.666983 852.867810 34

4 1018.340526 1313.679056 950.622661 724.742174 843.065903 1370.554164 905.469453 65

Standardize the Variables


In [4]: 1 from sklearn.preprocessing import StandardScaler

In [5]: 1 scaler = StandardScaler()

In [13]: 1 scaler.fit(df.drop('TARGET CLASS',axis=1))

Out[13]: StandardScaler()

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 1/6


4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

In [16]: 1 scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))

In [20]: 1 df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])


2 df_feat.head()

Out[20]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM

0 1.568522 -0.443435 1.619808 -0.958255 -1.128481 0.138336 0.980493 -0.932794 1.008313

1 -0.112376 -1.056574 1.741918 -1.504220 0.640009 1.081552 -1.182663 -0.461864 0.258321

2 0.660647 -0.436981 0.775793 0.213394 -0.053171 2.030872 -1.240707 1.149298 2.184784

3 0.011533 0.191324 -1.433473 -0.100053 -1.507223 -1.753632 -1.183561 -0.888557 0.162310

4 -0.099059 0.820815 -0.904346 1.609015 -0.282065 -0.365099 -1.095644 0.391419 -1.365603

Train Test Split


In [21]: 1 from sklearn.model_selection import train_test_split

In [22]: 1 X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TA


2 test_size=0.30)

Using KNN
Remember that we are trying to come up with a model to predict whether someone will TARGET
CLASS or not. We'll start with k=1.

In [23]: 1 from sklearn.neighbors import KNeighborsClassifier

In [24]: 1 knn = KNeighborsClassifier(n_neighbors=1)

In [25]: 1 knn.fit(X_train,y_train)

Out[25]: KNeighborsClassifier(n_neighbors=1)

In [26]: 1 pred = knn.predict(X_test)

Predictions and Evaluations


Let's evaluate our KNN model!

In [27]: 1 from sklearn.metrics import classification_report,confusion_matrix

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 2/6


4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

In [28]: 1 print(confusion_matrix(y_test,pred))

[[109 45]
[ 33 113]]

In [29]: 1 print(classification_report(y_test,pred))

precision recall f1-score support

0 0.77 0.71 0.74 154


1 0.72 0.77 0.74 146

accuracy 0.74 300


macro avg 0.74 0.74 0.74 300
weighted avg 0.74 0.74 0.74 300

Choosing a K Value
Let's go ahead and use the elbow method to pick a good K Value:

In [30]: 1 error_rate = []
2 ​
3 # Will take some time
4 for i in range(1,40):
5
6 knn = KNeighborsClassifier(n_neighbors=i)
7 knn.fit(X_train,y_train)
8 pred_i = knn.predict(X_test)
9 error_rate.append(np.mean(pred_i != y_test))

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 3/6


4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

In [31]: 1 plt.figure(figsize=(10,6))
2 plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='
3 markerfacecolor='red', markersize=10)
4 plt.title('Error Rate vs. K Value')
5 plt.xlabel('K')
6 plt.ylabel('Error Rate')

Out[31]: Text(0, 0.5, 'Error Rate')

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 4/6


4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

In [32]: 1 # FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1


2 knn = KNeighborsClassifier(n_neighbors=1)
3 ​
4 knn.fit(X_train,y_train)
5 pred = knn.predict(X_test)
6 ​
7 print('WITH K=1')
8 print('\n')
9 print(confusion_matrix(y_test,pred))
10 print('\n')
11 print(classification_report(y_test,pred))

WITH K=1

[[109 45]
[ 33 113]]

precision recall f1-score support

0 0.77 0.71 0.74 154


1 0.72 0.77 0.74 146

accuracy 0.74 300


macro avg 0.74 0.74 0.74 300
weighted avg 0.74 0.74 0.74 300

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 5/6


4/5/23, 12:36 PM Lecture-11-K Nearest Neighbors-Part2 - Jupyter Notebook

In [35]: 1 # NOW WITH K=3


2 knn = KNeighborsClassifier(n_neighbors=23)
3 ​
4 knn.fit(X_train,y_train)
5 pred = knn.predict(X_test)
6 ​
7 print('WITH K=30')
8 print('\n')
9 print(confusion_matrix(y_test,pred))
10 print('\n')
11 print(classification_report(y_test,pred))

WITH K=30

[[114 40]
[ 20 126]]

precision recall f1-score support

0 0.85 0.74 0.79 154


1 0.76 0.86 0.81 146

accuracy 0.80 300


macro avg 0.80 0.80 0.80 300
weighted avg 0.81 0.80 0.80 300

In [ ]: 1 ​

localhost:8890/notebooks/Lecture-11-K Nearest Neighbors-Part2.ipynb 6/6

You might also like