Univariate Logistic Regression
Finding the Best Fit Sigmoid Curve - I
Likelihood
Now, let’s say that for the ten points in our example, the labels are as follows:
Point no. 1 2 3 4 5 6 7 8 9 10
Diabetes no no no yes no yes yes yes yes yes
In this case, the likelihood would be equal to:
(1−P1) (1−P2 ) (1−P3) (1−P5 ) (P4) (P6) (P7) (P8) (P9) (P10) ✓ Correct
Odds and Log Odds
Log Odds
So, let’s say that the equation for the log odds is:
For x = 220, the log odds are equal to -13.5+(0.06*220) = -0.3. For x = 231.5, log odds are equal to:
ans: 0.39
Log Odds
So, let’s say that the equation for log odds is:
For x = 220, the log odds are equal to -0.3 and for x = 231.5, the log odds are equal to 0.39. For x = 243, the log
odds are equal to:
ans: 1.08
Multivariate Logistic Regression - Model Building
Data Cleaning and Preparation - I
Level counts
In the text above, you saw that for the variable ‘MultipleLines’ the value counts of the levels ‘Yes’, ‘No’, and ‘No
phone service’ are 3390, 2971, and 682 respectively. When you run the same command for the column
‘OnlineBackup’, what will the value count for its level ‘No internet service’ turn out to be?
1526 ✓ Correct
Levels of Dummy Variables
If you check the value counts of the levels ‘OnlineBackup’, ‘OnlineSecurity’, ‘DeviceProtection’, and all the others for
which one of the levels was dropped manually, you can see that the count of the level ‘No internet service’ is the
same for all, i.e. 1526. Can you explain brie y why this has happened?
ans: This happens because the level ‘No internet service’ just tells you whether a user has internet service or not.
Now because the number of users not having an internet service is the same, the count of this level in all of these
variables will be the same. You can also check the value counts of the variable ‘InternetService’ and you’ll see that
the output you’ll get is:
Fiber Optic 3096
DSL 2421
No 1526
Coincidence? No!
This information is already contained in the variable ‘InternetService’ and hence, the count will be the same in all
the variables with the level ‘No internet service’. This is actually also the reason we chose to drop this particular
level.
Data Cleaning and Preparation - II
Standardising Variables
In a dataset with mean 50 and standard deviation 12, what will be the value of a variable with an initial value of 20
after you standardise it?
1.9
-1.9
2.5
-2.5 ✓ Correct
Standardising the train and test sets
As Rahim mentioned in the lecture, you use ' t_transform' on the train set but just 'transform' on the test set. Recall
you had learnt this in linear regression as well. Why do you think this is done?
Suggested Answer
The ' t_transform' command rst ts the data to have a mean of 0 and a standard deviation of 1, i.e. it scales all
the variables using:
Now, once this is done, all the variables are transformed using this formula. Now, when you go ahead to the test
set, you want the variables to not learn anything new. You want to use the old centralisation that you had when
you used t on the train dataset. And this is why you don't apply ' t' on the test data, just the 'transform'.
Building your First Model
Correlation Table
Which of the following command can be used to view the correlation table for the dataframe telecom?
telecom.corr() ✓ Correct
Checking Correlations
Take a look at the heatmap provided above. Which of the variables have the highest correlation between them?
StreamingTV_Yes and StreamingMovies_Yes
StreamingTV_No and StreamingMovies_No
MultipleLines_No and MultipleLines_Yes ✓ Correct
Signi cant Variables
Which of the following variables are insigni cant as of now based on the summary statistics above? (More than
one option may be correct.)
Note: Use p-value to determine the insigni cant variables.
PhoneService ✓ Correct
MultipleLines_Yes
TechSupport_Yes ✓ Correct
Negatively Correlated Variables
Which of the following variables are negatively correlated with the target variable based on the summary statistics
given above? (More than one option may be correct.)
tenure ✓ Correct
TotalCharges
MonthlyCharges ✓ Correct
p-values
After learning the coe cients of each variable, the model also produces a ‘p-value’ of each coe cient. Fill in the
blanks so that the statement is correct:
“The null hypothesis is that the coe cient is __. If the p-value is small, you can say that the coe cient is signi cant
and hence the null hypothesis ____.”
zero, can be rejected ✓ Correct
Feature Elimination using RFE
Threshold Value
You saw that Rahim chose a cut-off of 0.5. What can be said about this threshold?
It was arbitrarily chosen by us, i.e. there’s nothing special about 0.5. We could have chosen something else as well.
✓ Correct
Signi cance based on RFE
Based on the RFE output shown above, which of the variables is least signi cant?
OnlineBackup_Yes
Partner
gender_Male ✓ Correct
Churn based on Threshold
Suppose the following table shows the predicted values for the probabilities for 'Churn'. Assuming you chose an
arbitrary cut-off of 0.5 wherein a probability of greater than 0.5 means the customer would churn and a
probability of less than or equal 0.5 means the customer wouldn't churn, which of these customers do you think
will churn? (More than one option may be correct.)
Customer Probability(Churn)
A 0.45
B 0.67
C 0.98
D 0.49
E 0.03
B ✓ Correct
C ✓ Correct
Confusion Matrix and Accuracy
Confusion Matrix and Accuracy
Given the confusion matrix below, can you tell how many 'Churns' were correctly identi ed, i.e. if the person has
actually churned, it is predicted as a churn?
Actual/Predicted Not Churn Churn
Not Churn 80 30
Churn 20 70
80
30
20
70 ✓ Correct
Calculating Accuracy
From the confusion matrix you saw in the last question, compute the accuracy of the model.
Actual/Predicted Not Churn Churn
Not Churn 80 30
Churn 20 70
70%
75% ✓ Correct
Confusion Matrix
Suppose you built a logistic regression model to predict whether a patient has lung cancer or not and you get the
following confusion matrix as the output.
Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of the patients were wrongly identi ed as a 'Yes'?
400
100 ✓ Correct
Confusion Matrix
Take a look at the table again.
Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of these patients were correctly labelled, i.e. if the patient had lung cancer it was actually predicted as a
'Yes' and if they didn't have lung cancer, it was actually predicted as a 'No'?
150
400
500
550 ✓ Correct
Accuracy Calculation
From the table you used for the last two questions, what will be the accuracy of the model?
Actual/Predicted No Yes
No 400 100
Yes 50 150
57.14%
64.29%
71.43%
78.57% ✓ Correct
Manual Feature Elimination
Multivariate Logistic Regression (Variable Selection)
Based on the above information, what can you say about the log odds of these two customers?
PS: Recall the log odds for univariate logistic regression was given as:
log odds (customer A) < log odds (customer B)
log odds (customer A) = log odds (customer B)
log odds (customer A) > log odds (customer B) ✓ Correct
Multivariate Logistic Regression (Variable Selection)
Now, what can you say about the odds of churn for these two customers?
For customer A, the odds of churning are lower than for customer B
For customer A, the odds of churning are equal to those for customer B
For customer A, the odds of churning are higher than for customer B ✓ Correct
Multivariate Logistic Regression - Log Odds
Now, suppose two customers, customer C and customer D, are such that their behaviour is exactly the same,
except for the fact that customer C has OnlineSecurity, while customer D does not. What can you say about the
odds of churn for these two customers?
For customer C, the odds of churning are lower than for customer D ✓ Correct
Graded Questions
Logistic Regression in Python
Which of these methods is used for tting a logistic regression model using statsmodels?
OLS()
GLM() ✓ Correct
Confusion Matrix
Given the following confusion matrix, calculate the accuracy of the model.
Actual/Predicted Nos Yeses
Nos 1000 50
Yeses 250 1200
96%
88% ✓ Correct
Diabetic based on Threshold
Suppose you are building a logistic regression model to determine whether a person has diabetes or not. Following
are the values of predicted probabilities of 10 patients.
6 ✓ Correct
Log Odds
Suppose you are working for a media services company like Net ix. They're launching a new show called 'Sacred
Games' and you are building a logistic regression model which will predict whether a person will like it or not based
on whether consumers have liked/disliked some previous shows. You have the data of ve of the previous shows
and you're just using the dummy variables for these ve shows to build the model. If the variable is 1, it means that
the consumer liked the show and if the variable is zero, it means that the consumer didn't like the show. The
following table shows the values of the coe cients for these ve shows that you got after building the logistic
regression model.
Variable Name Coe cient Value
TrueDetective_Liked 0.47
ModernFamily_Liked -0.45
Mindhunter_Liked 0.39
Friends_Liked -0.23
Narcos_Liked 0.55
Now, you have the data of three consumers Reetesh, Kshitij, and Shruti for these 5 shows indicating whether or
not they liked these shows. This is shown in the table below:
Based on this data, which one of these three consumers is most likely to like to new show 'Sacred Games'?
\
Reetesh ✓ Correct
Multivariate Logistic Regression - Model Evaluation
Metrics Beyond Accuracy: Sensitivity & Specificity
False Positives
What is the number of False Positives for the model given below?
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
400
100 ✓ Correct
Sensitivity
Sensitivity is de ned as the fraction of the number of correctly predicted positives and the total number of actual
positives, i.e.
What is the sensitivity of the following model?
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
60%
75% ✓ Correct
Evaluation Metrics
Among the three metrics that you've learnt about, which one is the highest for the model below?
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
Accuracy
Sensitivity
Speci city ✓ Correct
Sensitivity and Specificity in Python
False Negatives
What is the number of False Negatives for the model given below?
Actual/Predicted Not Churn Churn
Not Churn 80 40
Churn 30 50
80
40
30 ✓ Correct
Speci city
Speci city is de ned as the fraction of the number of correctly predicted negatives and the total number of actual
negatives, i.e.
What is the approximate speci city of the following model?
Actual/Predicted Not Churn Churn
Not Churn 80 40
Churn 30 50
60%
67% ✓ Correct
Evaluation Metrics
Which among accuracy, sensitivity, and speci city is the highest for the model below?
Actual/Predicted Not Churn Churn
Not Churn 80 40
Churn 30 50
Accuracy
Sensitivity
Speci city ✓ Correct
Other Metrics
In the code, you saw Rahim evaluate some other metrics as well. These were:
As you can see, the 'False Positive Rate' is basically (1 - Speci city). Check the formula and the values in the code
to verify.
The positive predictive value is the number of positives correctly predicted by the total number of positives
predicted. This is also known as 'Precision' which you'll learn more about soon.
Similarly, the negative predictive value is the number of negatives correctly predicted by the total number of
negatives predicted. There's no particular term for this as such.
Calculate the given three metrics for the model below and identify which one is the largest among them.
Negative Predictive Value ✓ Correct
Understanding ROC Curve
TPR and FPR
Given the following confusion matrix, calculate the value of True Positive Rate (TPR) and False Positive Rate
(FPR).
Actual/Predicted Not Churn Churn
Not Churn 300 200
Churn 100 400
TPR = 40%
FPR = 80%
TPR = 40%
FPR = 60%
TPR = 80%
FPR = 40% ✓ Correct
True Positive Rate
You have the following table showcasing the actual 'Churn' labels and the predicted probabilities for 5 customers.
Customer Churn Predicted Churn Probability
Thulasi 1 0.52
Aditi 0 0.56
Jaideep 1 0.78
Ashok 0 0.45
Amulya 0 0.22
Calculate the True Positive Rate and False Positive rate for the cutoffs of 0.4 and 0.5. Which of these cutoffs, will
give you a better model?
Note: The good model is the one in which TPR is high and FPR is low.
Cutoff of 0.4
Cutoff of 0.5 ✓ Correct
Changing the Threshold
You initially chose a threshold of 0.5 wherein a churn probability of greater than 0.5 would result in the customer
being identi ed as 'Churn' and a churn probability of lesser than 0.5 would result in the customer being identi ed
as 'Not Churn'.
Now, suppose you decreased the threshold to a value of 0.3. What will be its effect on the classi cation?
More customers would now be classi ed as 'Churn'. ✓ Correct
TPR and FPR
Fill in the blanks:
When the value of TPR increases, the value of FPR ______.
increases ✓ Correct
Area Under the Curve
You have the following ve AUCs (Area under the curve) for ROCs plotted for ve different models. Which of these
models is the best?
Model AUC
A 0.54
B 0.82
C 0.79
D 0.66
E 0.56
B ✓ Correct
ROC Curve in Python
ROC Curve
Following is the ROC curve that you got.
As you can see, when the 'True Positive Rate' is 0.8, the 'False Positive Rate' is about 0.24. What will be the value of
speci city, then?
0.8
0.2
0.76 ✓ Correct
ROC Curve
Which of the following ROC curve represents the best model?
C ✓ Correct
Finding the Optimal Threshold
Choosing the Optimal Cut-off
Suppose you created a dataframe to nd out the optimal cut-off point for a model you built. The dataframe looks
like the following:
Threshold Probability Accuracy Sensitivity Speci city
0.0 0.0 0.21 1.00 0.00
0.1 0.1 0.39 0.96 0.22
0.2 0.2 0.56 0.88 0.49
0.3 0.3 0.59 0.81 0.53
0.4 0.4 0.62 0.78 0.63
0.5 0.5 0.74 0.73 0.74
0.6 0.6 0.81 0.64 0.79
0.7 0.7 0.78 0.42 0.83
0.8 0.8 0.63 0.21 0.92
0.9 0.9 0.56 0.03 0.98
Based on the table above, what will the approximate value of the optimal cut-off be?
0.4
0.5 ✓ Correct
Choosing a model evaluation metric
As you learnt, there is usually a trade-off between various model evaluation metrics, and you cannot maximise all
of them simultaneously. For e.g., if you increase sensitivity (% of correctly predicted churns), the speci city (% of
correctly predicted non-churns) will reduce.
Let's say that you are building a telecom churn prediction model with the business objective that your company
wants to implement an aggressive customer retention campaign to retain the 'high churn-risk' customers. This is
because a competitor has launched extremely low-cost mobile plans, and you want to avoid churn as much as
possible by incentivising the customers. Assume that budget is not a constraint.
Which of the following metrics should you choose the maximise?
Accuracy
Sensitivity ✓ Correct
Model Evaluation Metrics - Exercise
Accuracy of the Model
Using the threshold of 0.3, what is the approximate accuracy of the model now?
72%
77% ✓ Correct
Confusion Matrix
Get the confusion matrix after using the cut-off 0.3. What is the number of 'False Negatives' now?
2793
842
283 ✓ Correct
Sensitivity
In the last question you saw that in the confusion matrix, the Churns are being captured better now. Using the
confusion matrix, can you tell what will the approximate sensitivity of the model now be?
67
72
76
78 ✓ Correct
Precision and Recall
Calculating Precision
Calculate the precision value for the following model.
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
60% ✓ Correct
F1-score
There is a measure known as F1-score which essentially combines both precision and recall. It is the basically the
harmonic mean of precision and recall and its formula is given by:
The F1-score is useful when you want to look at the performance of precision and recall together.
Calculate the F1-score for the model below:
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
33%
67% ✓ Correct
Optimal Cut-off
When using the sensitivity-speci city tradeoff, you found out that the optimal cutoff point was 0.3. Now, when you
plotted the precision-recall tradeoff, you got the following curve:
What is the optimal cutoff point according to the curve given above?
0.24
0.42 ✓ Correct
Making Predictions
Calculating Accuracy
Recall that in the last segment you saw that the cutoff based on the precision-recall tradeoff curve was
approximately 0.42. When you take this cut-off, you get the following confusion matrix on the test set.
Actual/Predicted Not Churn Churn
Not Churn 1294 234
Churn 223 359
What will the approximate value of accuracy be on the test set now?
60%
72%
75%
78% ✓ Correct
Calculating Recall
For the confusion matrix you saw in the last question, what will the approximate value of recall be?
Actual/Predicted Not Churn Churn
Not Churn 1294 234
Churn 223 359
62% ✓ Correct
Graded Questions
Calculating Sensitivity
Suppose you got the following confusion matrix for a model by using a cutoff of 0.5.
Actual/Predicted Not Churn Churn
Not Churn 1200 400
Churn 350 1050
Calculate the sensitivity for the model above. Now suppose for the same model, you changed the cutoff from 0.5
to 0.4 such that your number of true positives increased from 1050 to 1190. What will the be the change in
sensitivity?
Note: Report the answer in terms of new_value - old_value, i.e. if the sensitivity was, say, 0.6 earlier and then
changed to 0.8, report it as (0.8 - 0.6), i.e. 0.2.
0.05
-0.05
0.1 ✓ Correct
Calculating Precision
Consider the confusion matrix you had in the last question.
Actual/Predicted Not Churn Churn
Not Churn 1200 400
Churn 350 1050
Calculate the values of precision and recall for the model and determine which of the two is higher.
Precision
Recall ✓ Correct
True Positive Rate
Fill in the blanks.
The True Positive Rate (TPR) metric is exactly the same as ______.
Sensitivity ✓ Correct
Threshold
Suppose someone built a logistic regression model to predict whether a person has a heart disease or not. All you
have from their model is the following table which contains data of 10 patients.
Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1
Now, you wanted to nd out the cutoff based on which the classes were predicted, but you can't. But can you
identify which of the following cutoffs would be a valid cutoff for the model above based on the 10 data points
given in the table? (More than one option may be correct.)
0.50
✓ Correct
0.55
✓ Correct
Evaluation Metrics
Consider the same model given in the last question.
Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1
Calculate the values of Accuracy, Sensitivity, Speci city, and Precision. Which of these four metrics is the highest
for the model?
Accuracy
Sensitivity ✓ Correct
Logistic Regression - Industry Applications - Part I
Nuances of Logistic Regression - Variable Transformation-II
Woe Analysis
What information would you infer from the woe trend of tenure variable?
As tenure increases, the chances of churning decrease ✓ Correct
Woe Analysis
Choose the correct option:
Coarse binning is required for tenure variable as there is no monotonic trend in ne binning
Coarse binning is not required for tenure variable as there is a clear monotonic trend in ne binning ✓ Correct
Woe Analysis
What does negative woe signify in 'contract' variable (refer sheet-3)?
% of churners (bad customers) are more than % of no-churners (good customers) ✓ Correct
Woe Analysis
Compare the woe trends of both variables (tenure and contract).
Based on the woe trend, which variable when increased in value, might decrease the likelihood of churn?
Tenure
Contract
Both ✓ Correct
Information Value
What is the total information value of both the variables?
Contract = 0.83, Tenure = 1.24
Contract = 1.24 , Tenure = 0.83 ✓ Correct
Information Value
Choose the correct option?
Contract variable has stronger predictive power than tenure ✓ Correct
Nuances of Logistic Regression - Variable Transformation-III
WOE Missing Value
Woe value for NA bucket is:
0.51
0.41
-0.41
-0.51 ✓ Correct
Missing value
NA bucket can be merged with -
1-1 Bucket
2-2 Bucket
7-9 Bucket
None ✓ Correct
Graded Questions
Logistic Regression
What do you infer from the woe plot of the 'Grade' variable?
As the loan grade varies from A to G, the woe values gradually decrease from +0.99 to -1.09 ✓ Correct
Logistic Regression
Choose the correct option:
Woe graph shows monotonic nature ✓ Correct
Logistic Regression
Information value of the 'Grade' variable is:
0.56
0.43
0.34 ✓ Correct
Logistic Regression: Industry Applications - Part II
Coding Practice Optional)
Fibonacci Series
Description
Compute and display Fibonacci series upto n terms where n is a positive integer entered
You can go here to read about Fibonacci series.
n=int(input())
# first two terms
n1, n2 = 0, 1
count = 0
# check if the number of terms is valid
if n <= 0:
print("Please enter a positive integer")
# if there is only one term, return n1
elif n == 1:
print("Fibonacci sequence upto",nterms,":")
print(n1)
# generate fibonacci sequence
else:
print("Fibonacci sequence:")
while count < n:
print(n1)
nth = n1 + n2
# update values
n1 = n2
n2 = nth
count += 1
Prime Numbers
Description
Determine whether a positive integer n is a prime number or not. Assume n>1.
Display “number entered is prime” if n is prime, otherwise display “number entered is n
n=int(input())
out=True
for i in range(2,n):
if(n%i==0):
out=False
break
if out==True:
print("number entered is prime")
else:
print("number entered is not prime")
Armstrong number
Description
Any number, say n is called an Armstrong number if it is equal to the sum of its digits
n=int(input())
# Python program to check if the number is an Armstrong number or not
# initialize sum
sum = 0
# find the sum of the cube of each digit
temp = n
while temp > 0:
digit = temp % 10
sum += digit ** 3
temp //= 10
# display the result
if n == sum:
print(True)
else:
print(False)
Selecting dataframe columns
Description
Write a program to select all columns of a dataframe except the ones specified.
The input will contain a list of columns that you should skip.
You should print the first five rows of the dataframe as output where the columns are a
import pandas as pd
import ast,sys
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/X0kvr3wEYXRzONE5W37xWWY
input_str = sys.stdin.read()
to_omit = ast.literal_eval(input_str)
#write your code here
df=df[df.columns[~df.columns.isin(to_omit)]] #### check before submit
print(df.loc[:, sorted(list(df.columns))].head())
Two series
Description
Given two pandas series, find the position of elements in series2 in series1.
You can assume that all elements in series2 will be present in series1.
The input will contain two lines with series1 and series2 respectively.
The output should be a list of indexes indicating elements of series2 in series 1.
Note: In the output list, the indexes should be in ascending order.
import ast,sys
import pandas as pd
input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)
series1=pd.Series(input_list[0])
series2=pd.Series(input_list[1])
out_list=[pd.Index(series1).get_loc(num) for num in series2]
print(list(map(int,out_list)))#do not alter this step, list must be int type for evalua
Cleaning columns
Description
For the given dataframe, you have to clean the "Installs" column and print its correlat
You have to do the following:
1. Remove characters like ',' from the number of installs.
2. Delete rows where the Installs column has irrelevant strings like 'Free'
3. Convert the column to int type
You can access the dataframe using the following URL in your Jupyter notebook:
https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA7/googleplaystor
import pandas as pd
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA
df.Installs=df.Installs.str.replace(',','')
df.Installs=df.Installs.str.replace('+','')
df=df[df.Installs!='Free']
df.Installs=df.Installs.astype(int)
print(df.corr())
import jovian
jovian.commit()