0% found this document useful (0 votes)
181 views30 pages

Application of Logistic Regression To People-Analytics

This document discusses using logistic regression to predict whether job candidates will renege, or not join a company, after accepting an offer. It outlines objectives like identifying key drivers of a candidate's decision. The analyst prepares the data, explores it visually, builds a logistic regression model, and validates the model's goodness of fit. Important variables like notice period, line of business, candidate source, and compensation difference are identified. The regression coefficients are interpreted to understand the impact of variables on a candidate's likelihood of joining.

Uploaded by

Sravan Kr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views30 pages

Application of Logistic Regression To People-Analytics

This document discusses using logistic regression to predict whether job candidates will renege, or not join a company, after accepting an offer. It outlines objectives like identifying key drivers of a candidate's decision. The analyst prepares the data, explores it visually, builds a logistic regression model, and validates the model's goodness of fit. Important variables like notice period, line of business, candidate source, and compensation difference are identified. The regression coefficients are interpreted to understand the impact of variables on a candidate's likelihood of joining.

Uploaded by

Sravan Kr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Application of Logistic

Regression to People-Analytics
AMDA-2021
Prof. Ujjwal Das
Indian Institute of Management, Udaipur
Objectives
• Use of an analytical approach to predict renege.
• Past data from Indian IT companies revealed that
30% of the candidates did not join the company after
offer acceptance, which significantly increased the
overall cost of recruitment.
• Analytics could possibly help in identifying the key
drivers that influence a candidate in either
joining/not-joining a company after accepting the
offer, as it would largely help clients save both cost
and time.
Objectives
• However, there was a risk involved: any error in
this prediction could turn out to be a costly affair,
as the client could ‘‘wrongly’’ reject a potential
candidate even without interviewing him/her. In
addition to this, we will also see
• Data preparation
• Data exploration
• Data modeling
• Validation and implementation
Objectives
• Obviously, there are many options for classification
(joined/ not joined), we will consider and validate
logistic regression
• Explore the data by advanced visualization
• Will use dummy variables, interactions in model
building; variable selection
• Validate the goodness-of-fit of the model using
measures like omnibus test
• Predicting performance by using sensitivity, specificity,
ROC curve, AUC and Youden’s Index
Activities and Data Challenges in Analytics
Projects
• First job, perhaps, is to change the questions
from subject matter expert (SME) to statistical
problems e.g. hypothesis testing
• Make sure that all the questions from SME can
be answered from the available data
• Technically speaking: identify the factors
which may influence the decision-making
process of a candidate despite having an offer
Challenges continued..
• Before entering complex analytics
methodology it’s always recommended to
start with exploratory analysis
• Visualize the data and draw some basic insight
about your final decision making
• Will perform advanced graphics as we did in
previous sessions
Possible Research Question(s)
• Develop an appropriate model that can be
used by ScaleneWorks for predicting
candidates who are unlikely to join after
accepting the offer.
• Which are the important variables on renege?
• Devise a predictive algorithm to calculate the
probability of acceptance of an offer and
finally joining the company after offer
acceptance.
Exploratory Data Analysis (EDA)
• Before answering the questions of the client it is
recommended to discuss the data (overview)
• Understanding the business processes of different
functions which substantiate the problem in hand.
• Formulating hypothesis which needs to be tested
and could lead to possible solution. Hypothesis
formulation typically happens in the instance of
confirmatory data analysis, whereas EDA may not
have any hypothesis at all to start with but to
develop an insight about the data
Various Activities Performed during a project
lifecycle
• A typical instance of confirmatory data analysis
where several hypotheses were formulated with
the involvement of subject matter experts (SMEs)
• SMEs (in the field of talent management) were
asked for factors that influence the decision by
candidates.
• Note that these will be suggestions and you have
to decide from the data whether all or which of
them can be checked
Hypotheses Generated on the basis of the interview among domain experts from talent
management

• H1: when the offered compensation is less than the expected compensation by X%.
• H2: when candidate has to re-locate from one city to another.
• H3: when there is significant disparity between salary increment vis-à-vis per capita income in the city
• H4: when there is minimal responsibility, on the candidate, of immediate family members
(parents/siblings).
• H5: when candidate moves from a higher tier company to a lower tier company.
• H6: when candidate moves from a product development company to a service-oriented company.
• H7: when there is less disparity between a candidate’s current designation and designation offered.
• H8: when a candidate’s spouse is working in a different location.
• H9: when a candidate’s educational background includes a Tier1 institute.
• H10: when the time lag between the different stages of selection process increases.*
• H11: based on the period in which the offer was rolled out (salary increment cycle in the parting company
vs. salary increment cycle in the joining company).
• H12: with the channel through which the candidate’s profile was sourced.
• H13: based on the number of times the candidate has changed companies in the past.
• H14: when the hiring is not project-specific but general.
• H15: when the delay in background verification (BGV) process extends beyond the date of joining of the
candidate
Data Preparation
• At the time of extraction from the system, some corrections
were made e.g. “Calcutta”, “Kolkatta” were replaced by
“Kolkata”, bandwise correction for some CTC values etc.
• To improve data completeness, logical imputation was
carried out after discussing with ScaleneWorks: 
1. Expected CTC is blank, imputing the expected CTC with
the offered CTC (data completeness improved from 90% to
95%)
2. Offered CTC is blank, imputing the offered CTC with the
expected CTC (data completeness improved from 92% to 96%)
Data Exploration by Visualization
• Data exploration or visualization plays an important
role in understanding the attributes. Exploration
helps in understanding the data better.
•  Since the objective was to understand the impact of
various attributes on the final HR Status (Joined/Not
Joined), all the attributes were plotted against HR
status.
• This also ensures that it’s a classification problem
and hence a logistic regression may be appropriate
as a prediction model
Pie chart showing percentage of joined/not
joined
Impact of Notice-Period on Joining
Impact of other Variables
• Joining bonus (offered or not); LOB i.e. Line of
Business for which offer was rolled out; job
location; candidate source; gender; date of
joining extended based on candidate’s
request; offer band based on candidate’s
experience and performance in interview;
relocation needed
• similarly for continuous variables…
Data Modeling
• Now for the use of analytical techniques to
building models, the dataset has been divided
into training and validation datasets
• A total of 80% data was used for training
purpose and the remaining for validation
purpose. Then we apply logistic regression on
the data set. Surely, some other methods like
classification tree, RF are also applicable
Logistic Regression
•• The
  HR wants to predict not joining since this will save
their money & time
• The mathematical form of the model is where
π=P(Response = 1) and denotes the vector of unknown
regression coefficients
• Systematic component: X's are explanatory variables
(can be continuous, discrete, or both) and are linear in
the parameters, i.e.   = β0 + β1xi + ... + βkxk.  
• Link function: Logit link:
η=logit(π)=log(π/(1−π))
Basic Features
• Subjects are independent (we had it before).
• The dependent variable Yi does NOT follow
normal, but it follows binomial distribution
because of its dichotomous nature.
• The homoscedasticity assumption of variance is
not necessary to be satisfied.
• It uses maximum likelihood estimation (MLE)
rather than ordinary least squares (OLS) to
estimate the parameters.
Logistic Regression: goodness-of-fit
• Once fitted we need to understand whether the full model is
useful or not: Omnibus test.
• The Omnibus Tests of Model Coefficients is used to check
that the new model (with explanatory variables included) is
an improvement over the baseline model. It uses chi-square
tests to see if there is a significant difference between the
Log-likelihoods (specifically the -2LLs) of the baseline
model and the new model.
• If the new model has a significantly reduced -2LL compared
to the baseline then it suggests that the new model is
explaining more of the variability in the outcome and is an
improvement! Hence, our new model is significantly better
Goodness-of-fit continued
• Null Hypothesis: All regression coefficients (β-
values) are zero
• Alternative Hypothesis: Not all regression
coefficients (β-values) are zero.
• We reject the null hypothesis when the p-value
(significance value) is less than 0.05 (5%
significance) indicating that the variable may be
useful and plays an important role in
classification
Discussion on Regression Output
• Interpretation of numbers like regression coefficients; p-values
from the R output
• The p-value for DOJ_Extended, Notice Period, lob, Candidate
Source, and Percent_diff (main effects) showed significance
and suggests that the coefficients of these variables explain the
variation in the HR status.
• The p-value of interaction variable, that is, Age*LOB also
suggests significance in explaining HR status but not for
Age*Offered_Band
• Also note that the SE of the estimates in general are small
indicating the analysis is stable; a challenging problem if not!
Interpretation of Estimates
• An easy interpretation of the regression
coefficients in logistic regression is that when
the regression coefficient is positive, the
P(Y=1) increases as the corresponding variable
value increases. When the regression
coefficient is negative, the P(Y=1) decreases as
the corresponding variable value increases.
• A common way to interpret the regression
coefficients is through odds and odds ratio
Odds and Odds Ratio
• A very popular metric among people at large. Instead
of using probability people like to speak about odds in
favor of some incident.
• Odds of belonging to class 1 is defined as the ratio of
probability of belonging to class 1 to the same for
belonging to class 0 (also called “odds in favor of class
1”).
• If probability is known one can compute odds and
vice-versa. Odds=exp(linear predictor).
• Odds for continuous predictor
Odds and Odds Ratio
• Odds((X +1), X ,…, X )/Odds(X , X ,…, X )=exp(β )
1 2 p 1 2 phence, a
1

single unit increase in X is associated with change in


1

the odds favoring the “success” keeping all other


variables constant
• For a dummy variable, same interpretation but
comparing with the reference level
• If β >0 then odds increases; β <0 it decreases
1 1

• If percent.CTC.difference increases one unit then odds in


favor of not joining decreases exp(-0.0081) = 0.99 holding
all other variables fixed
Odds and Odds Ratio
• For “DOJ extended” with reference level “No”
exp(β )=exp(-0.9)=0.4066. This means for those people
1

who got their date of joining extended has less odds in


favor of not to join after accepting the offer (and hence is
less likely), compared to those who didn’t get the DOJ
extended.
• Finally, odds ratio (OR) is the ratio of two odds, and is
generally used to compare different class of observations.
Log(Odds ratio) from notice period 45 months over 30
months is 1.8 and hence odds for not joining is higher with
45 months than 30 months
Sensitivity; Specificity & Model Accuracy
• Interpret the classification table (also known as confusion
matrix) with cut-off value 0.5. Classification cut-off
probability can be changed to achieve higher sensitivity value
• Sensitivity is the probability that predicted class is +ve when
observed class is +ve; Specificity is the probability that the
predicted class is -ve when the observed class is -ve; and
Model accuracy denotes the correct classification of observed
classes using the model
• Sensitivity measures that true positive rate and the specificity
measures the true negative rate. That is, sensitivity is the model’s
ability to correctly classify positive given the observation is
positive, whereas, specificity is the model’s ability to correctly
classify negative given the observation is negative
Predictive Performance of the Classifier
• Predictive algorithm to calculate the probability of acceptance of
an offer and finally not joining the company after offer
acceptance can be clearly modeled by the regression equation;
Invert to get probabilities
• Evaluation criteria for the predictive performance of a classifier:
Predictive accuracy, ROC curve, AUC
• AUC has been proved analytically to be more consistent and
discriminating than accuracy
• One easily noted advantage of AUC over accuracy is that it is
independent of decision thresholds/ cut-off points
• Also, in unbalanced data AUC is empirically shown to be more
effective in discriminating than accuracy
Cut-off Probability & Evaluation of the
Classifier
• Final question that a data analyst has to answer is:
What cut-off probability should ScaleneWorks use to classify
joining and not joining the firm after accepting the offer?
• We have seen already that if we change the cut-off all
measures (accuracy etc.) changes immediately. We have to
select cut-off satisfying some optimal conditions.
• Receiver operating characteristic (ROC) curves are useful
tools to evaluate classifiers in several applications.
• An ROC plot displays the performance of a binary
classification method with continuous or discrete output.
• It shows the sensitivity and specificity as the output threshold
is moved over the range of all possible values.
Cut-off Probability & Evaluation Continued

• The area under the curve (AUC), also referred to as index of


accuracy (A) or concordant index, represents the
performance of the ROC curve. ROC is plotted between true
positive rate (sensitivity) on Y axis and false positive rate
(1-specificity) on X Axis.
• “False hopes are more dangerous than fears”- J. R. R.
Tolkein. Hence, it’s desirable to have less FPR
• The area under the ROC curve is the proportion of
concordant pairs in the data. Higher the value of AUC,
better the model. But AUC is also criticized for ignoring the
goodness-of-fit
Business Decision
• Youden’s index gives the optimal cut-off probability when both
sensitivity and specificity are treated equally important. Index =
Max[Sensitivity(p) + Specificity(p) -1]; max over p.
• The optimal cut-off under Youden’s index is the one that returns
the maximum value of the index
• The optimal cut-off probability based on Youden’s Index is
0.30, this indicates that ScaleneWork should use a cut-off
probability of 0.30. Anyone with probability of more 0.30 may be
classified as “not joining.”
• A final remark is that there may be scenarios where the estimates
of regression coefficient(s) doesn’t exist. It’s called separation
when penalized regression is a solution.

You might also like