Application of Logistic Regression To People-Analytics
Application of Logistic Regression To People-Analytics
Regression to People-Analytics
AMDA-2021
Prof. Ujjwal Das
Indian Institute of Management, Udaipur
Objectives
• Use of an analytical approach to predict renege.
• Past data from Indian IT companies revealed that
30% of the candidates did not join the company after
offer acceptance, which significantly increased the
overall cost of recruitment.
• Analytics could possibly help in identifying the key
drivers that influence a candidate in either
joining/not-joining a company after accepting the
offer, as it would largely help clients save both cost
and time.
Objectives
• However, there was a risk involved: any error in
this prediction could turn out to be a costly affair,
as the client could ‘‘wrongly’’ reject a potential
candidate even without interviewing him/her. In
addition to this, we will also see
• Data preparation
• Data exploration
• Data modeling
• Validation and implementation
Objectives
• Obviously, there are many options for classification
(joined/ not joined), we will consider and validate
logistic regression
• Explore the data by advanced visualization
• Will use dummy variables, interactions in model
building; variable selection
• Validate the goodness-of-fit of the model using
measures like omnibus test
• Predicting performance by using sensitivity, specificity,
ROC curve, AUC and Youden’s Index
Activities and Data Challenges in Analytics
Projects
• First job, perhaps, is to change the questions
from subject matter expert (SME) to statistical
problems e.g. hypothesis testing
• Make sure that all the questions from SME can
be answered from the available data
• Technically speaking: identify the factors
which may influence the decision-making
process of a candidate despite having an offer
Challenges continued..
• Before entering complex analytics
methodology it’s always recommended to
start with exploratory analysis
• Visualize the data and draw some basic insight
about your final decision making
• Will perform advanced graphics as we did in
previous sessions
Possible Research Question(s)
• Develop an appropriate model that can be
used by ScaleneWorks for predicting
candidates who are unlikely to join after
accepting the offer.
• Which are the important variables on renege?
• Devise a predictive algorithm to calculate the
probability of acceptance of an offer and
finally joining the company after offer
acceptance.
Exploratory Data Analysis (EDA)
• Before answering the questions of the client it is
recommended to discuss the data (overview)
• Understanding the business processes of different
functions which substantiate the problem in hand.
• Formulating hypothesis which needs to be tested
and could lead to possible solution. Hypothesis
formulation typically happens in the instance of
confirmatory data analysis, whereas EDA may not
have any hypothesis at all to start with but to
develop an insight about the data
Various Activities Performed during a project
lifecycle
• A typical instance of confirmatory data analysis
where several hypotheses were formulated with
the involvement of subject matter experts (SMEs)
• SMEs (in the field of talent management) were
asked for factors that influence the decision by
candidates.
• Note that these will be suggestions and you have
to decide from the data whether all or which of
them can be checked
Hypotheses Generated on the basis of the interview among domain experts from talent
management
• H1: when the offered compensation is less than the expected compensation by X%.
• H2: when candidate has to re-locate from one city to another.
• H3: when there is significant disparity between salary increment vis-à-vis per capita income in the city
• H4: when there is minimal responsibility, on the candidate, of immediate family members
(parents/siblings).
• H5: when candidate moves from a higher tier company to a lower tier company.
• H6: when candidate moves from a product development company to a service-oriented company.
• H7: when there is less disparity between a candidate’s current designation and designation offered.
• H8: when a candidate’s spouse is working in a different location.
• H9: when a candidate’s educational background includes a Tier1 institute.
• H10: when the time lag between the different stages of selection process increases.*
• H11: based on the period in which the offer was rolled out (salary increment cycle in the parting company
vs. salary increment cycle in the joining company).
• H12: with the channel through which the candidate’s profile was sourced.
• H13: based on the number of times the candidate has changed companies in the past.
• H14: when the hiring is not project-specific but general.
• H15: when the delay in background verification (BGV) process extends beyond the date of joining of the
candidate
Data Preparation
• At the time of extraction from the system, some corrections
were made e.g. “Calcutta”, “Kolkatta” were replaced by
“Kolkata”, bandwise correction for some CTC values etc.
• To improve data completeness, logical imputation was
carried out after discussing with ScaleneWorks:
1. Expected CTC is blank, imputing the expected CTC with
the offered CTC (data completeness improved from 90% to
95%)
2. Offered CTC is blank, imputing the offered CTC with the
expected CTC (data completeness improved from 92% to 96%)
Data Exploration by Visualization
• Data exploration or visualization plays an important
role in understanding the attributes. Exploration
helps in understanding the data better.
• Since the objective was to understand the impact of
various attributes on the final HR Status (Joined/Not
Joined), all the attributes were plotted against HR
status.
• This also ensures that it’s a classification problem
and hence a logistic regression may be appropriate
as a prediction model
Pie chart showing percentage of joined/not
joined
Impact of Notice-Period on Joining
Impact of other Variables
• Joining bonus (offered or not); LOB i.e. Line of
Business for which offer was rolled out; job
location; candidate source; gender; date of
joining extended based on candidate’s
request; offer band based on candidate’s
experience and performance in interview;
relocation needed
• similarly for continuous variables…
Data Modeling
• Now for the use of analytical techniques to
building models, the dataset has been divided
into training and validation datasets
• A total of 80% data was used for training
purpose and the remaining for validation
purpose. Then we apply logistic regression on
the data set. Surely, some other methods like
classification tree, RF are also applicable
Logistic Regression
•• The
HR wants to predict not joining since this will save
their money & time
• The mathematical form of the model is where
π=P(Response = 1) and denotes the vector of unknown
regression coefficients
• Systematic component: X's are explanatory variables
(can be continuous, discrete, or both) and are linear in
the parameters, i.e. = β0 + β1xi + ... + βkxk.
• Link function: Logit link:
η=logit(π)=log(π/(1−π))
Basic Features
• Subjects are independent (we had it before).
• The dependent variable Yi does NOT follow
normal, but it follows binomial distribution
because of its dichotomous nature.
• The homoscedasticity assumption of variance is
not necessary to be satisfied.
• It uses maximum likelihood estimation (MLE)
rather than ordinary least squares (OLS) to
estimate the parameters.
Logistic Regression: goodness-of-fit
• Once fitted we need to understand whether the full model is
useful or not: Omnibus test.
• The Omnibus Tests of Model Coefficients is used to check
that the new model (with explanatory variables included) is
an improvement over the baseline model. It uses chi-square
tests to see if there is a significant difference between the
Log-likelihoods (specifically the -2LLs) of the baseline
model and the new model.
• If the new model has a significantly reduced -2LL compared
to the baseline then it suggests that the new model is
explaining more of the variability in the outcome and is an
improvement! Hence, our new model is significantly better
Goodness-of-fit continued
• Null Hypothesis: All regression coefficients (β-
values) are zero
• Alternative Hypothesis: Not all regression
coefficients (β-values) are zero.
• We reject the null hypothesis when the p-value
(significance value) is less than 0.05 (5%
significance) indicating that the variable may be
useful and plays an important role in
classification
Discussion on Regression Output
• Interpretation of numbers like regression coefficients; p-values
from the R output
• The p-value for DOJ_Extended, Notice Period, lob, Candidate
Source, and Percent_diff (main effects) showed significance
and suggests that the coefficients of these variables explain the
variation in the HR status.
• The p-value of interaction variable, that is, Age*LOB also
suggests significance in explaining HR status but not for
Age*Offered_Band
• Also note that the SE of the estimates in general are small
indicating the analysis is stable; a challenging problem if not!
Interpretation of Estimates
• An easy interpretation of the regression
coefficients in logistic regression is that when
the regression coefficient is positive, the
P(Y=1) increases as the corresponding variable
value increases. When the regression
coefficient is negative, the P(Y=1) decreases as
the corresponding variable value increases.
• A common way to interpret the regression
coefficients is through odds and odds ratio
Odds and Odds Ratio
• A very popular metric among people at large. Instead
of using probability people like to speak about odds in
favor of some incident.
• Odds of belonging to class 1 is defined as the ratio of
probability of belonging to class 1 to the same for
belonging to class 0 (also called “odds in favor of class
1”).
• If probability is known one can compute odds and
vice-versa. Odds=exp(linear predictor).
• Odds for continuous predictor
Odds and Odds Ratio
• Odds((X +1), X ,…, X )/Odds(X , X ,…, X )=exp(β )
1 2 p 1 2 phence, a
1