100% found this document useful (1 vote)
61 views6 pages

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Logistic regression is a statistical model used to predict a binary outcome variable like whether a job candidate accepts a job offer. The document discusses steps for building a logistic regression model including obtaining a dataset with applicant and outcome data, preprocessing the data, dividing it into training and test sets, training the model on the training set, evaluating the model's performance on the test set using metrics like accuracy, and using the model to make predictions on new data. Results show the training and test accuracies are reasonably high (~81%) suggesting the model generalizes well without overfitting or underfitting. Sensitivity, specificity, and accuracy metrics for evaluating binary classification models are also explained. Finally, support vector machines are discussed as an alternative modeling approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
61 views6 pages

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Logistic regression is a statistical model used to predict a binary outcome variable like whether a job candidate accepts a job offer. The document discusses steps for building a logistic regression model including obtaining a dataset with applicant and outcome data, preprocessing the data, dividing it into training and test sets, training the model on the training set, evaluating the model's performance on the test set using metrics like accuracy, and using the model to make predictions on new data. Results show the training and test accuracies are reasonably high (~81%) suggesting the model generalizes well without overfitting or underfitting. Sensitivity, specificity, and accuracy metrics for evaluating binary classification models are also explained. Finally, support vector machines are discussed as an alternative modeling approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Q1.

Logistic regression is a statistical model that examines the connection between an


independent variable or variables and a binary outcome variable, such as whether or not a
candidate accepts a job offer and becomes a part of the organization.

To create a logistic regression model, the first step is to obtain a dataset that contains
information about the applicants and whether or not they accepted the job offer and joined
the company. You will then need to clean and pre-process the data by removing missing
values, scaling the data, and encoding categorical variables.

In this case
- Data provided is HR analytics data
- Classification goal is to predict whether the candidate will accept the offer or not
- It has 9012 records and 17 data fields

### Data Exploration


- Age
- Numeric
- Candidate.Ref
- Numeric
- Candidate.relocate.actual
- categorical(`Yes` or `No`)
- Candidate.Source
- categorical(`Agency`, `Direct`, `Employee Referral`)
- DOJ.Extended
- categorical(`Yes` or `No`)
- Duration.to.accept.offer
- Numeric
- Gender
- categorical(`Male` or `Female`)
- Joining.Bonus
- categorical(`yes` or `no`)
- LOB
- categorical(`Axon`, `BFSI`.. etc)
- Location
- categorical(`Ahmedabad`, `Bangalore`.. etc)
- Notice.period
- Numeric
- Offered.band
- categorical(`E0`, `E1`, `E2`, `E4`)
- Pecent.hike.expected.in.CTC
- Numeric
- Percent.difference.CTC
- Numeric
- Percent.hike.offered.in.CTC
- Numeric
- Rex.in.Yrs
- Numeric
- Status
- categorical(`Yes` or `No`)
### Data understanding and assumptions

* Here we can assume the `Status` to be the final outcome. In general acceptance of offer doesn`t exactly mean joined but
with the given data we can infer that yes is offer acceptance and no is refusal.

* By having first glance at data we can be sure that `Candidate.Ref` doesn`t have a correlation in outcome of the analysis. So
we can skip analysing the field

Next, divide the data into two sets: a training set and a testing set, so that you can train your
model on one set and evaluate its performance on the other. The training set will be used to
train the model, while the testing set will be used to assess its performance.

After that, train your logistic regression model by determining the coefficients of the
independent variables that are most effective in predicting the outcome (i.e., whether or not
the candidate accepted the job offer and joined the company).

Once the model has been trained, evaluate its performance using various metrics such as
accuracy, precision, and recall on the testing set.

Finally, utilize the model to make predictions about new data, allowing you to estimate the
likelihood of a candidate accepting the job offer and joining the organization.

It's important to note that creating a dependable logistic regression model necessitates
careful selection and pre-processing of the data, as well as well-thought-out feature
engineering.

Results:
After using logic regression model we obtained accuracy of

Training Acc: 0.814835368109508


Testing Acc: 0.8102635228848821
Based on the data above we can analyse that

1. Overall Performance: Both the training and testing accuracy are reasonably high,
which suggests that your model has a good overall performance.
2. Overfitting/Underfitting: The difference between training and testing accuracy is
quite small (less than 1%). This is a good sign. It suggests that your model is neither
overfitting nor underfitting. Overfitting occurs when a model performs well on the
training data but poorly on the testing data, indicating that it has "memorized" the
training data rather than "learning" the patterns. Underfitting occurs when a model
performs poorly on both training and testing data, indicating that it hasn't learned
the patterns well enough. In your case, the model seems to be well balanced
between bias and variance, which is desirable.
3. Model Suitability: The high accuracy on both training and testing datasets suggests
that Logistic Regression is a suitable model for this particular dataset. The model
appears to generalize well to unseen data.

Remember, while accuracy can provide a quick snapshot of model performance, it isn't
always the most informative metric, particularly for imbalanced datasets. Depending on the
specific task, we might also want to consider other metrics such as precision, recall, F1 score,
ROC AUC, etc., to get a more comprehensive understanding of your model's performance.

Q2.

Sensitivity, specificity, and model accuracy are commonly used


metrics for evaluating binary classification models like logistic
regression. Here's how to interpret these metrics:
1. Sensitivity: sensitivity measures the proportion of actual positive
applicants (i.e., applicants who accepted the job offer and joined
the company) who are correctly identified as positive by the model.
In other words, sensitivity indicates how well the model identifies
true positives. A sensitivity of 1.0 means that the model correctly
identifies all true positives, while a sensitivity of 0.0 means that the
model does not identify any true positives.
2. Specificity: specificity measures the proportion of actual negative
applicants (i.e., applicants who did not accept the job offer and did
not join the company) who are correctly identified as negative by
the model. In other words, specificity indicates how well the model
identifies true negatives. A specificity of 1.0 means that the model
correctly identifies all true negatives, while a specificity of 0.0
means that the model does not identify any true negatives.
3. Model Accuracy: Model accuracy measures the overall
performance of the model in correctly identifying both positive and
negative cases. It is simply the proportion of all cases in which the
model correctly predicts the outcome, whether positive or negative.
A model accuracy of 1.0 means that the model correctly predicts all
outcomes, while a model accuracy of 0.5 means that the model
does no better than chance. In general, a good classification model
should have high sensitivity and specificity and high overall
accuracy. However, the relative importance of these metrics
depends on the particular application and the consequences of false
positives and false negatives. For example, in some cases it may be
more important to minimize false negatives (i.e., ensure that all true
positives are identified), while in other cases it may be more
important to minimize false positives (i.e., avoid negative results
being misidentified as positives).
Sensitivity is a measure of how well a binary classification model
can detect true positives. In other words, it measures the proportion
of true positives that the model correctly identifies as positive.
Sensitivity is calculated by dividing the number of correct positive
predictions by the total number of true positive cases. A high
sensitivity means that the model is good at detecting positive cases
when they are present in the data. However, high sensitivity may be
associated with a higher false positive rate, meaning that the model
may also identify some negative cases as positive. Specificity is the
opposite of sensitivity and measures how well a binary
classification model can detect true negatives. It measures the
proportion of true negative cases that the model correctly identifies
as negative. Specificity is calculated by dividing the number of true
negative predictions by the total number of true negative cases.
High specificity means that the model is good at identifying
negative cases when they are present in the data. However, high
specificity may be associated with a higher false negative rate,
meaning that the model also identifies some positive cases as
negative.
Model accuracy is a measure of how well a binary classification
model performs overall. It measures the proportion of all cases in
which the model correctly predicts the outcome, whether positive
or negative. Model accuracy is calculated by dividing the total
number of correct predictions by the total number of cases. High
model accuracy means that the model is able to correctly identify
both positive and negative cases. However, model accuracy alone
may not be sufficient to fully evaluate the performance of a binary
classification model, especially if the classes are unbalanced.
In summary, sensitivity and specificity measure different aspects of
a binary classification model's performance, while model accuracy
provides an overall measure of the model's performance. When
selecting a metric to evaluate a binary classification model, it is
important to consider the specific application and the relative costs
of false positives and false negatives.

Q3. For the 3rd part I have considered to go with SVM.

SVMs is popular due to their ability to handle high dimensional data and their versatility
through the use of different kernel functions
Here after logistic model I have performed regression using SVM using kernel and linear and
without linear

For the three iterations the training set

Model accuracy score with default hyperparameters: 0.8147


Training set score: 0.8132
Test set score: 0.8125

Model accuracy score with linear kernel and C=1.0 : 0.8125


Training set score: 0.8132
Test set score: 0.8125

Model accuracy score with linear kernel and C=100.0 : 0.8125


Training set score: 0.8132
Test set score: 0.8125

Both the models perform good on the dataset. Depending on the volume of features and
dataset we can go ahead with logistic or SVM based on these differences

Both Logistic Regression and Support Vector Machines (SVM) are powerful tools in the field
of machine learning and have their own strengths. Here's a brief overview of when each
might be the best choice:
Logistic Regression:
1. Binary Classification Problems: Logistic regression is a go-to method for binary
classification problems. It's straightforward and efficient to implement.
2. Need for Probabilistic Results: Logistic regression not only gives a prediction
(classification), but also provides probabilities of the predicted outputs. This can be
useful if you need to gauge the certainty of predictions.
3. Large-Scale Datasets: Logistic regression can be a better choice for large-scale
datasets because it's generally faster and more efficient than SVM, especially in cases
where the number of observations outnumbers the features.
4. Linearly Separable Data: If your data is linearly separable (you can draw a straight
line to separate different classes), logistic regression can perform very well.
Support Vector Machines (SVM):
1. Non-Linear Data: SVMs can handle both linear and non-linear data. With the use of
the kernel trick, SVMs can model complex, non-linear decision boundaries.
2. High-Dimensional Space: SVM works well in a high-dimensional space - that is, when
you have a lot of features. This is a scenario where SVMs often outperform logistic
regression.
3. Robust to Outliers: SVMs are more robust to outliers than logistic regression. The
SVM algorithm is designed to maximize the margin and is therefore less sensitive to
outliers.
4. Small to Medium-Sized Datasets: SVMs can be computationally expensive and may
not scale well to very large datasets (both in terms of the number of samples and the
dimensionality).

You might also like