100% found this document useful (1 vote)

61 views6 pages

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Logistic regression is a statistical model used to predict a binary outcome variable like whether a job candidate accepts a job offer. The document discusses steps for building a logistic regression model including obtaining a dataset with applicant and outcome data, preprocessing the data, dividing it into training and test sets, training the model on the training set, evaluating the model's performance on the test set using metrics like accuracy, and using the model to make predictions on new data. Results show the training and test accuracies are reasonably high (~81%) suggesting the model generalizes well without overfitting or underfitting. Sensitivity, specificity, and accuracy metrics for evaluating binary classification models are also explained. Finally, support vector machines are discussed as an alternative modeling approach.

Uploaded by

Varshini Kandikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

61 views6 pages

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Uploaded by

Varshini Kandikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Q1.

Logistic regression is a statistical model that examines the connection between an

independent variable or variables and a binary outcome variable, such as whether or not a
candidate accepts a job offer and becomes a part of the organization.

To create a logistic regression model, the first step is to obtain a dataset that contains
information about the applicants and whether or not they accepted the job offer and joined
the company. You will then need to clean and pre-process the data by removing missing
values, scaling the data, and encoding categorical variables.

In this case
- Data provided is HR analytics data
- Classification goal is to predict whether the candidate will accept the offer or not
- It has 9012 records and 17 data fields

### Data Exploration

- Age
- Numeric
- Candidate.Ref
- Numeric
- Candidate.relocate.actual
- categorical(`Yes` or `No`)
- Candidate.Source
- categorical(Àgency`, `Direct`, Èmployee Referral`)
- DOJ.Extended
- categorical(`Yes` or `No`)
- Duration.to.accept.offer
- Numeric
- Gender
- categorical(`Male` or `Female`)
- Joining.Bonus
- categorical(`yes` or `no`)
- LOB
- categorical(Àxon`, `BFSI`.. etc)
- Location
- categorical(Àhmedabad`, `Bangalore`.. etc)
- Notice.period
- Numeric
- Offered.band
- categorical(È0`, È1`, È2`, È4`)
- Pecent.hike.expected.in.CTC
- Numeric
- Percent.difference.CTC
- Numeric
- Percent.hike.offered.in.CTC
- Numeric
- Rex.in.Yrs
- Numeric
- Status
- categorical(`Yes` or `No`)
### Data understanding and assumptions

* Here we can assume the `Status` to be the final outcome. In general acceptance of offer doesn`t exactly mean joined but
with the given data we can infer that yes is offer acceptance and no is refusal.

* By having first glance at data we can be sure that `Candidate.Ref` doesn`t have a correlation in outcome of the analysis. So
we can skip analysing the field

Next, divide the data into two sets: a training set and a testing set, so that you can train your
model on one set and evaluate its performance on the other. The training set will be used to
train the model, while the testing set will be used to assess its performance.

After that, train your logistic regression model by determining the coefficients of the
independent variables that are most effective in predicting the outcome (i.e., whether or not
the candidate accepted the job offer and joined the company).

Once the model has been trained, evaluate its performance using various metrics such as
accuracy, precision, and recall on the testing set.

Finally, utilize the model to make predictions about new data, allowing you to estimate the
likelihood of a candidate accepting the job offer and joining the organization.

It's important to note that creating a dependable logistic regression model necessitates
careful selection and pre-processing of the data, as well as well-thought-out feature
engineering.

Results:
After using logic regression model we obtained accuracy of

Training Acc: 0.814835368109508

Testing Acc: 0.8102635228848821
Based on the data above we can analyse that

1. Overall Performance: Both the training and testing accuracy are reasonably high,
which suggests that your model has a good overall performance.
2. Overfitting/Underfitting: The difference between training and testing accuracy is
quite small (less than 1%). This is a good sign. It suggests that your model is neither
overfitting nor underfitting. Overfitting occurs when a model performs well on the
training data but poorly on the testing data, indicating that it has "memorized" the
training data rather than "learning" the patterns. Underfitting occurs when a model
performs poorly on both training and testing data, indicating that it hasn't learned
the patterns well enough. In your case, the model seems to be well balanced
between bias and variance, which is desirable.
3. Model Suitability: The high accuracy on both training and testing datasets suggests
that Logistic Regression is a suitable model for this particular dataset. The model
appears to generalize well to unseen data.

Remember, while accuracy can provide a quick snapshot of model performance, it isn't
always the most informative metric, particularly for imbalanced datasets. Depending on the
specific task, we might also want to consider other metrics such as precision, recall, F1 score,
ROC AUC, etc., to get a more comprehensive understanding of your model's performance.

Q2.

Sensitivity, specificity, and model accuracy are commonly used

metrics for evaluating binary classification models like logistic
regression. Here's how to interpret these metrics:
1. Sensitivity: sensitivity measures the proportion of actual positive
applicants (i.e., applicants who accepted the job offer and joined
the company) who are correctly identified as positive by the model.
In other words, sensitivity indicates how well the model identifies
true positives. A sensitivity of 1.0 means that the model correctly
identifies all true positives, while a sensitivity of 0.0 means that the
model does not identify any true positives.
2. Specificity: specificity measures the proportion of actual negative
applicants (i.e., applicants who did not accept the job offer and did
not join the company) who are correctly identified as negative by
the model. In other words, specificity indicates how well the model
identifies true negatives. A specificity of 1.0 means that the model
correctly identifies all true negatives, while a specificity of 0.0
means that the model does not identify any true negatives.
3. Model Accuracy: Model accuracy measures the overall
performance of the model in correctly identifying both positive and
negative cases. It is simply the proportion of all cases in which the
model correctly predicts the outcome, whether positive or negative.
A model accuracy of 1.0 means that the model correctly predicts all
outcomes, while a model accuracy of 0.5 means that the model
does no better than chance. In general, a good classification model
should have high sensitivity and specificity and high overall
accuracy. However, the relative importance of these metrics
depends on the particular application and the consequences of false
positives and false negatives. For example, in some cases it may be
more important to minimize false negatives (i.e., ensure that all true
positives are identified), while in other cases it may be more
important to minimize false positives (i.e., avoid negative results
being misidentified as positives).
Sensitivity is a measure of how well a binary classification model
can detect true positives. In other words, it measures the proportion
of true positives that the model correctly identifies as positive.
Sensitivity is calculated by dividing the number of correct positive
predictions by the total number of true positive cases. A high
sensitivity means that the model is good at detecting positive cases
when they are present in the data. However, high sensitivity may be
associated with a higher false positive rate, meaning that the model
may also identify some negative cases as positive. Specificity is the
opposite of sensitivity and measures how well a binary
classification model can detect true negatives. It measures the
proportion of true negative cases that the model correctly identifies
as negative. Specificity is calculated by dividing the number of true
negative predictions by the total number of true negative cases.
High specificity means that the model is good at identifying
negative cases when they are present in the data. However, high
specificity may be associated with a higher false negative rate,
meaning that the model also identifies some positive cases as
negative.
Model accuracy is a measure of how well a binary classification
model performs overall. It measures the proportion of all cases in
which the model correctly predicts the outcome, whether positive
or negative. Model accuracy is calculated by dividing the total
number of correct predictions by the total number of cases. High
model accuracy means that the model is able to correctly identify
both positive and negative cases. However, model accuracy alone
may not be sufficient to fully evaluate the performance of a binary
classification model, especially if the classes are unbalanced.
In summary, sensitivity and specificity measure different aspects of
a binary classification model's performance, while model accuracy
provides an overall measure of the model's performance. When
selecting a metric to evaluate a binary classification model, it is
important to consider the specific application and the relative costs
of false positives and false negatives.

Q3. For the 3rd part I have considered to go with SVM.

SVMs is popular due to their ability to handle high dimensional data and their versatility
through the use of different kernel functions
Here after logistic model I have performed regression using SVM using kernel and linear and
without linear

For the three iterations the training set

Model accuracy score with default hyperparameters: 0.8147

Training set score: 0.8132
Test set score: 0.8125

Model accuracy score with linear kernel and C=1.0 : 0.8125

Training set score: 0.8132
Test set score: 0.8125

Model accuracy score with linear kernel and C=100.0 : 0.8125

Training set score: 0.8132
Test set score: 0.8125

Both the models perform good on the dataset. Depending on the volume of features and
dataset we can go ahead with logistic or SVM based on these differences

Both Logistic Regression and Support Vector Machines (SVM) are powerful tools in the field
of machine learning and have their own strengths. Here's a brief overview of when each
might be the best choice:
Logistic Regression:
1. Binary Classification Problems: Logistic regression is a go-to method for binary
classification problems. It's straightforward and efficient to implement.
2. Need for Probabilistic Results: Logistic regression not only gives a prediction
(classification), but also provides probabilities of the predicted outputs. This can be
useful if you need to gauge the certainty of predictions.
3. Large-Scale Datasets: Logistic regression can be a better choice for large-scale
datasets because it's generally faster and more efficient than SVM, especially in cases
where the number of observations outnumbers the features.
4. Linearly Separable Data: If your data is linearly separable (you can draw a straight
line to separate different classes), logistic regression can perform very well.
Support Vector Machines (SVM):
1. Non-Linear Data: SVMs can handle both linear and non-linear data. With the use of
the kernel trick, SVMs can model complex, non-linear decision boundaries.
2. High-Dimensional Space: SVM works well in a high-dimensional space - that is, when
you have a lot of features. This is a scenario where SVMs often outperform logistic
regression.
3. Robust to Outliers: SVMs are more robust to outliers than logistic regression. The
SVM algorithm is designed to maximize the margin and is therefore less sensitive to
outliers.
4. Small to Medium-Sized Datasets: SVMs can be computationally expensive and may
not scale well to very large datasets (both in terms of the number of samples and the
dimensionality).

Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
ML Projects For Final Year
No ratings yet
ML Projects For Final Year
7 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Classification Problems
100% (1)
Classification Problems
25 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Vinee
100% (1)
Vinee
28 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
EMF CheatSheet V4
100% (1)
EMF CheatSheet V4
2 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Charmi Shah 20bcp299 Lab2
100% (1)
Charmi Shah 20bcp299 Lab2
7 pages
Xgboost in Online Transaction Fraud Detection
100% (1)
Xgboost in Online Transaction Fraud Detection
8 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
ML Lect1
100% (1)
ML Lect1
51 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Hypothesis and Hypothesis Testing
100% (1)
Hypothesis and Hypothesis Testing
59 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Ensemble Methods Bagging Boosting and Stacking
100% (1)
Ensemble Methods Bagging Boosting and Stacking
19 pages
PR01
100% (1)
PR01
41 pages
Regressao Linear Simples - Ipynb - Colaboratory
100% (1)
Regressao Linear Simples - Ipynb - Colaboratory
2 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
IRIS BPNN - Ipynb - Colaboratory
100% (1)
IRIS BPNN - Ipynb - Colaboratory
4 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Csi 5155 ML Project Report
100% (1)
Csi 5155 ML Project Report
24 pages
9 Regression
100% (1)
9 Regression
14 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
TP Regression
100% (1)
TP Regression
1 page
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Assignment Updated 101
100% (1)
Assignment Updated 101
24 pages
Neural Network Based Rainfall Prediction System
100% (1)
Neural Network Based Rainfall Prediction System
6 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
Code ExerciseModelSelection
100% (1)
Code ExerciseModelSelection
19 pages
Book
100% (1)
Book
480 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Assignment10 4
100% (1)
Assignment10 4
3 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Regression Testing
No ratings yet
Regression Testing
25 pages
CS504 Solved Mega File With Reference
No ratings yet
CS504 Solved Mega File With Reference
62 pages
Project Assignment TBW
No ratings yet
Project Assignment TBW
6 pages
Data Warehouse Thesis Paper
100% (3)
Data Warehouse Thesis Paper
5 pages
Requirements Management
No ratings yet
Requirements Management
18 pages
Network Scheduling
No ratings yet
Network Scheduling
40 pages
Introduction To Project Management CPM and Crashing
No ratings yet
Introduction To Project Management CPM and Crashing
15 pages
Relational ScmTop DooTop DooTaskLayer - DooDecomposition - DooWorkbench
No ratings yet
Relational ScmTop DooTop DooTaskLayer - DooDecomposition - DooWorkbench
1 page
Tube-Based MPC
No ratings yet
Tube-Based MPC
9 pages
Introduction To Industrial Instrumentation and Process Control
100% (1)
Introduction To Industrial Instrumentation and Process Control
29 pages
Course Outline of TQM
No ratings yet
Course Outline of TQM
2 pages
Bsadcom 201910007
No ratings yet
Bsadcom 201910007
18 pages
A Survey of The Extended Finite Element
No ratings yet
A Survey of The Extended Finite Element
1 page
Assn 2
No ratings yet
Assn 2
3 pages
Tensorflow Examples and Tutorials: Tutorial Index
No ratings yet
Tensorflow Examples and Tutorials: Tutorial Index
6 pages
Software Engineering
No ratings yet
Software Engineering
4 pages
Basic Concepts Terminology and Techniques For Process Control
No ratings yet
Basic Concepts Terminology and Techniques For Process Control
9 pages
19me21p1 PDF
No ratings yet
19me21p1 PDF
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
15 pages
Generative AI - Siemens Global
No ratings yet
Generative AI - Siemens Global
3 pages
Scrum Methodology
No ratings yet
Scrum Methodology
9 pages
Maintaining Information Systems: Chapter Overview
No ratings yet
Maintaining Information Systems: Chapter Overview
7 pages
02 - SE - Software Development Processes
No ratings yet
02 - SE - Software Development Processes
37 pages
The Birth of BERT
No ratings yet
The Birth of BERT
7 pages
أثر تطبيق نظام إدارة الجودة iso 9001 على تحسين أداء قطاع الصناعة التقليدية والحرف دراسة حالة مؤسسة - datte el ghazel - بسكرة
No ratings yet
أثر تطبيق نظام إدارة الجودة iso 9001 على تحسين أداء قطاع الصناعة التقليدية والحرف دراسة حالة مؤسسة - datte el ghazel - بسكرة
17 pages
Lahore University of Management Sciences EE411 - Digital Signal Processing
No ratings yet
Lahore University of Management Sciences EE411 - Digital Signal Processing
4 pages
PID Tuning
100% (3)
PID Tuning
37 pages
Digital Image Processing
No ratings yet
Digital Image Processing
23 pages
08 ML WEKA Classification
No ratings yet
08 ML WEKA Classification
73 pages
Exercises On Bifurcations: Exercise
No ratings yet
Exercises On Bifurcations: Exercise
2 pages

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Uploaded by

### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'

Uploaded by

Q1.

Logistic regression is a statistical model that examines the connection between an

### Data Exploration

Training Acc: 0.814835368109508

Sensitivity, specificity, and model accuracy are commonly used

Q3. For the 3rd part I have considered to go with SVM.

For the three iterations the training set

Model accuracy score with default hyperparameters: 0.8147

Model accuracy score with linear kernel and C=1.0 : 0.8125

Model accuracy score with linear kernel and C=100.0 : 0.8125

You might also like