### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
To create a logistic regression model, the first step is to obtain a dataset that contains
information about the applicants and whether or not they accepted the job offer and joined
the company. You will then need to clean and pre-process the data by removing missing
values, scaling the data, and encoding categorical variables.
In this case
- Data provided is HR analytics data
- Classification goal is to predict whether the candidate will accept the offer or not
- It has 9012 records and 17 data fields
* Here we can assume the `Status` to be the final outcome. In general acceptance of offer doesn`t exactly mean joined but
with the given data we can infer that yes is offer acceptance and no is refusal.
* By having first glance at data we can be sure that `Candidate.Ref` doesn`t have a correlation in outcome of the analysis. So
we can skip analysing the field
Next, divide the data into two sets: a training set and a testing set, so that you can train your
model on one set and evaluate its performance on the other. The training set will be used to
train the model, while the testing set will be used to assess its performance.
After that, train your logistic regression model by determining the coefficients of the
independent variables that are most effective in predicting the outcome (i.e., whether or not
the candidate accepted the job offer and joined the company).
Once the model has been trained, evaluate its performance using various metrics such as
accuracy, precision, and recall on the testing set.
Finally, utilize the model to make predictions about new data, allowing you to estimate the
likelihood of a candidate accepting the job offer and joining the organization.
It's important to note that creating a dependable logistic regression model necessitates
careful selection and pre-processing of the data, as well as well-thought-out feature
engineering.
Results:
After using logic regression model we obtained accuracy of
1. Overall Performance: Both the training and testing accuracy are reasonably high,
which suggests that your model has a good overall performance.
2. Overfitting/Underfitting: The difference between training and testing accuracy is
quite small (less than 1%). This is a good sign. It suggests that your model is neither
overfitting nor underfitting. Overfitting occurs when a model performs well on the
training data but poorly on the testing data, indicating that it has "memorized" the
training data rather than "learning" the patterns. Underfitting occurs when a model
performs poorly on both training and testing data, indicating that it hasn't learned
the patterns well enough. In your case, the model seems to be well balanced
between bias and variance, which is desirable.
3. Model Suitability: The high accuracy on both training and testing datasets suggests
that Logistic Regression is a suitable model for this particular dataset. The model
appears to generalize well to unseen data.
Remember, while accuracy can provide a quick snapshot of model performance, it isn't
always the most informative metric, particularly for imbalanced datasets. Depending on the
specific task, we might also want to consider other metrics such as precision, recall, F1 score,
ROC AUC, etc., to get a more comprehensive understanding of your model's performance.
Q2.
SVMs is popular due to their ability to handle high dimensional data and their versatility
through the use of different kernel functions
Here after logistic model I have performed regression using SVM using kernel and linear and
without linear
Both the models perform good on the dataset. Depending on the volume of features and
dataset we can go ahead with logistic or SVM based on these differences
Both Logistic Regression and Support Vector Machines (SVM) are powerful tools in the field
of machine learning and have their own strengths. Here's a brief overview of when each
might be the best choice:
Logistic Regression:
1. Binary Classification Problems: Logistic regression is a go-to method for binary
classification problems. It's straightforward and efficient to implement.
2. Need for Probabilistic Results: Logistic regression not only gives a prediction
(classification), but also provides probabilities of the predicted outputs. This can be
useful if you need to gauge the certainty of predictions.
3. Large-Scale Datasets: Logistic regression can be a better choice for large-scale
datasets because it's generally faster and more efficient than SVM, especially in cases
where the number of observations outnumbers the features.
4. Linearly Separable Data: If your data is linearly separable (you can draw a straight
line to separate different classes), logistic regression can perform very well.
Support Vector Machines (SVM):
1. Non-Linear Data: SVMs can handle both linear and non-linear data. With the use of
the kernel trick, SVMs can model complex, non-linear decision boundaries.
2. High-Dimensional Space: SVM works well in a high-dimensional space - that is, when
you have a lot of features. This is a scenario where SVMs often outperform logistic
regression.
3. Robust to Outliers: SVMs are more robust to outliers than logistic regression. The
SVM algorithm is designed to maximize the margin and is therefore less sensitive to
outliers.
4. Small to Medium-Sized Datasets: SVMs can be computationally expensive and may
not scale well to very large datasets (both in terms of the number of samples and the
dimensionality).