Project Submission Predictive Modelling - Logistic Regression and LDA
Project Submission Predictive Modelling - Logistic Regression and LDA
Modelling Project
Report
(Logistic Regression and LDA Case Study)
Ankit bhagat
Date of Submission -28th Nov
1
Business problem
2
Logistic Regression and LDA Case Study
3
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You
are provided details of 872 employees of a company. Among these employees, some
opted for the package and some didn't. You have to help the company in predicting
whether an employee will opt for the package or not on the basis of the information
given in the data set. Also, find out the important factors on the basis of which the
company will focus on particular employees to sell their packages.
Data Dictionary:
4
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Checking Head
Checking Tail
INFO
5
No null values in the dataset,
We have integer and object data
DATA DESCRIBE
Salary, age, educ and number young children, number older children of employee have the went to
foreign, these are the attributes we have to cross examine and help the company predict weather
the person will opt for holiday package or not.
6
Unique values in the categorical data
HOLLIDAY_PACKAGE: 2
Yes 401
No 471
Name: Holliday Package, dtype: int64
FOREIGN : 2
Yes 216
No 656
Name: foreign, dtype: int64
Percentage of target :
This split indicates that 45% of employees are interested in the holiday package.
FOREIGN
7
HOLIDAY PACKAGE
8
We can see employee below salary 150000 have always opted for holiday package
9
HOLIDAY PACKAGE VS YOUNG CHILDREN
10
AGE VS SALARY VS HOLIDAY PACKAGE
11
Employee age over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30
to 50 and salary less than 50000 people have opted more for holiday package.
12
YOUNG CHILDREN VS AGE VS HOLIDAY PACKAGE
13
OLDER CHILDREN VS AGE VS HOLIDAY_PACKAGE
14
BIVARITE ANALYIS
DATA DISTRIBUTION
There is no correlation between the data, the data seems to be normal. There is no huge difference
in the data distribution among the holiday package, I don’t see any clear two different distribution in
the data.
15
No multi collinearity in the data
TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical computation treating outliers will
help perform the model better.
16
AFTER OUTLIER TREATMENT
17
Note -2.2 and 2.3 answers are performed below -
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
The encoding helps the logistic regression model predict better results
The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving
18
The grid search method gives, liblinear solver which is suitable for small datasets.
Tolerance and penalty has been found using grid search method
Predicting the training data,
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is best/optimized.
19
CONFUSION MATRIX FOR TEST DATA
20
AUC, ROC CURVE FOR TEST DATA
21
LDA
MODEL SCORE
22
MODEL SCORE
CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES BETTER ACCURACY AND F1
SCORE
23
24
25
26
AUC AND ROC CURVE
27
Comparing both these models, we find both results are same, but LDA works better when there is
category target variable.
28
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
We had a business problem where we need predict whether an employee would opt for a holiday
package or not, for this problem we had done predictions both logistic regression and linear
discriminant analysis. Since both are results are same.
The EDA analysis clearly indicates certain criteria where we could find people aged above 50 are not
interested much in holiday packages. So this is one of the we find aged people not opting for holiday
packages. People ranging from the age 30 to 50 generally opt for holiday packages. Employee age
over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30 to 50 and
salary less than 50000 people have opted more for holiday package.
The important factors deciding the predictions are salary, age and educ.
Recommendations
1. To improve holiday packages over the age above 50 we can provide religious destination places.
2. For people earning more than 150000 we can provide vacation holiday packages.
3. For employee having more than number of older children we can provide packages in holiday
vacation places.
29