0% found this document useful (0 votes)
606 views29 pages

Project Submission Predictive Modelling - Logistic Regression and LDA

The document discusses a case study using logistic regression and LDA to predict whether employees will opt for a holiday package based on their characteristics. Key factors influencing the predictions included salary, age, education level, and number of children. Both models performed similarly with an accuracy around 80%. Based on the analysis, recommendations were made to target packages towards older employees and higher earners.

Uploaded by

ankitbhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
606 views29 pages

Project Submission Predictive Modelling - Logistic Regression and LDA

The document discusses a case study using logistic regression and LDA to predict whether employees will opt for a holiday package based on their characteristics. Key factors influencing the predictions included salary, age, education level, and number of children. Both models performed similarly with an accuracy around 80%. Based on the analysis, recommendations were made to target packages towards older employees and higher earners.

Uploaded by

ankitbhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Predictive

Modelling Project
Report
(Logistic Regression and LDA Case Study)

Module 5 DSBA part 2

Ankit bhagat
Date of Submission -28th Nov

1
Business problem

Problem 2: Logistic Regression and LDA

2
 Logistic Regression and LDA Case Study

3
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You
are provided details of 872 employees of a company. Among these employees, some
opted for the package and some didn't. You have to help the company in predicting
whether an employee will opt for the package or not on the basis of the information
given in the data set. Also, find out the important factors on the basis of which the
company will focus on particular employees to sell their packages.
Data Dictionary:

Variable Name Description


Holiday_Package   Opted for Holiday Package yes/no?
Salary   Employee salary
age   Age in years
edu   Years of formal education
 The number of young children (younger than 7
no_young_children 
years)
no_older_children   Number of older children
foreign   foreigner Yes/No

4
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.

Loading all the necessary library for the model building.


Now, reading the head and tail of the dataset to check whether data has been properly fed

Checking Head

Checking Tail

SHAPE OF THE DATA (872, 8)

INFO

5
No null values in the dataset,
We have integer and object data

DATA DESCRIBE

We have integer and continuous data,


Holiday package is our target variable

Salary, age, educ and number young children, number older children of employee have the went to
foreign, these are the attributes we have to cross examine and help the company predict weather
the person will opt for holiday package or not.

6
Unique values in the categorical data
HOLLIDAY_PACKAGE: 2
Yes 401
No 471
Name: Holliday Package, dtype: int64
FOREIGN : 2
Yes 216
No 656
Name: foreign, dtype: int64
Percentage of target :

This split indicates that 45% of employees are interested in the holiday package.

CATEGORICAL UNIVARIATE ANALYSIS

FOREIGN

7
HOLIDAY PACKAGE

HOLIDAY PACKAGE VS SALARY

8
We can see employee below salary 150000 have always opted for holiday package

HOLIDAY PACKAGE VS AGE

HOLIDAY PACKAGE VS EDUC

9
HOLIDAY PACKAGE VS YOUNG CHILDREN

HOLIDAY PACKAGE VS OLDER CHILDREN

10
AGE VS SALARY VS HOLIDAY PACKAGE

11
Employee age over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30
to 50 and salary less than 50000 people have opted more for holiday package.

EDUC VS SALARY VS HOLIDAY PACKAGE

12
YOUNG CHILDREN VS AGE VS HOLIDAY PACKAGE

13
OLDER CHILDREN VS AGE VS HOLIDAY_PACKAGE

14
BIVARITE ANALYIS
DATA DISTRIBUTION

There is no correlation between the data, the data seems to be normal. There is no huge difference
in the data distribution among the holiday package, I don’t see any clear two different distribution in
the data.

15
No multi collinearity in the data
TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical computation treating outliers will
help perform the model better.

16
AFTER OUTLIER TREATMENT

No outliers in the data, all outliers have been treated.

17
Note -2.2 and 2.3 answers are performed below -

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).

ENCODING CATEGORICAL VARIABLE

The encoding helps the logistic regression model predict better results

GRID SEARCH METHOD:

The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving

18
The grid search method gives, liblinear solver which is suitable for small datasets.
Tolerance and penalty has been found using grid search method
Predicting the training data,

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is best/optimized.

19
CONFUSION MATRIX FOR TEST DATA

20
AUC, ROC CURVE FOR TEST DATA

21
LDA

PREDICTING THE VARIBALE

MODEL SCORE

CLASSFICATION REPORT TRAIN DATA

22
MODEL SCORE

CLASSIFICATION REPORT TEST DATA

CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES BETTER ACCURACY AND F1
SCORE

23
24
25
26
AUC AND ROC CURVE

27
Comparing both these models, we find both results are same, but LDA works better when there is
category target variable.

28
2.4 Inference: Basis on these predictions, what are the insights and recommendations.

We had a business problem where we need predict whether an employee would opt for a holiday
package or not, for this problem we had done predictions both logistic regression and linear
discriminant analysis. Since both are results are same.

The EDA analysis clearly indicates certain criteria where we could find people aged above 50 are not
interested much in holiday packages. So this is one of the we find aged people not opting for holiday
packages. People ranging from the age 30 to 50 generally opt for holiday packages. Employee age
over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30 to 50 and
salary less than 50000 people have opted more for holiday package.

The important factors deciding the predictions are salary, age and educ.
Recommendations
1. To improve holiday packages over the age above 50 we can provide religious destination places.
2. For people earning more than 150000 we can provide vacation holiday packages.
3. For employee having more than number of older children we can provide packages in holiday
vacation places.

29

You might also like