0% found this document useful (0 votes)
29 views35 pages

DVT Project

Uploaded by

Monica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views35 pages

DVT Project

Uploaded by

Monica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

| MONICA SHARMA

MACHINE LEARNING PROJECT REPORT

MACHINE LEARNING PROJECT


REPORT
- MONICA SHARMA

pg. 1
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

TABLE OF CONTENT
1. Problem 1---------------------------------------------------------------------------------------------------------------------4
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check
shape, Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningfu observations on individual
variables and the relationship between variables.-------------------------------------------------------------------5
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) -
Feature Engineering / drop redundant features (if needed) - Encode the data - Train-test split-------12
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check
the performance of the models across train and test set using different metrics and comment on the
same. 13
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the
model (minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier -
Comment on model performance after tuning the model.-------------------------------------------------------17
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built.-------------------------------------------------------------------------------21
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the
model (minimum 2 parameters to be tuned) - Comment on model performance after tuning the
model.23
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best
model with proper rationale - Conclude with the key takeaways (actionable insights and
recommendations) for the business.-----------------------------------------------------------------------------------25
2. Problem 2-------------------------------------------------------------------------------------------------------------------27
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the Deal
(Dependent Variable) and Description columns into a separate dataframe - Create two corpora - one
with those who secured a deal and the other with those who did not secur a deal - Find the number
of characters for both the corpuses Text preprocessing on corpora which secured the deal-----------27
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 31
2.1.2 Create two corpora - one with those who secured a deal and the other with those who
did not secure a deal-----------------------------------------------------------------------------------------------------31
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal.----------------------------------------------------------------------------------------------------------31
2.1.4 Text pre-processing on corpora which secured the deal.-------------------------------------------32
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a
deal - Provide insights from the preprocessed data.---------------------------------------------------------------35
2.3 Business Report Quality - Adhere to the business report checklist----------------------------------35

pg. 2
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

LIST OF FIGURES
Figure 1-1: No. of male & female using different transport modes........................................................6
Figure 1-2: Distribution of Age..............................................................................................................7
Figure 1-2: Distribution of Work Experience.........................................................................................7
Figure 1-2: Observation on Gender.......................................................................................................8
Figure 1-2: Distribution on preferred mode of transport.......................................................................8
Figure 1-2: Gender Impact on mode of transport................................................................................10
Figure 1-2: Work Exp Impact on mode of transport.............................................................................10
Figure 1-2: Age Impact on mode of transport......................................................................................11
Figure 1-9: Outlier Plot.......................................................................................................................12

LIST OF TABLES
Table 1-1:Data Information...................................................................................................................5
Table 1-2:Duplicate Value information..................................................................................................5
Table 1-3:Shape of the data..................................................................................................................5
Table 1-4:Statistical Information of the dataset....................................................................................6
Table 1-6:Preferred mode of Transport wrt Gender..............................................................................9
Table 1-5:Multivariate Analysis (Heat Map)..........................................................................................9
Table 2-1:head of the dataset (Part 1).................................................................................................27
Table 2-2:head of the dataset (Part 1).................................................................................................28
Table 2-3:Shape of the data................................................................................................................28
Table 2-4:Dataset type........................................................................................................................29
Table 2-5:Dataset Information............................................................................................................29
Table 2-6:Null value of Dataset...........................................................................................................30

pg. 3
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1. Problem 1
Context
You are in discussions with ABC Consulting company for providing transport for their employees. For this
purpose, you are tasked with understanding how do the employees of ABC Consulting prefer to
commute presently (between home and office). Based on the parameters like age, salary, work
experience etc. given in the data set ‘Transport.csv’, you are required to predict the preferred mode of
transport. The project requires you to build several Machine Learning models and compare them so that
the model can be finalized.

Objective
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalized for finally predicting the mode of transport chosen by the
employee.

Data Dictionary
Age: Age of the Employee in Years

Gender: Gender of the Employee

Engineer: For Engineer =1 , Non Engineer =0

MBA: For MBA =1 , Non-MBA =0

Work Exp: Experience in years

Salary: Salary in Lakhs per Annum

Distance: Distance in km from Home to Office

license: If Employee has Driving Licence -1, If not, then 0

Transport: Mode of Transport

pg. 4
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
Data is imported and the following are the observations:

Table 1-1:Data Information

Table 1-2:Duplicate Value information

Table 1-3:Shape of the data

 There are 444 employee records.


 There is a total of 9 variables, Transport is dependent and other variables are independent.
 There are no duplicate values in the record

Statistical Summary

pg. 5
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Table 1-4:Statistical Information of the dataset

 50% of the employees have work experience of less than 5 years and 75% of the employees have
work experience below 8 yrs.
 Average employee age is 27.75 years.
 The average salary of an employee is 16.23.
 75% of the employees have travel distance of less than 13

Figure 1-1: No. of male & female using different transport modes.

 Out of 444 records 316 is of ‘Male’ and remaining 128 is ‘Female’.


 Frequency of employees travelling through public transport is 300 and 144 is Private transport.

pg. 6
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Figure 1-2: Distribution of Age

 The distribution of age is rightly skewed. From the plot it is inferred that most of the employees
are aged between 23 to 30 years

Figure 1-3: Distribution of Work Experience

 Work Exp variable looks right skewed with most of the employees having work experience
between 0 to 8 years.

pg. 7
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Figure 1-4: Observation on Gender

- As it can be observed, the dataset has 71.2% male and 28.8% female

Figure 1-5: Distribution on preferred mode of transport

- 300 people use public transport and rest 144 prefer Private transport

pg. 8
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Table 1-5:Preferred mode of Transport wrt Gender

Multivariate Analysis

Table 1-6:Multivariate Analysis (Heat Map)

- As it can be observed from the heat map Work Exp is highly correlated with Salary and Age

pg. 9
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Figure 1-6: Gender Impact on mode of transport

- More females tend to prefer Private transport as compared to males

Figure 1-7: Work Exp Impact on mode of transport

- People with higher work experience prefer to travel using Private transport than Public
transport

pg. 10
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Figure 1-8: Age Impact on mode of transport

- People with Age more than 30 generally prefer to travel using Private transport than Public
transport

pg. 11
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) -
Feature Engineering / drop redundant features (if needed) - Encode the data - Train-test split

Outlier Detection

Figure 1-9: Outlier Plot

 There are outliers present. However, for now we will keep the outlier and proceed with model
building.

Data Split

 Shape of Training set: (310, 8)


 Shape of test set: (134, 8)
 Percentage of classes in training set:

1 0.674194
0 0.325806
Name: Transport, dtype: float64
 Percentage of classes in test set:

1 0.679104
0 0.320896
Name: Transport, dtype: float64

pg. 12
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check
the performance of the models across train and test set using different metrics and comment on
the same.

Model evaluation criterion:

Model can make wrong predictions as:

1. The model predicts that the public mode of transport is preferred but employees prefer private
mode.
2. The model predicts that that the Private mode of transport is preferred but employee prefers
public mode.

Which case is more important?

Both are important to correctly estimate the number of employees who prefer private transport.

How to reduce the losses?

 F1 Score can be used as the metric for evaluation of the model, greater the F1 score higher are
the chances of minimizing False Negatives and False Positives.
 We will use balanced class weights so that the model focuses equally on both classes.

We have created functions to calculate different metrics and confusion matrix so that we don't have to
use the same code repeatedly for each model.

 The model_performance_classification_sklearn function will be used to check the model


performance of models.
 The confusion_matrix_sklearn function will be used to plot the confusion matrix.

a. Bagging - Model Building


 Checking model performance on training set

pg. 13
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

 Checking model performance on tested set

pg. 14
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- As we can see, the model is overfitting here. We will try to tune the model and reduce
overfitting.

b. Random Forest- Model Building


 Checking model performance on training set

 Checking model performance on tested set

pg. 15
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- Similar to bagging model, it can be seen that the random forest model is overfitting
here. We will try to tune the model and reduce overfitting.

pg. 16
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment
on model performance after tuning the model.

a. Hyperparameter Tuning – Bagging Classifier


 Checking model performance on tested set

pg. 17
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.

b. Hyperparameter Tuning – Random Classifier

pg. 18
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

pg. 19
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.

pg. 20
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built.
a. Boosting- Model Building and Hyperparameter Tuning
 Checking model performance on training set

- We can see that the True positives account to 206, False negatives account to 3, False
Positives account to 32 and true negatives account to 69.

 Checking model performance on tested set

- We can see that the True positives account to 87, False negatives account to 4, False
Positives account to 17 and true negatives account to 26.

pg. 21
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

pg. 22
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.

- We can see that the True positives account to 204, False negatives account to 5, False
Positives account to 40 and true negatives account to 61.

pg. 23
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- We can see that the True positives account to 86, False negatives account to 5, False
Positives account to 19 and true negatives account to 24.

pg. 24
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and
recommendations) for the business.

Observation
- Based on the above data for all the modules, it can be observed that Adaboost classsifer
model will be able to provide better predictions. Compared to all the models, Adaboost
classifier shows better accuracy and precision.

pg. 25
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

- Looking at the feature importance of the Adaboost classifier model, the top three
important features to look for are -Salary, Distance and Age.

Actionable Insights and Recommendations:

- Important variables are Salary, Age, Work. exp, And Distance


- Age and Work.Exp are correlated.
- People with higher salaries prefer to use Private transport. However, we can see outlier
in the dataset.
- People with age more than 30 generally prefer to travel using Private transport than
public transport.
- People with higher work experience tend to prefer using Private mode of transport.
There are outlier present in the public transport data with more experience.

pg. 26
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

2. Problem 2
Context

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch
to the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.

1.8 Data Preparation Data preparation and exploratory data analysis - Pick out the
Deal (Dependent Variable) and Description columns into a separate dataframe -
Create two corpora - one with those who secured a deal and the other with those
who did not secure a deal - Find the number of characters for both the corpuses
Text preprocessing on corpora which secured the deal

a. Data Description

Table 2-7:head of the dataset (Part 1)

pg. 27
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Table 2-8:head of the dataset (Part 1)

Table 2-9:Shape of the data

pg. 28
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Table 2-10:Dataset type

Table 2-11:Dataset Information

pg. 29
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

Table 2-12:Null value of Dataset

- There 495 rows and 19 columns


- The dataset contains 2 Boolean, 5 integer and 12 objects.
- There are null values present in entrepreneur and website columns. However, as we will
not be using these columns for our study, we can keep it as it is.

pg. 30
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.8.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
- The new dataframe “df2” have 495 rows and 2 columns i.e., Deal and Description

1.8.2 Create two corpora - one with those who secured a deal and the other with those who did not
secure a deal
We created two corpora – Corpora 1: deal secured and Corpora 2 : deal not secured

1.8.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal.

- The number of characters in corpus which secure the Deal is 45002


- The number of characters in corpus which did not secure the Deal is 47184

pg. 31
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.8.4 Text pre-processing on corpora which secured the deal.

We'll be doing text preprocessing on the corpus for those who secured the deal

a. Removal of http links

b. De-contraction of words

c. Tokenization

d. Lowercasing: Lowercasing ALL your text data, although commonly overlooked, is one of the
simplest and most effective form of text preprocessing.

pg. 32
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

e. Removal of Punctuation

 Removal of stop words:

- Stop words are a set of commonly used words in a language.


- Examples of stop words in English are “a”, “the”, “is”, “are” etc. The intuition behind
using stop words is that, by removing low information words from text, we can focus on
the important words instead.

f. Lemmatization

- Lemmatization on the surface is very similar to stemming, where the goal is to


remove inflections and map a word to its root form.

pg. 33
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

g. Normalization (aggregating pre-processing function into one):

pg. 34
| MONICA SHARMA
MACHINE LEARNING PROJECT REPORT

1.9 Insight Generation - Create a wordcloud of common words used by companies who secure a
deal - Provide insights from the preprocessed data.

- From the word cloud, we can say that an entrepreneur who secured the deal used
words like ‘product’, ‘make’, ‘design’, ‘online’, ‘offer’, ‘need’ and more positive and
product descriptive words to attract the customer’s interest, hence securing the deal.
- Hence to increase the performance one must make more use of words which will attract
the customer’s interest and use more product and design oriented words.

1.10 Business Report Quality - Adhere to the business report checklist

pg. 35

You might also like