0% found this document useful (0 votes)

14 views46 pages

Dsba Project Main Et Easyvisa

The EasyVisa project aims to streamline the visa approval process for foreign workers in the U.S. by utilizing machine learning techniques to analyze applications and recommend certifications based on various factors. The dataset includes attributes related to both employees and employers, with a total of 25,480 applications processed, highlighting the need for efficient handling of increasing applicant numbers. The project involves data cleaning, exploratory analysis, and the implementation of classification models to predict visa approval outcomes.

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views46 pages

Dsba Project Main Et Easyvisa

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Science and Business

Analytics
Ensemble Techniques: Project Debrief

Visa Approval

EasyVisa Project

Problem Statement
Context:
Business communities in the United States are facing high demand for human resources, but one of the
constant challenges is identifying and attracting the right talent, which is perhaps the most important element in
remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals
both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to
work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on
their wages or working conditions by ensuring US employers' compliance with statutory requirements when
they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office
of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United
States and grants certifications in those cases where employers can demonstrate that there are not sufficient
US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the
area of intended employment.

Objective:
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and
permanent labor certifications. This was a nine percent increase in the overall number of processed applications
from the previous year. The process of reviewing every case is becoming a tedious task as the number of
applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in
shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-
driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a
classification model:

Facilitate the process of visa approvals.

Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on
the drivers that significantly influence the case status.

Data Description
The data contains the different attributes of employee and the employer. The detailed data dictionary is given
below.

case_id: ID of each visa application

continent: Information of continent the employee
education_of_employee: Information of education of the employee
has_job_experience: Does the employee has any job experience? Y= Yes; N = No
requires_job_training: Does the employee require any job training? Y = Yes; N = No
no_of_employees: Number of employees in the employer's company
yr_of_estab: Year in which the employer's company was established
region_of_employment: Information of foreign worker's intended region of employment in the US.
prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of
intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not
underpaid compared to other workers offering the same or similar service in the same area of employment.
unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
case_status: Flag indicating if the Visa was certified or denied

Note: This is a sample solution for the project. Projects will NOT be
graded on the basis of how well the submission matches this sample
solution. Projects will be graded on the basis of the rubric only.

Importing necessary libraries

Importing necessary libraries
In [ ]:
# this will help in making the Python code more structured automatically (good coding pra
ctice)
#%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import numpy as np
import pandas as pd

# Library to split data

from sklearn.model_selection import train_test_split

# libaries to help with data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns

pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)

# Libraries different ensemble classifiers

from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
StackingClassifier,
)

from xgboost import XGBClassifier

from sklearn.tree import DecisionTreeClassifier

# Libraries to get different metric scores

from sklearn import metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score,
)

# To tune different models

from sklearn.model_selection import GridSearchCV

Import Dataset
In [ ]:
visa = pd.read_csv("EasyVisa.csv")

In [ ]:
# copying data to another variable to avoid any changes to original data
data = visa.copy()

Overview of the Dataset

View the first and last 5 rows of the dataset

In [ ]:
data.head()

Out[ ]:

case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_em

0 EZYV01 Asia High School N N 14513 2007

1 EZYV02 Asia Master's Y N 2412 2002

2 EZYV03 Asia Bachelor's N Y 44444 2008

3 EZYV04 Asia Bachelor's N N 98 1897

4 EZYV05 Africa Master's Y N 1082 2005

In [ ]:
data.tail()
Out[ ]:

case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab regio

25475 EZYV25476 Asia Bachelor's Y Y 2601 2008

25476 EZYV25477 Asia High School Y N 3274 2006

25477 EZYV25478 Asia Master's Y N 1121 1910

25478 EZYV25479 Asia Master's Y Y 1918 1887

25479 EZYV25480 Asia Bachelor's Y N 3195 1960

Understand the shape of the dataset

In [ ]:
data.shape
Out[ ]:
(25480, 12)

The dataset has 25480 rows and 12 columns

Check the data types of the columns for the dataset

In [ ]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_id 25480 non-null object
1 continent 25480 non-null object
2 education_of_employee 25480 non-null object
3 has_job_experience 25480 non-null object
4 requires_job_training 25480 non-null object
5 no_of_employees 25480 non-null int64
6 yr_of_estab 25480 non-null int64
7 region_of_employment 25480 non-null object
8 prevailing_wage 25480 non-null float64
8 prevailing_wage 25480 non-null float64
9 unit_of_wage 25480 non-null object
10 full_time_position 25480 non-null object
11 case_status 25480 non-null object
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB

no_of_employees , yr_of_estab , and prevailing_wage are numeric features while rest are objects.
There are no null values in the dataset.

In [ ]:
# checking for duplicate values
data.duplicated().sum()
Out[ ]:
0

There are no duplicate values in the data.

Exploratory Data Analysis

Let's check the statistical summary of the data

In [ ]:
data.describe().T
Out[ ]:

count mean std min 25% 50% 75% max

no_of_employees 25480.0 5667.043210 22877.928848 -26.0000 1022.00 2109.00 3504.0000 602069.00

yr_of_estab 25480.0 1979.409929 42.366929 1800.0000 1976.00 1997.00 2005.0000 2016.00

prevailing_wage 25480.0 74455.814592 52815.942327 2.1367 34015.48 70308.21 107735.5125 319210.27

Observations:

The range of the number of employees in a company is huge. There are some anomalies in the data as we
can see that the minimum number of employees is equal to -26, which is not possible. We will have to fix
this.
The year of establishment of companies ranges from 1800 to 2016, which seems fine.
The average prevailing wage is 74455.81. There's also a very huge difference in 75th percentile and
maximum value which indicates there might be outliers present in this column.

Fixing the negative values in number of employees columns

In [ ]:
data.loc[data["no_of_employees"] < 0].shape
Out[ ]:
(33, 12)

We will consider the 33 observations as data entry errors and take the absolute values for this column.

In [ ]:
# taking the absolute values for number of employees
data["no_of_employees"] = abs(data["no_of_employees"])

Let's check the count of each unique category in each of the categorical variables

In [ ]:
# Making a list of all catrgorical variables
cat_col = list(data.select_dtypes("object").columns)

# Printing number of count of each unique value in each column

for column in cat_col:
print(data[column].value_counts())
print("-" * 50)

EZYV21731 1
EZYV22853 1
EZYV24045 1
EZYV20282 1
EZYV17175 1
..
EZYV2953 1
EZYV990 1
EZYV9352 1
EZYV24207 1
EZYV8395 1
Name: case_id, Length: 25480, dtype: int64
--------------------------------------------------
Asia 16861
Europe 3732
North America 3292
South America 852
Africa 551
Oceania 192
Name: continent, dtype: int64
--------------------------------------------------
Bachelor's 10234
Master's 9634
High School 3420
Doctorate 2192
Name: education_of_employee, dtype: int64
--------------------------------------------------
Y 14802
N 10678
Name: has_job_experience, dtype: int64
--------------------------------------------------
N 22525
Y 2955
Name: requires_job_training, dtype: int64
--------------------------------------------------
Northeast 7195
South 7017
West 6586
Midwest 4307
Island 375
Name: region_of_employment, dtype: int64
--------------------------------------------------
Year 22962
Hour 2157
Week 272
Month 89
Name: unit_of_wage, dtype: int64
--------------------------------------------------
Y 22773
N 2707
Name: full_time_position, dtype: int64
--------------------------------------------------
Certified 17018
Denied 8462
Name: case_status, dtype: int64
--------------------------------------------------
Observations:

Most of the applications in data are from Asians followed by Europeans.

Most of the applicants have a bachelor's degree followed by a master's degree.
Most of the applicants have job experience and do not require job training.
Most applicants have their worksite in the Northeast region of the US.
Most applicants will yearly unit of wage.
Most of the visa applications are for full-time job positions.
The target column case status is imbalanced with many applicants having a certified visa.

In [ ]:
# checking the number of unique values
data["case_id"].nunique()
Out[ ]:
25480

All the values in the case id column are unique.

We can drop this column.

In [ ]:
data.drop(["case_id"], axis=1, inplace=True)

Univariate Analysis

In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined

data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the colum
n
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram

Observations on number of employees

In [ ]:
histogram_boxplot(data, "no_of_employees")

The distribution of the number of employees is heavily right-skewed.

Some companies have more than 500k employees. Such companies might have multiple offices around the
world.

Observations on prevailing wage

In [ ]:
histogram_boxplot(data, "prevailing_wage")
The distribution of prevailing wage is skewed to the right.
There are some job roles where the prevailing wage is more than 200k.
The distribution suggests that some applicants have prevailing wage around 0, let's have a look at them. As
we say in the data summary the minimum value is 2.13.

In [ ]:
# checking the observations which have less than 100 prevailing wage
data.loc[data["prevailing_wage"] < 100]
Out[ ]:

continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employ

338 Asia Bachelor's Y N 2114 2012 North

634 Asia Master's N N 834 1977 North

839 Asia High School Y N 4537 1999

South
876 Bachelor's Y N 731 2004 North
America

995 Asia Master's N N 302 2000

... ... ... ... ... ... ...

25023 Asia Bachelor's N Y 3200 1994

25258 Asia Bachelor's Y N 3659 1997

North
25308 Master's N N 82953 1977 North
America

25329 Africa Bachelor's N N 2172 1993 North

25461 Asia Master's Y N 2861 2004

176 rows × 11 columns

It looks like the unit of the wage for these observations is hours.

In [ ]:
data.loc[data["prevailing_wage"] < 100, "unit_of_wage"].value_counts()
Out[ ]:
Hour 176
Name: unit_of_wage, dtype: int64

All such observations where the prevailing wage is less than 100 have the unit of wage as hours. This makes
sense and confirms that these are not anomalous observations in the data.

In [ ]:
# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):

"""
Barplot with percentage at the top

data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""

total = len(data[feature]) # length of the column

count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))

plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)

for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category

x = p.get_x() + p.get_width() / 2 # width of the plot

y = p.get_height() # height of the plot

ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage

plt.show() # show the plot

Observations on continent

In [ ]:
labeled_barplot(data, "continent", perc=True)
More than half (66.2%) of the applicants are from Asia, followed by 14.6% of the applications from Europe.

Observations on education of employee

In [ ]:
labeled_barplot(data, "education_of_employee", perc=True)

40.2% of the applicants have a bachelor's degree, followed by 37.8% having a master's degree.
8.6% of the applicants have a doctorate degree.

Observations on job experience

In [ ]:
labeled_barplot(data, "has_job_experience", perc=True)
58.1% of the customers have job experience.

Observations on job training

In [ ]:
labeled_barplot(data, "requires_job_training", perc=True)

88.4% of the applicants do not require any job training.

Observations on region of employment

In [ ]:

labeled_barplot(data, "region_of_employment", perc=True)

Northeast, South, and West have almost equal percentages of applicants. (25%-28%)
The Island regions have only 1.5% of the applicants.

Observations on unit of wage

In [ ]:

labeled_barplot(data, "unit_of_wage", perc=True)

90.1% of the applicants have a yearly unit of the wage, followed by 8.5% of the applicants having hourly
wages.

Observations on case status

In [ ]:
labeled_barplot(data, "case_status", perc=True)
66.8% of the visas were certified.

Bivariate Analysis

In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(10, 5))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

There's no correlation within the independent features of the data.

Creating functions that will help us with further analysis.

In [ ]:
### function to plot distributions wrt target

def distribution_plot_wrt_target(data, predictor, target):

fig, axs = plt.subplots(2, 2, figsize=(12, 10))

target_uniq = data[target].unique()

axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))

sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)

axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))

sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)

axs[1, 0].set_title("Boxplot w.r.t target")

sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")

sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)

plt.tight_layout()
plt.show()

In [ ]:

def stacked_barplot(data, predictor, target):

"""
Print the category counts and plot a stacked bar chart

data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any
impact on visa certification

In [ ]:
stacked_barplot(data, "education_of_employee", "case_status")

case_status Certified Denied All

education_of_employee
All 17018 8462 25480
Bachelor's 6367 3867 10234
High School 1164 2256 3420
Master's 7575 2059 9634
Doctorate 1912 280 2192
---------------------------------------------------------------------------------------
---------------------------------
---------------------------------

Education seems to have a positive relationship with the certification of visa that is higher the education
higher are the chances of visa getting certified.
Around 85% of the visa applications got certified for the applicants with Doctorate degree. While 80% of the
visa applications got certified for the applicants with Master's degree.
Around 60% of the visa applications got certified for applicants with Bachelor's degrees.
Applicants who do not have a degree and have graduated from high school are more likely to have their
applications denied.

Different regions have different requirements of talent having diverse educational backgrounds. Let's analyze it
further

In [ ]:
plt.figure(figsize=(10, 5))
sns.heatmap(
pd.crosstab(data["education_of_employee"], data["region_of_employment"]),
annot=True,
fmt="g",
cmap="viridis",
)
plt.ylabel("Education")
plt.xlabel("Region")
plt.show()
The requirement for the applicants who have passed high school is most in the South region, followed by
Northeast region.
The requirement for Bachelor's is mostly in South region, followed by West region.
The requirement for Master's is most in Northeast region, followed by South region.
The requirement for Doctorate's is mostly in West region, followed by Northeast region.

Let's have a look at the percentage of visa certifications across each region

In [ ]:
stacked_barplot(data, "region_of_employment", "case_status")

case_status Certified Denied All

region_of_employment
All 17018 8462 25480
Northeast 4526 2669 7195
West 4100 2486 6586
South 4913 2104 7017
Midwest 3253 1054 4307
Island 226 149 375
---------------------------------------------------------------------------------------
---------------------------------

Midwest region sees the highest number of visa certifications - around 75%, followed by the south region
that sees around 70% of the visa applications getting certified.
Island, West, and Northeast region has an almost equal percentage of visa certifications.

Lets' similarly check for the continents and find out how the visa status vary across different continents.

In [ ]:
stacked_barplot(data, "continent", "case_status")

case_status Certified Denied All

continent
All 17018 8462 25480
Asia 11012 5849 16861
North America 2037 1255 3292
Europe 2957 775 3732
South America 493 359 852
Africa 397 154 551
Oceania 122 70 192
---------------------------------------------------------------------------------------
---------------------------------
Applications from Europe and Africa have a higher chance of getting certified.
Around 80% of the applications from Europe are certified.
Asia has the third-highest percentage (Around 60%) of visa certification and has the highest number of
applications.

Experienced professionals might look abroad for opportunities to improve their lifestyles and career
development. Let's see if having work experience has any influence over visa certification

In [ ]:

stacked_barplot(data, "has_job_experience", "case_status")

case_status Certified Denied All

has_job_experience
All 17018 8462 25480
N 5994 4684 10678
Y 11024 3778 14802
---------------------------------------------------------------------------------------
---------------------------------

Having job experience seems to be a key differentiator between visa applications getting certified or denied.
Around 80% of the applications were certified for the applicants who have some job experience as
compared to the applicants who do not have any job experience.
Applicants without job experiences saw only 60% of the visa applications getting certified.
Do the employees who have prior work experience require any job training?

In [ ]:

stacked_barplot(data, "has_job_experience", "requires_job_training")

requires_job_training N Y All
has_job_experience
All 22525 2955 25480
N 8988 1690 10678
Y 13537 1265 14802
---------------------------------------------------------------------------------------
---------------------------------

Less percentage of applicants require job training if they have prior work experience.

The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze
the data and see if the visa status changes with the prevailing wage

In [ ]:
distribution_plot_wrt_target(data, "prevailing_wage", "case_status")
The median prevailing wage for the certified applications is slightly higher as compared to denied
applications.

Checking if the prevailing wage is similar across all the regions of the US

In [ ]:
plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="region_of_employment", y="prevailing_wage")
plt.show()

Midwest and Island regions have slightly higher prevailing wages as compared to other regions.
The distribution of prevailing wage is similar across West, Northeast, and South regions.

The prevailing wage has different units (Hourly, Weekly, etc). Let's find out if it has any impact on visa
applications getting certified.

In [ ]:
stacked_barplot(data, "unit_of_wage", "case_status")

case_status Certified Denied All

unit_of_wage
All 17018 8462 25480
Year 16047 6915 22962
Hour 747 1410 2157
Week 169 103 272
Month 55 34 89
---------------------------------------------------------------------------------------
---------------------------------
Unit of prevailing wage is an important factor for differentiating between a certified and a denied visa
application.
If the unit of prevailing wage is Yearly, there's a high chance of the application getting certified.
Around 75% of the applications were certified for the applicants who have a yearly unit of wage. While only
35% of the applications were certified for applicants who have an hourly unit of wage.
Monthly and Weekly units of prevailing wage have the same percentage of visa applications getting certified.

Data Pre-processing

Outlier Check
Let's check for outliers in the data.

In [ ]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):

plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)

plt.show()

Observations

There are quite a few outliers in the data.

However, we will not treat them as they are proper values.
However, we will not treat them as they are proper values.

Data Preparation for modeling

We want to predict which visa will be certified.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.

In [ ]:
data["case_status"] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)

X = data.drop(["case_status"], axis=1)
Y = data["case_status"]

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)

In [ ]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

Shape of Training set : (17836, 21)

Shape of test set : (7644, 21)
Percentage of classes in training set:
1 0.667919
0 0.332081
Name: case_status, dtype: float64
Percentage of classes in test set:
1 0.667844
0 0.332156
Name: case_status, dtype: float64

Model evaluation criterion

Model can make wrong predictions as :

1. Model predicts that the visa application will get certified but in reality, the visa application should get denied.
2. Model predicts that the visa application will not get certified but in reality, the visa application should get
certified.

Which case is more important?

Both the cases are important as:

If a visa is certified when it had to be denied a wrong employee will get the job position while US citizens will
miss the opportunity to work on that position.
If a visa is denied when it had to be certified the U.S. will lose a suitable human resource that can contribute
to the economy.

How to reduce the losses?

F1 Score can be used a the metric for evaluation of the model, greater the F1 score higher are the
chances of minimizing False Negatives and False Positives.
We will use balanced class weights so that model focuses equally on both classes.
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the
same code repeatedly for each model.

The model_performance_classification_sklearn function will be used to check the model performance of

models.
The confusion_matrix_sklearn function will be used to plot the confusion matrix.

In [ ]:
# defining a function to compute different metrics to check performance of a classificati
on model built using sklearn

def model_performance_classification_sklearn(model, predictors, target):

"""
Function to compute different metrics to check classification model performance

model: classifier
predictors: independent variables
target: dependent variable
"""

# predicting using the independent variables

pred = model.predict(predictors)

acc = accuracy_score(target, pred) # to compute Accuracy

recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score

# creating a dataframe of metrics

df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)

return df_perf

In [ ]:

def confusion_matrix_sklearn(model, predictors, target):

"""
To plot the confusion_matrix with percentages

model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n {0:.2%} ".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")

Decision Tree - Model Building and Hyperparameter Tuning

In [ ]:

model = DecisionTreeClassifier(random_state=1)
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(model, X_train, y_train)

In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
Out[ ]:

Accuracy Recall Precision F1

0 1.0 1.0 1.0 1.0

0 errors on the training set, each sample has been classified correctly.
Model has performed very well on the training set.
As we know, a decision tree will continue to grow and classify each data point correctly if no restrictions are
applied as the trees will learn all the patterns in the training set.
Let's check the performance on test data to see if the model is overfitting.

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(model, X_test, y_test)
In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
Out[ ]:

Accuracy Recall Precision F1

0 0.664835 0.742801 0.752232 0.747487

The decision tree model is overfitting the data as expected and not able to generalize well on the test set.
We will have to prune the decision tree.

Hyperparameter Tuning - Decision Tree

In [ ]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)

# Grid of parameters to choose from

parameters = {
"max_depth": np.arange(10, 30, 5),
"min_samples_leaf": [3, 5, 7],
"max_leaf_nodes": [2, 3, 5],
"min_impurity_decrease": [0.0001, 0.001],
}

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer, n_jobs=-1)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.

dtree_estimator.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', max_depth=10, max_leaf_nodes=2,
min_impurity_decrease=0.0001, min_samples_leaf=3,
random_state=1)

In [ ]:

confusion_matrix_sklearn(dtree_estimator, X_train, y_train)

In [ ]:
dtree_estimator_model_train_perf = model_performance_classification_sklearn(
dtree_estimator, X_train, y_train
)
dtree_estimator_model_train_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.712548 0.931923 0.720067 0.812411

In [ ]:
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)

In [ ]:
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
dtree_estimator, X_test, y_test
)
dtree_estimator_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.706567 0.930852 0.715447 0.809058

The decision tree model has a very high recall but, the precision is quite less.
The performance of the model after hyperparameter tuning has become generalized.
We are getting an F1 score of 0.81 and 0.80 on the training and test set, respectively.
Let's try building some ensemble models and see if the metrics improve.

Bagging - Model Building and Hyperparameter Tuning

Bagging Classifier

In [ ]:
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train, y_train)
Out[ ]:
Out[ ]:
BaggingClassifier(random_state=1)

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(bagging_classifier, X_train, y_train)

In [ ]:
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
bagging_classifier, X_train, y_train
)
bagging_classifier_model_train_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.985198 0.985982 0.99181 0.988887

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)

In [ ]:
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
bagging_classifier, X_test, y_test
)
bagging_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1

0 0.691523 0.764153 0.771711 0.767913

The bagging classifier is overfitting on the training set like the decision tree model.
We'll try to reduce overfitting and improve the performance by hyperparameter tuning.

Hyperparameter Tuning - Bagging Classifier

In [ ]:

# Choose the type of classifier.

bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from

parameters = {
"max_samples": [0.7, 0.8, 0.9],
"max_features": [0.7, 0.8, 0.9],
"n_estimators": np.arange(90, 120, 10),
}

# Type of scoring used to compare parameter combinations

acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.

bagging_estimator_tuned.fit(X_train, y_train)

Out[ ]:
BaggingClassifier(max_features=0.7, max_samples=0.7, n_estimators=100,
random_state=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_train, y_train)

In [ ]:
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn(
bagging_estimator_tuned, X_train, y_train
)
bagging_estimator_tuned_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.996187 0.999916 0.994407 0.997154

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)

In [ ]:
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(
bagging_estimator_tuned, X_test, y_test
)
bagging_estimator_tuned_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.724228 0.895397 0.743857 0.812622

After tuning the hyperparameters the bagging classifier is still overfitting.

There's a big difference in the training and the test recall.

Random Forest

In [ ]:
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=1, class_weight="balanced")
rf_estimator.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier(class_weight='balanced', random_state=1)

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(rf_estimator, X_train, y_train)

In [ ]:
# Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(
rf_estimator, X_train, y_train
)
rf_estimator_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 1.0 1.0 1.0 1.0

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(rf_estimator, X_test, y_test)

In [ ]:
rf_estimator_model_test_perf = model_performance_classification_sklearn(
rf_estimator, X_test, y_test
)
rf_estimator_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.727368 0.847209 0.768343 0.805851

With default parameters, random forest is overfitting the training data.

We'll try to reduce overfitting and improve recall by hyperparameter tuning.
Hyperparameter Tuning - Random Forest

In [ ]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)

parameters = {
"max_depth": list(np.arange(5, 15, 5)),
"max_features": ["sqrt", "log2"],
"min_samples_split": [3, 5, 7],
"n_estimators": np.arange(10, 40, 10),
}

# Type of scoring used to compare parameter combinations

acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(rf_tuned, parameters, scoring=acc_scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.

rf_tuned.fit(X_train, y_train)
Out[ ]:

RandomForestClassifier(max_depth=10, max_features='sqrt', min_samples_split=7,

n_estimators=20, oob_score=True, random_state=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(rf_tuned, X_train, y_train)

In [ ]:
rf_tuned_model_train_perf = model_performance_classification_sklearn(
rf_tuned, X_train, y_train
)
rf_tuned_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.769119 0.91866 0.776556 0.841652

Checking model performance on test set

In [ ]:

confusion_matrix_sklearn(rf_tuned, X_test, y_test)

In [ ]:
rf_tuned_model_test_perf = model_performance_classification_sklearn(
rf_tuned, X_test, y_test
)
rf_tuned_model_test_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.738095 0.898923 0.755391 0.82093

After hyperparameter tuning the model performance has generalized.

We have an F1 score of 0.84 and 0.82 on the training and test data, respectively.
The model has a high recall and a good precision.

Boosting - Model Building and Hyperparameter Tuning

AdaBoost Classifier

In [ ]:
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train, y_train)

Out[ ]:
AdaBoostClassifier(random_state=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(ab_classifier, X_train, y_train)
In [ ]:

ab_classifier_model_train_perf = model_performance_classification_sklearn(
ab_classifier, X_train, y_train
)
ab_classifier_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.738226 0.887182 0.760688 0.81908

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(ab_classifier, X_test, y_test)

In [ ]:

ab_classifier_model_test_perf = model_performance_classification_sklearn(
ab_classifier, X_test, y_test
)
ab_classifier_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.734301 0.885015 0.757799 0.816481

The model is giving a generalized performance.

We have received a good F1 score of 0.81 on both the training and test set.

Hyperparameter Tuning - AdaBoost Classifier

In [ ]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from

parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1, class_weight="balanced", random_state=1),
DecisionTreeClassifier(max_depth=2, class_weight="balanced", random_state=1),
DecisionTreeClassifier(max_depth=3, class_weight="balanced", random_state=1),
],
"n_estimators": np.arange(60, 100, 10),
"learning_rate": np.arange(0.1, 0.4, 0.1),
}

# Type of scoring used to compare parameter combinations

acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.

abc_tuned.fit(X_train, y_train)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
max_depth=1,
random_state=1),
learning_rate=0.1, n_estimators=90, random_state=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(abc_tuned, X_train, y_train)

In [ ]:

abc_tuned_model_train_perf = model_performance_classification_sklearn(
abc_tuned, X_train, y_train
)
abc_tuned_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.719163 0.781415 0.79469 0.787997

Checking model performance on test set

In [ ]:
In [ ]:
confusion_matrix_sklearn(abc_tuned, X_test, y_test)

In [ ]:
abc_tuned_model_test_perf = model_performance_classification_sklearn(
abc_tuned, X_test, y_test
)
abc_tuned_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.716641 0.781587 0.79151 0.786517

After tuning the F1 score has reduced.

The recall of the model has reduced but the precision has improved.

Gradient Boosting Classifier

In [ ]:
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(random_state=1)

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(gb_classifier, X_train, y_train)

In [ ]:
gb_classifier_model_train_perf = model_performance_classification_sklearn(
gb_classifier, X_train, y_train
)
gb_classifier_model_train_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.758802 0.88374 0.783042 0.830349

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(gb_classifier, X_test, y_test)

In [ ]:
gb_classifier_model_test_perf = model_performance_classification_sklearn(
gb_classifier, X_test, y_test
)
gb_classifier_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.744767 0.876004 0.772366 0.820927

The model is giving a good and generalized performance.

We are getting the F1 score of 0.83 and 0.82 on the training and test set, respectively.
Let's see if the performance can be improved further by hyperparameter tuning.

Hyperparameter Tuning - Gradient Boosting Classifier

In [ ]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)

# Grid of parameters to choose from

parameters = {
"n_estimators": [200, 250, 300],
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"learning_rate": np.arange(0.1, 0.4, 0.1),
}

# Type of scoring used to compare parameter combinations

acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.

gbc_tuned.fit(X_train, y_train)
Out[ ]:

GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.8, n_estimators=200, random_state=1,
subsample=1)

Checking model performance on training set

In [ ]:
confusion_matrix_sklearn(gbc_tuned, X_train, y_train)

In [ ]:
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
gbc_tuned, X_train, y_train
)
gbc_tuned_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.764017 0.882649 0.789059 0.833234

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
In [ ]:

gbc_tuned_model_test_perf = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
gbc_tuned_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.743459 0.871303 0.773296 0.819379

After tuning there is not much change in the model performance as compared to the model with default
values of hyperparameters.

XGBoost Classifier

In [ ]:
xgb_classifier = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)

Out[ ]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(xgb_classifier, X_train, y_train)

In [ ]:
xgb_classifier_model_train_perf = model_performance_classification_sklearn(
xgb_classifier, X_train, y_train
)
xgb_classifier_model_train_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.838753 0.931419 0.843482 0.885272

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)

In [ ]:
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
xgb_classifier, X_test, y_test
)
xgb_classifier_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.733255 0.860725 0.767913 0.811675

The XGBoost model on the training set has performed very well but it is not able to generalize on the test
set.
Let's try and tune the hyperparameters and see if the performance can be generalized.

Hyperparameter Tuning - XGBoost Classifier

In [ ]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters to choose from

parameters = {
"n_estimators": np.arange(150, 250, 50),
"scale_pos_weight": [1, 2],
"subsample": [0.9, 1],
"learning_rate": np.arange(0.1, 0.21, 0.1),
"gamma": [3, 5],
"colsample_bytree": [0.8, 0.9],
"colsample_bylevel": [ 0.9, 1],
}

# Type of scoring used to compare parameter combinations

acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search

grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters

xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.

xgb_tuned.fit(X_train, y_train)
Out[ ]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.1, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(xgb_tuned, X_train, y_train)

In [ ]:

xgb_tuned_model_train_perf = model_performance_classification_sklearn(
xgb_tuned, X_train, y_train
)
xgb_tuned_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.765474 0.881642 0.791127 0.833935

Checking model performance on test set

In [ ]:
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)

In [ ]:

xgb_tuned_model_test_perf = model_performance_classification_sklearn(
xgb_tuned, X_test, y_test
)
xgb_tuned_model_test_perf

Out[ ]:

Accuracy Recall Precision F1

0 0.74516 0.86954 0.775913 0.820063

XGBoost model after tuning is giving a good and generalized performance.

We have received the F1 score of 0.83 and 0.82 on the training and the test set, respectively.

Stacking Classifier
In [ ]:

estimators = [
("AdaBoost", ab_classifier),
("Gradient Boosting", gbc_tuned),
("Random Forest", rf_tuned),
]

final_estimator = xgb_tuned

stacking_classifier = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)

stacking_classifier.fit(X_train, y_train)

Out[ ]:

StackingClassifier(estimators=[('AdaBoost', AdaBoostClassifier(random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(rando
m_state=1),
max_features=0.8,
n_estimators=200,
random_state=1,
subsample=1)),
('Random Forest',
RandomForestClassifier(max_depth=10,
max_features='sqrt',
min_samples_split=7,
n_estimators=20,
oob_score=Tru...
eval_metric='logloss', gamma=5,
eval_metric='logloss', gamma=5,
gpu_id=-1,
importance_type='gain',
interaction_constraints='',
learning_rate=0.1,
max_delta_step=0, max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=200, n_jobs=8,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))

Checking model performance on training set

In [ ]:

confusion_matrix_sklearn(stacking_classifier, X_train, y_train)

In [ ]:
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
stacking_classifier, X_train, y_train
)
stacking_classifier_model_train_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.770296 0.892554 0.790558 0.838465

Checking model performance on test set

In [ ]:

confusion_matrix_sklearn(stacking_classifier, X_test, y_test)

In [ ]:

stacking_classifier_model_test_perf = model_performance_classification_sklearn(
stacking_classifier, X_test, y_test
)
stacking_classifier_model_test_perf
Out[ ]:

Accuracy Recall Precision F1

0 0.74529 0.879138 0.771399 0.821752

Stacking model has also given a good and generalized performance.

The performance is comparable to the XGBoost model.
We have received F1 scores of 0.83 and 0.81 on the training and test set, respectively.

Model Comparison and Final Model Selection

Comparing all models

In [ ]:

# training performance comparison

models_train_comp_df = pd.concat(
[
dtree_estimator_model_train_perf.T,
dtree_estimator_model_train_perf.T,
bagging_classifier_model_train_perf.T,
bagging_estimator_tuned_model_train_perf.T,
rf_estimator_model_train_perf.T,
rf_tuned_model_train_perf.T,
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[ ]:

Tuned
Tuned Tuned Tuned Tuned Gradient XGBoo
Decision Bagging Random Adaboost Gradient XGBoost
Decision Bagging Random Adaboost Boost Classif
Tree Classifier Forest Classifier Boost Classifier
Tree Classifier Forest Classifier Classifier Tun
Classifier

Accuracy 0.712548 0.712548 0.985198 0.996187 1.0 0.769119 0.738226 0.719163 0.758802 0.764017 0.838753 0.7654

Recall 0.931923 0.931923 0.985982 0.999916 1.0 0.918660 0.887182 0.781415 0.883740 0.882649 0.931419 0.8816

Precision 0.720067 0.720067 0.991810 0.994407 1.0 0.776556 0.760688 0.794690 0.783042 0.789059 0.843482 0.7911

F1 0.812411 0.812411 0.988887 0.997154 1.0 0.841652 0.819080 0.787997 0.830349 0.833234 0.885272 0.8339

In [ ]:
# testing performance comparison

models_test_comp_df = pd.concat(
[
dtree_estimator_model_test_perf.T,
dtree_estimator_model_test_perf.T,
bagging_classifier_model_test_perf.T,
bagging_estimator_tuned_model_test_perf.T,
rf_estimator_model_test_perf.T,
rf_tuned_model_test_perf.T,
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df

Testing performance comparison:

Out[ ]:

Accuracy 0.706567 0.706567 0.691523 0.724228 0.727368 0.738095 0.734301 0.716641 0.744767 0.743459 0.733255 0.7451

Recall 0.930852 0.930852 0.764153 0.895397 0.847209 0.898923 0.885015 0.781587 0.876004 0.871303 0.860725 0.8695

Precision 0.715447 0.715447 0.771711 0.743857 0.768343 0.755391 0.757799 0.791510 0.772366 0.773296 0.767913 0.7759

F1 0.809058 0.809058 0.767913 0.812622 0.805851 0.820930 0.816481 0.786517 0.820927 0.819379 0.811675 0.8200
F1 0.809058 0.809058 0.767913 0.812622 0.805851 0.820930 0.816481 0.786517 0.820927 0.819379 0.811675 0.8200
Tuned
Tuned Tuned Tuned Tuned Gradient XGBoo
Decision Bagging Random Adaboost Gradient XGBoost
Decision Bagging Random Adaboost Boost Classif
Tree Classifier Forest Classifier Boost Classifier
Tree Classifier Forest Classifier Classifier Tun
Tuned Random Forest model has given a good and generalized performance. We will use itClassifier
as our final
model.
With the tuned random forest model we are getting the F1 score of 0.84 and 0.82 on the training and the test
set, respectively.
Let's check the important features of the final model.

Important features of the final model

In [ ]:
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Looking at the feature importance of the Random Forest model, the top three important features to look for
while certifying a visa are -Education of the employee, Job experience, and Prevailing Wage.

Actionable Insights and Recommendations

Actionable Insights and Recommendations
The profile of the applicants for whom the visa status can be approved:

Primary information to look at:

Education level - At least has a Bachelor's degree - Master's and doctorate are preferred.
Job Experience - Should have some job experience.
Prevailing wage - The median prevailing wage of the employees for whom the visa got certified is
around 72k.

Secondary information to look at:

Unit of Wage - Applicants having a yearly unit of wage.

Continent - Ideally the nationality and ethnicity of an applicant shouldn't matter to work in a country but
previously it has been observed that applicants from Europe, Africa, and Asia have higher chances of visa
certification.
Region of employment - Our analysis suggests that the applications to work in the Mid-West region have
more chances of visa approval. The approvals can also be made based on requirement of talent, from our
analysis we see that:
The requirement for the applicants who have passed high school is most in the South region, followed
by Northeast region.
The requirement for Bachelor's is mostly in South region, followed by West region.
The requirement for Master's is most in Northeast region, followed by South region.
The requirement for Doctorate's is mostly in West region, followed by Northeast region.

The profile of the applicants for whom the visa status can be denied:

Primary information to look at:

Education level - Doesn't have any degree and has completed high school.
Job Experience - Doesn't have any job experience.
Prevailing wage - The median prevailing wage of the employees for whom the visa got certified is
around 65k.

Secondary information to look at:

Unit of Wage - Applicants having an hourly unit of wage.

Continent - Ideally the nationality and ethnicity of an applicant shouldn't matter to work in a country but
previously it has been observed that applicants from South America, North America, and Oceania have
higher chances of visa applications getting denied.

Additional information of employers and employees can be collected to gain better insights. Information
such as:
Employers: Information about the wage they are offering to the applicant, Sector in which company
operates in, etc
Employee's: Specialization in their educational degree, Number of years of experience, etc

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Easy Visa Project PDF
100% (4)
Easy Visa Project PDF
17 pages
Project 5-EasyVisa Assignment
No ratings yet
Project 5-EasyVisa Assignment
57 pages
Machine Learning 2 Business Report
No ratings yet
Machine Learning 2 Business Report
21 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
Amit Khilare EasyVisa Project Final
No ratings yet
Amit Khilare EasyVisa Project Final
28 pages
ML2 Project
No ratings yet
ML2 Project
38 pages
Machine Learning 2 Working-pages-Deleted
No ratings yet
Machine Learning 2 Working-pages-Deleted
16 pages
EasyVisa Problem Statement
No ratings yet
EasyVisa Problem Statement
2 pages
Ml-1-Guided-Bus Report
No ratings yet
Ml-1-Guided-Bus Report
35 pages
Visa Application Report
No ratings yet
Visa Application Report
7 pages
Kewal Kumar Singh
No ratings yet
Kewal Kumar Singh
21 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Id5132 1
No ratings yet
Id5132 1
22 pages
EDA - Session-4 - Numerical Data Analysis
No ratings yet
EDA - Session-4 - Numerical Data Analysis
9 pages
EDA - Session-6 - Bi Variate Analysis
No ratings yet
EDA - Session-6 - Bi Variate Analysis
17 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
EDA - Session-2 - Data Frame Basics-2
No ratings yet
EDA - Session-2 - Data Frame Basics-2
11 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
Salary Data Set Description: Source
No ratings yet
Salary Data Set Description: Source
2 pages
Churn Prediction - Commercial Use of Data Science
No ratings yet
Churn Prediction - Commercial Use of Data Science
25 pages
Week 4 LAB
No ratings yet
Week 4 LAB
26 pages
EDA - Session-7 - Convert Categorical To Numerical
No ratings yet
EDA - Session-7 - Convert Categorical To Numerical
5 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Explatory Data Analysis
No ratings yet
Explatory Data Analysis
18 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
26 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Odera Python Assignment
No ratings yet
Odera Python Assignment
3 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
PFDA
No ratings yet
PFDA
23 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
HACKATHON
No ratings yet
HACKATHON
8 pages
EDA Assignment
No ratings yet
EDA Assignment
33 pages
Credit EDA Case Study Doc 1
100% (1)
Credit EDA Case Study Doc 1
16 pages
I Love Merge
No ratings yet
I Love Merge
56 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Prediciton of Loan Apprval-Project Report
No ratings yet
Prediciton of Loan Apprval-Project Report
82 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
Classification Project 1691995218
No ratings yet
Classification Project 1691995218
43 pages
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
No ratings yet
Machine Learning Engineer Nanodegree Supervised Learning Project: Finding Donors For CharityML
16 pages
EDA - Session-5 - Outlier Analysis
No ratings yet
EDA - Session-5 - Outlier Analysis
11 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
11 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
BA15 Machine Learning Assignment Guidelines Assignment 01
No ratings yet
BA15 Machine Learning Assignment Guidelines Assignment 01
9 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Bank Loan PPT
No ratings yet
Bank Loan PPT
45 pages
IGNOU MCA Accountancy and Financial Previous Years Unsolved Papers MCS 225
From Everand
IGNOU MCA Accountancy and Financial Previous Years Unsolved Papers MCS 225
Manish Soni
No ratings yet
02 Measures of Spread
No ratings yet
02 Measures of Spread
6 pages
03 Symmetric and Skewed Distributions and Outliers
No ratings yet
03 Symmetric and Skewed Distributions and Outliers
6 pages
01 Mean, Variance, and Standard Deviation
No ratings yet
01 Mean, Variance, and Standard Deviation
10 pages
01 Measures of Central Tendency
No ratings yet
01 Measures of Central Tendency
6 pages
08 Joint Distributions
No ratings yet
08 Joint Distributions
6 pages
3 Outliers Iqr
No ratings yet
3 Outliers Iqr
3 pages
02 Frequency Histograms and Polygons, and Density Curves
No ratings yet
02 Frequency Histograms and Polygons, and Density Curves
6 pages
07 Relative Frequency Tables
No ratings yet
07 Relative Frequency Tables
6 pages
04 Box and Whisker Plots
No ratings yet
04 Box and Whisker Plots
6 pages
10 Building Histograms From Data Sets
No ratings yet
10 Building Histograms From Data Sets
7 pages
Python Seaborn Tutorial For Beginners v2
No ratings yet
Python Seaborn Tutorial For Beginners v2
40 pages
Probability & Statistics - Final Exam - Solutions
No ratings yet
Probability & Statistics - Final Exam - Solutions
16 pages
09 Histograms and Stem-And-leaf Plots
No ratings yet
09 Histograms and Stem-And-leaf Plots
6 pages
09 Lineplot
No ratings yet
09 Lineplot
21 pages
Probability & Statistics - Workbook.solutions
No ratings yet
Probability & Statistics - Workbook.solutions
471 pages
Probability & Statistics - Final Exam - Practice 1
No ratings yet
Probability & Statistics - Final Exam - Practice 1
9 pages
Workbook - Discrete Random Variables
No ratings yet
Workbook - Discrete Random Variables
19 pages
Probability & Statistics - Workbook
No ratings yet
Probability & Statistics - Workbook
163 pages
Probability & Statistics - Final Exam
No ratings yet
Probability & Statistics - Final Exam
9 pages
02 Significance Level and Type I and II Errors
No ratings yet
02 Significance Level and Type I and II Errors
8 pages
Workbook Regression
No ratings yet
Workbook Regression
18 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
03 Coefficient of Determination and RMSE
No ratings yet
03 Coefficient of Determination and RMSE
7 pages
Workbook - Hypothesis Testing - Solutions
No ratings yet
Workbook - Hypothesis Testing - Solutions
91 pages
10 Hypothesis Testing For The Difference of Proportions
No ratings yet
10 Hypothesis Testing For The Difference of Proportions
9 pages
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Workbook - Hypothesis Testing
No ratings yet
Workbook - Hypothesis Testing
26 pages
Brochure - Global Wi-Fi Market - Global Forecast To 2020
No ratings yet
Brochure - Global Wi-Fi Market - Global Forecast To 2020
24 pages
Car Insurance Insights Summary Presentation
No ratings yet
Car Insurance Insights Summary Presentation
10 pages
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Raptor 2024
No ratings yet
Raptor 2024
8 pages
Introduction To Machine Learning PART 1
No ratings yet
Introduction To Machine Learning PART 1
6 pages
Jaget PDF
No ratings yet
Jaget PDF
5 pages
Advanced View of Atmega Microcontroller Projects List - ATMega32 AVR
No ratings yet
Advanced View of Atmega Microcontroller Projects List - ATMega32 AVR
146 pages
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
No ratings yet
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
10 pages
10 Formatting Text (Font, Paragraph, Lists)
No ratings yet
10 Formatting Text (Font, Paragraph, Lists)
3 pages
Grade 6 Performance Task: Taking A Field Trip
No ratings yet
Grade 6 Performance Task: Taking A Field Trip
24 pages
ICT in Education
No ratings yet
ICT in Education
26 pages
Option 1 Project Management Issues and Concerns About The Project Timeline
No ratings yet
Option 1 Project Management Issues and Concerns About The Project Timeline
8 pages
AMF-65 AMS RFT Partnership Range - CULT PDF
No ratings yet
AMF-65 AMS RFT Partnership Range - CULT PDF
46 pages
KOM-MICS, A "Tsunagaruka" System For Production Sites: Technical Paper
No ratings yet
KOM-MICS, A "Tsunagaruka" System For Production Sites: Technical Paper
6 pages
GE Welch - Group 5
No ratings yet
GE Welch - Group 5
5 pages
21 22
No ratings yet
21 22
14 pages
Sectional Weights
No ratings yet
Sectional Weights
1 page
TTL Midterm Reviewer
No ratings yet
TTL Midterm Reviewer
10 pages
Lesson 2 Introduction of Robot HAT
No ratings yet
Lesson 2 Introduction of Robot HAT
4 pages
Unit - 1 Notes
No ratings yet
Unit - 1 Notes
27 pages
Raw Data Quantitative Analysis Meaningful Information
No ratings yet
Raw Data Quantitative Analysis Meaningful Information
3 pages
OCCUPATIONAL HEALTH AND SAFETY PROCEDURES IN COMPUTER - PPTM
No ratings yet
OCCUPATIONAL HEALTH AND SAFETY PROCEDURES IN COMPUTER - PPTM
29 pages
Form B Level 200
No ratings yet
Form B Level 200
1 page
Sans 10292
No ratings yet
Sans 10292
31 pages
Quiz 2 AIS Niko Arniño
No ratings yet
Quiz 2 AIS Niko Arniño
8 pages
Bill of Material IH
No ratings yet
Bill of Material IH
1 page
Natural Science and History Museum
No ratings yet
Natural Science and History Museum
24 pages
CEM How To - Final
No ratings yet
CEM How To - Final
84 pages
High School Students' Perceptions of Motivations For Cyberbullying An Exploratory Study
No ratings yet
High School Students' Perceptions of Motivations For Cyberbullying An Exploratory Study
6 pages
S4 Planning Phase
No ratings yet
S4 Planning Phase
4 pages
Final Rodent
No ratings yet
Final Rodent
7 pages
Wravor Catalog en
No ratings yet
Wravor Catalog en
28 pages
Order No. 11909520: Thank You For Your Order
No ratings yet
Order No. 11909520: Thank You For Your Order
2 pages