Dsba Project Main Et Easyvisa
Dsba Project Main Et Easyvisa
Analytics
Ensemble Techniques: Project Debrief
Visa Approval
EasyVisa Project
Problem Statement
Context:
Business communities in the United States are facing high demand for human resources, but one of the
constant challenges is identifying and attracting the right talent, which is perhaps the most important element in
remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals
both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to
work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on
their wages or working conditions by ensuring US employers' compliance with statutory requirements when
they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office
of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United
States and grants certifications in those cases where employers can demonstrate that there are not sufficient
US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the
area of intended employment.
Objective:
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and
permanent labor certifications. This was a nine percent increase in the overall number of processed applications
from the previous year. The process of reviewing every case is becoming a tedious task as the number of
applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in
shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-
driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a
classification model:
Data Description
The data contains the different attributes of employee and the employer. The detailed data dictionary is given
below.
Note: This is a sample solution for the project. Projects will NOT be
graded on the basis of how well the submission matches this sample
solution. Projects will be graded on the basis of the rubric only.
import warnings
warnings.filterwarnings("ignore")
Import Dataset
In [ ]:
visa = pd.read_csv("EasyVisa.csv")
In [ ]:
# copying data to another variable to avoid any changes to original data
data = visa.copy()
In [ ]:
data.head()
Out[ ]:
In [ ]:
data.tail()
Out[ ]:
In [ ]:
data.shape
Out[ ]:
(25480, 12)
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_id 25480 non-null object
1 continent 25480 non-null object
2 education_of_employee 25480 non-null object
3 has_job_experience 25480 non-null object
4 requires_job_training 25480 non-null object
5 no_of_employees 25480 non-null int64
6 yr_of_estab 25480 non-null int64
7 region_of_employment 25480 non-null object
8 prevailing_wage 25480 non-null float64
8 prevailing_wage 25480 non-null float64
9 unit_of_wage 25480 non-null object
10 full_time_position 25480 non-null object
11 case_status 25480 non-null object
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB
no_of_employees , yr_of_estab , and prevailing_wage are numeric features while rest are objects.
There are no null values in the dataset.
In [ ]:
# checking for duplicate values
data.duplicated().sum()
Out[ ]:
0
In [ ]:
data.describe().T
Out[ ]:
Observations:
The range of the number of employees in a company is huge. There are some anomalies in the data as we
can see that the minimum number of employees is equal to -26, which is not possible. We will have to fix
this.
The year of establishment of companies ranges from 1800 to 2016, which seems fine.
The average prevailing wage is 74455.81. There's also a very huge difference in 75th percentile and
maximum value which indicates there might be outliers present in this column.
In [ ]:
data.loc[data["no_of_employees"] < 0].shape
Out[ ]:
(33, 12)
We will consider the 33 observations as data entry errors and take the absolute values for this column.
In [ ]:
# taking the absolute values for number of employees
data["no_of_employees"] = abs(data["no_of_employees"])
Let's check the count of each unique category in each of the categorical variables
In [ ]:
# Making a list of all catrgorical variables
cat_col = list(data.select_dtypes("object").columns)
EZYV21731 1
EZYV22853 1
EZYV24045 1
EZYV20282 1
EZYV17175 1
..
EZYV2953 1
EZYV990 1
EZYV9352 1
EZYV24207 1
EZYV8395 1
Name: case_id, Length: 25480, dtype: int64
--------------------------------------------------
Asia 16861
Europe 3732
North America 3292
South America 852
Africa 551
Oceania 192
Name: continent, dtype: int64
--------------------------------------------------
Bachelor's 10234
Master's 9634
High School 3420
Doctorate 2192
Name: education_of_employee, dtype: int64
--------------------------------------------------
Y 14802
N 10678
Name: has_job_experience, dtype: int64
--------------------------------------------------
N 22525
Y 2955
Name: requires_job_training, dtype: int64
--------------------------------------------------
Northeast 7195
South 7017
West 6586
Midwest 4307
Island 375
Name: region_of_employment, dtype: int64
--------------------------------------------------
Year 22962
Hour 2157
Week 272
Month 89
Name: unit_of_wage, dtype: int64
--------------------------------------------------
Y 22773
N 2707
Name: full_time_position, dtype: int64
--------------------------------------------------
Certified 17018
Denied 8462
Name: case_status, dtype: int64
--------------------------------------------------
Observations:
In [ ]:
# checking the number of unique values
data["case_id"].nunique()
Out[ ]:
25480
In [ ]:
data.drop(["case_id"], axis=1, inplace=True)
Univariate Analysis
In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the colum
n
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
In [ ]:
histogram_boxplot(data, "no_of_employees")
In [ ]:
histogram_boxplot(data, "prevailing_wage")
The distribution of prevailing wage is skewed to the right.
There are some job roles where the prevailing wage is more than 200k.
The distribution suggests that some applicants have prevailing wage around 0, let's have a look at them. As
we say in the data summary the minimum value is 2.13.
In [ ]:
# checking the observations which have less than 100 prevailing wage
data.loc[data["prevailing_wage"] < 100]
Out[ ]:
South
876 Bachelor's Y N 731 2004 North
America
North
25308 Master's N N 82953 1977 North
America
It looks like the unit of the wage for these observations is hours.
In [ ]:
data.loc[data["prevailing_wage"] < 100, "unit_of_wage"].value_counts()
Out[ ]:
Hour 176
Name: unit_of_wage, dtype: int64
All such observations where the prevailing wage is less than 100 have the unit of wage as hours. This makes
sense and confirms that these are not anomalous observations in the data.
In [ ]:
# function to create labeled barplots
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
Observations on continent
In [ ]:
labeled_barplot(data, "continent", perc=True)
More than half (66.2%) of the applicants are from Asia, followed by 14.6% of the applications from Europe.
In [ ]:
labeled_barplot(data, "education_of_employee", perc=True)
40.2% of the applicants have a bachelor's degree, followed by 37.8% having a master's degree.
8.6% of the applicants have a doctorate degree.
In [ ]:
labeled_barplot(data, "has_job_experience", perc=True)
58.1% of the customers have job experience.
In [ ]:
labeled_barplot(data, "requires_job_training", perc=True)
In [ ]:
In [ ]:
90.1% of the applicants have a yearly unit of the wage, followed by 8.5% of the applicants having hourly
wages.
In [ ]:
labeled_barplot(data, "case_status", perc=True)
66.8% of the visas were certified.
Bivariate Analysis
In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(10, 5))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
In [ ]:
### function to plot distributions wrt target
target_uniq = data[target].unique()
plt.tight_layout()
plt.show()
In [ ]:
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any
impact on visa certification
In [ ]:
stacked_barplot(data, "education_of_employee", "case_status")
Education seems to have a positive relationship with the certification of visa that is higher the education
higher are the chances of visa getting certified.
Around 85% of the visa applications got certified for the applicants with Doctorate degree. While 80% of the
visa applications got certified for the applicants with Master's degree.
Around 60% of the visa applications got certified for applicants with Bachelor's degrees.
Applicants who do not have a degree and have graduated from high school are more likely to have their
applications denied.
Different regions have different requirements of talent having diverse educational backgrounds. Let's analyze it
further
In [ ]:
plt.figure(figsize=(10, 5))
sns.heatmap(
pd.crosstab(data["education_of_employee"], data["region_of_employment"]),
annot=True,
fmt="g",
cmap="viridis",
)
plt.ylabel("Education")
plt.xlabel("Region")
plt.show()
The requirement for the applicants who have passed high school is most in the South region, followed by
Northeast region.
The requirement for Bachelor's is mostly in South region, followed by West region.
The requirement for Master's is most in Northeast region, followed by South region.
The requirement for Doctorate's is mostly in West region, followed by Northeast region.
Let's have a look at the percentage of visa certifications across each region
In [ ]:
stacked_barplot(data, "region_of_employment", "case_status")
Midwest region sees the highest number of visa certifications - around 75%, followed by the south region
that sees around 70% of the visa applications getting certified.
Island, West, and Northeast region has an almost equal percentage of visa certifications.
Lets' similarly check for the continents and find out how the visa status vary across different continents.
In [ ]:
stacked_barplot(data, "continent", "case_status")
Experienced professionals might look abroad for opportunities to improve their lifestyles and career
development. Let's see if having work experience has any influence over visa certification
In [ ]:
Having job experience seems to be a key differentiator between visa applications getting certified or denied.
Around 80% of the applications were certified for the applicants who have some job experience as
compared to the applicants who do not have any job experience.
Applicants without job experiences saw only 60% of the visa applications getting certified.
Do the employees who have prior work experience require any job training?
In [ ]:
requires_job_training N Y All
has_job_experience
All 22525 2955 25480
N 8988 1690 10678
Y 13537 1265 14802
---------------------------------------------------------------------------------------
---------------------------------
Less percentage of applicants require job training if they have prior work experience.
The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze
the data and see if the visa status changes with the prevailing wage
In [ ]:
distribution_plot_wrt_target(data, "prevailing_wage", "case_status")
The median prevailing wage for the certified applications is slightly higher as compared to denied
applications.
Checking if the prevailing wage is similar across all the regions of the US
In [ ]:
plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="region_of_employment", y="prevailing_wage")
plt.show()
Midwest and Island regions have slightly higher prevailing wages as compared to other regions.
The distribution of prevailing wage is similar across West, Northeast, and South regions.
The prevailing wage has different units (Hourly, Weekly, etc). Let's find out if it has any impact on visa
applications getting certified.
In [ ]:
stacked_barplot(data, "unit_of_wage", "case_status")
Data Pre-processing
Outlier Check
Let's check for outliers in the data.
In [ ]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
plt.show()
Observations
In [ ]:
data["case_status"] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)
X = data.drop(["case_status"], axis=1)
Y = data["case_status"]
X = pd.get_dummies(X, drop_first=True)
In [ ]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
1. Model predicts that the visa application will get certified but in reality, the visa application should get denied.
2. Model predicts that the visa application will not get certified but in reality, the visa application should get
certified.
F1 Score can be used a the metric for evaluation of the model, greater the F1 score higher are the
chances of minimizing False Negatives and False Positives.
We will use balanced class weights so that model focuses equally on both classes.
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the
same code repeatedly for each model.
In [ ]:
# defining a function to compute different metrics to check performance of a classificati
on model built using sklearn
model: classifier
predictors: independent variables
target: dependent variable
"""
return df_perf
In [ ]:
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n {0:.2%} ".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(random_state=1)
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(model, X_train, y_train)
In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
Out[ ]:
0 errors on the training set, each sample has been classified correctly.
Model has performed very well on the training set.
As we know, a decision tree will continue to grow and classify each data point correctly if no restrictions are
applied as the trees will learn all the patterns in the training set.
Let's check the performance on test data to see if the model is overfitting.
In [ ]:
confusion_matrix_sklearn(model, X_test, y_test)
In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
Out[ ]:
The decision tree model is overfitting the data as expected and not able to generalize well on the test set.
We will have to prune the decision tree.
In [ ]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)
In [ ]:
Out[ ]:
In [ ]:
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
In [ ]:
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
dtree_estimator, X_test, y_test
)
dtree_estimator_model_test_perf
Out[ ]:
The decision tree model has a very high recall but, the precision is quite less.
The performance of the model after hyperparameter tuning has become generalized.
We are getting an F1 score of 0.81 and 0.80 on the training and test set, respectively.
Let's try building some ensemble models and see if the metrics improve.
Bagging Classifier
In [ ]:
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train, y_train)
Out[ ]:
Out[ ]:
BaggingClassifier(random_state=1)
In [ ]:
In [ ]:
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
bagging_classifier, X_train, y_train
)
bagging_classifier_model_train_perf
Out[ ]:
In [ ]:
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
In [ ]:
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
bagging_classifier, X_test, y_test
)
bagging_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
The bagging classifier is overfitting on the training set like the decision tree model.
We'll try to reduce overfitting and improve the performance by hyperparameter tuning.
In [ ]:
Out[ ]:
BaggingClassifier(max_features=0.7, max_samples=0.7, n_estimators=100,
random_state=1)
In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_train, y_train)
In [ ]:
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn(
bagging_estimator_tuned, X_train, y_train
)
bagging_estimator_tuned_model_train_perf
Out[ ]:
In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)
In [ ]:
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(
bagging_estimator_tuned, X_test, y_test
)
bagging_estimator_tuned_model_test_perf
Out[ ]:
Random Forest
In [ ]:
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=1, class_weight="balanced")
rf_estimator.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier(class_weight='balanced', random_state=1)
In [ ]:
In [ ]:
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
In [ ]:
rf_estimator_model_test_perf = model_performance_classification_sklearn(
rf_estimator, X_test, y_test
)
rf_estimator_model_test_perf
Out[ ]:
In [ ]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)
parameters = {
"max_depth": list(np.arange(5, 15, 5)),
"max_features": ["sqrt", "log2"],
"min_samples_split": [3, 5, 7],
"n_estimators": np.arange(10, 40, 10),
}
In [ ]:
confusion_matrix_sklearn(rf_tuned, X_train, y_train)
In [ ]:
rf_tuned_model_train_perf = model_performance_classification_sklearn(
rf_tuned, X_train, y_train
)
rf_tuned_model_train_perf
Out[ ]:
In [ ]:
rf_tuned_model_test_perf = model_performance_classification_sklearn(
rf_tuned, X_test, y_test
)
rf_tuned_model_test_perf
Out[ ]:
AdaBoost Classifier
In [ ]:
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train, y_train)
Out[ ]:
AdaBoostClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(ab_classifier, X_train, y_train)
In [ ]:
ab_classifier_model_train_perf = model_performance_classification_sklearn(
ab_classifier, X_train, y_train
)
ab_classifier_model_train_perf
Out[ ]:
In [ ]:
confusion_matrix_sklearn(ab_classifier, X_test, y_test)
In [ ]:
ab_classifier_model_test_perf = model_performance_classification_sklearn(
ab_classifier, X_test, y_test
)
ab_classifier_model_test_perf
Out[ ]:
In [ ]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(abc_tuned, X_train, y_train)
In [ ]:
abc_tuned_model_train_perf = model_performance_classification_sklearn(
abc_tuned, X_train, y_train
)
abc_tuned_model_train_perf
Out[ ]:
In [ ]:
In [ ]:
confusion_matrix_sklearn(abc_tuned, X_test, y_test)
In [ ]:
abc_tuned_model_test_perf = model_performance_classification_sklearn(
abc_tuned, X_test, y_test
)
abc_tuned_model_test_perf
Out[ ]:
In [ ]:
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(random_state=1)
In [ ]:
Out[ ]:
In [ ]:
confusion_matrix_sklearn(gb_classifier, X_test, y_test)
In [ ]:
gb_classifier_model_test_perf = model_performance_classification_sklearn(
gb_classifier, X_test, y_test
)
gb_classifier_model_test_perf
Out[ ]:
In [ ]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.8, n_estimators=200, random_state=1,
subsample=1)
In [ ]:
confusion_matrix_sklearn(gbc_tuned, X_train, y_train)
In [ ]:
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
gbc_tuned, X_train, y_train
)
gbc_tuned_model_train_perf
Out[ ]:
In [ ]:
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
In [ ]:
gbc_tuned_model_test_perf = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
gbc_tuned_model_test_perf
Out[ ]:
After tuning there is not much change in the model performance as compared to the model with default
values of hyperparameters.
XGBoost Classifier
In [ ]:
xgb_classifier = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)
Out[ ]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
In [ ]:
Out[ ]:
In [ ]:
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)
In [ ]:
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
xgb_classifier, X_test, y_test
)
xgb_classifier_model_test_perf
Out[ ]:
The XGBoost model on the training set has performed very well but it is not able to generalize on the test
set.
Let's try and tune the hyperparameters and see if the performance can be generalized.
In [ ]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
In [ ]:
In [ ]:
xgb_tuned_model_train_perf = model_performance_classification_sklearn(
xgb_tuned, X_train, y_train
)
xgb_tuned_model_train_perf
Out[ ]:
In [ ]:
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)
In [ ]:
xgb_tuned_model_test_perf = model_performance_classification_sklearn(
xgb_tuned, X_test, y_test
)
xgb_tuned_model_test_perf
Out[ ]:
Stacking Classifier
In [ ]:
estimators = [
("AdaBoost", ab_classifier),
("Gradient Boosting", gbc_tuned),
("Random Forest", rf_tuned),
]
final_estimator = xgb_tuned
stacking_classifier = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
stacking_classifier.fit(X_train, y_train)
Out[ ]:
StackingClassifier(estimators=[('AdaBoost', AdaBoostClassifier(random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(rando
m_state=1),
max_features=0.8,
n_estimators=200,
random_state=1,
subsample=1)),
('Random Forest',
RandomForestClassifier(max_depth=10,
max_features='sqrt',
min_samples_split=7,
n_estimators=20,
oob_score=Tru...
eval_metric='logloss', gamma=5,
eval_metric='logloss', gamma=5,
gpu_id=-1,
importance_type='gain',
interaction_constraints='',
learning_rate=0.1,
max_delta_step=0, max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=200, n_jobs=8,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
In [ ]:
In [ ]:
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
stacking_classifier, X_train, y_train
)
stacking_classifier_model_train_perf
Out[ ]:
In [ ]:
stacking_classifier_model_test_perf = model_performance_classification_sklearn(
stacking_classifier, X_test, y_test
)
stacking_classifier_model_test_perf
Out[ ]:
In [ ]:
models_train_comp_df = pd.concat(
[
dtree_estimator_model_train_perf.T,
dtree_estimator_model_train_perf.T,
bagging_classifier_model_train_perf.T,
bagging_estimator_tuned_model_train_perf.T,
rf_estimator_model_train_perf.T,
rf_tuned_model_train_perf.T,
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Tuned
Tuned Tuned Tuned Tuned Gradient XGBoo
Decision Bagging Random Adaboost Gradient XGBoost
Decision Bagging Random Adaboost Boost Classif
Tree Classifier Forest Classifier Boost Classifier
Tree Classifier Forest Classifier Classifier Tun
Classifier
Accuracy 0.712548 0.712548 0.985198 0.996187 1.0 0.769119 0.738226 0.719163 0.758802 0.764017 0.838753 0.7654
Recall 0.931923 0.931923 0.985982 0.999916 1.0 0.918660 0.887182 0.781415 0.883740 0.882649 0.931419 0.8816
Precision 0.720067 0.720067 0.991810 0.994407 1.0 0.776556 0.760688 0.794690 0.783042 0.789059 0.843482 0.7911
F1 0.812411 0.812411 0.988887 0.997154 1.0 0.841652 0.819080 0.787997 0.830349 0.833234 0.885272 0.8339
In [ ]:
# testing performance comparison
models_test_comp_df = pd.concat(
[
dtree_estimator_model_test_perf.T,
dtree_estimator_model_test_perf.T,
bagging_classifier_model_test_perf.T,
bagging_estimator_tuned_model_test_perf.T,
rf_estimator_model_test_perf.T,
rf_tuned_model_test_perf.T,
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Out[ ]:
Tuned
Tuned Tuned Tuned Tuned Gradient XGBoo
Decision Bagging Random Adaboost Gradient XGBoost
Decision Bagging Random Adaboost Boost Classif
Tree Classifier Forest Classifier Boost Classifier
Tree Classifier Forest Classifier Classifier Tun
Classifier
Accuracy 0.706567 0.706567 0.691523 0.724228 0.727368 0.738095 0.734301 0.716641 0.744767 0.743459 0.733255 0.7451
Recall 0.930852 0.930852 0.764153 0.895397 0.847209 0.898923 0.885015 0.781587 0.876004 0.871303 0.860725 0.8695
Precision 0.715447 0.715447 0.771711 0.743857 0.768343 0.755391 0.757799 0.791510 0.772366 0.773296 0.767913 0.7759
F1 0.809058 0.809058 0.767913 0.812622 0.805851 0.820930 0.816481 0.786517 0.820927 0.819379 0.811675 0.8200
F1 0.809058 0.809058 0.767913 0.812622 0.805851 0.820930 0.816481 0.786517 0.820927 0.819379 0.811675 0.8200
Tuned
Tuned Tuned Tuned Tuned Gradient XGBoo
Decision Bagging Random Adaboost Gradient XGBoost
Decision Bagging Random Adaboost Boost Classif
Tree Classifier Forest Classifier Boost Classifier
Tree Classifier Forest Classifier Classifier Tun
Tuned Random Forest model has given a good and generalized performance. We will use itClassifier
as our final
model.
With the tuned random forest model we are getting the F1 score of 0.84 and 0.82 on the training and the test
set, respectively.
Let's check the important features of the final model.
In [ ]:
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Looking at the feature importance of the Random Forest model, the top three important features to look for
while certifying a visa are -Education of the employee, Job experience, and Prevailing Wage.
Education level - At least has a Bachelor's degree - Master's and doctorate are preferred.
Job Experience - Should have some job experience.
Prevailing wage - The median prevailing wage of the employees for whom the visa got certified is
around 72k.
The profile of the applicants for whom the visa status can be denied:
Education level - Doesn't have any degree and has completed high school.
Job Experience - Doesn't have any job experience.
Prevailing wage - The median prevailing wage of the employees for whom the visa got certified is
around 65k.
Additional information of employers and employees can be collected to gain better insights. Information
such as:
Employers: Information about the wage they are offering to the applicant, Sector in which company
operates in, etc
Employee's: Specialization in their educational degree, Number of years of experience, etc