BA15 Machine Learning Assignment Guidelines Assignment 01
BA15 Machine Learning Assignment Guidelines Assignment 01
Assignment Description:
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and
permanent labor certifications. This was a nine percent increase in the overall number of processed applications
from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants
is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help shortlist
candidates with higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions.
You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:
Assignment Objectives:
By completing this assignment, you will be able to:
Assignment Guidelines:
Datasets to Use:
Each row in the dataset corresponds to a visa application and includes both employee and employer-
related features:
Feature Description
case_id ID of each visa application
continent Continent of the applicant
education_of_employee Education level of the employee
has_job_experience Y/N indicating prior job experience
requires_job_training Y/N indicating need for job training
no_of_employees Number of employees in employer company
yr_of_estab Year of company establishment
region_of_employment Region of employment in the U.S.
prevailing_wage Market wage for the job in that location
unit_of_wage Wage unit (Hourly, Weekly, Monthly, Yearly)
full_time_position Y/N indicating full-time role
case_status Target variable: Certified / Denied
Problem Statement:
The U.S. labor market is experiencing an increasing demand for skilled workers, driving employers to
seek qualified individuals both domestically and internationally. The Office of Foreign Labor
Certification (OFLC) oversees the complex process of evaluating visa applications submitted by
employers wishing to hire foreign talent. In FY 2016 alone, over 775,000 applications were processed —
a 9% increase from the previous year. As this volume grows annually, the manual review of applications
becomes increasingly inefficient, error-prone, and resource-intensive.
To address this challenge, EasyVisa, a data science consultancy firm, has been hired by OFLC to develop
a machine learning–based classification model that can assist in identifying applications that are most
likely to be approved. By analyzing a variety of employer and employee attributes — such as education
level, work experience, job type, prevailing wage, and location — the model aims to:
This project will involve comprehensive exploratory data analysis (EDA), data preprocessing, and the
application of ensemble techniques such as Bagging (Random Forest) and Boosting (AdaBoost,
Gradient Boosting). The end goal is to build a robust and interpretable model that improves the
efficiency, fairness, and scalability of the visa approval process.
Analysis Expectations:s
Q1: Clearly define the business problem and its relevance in the current labor market
scenario. (2 marks)
Q2: Perform univariate analysis on categorical and numerical variables using appropriate
plots. (2 marks)
Q3: Comment on the distribution and patterns observed from univariate analysis. (2 marks)
Q4: Perform bivariate analysis between independent features and the target variable using
visualizations. (3 marks)
Q5: Provide insights on how features like education, experience, pay unit, continent, and
prevailing wage influence visa status. (2 marks)
Q8: Identify missing values and justify the chosen treatment method. (2 marks)
Q9: Detect and treat outliers (if any), and provide rationale. (2 marks)
Q10: Create or transform features that help improve model performance and explain the
reasoning. (3 marks)
Q11: Properly split the dataset into training and testing sets with justification of the split
ratio. (3 marks)
Q12: Build and evaluate Decision Tree, Bagging, and Random Forest classifiers. (3 marks)
Q13: Compare model performance using metrics like Accuracy, Precision, Recall, F1-Score.
(3 marks)
Q14: Select and justify the evaluation metric(s) appropriate for this classification task. (1
mark)
Q16: Perform hyperparameter tuning for Decision Tree, Bagging, and Random Forest
models. (3 marks)
Q17: Evaluate and compare performance of tuned models across all metrics. (2 marks)
Q19: Build AdaBoost and Gradient Boosting models and evaluate their performance. (2
marks)
Q20: Compare results with Bagging models using chosen metrics. (2 marks)
Q23: Interpret feature importance and how boosting methods capture complex patterns. (2
marks)
Q24: Based on overall analysis, provide at least 3 actionable insights that could help
stakeholders. (2 marks)
Q25: Justify the final model selection based on performance and explain how it can be used
by EasyVisa. (3 marks)
Q26: Ensure smooth structure, appropriate code comments, readable formatting, and no
execution errors. (3 marks)
Deliverables:
Submission Details:
Submission Mode: LMS upload as a zipped folder containing the notebook and presentation (if
applicable)
Evaluation Criteria:
Plagiarism Policy:
All submissions must be original
Cite all external data sources or references
Academic integrity will be strictly enforced