Salary Data Analysis - Phase 1
Salary Data Analysis - Phase 1
CIT-466-502
Student Name – ID
Student Name – ID
1.1 Introduction
In this project, we will explore a salary dataset to understand factors that can have a significant
impact on employee salaries, and then we will build a regression model to predict the salary of
employees. Today, businesses started to focus on strategic workforce planning and equitable
compensation. The analysis of salary data is more important for optimizing human resource
practices. In this project, we will use statistical analysis along with a regression model to
analyze, visualize, and model the factors associated with employee salaries. By exploring the
relationship of employee salaries with different factors such as age, gender, education level, job
title, and years of experience, we want to generate useful insights that can help companies with
employee hiring, retention, and compensation strategies.
We will use systematic approaches to complete this project. This approach includes data
preparation, model planning, model building, and then evaluating the performance of the model.
The insights generated from this project will not only help organizations to offer market
competitive salaries but also identify any potential inequities or gender disparity. Addressing any
inequity or disparity helps organizations to improve employee satisfaction and retention and can
also help to enhance the company’s reputation.
Age: This column represents the age of each employee in years. The values in this column are
numeric.
Gender: This column contains the gender of each employee, which can be either male or female.
The values in this column are categorcal.
Education Level: This column contains the educational level of each employee, which can be
high school, bachelor's degree, master's degree, or PhD. The values in this column are
categorical.
Job Title: This column contains the job title of each employee. The job titles can vary depending
on the company and may include positions such as manager, analyst, engineer, or administrator.
The values in this column are categorical.
Years of Experience: This column represents the number of years of work experience of each
employee. The values in this column are numeric.
Salary: This column represents the annual salary of each employee in US dollars. The values in
this column are numeric and can vary depending on factors such as job title, years of experience,
and education level.
From a business point of view, the dataset consists of different variables that can be used for
providing variable insights for employees, strategic workforce planning, and human resources
optimization. We can analyze variables such as age, gender, educational level, job title, years of
experience, along with salary, and can identify the key trends that can be used for hiring
employees, training employees, and employee retention strategies. By understanding
relationships between salary and other variables, it can help businesses assess whether their
current salary structure aligns with industry standards and also with employee expectations to
make sure salary is competitive and fair with compensation practices.
We can also perform gender analysis for salary, job title, and experience that can help us reveal
if there are any gaps or biases in compensation and promotion practices, which will enable
companies to create targeted initiatives and equal opportunities for male and female employees.
By identifying discrepancies in salary by gender or by experience level, the organization can take
proactive steps to address potential inequities that will help to strengthen the company’s
reputation as an inclusive and fair employer.
1.3 Purpose
The purpose of this salary dataset is to predict the salary of an employee based on the different
factors provided in the dataset. Since the salary variable is a continuous and numeric variable, it
is clear that regression is suitable for this problem because the output variable salary is numerical
and continuous.
For this project, building an accurate model that can predict the salary of an employee accurately
is crucial. The organizations can use this model for rapid decision-making, operational
efficiency, and retention efforts that can help identify if there are any potential salary disparities
and to make sure that compensation practices promote both fairness and competitiveness. The
insights generated from this dataset can help organizations attract and retain the top talent, offer
competitive salary packages and enhance employee satisfaction.
Data Resources:
● The historical employee salary data that will be used for analysis and building a
regression model.
Technical Resources
● Data Storage
● Computation Resources and Computer
● R Studio, which will be used for data cleaning, data analysis, data visualization, and then
for regression
Human Resources:
● Data Engineer: Responsible for cleaning and transforming data to make it ready and
suitable for further analysis.
● Data Scientists: Responsible for building regression models to predict the salary of
employees.
● Project Manager: To make sure everything progresses according to a plan and to ensure
that the milestones and objectives are met in a timely manner and with high quality.
1. Hypothesis 1: Higher education levels are associated with higher average salaries.
2. Hypothesis 2: Employees with higher years of experience tend to have higher salaries.
3. Hypothesis 3: Salaries may not be evenly distributed across genders for senior-level
positions, which may suggest a potential gender disparity in leadership roles.
4. Hypothesis 4: Younger employees with advanced degrees have similar or higher salaries
compared to older employees with the same job titles and fewer formal qualifications.