Income Prediction Analysis
Income Prediction Analysis
net/publication/378152705
CITATIONS READS
0 1,745
3 authors, including:
Yashkumar Kalariya
Mercer University
36 PUBLICATIONS 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yashkumar Kalariya on 13 February 2024.
1 Introduction
Income in the US is determined by several factors including age, sex, occupation, and
educational status. Individual decisions may affect a person's earnings in the US. For example,
whether a person chooses to receive a Master's or a Doctorate degree can affect the income level
of the individual. Having a good understanding of the determining factors to earning a high
income is crucial for informed decision
In this research paper, we will explore the variables that predict wages for an adult in the labor
force. We will be highlighting certain factors (independent variables) such as years on education,
sex, race, age, and hours per week. The dependent variable from our dataset will be income.
We will also examine which of the independent variables has a more significant impact in
determining income levels in the US and identify any existing trends in income based on the
factors listed.
2 Literature Review
Much research across different fields has been conducted to study the relationship between
income and race, gender, marital status, skillset and many more. Predicting a person's income
has many benefits including determining the right educational and career paths or forecasting an
individual's financial situations. The prediction model we will be developing for this research
will assess the variation and the significance of the variables that affect income levels while
predicting individual potential earnings for informed decision making.
The goal of the data sources is to examine trends in earnings based on several factors including,
age, gender, race, educational level, etc. We have found supporting research and articles that
identify the difference in earnings between men and women. According to Blau, F. D., & Kahn,
L. M., 2017, there is a clear pay gap between men and women in society. From their research,
they highlight gender being an influencing factor in finding income and earnings in society.
The dataset supplies a detailed income prediction and the variables that affect it. However, with
most datasets, there is evidence of limitations which may call into question the accuracy of the
proposed model. For example, there are variables that highly correlate with each other. 'Age` and
`Experience` is a perfect example of this. As an individual ages, the chances of him/her gaining
more experience that may lead to an increase in wages are high.
However, the Income dataset does not account for every occupation in the US that earns more or
less than income. For example, minimum wages vary across different states in the US which the
dataset does not account for. In most datasets, there will be some biases detected. Although the
model has been proven to be effective in showing the relationship or the trends between income
and gender, and educational attainment, it is important that once there are omitted variables, it
leaves room for biases in the prediction model. (Analytics Vidhya, 2022)
The dataset that we will be using in this project originated from the US Bureau of Labor
Statistics. It contains 9799 observations with 23 variables. The dataset is collected from a survey
that highlights the factors that may affect an individual's wage in the US. Some of these factors
include hours worked, educational and work experience, age, location, race, and marital status.
The summary statistics of the dataset is described in table 1.
We have a total of 23 variables in this dataset. 22 of these are factors that affect wages in the US
according to the Current Population Survey.
These data visualization shows the Distribution of wage, age, educational experience, work
experience and family income. The purpose of creating distribution charts is to visualize the
distribution of the variables to gain a proper understanding of them for informed decision
making.
The data visualization in Figure 2 shows the Distribution of wage, age, educational experience,
work experience and family income. The purpose of creating distribution charts is to visualize
the distribution of the variables to gain a proper understanding of them for informed decision
making. Each chart represents the distribution of the variables, wage, age, experience, and family
income
We also plotted the plots for the binary variables in our dataset. The plots in Figure 3 help us
further visualize our dataset. It gives an illustration and count of individuals in the labor force of
different races (i.e., Asian, black, white), the gender count (i.e., no. Of people that identify as
male or female), hours worked, the count of married individuals, number of children, married or
unmarried individuals and the various locations of the individuals used in the sample of our
dataset.
The above data did not include any missing or ’N/A’ variables. However, for this project, we
may exclude a few of the variables that have been listed as factors that influence wages in the
US. For example, the information on Medicaid, Insurance and Medicare can be excluded from
the dataset. We can analyze the data to see if there are indeed any missing variables. We will also
analyze the dataset to see if there are any existing outliers that may affect the accuracy of our
model. Figure 4 shows an Illustration of the Continuous Variables in the Dataset for Outlier
Detection
Can we predict income based on our Predictor Variables in this Dataset? The prediction model is
meant to be used in real life situations. Here are a few instances in which the prediction can be
applied to: Navigating educational paths and career choices: Students or individuals in the labor
force can use this model to steer them in the right career path that would be more beneficial to
them financially. Incorporating future financial planning: Individuals can use the model to
predict their future earnings based on their qualifications for future planning and budgeting.
Encouraging fair and appropriate hiring practices: Businesses and corporations may utilize the
model to identify which candidates deserve a higher income in a competitive setting. For
example, all other factors being constant, a candidate with more years of educational experience
will earn more than a fellow candidate with less educational experience. In addition to this,
companies can adopt this model in identifying gender bias when it comes to wages. The accuracy
of the income prediction model in this case depends on the relationship between the predictor
variables and income.
In this project, we will be using the regression models to estimate the correlation between wages
and the independent variables listed in our dataset using Linear and Logistic regression models.
This method aims to predict income based on a set of predictor variables using two regression
models: Linear Regression and Logistic Regression. The goal is to develop right predictive
models to help individuals and organizations in estimating income levels effectively. After the
data collection and cleaning process, we can explore our predictor variables appropriately.
5.1.2 Linear Regression
The above results show information on a linear regression model that predicts ‘wages 'based on
the predictor variables. The coefficients represent the impact of the predictor variables on wage.
Significant predictors (p-value ¡ 0.05) include ‘educ ‘, ‘exper‘, ‘faminc‘, ‘nchild‘, ‘black ‘,
‘female ‘, ‘metro ‘, ‘Midwest ‘, and ‘south ‘. The R-squared value 0.2362 shows that the model
explains 23.62 The larger the absolute value of the coefficient, the stronger the impact of the
variable on wages. In this case, Education has the highest positive coefficient (2.62), indicating
that, on average, each added unit of education is associated with an increase in wage by 2.62
units. Therefore, ‘educ‘is the most efficient predictor of wages in this model. Table 3 gives a
summary of the linear regression model. Table 4 shows the VIF Values of the predicting
variables from the linear regression model.
In addition to ‘educ ‘(Education), which is the most efficient predictor of wages based on its
coefficient magnitude, there are several other predictors that are statistically significant and have
meaningful coefficients, indicating their efficiency in predicting wages.
faminc (Family Income): The coefficient for faminc is positive (0.000016), suggesting that
higher family income is associated with higher wages. While the coefficient is small, it is
statistically significant.
nchild (Number of Children): The coefficient for nchild is 1.11, indicating that having more
children is associated with higher wages, on average.
metro (Metropolitan Area): Living in a metropolitan area has a positive impact on wages, as
indicated by the coefficient of 3.097.
female (Gender): The coefficient for female is -4.440, indicating that being female is associated
with lower wages, on average.
Midwest (Region - Midwest): Residing in the Midwest region has a negative impact on wages, as
shown by the coefficient of -2.104.
south (Region - South): Similarly, living in the South region is associated with lower wages, as
indicated by the coefficient of -0.791.
The second prediction model we will be using to predict income is the Logistic Regression
Model. To do this, the response variable wage would have to be a binary variable. Table 5 gives
a summary of the logistic regression model. Table 6 shows the VIF Values of the predicting
variables from the logistic regression model.
Intercept: The intercept represents the estimated log-odds of the binary outcome variable (wage
binary) when all predictor variables are set to zero. In this case, it’s approximately -7.60548. The
negative sign indicates that the log-odds of the binary outcome are negative when all predictors
are zero. Predictor Variables:
educ: For a one-unit increase in education (e.g., one additional year of education), the log-odds
of the binary outcome are expected to increase by approximately 0.42135. The p-value (0.00000)
suggests that education is highly statistically significant.
exper: For a one-unit increase in experience, the log-odds of the binary outcome are expected to
increase by approximately 0.02908. The p-value (0.00000) indicates that experience is highly
statistically significant.
faminc: The coefficient is close to zero (0.00000), suggesting that changes in family income have
a negligible effect on the log-odds of the binary outcome. However, it has a statistically
significant p-value (0.00028).
Hrswork: For a one-unit increase in hours worked, the log-odds of the binary outcome are
expected to increase by approximately 0.00424. The p-value (0.14612) suggests that hours
worked are not statistically significant at a typical significance level of 0.05.
nchild: For a one-unit increase in the number of children, the log-odds of the binary outcome are
expected to increase by approximately 0.15371. The p-value (0.00000) indicates that the number
of children is highly statistically significant.
Asian: The coefficient is negative (-0.02795), suggesting that being Asian is associated with a
decrease in the log-odds of the binary outcome, but the effect is not statistically significant (p-
value: 0.79560).
black: For individuals who are Black, the log-odds of the binary outcome are expected to
decrease by approximately -0.49815. The p-value (0.00000) suggests that race (being Black) is
highly statistically significant.
divorced: Being divorced is associated with an increase in the log-odds of the binary outcome by
approximately 0.07800, but the effect is not statistically significant (p-value: 0.29488).
female: Being female is associated with a decrease in the log-odds of the binary outcome by
approximately -0.65940. The p-value (0.00000) suggests that gender (being female) is highly
statistically significant.
metro: Living in a metropolitan area is associated with an increase in the odds of the binary
outcome by approximately 0.52112. The p-value (0.00000) shows that metro status is highly
statistically significant.
Midwest: Living in the Midwest is associated with a decrease in the log odds of the binary
outcome by approximately -0.16256. The p-value (0.01903) suggests that living in the Midwest
is statistically significant.
northeast: Living in the Northeast is associated with a slight increase in the log-odds of the
binary outcome by approximately 0.05265, but the effect is not statistically significant (p-value:
0.46397).
south: Living in the South is associated with a decrease in the log-odds of the binary outcome by
approximately -0.06954, but the effect is not statistically significant (p-value: 0.29559)
6 Conclusion
In conclusion, both the linear and logistic regression models have shown promise in predicting
income. Table. 7 illustrates both the summary for the linear and logistic models. However, when
choosing between the two, the logistic regression model appears as the preferred choice. This
decision is based on the problem. Also, the logistic regression model gives a more realistic result
based on the predicting variables given in our dataset. For example, in our logistic model, the
results show is positive coefficient for the number of hours worked, meaning that there is indeed
a positive effect on income when you work more hours. Predicting income levels often involves
classifying individuals into income categories, making it a classification problem. Logistic
regression is well-suited for such tasks as it gives a clear probability-based classification, making
it easier to interpret and act upon. Therefore, due to its suitability for the income prediction
problem and its ability to give more insights, the logistic regression model is our preferred
choice.
References
Aamir Ali A. “Adult Census Income-Analysis” Medium. November 11, 2019, URL:
https://medium.com/data-warriors/eda-of-adult-census-income-dataset-cc9ac1a3d552
Mincer, Jacob A., “Introduction to Schooling, Earnings and Income,” National Bureau of Economic
Research, January 1974, URL: https://www.nber.org/books-and-chapters/schooling-experience-and-
earnings/introduction-schooling-experience-and-earnings
Blau, Francine D., and Lawrence M. Kahn. 2017. "The Gender Wage Gap: Extent, Trends, and
Explanations." Journal of Economic Literature, 55 (3): 789-865.DOI: 10.1257/jel.20160995 URL:
https://www.aeaweb.org/articles?id=10.1257/jel.20160995
Dataset: cpsdata_PoEbook.csv
Appendix
Tables
Figures
View publication stats