Report
Report
A summary of this report outlines the problem of investigating factors that affect earning potential. Our
analysis focused on exploring the relationship between various demographic attributes and the likelihood of
earning over 50k. We utilised a dataset from the 2022 US Census and applied two machine learning models,
Logistic Regression and Random Forest, to identify the most significant factors in determining earning
potential. Our findings suggest that certain attributes such as education level, gender, and average hours
worked per week have a significant impact on the likelihood of earning over 50k.
However, it is important to note that our findings are not conclusive, and there are other factors that could
potentially affect earning potential that we did not consider due to time constraints. Overall, our analysis
provides valuable insight into the relationship between various attributes and earning potential. This report
concludes with several potential implications of our findings, including employment policies, workforce
diversity initiatives, and work-life balance issues. Further research is necessary to fully understand the impact
of these factors on earning potential, but our analysis represents an important step in the right direction.
After pre-processing the data and training the models, we evaluated their performance using three metrics:
accuracy, area under the Receiving Operating Characteristic curve and F1 score.
Our results showed that the random forest model had better prediction performance than the logistic
regression model, with higher accuracy, AUC, and F1 score. This suggests that the random forest model was
able to capture more complex relationships between the predictor variables and the target variable than the
logistic regression model. Other classification models such as support vector machines, etc have been used to
predict income levels based on similar datasets.
However, the performance of these models varies depending on the dataset and the specific problem being
addressed. The comparative advantages and disadvantages of the two models used in this project are as
follows:
• Logistic Regression (LR): We used this to analyse the relationship between certain factors and the
likelihood of earning over 50k. Using binary logistic regression to model the probability of a person
earning over 50k was optimal for this project.
Advantages:
1. Interpretable as it is easy to understand the coefficients and the contribution of each predictor
variable to the prediction.
2. Computationally efficient as it is faster to train and predict on logistic regression models compared to
more complex models like random forests, which we used also.
3. Works well with small datasets as LR works well when the dataset is small and the classes are linearly
separable.
Disadvantages:
1. Assumes linear relationships as LR assumes that the relationship between the predictor variables and
the target variable is linear. If the relationship is non-linear, the performance of the model may suffer.
2. May underfit: If the model is too simple, it may not capture the complexity of the data and may
underfit.
• Now for our other model- Random Forest (RF): Constructing a multitude of decision trees was used in
this project to determine the most important variables predicting whether someone earns 50k or not.
It allowed us to identify strong correlations between attributes and earning potential and we
obtained more accurate predictions using this model than LR
Advantages:
1. Can handle non-linear relationships, unlike LR, RF can handle non-linear relationships between the
predictor variables and the target variable.
2. Can ‘ignore’ outliers by averaging the predictions of multiple decision trees.
Disadvantages:
1. Computationally expensive: RF is more computationally expensive than LR.
2. Harder to interpret: The predictions of RF models are harder to interpret compared to LR models.
As for a direct comparison between the two Machine Learning Models: it seems as though the RF model
may be a better predictor of earning potential given the variables we opted for
• Accuracy on the test set: RF scored 0.85578281 compared to LR’s lower 0.7618696640255512
• AUC score: RF>LR again as 0.7677520297434334 > 0.7592520419755271
• F1 Score: RF>LR as 0.6631276901004304 > 0.5994420911251161
As for improvements, we could try tuning the hyperparameters, for example increasing the number of trees in
the RF model (>200). Additionally, we could try to address the issue of class imbalance in the target variable by
using techniques like oversampling or under-sampling. Finally, we could try to analyse the importance of the
predictor variables and identify which variables are the most important in predicting the income level.
As for our initial hypotheses we had three main ones we discussed and we were curious about as a group:
• The higher the level of education achieved, the more likely it is that that individual makes above 50k
• There will be more males that earn above 50k than females
• The more hours as individual works, the more likely they are to earn above 50k
Going through each preliminary hypothesis, and evaluating whether our initial thoughts matched up to what
our data analysis yield, we found that our hypothesis regarding the positive correlation between the average
years spent in education and the number of individuals earning over 50k had some merit. This finding indicates
that higher education levels tend to correspond with higher salaries. As seen below.
However, if we look as specific levels of education our training set individuals reached, our hypothesis, to be a
fully correct claim, should have seen a higher portion of Doctorate student (highest level of education
recorded) earning over 50k but we did not. It seems as if the was a higher portion of doctorate holders who
did not earn over 50k than did. There may be several reasons for this, but the most obvious one would be that
we did not have a large sample of Doctorate holders to make a fair conclusion as you can see from the graph
below.
Overall, the generality of this first hypothesis did hold up well, but we would have to question the reasons as
to why our highest education level was not a perfect representation of our hypothesis. Further analysis would
be needed to question whether a curve of diminishing returns may exist for the education attribute, where the
correlation between earning over 50k begins to decrease after a certain point. As seen below.
Now moving onto our second initial hypothesis: There will be more males that earn above 50k than females.
Our analysis seems to have yielded a bias towards males earning more than 50k compared to women, as seen
in the graph below. However, this may be partly due to the significantly larger number of men's data included
in the census compared to women's data, which could lead to an unfair representation. Nevertheless, even
when considering the proportion of women who submitted their data, the results show a lower proportion of
them making over 50k than their male counterparts. This suggests that gender does have a statistically
significant correlation with earning potential.
Overall, I think the statistically significant correlation between gender and earning potential shown in our
analysis means that our initial hypothesis was proven right.
Thirdly, we can discuss the validity and correctness of our last significant hypothesis: The more hours as
individual works, the more likely they are to earn above 50k. This seemed like the most intuitive and logical
hypothesis we could make at the time and it turned out this is, also, the hypothesis that matched up with our
expectations the most. As you can see from our model, these is a significant statistical correlation between the
number of hours an individual works with the likelihood of earning above the 50k threshold.
Linking some of our findings together, the average hours worked may be a plausible reason as to why we
observe a discrepancy between pay for both genders. Women, sometimes, take up more traditional roles in
the household meaning they get less time to work, therefore limiting the number of hours they work, which
clearly is correlated to earning less than that 50k threshold.
Regarding possible impacts of our findings and research, we can, clearly, observe a strong correlation between
certain attributes or demographic characteristics, such as education level, gender and average hours worked
per week and the likelihood of earning above the 50k threshold. There are many implications (and possibly
consequences) to our findings, but we would need to conduct further research and gain more data to
completely lay claim to our ideas. Some implications would include employment policies, workforce diversity
initiatives, and work-life balance issues.
From our research, we concluded that there was a clear discrepancy between the earning potential between
males and females, the consequences of this would be detrimental to corporations and hiring teams as a
whole as there seems to be a clear bias towards hiring a specific gender. In saying this, we must remember
that our findings are not conclusive and there were many issues we did not tackle to gauge a full picture and
conclusion on this matter outside of scratching the surface level. If a hiring team were to adopt a similar
model/s to us, by doing so, could ensure a fairer and more equitable payment structure and hiring fairness to
even out the gender wage gap that does seem to exist in our data.
Another implication our findings may elude to is that to clear that coveted 50k threshold, one may have to
sacrifice their work-life balance and lean more towards work in order to increase their earning potential. In a
world where mental health is becoming more prevalent and on the forefronts of everyone’s’ minds, there may
be consequences towards corporations in an effort to reduce hours and find a more effective way to increase
productivity beyond simply increasing hours worked.
Moving onto the ethical concerns of our project, we encountered a few. A usual issue would be limited data,
but I do not think this was a major issue of ours in this project. I think a main concern was the lack of
representation in our data whether that be an equal number of males and females surveyed, or that be race
and ethnic diversity etc. A lack of representation may have skewed and affected some of our conclusions
drawn from our data. In terms of ethics, it is well worth noting that personal and sensitive data should be
protected and used appropriately.
The ethical notion of explainability and not always understanding what you are making would have been an
issue in this project. Inferring certain attributes to others in an attempt to find the most interesting
correlations may not always be the fairest way to analyse big data. It was important for us consider the
explainability of our results and findings used in the project, by ensuring clear explanations and making the
relevancy of the patterns and correlations we produced of a high priority.
To wrap up our project, we successfully found many clear correlations between certain attributes and the
likelihood of earning over 50k. It is important to note that our findings are not conclusive. There were many
strengths of our approach, but the main ones were using multiple models (RF and LR) to identify the most
important factors in analysing earning potential, thus increasing the accuracy and validity of our findings. Also,
using the census data set of 2022 was another advantage for us as our findings are very much replicable and
verifiable by other researchers.
With good moves, always comes blunders, and one disadvantage with our conclusions is merely not
considering all other factors relevant to earning potential such as location, and industry job experience, but
that issue can be excused due to time constraints. Another disadvantage of our findings would be that the
dataset was limited to 2022, therefore not representing the whole job market, rather just a specific point in
time. Overall, I think, we produced some great insight into some factors concerning earning potential.