0% found this document useful (0 votes)

53 views5 pages

Report

This document summarizes a study that used logistic regression and random forest models to predict income level using 2022 US Census data. The random forest model had better prediction performance, accurately predicting income over $50k based on attributes like education, gender, and hours worked. While the models supported hypotheses that higher education and more hours worked correlate with higher income, they also suggested gender gaps exist. The study concludes more research is needed but the findings provide insight into relationships between attributes and income levels.

Uploaded by

王大锤

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views5 pages

Report

Uploaded by

王大锤

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Predicting Income Levels Using 2022 US Census Data: A Comparison

of Logistic Regression and Random Forrest

In this project, we aimed to predict the income level of individuals based on various demographic and
socioeconomic factors using two main classification models: logistic regression and random forest.

A summary of this report outlines the problem of investigating factors that affect earning potential. Our
analysis focused on exploring the relationship between various demographic attributes and the likelihood of
earning over 50k. We utilised a dataset from the 2022 US Census and applied two machine learning models,
Logistic Regression and Random Forest, to identify the most significant factors in determining earning
potential. Our findings suggest that certain attributes such as education level, gender, and average hours
worked per week have a significant impact on the likelihood of earning over 50k.

However, it is important to note that our findings are not conclusive, and there are other factors that could
potentially affect earning potential that we did not consider due to time constraints. Overall, our analysis
provides valuable insight into the relationship between various attributes and earning potential. This report
concludes with several potential implications of our findings, including employment policies, workforce
diversity initiatives, and work-life balance issues. Further research is necessary to fully understand the impact
of these factors on earning potential, but our analysis represents an important step in the right direction.

After pre-processing the data and training the models, we evaluated their performance using three metrics:
accuracy, area under the Receiving Operating Characteristic curve and F1 score.

Our results showed that the random forest model had better prediction performance than the logistic
regression model, with higher accuracy, AUC, and F1 score. This suggests that the random forest model was
able to capture more complex relationships between the predictor variables and the target variable than the
logistic regression model. Other classification models such as support vector machines, etc have been used to
predict income levels based on similar datasets.

However, the performance of these models varies depending on the dataset and the specific problem being
addressed. The comparative advantages and disadvantages of the two models used in this project are as
follows:

• Logistic Regression (LR): We used this to analyse the relationship between certain factors and the
likelihood of earning over 50k. Using binary logistic regression to model the probability of a person
earning over 50k was optimal for this project.

Advantages:
1. Interpretable as it is easy to understand the coefficients and the contribution of each predictor
variable to the prediction.
2. Computationally efficient as it is faster to train and predict on logistic regression models compared to
more complex models like random forests, which we used also.
3. Works well with small datasets as LR works well when the dataset is small and the classes are linearly
separable.

Disadvantages:
1. Assumes linear relationships as LR assumes that the relationship between the predictor variables and
the target variable is linear. If the relationship is non-linear, the performance of the model may suffer.
2. May underfit: If the model is too simple, it may not capture the complexity of the data and may
underfit.

• Now for our other model- Random Forest (RF): Constructing a multitude of decision trees was used in
this project to determine the most important variables predicting whether someone earns 50k or not.
It allowed us to identify strong correlations between attributes and earning potential and we
obtained more accurate predictions using this model than LR

Advantages:
1. Can handle non-linear relationships, unlike LR, RF can handle non-linear relationships between the
predictor variables and the target variable.
2. Can ‘ignore’ outliers by averaging the predictions of multiple decision trees.

Disadvantages:
1. Computationally expensive: RF is more computationally expensive than LR.
2. Harder to interpret: The predictions of RF models are harder to interpret compared to LR models.

As for a direct comparison between the two Machine Learning Models: it seems as though the RF model
may be a better predictor of earning potential given the variables we opted for

• Accuracy on the test set: RF scored 0.85578281 compared to LR’s lower 0.7618696640255512
• AUC score: RF>LR again as 0.7677520297434334 > 0.7592520419755271
• F1 Score: RF>LR as 0.6631276901004304 > 0.5994420911251161

As for improvements, we could try tuning the hyperparameters, for example increasing the number of trees in
the RF model (>200). Additionally, we could try to address the issue of class imbalance in the target variable by
using techniques like oversampling or under-sampling. Finally, we could try to analyse the importance of the
predictor variables and identify which variables are the most important in predicting the income level.

As for our initial hypotheses we had three main ones we discussed and we were curious about as a group:

• The higher the level of education achieved, the more likely it is that that individual makes above 50k
• There will be more males that earn above 50k than females
• The more hours as individual works, the more likely they are to earn above 50k

Going through each preliminary hypothesis, and evaluating whether our initial thoughts matched up to what
our data analysis yield, we found that our hypothesis regarding the positive correlation between the average
years spent in education and the number of individuals earning over 50k had some merit. This finding indicates
that higher education levels tend to correspond with higher salaries. As seen below.

However, if we look as specific levels of education our training set individuals reached, our hypothesis, to be a
fully correct claim, should have seen a higher portion of Doctorate student (highest level of education
recorded) earning over 50k but we did not. It seems as if the was a higher portion of doctorate holders who
did not earn over 50k than did. There may be several reasons for this, but the most obvious one would be that
we did not have a large sample of Doctorate holders to make a fair conclusion as you can see from the graph
below.
Overall, the generality of this first hypothesis did hold up well, but we would have to question the reasons as
to why our highest education level was not a perfect representation of our hypothesis. Further analysis would
be needed to question whether a curve of diminishing returns may exist for the education attribute, where the
correlation between earning over 50k begins to decrease after a certain point. As seen below.

Now moving onto our second initial hypothesis: There will be more males that earn above 50k than females.
Our analysis seems to have yielded a bias towards males earning more than 50k compared to women, as seen
in the graph below. However, this may be partly due to the significantly larger number of men's data included
in the census compared to women's data, which could lead to an unfair representation. Nevertheless, even
when considering the proportion of women who submitted their data, the results show a lower proportion of
them making over 50k than their male counterparts. This suggests that gender does have a statistically
significant correlation with earning potential.

Overall, I think the statistically significant correlation between gender and earning potential shown in our
analysis means that our initial hypothesis was proven right.

Thirdly, we can discuss the validity and correctness of our last significant hypothesis: The more hours as
individual works, the more likely they are to earn above 50k. This seemed like the most intuitive and logical
hypothesis we could make at the time and it turned out this is, also, the hypothesis that matched up with our
expectations the most. As you can see from our model, these is a significant statistical correlation between the
number of hours an individual works with the likelihood of earning above the 50k threshold.

Linking some of our findings together, the average hours worked may be a plausible reason as to why we
observe a discrepancy between pay for both genders. Women, sometimes, take up more traditional roles in
the household meaning they get less time to work, therefore limiting the number of hours they work, which
clearly is correlated to earning less than that 50k threshold.

Regarding possible impacts of our findings and research, we can, clearly, observe a strong correlation between
certain attributes or demographic characteristics, such as education level, gender and average hours worked
per week and the likelihood of earning above the 50k threshold. There are many implications (and possibly
consequences) to our findings, but we would need to conduct further research and gain more data to
completely lay claim to our ideas. Some implications would include employment policies, workforce diversity
initiatives, and work-life balance issues.

From our research, we concluded that there was a clear discrepancy between the earning potential between
males and females, the consequences of this would be detrimental to corporations and hiring teams as a
whole as there seems to be a clear bias towards hiring a specific gender. In saying this, we must remember
that our findings are not conclusive and there were many issues we did not tackle to gauge a full picture and
conclusion on this matter outside of scratching the surface level. If a hiring team were to adopt a similar
model/s to us, by doing so, could ensure a fairer and more equitable payment structure and hiring fairness to
even out the gender wage gap that does seem to exist in our data.

Another implication our findings may elude to is that to clear that coveted 50k threshold, one may have to
sacrifice their work-life balance and lean more towards work in order to increase their earning potential. In a
world where mental health is becoming more prevalent and on the forefronts of everyone’s’ minds, there may
be consequences towards corporations in an effort to reduce hours and find a more effective way to increase
productivity beyond simply increasing hours worked.

Moving onto the ethical concerns of our project, we encountered a few. A usual issue would be limited data,
but I do not think this was a major issue of ours in this project. I think a main concern was the lack of
representation in our data whether that be an equal number of males and females surveyed, or that be race
and ethnic diversity etc. A lack of representation may have skewed and affected some of our conclusions
drawn from our data. In terms of ethics, it is well worth noting that personal and sensitive data should be
protected and used appropriately.

The ethical notion of explainability and not always understanding what you are making would have been an
issue in this project. Inferring certain attributes to others in an attempt to find the most interesting
correlations may not always be the fairest way to analyse big data. It was important for us consider the
explainability of our results and findings used in the project, by ensuring clear explanations and making the
relevancy of the patterns and correlations we produced of a high priority.

To wrap up our project, we successfully found many clear correlations between certain attributes and the
likelihood of earning over 50k. It is important to note that our findings are not conclusive. There were many
strengths of our approach, but the main ones were using multiple models (RF and LR) to identify the most
important factors in analysing earning potential, thus increasing the accuracy and validity of our findings. Also,
using the census data set of 2022 was another advantage for us as our findings are very much replicable and
verifiable by other researchers.

With good moves, always comes blunders, and one disadvantage with our conclusions is merely not
considering all other factors relevant to earning potential such as location, and industry job experience, but
that issue can be excused due to time constraints. Another disadvantage of our findings would be that the
dataset was limited to 2022, therefore not representing the whole job market, rather just a specific point in
time. Overall, I think, we produced some great insight into some factors concerning earning potential.

Adult Census Income Prediction
100% (1)
Adult Census Income Prediction
31 pages
Random Forest Assignment
0% (1)
Random Forest Assignment
5 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
1 s2.0 S0014292121000660 Main
No ratings yet
1 s2.0 S0014292121000660 Main
29 pages
Paper On Machine Learning For Kaggle
No ratings yet
Paper On Machine Learning For Kaggle
40 pages
Report of Business Research Methodology
100% (1)
Report of Business Research Methodology
32 pages
Female Earnings
No ratings yet
Female Earnings
18 pages
National Economics University Advanced Educational Program: Advanced Finance 64A Ph.D. Nguyen Manh The
No ratings yet
National Economics University Advanced Educational Program: Advanced Finance 64A Ph.D. Nguyen Manh The
25 pages
MA SecA Group7
No ratings yet
MA SecA Group7
20 pages
Test Metrics
No ratings yet
Test Metrics
10 pages
US Census Income 1
No ratings yet
US Census Income 1
18 pages
Contemporary Issues in Accounting
100% (2)
Contemporary Issues in Accounting
45 pages
Predicting University Students' Academic Success and Major Using Random Forests
No ratings yet
Predicting University Students' Academic Success and Major Using Random Forests
17 pages
Wages Micro Data
No ratings yet
Wages Micro Data
28 pages
AI Report
No ratings yet
AI Report
16 pages
A Study On Socio - Economic Background On Casual Employees in Kanchipuram
No ratings yet
A Study On Socio - Economic Background On Casual Employees in Kanchipuram
12 pages
Solution Manual For Introductory Econometrics 6th Edition by Woolridge
0% (3)
Solution Manual For Introductory Econometrics 6th Edition by Woolridge
7 pages
Assign Docs
No ratings yet
Assign Docs
20 pages
Assignment 2
No ratings yet
Assignment 2
20 pages
Analytic Strategy
No ratings yet
Analytic Strategy
8 pages
Ba Report
No ratings yet
Ba Report
8 pages
FinalProject Results
No ratings yet
FinalProject Results
9 pages
Income Prediction Analysis
No ratings yet
Income Prediction Analysis
16 pages
Portfolio Project 4 Jacob Cho, Tiffany Tolato, Matthew Hall, Adam Ali, Robert Sadlier
No ratings yet
Portfolio Project 4 Jacob Cho, Tiffany Tolato, Matthew Hall, Adam Ali, Robert Sadlier
13 pages
Ecnometrics 8775
No ratings yet
Ecnometrics 8775
6 pages
Categorical Predictor S
No ratings yet
Categorical Predictor S
41 pages
Population and Lifespan - The Linear Regression Mini-Project
No ratings yet
Population and Lifespan - The Linear Regression Mini-Project
4 pages
Samprit-Chakrabortybs2004stat Methods II Project-1
No ratings yet
Samprit-Chakrabortybs2004stat Methods II Project-1
6 pages
DS Final Project
No ratings yet
DS Final Project
20 pages
Assignment 2 S 10
No ratings yet
Assignment 2 S 10
4 pages
BE As 3 (Fixed)
No ratings yet
BE As 3 (Fixed)
13 pages
Group 9
No ratings yet
Group 9
9 pages
amr.850 851.1069
No ratings yet
amr.850 851.1069
5 pages
Adult Income Prediction
0% (1)
Adult Income Prediction
9 pages
Anonuevo LiamAngelo AS3
No ratings yet
Anonuevo LiamAngelo AS3
4 pages
Bank Credit Scoring Analysis With Bayesian Logistic Regression As A Decision Tool
No ratings yet
Bank Credit Scoring Analysis With Bayesian Logistic Regression As A Decision Tool
76 pages
True Regression Model: C - Logincome = Β + Β · B - Years Of Schooling + Β · D - Age + Β · E - Female + Β ·H - Smoker + Β · D - Age ·E - Female + Β · D - Age · H - Smoker + Ε
No ratings yet
True Regression Model: C - Logincome = Β + Β · B - Years Of Schooling + Β · D - Age + Β · E - Female + Β ·H - Smoker + Β · D - Age ·E - Female + Β · D - Age · H - Smoker + Ε
7 pages
London School of Commerce (LSC) : Name: Anika Thasin Binti Course Title: QTB
No ratings yet
London School of Commerce (LSC) : Name: Anika Thasin Binti Course Title: QTB
13 pages
Research Methodology MCQ
100% (2)
Research Methodology MCQ
101 pages
Adult Income Prediction
No ratings yet
Adult Income Prediction
9 pages
Descriptive Analytics and ANOVA
No ratings yet
Descriptive Analytics and ANOVA
31 pages
Research - Proposal Edited
No ratings yet
Research - Proposal Edited
33 pages
Probability of A Term Deposit
No ratings yet
Probability of A Term Deposit
31 pages
2.2.1 Transcript
No ratings yet
2.2.1 Transcript
2 pages
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
No ratings yet
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
6 pages
Econometric Methods
No ratings yet
Econometric Methods
8 pages
Senior Practical Research 2 Q1 Module10 For Printing
No ratings yet
Senior Practical Research 2 Q1 Module10 For Printing
16 pages
Project Report
No ratings yet
Project Report
24 pages
Homework 2 With Suggested Answers
No ratings yet
Homework 2 With Suggested Answers
14 pages
Case 3
No ratings yet
Case 3
3 pages
A Model To Predict Pay Scale Fixation in Job Marke
No ratings yet
A Model To Predict Pay Scale Fixation in Job Marke
6 pages
Wooldridge 7e Ch01 IM
No ratings yet
Wooldridge 7e Ch01 IM
8 pages
DataAnalysis 101
No ratings yet
DataAnalysis 101
3 pages
Capstone Final PPT Group 6
No ratings yet
Capstone Final PPT Group 6
19 pages
Nguyen Final Project Report
No ratings yet
Nguyen Final Project Report
10 pages
SSRN 3526707
No ratings yet
SSRN 3526707
5 pages
Assessment 1 - UEL-CN-7000
No ratings yet
Assessment 1 - UEL-CN-7000
3 pages
SSRN Id3990877
No ratings yet
SSRN Id3990877
8 pages
Applied Econometrics For Managers (MBAA-II, AY: 2023-24) IIM Kashipur
No ratings yet
Applied Econometrics For Managers (MBAA-II, AY: 2023-24) IIM Kashipur
3 pages
Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval
No ratings yet
Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval
6 pages
SOTE BSEDMATH 3rd Year, 1st Sem M113 Advanced Statistics
No ratings yet
SOTE BSEDMATH 3rd Year, 1st Sem M113 Advanced Statistics
15 pages
Validation Sheet For Quantitative
No ratings yet
Validation Sheet For Quantitative
1 page
Census Income Project
No ratings yet
Census Income Project
4 pages
Importance of Qualitative Research
No ratings yet
Importance of Qualitative Research
1 page
Experimental and Quasi Experimental Designs For Research PDF
100% (1)
Experimental and Quasi Experimental Designs For Research PDF
2 pages
Sertifikat Kalibrasi HK-BNN
No ratings yet
Sertifikat Kalibrasi HK-BNN
10 pages
Select A Sample Little Quick Fix - 1st Edition ISBN 1529708990, 9781529708998 Reference Book Download
No ratings yet
Select A Sample Little Quick Fix - 1st Edition ISBN 1529708990, 9781529708998 Reference Book Download
17 pages
RESEARCH
No ratings yet
RESEARCH
62 pages
Boomers
No ratings yet
Boomers
21 pages
Encyclopedia of Social Measurement Three Volume Set 1st Edition Kimberly Kempf-Leonard PDF Download
No ratings yet
Encyclopedia of Social Measurement Three Volume Set 1st Edition Kimberly Kempf-Leonard PDF Download
52 pages
Final Evaluation Report The Community Peacebuilding Project
No ratings yet
Final Evaluation Report The Community Peacebuilding Project
74 pages
ANOVA (Analysis-WPS Office
No ratings yet
ANOVA (Analysis-WPS Office
4 pages
10.1 Data Analysis and Interpretation
No ratings yet
10.1 Data Analysis and Interpretation
23 pages
Sip Report Naincy
No ratings yet
Sip Report Naincy
43 pages
3i's Final Defense
No ratings yet
3i's Final Defense
13 pages
Uk Literature Review Example
No ratings yet
Uk Literature Review Example
7 pages
Systematic Review Template
100% (1)
Systematic Review Template
2 pages
Royalty 2013
No ratings yet
Royalty 2013
18 pages
Activity Exemplar - Waze Project Proposal
No ratings yet
Activity Exemplar - Waze Project Proposal
3 pages
Proposal For Print
No ratings yet
Proposal For Print
20 pages
Case Study - Definition and Types in Sociology
No ratings yet
Case Study - Definition and Types in Sociology
9 pages
Vol 1 Issue 3 Article 7
No ratings yet
Vol 1 Issue 3 Article 7
16 pages
Research Design Canvas
No ratings yet
Research Design Canvas
1 page
Cream and Beige Illustrative Research Report Presentation
No ratings yet
Cream and Beige Illustrative Research Report Presentation
11 pages
Literature Review Example Film
100% (2)
Literature Review Example Film
6 pages
PR2 DLP Q1 W1
No ratings yet
PR2 DLP Q1 W1
4 pages
Activity 2
No ratings yet
Activity 2
2 pages

Report

Uploaded by

Report

Uploaded by

Predicting Income Levels Using 2022 US Census Data: A Comparison

of Logistic Regression and Random Forrest

You might also like