0% found this document useful (0 votes)

24 views20 pages

DS Final Project

The DS 2010 Final Data Project by Team Lavender investigates the correlation between academic major, college type, and mid-career salaries for U.S. graduates using the 'College Salaries' dataset from Kaggle. The project employs a Random Forest model, achieving a high R² score of 0.9748, indicating strong predictive capabilities. The findings aim to assist students and educational institutions in making informed decisions regarding majors and career prospects.

Uploaded by

Vasudha Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views20 pages

DS Final Project

Uploaded by

Vasudha Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

DS 2010 Final Data Project

College Graduate Salaries

Dataset - Salaries by college, region, and academic major

Team Lavender
Vasudha Arora, Renisha Rana and Jack Jancaric
This project explores how various factors, including
academic major and college type correlate to starting and
mid-career salaries for U.S. graduates.

The research question addresses whether mid-career

earnings can be accurately predicted based on these
factors, using data-driven approaches and predictive
modeling techniques for analysis.

Are differences in mid-career salaries proportionate to a

graduate's chosen major and college type?

To what extent do these variables influence earning

potential over time?
The "College Salaries" dataset from Kaggle includes data on median early-career and mid-
career salaries for graduates from various majors and colleges in the U.S. It provides details
such as major, type of college, and median earnings, which are ideal for building predictive
models for mid-career salaries.

Research Question: Can we predict a graduate’s mid-career salary based on

their major and the type of college they attended?
Expectation: We expect to find that certain majors and specific college types
may have a strong correlation with mid-career salaries.

Benefit of the project : This project will provide valuable insights to students,
helping them make informed decisions about their choice of major and college by
understanding potential salary outcomes. Additionally, it offers educational
institutions a clearer view of the economic impact of their programs, assisting
them in aligning their offerings with career prospects.

Dataset URL
Kaggle - College Salaries Dataset
How did you get your data?
URL Kaggle - College Salaries Dataset
Downloaded through Kaggle
File Type .CSV
Yes, we found this data set on our own using Kaggle.
How did you clean your data?
Checked for Duplicates
Standardized column names.
Removed dollar signs and commas from salary columns.

How did you prepare it?

Checked for Outliers
Converted salary columns to numeric data types.

Did you need to get another related dataset?

No we didn’t need another data set to answer the question above.
Histogram displaying the distribution of mid-
career median salaries, highlighting salary
clusters between $50,000 and $80,000.

Heatmap showing correlations between various

salary metrics, with strong links between starting
and mid-career salaries.
Hypothesis

Our hypothesis is that the choice of undergraduate major significantly impacts long-
term salary growth, with STEM majors experiencing the highest salary increases
over the course of their careers.

STEM majors see the highest career salary increases.

Engineering and Computer Science have higher starting and mid-career salaries.
Non-STEM fields show slower salary growth over time.
Limitations of Data Set
There’s a small limitation with the dataset as it primarily consists of aggregated data, which may not
capture the full range of individual variations. For future improvements, collecting individual-level
data, such as detailed salary information, demographics, and academic background, would provide
deeper insights. Including variables like job market trends and geographic details could further
enhance predictive accuracy. Additionally, ensuring data completeness and incorporating
longitudinal data would make the analysis more robust and meaningful.
Machine Learning was completed using
Random Forest Regressor
How the Machine Learning Model Was Selected

Reason for Selection

Effective for complex data patterns.
Utilizes ensemble learning to improve generalization by averaging predictions
from multiple decision trees.

Manages High Dimensionality

Performs well with large datasets and many features.
Provides accurate predictions while remaining interpretable.

Key Advantages of Random Forest

The model averages across trees, reducing variance and improving reliability.
Balances predictive power with the ability to understand feature importance.
Measuring Model Accuracy

The Mean Absolute Error (MAE) is 2416.0, which measures the average magnitude of errors
without considering their direction.

The Mean Squared Error (MSE) is 8,837,527.4, penalizing larger errors more heavily by squaring
the differences.

The Root Mean Squared Error (RMSE) is 2,972.80, providing the error in the same units as the
target variable, making it more interpretable.

The R² Score is 0.9748, which indicates the proportion of variance explained by the model, with a
higher value signifying better performance.

The model’s performance improved across all metrics after hyperparameter optimization.
Training/Testing Split and Cross-Validation

A train-test split was likely used, with 70% of the data for training and 30% for testing.
K-fold cross-validation may have been employed, splitting the dataset into k subsets. The model is
trained on k-1subsets and validated on the remaining subset, repeating this process for each fold.
This method helps evaluate model performance more consistently and reduces the risk of overfitting.
Features and Feature Engineering

Features used were chosen to address the research question effectively.

Scaling/Normalizing data ensured uniformity across features.

Interaction terms or polynomial features were created to capture non-linear relationships.

One-hot encoding was applied to categorical variables to make them usable for the model.
Model Evaluation and Comparison

Model evaluation was based on key metrics, including

MAE, MSE, RMSE, and R² score.

The optimized model improved performance, with a higher

R² score and lower error metrics compared to the
baseline.

Other models like Linear Regression, Decision Trees,

Gradient Boosting, and Neural Networks have been tested
for comparison.

Random Forest was chosen for its ensemble nature, which

effectively handles both linear and non-linear
relationships.
Improving the Model and Alternative Approaches

Model improvement involved hyperparameter tuning using grid search to

optimize parameters.

Increased estimators (n_estimators) enhanced averaging, while

min_samples_split and min_samples_leafreduced overfitting.

Ensemble learning was explored by combining Random Forest with

models like Gradient Boosting.

Potential improvements include feature selection, stacking or blending

algorithms, and Bayesian hyperparameter optimization for efficiency.

Alternative Approaches
Gradient Boosting Algorithms (e.g., XGBoost or LightGBM) for structured data.

Neural Networks for large datasets with complex relationships.

Support Vector Machines (SVM) for smaller datasets requiring high precision.

Simpler Models like linear regression for interpretability.

Prescriptive Analysis

Suggested Actions

Deploy the optimized model for real-time predictions.

Monitor performance to detect data drift and retrain when necessary.

Perform advanced hyperparameter tuning, like Bayesian Optimization.

Refine features through importance analysis and engineering.

Experiment with ensemble techniques to further improve accuracy.

Potential Applications

Sales Forecasting: Predict demand for better inventory management.

Financial Predictions: Forecast stock prices or assess risks in lending.

Customer Insights: Enhance marketing through behavior predictions.

Pricing Optimization: Identify optimal price points for profitability.

Risk Management: Predict and mitigate risks in finance and insurance.

Potential Benefits
Enable data-driven decision-making for strategic improvements.

Achieve cost savings through resource optimization.

Improve customer experience with personalized offerings.

Gain a competitive advantage through actionable insights.

Scale predictive capabilities across departments for organization-wide impact.

Conclusion
Research Question
The project successfully explored whether a graduate’s mid-career salary can be predicted based on
their major and the type of college they attended.

Key Findings
The Random Forest model demonstrated high accuracy, achieving an R² score of 0.9747, indicating a
strong ability to explain salary variations.
Key predictors included the major and type of college, with other engineered features further improving
the model's performance.

Actions and Applications

The model is ready for deployment to predict salary outcomes, enabling students, educators, and
policymakers to make data-driven decisions.
Potential applications include career counseling, college ranking evaluations, and workforce planning.
Sources
Peden, R. (2018). College Salaries Dataset. Kaggle. Retrieved from
https://www.kaggle.com/datasets/ryanpeden/college-salaries

Scikit-learn. (n.d.). Random Forests. Scikit-learn Documentation. Retrieved from https://scikit-

learn.org/stable/modules/ensemble.html#random-forest

Roy, A. (2021). Feature Engineering for Machine Learning: Steps, Techniques, and Best Practices.
Towards Data Science. Retrieved from https://towardsdatascience.com/feature-engineering-for-machine-
learning-steps-techniques-and-best-practices-83c8b7b5e6a8

Scikit-learn. (n.d.). Grid Search for Hyperparameter Tuning. Scikit-learn Documentation. Retrieved from
https://scikit-learn.org/stable/modules/grid_search.html

Analytics Vidhya. (2021). R-Squared or Coefficient of Determination: A Beginner’s Guide. Analytics

Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2021/05/r-squared-or-coefficient-of-
determination/
Thank you

20MCA041
No ratings yet
20MCA041
72 pages
Draftsman Interview Questions and Answers Guide.: Global Guideline
No ratings yet
Draftsman Interview Questions and Answers Guide.: Global Guideline
9 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
2023 MScIT Patel Mirza
No ratings yet
2023 MScIT Patel Mirza
54 pages
Module 5
No ratings yet
Module 5
46 pages
Batch 1 Job Market Analysis and Prediction-1
No ratings yet
Batch 1 Job Market Analysis and Prediction-1
60 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Module 2
No ratings yet
Module 2
35 pages
Final 007
No ratings yet
Final 007
35 pages
FinalPaper SalesPredictionModelforBigMart
No ratings yet
FinalPaper SalesPredictionModelforBigMart
14 pages
Predictive Data Analytics With Python
100% (1)
Predictive Data Analytics With Python
97 pages
DWDM PR
No ratings yet
DWDM PR
29 pages
Capstone Final PPT Group 6
No ratings yet
Capstone Final PPT Group 6
19 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Review 3
No ratings yet
Review 3
25 pages
Project Synopsis
33% (3)
Project Synopsis
4 pages
Synopsis Group 6 Final
No ratings yet
Synopsis Group 6 Final
6 pages
Sales Prediction For Big Mart 3.0.pptx MM
No ratings yet
Sales Prediction For Big Mart 3.0.pptx MM
25 pages
Salary Predictions
No ratings yet
Salary Predictions
43 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Salary Prediction Using Machine Learning
No ratings yet
Salary Prediction Using Machine Learning
4 pages
Ieee Research Paper
No ratings yet
Ieee Research Paper
2 pages
Machine Learning Models For Salary Prediction Dataset Using Python
No ratings yet
Machine Learning Models For Salary Prediction Dataset Using Python
5 pages
Cse 01506423&01506451
No ratings yet
Cse 01506423&01506451
15 pages
House Price Using Machine Learning
No ratings yet
House Price Using Machine Learning
9 pages
Final Report
No ratings yet
Final Report
17 pages
Batch 1 Publication
No ratings yet
Batch 1 Publication
16 pages
bản sửa của plinh
No ratings yet
bản sửa của plinh
18 pages
Group 24 Miniproject
No ratings yet
Group 24 Miniproject
33 pages
Full Text 02
No ratings yet
Full Text 02
52 pages
Career Path Prediction Project
No ratings yet
Career Path Prediction Project
3 pages
Shsconf Cdems2023 03013
No ratings yet
Shsconf Cdems2023 03013
5 pages
US Census Income 1
No ratings yet
US Census Income 1
18 pages
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
No ratings yet
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
11 pages
Placment Predection Using Machine Learning
No ratings yet
Placment Predection Using Machine Learning
9 pages
A Model To Predict Pay Scale Fixation in Job Marke
No ratings yet
A Model To Predict Pay Scale Fixation in Job Marke
6 pages
Project Report
No ratings yet
Project Report
11 pages
Final Report
No ratings yet
Final Report
27 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
BT4234 - RPT - Mr. Sreenarayanan N M
No ratings yet
BT4234 - RPT - Mr. Sreenarayanan N M
32 pages
Decision Tree and Ensemble
No ratings yet
Decision Tree and Ensemble
92 pages
Volume6 Issue3 Paper10 2022
No ratings yet
Volume6 Issue3 Paper10 2022
6 pages
Article Review 11 Eng
No ratings yet
Article Review 11 Eng
18 pages
Assessment 2 UEL CN 7000
No ratings yet
Assessment 2 UEL CN 7000
10 pages
IEEE
No ratings yet
IEEE
6 pages
Salary Data Analysis - Phase 1
No ratings yet
Salary Data Analysis - Phase 1
5 pages
UNIT 2-Part2
No ratings yet
UNIT 2-Part2
9 pages
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Predictive Model For Engineering Graduate Salaries Using XGBoost Regressor
No ratings yet
A Predictive Model For Engineering Graduate Salaries Using XGBoost Regressor
2 pages
Course Project - Machine Learning (DS PGC)
No ratings yet
Course Project - Machine Learning (DS PGC)
6 pages
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
TB 969425740
No ratings yet
TB 969425740
16 pages
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
100% (3)
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
21 pages
Irjet V10i395
No ratings yet
Irjet V10i395
4 pages
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
No ratings yet
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
11 pages
Keer 2010
No ratings yet
Keer 2010
288 pages
MS OFFICE APPLICATION-ren
No ratings yet
MS OFFICE APPLICATION-ren
17 pages
Lect8 Spice
No ratings yet
Lect8 Spice
27 pages
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
100% (2)
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
40 pages
Zedboard Ubuntu
No ratings yet
Zedboard Ubuntu
11 pages
Mini Project Report
No ratings yet
Mini Project Report
10 pages
LOCOS-fabrication Unit 2
No ratings yet
LOCOS-fabrication Unit 2
39 pages
80mm Receipt Printer Program Manual
No ratings yet
80mm Receipt Printer Program Manual
54 pages
Avh-X8650bt Firmware - Update - Instruction
No ratings yet
Avh-X8650bt Firmware - Update - Instruction
6 pages
Expert System For Student Placement Prediction
No ratings yet
Expert System For Student Placement Prediction
5 pages
Specifications:: Specifications Product Product Name Merk / Neg - Asal Type
No ratings yet
Specifications:: Specifications Product Product Name Merk / Neg - Asal Type
4 pages
Academic
No ratings yet
Academic
8 pages
Saudia Arabia
No ratings yet
Saudia Arabia
2 pages
Aemc Ca811 Ca813
No ratings yet
Aemc Ca811 Ca813
1 page
Ppce 12
No ratings yet
Ppce 12
3 pages
Ece-A Touch Sensor
No ratings yet
Ece-A Touch Sensor
5 pages
Assignment I-21MAB204T (2024-25)
No ratings yet
Assignment I-21MAB204T (2024-25)
2 pages
Why I Hate Microsoft by F.W. Van Wensveen
No ratings yet
Why I Hate Microsoft by F.W. Van Wensveen
73 pages
CN Impq - R22
No ratings yet
CN Impq - R22
3 pages
Statistics and Probability (MAT02) Numerical Descriptive Measure
No ratings yet
Statistics and Probability (MAT02) Numerical Descriptive Measure
13 pages
DTN Tutorial
No ratings yet
DTN Tutorial
11 pages
Math Quad
No ratings yet
Math Quad
4 pages
Lecture - 6 PWM and DC Motor Control
No ratings yet
Lecture - 6 PWM and DC Motor Control
14 pages
0702LS Infineon PDF
No ratings yet
0702LS Infineon PDF
12 pages
Manual Usuario Cone P (INGLÉS)
No ratings yet
Manual Usuario Cone P (INGLÉS)
40 pages
Budget of Minority
No ratings yet
Budget of Minority
18 pages
Parsing Dependency
No ratings yet
Parsing Dependency
26 pages
Instaliranje Total War
No ratings yet
Instaliranje Total War
2 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Some Common Taylor Series: The Sine and Cosine Functions
No ratings yet
Some Common Taylor Series: The Sine and Cosine Functions
4 pages

DS Final Project

Uploaded by

DS Final Project

Uploaded by

DS 2010 Final Data Project

College Graduate Salaries

The research question addresses whether mid-career

Are differences in mid-career salaries proportionate to a

To what extent do these variables influence earning

Research Question: Can we predict a graduate’s mid-career salary based on

How did you prepare it?

Did you need to get another related dataset?

Heatmap showing correlations between various

STEM majors see the highest career salary increases.

Reason for Selection

Manages High Dimensionality

Key Advantages of Random Forest

Features used were chosen to address the research question effectively.

Scaling/Normalizing data ensured uniformity across features.

Interaction terms or polynomial features were created to capture non-linear relationships.

Model evaluation was based on key metrics, including

The optimized model improved performance, with a higher

Other models like Linear Regression, Decision Trees,

Random Forest was chosen for its ensemble nature, which

Model improvement involved hyperparameter tuning using grid search to

Increased estimators (n_estimators) enhanced averaging, while

Ensemble learning was explored by combining Random Forest with

Potential improvements include feature selection, stacking or blending

Neural Networks for large datasets with complex relationships.

Simpler Models like linear regression for interpretability.

Deploy the optimized model for real-time predictions.

Monitor performance to detect data drift and retrain when necessary.

Perform advanced hyperparameter tuning, like Bayesian Optimization.

Refine features through importance analysis and engineering.

Experiment with ensemble techniques to further improve accuracy.

Sales Forecasting: Predict demand for better inventory management.

Financial Predictions: Forecast stock prices or assess risks in lending.

Customer Insights: Enhance marketing through behavior predictions.

Pricing Optimization: Identify optimal price points for profitability.

Risk Management: Predict and mitigate risks in finance and insurance.

Achieve cost savings through resource optimization.

Improve customer experience with personalized offerings.

Gain a competitive advantage through actionable insights.

Scale predictive capabilities across departments for organization-wide impact.

Actions and Applications

Scikit-learn. (n.d.). Random Forests. Scikit-learn Documentation. Retrieved from https://scikit-

Analytics Vidhya. (2021). R-Squared or Coefficient of Determination: A Beginner’s Guide. Analytics

You might also like