0% found this document useful (0 votes)
24 views20 pages

DS Final Project

The DS 2010 Final Data Project by Team Lavender investigates the correlation between academic major, college type, and mid-career salaries for U.S. graduates using the 'College Salaries' dataset from Kaggle. The project employs a Random Forest model, achieving a high R² score of 0.9748, indicating strong predictive capabilities. The findings aim to assist students and educational institutions in making informed decisions regarding majors and career prospects.

Uploaded by

Vasudha Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views20 pages

DS Final Project

The DS 2010 Final Data Project by Team Lavender investigates the correlation between academic major, college type, and mid-career salaries for U.S. graduates using the 'College Salaries' dataset from Kaggle. The project employs a Random Forest model, achieving a high R² score of 0.9748, indicating strong predictive capabilities. The findings aim to assist students and educational institutions in making informed decisions regarding majors and career prospects.

Uploaded by

Vasudha Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DS 2010 Final Data Project

College Graduate Salaries


Dataset - Salaries by college, region, and academic major

Team Lavender
Vasudha Arora, Renisha Rana and Jack Jancaric
This project explores how various factors, including
academic major and college type correlate to starting and
mid-career salaries for U.S. graduates.

The research question addresses whether mid-career


earnings can be accurately predicted based on these
factors, using data-driven approaches and predictive
modeling techniques for analysis.

Are differences in mid-career salaries proportionate to a


graduate's chosen major and college type?

To what extent do these variables influence earning


potential over time?
The "College Salaries" dataset from Kaggle includes data on median early-career and mid-
career salaries for graduates from various majors and colleges in the U.S. It provides details
such as major, type of college, and median earnings, which are ideal for building predictive
models for mid-career salaries.

Research Question: Can we predict a graduate’s mid-career salary based on


their major and the type of college they attended?
Expectation: We expect to find that certain majors and specific college types
may have a strong correlation with mid-career salaries.

Benefit of the project : This project will provide valuable insights to students,
helping them make informed decisions about their choice of major and college by
understanding potential salary outcomes. Additionally, it offers educational
institutions a clearer view of the economic impact of their programs, assisting
them in aligning their offerings with career prospects.

Dataset URL
Kaggle - College Salaries Dataset
How did you get your data?
URL Kaggle - College Salaries Dataset
Downloaded through Kaggle
File Type .CSV
Yes, we found this data set on our own using Kaggle.
How did you clean your data?
Checked for Duplicates
Standardized column names.
Removed dollar signs and commas from salary columns.

How did you prepare it?


Checked for Outliers
Converted salary columns to numeric data types.

Did you need to get another related dataset?


No we didn’t need another data set to answer the question above.
Histogram displaying the distribution of mid-
career median salaries, highlighting salary
clusters between $50,000 and $80,000.

Heatmap showing correlations between various


salary metrics, with strong links between starting
and mid-career salaries.
Hypothesis

Our hypothesis is that the choice of undergraduate major significantly impacts long-
term salary growth, with STEM majors experiencing the highest salary increases
over the course of their careers.

STEM majors see the highest career salary increases.


Engineering and Computer Science have higher starting and mid-career salaries.
Non-STEM fields show slower salary growth over time.
Limitations of Data Set
There’s a small limitation with the dataset as it primarily consists of aggregated data, which may not
capture the full range of individual variations. For future improvements, collecting individual-level
data, such as detailed salary information, demographics, and academic background, would provide
deeper insights. Including variables like job market trends and geographic details could further
enhance predictive accuracy. Additionally, ensuring data completeness and incorporating
longitudinal data would make the analysis more robust and meaningful.
Machine Learning was completed using
Random Forest Regressor
How the Machine Learning Model Was Selected

Reason for Selection


Effective for complex data patterns.
Utilizes ensemble learning to improve generalization by averaging predictions
from multiple decision trees.

Manages High Dimensionality


Performs well with large datasets and many features.
Provides accurate predictions while remaining interpretable.

Key Advantages of Random Forest


The model averages across trees, reducing variance and improving reliability.
Balances predictive power with the ability to understand feature importance.
Measuring Model Accuracy

The Mean Absolute Error (MAE) is 2416.0, which measures the average magnitude of errors
without considering their direction.

The Mean Squared Error (MSE) is 8,837,527.4, penalizing larger errors more heavily by squaring
the differences.

The Root Mean Squared Error (RMSE) is 2,972.80, providing the error in the same units as the
target variable, making it more interpretable.

The R² Score is 0.9748, which indicates the proportion of variance explained by the model, with a
higher value signifying better performance.

The model’s performance improved across all metrics after hyperparameter optimization.
Training/Testing Split and Cross-Validation

A train-test split was likely used, with 70% of the data for training and 30% for testing.
K-fold cross-validation may have been employed, splitting the dataset into k subsets. The model is
trained on k-1subsets and validated on the remaining subset, repeating this process for each fold.
This method helps evaluate model performance more consistently and reduces the risk of overfitting.
Features and Feature Engineering

Features used were chosen to address the research question effectively.

Scaling/Normalizing data ensured uniformity across features.

Interaction terms or polynomial features were created to capture non-linear relationships.

One-hot encoding was applied to categorical variables to make them usable for the model.
Model Evaluation and Comparison

Model evaluation was based on key metrics, including


MAE, MSE, RMSE, and R² score.

The optimized model improved performance, with a higher


R² score and lower error metrics compared to the
baseline.

Other models like Linear Regression, Decision Trees,


Gradient Boosting, and Neural Networks have been tested
for comparison.

Random Forest was chosen for its ensemble nature, which


effectively handles both linear and non-linear
relationships.
Improving the Model and Alternative Approaches

Model improvement involved hyperparameter tuning using grid search to


optimize parameters.

Increased estimators (n_estimators) enhanced averaging, while


min_samples_split and min_samples_leafreduced overfitting.

Ensemble learning was explored by combining Random Forest with


models like Gradient Boosting.

Potential improvements include feature selection, stacking or blending


algorithms, and Bayesian hyperparameter optimization for efficiency.

Alternative Approaches
Gradient Boosting Algorithms (e.g., XGBoost or LightGBM) for structured data.

Neural Networks for large datasets with complex relationships.

Support Vector Machines (SVM) for smaller datasets requiring high precision.

Simpler Models like linear regression for interpretability.


Prescriptive Analysis

Suggested Actions

Deploy the optimized model for real-time predictions.

Monitor performance to detect data drift and retrain when necessary.

Perform advanced hyperparameter tuning, like Bayesian Optimization.

Refine features through importance analysis and engineering.

Experiment with ensemble techniques to further improve accuracy.


Potential Applications

Sales Forecasting: Predict demand for better inventory management.

Financial Predictions: Forecast stock prices or assess risks in lending.

Customer Insights: Enhance marketing through behavior predictions.

Pricing Optimization: Identify optimal price points for profitability.

Risk Management: Predict and mitigate risks in finance and insurance.

Potential Benefits
Enable data-driven decision-making for strategic improvements.

Achieve cost savings through resource optimization.

Improve customer experience with personalized offerings.

Gain a competitive advantage through actionable insights.

Scale predictive capabilities across departments for organization-wide impact.


Conclusion
Research Question
The project successfully explored whether a graduate’s mid-career salary can be predicted based on
their major and the type of college they attended.

Key Findings
The Random Forest model demonstrated high accuracy, achieving an R² score of 0.9747, indicating a
strong ability to explain salary variations.
Key predictors included the major and type of college, with other engineered features further improving
the model's performance.

Actions and Applications


The model is ready for deployment to predict salary outcomes, enabling students, educators, and
policymakers to make data-driven decisions.
Potential applications include career counseling, college ranking evaluations, and workforce planning.
Sources
Peden, R. (2018). College Salaries Dataset. Kaggle. Retrieved from
https://www.kaggle.com/datasets/ryanpeden/college-salaries

Scikit-learn. (n.d.). Random Forests. Scikit-learn Documentation. Retrieved from https://scikit-


learn.org/stable/modules/ensemble.html#random-forest

Roy, A. (2021). Feature Engineering for Machine Learning: Steps, Techniques, and Best Practices.
Towards Data Science. Retrieved from https://towardsdatascience.com/feature-engineering-for-machine-
learning-steps-techniques-and-best-practices-83c8b7b5e6a8

Scikit-learn. (n.d.). Grid Search for Hyperparameter Tuning. Scikit-learn Documentation. Retrieved from
https://scikit-learn.org/stable/modules/grid_search.html

Analytics Vidhya. (2021). R-Squared or Coefficient of Determination: A Beginner’s Guide. Analytics


Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2021/05/r-squared-or-coefficient-of-
determination/
Thank you

You might also like