DS Final Project
DS Final Project
Team Lavender
Vasudha Arora, Renisha Rana and Jack Jancaric
This project explores how various factors, including
academic major and college type correlate to starting and
mid-career salaries for U.S. graduates.
Benefit of the project : This project will provide valuable insights to students,
helping them make informed decisions about their choice of major and college by
understanding potential salary outcomes. Additionally, it offers educational
institutions a clearer view of the economic impact of their programs, assisting
them in aligning their offerings with career prospects.
Dataset URL
Kaggle - College Salaries Dataset
How did you get your data?
URL Kaggle - College Salaries Dataset
Downloaded through Kaggle
File Type .CSV
Yes, we found this data set on our own using Kaggle.
How did you clean your data?
Checked for Duplicates
Standardized column names.
Removed dollar signs and commas from salary columns.
Our hypothesis is that the choice of undergraduate major significantly impacts long-
term salary growth, with STEM majors experiencing the highest salary increases
over the course of their careers.
The Mean Absolute Error (MAE) is 2416.0, which measures the average magnitude of errors
without considering their direction.
The Mean Squared Error (MSE) is 8,837,527.4, penalizing larger errors more heavily by squaring
the differences.
The Root Mean Squared Error (RMSE) is 2,972.80, providing the error in the same units as the
target variable, making it more interpretable.
The R² Score is 0.9748, which indicates the proportion of variance explained by the model, with a
higher value signifying better performance.
The model’s performance improved across all metrics after hyperparameter optimization.
Training/Testing Split and Cross-Validation
A train-test split was likely used, with 70% of the data for training and 30% for testing.
K-fold cross-validation may have been employed, splitting the dataset into k subsets. The model is
trained on k-1subsets and validated on the remaining subset, repeating this process for each fold.
This method helps evaluate model performance more consistently and reduces the risk of overfitting.
Features and Feature Engineering
One-hot encoding was applied to categorical variables to make them usable for the model.
Model Evaluation and Comparison
Alternative Approaches
Gradient Boosting Algorithms (e.g., XGBoost or LightGBM) for structured data.
Support Vector Machines (SVM) for smaller datasets requiring high precision.
Suggested Actions
Potential Benefits
Enable data-driven decision-making for strategic improvements.
Key Findings
The Random Forest model demonstrated high accuracy, achieving an R² score of 0.9747, indicating a
strong ability to explain salary variations.
Key predictors included the major and type of college, with other engineered features further improving
the model's performance.
Roy, A. (2021). Feature Engineering for Machine Learning: Steps, Techniques, and Best Practices.
Towards Data Science. Retrieved from https://towardsdatascience.com/feature-engineering-for-machine-
learning-steps-techniques-and-best-practices-83c8b7b5e6a8
Scikit-learn. (n.d.). Grid Search for Hyperparameter Tuning. Scikit-learn Documentation. Retrieved from
https://scikit-learn.org/stable/modules/grid_search.html