0% found this document useful (0 votes)
53 views

Assign4 Gam

This document describes building an additive model to predict workers' wages using their age, year, and education level. An additive model was created using a generalized additive model (GAM) with 6 splines for age and year and 5 splines for education. Partial dependency plots show the influence of each feature on wages. The model was validated by comparing actual and predicted test set values and analyzing residuals and correlation, with low R2 scores indicating the model could be improved by adding more features.

Uploaded by

Chelsi Gondalia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Assign4 Gam

This document describes building an additive model to predict workers' wages using their age, year, and education level. An additive model was created using a generalized additive model (GAM) with 6 splines for age and year and 5 splines for education. Partial dependency plots show the influence of each feature on wages. The model was validated by comparing actual and predicted test set values and analyzing residuals and correlation, with low R2 scores indicating the model could be improved by adding more features.

Uploaded by

Chelsi Gondalia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Building an Additive Model

Chelsi Gondalia
10/25/2021
In this study, we are working with a dataset that contains the wage of workers and some of their
demographic data like age, year, marital status, education, and several others. Overall, we have 3000
records and 9 features. The objective of this study is to build an additive model to predict the wages of
workers.
In our model we are focusing on three features: age, year and education, the theoretical representation of
this model is given by Equation (1).
𝑤𝑎𝑔𝑒𝑖 = 𝛽0 + 𝑓1 (𝑦𝑒𝑎𝑟𝑖 ) + 𝑓2 (𝑎𝑔𝑒𝑖 ) + 𝑓3 (𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 ) + 𝜀𝑖
The dataset was split into train and test sets. The test set size was set to contain 25% of the total data. The
train set was then used to build our additive model in python using the function GAM. For the age and
year, we used n_splines=6. This means that we used 6 splines or knots in each of the smoothing functions
that was fitted. For education, we used n_splines=5. The results of the GAM can be interpreted with the
help of partial dependency plots shown in Figure 1. It should be noted that spline function allows for
smoothing of the curves in Figure 1. The plots are visualization of how each feature (on the x-axis)
influences our response variable- wages (on the y-axis). The dotted lines around each of the solid curves
represent the 95% confidence intervals. For the feature “year”, the wage increases overall with one
peculiar drop between 2007 and 2008. For the feature “age”, the wage increases with a steep slope until
~48 years and then begins to decline. The feature “education” seems to have a fairly linear relationship
with wage.

Figure 1. Partial dependency plots with confidence intervals.


Now that we have a fair understanding of our additive model and is features, it must be validated. The test
set predictions and actual values are plotted in Figure 2 for comparison. It is evident that our model is
poor at predicting wages that beyond 175. To further validate the model, the residual distribution, which
is assumed to be reasonably normal with a trailing right end, is plotted in Figure 3.

Figure 2. Comparing actual test set values to GAM predictions.

Figure 3. Distribution of residuals.


Lastly, we want to check for the correlation between our actual data points and the GAM predictions.
This comparison for both the test and the train sets can be viewed in Table 1. As we can see, the overall
R2 score is quite low indicating that our model is not efficient. We could improve this model by adding
more features.
Table 1. R2 score for test and train sets.

Test set Train set


2
R 0.32 0.29

You might also like