Syndicate 6 - Assignment 3
Syndicate 6 - Assignment 3
The regression tree shows the highest level of error which given the limited level of granularity that
can be achieved. Changing the CP level lower had little impact on the amount of error the model
produced, however using a value of 0.05 made the tree too simple and not functional. For this reason
0.01 was the optimal value for CP. Only one layer was used for the neural networks, adding more
layers did not improve the predictability of the model. This was the most accurate of the machine
learning models for MAE and MASE.Two clusters were chosen for the K-means clustering as all
three three outputs for the elbow method, the gap statistic and the silhouette measure showed two as
the optimal number of clusters. The K-NN regression model was the third most accurate of the
machine learning models, with all errors higher than the Neural network and K-means regression.
Comparing the variable importance plot of the linear regression from the first assignment to the
variable importance plot for the regression tree, Garson’s relative importance and Olden’s connection
weights shows similarity between the importance of different variables and the quality score.
Syndicate 6- Alister King, Radeyan Sazzad, Matt Lewis, Sope Dalley
Figure 1 - VIP of physicochemical properties from Task 1. Figure 2 – VIP of the regression tree.
The variable importance plot showed Alc, Density, sulphates, VA as the most important factors,
noting the magnitude of the result cannot be interpreted. This differed from the correlation plot as
density is considered the second most important factor.
The highest magnitude factors from Garson’s relative importance shows CA, Alc, TSD and FA to be
the most important factors for the quality score of wine. It should be noted however that the direction
of the response cannot be determined. This differed from the correlation table given sulphates had a
higher correlation to quality score and VA which had a high negative correlation. Olden’s connection
weights show Alc, FA, FSD and Sulphates to be the most important variables for the quality score of
wine. It should be noted however that the magnitude of the variables cannot be interpreted. This
differed from the correlation table as FA had a lower correlation compared to other variables and CA
had a positive correlation compared to a negative importance.
Figure 5– Histogram of fitted residuals for linear, stepwise, and nonlinear regressions.
Regression Tree
Regression tree analysis resulted in a similar USLF as the three regression models (99.432);
indicating the same impact of residuals at the higher end and minimal impact on segmenting the data.
This is due to the fewer observations at the higher and lower QS results.
Neural Network
The USLF for the single layer neural network was 97.846, 2 points lower than the regression models
and tree, indicating they are slightly less costly to the user for the specified USLF.
K-Means Clustering
K-Means clustering with 2 clusters showed a 3 point higher USLF when compared with the
regression models indicating higher cost to the used. This is driven by the higher standard error
increasing the opportunity cost and marketing costs (larger residuals on both sides).
K-NN Regression
K-NN regression showed a similar USLF to the regression models (99.432), reflecting a similar
amount of error (RMSE~0.6) as outlined in Table 1 which drives the residuals and thus the USLF.
The Neural Network model resulted in the lowest cost to the user based on the function weighted
towards opportunity cost (significantly penalising underprediction). Sensitivity analysis was
conducted by changing the weights further which further reinforced the cost of underprediction.
Larger amounts of data would allow for greater segmentation for the extreme scores (7,8) which
would reduce the impact of underprediction. It is also worth noting that the actual scores are allocated
in integers which reduces data resolution and increases the error across all models.