0% found this document useful (0 votes)
32 views

Syndicate 6 - Assignment 3

The document compares different predictive models for wine quality and discusses their performance based on error metrics and a user-specific loss function. It finds that a neural network model has the lowest error and loss, while k-means clustering performs worst. Regression tree and k-NN models have similar errors to basic regressions. Underpredicting high quality wines leads to the highest losses.

Uploaded by

Sope Dalley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Syndicate 6 - Assignment 3

The document compares different predictive models for wine quality and discusses their performance based on error metrics and a user-specific loss function. It finds that a neural network model has the lowest error and loss, while k-means clustering performs worst. Regression tree and k-NN models have similar errors to basic regressions. Underpredicting high quality wines leads to the highest losses.

Uploaded by

Sope Dalley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Syndicate 6- Alister King, Radeyan Sazzad, Matt Lewis, Sope Dalley

Predictive Analytics Assignment 3


Comment on how your results from these methods
compare to your analysis from Syndicate Task #1.
Compared to the results from Syndicate Task #1 shown in Q2, all four models are inferior for this
application in terms of error. Of the four new models, the neural network is considered the most
accurate given the lowest errors for RMSE, MAE, MAPE and MASE. The K-means clustering model
is the least accurate of all models. All models perform better than the naïve case for MASE.

Table 1: Prediction Error For All Models


RMSE MAE MAPE MASE UserLoss

Linear 0.635 0.488 8.700 0.735 99.432

Stepwise 0.636 0.489 8.700 0.735 99.436

Non-Linear 0.627 0.488 8.710 0.734 99.437

RegTree 0.629 0.474 8.902 0.743 99.432

NeurNet 0.599 0.462 8.779 0.735 97.846

kMeansCluster 0.661 0.532 9.940 0.834 102.607

kNN 0.609 0.493 9.324 0.773 99.432

The regression tree shows the highest level of error which given the limited level of granularity that
can be achieved. Changing the CP level lower had little impact on the amount of error the model
produced, however using a value of 0.05 made the tree too simple and not functional. For this reason
0.01 was the optimal value for CP. Only one layer was used for the neural networks, adding more
layers did not improve the predictability of the model. This was the most accurate of the machine
learning models for MAE and MASE.Two clusters were chosen for the K-means clustering as all
three three outputs for the elbow method, the gap statistic and the silhouette measure showed two as
the optimal number of clusters. The K-NN regression model was the third most accurate of the
machine learning models, with all errors higher than the Neural network and K-means regression.

Comparing the variable importance plot of the linear regression from the first assignment to the
variable importance plot for the regression tree, Garson’s relative importance and Olden’s connection
weights shows similarity between the importance of different variables and the quality score.
Syndicate 6- Alister King, Radeyan Sazzad, Matt Lewis, Sope Dalley

Figure 1 - VIP of physicochemical properties from Task 1. Figure 2 – VIP of the regression tree.

The variable importance plot showed Alc, Density, sulphates, VA as the most important factors,
noting the magnitude of the result cannot be interpreted. This differed from the correlation plot as
density is considered the second most important factor.

Figure 3 – Garson’s relative importance. Figure 4 – Olden’s connection weights.

The highest magnitude factors from Garson’s relative importance shows CA, Alc, TSD and FA to be
the most important factors for the quality score of wine. It should be noted however that the direction
of the response cannot be determined. This differed from the correlation table given sulphates had a
higher correlation to quality score and VA which had a high negative correlation. Olden’s connection
weights show Alc, FA, FSD and Sulphates to be the most important variables for the quality score of
wine. It should be noted however that the magnitude of the variables cannot be interpreted. This
differed from the correlation table as FA had a lower correlation compared to other variables and CA
had a positive correlation compared to a negative importance.

Consider the scenario where you would like to use the


predictions for forming pricing and marketing strategies.
The premium wine segment typically ranges in pricing from $50 to over $1000 per bottle compared
with the mass produced market ranging from $5-$20 per bottle. Assuming the margins follow a
similar pattern, a weighting of 5 times the marketing cost was allocated to the opportunity cost of the
margin not realised on a premium wine. Using this logic a user-specific loss function was developed
which sums the instances when predictions were above 7 and the actual score in the training set was
below 7 (excess marketing expenses), and the opposite when actuals were below 7 but prediction was
above 7 (opportunity cost). The sum incorporates the magnitude of the error.

User Specific Loss Function (USLF)=excess marketing expense + 5xopporutnity cos


Syndicate 6- Alister King, Radeyan Sazzad, Matt Lewis, Sope Dalley

Regression Models (Task #1)


All 3 regression models (linear, stepwise, nonlinear) from task one demonstrated near identical results
for the loss function (99.432-99.439); indicating they would have similar under and overprediction
characteristics (see residual plots below). Further inspection of the residuals indicated a tendency to
underpredict at the higher scores (6-7) and overpredict at the lower QS (3-4). This was the case for all
3 regression models, indicating a flatter prediction. When conducting sensitivity analysis on
regression models, it was found that the significant underpredictions occurred with more extreme QS
scores (8), resulting in higher opportunity cost penalty from the USLF. The higher residuals are also
reflected in the histograms below. If the regression models were used to predict wine scores the flat
curve should be kept in mind and potentially reconsider the models if applying to the premium
market, or use segmentation methods discussed below, or tailor a separate regression for premium QS
values.

Figure 5– Histogram of fitted residuals for linear, stepwise, and nonlinear regressions.

Regression Tree
Regression tree analysis resulted in a similar USLF as the three regression models (99.432);
indicating the same impact of residuals at the higher end and minimal impact on segmenting the data.
This is due to the fewer observations at the higher and lower QS results.

Neural Network
The USLF for the single layer neural network was 97.846, 2 points lower than the regression models
and tree, indicating they are slightly less costly to the user for the specified USLF.

K-Means Clustering
K-Means clustering with 2 clusters showed a 3 point higher USLF when compared with the
regression models indicating higher cost to the used. This is driven by the higher standard error
increasing the opportunity cost and marketing costs (larger residuals on both sides).

K-NN Regression
K-NN regression showed a similar USLF to the regression models (99.432), reflecting a similar
amount of error (RMSE~0.6) as outlined in Table 1 which drives the residuals and thus the USLF.

The Neural Network model resulted in the lowest cost to the user based on the function weighted
towards opportunity cost (significantly penalising underprediction). Sensitivity analysis was
conducted by changing the weights further which further reinforced the cost of underprediction.
Larger amounts of data would allow for greater segmentation for the extreme scores (7,8) which
would reduce the impact of underprediction. It is also worth noting that the actual scores are allocated
in integers which reduces data resolution and increases the error across all models.

You might also like