Final Assignment
Final Assignment
The Exchange Rate showed a positive coefficient in the regression model. This
indicates that, on average, as the exchange rate rises, so will the number of tourist
arrivals. In practice, when a country's currency weakens and has a greater exchange
rate in terms of foreign currency, tourists may find it cheaper to travel since their
foreign money has more purchasing power. While the data suggests a favourable
relationship, it is important to realise that an increase in the exchange rate does not
always result in an increase in visitor arrivals. Other obscure factors or external
causes could influence both the exchange rate and visitor arrivals. For example, a
thriving global economy may encourage more individuals to travel (raising tourist
arrivals) while also influencing currency rates.
The R2 value, 0.2483 (Appendix A) in this case, tells us about the model's accuracy.
It means the Exchange Rate accounts for roughly 25% of the changes we observe in
tourist numbers. This percentage gives an idea of how much the Exchange Rate can
predict tourist movement. The fact that our R2 isn't near 1 hints that there are other
influential factors we haven't considered in this model. These might include global
events, marketing efforts for tourism, political scenarios, among others.
1.2 Residuals Diagnostics
The Breusch-Godfrey test results indicate that the linear regression model may not
be adequately capturing all the temporal structures in the data. With a test statistic
(Q*) of 348.59 and 10 degrees of freedom, the p-value is significantly less than the
conventional 0.05 threshold for statistical significance (Appendix B). Such a small
p-value leads us to reject the null hypothesis, implying that there's significant
autocorrelation in the residuals up to 10 lags. This suggests that the residuals aren't
random, and the model hasn't captured all the underlying temporal structures in the
time series. Furthermore, the ACF plot reinforces the findings from the
Breusch-Godfrey test. The initial lags in the ACF plot showed significant
autocorrelation. This visual evidence further confirms the presence of autocorrelation
in the residuals.
The presence of serial correlation in the residuals, as highlighted by both the
Breusch-Godfrey test and the ACF plot, indicates that the current regression model
is not fully adequate in representing the relationship between the exchange rate and
tourist arrivals. While the model provides some insights, its residuals' structure
suggests that it may not be capturing some important temporal dynamics or other
influential factors in the data. As such, to obtain more accurate and reliable insights,
the model may require refinements or even a shift towards other modelling
approaches better suited for time series data with potential autocorrelation.
January February
● In January 2020, when the exchange rate is expected to be 3.5 RM for every
USD, the forecasted tourist arrival in Malaysia is approximately 1.4 million.
● For February 2020, with a higher exchange rate of 4.9 RM to USD, the model
predicts a rise in tourist arrivals to approximately 2.27 million.
● The forecast interval for January is quite wide, spanning from around 184k to
2.62 million tourists. Similarly, the February interval spans from around 1.04
million to 3.5 million tourists.
Given the wide forecast intervals, while the point forecasts for January and February
2020 seem reasonable, there's a substantial degree of uncertainty associated with
these predictions. This uncertainty might arise from the model's inability to capture all
influencing factors, as indicated by the presence of autocorrelation in the residuals.
Standard Error:
● Intercept: 192,178 indicates the average variability or uncertainty associated
with the estimate of the intercept. The smaller the standard error, the more
reliable (or precise) the coefficient estimate.
● Exchange Rate: 56,202 gives the average variability or uncertainty associated
with the estimate of the exchange rate coefficient.
T-value:
● Intercept: -4.017 indicates that the observed value of the intercept is less than
the null hypothesis value of zero.
● Exchange Rate: t-value of 11.055 provides strong evidence that there's a
significant positive relationship between the exchange rate (RM to USD) and
the number of tourist arrivals. As the exchange rate increases, the number of
tourist arrivals also tends to increase, and this relationship is statistically
significant and not due to random chance.
Pr(>|t|):
● Intercept: 7.13e-05 is the p-value associated with the t-test for the intercept.
Given that it's much smaller than the common alpha level of 0.05, the
intercept is statistically significant.
● < 2e-16 is extremely small, much less than the typical 0.05 threshold,
indicating that the relationship between the exchange rate and tourist arrivals,
as described by this coefficient, is statistically significant.
Minimum -1634548
Median -165448
Maximum 1559931
● Min: This is the smallest residual, indicating that in the most extreme case, the
model under-predicted the actual tourist arrival by about 1.63 million.
● 1Q: 25% of the residuals are below this value. This means that in 25% of the
cases, the model under-predicted the actual value by more than 425,708
tourists.
● Median: A negative median suggests that, on a median basis, the model
tends to under-predict tourist arrivals by 165,448.
● 3Q: 75% of the residuals are below this value. This means that in the top 25%
of cases, the model over-predicted the actual value by up to 515,899 tourists.
● Max: This is the largest residual, indicating that in the most extreme case, the
model over-predicted the actual tourist arrival by about 1.56 million.
F-statistics 122.2
Multiple R-squared
● This value represents the proportion of the variance in the dependent
variable (Tourist Arrival) that's explained by the independent variable
(Exchange Rate). An R2 of 0.2483 means that the model explains
about 24.83% of the variability in the tourist arrivals. The closer this
value is to 1, the better the model fits the data.
Adjusted R-squared
● Similar to R2 the adjusted R2 provides a measure of how well the
independent variables explain the variability in the dependent variable.
However, it adjusts for the number of predictors in the model,
preventing it from artificially inflating when unnecessary predictors are
added. An adjusted R2 of 0.2463 indicates that, after adjusting for the
number of predictors, the model explains about 24.63% of the
variability in tourist arrivals.
F-statistic:
● An F-statistic of 122.2 is quite large, suggesting that the model
provides a significantly better fit to the data than a model with no
predictors.
● Seasonality: There appear to be regular peaks and troughs within each year.
This suggests that there is a seasonal pattern in the tourist arrivals data.
● Trend: There's an overall upward trend in the data, indicating that the number
of tourist arrivals has generally increased over the years. There are, however,
periods where the trend is relatively flat or slightly declining.
● Cyclic Patterns: Beyond the regular seasonal fluctuations, there seem to be
longer-term patterns where the tourist arrivals rise and fall.
2.2 Determine Seasonality
a) Plotting the time series data offers an intuitive way to detect recurring trends
or patterns that happen at consistent timeframes. When the graph displays
regular peaks, valleys, or other noticeable patterns at certain times each year,
it's a sign of seasonality. This direct visual method is often the starting point
because it vividly underscores pronounced seasonal tendencies.
b) On the other hand, the ACF plot, also known as a correlogram, illustrates how
a time series is correlated with its own previous values, which are termed
lags. When there's seasonality in the data, the ACF plot will exhibit
pronounced spikes at intervals corresponding to the season. For example, in
monthly data with an annual seasonal pattern, we’d expect to see prominent
spikes at intervals like 12, 24, or 36 months. Such a recurring pattern in the
ACF, especially a notable spike at a 12-month lag, affirms the correlation
between a value and the same value from a year prior, signalling strong
seasonality.
We want to subset to focus on a particular time window, in this case, January 2010
to January 2020, allowing for a more detailed and relevant analysis of recent trends,
patterns, and behaviours in tourist arrivals (Appendix C).
This "tourist2" dataset is split into training and test subsets. Specifically, 80% of the
data is allocated for training, covering the period from January 2010 to December
2017, while the remaining 20% is set aside for testing, encompassing the timeframe
from January 2018 to January 2020. The length is verified to contain 96 observations
for the train and 25 observations for the test using the `length()` function as shown in
appendix D.
2.5 Fit a piecewise linear regression model to the training set and include
seasonal dummies (Appendix E)
It's also noteworthy that the model, with its R-squared value of approximately 60%,
does a reasonably good job of capturing the core trends in the data. This suggests
that while time, the 2014 trend shift, and seasonality are essential factors, there
might be other external variables. The highly significant F-statistic further
strengthens the model's credibility, indicating that the predictors collectively play a
critical role in determining tourist patterns.
2.6 Plot of the training set data with the fitted values
Both the actual and predicted values show a growth pattern up to approximately
January 2014. From this juncture, a descending trend is observed in the predicted
values, which mirrors the segmented design of the model. This indicates that an
event around January 2014 may have influenced tourist arrivals. While the piecewise
linear regression provides a commendable approximation to the actual data,
especially in recognizing the trend shift around January 2014, there are instances
where it either falls short of or exceeds the actual figures. Clear cyclical patterns are
noticeable in the real tourist arrivals, characterised by recurring highs and lows.
While the model grasps some aspects of this cyclical behaviour, there are
discrepancies. For instance, certain peaks in the real data surpass the model's
estimates, and vice versa for the troughs. Significantly, a distinct trend alteration is
discernible around January 2014 in the actual data, which the model aptly
represents, underscoring the relevance of the segmented regression at that point.
Tourist∼Time+SegmentedTime+Month
The graph contrasts the actual tourist arrivals against their projected numbers over a
given duration. At first glance, the actual data shows a gentle climb in tourist figures,
marked by consistent highs and lows indicative of seasonal trends. In contrast, the
forecasts from the piecewise linear regression model seem to suggest a continuous
decrease, indicating an expected drop in tourism in the forthcoming period.
The actual data prominently showcases these seasonal patterns, with some months
constantly experiencing a rise or fall in tourist numbers. However, the predicted
numbers, even though they try to emulate these seasonal patterns, don't align
seamlessly with the real data's peaks and troughs.
Progressing through the timeline, a growing discrepancy between the actual and
forecasted numbers emerges, especially towards the latter part of the graph. This
deviation implies that the model, perhaps initially in sync with the real numbers,
starts to drift as time advances. It's noteworthy that in the graph's initial phases, the
model tends to overestimate the tourist counts. Yet, as we delve deeper into the
timeline, it seems to lag behind, particularly during the pronounced peaks in tourist
arrivals.
2.8.2 Produce forecasts for the time period spanning the test dataset using an
ensemble forecast comprising the ETS and ARIMA forecasts
The chart illustrates the actual tourist numbers alongside predictions from three
forecasting methods. Actual data exhibits a rising trend with pronounced seasonal
highs and lows. The ETS method, designed to account for errors, trends, and
seasonal shifts, initially mirrors the actual figures quite well. However, as time
progresses, it tends to fall short, especially during peak tourist seasons. On the other
hand, the ARIMA method, which combines several statistical approaches, predicts a
steady decline in tourist numbers. While it does acknowledge the drop in tourists in
certain periods, it's generally more conservative and misses the mark during high
seasons. The Ensemble approach, merging the strengths of both ETS and ARIMA,
offers a more balanced prediction, bridging the gap between the optimistic view of
ETS and the cautious stance of ARIMA. Nonetheless, even this combined approach
tends to underestimate the high points in the real data.
2.8.3 Produce forecasts for the time period spanning the test dataset using the
bagging procedure
2.8.4 Produce the plot along with the point forecasts from the piecewise linear
regression model, ensemble method, and the bagging procedure.
The chart presents actual tourist arrivals, which consistently rise with evident
seasonal highs and lows. The Piecewise Linear Regression closely mirrors these
arrivals at first but starts to miss the mark, especially in the later stages, often
undervaluing the real figures. The Exponential Smoothing (ETS) method gives a fair
representation, especially in the early sections, but tends to undervalue the peaks as
time progresses. On the other hand, the ARIMA forecast is a bit cautious, suggesting
a mild decline and frequently missing the season's high points. Merging the insights
from ETS and ARIMA, the Ensemble method provides a more measured forecast,
finding a balance between the two. Meanwhile, the Bagging approach, pooling
insights from various forecasts, aligns reasonably well with the actual data early on,
but seems to slightly undervalue later figures.
2.8.5 Determine method that produces the most accurate forecast (Appendix F)
ME 23147.049
RMSE 149539.1
MAE 117584.9
MPE 0.7676820
MAPE 5.431734
The ME, RMSE, and MAE are relatively high compared to other methods, especially
the RMSE. The MAPE is also slightly higher than some of the other models.
ETS
ME 67505.509
RMSE 167287.4
MAE 133773.9
MPE 2.8190726
MAPE 6.103809
This model has the highest ME, RMSE, and MAE among all models. It also has the
highest MAPE, suggesting it may not be the best model in terms of percentage
errors.
ARIMA
ME 3750.373
RMSE 157146.9
MAE 116758.9
MPE -0.1219569
MAPE 5.382891
The values of ME, RMSE, MAE, and MAPE are in the mid-range among all models.
It's better than the ETS model but not as accurate as some of the other methods.
Ensemble
ME 35627.941
RMSE 157120.6
MAE 119557.7
MPE 1.3485578
MAPE 5.481776
The ME is relatively high, but the RMSE, MAE, and especially the MAPE are better
than the ARIMA and ETS models. This indicates that combining forecasts can
improve accuracy.
Bagging
ME 12042.293
RMSE 144335.1
MAE 124681.4
MPE 0.1179655
MAPE 5.746465
This method has the lowest ME, RMSE, and MAE among all models. It also has the
lowest MAPE, suggesting it provides the most accurate forecasts in terms of
percentage errors.
In conclusion, based on the provided metrics, the Bagging method seems to be the
most accurate forecasting method among the ones considered. It consistently shows
the lowest error rates across different measures.
Phase (3): Conclusion and policy implications
Our journey began with a deep dive into the trends and patterns of tourist arrivals in
Malaysia. From Phase (1), it was clear that tourist visits followed predictable
patterns, likely influenced by events like festivals, school breaks, or even the
weather.
Policy Implications:
● The rhythmic ebb and flow of tourist numbers suggest a need for strategic
planning. During slower months, perhaps we can introduce special
promotions to attract visitors. And when a busy season is anticipated, it's
crucial to ensure we're well-equipped to handle the crowd, guaranteeing a
seamless experience for all.
● The standout performance of the Bagging method is a testament to the power
of data-driven decision-making. Relying on such advanced techniques can
guide stakeholders, helping them anticipate and navigate challenges
effectively.
● Even with a champion like the Bagging method, it's essential to keep our
models updated. The world is ever-evolving, with factors like global events or
economic shifts potentially altering tourism patterns. Our models should be
agile enough to adapt to these changes.
● With some months more popular than others for tourists, spreading out
marketing efforts can ensure a more balanced influx of visitors throughout the
year. This can lead to better revenue management and a sustainable tourism
model.
Appendix
Appendix A
Summary of Linear Regression Model
Appendix B
Residual Diagnostics
Appendix C
Subset of data starting from January 2010 to January 2020
Appendix D
Training and Test dataset
Appendix E
Summary of Piecewise Linear Regression Model
Appendix F
Determine method that produces the most accurate forecast