Capstone Report 22112338

CAPSTONE PROJECT REPORT
CAC - 3
Machine Learning
“GoldForecast”
A Regression and Time Series Analysis
Submitted By,
Suhani Lariya
22112338
4 B.Sc Economics & Analytics
Under the guidance of
Prof. Vandana Bhagat
Department of Data Science
3rd May 2024
1
INDEX
Sl.No Table of Contents Page Number
1 Introduction
2 Problem statement
3 Dataset Description
4 EDA & Preprocessing

5 Model Selection & Model Building
6 Result Description
7 Conclusions & recommendations
2
I. INTRODUCTION
This project undertakes a comprehensive analysis of predicting gold prices primarily focused on the USA
market. It aims to leverage regression models and time series forecasting techniques to anticipate
fluctuations in gold prices based on key independent variables such as inflation, unemployment rate,
interest rate, and oil prices.
The project initiates with a clear delineation of its objectives and relevance in addressing contemporary
challenges in the finance domain. Through the systematic exploration of data science methodologies, it
seeks to elucidate the intricate relationships between economic indicators and gold prices, thus
empowering stakeholders to make informed decisions.
Key phases of the project include:
● Problem Identification:
The project identifies the core challenge of predicting gold prices in the USA market and formulates
specific objectives to address this problem.
● Data Collection and Description:
A suitable dataset encompassing historical data on gold prices, inflation rates, unemployment rates,
interest rates, and oil prices is acquired and described in detail.
● Exploratory Data Analysis (EDA) and Preprocessing:
Through EDA, the project delves into the characteristics of the dataset, uncovering patterns, correlations,
and outliers. Preprocessing steps are undertaken to clean and transform the data, ensuring its suitability
for modeling.
● Model Selection and Building:
The project justifies the selection of regression models and time series forecasting techniques based on the
problem statement and dataset characteristics. Models are built and fine-tuned to predict gold prices
accurately.
● Result Interpretation:
The outcomes of the models are analyzed and interpreted, shedding light on the factors influencing gold
prices and the predictive performance of the models.
2
Recommendations and Conclusion:
Based on the findings, the project provides actionable recommendations for stakeholders in the finance
industry and concludes by summarizing the project's contributions and potential avenues for future
research.
3
II. PROBLEM STATEMENT
The problem statements for the study can be categorized into 3 parts:
Problems which can be solved through :
1. EDA (Exploratory Data Analysis) - This problem deals with all the complications and
doubts solved at data-preparation level . It deals with :-
● Missing Values: Check for missing values in any of the columns and decide on
strategies for handling them, such as imputation or removal.
● Outliers: Identify outliers in the Gold Prices column as well as in other
independent variables and determine if they are genuine data points or errors.
● Data Visualization : Type of data , the distribution of Gold Prices and other
variables , pairplots ,histograms etc…
● Correlation Analysis: Explore the relationships between Gold Prices and other
independent variables (Inflation, Unemployment rate, Interest rates, Oil Prices)
using correlation analysis or visualizations (e.g., scatter plots, correlation
matrices).
● Variable Importance: Determine the relative importance of each independent
variable in predicting Gold Prices using techniques like feature importance.
● Data Scaling: Evaluate whether data scaling is necessary for any of the variables
● Multicollinearity : To determine if data has any multicollinearity between two or
more independent variables.And how to deal with it.
Exploring these aspects through EDA can provide valuable insights into the data and help
in building more accurate predictive models for Gold Prices.
2. Regression model - There are some problems which can be significantly solved through a
regression analysis.
● Prediction: Primary objective of regression analysis in this scenario is to predict
future Gold Prices based on historical data and the values of the independent
variables. By fitting a regression model to the historical data, wecan estimate the
4
relationship between Gold Prices and the independent variables and then use this
relationship to make predictions for future time periods.
● Model Selection: Selecting an appropriate regression model that best fits the data.
By exploring different types of regression models (e.g., linear regression, ridge
regression, Lasso regression) and assessing their performance using techniques
such as cross-validation . Also , to fine tune the models by hyperparameter tuning.
And what model is the best to continue with.
● Assumptions Checking: Checking whether some assumptions of regression
analysis are met. These assumptions include normality of residuals, and tests for
multicollinearity.
● Model Evaluation: Once you have fitted several regression models, you need to
evaluate their performance to determine which model is better suited for
predicting Gold Prices. Common evaluation metrics for regression models include
Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared
(coefficient of determination). And compared to identify the one that provides the
best balance between accuracy and simplicity.
● Interpretation of Results: Regression analysis also allows you to interpret the
coefficients of the independent variables in the model. This can provide valuable
insights into the factors influencing Gold Prices in the real world.
3. Time-Series model/forecasting - Some questions can be simply answered with

TimeSeries analysis , especially when forecasting.
● Time Series Forecasting: Predict future values of a time series variable (e.g., stock
prices, sales, temperature) based on historical data. EDA can help identify trends,
seasonality, and other patterns in the data, while regression modeling can be used
to build predictive models.
● Seasonality and Trends: Seasonality or long-term trends in Gold Prices and other
variables over time. This could involve visualizing time series plots and methods.
● Data Stationarity: Check if the data exhibits stationarity, which is crucial for time
series analysis and forecasting
5
The challenges and insights in a time series project extend beyond analysis and exploratory data
analysis (EDA). They encompass the entire process of completing the project, including data
collection, preprocessing, modeling, and interpretation.It is something which we gain from
experience .
6
III. DATA DESCRIPTION
● The data has 2244 rows and 6 columns , including timestamp . The collected data was
time series but as variables were needed , apparently it changed to Panel data .
● Data Source:
The data was collected from various reputable sources, including the Federal Reserve website,
the U.S. Energy Information Administration (EIA) website, Statista, the World Bank, and
Macrotrends.
These sources provide reliable and authoritative data on economic and financial indicators,
making them suitable for analysis and prediction.
● Data Collection Process:
-The objective was to predict gold prices using time series data. Weekly data for gold prices was
obtained from the Federal Reserve website.
-Additional variables impacting gold prices were collected individually from various sources,
including economic indicators such as inflation, unemployment rate, interest rates, and oil prices.
-Due to variations in data availability and consistency, the time period for analysis was narrowed
down from 1978-2023 to 1980-2020, spanning 40 years.
-This timeframe was chosen to ensure consistency in data coverage across all variables and to
manage the size of the dataset, as handling too many rows of data became challenging.
-Also, theoretically , many variables impact Gold prices , but in practical scenario ,the data is
difficult to find as there is no consistent data availability.
● Data Variables:
-Time Series Variable: Gold Prices (USD) - Weekly data from 1980 to 2020.
-Economic Indicators: Inflation, Unemployment Rate, Interest Rates, Oil Prices (USD), and
potentially other relevant variables impacting gold prices.
-Each variable is collected at weekly intervals, providing a detailed view of changes over time.
-Variable Relation with Gold prices:
1. Gold Prices (USD): Represents gold's market value in U.S. dollars per ounce. Gold is
often sought during economic uncertainty or high inflation, leading to increased prices.
7
2. Inflation: Measures the rate of price increase for goods and services. Gold is considered a
hedge against inflation, so its prices tend to rise during periods of high inflation.
3. Unemployment Rate: Indicates the percentage of the workforce without jobs. High
unemployment rates often coincide with economic instability thus increasing its prices.
4. Interest Rates: Reflect the cost of borrowing or returns on investments. Low interest rates
make gold more attractive as an investment, while higher rates may reduce demand and
lower prices.
5. Oil Prices (USD): Represent crude oil's market value per barrel. Changes in oil prices can
impact inflation. It generally shares positive relationship.
● Descriptive Statistics:
-The data has some missing values originally. Due to lack of data , 1-2 year data was missing.
-Count: The number of non-null observations for each variable. In this case, there are 2136
observations for each variable.
-Mean: The average value of each variable. For example, the mean gold price is approximately
$684.83 USD.
- The minimum value observed for each variable. For instance, the lowest gold price observed is
$253.80 USD.
-The 25th percentile, also known as the first quartile. This indicates the value below which 25%
of the observations fall. For example, 25% of the gold prices are below $353.99 USD.
-The median, also known as the 50th percentile or second quartile. This represents the middle
value of the dataset when it's sorted in ascending order. For example, the median gold price is
$417.75 USD.
-The 75th percentile, also known as the third quartile. This indicates the value below which 75%
of the observations fall. For example, 75% of the gold prices are below $1140.94 USD.
-The maximum value observed for each variable. For instance, the highest gold price observed is
$2031.15 USD.
-A higher standard deviation indicates greater variability in the data. For example, the standard
deviation for gold prices is approximately $457.73 USD.
8
● Visualisation :
The data can be visualized by using histogram to find

out the distribution of data , it can be seen that the
data for Gold Prices (USD) is right skewed . Along
with this , “inflation” , “Interest rates” are also right
skewed. It can be resolved by some techniques.
This shows the trend / it is a scatter plot when kept X

and Y axis as “Gold Prices (USD)” and “year”. It
shows an upward trend and the trend is increasing
over years.
● Insight :
The dataset was exceptionally large, making it challenging to efficiently monitor and analyze
each individual value, particularly within Excel. Large datasets can overwhelm traditional
spreadsheet software, leading to performance issues, increased likelihood of errors, and difficulty
in maintaining data integrity
9
IV. EDA(EXPLORATORY DATA ANALYSIS
1. Missing Values :
The data has a total of 2244 rows , out

of which 2 variables have 2192 and
2136 rows respectively. They do have
some missing values
To address missing values in the dataset, the most suitable approach for this time series
data is removal. The availability of data for variables such as inflation and unemployment
rate varied, with inflation data available until 2021 and unemployment rate data only
available until 2020. Since the dataset follows a time series format and missing values
cannot be accurately imputed using methods like mean, median, or forward/backward fill,
it was decided to remove rows with missing values.
By removing rows with missing values, we ensured that the integrity of the time series
data remained intact, preserving the temporal relationships between variables. This
approach avoids introducing bias or inaccuracies that may arise from imputing missing
values.
2. Outlier Detection :
If we look into the data , there were many outliers present.
It was observed that there were numerous instances where values

around 200 to 300 appeared to be outliers. However, upon further
investigation, it was found that these outliers were not random
occurrences but were often associated with significant economic
activities or events. For example, during periods of economic
recession or instability, unemployment rates may spike, leading
10
to outlier values in the dataset. Similarly, high inflation rates during recessionary periods can also
result in outliers.
Given the historical context and the potential significance of these outlier values, it was decided
not to remove them from the dataset. Removing these outliers could potentially obscure the true
underlying patterns and trends in the data. By retaining these outlier values, the dataset provides
a more comprehensive and accurate representation of the economic dynamics and fluctuations
over time.
3. Visualisation :
It can be seen from the distribution of data for each column , almost all the variables has
Right-Skewed data , means majority of the values lies on the left side of distribution .
11
Only “Unemployment rate” has an equal distribution , from a broader perspective.
This is called “subplot” and from all these plots we can observe a TREND spanning from 1980
to 2020. The trend for almost most of the variables is looking non-stationary. From the graphs ,
plot(1:Gold prices (USD)) and plot(3:Unemployment rate) are definitely non-stationary. It is a
common phenomenon for time-series and panel data.
-Given below is called “PAIRPLOT” , also known as a scatterplot matrix, displays pairwise
relationships between variables in a dataset. Each variable in the dataset is plotted against every
other variable, resulting in a grid of scatter plots.
12
-Oil Prices and Inflation are not that correlated which can be seen in the graph .
-Unemployment rate and Gold Prices (USD) are more correlated according to pairplot.
-Inflation & Oil prices , Interest rate and Oil prices , are not much correlated
4. Correlation Matrix:
13
-Interest rate and gold prices share a correlation of -0.56 , as higher interest rates can lead to
lower demand for gold and, consequently, lower gold prices.
-It can be observed that the relationship between inflation and gold prices is positive
theoretically , but according to the correlation matrix it is negative which is -0.21 .It may be due
to external factors like govt policies that the data is behaving like this. Also this is not a very
strong relationship.
- The relationship between the unemployment rate and gold prices is showing as -0.01 which is
an extremely weak negative correlation . Theoretically , the relationship between these two
variables may not always be consistent due to other factors influencing market sentiment and
investor behavior. But , according to data it shares negative correlation.
-Changes in oil prices can indirectly affect gold prices. A significant rise in oil prices may raise
concerns about inflation and economic slowdown, prompting investors to hedge with gold, thus
increasing its demand and price. Conversely, a substantial decrease in oil prices may alleviate
inflation worries but signal economic weakness, leading investors to seek safe-haven assets like
gold, consequently driving its prices higher. However, the correlation coefficient between Oil
14
Prices(USD) and Gold Prices (USD) is quite good ,0.77 which suggests that there is positive
correlation.
5. Data Scaling (Standardization):
Assessing the need for data scaling, it was determined that scaling wasn't pursued. There is
disparity in magnitudes between variables: prices were represented by large numbers (Gold
Prices(USD) , Oil Prices (USD) ), while rates were in decimal form (Unemployment rate ,
Interest rate). Scaling these variables could have resulted in excessive variation, leading to the
decision to retain the original scaling. This choice remains consistent even after considering the
potential effects of scaling.
Scaling aims to bring all features to a similar scale, typically between 0 and 1 or centered around
a mean of 0 with a standard deviation of 1. However, if the disparity in magnitude between
variables is too large, scaling might not be appropriate as it could distort the relationships
between variables.
In our scenario, where scaling doesn't improve the interpretability or performance of the model
and may even introduce unnecessary variability, it's reasonable not to scale the data.
6. Multicollinearity:
To find ,if the data has Multicollinearity present in it , we used Python to do that . Upon
revisiting the results, I found that some variables, like Gold Prices (USD) and Oil Prices (USD),
and Inflation and Interest rates, were pretty closely linked, with correlation coefficients ranging
from 0.7 to 0.9. This tells us they might be sharing too much info, making it tricky to understand
their individual effects. And imply multicollinearity.
To handle this, I looked into using principal component analysis (PCA) to simplify things, but
the results didn't seem promising.
15
The plot showed a downward trend and k=1, suggesting PCA might not be the best fit here.
Plus, dropping any variables would seriously limit the number of factors we're considering for
our analysis. So, for now, we'll stick with what we've got, knowing there might be some overlap
in our data but still aiming to make the most of it for our analysis.
16
V. MODEL SELECTION AND MODEL BUILDING
1.The data requires a regression analysis
● Linear Regression Model : We chose linear regression at first because it's a simple and
interpretable model that provides a good baseline for understanding the relationship
between the input variables and the target variable. Additionally, linear regression
assumes a linear relationship between the features and the target variable, which is a
reasonable starting point for many datasets.
Linear Regression RMSE: 1.2366982197576302e-13
Linear Regression R-squared : 1
( When we initially ran linear regression and obtained an accuracy of 1, it might seem
like we've found the perfect model. However, this high accuracy could be indicative of
overfitting, especially if we're testing the model on the same data it was trained on.
Overfitting occurs when a model learns the noise in the training data rather than the
underlying pattern, resulting in poor generalization to new, unseen data.
To address the possibility of overfitting and to ensure that our model generalizes well to
new data, we explored various other models such as decision trees, random forests, and
ensembling techniques like stacking. )
17
● Lasso Regression : It offers a regularization technique that helps prevent overfitting by
penalizing the absolute size of the regression coefficients. Lasso performs feature
selection by shrinking some coefficients to exactly zero, effectively removing those
features from the model. This feature selection property is useful when dealing with
datasets with many features, as it can help simplify the model and improve its
interpretability by focusing on the most important features.
Additionally, Lasso regression can handle multicollinearity among the features, which is
common in real-world datasets. Therefore, by using Lasso regression, we aim to build a
more robust and parsimonious model while avoiding overfitting.
Results :
LassoRegression RMSE 2.2337202089187744e-06
Lasso Regression R-Squared 1
The accuracy may still be the same .We will check the results, if they are genuinely true
or it is a result of overfitting
● DECISION TREE :
Decision trees recursively split the data into subsets based on the features that best separate the
target variable. Each split aims to maximize the homogeneity of the target variable within the
resulting subsets.
18
-The R-squared value of 1.0 indicates that the Decision Tree perfectly predicts the target variable
on the training data
-The high R-squared value close to 1.0 (specifically 0.9999932907263452) on the testing data
-A lower RMSE signifies better model performance, and in this case, the RMSE is relatively low,
indicating that the Decision Tree model's predictions are close to the actual values on average.
However, the Decision Tree also performs exceptionally well on the testing data, with a very
high R-squared value and a low RMSE. This indicates that the model generalizes well and is not
overfitting to the training data, unlike the linear regression model which had perfect accuracy but
might have overfit the data.
Overall, the Decision Tree model demonstrates superior performance compared to the linear
regression model, particularly in terms of generalization to unseen data.
1. CROSS VALIDATION:
a)On Lasso regression model
After performing cross-validation on the results of Lasso Regression by taking the difference
between training accuracy and testing accuracy . The lesser the RMSE , the better .
19
The training and testing mean squared errors (MSEs) are very close, suggesting that the model
generalizes well to unseen data. The cross-validation RMSE scores also indicate consistent
performance across different folds, with a mean RMSE of approximately 0.0002189 . The small
difference between training and testing MSEs further suggests that the model is not overfitting.
After performing Cross-Validation ; the Overall R-Square of Lasso Regression came as ;
It shows that cross-validation has impacted and tried to make the model better.
b) On Linear regression model–
The results for cross validation in this model was -
After performing the cross validation in this model

, the results are the same . It means that Linear
regression model is not good for this data.
So , Lasso Regression is only giving the best results
-Residual Analysis :
Residual analysis allows us to evaluate the goodness of fit of the regression model. If the
residuals are randomly distributed around zero with constant variance, it indicates that the model
captures the underlying relationships in the data well. Deviations from this pattern may suggest
issues such as underfitting or overfitting.
1.Shapiro-Wilk Test: After running this test , we found out that “Residuals are not normally
distributed”.
20
2.Breusch-Pagan Test:After running this test also , the results were same as residuals are not
normally distributed.
-APPLYING LOG TRANSFORMATION TO GET UNIFORM DISTRIBUTED

RESIDUALS
After applying LOG transformation , the distribution graphically looked symmetric ;
But , re-running the test gave the same results. Residuals were still not equally distributed.
● ENSEMBLING MODELS
We employed several ensemble learning techniques to improve the predictive

performance of our models. Ensemble methods combine predictions from multiple
individual models to produce a stronger learner.
a. Bagging (Random Forest):
Bagging, or Bootstrap Aggregating, involves training multiple models (often decision trees) on
different bootstrap samples of the training data and averaging their predictions.
21
Bagging helps reduce overfitting by reducing the variance of the individual models. It achieves
this by training each model on a slightly different subset of the data, leading to more robust
predictions.
–Training R-squared: The R-squared value of 1.0 indicates that the Decision Tree perfectly
predicts the target variable on the training data, capturing all the variance in the target variable
based on the features.
–Testing R-squared: The high R-squared value close to 1.0 (specifically 0.9999932907263452)
on the testing data suggests that the model generalizes well and maintains strong predictive
performance on unseen data.
–Decision Tree RMSE: The Root Mean Squared Error (RMSE) of 1.1855782531712218
indicates the average magnitude of the residuals (the differences between the predicted and
actual values) in the model's predictions.
The Decision Tree model shows perfect fit on the training data, which is not uncommon for
decision trees as they can easily memorize the training data due to their ability to create complex
decision boundaries.
However, the Decision Tree also performs exceptionally well on the testing data, with a very
high R-squared value and a low RMSE. This indicates that the model generalizes well and is not
overfitting to the training data, unlike the linear regression model which had perfect accuracy but
might have overfit the data.
Overall, the Decision Tree model demonstrates superior performance compared to the linear
regression model, particularly in terms of generalization to unseen data.
Testing R-squared: The high R-squared value close to 1.0 (specifically 0.999994285557632) on
the testing data indicates that the model generalizes exceptionally well. It maintains strong
predictive performance on unseen data.
Boosting (Gradient Boosting and XGBoost):
22
The Gradient Boosting Regression model exhibits outstanding performance, achieving an
R-squared value of 0.999975 on the training dataset, indicating its ability to accurately capture
almost all the variance in the target variable. Impressively, it maintains its strong predictive
power on unseen data, with a testing R-squared of 0.999957, suggesting excellent generalization
capability. This model demonstrates its effectiveness in capturing complex relationships within
the dataset, making it a powerful tool for predictive modeling.
Initially, we started with a simple Linear Regression model, which yielded perfect accuracy on
the training data but raised concerns about potential overfitting due to its inability to generalize
well on unseen data.
To address this limitation, we explored various advanced regression techniques. Lasso

Regression, for instance, introduced regularization to penalize overly complex models, resulting
in improved generalization performance by preventing overfitting.
Subsequently, we experimented with ensemble methods like Random Forest Regression and
Gradient Boosting Regression. These techniques leverage multiple models to make predictions,
effectively reducing overfitting while enhancing predictive accuracy.
Additionally, we employed cross-validation techniques to evaluate model performance on

multiple subsets of the data, ensuring robustness and generalization capability. Through this
iterative process of model selection, regularization, and validation, we successfully improved
predictive accuracy while mitigating overfitting, culminating in models that exhibit strong
performance on both training and testing datasets.
23
TIME SERIES ANALYSIS
1. Importing Data:
Start by importing the dataset containing the time series data. Ensure that the dataset is structured
properly, with each column representing a variable and the rows representing observations over
time.
2. Converting Year Column to Timestamp as Index:
The first step in time series analysis is to ensure that the time-related data is properly formatted.
Convert the 'Year' column to a datetime data type and set it as the index of the DataFrame. This
allows for easier manipulation and analysis of time series data.
3. Finding the Trend of Each Variable:
Visualize the trend of each variable over time using time series plots. These plots display the
values of each variable against time, allowing you to identify any long-term patterns or trends
present in the data, such as increasing or decreasing values over time.
4. Splitting the Data:
24
Divide the dataset into training and testing sets using an 80:20 ratio. The training set will be used
to train the forecasting model, while the testing set will be used to evaluate the performance of
the model. This step ensures that the model's performance is assessed on unseen data.
5. Finding Stationarity:
Stationarity is a crucial assumption in time series analysis, as it ensures that the statistical
properties of the series remain constant over time. Conduct the Augmented Dickey-Fuller (ADF)
test on each variable to determine its stationarity. If a variable is found to be non-stationary,
further processing may be required to make it stationary.
(non-stationary gold prices column)
6. Taking First Difference:
25
For non-stationary variables identified in the previous step, compute the first difference by
subtracting each value from its preceding value. This transformation helps remove any trend or
seasonality present in the data, making it stationary and suitable for time series analysis.
(Gold Prices variable after first differencing)
26
7. Calculating Moving Average and finding out Trend,Seasonality,Residuality :
27
8. Forecasting Model Selection:
ARIMA (Autoregressive Integrated Moving Average) is a widely used forecasting model for
time series data. It is suitable for data with trends and/or seasonality, which are common
characteristics of many real-world time series datasets
ARIMA models can capture a wide range of temporal dependencies, including autocorrelation,
trend, and seasonality. By adjusting the model parameters, such as the order of autoregression
(p), differencing (d), and moving average (q), ARIMA can accommodate various patterns present
in the data.
Overall, ARIMA is a versatile and effective forecasting model that can capture complex
temporal patterns in time series data.
9. Model Training and Evaluation:
Train the selected forecasting model using the training data and evaluate its performance using
the testing data. Common evaluation metrics include Mean Absolute Error (MAE), Mean
Squared Error (MSE), and Root Mean Squared Error (RMSE).
28
VI. RESULT DESCRIPTION
1. Linear Regression:
● Training R-squared: 1.0
● Testing R-squared: 1.0
● RMSE: 2.2829534657510607e-13
● Cross-validation RMSE: 2.4273458808647473e-13
2. Lasso Regression:
● RMSE: 2.3371014992128577e-13
● Cross-validation RMSE: 2.4273458808647473e-13
3. Decision Tree Regression:
● RMSE: 1.1855782531712218
4. Random Forest Regression:
● RMSE: RMSE Value
5. Gradient Boosting Regression:
● RMSE: RMSE Value
Each model's performance is assessed based on its R-squared value, which indicates the
proportion of the variance in the dependent variable that is predictable from the independent
variables. Additionally, the Root Mean Squared Error (RMSE) provides a measure of the average
deviation between the predicted and actual values, with lower values indicating better model
performance.
29
The best model was Random Forest with a lower R2 and larger MSE but it reduced overfitting
unlike Linear Regression.
Overall, the ensemble methods, including Random Forest Regression and Gradient Boosting
Regression, demonstrate superior predictive accuracy compared to the baseline Linear and Lasso
Regression models. These techniques leverage the combined strength of multiple models to
capture complex relationships in the data, leading to enhanced performance and reduced risk of
overfitting.
30
VII. CONCLUSION & RECOMMENDATION
1. Continuous Monitoring
Since financial markets are dynamic and subject to various external factors, it's crucial to
continuously monitor model performance and update the model as new data becomes available.
Regular retraining of the model with fresh data ensures that it remains relevant and accurate over
time.
2. Risk Management:
While regression models provide valuable insights into gold price prediction, it's essential to
acknowledge the inherent uncertainty and volatility in financial markets. Users should exercise
caution and consider implementing risk management strategies to mitigate potential losses
associated with model predictions.
3. Consultation with Experts:
For critical decision-making processes, consulting with domain experts or financial analysts can
provide valuable insights and validate the model's predictions. Combining quantitative analysis
with qualitative expertise can lead to more robust and reliable forecasts.
4. Diversification:
Finally, users should consider diversifying their investment portfolio to spread risk across
different asset classes and minimize the impact of unforeseen events or market fluctuations.
While regression models can inform investment decisions, diversification remains a fundamental
principle of risk management in finance.
31
32
33
34
35
36

Capstone Report 22112338

Uploaded by

Copyright:

Available Formats

Capstone Report 22112338

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Report 22112338

Uploaded by

Copyright:

Available Formats

CAPSTONE PROJECT REPORT

A Regression and Time Series Analysis

4 B.Sc Economics & Analytics

Under the guidance of

Prof. Vandana Bhagat

Department of Data Science

3rd May 2024

Sl.No Table of Contents Page Number

4 EDA & Preprocessing

Key phases of the project include:

● Data Collection and Description:

● Exploratory Data Analysis (EDA) and Preprocessing:

● Model Selection and Building:

Problems which can be solved through :

3. Time-Series model/forecasting - Some questions can be simply answered with

The data can be visualized by using histogram to find

This shows the trend / it is a scatter plot when kept X

The data has a total of 2244 rows , out

If we look into the data , there were many outliers present.

It was observed that there were numerous instances where values

5. Data Scaling (Standardization):

potential effects of scaling.

1.The data requires a regression analysis

Linear Regression RMSE: 1.2366982197576302e-13

Linear Regression R-squared : 1

LassoRegression RMSE 2.2337202089187744e-06

Lasso Regression R-Squared 1

a)On Lasso regression model

After performing Cross-Validation ; the Overall R-Square of Lasso Regression came as ;

b) On Linear regression model–

The results for cross validation in this model was -

After performing the cross validation in this model

So , Lasso Regression is only giving the best results

-APPLYING LOG TRANSFORMATION TO GET UNIFORM DISTRIBUTED

After applying LOG transformation , the distribution graphically looked symmetric ;

We employed several ensemble learning techniques to improve the predictive

a. Bagging (Random Forest):

Boosting (Gradient Boosting and XGBoost):

To address this limitation, we explored various advanced regression techniques. Lasso

Additionally, we employed cross-validation techniques to evaluate model performance on

2. Converting Year Column to Timestamp as Index:

3. Finding the Trend of Each Variable:

4. Splitting the Data:

(non-stationary gold prices column)

6. Taking First Difference:

(Gold Prices variable after first differencing)

9. Model Training and Evaluation:

unlike Linear Regression.

3. Consultation with Experts:

You might also like