Stock Report
Stock Report
CHAPTER 1
INTRODUCTION
In this project, we aim to predict stock prices by using a variety of time series
forecasting techniques. Stock price prediction is a challenging task due to the dynamic and
volatile nature of financial markets. Our aim is to explore and compare the effectiveness of
several models, ranging from classical statistical approaches to advanced machine learning and
deep learning techniques. We begin by applying AR, MA, ARIMA (Auto-Regressive
Integrated Moving Average), a traditional time series method known for its ability to capture
linear patterns. Next, we employ Regression and ensemble models like XG-Boost and Random
Forest, which offer flexibility in handling both linear and nonlinear relationships in the data.
To push the boundaries further, we integrate deep learning techniques such as Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term
Memory (LSTM) networks, known for their ability to capture sequential dependencies in time
series data. Through this multi-technique approach, we evaluate the models based on their
predictive accuracy, robustness to market fluctuations, and overall performance in stock price
forecasting. Our findings highlight the advantages and limitations of each method, offering
valuable insights for financial analysts and traders looking to make data-driven investment
decisions.
This section outlines the hardware and software requirements used to develop the stock
price prediction system. These specifications ensure the system is designed to handle data
preprocessing, model training, evaluation, and deployment efficiently.
CHAPTER 2
LITERATURE SURVEY
1. “Profit Prediction Using ARIMA, SARIMA and LSTM Models in Time Series
Forecasting: A Comparison” by Uppala Meena Sirisha, Manjula C Belavagi & Girija
Attigeri.
This paper compares ARIMA, SARIMA, and LSTM models for profit prediction. The study
emphasizes the strengths and weaknesses of these models in time series forecasting,
particularly in terms of predictive accuracy and computational efficiency. LSTM was found to
outperform ARIMA and SARIMA in handling non-linear patterns due to its ability to learn
from sequential data and capture dependencies across time steps.
ARIMA is a statistical approach for time series forecasting that models the relationship
between an observation and its lagged values, including error terms from previous predictions.
It is ideal for non-seasonal data and focuses on making non-stationary data stationary before
applying the model.
Key Components:
• AR (Auto-Regressive) Order (p): Represents the number of lag observations included
in the model.
• I (Integrated) Order (d): Specifies the number of differencing required to make the
series stationary.
• MA (Moving Average) Order (q): Indicates the number of lagged forecast errors in the
model.
SARIMA enhances ARIMA by incorporating seasonal patterns, making it suitable for data
with periodic fluctuations. It introduces additional parameters to model seasonality.
Additional Parameters:
• LSTM Layer: Stores and updates memory over time steps. It uses gates (input, forget,
and output) to manage the flow of information.
This paper examines the performance of ARIMA, SARIMA, and LSTM models in profit
prediction for time series data.
• ARIMA & SARIMA Models: These statistical models are shown to perform well for
data with linear trends and seasonality. However, they struggle with non-linear patterns
and long-term dependencies.
• LSTM Model: As a neural network-based model, LSTM can capture complex, non-
linear relationships and dependencies over time, making it ideal for time series
forecasting.
• Comparison: The paper concludes that LSTM outperforms ARIMA and SARIMA in
accuracy for datasets with non-linear trends, though it requires more computational
power and tuning expertise.
• Applications: Demonstrates the usability of LSTM in financial forecasting tasks where
traditional statistical models fail to generalize well.
2. “On the Autoregressive Time Series Model Using Real and Complex Analysis”, by
Torsten Ullrich
This paper delves into the global structure of autoregressive models using real and complex
analysis. It highlights that autoregressive models can be understood as combinations of
sinusoidal, exponential, and polynomial functions, aiding in the model selection process. These
insights help avoid unsuccessful modeling attempts by ensuring compatibility with the data's
global properties.
The Autoregressive (AR) model is a fundamental time series analysis method that predicts a
data point based on its past values and a stochastic error term. It’s suitable for modeling short-
term relationships within a stationary time series.
Key Features
• Stationarity Requirement: The series should have constant statistical properties over
time.
• Short-Term Dependency: Captures relationships between recent data points.
This research explores autoregressive (AR) models using both real and complex mathematical
frameworks.
• Global and Local Structure: The authors analyse how AR models derive their short-
term predictions from previous values but often lack explicit global structural insights.
• Complex Analysis: By combining polynomial, sinusoidal, and exponential functions,
the paper highlights how AR models' global behaviour can be better understood.
• Applications: This approach aids in model selection, ensuring that the chosen model
aligns with the data's inherent properties.
• Key Contribution: Offers theoretical insights into how complex analysis can refine AR
models, reducing the risk of poor model selection and providing a mathematical basis
for understanding temporal patterns.
3. “XGBoost: A Scalable Tree Boosting System” by Tianqi Chen & Charlos Guestrin
This paper introduces XGBoost, a highly scalable gradient boosting framework for
machine learning.
Key Innovations:
4. “Stock Price Prediction Using CNN and LSTM-Based Deep Learning Models” by
Sidra Mehtab & Jaydip Sen
This study employs CNN and LSTM models for predicting stock prices of the NIFTY
50 index. It highlights that LSTM-based models excel at capturing temporal dependencies,
while CNN is faster in execution. The encoder-decoder convolutional LSTM model was the
most accurate, while a simple CNN model was the quickest.
The CNN is a type of deep learning algorithm designed to extract features from
structured data like images or tensors.
1. Purpose: CNNs are effective in identifying spatial patterns and features in data. Originally
designed for image recognition, they can be adapted for other domains, such as stock market
prediction, by processing tensors derived from 1D or 2D data.
2. Architecture: A CNN is composed of multiple layers:
• Convolutional Layers: These apply filters to input data to capture patterns. Each filter
focuses on a specific aspect, such as edges or textures in image data or trends in tensors
for time-series data.
• Pooling Layers: These reduce dimensionality and help retain essential features while
reducing computational complexity.
• Flatten Layer: Converts multidimensional data into a 1D vector for input into
subsequent layers, such as dense or fully connected layers.
This study develops deep learning models for predicting stock prices using historical
data from the NIFTY 50 index.
CNN-Based Models:
This research compares deep learning architectures—MLP, RNN, LSTM, and CNN—for
stock market prediction using NSE and NYSE data. CNN outperformed others in accuracy and
generalization, showcasing its ability to learn from shared dynamics between distinct markets.
This study applies deep learning architectures to stock price prediction, comparing models
trained on NSE (India) data.
• Models Used:
This review aggregates findings from multiple studies on AI techniques like neural
networks, reinforcement learning, and ensemble methods in stock market prediction. It
identifies trends, challenges, and future directions, emphasizing AI's potential to predict market
trends with high accuracy through hybrid and ensemble approaches.
This paper synthesizes findings from multiple reviews on AI applications in stock market
prediction.
• Techniques Reviewed:
• Reinforcement learning
• Hybrid and ensemble models
• Challenges Identified:
• Overfitting in small datasets
• Difficulty in integrating external factors like news and sentiment analysis
• Future Directions: Emphasizes combining AI with advanced data sources (e.g.,
sentiment analysis, macroeconomic indicators) to improve predictive performance.
• Key Takeaway: AI-driven models, particularly hybrid ones, show promise in
improving prediction accuracy and mitigating market volatility.
CHAPTER 3
SYSTEM ANALYSIS
The existing methods for time series analysis and forecasting primarily focus
on creating models based on one of the following three categories:
• Statistical Models: These models typically rely on traditional statistical methods to
capture the underlying patterns in time series data. Common statistical models include
ARIMA (Autoregressive Integrated Moving Average), Exponential Smoothing State
Space Models (ETS), and SARIMA (Seasonal ARIMA). These models are widely used
for tasks like trend and seasonality detection, and for forecasting future values based
on historical data.
• Machine Learning Models: Machine learning methods aim to learn patterns and
relationships within time series data using algorithms that do not require explicitly
defined equations like statistical models. Examples of machine learning models include
Random Forests, Support Vector Machines (SVM), XGBoost, and Gradient Boosting
Machines. These models can handle more complex data and relationships, often
offering superior accuracy over traditional statistical methods, especially when dealing
with large datasets or non-linear relationships.
• Deep Learning Models: Deep learning methods, particularly Recurrent Neural
Networks (RNNs) and Long Short-Term Memory (LSTM) networks, are designed to
work with sequences of data and capture temporal dependencies more effectively.
These models are capable of learning more complex patterns, and they excel in cases
with large amounts of time series data or when the data is non-stationary, meaning its
statistical properties change over time.
While the existing systems utilizing statistical, machine learning, and deep learning models
have been effective, they exhibit certain limitations when applied to specific use cases, such as
stock price prediction for a single stock:
• Lack of Integration: In the current systems, different models are generally applied
independently to time series data. However, these systems do not combine or evaluate
multiple models within the context of a single stock. The absence of such integrated
analysis could potentially lead to missed opportunities in selecting the best-suited
model for prediction, especially in volatile markets like stock prices.
• Model Selection Issues: The existing systems may not provide a clear framework for
comparing the effectiveness of different models on the same dataset, leading to a lack
of understanding of which approach would give the most accurate results. Users may
need to manually test several models, which can be time-consuming and may not
always yield the best solution.
To overcome the limitations mentioned above, the proposed system aims to provide a
comprehensive solution where a single stock is analyzed using all state-of-the-art models
across three domains: Statistical Models, Machine Learning Models, and Deep Learning
Models. The system would compare the results of all these models using a common
performance metric, namely RMSE (Root Mean Square Error), to allow for easy comparison
of their prediction accuracy.
The core features of the proposed system include:
• Comprehensive Model Application: The system applies a range of models from
statistical, machine learning and deep learning techniques to a single stocks historical
data. This ensures that the most effective model can be identified for the given stock.
• Comparison via RMSE: By comparing the models using RMSE, the system can
objectively evaluate which model performs best. A lower RMSE indicating better
accuracy.
• User-Centric: Users can choose the model based on their requirements, such as the level
of complexity they desire or the speed of the model.
CHAPTER 4
SYSTEM DESIGN
The system architecture is designed to process historical stock market data, train
predictive models, and provide future stock price predictions. The architecture consists of the
following components:
1. Data Ingestion:
• Algorithms: AR, MA, ARIMA, Random Forest, XGBoost, RNN and LSTM for
time series analysis.
Figure 1 depicts the Architecture of our System for stock price prediction. The process
begins with fetching stock data from the Yahoo Database using the YFinance API. The
data is processed and analyzed in Google Colab, involving steps such as exploratory data
analysis and data preprocessing. The processed data is then used for model building, and
the resulting model is deployed locally for user interaction through a Streamlit-based
UI/UX.
CHAPTER 5
DATA COLLECTION AND PREPRATION
5.1 DATA COLLECTION
Table 1: Head of the Data
Table 1 and Table 2 represent the first 5 and last 5 rows of Reliance stock data,
respectively, stored in the data frame. These tables provide a snapshot of the initial and
final stock records in the dataset for quick reference.
(mean) of the values in the surrounding data points. The term "nearby" usually refers
to adjacent or neighboring values within the time series or dataset.
• For example, if a particular data point is missing in a time series, the missing value can
be replaced by the average of the previous and next values. In some cases, the mean of
a defined range of values (like the previous few data points) is used to ensure more
contextual consistency.
4. Potential Limitations
• Not Always Accurate: This approach works well when the data is relatively stable and
has small variations. However, in cases where the data fluctuates greatly or contains
significant outliers, replacing missing values with the mean may introduce errors.
• Doesn't Capture Complex Relationships: This method does not account for complex
relationships or patterns that might exist between different variables. For example, if
the missing value is influenced by multiple other variables, a more sophisticated
imputation method (like using machine learning models or interpolation) might be
needed.
CHAPTER 6
EXPLORATORY DATA ANALYSIS
In time series analysis, the goal is to break down the data into its core components:
Trend, Seasonality, and Residuals. These components help to understand the underlying
patterns in the data, which can be useful for forecasting and model building. By visualizing
these components through decomposition, we can better interpret how each part contributes to
the overall time series.
Exploration of Data for Trends, Seasonality, and Residuals:
In time series analysis, the goal is to decompose data into three components:
1. Trend: The long-term movement or direction in the data (e.g., upward or downward).
2. Seasonality: The regular, repeating patterns over fixed intervals (e.g., daily, weekly,
yearly).
3. Residuals: The random fluctuations or noise remaining after removing trend and
seasonality.
Decomposition breaks down the time series into these components for better analysis and
forecasting. By plotting the decomposed components:
• Trend plot shows the overall direction of the data.
• Seasonality plot reveals recurring cycles or patterns.
• Residual plot helps identify randomness or unexplained variations.
Decomposing the data helps:
• Identify and forecast long-term trends.
• Account for periodic patterns (seasonality).
• Evaluate residuals to check for unexplained noise.
Methods:
• Classical Decomposition (Additive or Multiplicative)
• STL Decomposition (robust method using LOESS)
In essence, this process aids in better understanding the time series and improving forecasting
accuracy.
Figure 3, Figure 4, and Fig 5 present the trend, seasonal, and residual components of the
Close price after Classical decomposition respectively.
CHAPTER 7
METHODOLOGY
7.1 PROJECT WORKFLOW
Data Collection
•Reliance
•Yfinance
Model Building
•Statistical
•Machine Learning
•Deep Learning
Model Evaluation
•RMSE on test data
Output Visualization
•Visualize errors
•Visualize output Prediction
AR MODEL
The Autoregressive (AR) model is a fundamental time series analysis method that
predicts a data point based on its past values and a stochastic error term. It’s suitable for
modeling short-term relationships within a stationary time series.
Key Features
• Stationarity Requirement: The series should have constant statistical properties over
time.
Model Properties
The AR model assumes that the time series can be represented as a combination of exponential,
sinusoidal, and polynomial trends, depending on the roots of its characteristic polynomial.
Parameter Estimation
Parameters (ϕ\phi) are estimated using techniques like:
MA MODEL
The Moving Average (MA) model is another fundamental time series analysis
method that expresses the current value of a time series as a linear combination of past
white noise terms (random shocks). It is particularly suitable for modelling short-term
dependencies and smoothing noise in a stationary time series.
Key Features
• Stationarity Requirement: The series must have constant statistical properties
over time.
• Dependency on Error Terms: Captures the influence of past random shocks on
the current data point.
Model Properties
The MA model assumes that the time series behaviour arises from the interaction of
past stochastic error terms, which can lead to patterns that mimic trends or periodicity,
depending on the weights applied to past errors.
• Fixed Lag Structure: Only considers a finite number of past error terms.
• Uncorrelated Noise: Relies on the assumption that the white noise
terms are uncorrelated and have a mean of zero.
ARIMA MODEL
ARIMA is a statistical approach for time series forecasting that models the relationship
between an observation and its lagged values, including error terms from previous predictions.
It is ideal for non-seasonal data and focuses on making non-stationary data stationary before
applying the model.
Key Components:
• AR (Auto-Regressive) Order (p): Represents the number of lag observations included
in the model.
• I (Integrated) Order (d): Specifies the number of differences required to make the series
stationery.
• MA (Moving Average) Order (q): Indicates the number of lagged forecast errors in the
model.
Methodology:
1. Stationarity Check:
Time series data is tested for stationarity using statistical methods like the Augmented
Dickey-Fuller (ADF) test and KPSS test. Non-stationary series are transformed using
techniques like differencing or logarithmic scaling.
3. Model Fitting:
The ARIMA model is built using statistical methods like maximum likelihood
estimation to minimize errors. After fitting, the data is scaled back to its original form for
interpretation.
Strengths:
Limitations:
SARIMA MODEL
SARIMA enhances ARIMA by incorporating seasonal patterns, making it suitable for
data with periodic fluctuations. It introduces additional parameters to model seasonality.
Additional Parameters:
Methodology:
1. Order Selection:
Seasonal patterns are identified using seasonal ACF and PACF plots. P, D, Q, and s values are
determined accordingly.
2. Model Fitting:
The SARIMA model is trained similarly to ARIMA but accommodates both trend and
seasonality.
3. Residual Diagnostics:
The model’s validity is checked through residual analysis to ensure unbiased predictions.
Strengths:
• Effective for time series data with both trend and seasonality.
VARIMA MODEL
The Vector Autoregressive Integrated Moving Average (VARIMA) model is an
extension of ARIMA designed to handle multivariate time series data, where multiple
interdependent variables are modeled together. It combines the features of ARIMA with
vectorized (multivariate) modeling, making it suitable for systems where variables influence
one another over time.
Key Features
• Multivariate Dependency: Models interrelationships between multiple time series.
• Stationarity Adjustment: Handles non-stationary data using differencing, similar to
ARIMA.
• Lagged Relationships: Incorporates past values of all variables and their error terms.
XG-BOOST
XGBoost, or Extreme Gradient Boosting, is a scalable and powerful tree boosting
system widely used in machine learning.
1. Overview: XGBoost is an implementation of gradient tree boosting, a technique that
iteratively builds decision trees to optimize a specified loss function. It combines predictions
from multiple trees to improve accuracy, making it effective for both classification and
regression tasks.
2. Key Features:
• Regularized Learning Objective: XGBoost introduces a regularization term to control
model complexity, preventing overfitting and ensuring smoother weight adjustments
across trees.
• Sparsity Awareness: It handles datasets with missing values or sparse inputs efficiently,
leveraging default directions for handling such entries.
• Weighted Quantile Sketch: This method enables accurate split proposals for datasets
with weighted instances, enhancing its ability to handle large-scale data.
3. Optimization Techniques:
o Parallel and Distributed Computing: XGBoost supports multi-threading and distributed
processing to speed up model training on large datasets.
o Cache-Aware Access: By optimizing memory access patterns, it reduces delays caused by
non-contiguous memory reads.
o Out-of-Core Computation: It processes datasets that do not fit into memory by dividing data
into manageable blocks and using techniques like compression and sharding.
4. Practical Impact: XGBoost has been instrumental in achieving state-of-the-art results in
various machine learning challenges, such as Kaggle competitions and industry applications
like ad click prediction and fraud detection.
RANDOM FOREST
The Random Forest is an ensemble machine learning model that uses multiple
decision trees to improve prediction accuracy and robustness. It is versatile and suitable
for both classification and regression tasks.
Key Features
• Ensemble Method: Combines multiple decision trees to reduce
overfitting and variance.
• Bootstrap Aggregation (Bagging): Builds trees on random subsets of
data with replacement.
Applications
• Classification: Predicting customer churn, spam detection.
• Regression: House price prediction, stock price forecasting.
• Anomaly Detection: Identifying outliers in datasets.
ANN ARCHITECTURE
An Artificial Neural Network (ANN) is a machine learning model inspired by the
structure and functioning of the human brain. It learns complex relationships in data through
multiple interconnected layers of neurons.
Key Features
• Non-linearity: Captures non-linear patterns in data.
• Layered Structure: Includes input, hidden, and output layers.
• Learning Mechanism: Adjusts weights using backpropagation and
gradient descent.
Applications
• Image and Speech Recognition: Computer vision, voice assistants.
• Predictive Modelling: Time series prediction, medical diagnosis.
• Classification and Regression Tasks.
CNN ARCHITECTURE
The CNN is a type of deep learning algorithm designed to extract features from
structured data like images or tensors.
CNN Basics:
1. Purpose: CNNs are effective in identifying spatial patterns and features in data. Originally
designed for image recognition, they can be adapted for other domains, such as stock market
prediction, by processing tensors derived from 1D or 2D data.
2. Architecture:
o A CNN is composed of multiple layers:
▪ Convolutional Layers: These apply filters to input data to capture patterns. Each filter focuses
on a specific aspect, such as edges or textures in image data or trends in tensors for time-series
data.
▪ Pooling Layers: These reduce dimensionality and help retain essential features while reducing
computational complexity.
▪ Flatten Layer: Converts multidimensional data into a 1D vector for input into subsequent
layers, such as dense or fully connected layers.
3. Key Components:
o Kernel Size: Determines the size of the filter used in convolution. For example, a kernel size
of 3 means the filter examines a 3x3 grid of data at a time.
o Activation Functions: Functions like ReLU (Rectified Linear Unit) introduce non-linearity,
allowing the network to model complex relationships.
o Feature Extraction: By stacking multiple convolutional and pooling layers, the CNN captures
hierarchical features from data.
RNN ARCHITECTURE
A Recurrent Neural Network (RNN) is a specialized type of ANN designed to handle
sequential data. It retains information from previous steps in the sequence through feedback
connections.
Key Features
• Sequential Data Processing: Suitable for time series, natural language
processing, and other sequence-based tasks.
• Memory Mechanism: Maintains a hidden state to capture temporal
dependencies.
• Weight Sharing: Uses the same weights across different time steps.
Applications
• Time Series Analysis: Stock price prediction, weather forecasting.
• Natural Language Processing (NLP): Text generation, sentiment analysis,
machine translation.
LSTM ARCHITECTURE
LSTM is a deep learning approach designed to capture long-term dependencies in
sequential data. It excels in time series forecasting due to its ability to model complex patterns,
trends, and seasonality dynamically.
Architecture:
• LSTM Layer: Stores and updates memory over time steps. It uses gates (input, forget,
and output) to manage the flow of information.
• Dense Layer: Outputs the final prediction.
Methodology:
1. Data Preprocessing:
o Normalize features using MinMaxScaler to scale data into the range [0, 1].
o Reshape the dataset into sequences of input-output pairs (e.g., using a rolling window
approach).
2. Model Definition:
o Define the LSTM layer with neurons and activation functions (e.g., sigmoid or ReLU).
o Use mean squared error as the loss function and Adam optimizer for training.
3. Training and Validation:
o Train the model on a portion of the dataset and evaluate it using validation data.
o the model learns sequential dependencies and adjusts weights dynamically.
Strengths:
o Handles both linear and non-linear dependencies.
o Adapts to varying complexities in the data, making it robust for real-world applications.
Limitations:
o High computational cost and training time.
o Requires careful tuning of hyperparameters like batch size, epochs, and learning rate.
7.3 RESULTS
In this section, we will examine the performance of the models on the test data.
AR Model
MA Model
Figures 7 and 8 present the performance of the AR and MA models on the test data
respectively.
ARIMA Model
XGBoost Model
CNN Model
RMSE
Figure 18 represents the root mean square error of all the models for comparison.
STREAMLIT INTERFACE
Fig 21: WebApp Displaying ANN model Prediction for next 30 days.
Figure 20 presents the web interface showing the options to select the Model among
ANN, CNN, RNN and LSTM.
Figure 21 presents the ANN model’s future predictions for the next 30 days.
Fig 23: WebApp Displaying LSTM model prediction for next 30 days.
CHAPTER 8
TESTING
AR
Figures 24 and 25 presents the predictions by the AR and MA model for the future 30
days.
ARIMA
SARIMA
Figures 26 and 27 present the predictions by the ARIMA and SARIMA model for the
future 30 days.
VARIMA
XGBOOST
Figures 28 and 29 present the predictions by the VARIMA and XGBoost model for the
future 30 days.
RANDOM FOREST
ANN
Figures 30 and 31 present the predictions by the Random Forest and ANN model for the
future 30 days.
CNN
RNN
Figures 32 and 33 presents the predictions by the CNN and RNN model for the future 30
days.
LSTM
Figures 34 presents the predictions by the LSTM model for the future 30 days.
CHAPTER 9
CONCLUSION
This project offers a comprehensive and robust approach to stock price prediction,
drawing on a diverse range of modeling techniques from statistical, machine learning, and deep
learning paradigms. By leveraging these three methodologies, we were able to assess and
compare the strengths and weaknesses of different models in the context of time series
forecasting. The evaluation of 11 models—5 statistical models (AR, MA, ARIMA, SARIMA,
VARIMA), 2 machine learning models (XGBoost, Random Forest), and 4 deep learning
models (ANN, CNN, RNN, LSTM)—enabled us to comprehensively explore various
forecasting techniques and their applicability in predicting stock prices.
Throughout the course of the project, a variety of factors were considered to assess
model performance, such as handling seasonality, trend extraction, data complexity, and
computational efficiency. The SARIMA model was found to be particularly effective in
capturing seasonality and trends, making it a strong candidate for time series data with clear
periodic patterns. Random Forest, a robust machine learning technique, excelled in handling
high-dimensional feature sets and interactions, providing reliable predictions in the presence
of noisy data. Meanwhile, the CNN, with its ability to capture spatial patterns and local
dependencies, showed great promise in recognizing subtle but important features within the
historical stock price data, making it particularly useful in scenarios with non-linear
relationships.
The integration of these models into a single predictive framework provides a well-
rounded solution that can accommodate different types of data and forecasting requirements.
The insights gained from this analysis have significant implications for stock market
forecasting, where stakeholders can benefit from tailored recommendations on the most
effective models depending on the specific characteristics of the stock data being analyzed.
Whether looking for simple trend-based predictions or complex non-linear relationships, the
project highlights the flexibility and power of combining traditional statistical techniques with
more advanced machine learning and deep learning models.
A key strength of this project is the Streamlit-based frontend that was developed,
enabling stakeholders to easily visualize the predicted stock prices for the next 30 days. This
interactive platform empowers users to make informed, data-driven decisions, as they can
easily interpret predictions and trends, track future market movements, and assess potential
risks. The visualizations provided offer valuable insights that enhance decision-making
processes and add practical value to the overall system.
FUTURE ENHANCEMENTS
The current system can be upgraded to perform real-time analysis and forecasting by
integrating big data technologies. By harnessing tools such as Apache Kafka, Spark Streaming,
and Apache Flink, the system can continuously ingest and process live stock market data. This
would enable the prediction models to update in real-time, providing stakeholders with the
most up-to-date insights. Additionally, leveraging big data platforms like Hadoop or Apache
Hive can help manage and process vast amounts of financial data, enabling more accurate and
timely predictions. The integration of real-time data feeds from financial markets could also
allow for dynamic adaptation of models to account for rapidly changing market conditions,
further improving the predictive power of the system.
Another avenue for improvement lies in combining different modeling techniques to form hybrid
models, which can leverage the strengths of various approaches. For instance, combining the time
series forecasting capabilities of SARIMA with the predictive power of Random Forest or
XGBoost could produce a more robust and accurate model. Hybrid models may take the form of
ensemble learning, where the outputs of several models are combined (e.g., through stacking or
boosting), or neural-ensemble models, where deep learning models such as CNN or LSTM are
fused with traditional statistical or machine learning methods. This approach can reduce individual
model biases and improve the overall prediction accuracy by incorporating complementary
strengths.
The predictive power of the models can also be increased through more advanced feature
engineering. By incorporating complex features such as volatility indices, moving averages,
momentum indicators, and market liquidity measures, the models can account for deeper market
dynamics. Additionally, dimensionality reduction techniques like PCA (Principal Component
Analysis) or t-SNE could be used to extract the most important features from high-dimensional
data, improving model training efficiency and reducing overfitting.
BIBLIOGRAPHY
Books:
1. Time Series Forecasting in Python -Marco Peixeiro
2. Python Machine Learning by Example -Yuxi (Hayden) Liu
3. Deep Learning with Python -Francois Chollet
Research Papers:
[1] “Profit Prediction Using ARIMA, SARIMA and LSTM Models in Time Series Forecasting:
A Comparison” by Uppala Meena Sirisha, Manjula C Belavagi & Girija Attigeri.
[2] “On the Autoregressive Time Series Model Using Real and Complex Analysis”, by Torsten
Ullrich
[3] “XGBoost: A Scalable Tree Boosting System” by Tianqi Chen & Charlos Guestrin
[4] “Stock Price Prediction Using CNN and LSTM-Based Deep Learning Models” by Sidra
Mehtab & Jaydip Sen
[5] “Understanding Random Forests: From Theory to Practice” by Gilles Louppe
Online Resources:
https://companiesmarketcap.com/inr/india/most-profitable-indian-companies/
https://seeve.medium.com/time-series-modeling-13f076d6d6ca
https://medium.com/@batuhansenerr/ai-powered-financial-analysis-multi-agent-systems-
transform-data-into-insights-d94e4867d75d
https://towardsdatascience.com/exploring-the-lstm-neural-network-model-for-time-series-
8b7685aa8cf
https://pub.towardsai.net/can-machine-learning-outperform-statistical-models-for-time-series-
forecasting-73dc2045a9c5
APPENDIX
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
from statsmodels.tsa. seasonal import seasonal_decompose
from statsmodels.tsa. arima. model import ARIMA
from statsmodels.tsa. statespace. sarimax import SARIMAX
from statsmodels.tsa. stattools import adfuller
from statsmodels. graphics. tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.api import VARMAX
from sklearn. metrics import mean_squared_error
result = adfuller(data["Close"])
print (f"ADF Statistic: {result [0]}")
print (f"p-value: {result [1]}")
# ACF Plot
plt. figure (figsize= (12, 6))
plt. subplot (1, 2, 1)
plot_acf(data["Close"], lags=40, ax=plt.gca ())
plt. title ('Autocorrelation Function (ACF)')
# PACF Plot
plt. subplot (1, 2, 2)
plot_pacf(data["Close"], lags=40, ax=plt.gca ())
plt. title ('Partial Autocorrelation Function (PACF)')
plt. tight_layout ()
plt. show ()
plt. ylabel('Price')
plt. legend ()
plt. show ()
model = ARIMA (train, order= (0, 2, 1)) # (p, d, q) where p=2, d=2, q=1
model_fit = model.fit ()
model = ARIMA (new_train, order= (0, 2, 1)) # (p, d, q) where p=2, d=2, q=1
model_fit = model.fit ()
MA_future_prediction = model_fit. predict(start=len(new_train), end=len(new_train) + 30,
dynamic=False)
y = lagged_data['target']
model = Sequential([
Dense(64, activation='relu', input_dim=X_train.shape[1]),
Dense(32, activation='relu'),
Dense(32, activation='relu'),
Dense(1) # Output layer
])
%%capture
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
predictions = model.predict(X_test)
Streamlit code:
import pandas as pd
import numpy as np
import yfinance as yf
from tensorflow. keras. models import load_model
from sklearn. preprocessing import MinMaxScaler
from sklearn. model_selection import train_test_split
import streamlit as st
# Load data
data = yf. download ('RELIANCE.NS', end='2024-12-24')
series = data['Close']
# Prepare data
lagged_data = create_lagged_features (series, lag=3)
X = lagged_data. drop ('target', axis=1)
y = lagged_data['target']
# Future predictions
def predict future (model, last_data, steps=30):
future_preds = []
for _ in range(steps):
scaled_input = scaler. transform (last_data. reshape (1, -1))
next_pred = model. predict(scaled_input)
future_preds. append (next_pred [0, 0])
last_data = np. append (last_data [1:], next_pred)
return np. array(future_preds)
# Streamlit UI
st. title ("Reliance Stock Price Prediction")
# Model selection
model_choice = st. selectbox ("Select a Model to View Predictions:", list (models. keys ()))