Project Report
Project Report
Bachelor of Technology
in
<Btech CSE>
<May, 2024>
Declaration
We declare that this written submission represents my ideas in my own words and where
others' ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in
my submission. I understand that any violation of the above will be cause for disciplinary
action by the University and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.
The plagiarism check report is attached at the end of this document.
We would like to express our sincere gratitude to Dr. Anil Kumar Dahiya (Supervisor) for
their invaluable guidance, support, and encouragement throughout this research endeavor.
Their expertise and insights have been instrumental in shaping the direction of this project.
We would also like to thank DIT University for providing the necessary resources and
facilities for conducting this research. Additionally, we extend our appreciation to Dr. Rakesh
Kumar Pandey sir who have contributed to this project in various capacities.
Finally, we are grateful to our families , friends and teachers for their unwavering support and
encouragement throughout this journey. Their understanding and encouragement have been a
constant source of motivation for us.
Table of Contents
1. Abstract
2. List of Tables
3. List of Figures
4. List of Symbols and Abbreviations
5. Chapters
5.1 Introduction
5.2 Project Research Methodology
5.3 Model Paradigm
5.4 Results and Discussion
5.5 Summary and Conclusions
5.6 Scope for Future Work
6. References
7. Appendix
Abstract
This report presents a comprehensive study on the development of machine learning (ML)
and deep learning (DL) frameworks for trend prediction and optimization of Nifty50 sectoral
indices. The research integrates sentiment analysis derived from social media platforms to
enhance the predictive accuracy of stock market trends. The key objective of a successful
stock market prediction strategy is to not only generate the highest possible returns but also to
minimize inaccuracies in stock price estimations. In trading, utilizing sentiment analysis
helps investors make well-informed choices about where to put their money. However,
forecasting stock prices is a complex task due to their susceptibility to a wide array of
influences, including shifts in investor mood, economic and political landscapes, leadership
transitions, and more. Predictions based solely on past data or textual content tend to be
unreliable. To improve accuracy, there's a growing focus on integrating the sentiment from
news sources with existing stock price information. A deep learning method has been
developed to track the trends of Nifty50 stocks, utilizing data scraped from social media
platforms like Twitter, Facebook, StockTwits, and YouTube. This data was cleaned and
analyzed to obtain subjectivity and polarity scores, reflecting positive, neutral, or negative
sentiments. By integrating these sentiment scores with market data, a novel approach was
formed to predict Nifty50 returns using the deep learning model.
List of Tables
1.1 Background
Over the years, communication technologies be evolving and high-speed internet access has
been becoming more ubiquitous, social medial platforms such as blogs, Facebook, and
Twitter be gaining significant popularity and effectiveness as channels for interaction and
information sharing. These platforms have be revolutionizing how people communicate and
collaborate, leading to the proliferation of user-generated content. However, manual analysis
of the vast amount of data be generated on these platforms has been becoming prohibitively
expensive, prompting the development of automated systems like Sentiment Analysis.
Sentiment Analysis be swiftly determining the overall sentiment of news stories, providing a
valuable tool amidst the growing popularity of these strategies, making it easier to
understanding evolving trends in the stock market and potentially yielding profitable returns
with minimal effort. The evolution of communication technologies be revolutionizing the
way individuals interact and share information globally. With the widespread adoption of
high-speed internet, social medial platforms such as Facebook, Twitter, and blogs be became
integral parts of daily life. These platforms facilitate real-time communication and
collaboration, allowing users to sharing ideas, opinions, interests, and personal experiences
with a diverse global audience.
Sentiment Analysis be emerged as a critical tool in analyzing the vast volume of user-
generated content on social medial platforms. By swiftly determining the overall sentiment of
news stories, social media posts, and online discussions, Sentiment Analysis provide valuable
insights into public opinion and sentiment trends. This automated approach has be becoming
increasingly indispensable amidst the growing popularity of social media strategies in various
domains, including stock market analysis. The integration of social media in stock market
analysis be gaining traction in recent years as researchers and investors recognize the
potential of sentiment analysis in predicting market trends. Social medial platforms be serve
as rich sources of information, offering real-time insights into investor sentiment, market
sentiment, and emerging trends. By leveraging sentiment analysis techniques, researchers and
investors can be gained a deeper understanding of market dynamics and making more
informed decisions.
2.2.4 Standardization:
Standardization is a crucial preprocessing step aimed at ensuring uniformity in feature scales,
thereby preventing any particular feature from dominating the modeling process. In this
project, the preprocessed dataset underwent standardization using the StandardScaler. By
scaling the features to have a mean of 0 and a standard deviation of 1, the data was
normalized and brought to a common scale, facilitating more stable and effective model
training. This standardization process ensured that each feature contributed proportionately to
the model's learning process, thereby preventing biases and improving the overall
performance of the machine learning models for trend prediction of Nifty50 stocks.
2.2.5 Data Splitting:
Data splitting is a critical step in machine learning model development, enabling the
evaluation of model performance on unseen data. In this project, the preprocessed dataset was
divided into training and testing sets using a common ratio of 70% for training and 30% for
testing. This partitioning facilitated the development of machine learning models on the
training data while allowing for the independent validation of model performance on the
testing data. By separating the dataset into distinct training and testing subsets, the risk of
overfitting was mitigated, ensuring that the models generalized well to unseen data.
Additionally, this approach enabled robust model evaluation and provided insights into the
effectiveness of the developed models for trend prediction of Nifty50 .
2.3 Machine Learning
a)Support vector Machine: SVMs were developed in the 1990s by Vladimir N. Vapnik and
his colleagues, and they published this work in a paper titled "Support Vector Method for
Function Approximation, Regression Estimation, and Signal Processing"1 in 1995 . Support
Vector Machines (SVMs) play a pivotal role in accurately predicting dataset, owing to their
robustness and adaptability to complex market dynamics . SVMs excel in discerning intricate
patterns by effectively separating data points into distinct classes using hyperplanes. This
capability allows SVMs to capture both linear and nonlinear relationships, making them
versatile for modeling diverse behaviors. The SVM algorithm is widely used in machine
learning as it can handle both linear and nonlinear classification tasks. However, when the
data is not linearly separable, kernel functions are used to transform the data to higher-
dimensional space to enable linear separation. This application of kernel functions can be
known as the “kernel trick”, and the choice of kernel function, such as linear kernels,
polynomial kernels, radial basis function (RBF) kernels, or sigmoid kernels, depends on data
characteristics and the specific use case . So, to separate the multi-dimensional data we use
hyperplane. To define a hyperplane for two-dimensional data which can be linearly separable
by a line. Now we are renaming x with x1 , y with x2 then we get:
Once we have the hyperplane, we can then use the hyperplane to make predictions. We define
the hypothesis function h as:
The point above or on the hyperplane will be classified as class +1, and the point below the
hyperplane will be classified as class -1.
b)Logistic Regression: Logistic regression, pioneered by David Cox, models the relationship
between multiple independent variables and a dependent variable, specifically suited for
situations with binary outcomes and continuous predictors. Unlike traditional regression
methods, logistic regression is adept at classifying observations into distinct categories,
relaxing assumptions like normality of independent variables and absence of multicollinearity
[33] , [34]. The logistic regression equation is represented as:
P(Y=1/X)=1/(1+e^(-w⋅x-b) ) (5)
Training a Logistic Regression model involves the estimation of parameters w and b from the
training data. This estimation typically revolves around maximizing the likelihood of
observing the training labels given the input features [35]. Formally, this is expressed as
maximizing the likelihood function
where : N represents the number of training samples Y_i denotes the true label of the i-th
training sample x_i signifies the feature vector of the i-th training sample.
c)Random Forest Classifier: Random Forest, introduced by Leo Breiman and Adele Cutler in
2001, is a powerful ensemble learning method widely employed for classification and
regression tasks . Its foundation lies in the construction of multiple decision trees during
training and the aggregation of their predictions through voting or averaging, resulting in
robust and accurate predictions . It creates a different training subset from sample training
data with replacement & the final output is based on majority voting. It combines weak
learners into strong learners by creating sequential models such that the final model has the
highest accuracy. For example, ADABOOST,XG BOOST . Steps Involved in Random Forest
Algorithm-
Step 1: In the Random forest model, a subset of data points and a subset of features is
selected for constructing each decision tree. Simply put, n random records and m features are
taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification
and regression, respectively.
By combining the predictions of multiple trees, Random Forests can capture complex
relationships between input variables and the target variable, making them suitable for
capturing the nonlinear dynamics [40] , [41]. Hyperparameters are used in random forests to
either enhance the performance and predictive power of models or to make the model faster.
The hyperparameters used by random forest classifier are n_estimators , max_features ,
mini_sample_leaf , criterion , and max_leaf_nodes for increasing predictive power . To
increase the speed n_jobs , random_state , oob_scores are used .
Where y ̂_m (x) represents the predicted value at iteration m for input x, y ̂_(m-1) (x) is the
prediction from the previous iteration. h_m (x) is the weak learner (e.g., decision tree) trained
to fit the residuals , λ is the learning rate, controlling the step size in the gradient descent
process. [45]
To study implementations of Gradient Boosting Machines there are mainly two algorithms:
XGBoost , LightGBM. Hyperparameters used in gradient boosting classifier here are learning
rate , n_estimators , subsample and max depth , using these hyperparameters we aimed to
achive maximum accuracy using LightGBM [46] , [47].
iteration, A ⃗ and C ⃗ are coefficient vectors, X ⃗p is the position vector of the prey, and X ⃗
(9) Where t indicates the current
indicates the position vector of a grey wolf. The vectors A ⃗ and \overline( C) ⃗ are calculated
as follows: A ⃗=2*a.〖r ⃗ 〗_1-a (10) C ⃗=2 .r ⃗2
(11) 3)Hunting In order to mathematically simulate the hunting behavior of grey wolves,
we suppose that the alpha (best candidate solution) beta, and delta have better
knowledge about the potential location of prey. Hence, we store the top three solutions
attained thus far and direct the remaining search agents, including the omegas, to adjust their
(12) D ⃗_δ=|C ⃗_3.X ⃗δ-X ⃗| X ⃗₁=X ⃗_α-A ⃗₁.D ⃗_a X ⃗_2=X ⃗_β-A ⃗₂.D ⃗_β
following formulas: D D
through parameters like ∣A∣ and ∣C∣ to encourage divergence among search agents, promoting
prey. 5)Searching for prey(exploration) In the GWO algorithm, randomness is introduced
global exploration. ∣A∣ facilitates exploration, while ∣C∣ provides random weights to influence
prey factors, aiding in avoiding local optima. Unlike ∣A∣, ∣C∣ maintains randomness
throughout optimization, preventing stagnation in local optima. GWO starts with a random
population of wolves, iteratively estimating prey positions by alpha, beta, and delta wolves
while adjusting their distances accordingly. Model Paradigm- The framework proposed in
this study, as depicted in Fig. 1, outlines a schematic flow diagram for predicting trends in the
Nifty50 indices. The input and output labels are fed into a predictive model, which undergoes
preprocessing. A notable preprocessing step involves addressing multicollinearity, where
rows with a threshold greater than 0.75 are averaged out into a new column, effectively
treating them as a new feature. Incorporating Grey Wolf Optimization (GWO) for
hyperparameter tuning significantly enhances the predictive performance of our models.
Hyperparameter tuning is vital for optimizing machine learning and deep learning models,
aiming to identify the optimal set of hyperparameters that maximize performance metrics
such as accuracy, precision, or recall. Manual hyperparameter tuning is often time-consuming
and requires domain expertise. However, by leveraging metaheuristic algorithms like GWO,
we automate the process of exploring the hyperparameter space efficiently, thus mitigating
the challenges associated with manual tuning. Fig. 1 Flow diagram of the proposed
framework Fig. 2 Confusion matrix for total data (a) SVM without GWO, and (b) SVM with
GWO . The confusion matrix shown in Fig. 2 serve as visual representations of the Support
Vector Machine (SVM) model's performance, contrasting between scenarios with and without
the Grey Wolf Optimization (GWO) technique. In the initial matrix, SVM's classification
outcomes are depicted, offering insights into its accuracy and predictive capabilities.
Conversely, the second matrix reflects the model's performance post-GWO integration,
showcasing potential improvements in classification accuracy. Fig. 3 Confusion matrix for
total data (a) LR without GWO, and (b) LR with GWO . The confusion matrices shown in fig
3 for Logistic Regression depict the model's classification results before and after integrating
the Grey Wolf Optimization (GWO) method. Initially, the matrix showcases the model's
classification accuracy and misclassifications. However, post-GWO implementation, the
subsequent matrix illustrates minimal improvements in accuracy, with the model's
performance showing an increase of less than 1%. Despite the slight enhancement, the
matrices offer valuable insights into the impact of GWO on Logistic Regression's predictive
capabilities, underscoring the need for further optimization strategies to achieve more
substantial performance gains. Fig. 4 Confusion matrix for total data (a) RFC without GWO,
and (b) RFC with GWO . The confusion matrices shown in fig 4 for Random Forest represent
the classification outcomes both before and after the integration of Grey Wolf Optimization
(GWO). Initially, the model achieved an impressive accuracy of approximately 99.4%
without GWO, as depicted in the first matrix. However, upon implementing GWO, there was
a marginal improvement in accuracy, with the model's performance rising to around 99.6%,
as evidenced in the subsequent matrix Fig. 5 Confusion matrix for total data (a) GBC without
GWO, and (b) GBC with GWO The evolution of the Gradient Boosting Classifier (GBC) can
be observed through the comparison of its respective confusion matrices before and after the
incorporation of Grey Wolf Optimization (GWO). Initially, the GBC exhibited a modest
accuracy of approximately 79.2456%. However, upon implementing GWO, a remarkable
transformation occurred, elevating the accuracy to an impressive 100%. This stark
improvement underscores the significant impact of GWO in enhancing the performance of
the Gradient Boosting Classifier, leading to a perfect predictive accuracy.