0% found this document useful (0 votes)

37 views25 pages

Sentiment Prediction For Market Volatility

This document presents a project on an automated framework for generating sentiment metrics from SEC 10-K filings to predict stock market returns and volatility. It includes two core models: an SEC Filing Extraction Model and a Supervised Lexicon Learning Model, which together analyze sentiment and apply trend analysis techniques. Empirical results indicate that sentiment metrics, particularly from the risk factor section of 10-K filings, are effective in forecasting market movements.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views25 pages

Sentiment Prediction For Market Volatility

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804

Sentiment Prediction for Market Volatility

Niraj Patel1
1
ST. Clair College

Publication Date: 2025/04/16

Abstract: This project presents an automated framework for generating sentiment metrics from SEC 10-K filings, aiming
to predict stock market returns and volatility at the sector, portfolio, and firm levels. The system comprises two core models:
an SEC Filing Extraction Model, which preprocesses filings, and a Supervised Lexicon Learning Model, which analyzes
sentiment using a four-step process. This includes identifying sentimentrelated words, assigning predictive weights,
aggregating sentiment scores, and applying the Kalman Filter for trend analysis. Empirical results demonstrate the
effectiveness of sentiment metrics from 10-K filings, particularly the Item 1A risk factor section, in forecasting market
movements.

How to Cite: Niraj Patel. (2025). Sentiment Prediction for Market Volatility. International Journal of Innovative Science and
Research Technology, 10(2), 2531-2555. https://doi.org/10.38124/ijisrt/25feb804

I. INTRODUCTION Commission (SEC) since 1934. It has its origins in the the
Section 13 or 15(d) of the Securities Exchange Act of 1934.
 Motivation The 10-K is a comprehensive official document offering a
Now is the age of Artificial Intelligence and Big data. through overview of a company’s business, its’ potential
With the advance of computational powers, a large amount of challenges, and its financial performance through the fiscal
various data, such as text, video, and audio have been used year. A company’s leadership, in the 10-K, provides their
for scientific analysis. Among a myriad of data forms, textual perspective on the business outcomes and the factors
data has gotten the fastest attention in the social science influencing them [7]. Furthermore, several studies found that
academic field. Textual data’s numerical representation for 10-K filings offered predictive power to stock price
statistical analysis in nature is extremely high in dimensions prediction[8], [9], [10], [11], [12]. [8] found a shred of
that empirical study seeking its textual richness should face evidence that the market reacted to 10-K filings in a
its dimensionality challenges. Machine learning will be statistically significant way. [10], [9] showed there was a
employed to extract richer meaning from textual data for correlation between the complexity of 10-K filing and stock
predictive analysis in a high-dimensional data environment price volatility. [11], [12] found a positive correlation
[1]. between 10-K filing and stock price through cutting-edge AI
methodologies. Hence, investors should pay attention to 10-
In finance, textual data is commonly employed for K filings.
predicting market movements[1], [2], [3], [4], [5], [6]. In
stock prediction, textual analysis of market sentiment has Since the beginning of 2005, a new section has been
shown notable success. News data was employed to analyse required to be included in all firms ‘annual filings by the SEC.
sentiments in the prediction of short-term stock price The section called “Section 1A of the Annual Report on Form
movements [3]. Similarly, social media textual data was 10-K” discusses “the most significant factors that make th
utilised by integrating social media sentiment and AI [5]. company speculative or risky” [13]. Prior to this alteration,
Also, annual report data was used for stock market companies were only obligated to provide this information in
forecasting [6]. Likewise, we could find myriad types of their registration when issuing their equity or debt securities.
textual data were used in predicting market movement. Then, Also, some companies voluntarily offer risk disclosures in the
what type of textual data can be informative for such a section called “Management’s Discussion and Analysis of
purpose? If you want to invest in a public company in the Financial Condition and Results of Operations(MD&A)”
United States, where can you begin your investment journey?
Opponents of the new disclosure requirements argue
There are myriad ways to begin your investment in a that risk factor disclosures are unlikely to offer valuable
public company. However, what if you do not know about a information. First, risk disclosure can be biased. Managers
firm at all in which you would invest or what if you do just might resist disclosing negative information about their
partially know the firm? Then you should first know the firm business or career incentives [14], [15], [16], [17], [18].
correctly. If so, where we can find reliable and trustworthy Second, managers’ overconfidence could make them perceive
information about a firm? You can find a rich deposit of less risk or overconfident managers could have the illusion
reliable knowledge on Form 10-K filing. The Form 10-K that they can effectively manage the risks confronting their
filing has been mandated by the Securities and Exchange firms [19]. Third, managers tended to disclose all possible

IJISRT25FEB804 www.ijisrt.com 2531

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
risks and uncertainties without making precise predictions or latest 10-K filings for prompt generation of sentiment
providing detailed financial assessments. This practice stems information.
from the fact that companies were not required to predict the
possibility that a disclosed risk would actually materialise.  Generate Various Sentiment Scores from the Extracted
Furthermore, there was no obligation for firms to specify the 10-K Reports.
financial influence that a disclosed risk might have on their In this research, our study aims to generate sentiment
present or future financial statements [20]. Since 2010, the metrics for market movement prediction of a firm, i.e., return
SEC has warned companies to “avoid generic risk factor and volatility, from both the entire 10-K report and risk factor
disclosure that could apply to any company” [21] and has disclosures. This is the main goal of my project. Recently, a
continuously pushed the precise risk factor disclosures new model has been proposed to generate strong indicators
through the comment letter process [22]. Recently, the SEC for predicting price reactions to new information. This model
has been demanding the explicit and specific inclusion of [1] generated a powerful return-predictive signal from their
both cyber security risk factor disclosure and climate change supervised sentiment text model with news data. Our research
risk factor disclosure [23], [24]. methodology for generating sentiment scores referred to the
methods suggested in [1]. However, our study will show an
Notably, numerous studies reach an opposite conclusion innovative improvement compared to [1]. [1] employed news
by providing evidence that the risk factor disclosures, in fact, articles only to generate return-predictive signals, while our
provide valuable information [25], [26], [27], [28], [29], [20], study will use 10-K reports, especially focusing on the risk
[19]. The disclosures reflect the genuine risks confronting factor section, to generate volatility-predictive signals as well
their firms. Also, it might help investors assess the volatility as return-predictive signals. Furthermore, the predictive
of a firm’s cash flows, and tax-related risk factor disclosures sentiment signals will be generated in three different
offer details about the level of a firm’s future cash flows, stakeholder levels such as a sector, a portfolio, and an
helping investors incorporate this information into current individual firm. As far as we are aware, no research applies
stock prices [25]. The newly created risk factor disclosures supervised sentiment learning with 10-K reports for
also show a correlation with conventional asset pricing risk predicting volatility, nor is there any that offers a comparative
factors, indicating that the disclosed factors are valuable for study of pre-established and acquired sentiments by using 10-
assessing overall risk [26]. cyber security risk factor K reports for returns or volatility predictions.
disclosures increase the risk of a company’s stock price
declining in the future [27]. Furthermore, it has been revealed  Build an Automated Pipeline on the Airflow Framework.
that the length of the disclosure is associated with market 10-K fillings should be released annually, but the
reaction. Lengthier risk factor disclosures have a negative publication dates of it are different to each firm. In the case
correlation with market reactions [25]. More detailed of 100 firms listed in the QQQ for our study, for instance, a
disclosures tend to generate more profound market reactions firm’s 10-K is released almost every single day within that
[28]. Alterations in the length of risk disclosures also can year. Due to that, in order to offer prompt sentiment
influence an investor’s risk perceptions [29]. From these information, our system should update the latest 10-K filings
observations, although disclosures are occasionally seen as daily or at least monthly.
generic [20] or susceptible to bias [19], they still provide risk-
related information that can assist investors and affect stock  Evaluate Sentiment Scores in the Context of Prediction for
values. Returns and Volatility.
After we achieve our main goal, we will evaluate our
 Project Objective generated sentiment scores quantitatively and qualitatively.
Given the informational value of 10-K filings, as For quantitative analysis, we will use Pearson correlation
introduced in the motivation section, our project aims to (check Appendix E for the formula) to find a correlation
achieve the following main goal: between the metrics we generated. For qualitative analysis,
we will employ the top 15 most influential words we
 Developing an automated pipeline to generate sentiment extracted from the process of sentiment score generation. The
scores from 10-K reports of firms in the technology most impactful words will be used to generalise a topic which
industry for market movement prediction of a firm, may affect a sentiment.
i.e.,return and volatility.
 Contribution
The Following are the Sub-Steps for the Main Goal: We have completed the main goal, including all sub-
goals, we set in the previous project objective section.
 Extraction of 10-K fillings, followed by extraction of risk
factors from the extracted 10-K fillings. The list of 10-K Our contributions are shown as follows:
fillings was based on the Invesco QQQ Trust Series
1(QQQ).  Developed an Automated System for Generating
Our pipeline will collect an annual disclosure from the Sentiment Analysis Metrics, Comprising Two Main
SEC’s Electronic Data Gathering, Analysis, and Models:
Retrieval(EDGAR) system and extract the risk factor section
of each report. It also will be able to automatically collect the  The 10-K Filing Extraction Model, which automatically
retrieves 10-K filings from the Electronic Data Gathering,

IJISRT25FEB804 www.ijisrt.com 2532

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
Analysis, and Retrieval (EDGAR) system, including the  Chapter 5 - Experiment Results critically evaluated all
extraction of the Item 1A risk factor disclosure section. generated sentiment metrics through quantitative(i.e
This data feeds into the Sentiment Score Prediction correlation analysis) and qualitative analysis(i.e most
Model. Referenced in [30]. influential word set).
 The Sentiment Score Prediction Model, which evaluates  Chapter 6 - Conclusion summarised the results of our
sentiment across three levels of stakeholders: sector (e.g., suggested model and suggested to investors about model
Technology), portfolio (e.g., top 10 firms), and individual usage. Limitations and future works were mentioned.
company (e.g., Nvidia). Referenced in [1]. Unlike the
model in [1], our model demonstrates results across three II. RELATED WORKS
levels of stakeholders to maximize the model’s
applicability.  Textual Sentiment Analysis in Finance
In the field of finance, sentiment analysis has grown in
 Formulated 12 Sentiment Metrics across these Three significance due to its wide range of applications. Through
Levels: sentiment analysis, we can analyse, interpret, and derive
insights from large volumes of financial data. One popular
 Sector Level: Developed sector sentiment metrics from use case of sentiment analysis is stock market movement
10-K filings, labelling based on QQQ’s return and prediction [2], [3], [4], [6], [5], [31], [32]. In the context of
volatility. sentiment analysis in stock market prediction, two types of
 Portfolio Level: Formulated portfolio sentiment metrics sentiments have been researched: investor sentiment and
using 10-K filings, labelling based on portfolio returns textual sentiment. Investor sentiment refers to an investor’s
and volatility. subjective sentiment on firms or markets. The second type of
 Company Level: Established company-specific sentiment sentiment is textual sentiment or text-based sentiment. It
metrics through 10-K filings, labelling based on company indicates the level of positive or negative sentiment in texts.
returns and volatility. In some studies [1], [33], for instance, the term ‘tone’ (i.e.
 Generated sentiment scores for 10-K/Item 1A using the positive or negative) in a corporate disclosure means
methodology in [1]. This approach is novel, as no prior sentiment. Investor sentiment and textual sentiment are
study has applied this methodology to 10-K filings/Item fundamentally different. Investor sentiment encompasses the
1A risk factors. subjective assessments and behavioural traits of investors. In
 Utilized return/volatility to train sentiment scores, which contrast, textual sentiment may incorporate investment
is a unique approach. No other study, including [1], has sentiment but also covers a more objective representation of
used volatility to train sentiment signals. Our model the state within companies, institutions, and markets [34]. In
produced both volatility-predictive and return-predictive the context of textual sentiment analysis, various sources
sentiment signals. have been used, such as news articles, social media data, or
 Assessed the performance of the Sentiment Score corporate disclosures. News articles and social media data are
Prediction Model. used to analyse the short-term effects of the sentiment on
 Conducted both quantitative (correlation analysis) and market variables like return, volatility, and stock volume [3],
qualitative (most influential words to return/volatility) [5]. Corporate disclosure usually focuses on finding the
analysis to critically evaluate the derived sentiment relationship between sentiment and firm performance [35],
metrics. These evaluations help investors improve their [36]. In this paper, we focused on text-based sentiment using
understanding of market or firm fundamentals. a corporate disclosure, analysing its impact not only on firm
 Connected to the Airflow server for automation updates. performance but also on portfolios and the market.
Our system automatically updates the latest 10-K filings,
facilitating seamless data preparation and providing the Furthermore, many previous studies to extract textual
latest sentiment scores for three stakeholder levels. sentiments depended on pre-defined sentiment dictionaries
Investors can receive prompt sentiment information for a instead of using statistical text analysis. Pre-defined
sector, portfolio, and a firm. Detailed explanation sentiment dictionary is created through the dictionary-based
available in Appendix H. approach. This approach utilises a mapping algorithm
through which a computer programme reads text and
 Outline of the Project categorises words, phrases, or sentences into predefined
The thesis structure is outlined below: categories [37]. There are major pre-defined sentiment
dictionaries for textual sentiment analysis in finance. The first
 Chapter 1 - Introduction provided motivation for this one is the General Inquirer(GI) or Harvard IV-4
study, including study objectives and contributions dictionary(HIV4). The second one is the DICTION
 Chapter 2 - Related Works explained related works about dictionary. The two dictionaries have been widely used for
various textual sentiment analysis approaches as well as financial analysis [38], [39], [40], [41], [42], [43]. However,
its relevant background information. both the GI/Harvard and DICTION, being general English
 Chapter 3 - 10-K filings Extraction Model explained the linguistic dictionaries, offer inaccuracies in a financial
10-K filing acquisition process as well as the detailed context. To overcome this issue, Loughran and McDonald
extraction process of Item 1A risk factor section. recreated the LM dictionary specific to the finance domain.
 Chapter 4 - Methodology detailed a comprehensive Since the LM dictionary was developed for 10-K sentiment
explanation of the sentiment score prediction model. assessment, we used this dictionary as our benchmark to

IJISRT25FEB804 www.ijisrt.com 2533

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
evaluate our estimated sentiments. The LM was used by many manual labelling labour. With minimal, reliable, and clever
researchers [44], [45], [46], [47], [48]. assumptions, the labelling mechanism works in an automated
process. Finally, by training on returns from sampled
 Textual Sentiment Analysis with AI in Finance companies to analyse sentiments, this method is likely more
With the increase in computing power and the effective at predicting stock returns compared to others.
development of cutting-edge AI methodologies, AI
technology has been applied to textual sentiment analysis in In this paper, we gained access to the model’s other four
finance. In classical machine learning, some previous papers benefits by leveraging its scalability. In other words, unlike
utilised offthe- shelf machine learning techniques like the model using news articles to generate return-predicative
Support Vector Machine(SVM), Na¨ıve Bayes, Decision sentiment signals, we used informative 10-K fillings(plus,
Trees, or artificial neural networks(ANN) to control the curse Item 1A risk factor section) to produce volatility-predicative
of high dimensionality in textual sentiment in finance context sentiment signals as well as return one instead. No prior study
[49], [37], [50], [51], [52]. For instance, [52] extracted applied the model’s methodology to the 10-K fillings and risk
sentiments from media textual data(e.g. StockTwits) through factor section. Also, even in this methodology, they did not
a variety of machine-learning binary classifiers. They found use volatility to train sentiment signals. Moreover, we
that the SVM classifier achieved higher accuracy than broadened the range of model application levels from a
Decision Tress and Naïve Bayes. However, classical portfolio level to a sector level and a firm level to maximise
machine-learning techniques could not capture a sentence’s the model’s applicability.
complex features and contextual information. These tasks
require deep-learning techniques, which facilitate III. 10-K FILINGS EXTRACTION MODEL
understanding sequential information, complex feature
extraction, and location identification [53]. In this research, we utilised textual data for financial
analysis. Among the myriad kinds of textual data available,
Deep learning, a branch of machine learning, employs we selected Form 10-K filings, which contain some of the
multi-layered neural networks, called deep neural networks, richest information about firms. This paper collected the
for extracting complex features [54]. In textual sentiment entire 10-K filings for predicting sentiment scores while
analysis, deep learning can be useful for generating learning collecting the Item 1A risk factor section of the filing to
patterns and learning contextual information in a sentence extract more informative features. Form 10-K filings are the
[55]. Many studies showed the effectiveness of deep learning official documents that all publicly traded firms in the United
models, such as current neural networks(RNN) [56], [57], States are required to submit to the SEC. The SEC enforces
convolutional neural networks(CNN) [58], [59], [60], and very strict rules regarding the content and structure of the
attention mechanisms [53], [61] for sentiment analysis in information required in Form 10-K filings. These filings
finance. Recently, transformer architectures such as BERT or contain no pictures or charts. A well-structured 10-K is
RoBERTA have shown superior performance in sentiment divided into five separate parts. The first three parts offer a
analysis in the finance domain [62]. However, transformers’ concise summary of the firm’s primary business activities,
superior performance comes at a cost. They demand including its services and products; enumerate every risk
extensive data and computational power for training and encountered by the firm; and provide detailed financial
testing. Moreover, they require considerable prediction times, information about the firm over the last five years. The fourth
rendering them less viable for real-time applications or part delivers a senior management analysis of its financial
environments with constrained processing capabilities [63]. results. The final part includes the actual financial figures; the
Also, the transformer is perceived as a ‘black box’ due to its firm’s audited financial statements, which consist of the
uninterpretable complex internal mechanisms [64]. income statement, balance sheets, and statement of cash
flows. A detailed description of the 10-K structure can be
Recently, [1] suggested an interesting and innovative found in Appendix A. Hence, the Form 10-K exhibits uniform
sentiment model with a good performance. The suggested structures. Thanks to these organised structures, we can
model is a lexicon learning model to find the correlation algorithmically extract information on the filings through.
between the sentiments from firm-specific news and returns. [65], [30]
This model has five virtues. First, this model is a transparent
and simple supervised lexicon learning model. It needs only A. 10-K Filings Collection
basic econometric methods, such as correlation analysis and Form 10-K filings can officially be found on the SEC
maximum likelihood estimation. Hence, this approach is website, where they are available to the public. The SEC
entirely ‘white box’. Secondly, this model requires minimal provides 10-K filings in HTML or TXT formats through its
computing resources. It only takes a matter of minutes to database system, known as the Electronic Data Gathering,
handle millions of documents on a laptop computer. Thirdly, Analysis, and Retrieval(EDGAR). This system offers various
this model has a broad range of scalability. Unlike existing official filings, including 10-K, 10-Q, 8-K, and 6-K, among
lexicon-based models that depend on a pre-existing sentiment others. For our research, we were able to collect most of the
dictionary, this model is able to use various types of textual 10-K filings from firms in the technology sector on the
data in the finance domain without a pre-defined dictionary. EDGAR. EDGAR offers files in both TXT and HTML
Fourthly, this supervised model does not require manual formats, but since most of the 10-K reports were easily
labelling labour to train the sentiment model. The model is accessible in HTML format, our algorithms focused
free from the expensive expense of a significant amount of exclusively on extracting HTML files. For instance, EDGAR

IJISRT25FEB804 www.ijisrt.com 2534

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
has 8140 filings in HTML format of all firms in the S&P 500,  Page Footers Removal
whereas, there are only 48 filings in TXT format [30]. We  Headings Extraction
considered the portfolio of Invesco QQQ Trust Series 1, an  Risk Factor Section Detection
exchange-traded fund(QQQ or QQQ ETF), to compile a list  Titles Extraction
of tech firms representing the technology sector in the United
States. The QQQ is designed to passively track the Nasdaq C. Page Footers Removal
100 Index, which conventionally represents the technology There is repetitive information in page footers of HTML
sector. More details of the QQQ portfolio can be found in files. Initially, we removed “Item IA. Risk
Portfolio Appendix C. Factors(Continued)” by using RegExr 4 at Table. We
observed that page footers are typically found near horizontal
Form 10-K filings can be easily collected by using a lines marked by ‘<hr>’ or ‘<div>’ tags with ‘page-break-
Central Index Key(CIK)in EDGAR. A CIK is a number given after: always’ RegExr 2 at Table. They were then replaced
to an individual or company by the SEC, used to identify the with a ‘split of pages’ marker by using the BeautifulSoup
ownership of a filing. Initially, we collected the CIKs list of library. Our algorithm subsequently identified and removed
firms listed in QQQ to facilitate data retrieval through the non-informative page footers- both textual and numerical-
RSS Feeds provided by EDGAR. It is important to consider located near these markers by scanning both forward and
the crawler’s limitation, which is a maximum request rate of backward.
ten requests per second, to ensure equitable access. In this
report, we specifically focused on Item 1A, the risk factor, as D. Headings Extraction
well as the entirety of the 10-K filing for our purpose. As the Headings were extracted by detecting tags typically
mandatory disclosure of Item 1A as a separate section of the associated with headings, identified by their styles or
filing has been required since 2005, our crawler collected 10- attributes. The algorithm uses these attributes at Table to
K forms spanning nearly 17 years, from January 2006 to identify headings and encapsulate their contents within ‘
December 2023, for every firm listed in QQQ. In total, we <heading>’ and ‘ </heading>’ markers. These identified
collected 1383 filings, primarily stored in HTML. You can headings are then readily identifiable for further analysis
find an example case in Appendix B. steps. Additionally, we removed the ‘ <heading>’ and ‘
</heading>’ tags when the heading contained five words or
B. Risk Factor Extraction Process fewer.
In this study, our focus was on Item 1A, the Risk Factor
section, to attain richer information. Initially, we extracted all E. Risk Factor Section Detection
contents from the Item 1A Risk Factor section. We separately The risk factors section is extracted by identifying the
extracted each risk heading along with its corresponding heading “Item 1A - Risk Factor” RegExr 6,7 at Table, and the
content and then aggregated these elements together. To subsequent heading RegExr 8-16 at Table. The contents of the
extract only the risk factor section, we used a well-organised section are located between the heading and the subsequent
structure of 10-K. You can find an example case in Appendix headings. Typically, if the pattern “Item 1A -Risk
B Factor”appears in the heading, it may contain extra spaces,
line breaks or variations. Conversely, if it is not in the
The Form 10-Ks feature well-structured formats in heading, the pattern is consistently enclosed in special
HTML. We observed the following regularities in the quotation marks, either “& ldquo”, “& rdquo”, or “& quot”.
structure of the 10-Ks: The algorithm, improved by [65], detects the positions of the
several types of headings by iterating through the regex
 The heading of a risk is followed by the explanations of pattern at Table I. Additionally, after extracting the contents
the risk (Check the example in Appendix B. of the section, the algorithm checks if the extracted content
 There is a summary of the Risk Factor section at the typically exceeds 1000 characters in length.
beginning of the section.
 Non-informative elements are located in the footers, F. Titles Extraction
including page numbers, copyright information, or The extracted headings include both titles (such as
disclaimers “Risks Related to Legal, Regulatory and Compliance
 Reiterated expressions appear in a certain section; Matters; FORWARD-LOOKING STATEMENT”) and
specifically, “Item 1A Risk Factor (Continued).” headings (like “We may be unable to adequately protect our
 Diverse layouts exist within a report, with variations in proprietary intellectual property rights, which may limit our
fonts, headings, and overall formatting. ability to compete effectively.”). We noted that titles have all
words starting with capital letters, after removing stopwords.
With the aid of the regularity in the filing’s structure in Thus, for headings fitting this pattern, we replace their
HTML, our algorithms can accurately locate and extract the markers with‘<title >’ and ‘</title >’.
risk factor section. The process of extracting risk factors from
HTML involves four steps:

IJISRT25FEB804 www.ijisrt.com 2535

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
Table 1 Heading Pattern
Sample Description
font-weight: bold Bold heading with style attribute
font-weight: 700 Bold heading with style attribute
 Bold tag
¡strong¿¡/strong¿ Strong tag
text-decoration: underline Underlined heading with style attribute
font-style: italic Italic heading with style attribute
 Italic tag
 Emphasized tag

G. Preliminary Analysis We labelled filing i with the associated time series

The data extraction algorithm successfully collected the variable yi, either return or volatility, on the publication date
10-K filings from 94 firms out of 100 firms listed in the QQQ of the filing. In this project, we assumed that each filing had
ETF directly from the SEC EDGAR database. The six firms a sentiment score, denoted by pi ∈ [0, 1] . In the case of return
that are not included are foreign-based and instead issued 20- fitting, a high score suggested that the filing had a
F filings. In total, the algorithm gathered 1397 filings, predominantly positive tone in the report. However, it implies
demonstrating its effectiveness for 10-K filings. However, neither positive nor negative in the context of volatility
some documents do not follow the standard regularity, fitting, as volatility signifies market uncertainty. Therefore, in
including having an unstructured format or placing the risk the volatility setting, a high score indicates high market
factor section in other sections. The number of non- uncertainty and a low score suggests low market uncertainty.
standardised documents is 63 out of the 1397 documents.
Thus, we collected 1334 filings in total, excluding 63 fillings.  Model Assumptions
In these cases, our algorithm was not able to accurately
identify and collect the relevant information. Despite this  Assumption 1:
fact, its design allows for adaptability to collect various types We assumed that vocabulary V consisted of a set S of
of SEC filings, making it a scalable tool for financial research sentiment-charged words and a set N of neutral words, i.e., V
that requires textual report analysis. = S ∪ N. These sets were mutually exclusive, i.e. S ∩ N = ∅.
Furthermore, we posited the set of sentimentcharged words
IV. METHODOLOGY affected the tone of a filing, whereas the set of neutral words
did not influence its tone, i.e, d[S],i Ʇ pi and d[N],i ⊥⊥ pi. The
In this section, we introduced the supervised learning sentiment word count was independent of the neural word
model for 10-K sentiment analysis and explained how the count, i.e., d[S],i ⊥⊥ d[N],i, implying that the model did not
model functioned. Section 4.1, which was referred to as [1], include the neutral words.
demonstrated how the model generated a sentiment score of
a 10-K filing by using the return at the time of the filing’s  Assumption 2:
publication as a label. In Section 4.2, we described in detail We assumed that the sentiment-charged word counts
the model that used volatility as a label to estimate sentiment d[S],i were produced by a mixture multinomial distribution:
and its adaptation process. In Section 4.3, to represent the
macroscopic sentiment trend of the technology sector, we
introduced the Kalman filter and its process for removing [1]
noise in the context of 10-K sentiment analysis.
Where si was the total count of sentiment-charged
 A Sentiment Score Prediction Model words in the i-th filing, i.e., si = Σw∈S dw,i. This determined the
scale of the multinomial. We then modelled O ∈ R=+|S|×2
 Notation matrix, representing the probabilities of individual word
To establish notation for the probabilistic sentiment counts using a mixture model that incorporated positive and
model, we considered n as a set of 10-K filings and V as a negative topics. The model included O+, a vector of |S| non-
vocabulary consisting of m words. The i = 1,. . . ,n represented negative elements with a unit ℓ1-norm, representing a
the index of 10-K filings. It represented both the filing’s probability distribution across words, such that Σ|S| o+,w = 1 ,
publication date and the company to which it related. The where o+,w ∈ O+ and w ∈ |S|. O+ symbolised a ‘positive
word counts were recorded in a vector di ∈ Rm+ , where di,j sentiment topic’ and represented the expected distribution of
represented the number of times word j occurred in 10-K word frequencies in a filing with the highest possible positive
filing i. We defined D ∈ R+n×m as a document-term matrix, sentiment, where the probability pi equals 1. Similarly, O−
with D = [d1, ..., dn], representing word counts in each symbolised a ‘negative sentiment topic’ that represented the
document. di was the i-th row of D, and the indices of distribution of word frequencies in a filing with the most
columns were listed in the set S, which was a subset of pronounced negative sentiment, where the probability pi
vocabulary, i.e., S ⊆ V . D[S],i was the submatrix of the i-th equals 0. The sentiment score pi, where 0 < pi < 1, was a
filing. d[S],i was the word count vector in subset S for the i-th mixture coefficient of two sentiment topics.
filing.

IJISRT25FEB804 www.ijisrt.com 2536

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
 Assumption 3: as a minimum frequency requirement to mitigate the impact
Finally, we assumed that the sentiment score fully of rare words, which could be sources of noise. For instance,
encapsulated the information within a filing that affected the a term appearing only once in all filings would result in an fw
dependent variables, i.e., yi|pi ⊥⊥ di. This implied that of either 0 or 1, thus being noisy and unreliable itself. By
sentiment score could primarily be used as a feature for return applying the threshold k, we filtered out such noisy instances,
or volatility prediction models.p᷃. requiring tPhat a word appeared more than k times to be
considered: Σn i=1 ⊮{dw,i>0} > k. With these conditions, our
 Model extracted set Ŝ was defined by.
The model, incorporating these three assumptions,
consisted of three steps for predicting the sentiment score of
a 10-K filing. First, we extracted the set of sentiment-charged
words, denoted as Sˆn. Second, we estimated the probabilistic
distribution matrix of positive-negative topic parameters over
words, which is Ô = [Ô+,Ô−]. Third, we predicted the [3]
sentiment scores p̂i of a 10-K filing using penalised maximum
likelihood estimation. Each step was described in detail in  *Return or Calculation
Section 4.1.3 to Section 4.1.4. The model overview can be When using a return as our label, we used the
found in. publication time t over the days t − 1 to t + 1, as suggested in
[1]. Using only the return at the publication time t can produce
 Extracting Sentiment-Charged Words: a noisy signal. A return may slowly respond to the content of
In a 10-K filing, sentiment-neutral words were likely to 10-K filings, so the market may need the time to reflect the
predominate in terms of the number of words and total counts. released 10-K filing of a firm to its stock price. Additionally,
This dominance tended to introduce noise and could make the some contents of 10-K filings can already be reflected in its
extraction of sentimentcharged words from the entire return as the 10-K filings, annually released, can contain
vocabulary computationally burdensome, especially if some repetitive contents. To mitigate the noisy signal, we
sentiment-neutral words were not selectively excluded. used open-close log returns Rt:
Hence, in our model, we filtered out the set of sentiment-
neutral words and focused solely on the subset of sentiment-
charged words, estimating topic parameters for this subset
alone. To achieve this, we utilised realised stock returns as a
label (Check Return yi, or Rt calculation) (Volatility was also [4]
used as a label, with the fitting process described in Section
4.2 in detail). The indicator function, represented by ⊮{a}, was Where P refers to the equity price at a given time. The
defined as 1 when condition a was met, and 0 otherwise. market open time at day t was denoted tO, and the market
Consequently, ⊮{di,w>0} represented the presence of word w in close time at day t was denoted tC. We used open-close return
the i-th filing, and ⊮{yi>0} denoted that the return variable to make a symmetric alignment with the 10-K filing’s
associated with the i-th filing was positive. For each word w published date.
∈ V, we could compute the frequency of word w in 10-K
filings with positive returns relative to its overall frequency  Estimating Probabilistic Distribution of Topic
across all filings using the following equation: Parameters:
In this process, our goal was to estimate the topic
parameters for a report. Ôi = [Ôi+,Ô_] referred to a 1x2 vector,
with each element corresponding to the estimated
[2] probabilistic distribution of positive topic or negative topic
for a filing, respectively. To obtain Ô = [Ô+,Ô_], we
This measure signified the word’s sentiment tone to associated the sentiment expressed in a 10-K filing with stock
return values. For any given word w, if filings that contained returns or volatility (Fitting on volatility will be explained on
it generally coincide with positive rather than negative Section 4.2). This approach comes from the assumption that
returns, w was tagged with a positive sentiment. Conversely, stock returns or volatility at the publication date of a 10-K
if occurrences of w tended more often to align with negative filing represented the sentiment for it. Note that we did not
returns, it was considered to have a negative return. have direct sentiment scores for the filings, thus we estimated
the sentiment proxy by using the standardised ranks of stock
Subsequently, we assessed fw against appropriate returns as a label in the equation:
thresholds. If fw was approximately 0.5, this suggested that
the word was sentiment-neutral and belonged to the set N. To 𝑃𝑖
𝑟𝑎𝑛𝑘(𝑦𝑖)
[5]
𝑛
differentiate sentiment-charged words, we defined α+ and α−
within the interval (0, 0.5] to filter out sentiment-neutral
Where the expression defined p᷃i, the estimated
words. A word was classified as a positive sentiment word if
sentiment proxy for the i-th filing, as the rank of yi, i.e, a stock
fw > 0.5 + α+ and as a negative sentiment word if fw < 0.5 +
return or volatility (Fitting on volatility will be explained on
α_. We also defined a third threshold, k ∈ N, which pertains
to the count of word w across all filings. The threshold k acted

IJISRT25FEB804 www.ijisrt.com 2537

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
Section 4.2), divided by n, the total number of filings. The the w-th elements of their respective vectors, and λ was a
rank of yi was sorted in ascending order. positive constant used to adjust the model. In the MLE, λ
log(p(1−p)) was a penalty term to avoid overfitting when
With these estimated sentiment proxies, we had a matrix there were reports with few sentiment-charged words. For
Ŵ containing two rows for each filing: one for the estimated instance, if the filing had a limited number of positive
positive sentiment proxy p͠i and one for the estimated sentiment-charged words without negative words, the model
negative sentiment proxy 1 − p̴i in the equation: believed the filing had a positive tone even if the filing just
only contained a few positive words. The term served as a
regularising factor that pushed the estimated sentiments
toward 0.5, indicative of a neutral sentiment. In other words,
the penalty terms nudged the model to be more conservative
[6] when scoring new filings.
Subsequently, we adjusted the counts of sentiment-  Volatility Label
charged words by dividing each count by the total sentiment- Section 4.1 introduced the sentiment prediction model
charged word count for the filing like in the equation: with the firm return as yi, as in [1]. Notably, [1] suggested the
model is universally adaptable. However, its core
assumptions and methodologies might not suit every
forecasting objective such as using volatility as a label. While
the current model with a return label fits into a binary or
[7] discrete framework, volatility does not naturally fit into a
binary or discrete framework. Due to that, we should do
Where ĥi was the counts of sentiment-charged words. adaptation to use volatility as a label for the model.
This was defined as the ratio of d[S],i, each word count within
the subset [Ŝ] of the i-th filing, to ŝi, which was the total In the current narrative of returns prediction, we could
sentiment-charged word count for the filing. These relative employ returns as a label corresponding to binary sentiment
term counts were collected in a matrix Ĥ = [ĥ1, ..., ĥn]. topics, i.e., positive and negative, because one either made a
loss or a profit. However, this classification was not clear in
In this process, we aimed to estimate a matrix Ô, which the context of volatility. The adaptation for volatility was to
contained parameters that referred to the probability set a threshold θ and we labelled all values above this θ point
distribution of positive and negative sentiments. Then, we as high volatility and those below it as low volatility. We then
could estimate Ô with a regression, using matrix Ĥ as the replaced 0 by the θ in Equation 4.2 and the value 0.5 by
predicted outcome and matrix Ŵ as the predictor like in the quantile q in Equation 4.3. Note that there were no right
equation: threshold or quantile, but we should keep in mind that our
choice of hyper-parameter would impact the outcome of the
Ô=ĤŴT(ŴŴT )-1 [8] model.

In the final step, to ensure the estimates correspond to a Volatility is, in nature, an asymmetric variable; thus,
probability distribution, we corrected any negative entries by getting a ranking of the volatility like in Equation 4.4 will lose
resetting them to zero and then re-normalised each column so substantial informational value to estimate the sentiment
that their totals were equal to one. This process produced a score. Subsequently, normalising the volatility can be an
revised matrix, but to simplify notation, we reused Ô for the appropriate alternative as it preserves asymmetry:
resulting matrix, and we labelled its first and second columns
as Ô+ and Ô−, respectively. Pi*=
𝑦𝑖 − 𝑚𝑖𝑛(𝑦)
[10]
𝑚𝑎𝑥(𝑦) − 𝑚𝑖𝑛(𝑦)

 Scoring New Filing

Now that we constructed estimators Ŝ and Ô . Also, from Where p̂∗i is greater than or equal to 0 and less than or
the mixed multinomial distribution in Assumption 4.1.2.2, we equal to 1. However, the normalised volatility itself was not
could estimate the filing’s count vector, which is d[S]. Given able to catch the outlier of the market movement despite
all estimates Ŝ, Ô, and d[S], we could estimate the best making it symmetric around zero, leading to poor predictive
probable sentiment score p by Maximum Likelihood performance. Thus, we used volatility values over multiple
Estimation (MLE): days for the robust labels. In practice, we averaged the
volatility over three days. We used the intra-daily range
among other standard volatility proxies such as squared
returns, or the realised volatility as it is a less noisy volatility
proxy, leading to less distortion [66], [67]. The intra-daily log
range is defined as:

[9]
[11]
Where ̂s represented the total count of words from the
set Ŝ in the new filing, while di,w,Ô+,w, and Ô-,w referred to

IJISRT25FEB804 www.ijisrt.com 2538

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
 The Volatility Proxy V᷃t was then Computed Like this:
To adapt the Kalman filter in our context, our sentiment
scores should be converted to time series data, because there
were multiple sentiment scores on the same publication data
[12] of 10-K filings. In order to convert it to time series data, we
aggregated sentiment scores on the same date over all firms
Where the intra-daily log range was squared and divided like where At referred to the set of 10-K
1
by the adjustment factor, 4log(2). The adjustment factor is used filings published on day t. Subsequently, vt was substituted by
p̴͠t in Equation 4.9 and we obtained the filtered industry-level
to correct potential bias that may occur in the data generation sentiment scores p͠t after applying the Kalman filter. In other
process when we assume that the process follows Brownian words, this approach implied that all 10-K filings published
motion with drift Parkinson, 1980. Then, we attained the on the same day were treated equally. This approach may
averaged volatility over three days for a more reliable label: enhance the model’s ability to estimate sentiment scores more
accurately. However, it comes with the trade-off of
transforming the score into an aggregate metric of all 10-K
[13]
publications within a given day.
Which is the common approach to calculating multi-day
aggregation of daily volatility proxies [68]. V. EXPERIMENT RESULTS

In this paper, we aimed to generate various sentiment

 Kalman Filter
scores from our suggested sentiment prediction model. To
In this paper, we generated sentiment metrics of both the
achieve that, we generated sentiment scores at three
technology sector and a single firm(e.g. Nvidia) with 10-K
stackholder levels: Sector level, Portfolio level, and Company
filings over a certain time. But, the estimation of the
level. At the sector level, we generated scores from the model
industry’s time-varying sentiment measures should be very
where it aggregated all filings of companies from the
noisy. To obtain robust sentiment features on time series, we
technology sector. At the portfolio level, we generated scores
adapted the Kalman filter to smooth the time-varying noisy
for a specific set of companies. In the practice of our study,
signals. In the context of smoothing sentiments of 10-K
we aggregated filings of the top 10 firms listed in QQQ.
filings across the industry, we referred to [69] to adapt the
Finally, at the company level, we aggregated all filings of a
Kalman filter to our context. Following the adapted Kalman
firm to generate sentiment scores. We generated Nvidia’s
filter in our context of work.
sentiment metrics in our study. All models for each level
commonly used either the entire 10-K or only the risk factor
[14] section, by labelling with return or volatility. That is each
level generated four types of sentiment scores. Three levels
were multiplied by four types of sentiment scores.In total, we
[15]
generated 12 types of sentiment scores(checkContribution).
Furthermore, for analysis of both a sector leve land portfolio
Firstly, we assumed that there were the observed
level, we applied the Kalman filter to mitigate noise.
sentiment score vt and an unobserved sentiment score - the
state variable μt in the Kalman filter framework. We extracted
In this chapter, we showed the statistics overview of
the unobserved sentiment score from the publication-date-
experimental results to introduce what we found. Then, we
based sentiment scores via Equation 4.14 and Equation 4.15.
evaluated sentiment scores we generated at three stack holder
Equation 4.14 referred to the prediction model and Equation
4.15 referred to the correction model. The prediction model levels through correlation analysis and qualitative analysis
with the most influential words.
updated the unobserved state, and the correction model
represented the observed sentiment cores that were affected
 Descriptive Statistics
by the state variable. The Kalman filter worked recursively
This section shows descriptive statistics of sentiment
through prediction and correction. It predicted the
scores we generated at the three different stakeholder levels.
unobserved state estimates from the previous weighted
We show the number of observations for each model and the
variable. You can find a more detailed explanation for this
number of sentiment words used for each model. We also
Kalman filter model in [70]. In our study, we employed the
Kalman smoother, instead of the Kalman filter, as the former calculate the mean, Standard Deviation (SD), and maximum
one was likely to be more proper to show a retrospective trend and minimum sentiment scores for each model.
to analyse the industry-level sentiments. Both the Kalman
 Evaluation Methods
filter and the Kalman smoother are very similar in terms of
estimating the unobserved sentiment score. However, the
 Correlation Analysis
Kalman filter estimates the unobserved sentiment score at
We used the Pearson correlation to evaluate our
time t given observations up to and including time t, whereas
estimated sentiment scores. The Pearson correlation
the Kalman smoother estimates the hidden sentiment score
coefficient could have two assumptions[71]. The first
based on all observations for t = 1, ..., T. In this paper, we
assumption is two variables are continuous. Secondly, two
called the Kalman smoother as the Kalman filter for
variables are assumed to have a linear relation. Based on these
convenience.

IJISRT25FEB804 www.ijisrt.com 2539

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
assumptions, we could analyse our sentiment scores with the When interpreting the set of the extracted words, the
Pearson correlation. We used four variables(i.e. PRET , PV interpretations might be interpreted in various ways. The
OL, PLM, and Stock) for correlation analysis. The correlation objective of the word extractions was to offer the most
of variable pairs was calculated through the Pearson influential keywords used for financial analysis. We assumed
correlation formula in Appendix E, and their linear the value of the words would be synergised with domain
correlation could be checked through a significance test [72]. knowledge. Also, using generative AI could offer a good
In a significance test, 0.05 is conventionally used as a reference to our word set for in-depth and insightful analysis.
significance level. However, a long list of scientists recently
proposed a new p-value threshold(i.e. 0.005) to improve the  Evaluation Results
reproducibility as the conventionally used threshold leads to
a high rate of considerable false positives, even without  Sector Level (i.e. Technology Sector)
taking into account any issues related to the experiment, In a technology sector analysis, we computed the sector
procedure, or documentation [73]. In the practice of our study, sentiment scores with all firms’ filing, additionally
we significantly increased our experiment’s reliability by considering the entire 10-K filing or only the Item 1A section,
setting both 0.05 and 0.005. respectively. To improve readability, we introduced a table at
the top right for all the score graph figures to identify easily
 Qualitative Analysis with Most Influential Words which models you are referring to. Moreover, we employed a
In the process of predicting sentiment scores for all Kalman filter to represent a reliable sentiment trend for a
models we have introduced so far, we could extract the top 15 sector analysis [70]. An industry-level analysis generated a
influential words from ˜pRET , and ˜pV OLfor each model. The considerable amount of signal noise. We could control these
detailed extraction process can be found in Equation 4.3. With noisy signals by using the Kalman filter.
the extracted set, we could infer a topic for the analysis of
three stakeholder levels.  Sector Correlation Analysis:

Table 2 Sentiment token Statistics with 10-K and Risk Factor

# of Total Tokens Mean SD Max/Min
(# of Sentiment Tokens)
p˜RET with 10-K (Figure X) 15,863 (8,725) 0.48 0.28 0.92/0.08
p˜V OL with 10-K (Figure X) 15,863 (9,518) 0.43 0.24 0.92/0.08
p˜RET with Risk factor (Figure X) 9,030 (5,147) 0.46 0.27 0.92/0.08
p˜V OL with Risk factor (Figure X) 9,030 (3,792) 0.42 0.25 0.92/0.08

Table 3 Sentiment token Statistics with 10-K and Risk Factor

# of Total Tokens Mean SD Max/Min
(# of Sentiment Tokens)
p˜RET with 10-K 7,040 (6,434) 0.42 0.30 0.92/0.08
p˜V OL with 10-K 7,040 (6,985) 0.43 0.33 0.89/0.08
p˜RET with Risk factor 4,265 (4,052) 0.42 0.29 0.92/0.08
p˜V OL with Risk factor 4,265 (4,005) 0.44 0.31 0.88/0.08

Table 4 Sentiment token Statistics with 10-K and Risk Factor Signal Noise.
# of Total Tokens Mean SD Max/Min
(# of Sentiment Tokens)
pˆRET with 10-K 3,861 (3,861) 0.49 0.32 0.89/0.13
pˆV OL with 10-K 3,861 (3,861) 0.39 0.21 0.86/0.13
pˆRET with Risk factor 2,201 (2,201) 0.49 0.32 0.90/0.11
pˆV OL with Risk factor 2,201 (2,201) 0.41 0.18 0.80/0.17

 The Sector Sentiment Model with 10-K Filing (Figure )

Figure showed the predicted sentiment scores for the We calculated two types of average loss for the same
technology sector. These scores were estimated with almost windows: one between the estimated sentiment score p͠RET
all firms’ 10-K filings listed in the QQQ(i.e. 94 firms out of and the normalized rank of return, and the other between the
100), and we used the entire 10-K filings to calculate the estimated sentiment score p͠VOL and the normalized rank of
scores. Note that the sentiment scores labelled with return and volatility. Furthermore, we calculated how well our sentiment
volatility in Figure are filtered. Upon reviewing graphs in and analysis models can predict the sentiment of the financial
, it was evident that both unfiltered return sentiment and markets based on return or volatility. For this model (Figure
volatility sentiment contained significant noise. Therefore, ), the loss of the model ˜ pRET was 0.25, and the accuracy
filtered versions were chosen to more reliably depict the rate was 78%. It showed p͠RET was strongly well predicted
sector’s trend. Other models, which will be introduced in the compared to the QQQ return given a window. It meant the
following parts, also used the filtered one. p͠RET represented the sentiment of the technology sector.

IJISRT25FEB804 www.ijisrt.com 2540

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
Additionally, the p͠V OL sentiment of this model showed significant correlations compared to unfiltered scores. All
stronger prediction accuracy. The p͠VOL model performed a other pairs, with the exception of the LM and Stock, exhibited
92% accuracy rate with 0.22 loss. weak correlations. Specifically, an r value of-.245 between
p͠RET and p͠VOL indicated a weak negative correlation,
To evaluate our model critically, we selected the LM as suggesting that positive sentiment in RET generally
our benchmark as it was created for 10-K sentiment corresponded to negative sentiment in V OL, and vice versa.
assessment. In Figure , the p͠LM score showed a steady Furthermore, an r value of +.296 between p͠RET and Stock
decrease until 2018, followed by a significant and gradual implies a weak positive correlation, indicating that positive
decline, with intermittent fluctuations due to noise. This trend sector sentiment in RET was associated with rising QQQ
occurred despite the technology sector’s rapid growth from stock prices, and vice versa. In Figure , despite oscillated
2018 until just before the COVID-19 pandemic, indicating a fluctuations due to noise from the sector, p͠RET showed a slow
generally negative direction in the model’s movements. This increase during the observed window. Notably, the sentiment
was because the predominance of negative over positive trend in 2023 for p͠RET closely mirrored the QQQ stock
vocabulary in the LM dictionary (2,355 negatives vs 354 performance in the same year. The case of p͠VOL and Stock
positives out of 86,531 ˜ words) suggested a bias towards (r=+.121) showed also a positive correlation, but weaker.
negative sentiment in the pLM score, as approximately 85% of Moreover, both r p͠RET, p͠LM(=-.359) and r p͠V OL, p͠LM (=-.173)
the relevant vocabulary is negative. The rest, 83,822 words, showed a weak negative correlation. , all pairs exhibited low
were neutral and excluded from the sentiment analysis. This p-values, with those in Table being significantly lower. This
imbalance likely caused the p͠LM score to trend negatively. suggests that the sample results—specifically, the correlation
Additionally, the sector sentiment prediction model took 37 coefficients—provide sufficient evidence to reject the null
seconds to execute. hypothesis from the entire population [74]. In our case, the
null hypothesis was that there was no correlation between the
These tablesG.1, and Table 5.1) represented Pearson pair, whereas the alternative hypothesis was that there was a
correlation coefficients between the sentiment estimates correlation between them. However, the lower p-value of all
including QQQ’s stock price, both filtered and unfiltered. pairs does not measure the probability that the alternative
Upon analyzing G.1, we found that, with the exception of the hypothesis is true. The p value is not an absolute index for
V OL and LM pair, there was no linear correlation between arguing that a hypothesis is true. Instead, the lower p-value
any pairs of unfiltered sentiment scores. However, analysis of shows our experiments have statistical significance [75], [76],
filtered sentiment scores (Table 5.1) revealed more [77].

Table 5 Filtered, A Sector Sentiment Correlation with 10-K Filing

rp˜i,p˜j RET VOL LM Stock
RET 1 1 1
VOL LM **−0.245 **−0.173 **−0.910
1
Stock **−0.359 **0.121
**0.296

 Note: *p-value < 0.05, **p-value < 0.005 factor sections, despite typically constituting only 10% to
15% of the entire filing, contained a significant portion of the
 The Sector Sentiment Model with Only-Risk-Factor relevant words to the model’s sentiments. It showed that a
The sentiment scores in were predicted based on the risk risk factor section could offer informational value for textual
factor section. Note again that only showed the filtered one analysis in the finance domain. Also. the minor differences in
(the unfiltered can be found at. accuracy between the models could suggest that the risk
factor section provided a valuable textual context in
The PRET model accuracy was 73% with the 0.25 loss. predicting sentiment scores for both P̴RET and P᷃VOL. However,
The risk factor P͠RET model accuracy is 5% lower than the upon comparing both models, it was observed that all scores
entire 10-K P͠RET model, which is 78%. Moreover, the risk from the model analyzing risk factor sections were lower than
factor P᷃VOL model showed a stronger model performance. It those from the comprehensive model. It indicated that the
performed 90% accuracy with a 0.22 loss. Comparing the 10- tone of a risk factor section is pessimistic/negative as other
K sector ˜ PLM and the risk factor P᷃LM, the latter is generally studies also empirically found [25], [78]. The training time
lower than the former. This pattern suggested that the p᷃LM for this model, additionally, was also reduced by 40%(i.e.
score, influenced by an LM dictionary dominated by negative 23s).
words and a risk factor section pessimistically toned,
inherently leaned towards a negative sentiment [25], [78]. Table showed the Pearson correlation coefficients of the
filtered sentiment sector model with only the risk factor.
When comparing the training data for both Compared to the sector model with the entire 10-K filing,
models(Figure , and ), it’s noted that the aggregated word shown in Table 5.1, the sector risk factor model showed, in
count from all QQQ firms’ filings was 15,863 post- general, lower r values for all pairs, and some pairs showed
preprocessing, and the count from just the risk factor sections altered sign such as the pair of P̴RETand P᷃LM, the pair of P̴RET
of these filings was 9,030 after undergoing the same and Stock, and the pair of P᷃VOL and P̴LM. Except for the pair
preprocessing. This observation highlighted that the risk of P̴RETand P̴LM which did not show statistical significance,

IJISRT25FEB804 www.ijisrt.com 2541

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
the r=-0.376 from P̴RET and Stock showed a weak negative Two themes could be inferred from the extracted words
correlation in the risk factor model. The more positive tone to increase market uncertainty. The first theme could be
the sector has, the lower the price the sector has. This ’COVID 19’ from words, including lancet and forearm. lancet
outcome was in contrast to that of the model (Table 5.1) and is a major medical journal. forearm is a body part relevant to
also our intuition where a good sentiment is related to a higher a vaccine. Both could be related to the COVID-19. We
return. Other pairs did not show a meaningful correlation as secondly could infer ‘Innovation and Growth’. The words
they were close to 0. contained fang, bully, amenable, killing, granularity, koa, and
renter. fang referred to FANG, which is an acronym for major
 Sector Most Influential Words: technology firms in the US. They were known for their
volatile stock prices. Words like granularity, koa, and renter
 The Sector Sentiment Model with 10-K Filing (Figure ) could be associated with a tech firm’s development. The word
The following words were the most influential words in renter indicated the concept of the sharing economy. A
p̴RET sentiment score prediction, positively or negatively, representative example of this sharing economy could be
respectively. cloud service. koa represented a web framework for Node.js,
which highlighted the back-end technology.
From the positive influential words in S+RET , we could
categorise a few themes that could impact the technology In fact, we could interpret the extracted words in various
sector’s positive sentiment tone. The first theme can be ways. One theme we could infer was ‘Environmental Issues’
‘Innovation and Development’. The words such as musk, from words, including swine, laryngoscope, farming, moss,
maximiser, searcher, multifaceted, and incisive could be and subway. swine could refer to swine flu. laryngoscope
interpreted as innovation-relevant vocabulary. Maximiser and could be related to COVID-19. This flu and the pandemic
searcher could refer to a figure to lead technology innovation could increase or decrease the volatility of the technology
or development. Interestingly, the word, musk, seemed to sector. Moreover, words such as farming, moss, and subway
indicate “Elon Musk”, who is one of the figures to represent could be associated with modernisation or urbanisation for
innovation under the fourth industrial revolution era. sustainable development. Or, they could be related to climate
Moreover, the words(i.e. multifaceted, incisive) could change. A few studies supported that environmental issues
indicate a capability or capacity(or both) required for were correlated with volatility [83], [84].
innovation. In Figure , the technology sector has been
gradually and significantly developed during the given  The Sector Sentiment Model with Only-Risk-Factor
period. This remarkable development could be associated These words affected the sector sentiment positively.
with innovations. In this context, words such as musk, torque, The word set showed a similar theme compared to the
and bolivia were additionally associated with innovation, previous one (check S͠+ RET of the 10-K sector model in here
specifically indicating an electric car or robotics. Elon musk, ). The first theme, ‘Innovation and Development,’ could be
CEO of Tesla, leads the American firm known for its electric inferred from the words, including artificially, mechanic,
vehicles and robotics innovations. The torque word could be excellence, and explorer. artificially represents Artificial
related to an electric vehicle or robot. Bolivia is home to the Intelligence(AI), which is a cornerstone of the fourth
world’s largest lithium deposits [79]. Lithium is a key industrial revolution. Explorer could be interpreted in the
component of electric car batteries. Furthermore, the words, same context of maximizer, searcher in the previous one
including gilt, div(we interpreted as dividend), memoranda, (check S͠+ RETof the 10-K sector model in here). The second
and stance, could be interpreted as finance-relevant terms. theme was ‘Financial Health’. Words, such as roe, stance,
Good f inancial health could be an important factor in escheat, wrongfully, and ruble could indicate the theme. roe
economic growth [80]. Hence, these finance terms represent referred to ROE (i.e. Return on Equity). roe, stance, and
’Financial Health’. The emphasis on improving the financial wrongfully could symbolise firms’ financial condition.
health of each tech firm might impact their stock positively, Escheat could convey asset retention. Also, ruble could
leading to a rise in the sector return. indicate global market interactions. The ruble is the currency
of the Russian Federation.
We could infer a theme from the negative impactful
words in S̴_ RET . ’Technological Difficulty’ could be a theme There are a few words that could be interpreted as
that makes the sector tone negatively. The words, such as similar to the ‘Technological difficulty’ theme mentioned
tango, transformer, scalar, infer, and tera could represent a previously (check S͠_ RET of the 10-K sector model in here).
firm’s technological difficulties. For instance, Tango was The words included biogen, surviving, kinase, ramification,
Google’s Augmented Reality deprecated project due to its and chemotherapy. These words could specifically indicate
technological issue [81], [82]. Actually, Google’s Gemini biotechnology. biogen, kinase, and chemotherapy are
glitches cost Google 90 billion dollars in stock loss in a single terminology used in biotechnology. Ramification and
day. Furthermore, transformer is a significant natural surviving could be interpreted as a difficulty to develop
language process(NLP) model architecture for generative AI. biotechnology. Interestingly, this S͠_ RET model trained with
scalar, an element used to define a vector space in machine the risk factor section extracted the word covid explicitly,
learning, can refer to the number of parameters for an AI which rapidly dropped the sector sentiment with the reduction
model, which is associated with model performance. Tera of stock price.
referred to a terabyte, representing a large dataset.

IJISRT25FEB804 www.ijisrt.com 2542

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
A few words, such as pressing, prominently, and around 2016, likely influenced by the 2007-2008 financial
endocrinologist could show the COVID-19 relevance. Also, crisis and subsequently moderated by quantitative easing
Spoof appeared again. It could symbolise cybersecurity, (QE) policies. Economists credited the Federal Reserve’s
including impropriety, programmatic, and erroneously. fiscal stimulus for mitigating the crisis’s effects [85], [86],
Interestingly, clothing and polymeric were extracted. They [93], [88], [87], [89], [94]. A similar volatility movement
could indicate wearable technology in health care. happened again during the outbreak of COVID-19.
Furthermore, this p͠VOL model indicated a decalcomania-like
 K. Portfolio Level (i.e. Top10 Firms Portfolio) movement with p͠LM, where increases in volatility seemed to
We generated sentiment scores at a portfolio level. The be associated with decreases in p͠LM and vice versa. Despite
portfolio consists of the top 10 firms equally listed in the this, p͠LM failed to accurately reflect significant market events,
QQQ fund. Top 10 firms denoted the top 10 most invested such as the 2008 financial crisis or the COVID-19 pandemic,
companies from 1 to 10 in QQQ, as the 50 percentile of the showing no substantial change in its trend from 2018 to 2024.
QQQ fund. Note again that the Kalman filter also was applied This lack of responsiveness suggested the p͠LM model’s
here. inability to dynamically mirror the market, likely due to its
training dataset being heavily skewed towards negative
 Portfolio Correlation Analysis: terms.

 The Top10 Sentiment Model with 10-K Filing Table 5 showed the Pearson correlation coefficients of
Indicated the predicted sentiment score for the top10 the filtered top10 sentiment model with the entire 10-K
port folio. These scores were estimated based on the top 10 filings. All r values indicated a strong correlation with
firms’ entire 10-K filing for a given window from 2006 to rigorous statistical significance. Compared to Figure , the top
2023. The p͠RET model performed a 76% accuracy rate with 10 firms revealed clearer correlations by removing the noise
0.22 loss. The previously mentioned sector models had 78% associated with the other 90 firms. A strong positive
and 72%, respectively. The p̴VOL model decreased its correlation of r = +.905 between p᷃RET and p᷃VOL indicated that
accuracy rate to 78% with the 0.20 loss (: 92%, : 90%), but it higher sector positivity was linked with greater uncertainty.
still predicted the sector’s sentiment labelled with volatility Additionally, rp᷃RET,Stock = +.956 showed that increases in
well. The training time was slightly faster, reduced to 13 sector stock prices correlated with more positive sentiment,
seconds, as we only used the 10-K filings of the top 10 firms. and similarly, a strong positive correlation existed between
p᷃VOL and stock prices, suggesting that stock prices rose with
Compared to the sector’s sentiment score (Figure ), the increased market uncertainty. Strong negative correlations of
top10 10-K model () showed a significantly far clearer trend rp̂RET, p̂LM = −.922 and rp̂VOL, p̂LM = −.903 indicated that
while noisy signals were removed again through the Kalman decreases in p̂LM were associated with the increased market
filter (the unfiltered model can be found at ). We found sentiment and uncertainty, and vice versa. Also, p̂LM had a
extremely interesting a few phenomena in this model. Firstly, strong negative correlation with Stock.
we observed the p͠RET strongly mirrored the top 10 stock price
movement. We could see that the p̴RET had a steady and  The Top10 Sentiment Model with Only-Risk-Factor
moderate increase from 2006 to around 2016. Then, Showed the top 10 firms’ estimated sentiment scores by
beginning around 2018, p͠RET showed a rapidly sharp rise, but training only the risk factor section. Compared to the 10 K
even after the outbreak of COVID-19. Very similarly, the Top top10 model (), this model() performance was decreased
10 stock rose sharply over a few years in early 2018 and slightly to 71% with the 0.23 loss (p̂RET) and 74% with the
showed a steep drop from 2021 to the beginning of 2023. 0.21 loss (p̂VOL). This could be because of the smaller size of
Then, it increased sharply again. The steep drop from 2021 to the vocabulary of the risk factor than the entire 10-K f ilings.
2023 could be stemmed from the COVID-19 pandemic. Then, the training time also was reduced to 8 seconds. Due to
Several studies argued that quantitative easing(QE) seemed the same reason, we observed that the p̂RET model steadily
to be significantly and positively related to a higher recovery increased. However, this model did not reflect the market
of the USA’s stock market. This could explain the QQQ’s situation well. This model gradually increased during the
sharp increase after the outbreak of COVID 19 [85], [86], technology industry boom time and very slightly decreased
[87], [88], [89], [90], [91], [92]. However, the post-COVID- after the outbreak of COVID-19. Also, the p̂VOLmodel showed
19 surge in sentiment suggested the p͠RET model did not the market uncertainty steadily fell during the same period.
accurately gauge event significance, a factor likely The p̂RET model dynamically, albeit minimally, reflected
considered in 10-K filings. This limitation could arise from market changes, while the p̂VOL model struggled due to its
managers’ ignorance about an event’s impact at filing time or focus on the risk factor section dominated by negative tones.
the model’s inability to quantify how much sentiment is When using the model trained on risk factors, we must
driven by the event. Essentially, the model treated significant remember it might offer a pessimistic market view.
terms like ”COVID-19” or ”restrictions” merely as text, Furthermore, the p̂LM model indicated a slow, moderate shift
without evaluating their actual importance or sentiment towards negativity, showing it lacks dynamic market
contribution. responsiveness, similar to patterns seen in the earlier p̂LM
model(). This negative bias distorted the analysis of the
Secondly, the p͠VOL model showed interesting points. portfolio level.
The p͠VOL model revealed trends in market volatility, with a
sharp increase until around 2010, followed by a decline until

IJISRT25FEB804 www.ijisrt.com 2543

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
When examining the Pearson correlation Table , the The following word set extracted from the risk factor
correlations including p̂VOL showed differently compared to increased or decreased the portfolio uncertainty.
the 10-K top 10 sector model(Table F.2). We observed that all
correlation, such as rp̂VOL, p̂RET (=+.145), rpVOL, pLM(= .349), The ˜S+ VOL could indicate a theme, ‘Innovation and
and rpVOL,Stock(=+.185), became to have a weak linear development’ to increase market uncertainty. circuit,
correlation. The rp̂VOL, p̂RET and the r_ even were close to zero, generating could symbolise generative AI, and its
implying that they lost their correlation. We could assume that semiconductor chip, GPU. virtual could refer to the
the p̂VOL score was not calculated correctly in the model metaverse. transact could indicate transaction technology like
trained with the risk factor section. blockchain or digital payment systems. Several studies found
that firms’ R&D investment in these fields, being represented
 Portfolio Most Influential Words: by innovation, has a strong positive relationship with
volatility [99], [100], [101]. The topic, ‘Innovation’, from the
 The Top10 Sentiment Model with 10-K Filing () model supported the argument of these studies.
The word sets extracted from the top10 10-K model
seemed to indicate the ’Electric cars’. The 10-K model The ˜S_ VOLcould refer to a theme, ‘Infrastructure and
includes words like roadster, musk, supercharger, cluster, Services’, to decrease market uncertainty. depot, host could
roof, sedan, and dangerous. These words explicitly indicated indicate a data centre and its hosting service. Also, sea could
’Electric cars’. Notably, the same theme commonly appeared be related to undersea cables for data transmission. club,
in the following top10 risk-factor model. From the membership can represent a social infrastructure( or network)
observation, we could infer that ’Electric cars’ have a strong for tech operâtions. Several studies found digital and
correlation with the technology sector return. traditional infrastructure investments are key components to
stabilising and driving forward economic growth as well as
The 10-K word set to affect return negatively seemed to reducing market uncertainty[102], [103], [104]. These studies
reveal generic words, hindering catching a theme. However, support the model’s argument that ‘Infrastructure and
the following S˜_ RET in the top10 risk factor model seemed Services’ could decrease market uncertainty.
to highlight a distinct theme.
 Company Level (i.e. Nvidia)
The positive word set had a common theme, ‘Electric
cars’, with S+ RET in the top10 10-K model. The words  Nvidia Correlation Analysis:
indicating the topic include autonomous, neural, pilot ford,
berlin, and race. autonomous, neural, and pilot could refer to  Nvidia Sentiment Model with 10-K filings
self-driving, while ford, berline, and race could represent an Represented Nvidia’s predicted sentiment scores trained
automobile industry. From this observation, ‘Electric cars’ with the entire 10-K filing. The p̂RET model performed a 75%
also was the topic to increase the technology market’s with 0.19 loss and the p̂VOL showed a 69% accuracy with 0.25
uncertainty as well as the market’s positivity. loss. Our sentiment prediction model predicted a single firm’s
sentiment fairly well. It took 7 seconds.
The words from ˜S_ VOL such as restrain, resilience,
tester, and insecure could indicate a topic, ‘Stability and The p̂RET model seemed to reveal a periodic pattern of
Resilience”. The continuous efforts(tester) of a tech firm to the sentiment tone. Its sentiment tone dramatically became
overcome technological difficulties(insecure, restrain) could positive from 2006 to 2011 and vastly became negative from
improve firm stock’s stability(resilience), contributing to 2014 to 2020. In the same period, Nvidia’s stock price
lower volatility. remained the same. Then, it sharply and rapidly became
positive. The stock was hugely impacted by the outbreak of
 The Top10 Sentiment Model with Only-Risk-Factor COVID-19 and rebounded due to the US government’s
From the words such as battery, solar, musk, screen, aggressive QE policy and Nvidia’s dominant position in the
roadster, and motor. This set could indicate ’Electric Cars’. GPU market in the fourth industrial revolution. We assumed
Interestingly, this topic appeared in the top10 10-K model to that Nvidia’s periodic rise and fall in their sentiment might
increase return and volatility, suggesting investors should have stemmed from the firm’s internal optimism and the
carefully pay attention to the electric car industry. It could be market not responding to that optimism. Furthermore, the
the next alpha(α) signal to beat the market. p̂VOL model showed a similar movement to the p̂RET model.
This model could capture the aftermath of COVID-19, but it
The theme “Supply chain risk”, for instance, could be is not that great. As our model was not trained with 2024 data,
revealed from words like subcontractor, entrust, dependence, the p̂VOL could not capture the current soaring as well as p̂RET,
and nationwide. Several studies showed the risk of supply and p̂LM. The p̂LM, interestingly, showed a decalcomania-like
chains for high-technology industries severely increased due pattern to the p̂RET. Also, they could capture COVID-19’s
to geopolitical disruptions [95], [96], [97], [98]. aftermath.
subcontractor, entrust, dependence could indicate firms’
dependency(dependence) on their productions and services. Revealed all r values(i.e. all correlations) showed the
nationwide could represent supply chain risk occurring same movement compared to the 10-K top10 model’s
globally. correlation (Table F.2), but represented a lower correlation
power. For instance, Nvidia showed a moderate positive

IJISRT25FEB804 www.ijisrt.com 2544

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
correlation at r ˜pRET,˜ pVOL(=+.612), whereas the 10-K top10 responding to legal and regulatory petitions, while managing
model (Table F.2) showed a strong positive correlation at r ˜ its brand in a competitive market big, eastern by adapting to
pRET,˜ pVOL(=+.905). laws affecting its business model initiative or product
offerings (enactment).
 Nvidia Sentiment Model with Only-Risk-Factor
Indicated Nvidia sentiment prediction score trained with We could interpret “Nvidia, Venture Capitalist” from
the risk factor. Compared to the 10-K Nvidia sentiment model words such as starting, attempt, and derive. Nvidia currently
(), the p̂RET showed almost the same performance, and the is ramping up its venture investment [109], [110], [111].
p̂VOL increased 6% accuracy. The prediction took less time. It Venture investment in a startup (starting) that attempts to
took 4 seconds as fewer tokens were used(i.e. 2,201). In the derive innovation could be risky and subsequently could
same comparison between and , the p̂RET and the p̂VOL showed increase volatility.
a similar movement, whereas the risk factor p̂LMmodel
showed a gradually falling movement. We assumed this was The following words mostly influenced the reduction of
also because the p̂LM model dataset was predominantly filled Nvidia’s market uncertainty.
with negative words, as previously mentioned in .
The words, such as virus, grown, returned, settle, and
Revealed a meaningful correlation compared to the 10- near, occurrence, could refer to ”Stability from Alleviating
K Nvidia model showed a slightly stronger correlation but the after math of COVID-19”. They may hint at
with more rigorous statistical significance in both rp̂VOL ,p̂RET recovery(settle, returned, near, occurrence) from
and r̂VOL ,Stock. The correlation results of the 10-K Nvidia setbacks(virus, grown), signalling the reduction of volatility.
model were not actually reliable as almost all correlations did
not show statistical significance, except for the RET and V  Nvidia Sentiment Model with Only-Risk-Factor ()
OL pair. For your consideration, both p̂LM’s correlations were The following positive word set seemed to show more
less informative for our project. Even p̂LM’s movements accurate information to identify a positive topic on return.
inaccurately mirrored Nvidia stock’s behavior.
For example, the word set, including card, mineral,
 Nvidia Most Influential Words: original, thermal, chain, and intangible, could symbolise
‘Graphics Processing Unit(GPU)’, which has been the cash
 Nvidia Sentiment Model with 10-K filing cow for Nvidia. Those words seemed to directly be indicative
The following words were the most impactful words in of GPU. card could represent a graphic card. mineral could
Nvidia’s p̂RET sentiment score prediction, positively or indicate core raw materials for GPU manufacturing [112].
negatively, respectively. chain could refer to the supply chain of Nvidia that had a
dominant position in the GPU market. thermal could indicate
The word set to impact positively on Nvidia sentiment the innovative development of GPU. original and intangible
could be interpreted into two categories. The first category could indicate Nvidia GPU’s originality and its intellectual
could be ‘Innovation and Development’. This theme property.
appeared again in the sector analysis. Words such as talent,
thermal, failing, strict, and connect could be related to a Note that the words were extracted from the risk factor
firm’s innovation. Talent could indicate that talented section where negative words were prevalent. This section
employees were drivers of innovation. Thermal could convey contained valuable information to analyse firms’ risks. So, we
innovative GPU development, which is Nvidia’s main assume that the risk section could offer better informative
product, as temperature control is required for chip’s higher insight for analysing their negative sentiment. From words
performance [105]. Also, words, including chain, favour, like console and video, we could deduce the ’Video Game
exclusion, and tender could symbolise the second theme, Industry’ topic and its negative impact on return. Actual
‘GPU market domination’. tender could represent the historical revenue data of Nvidia supported our interpretation.
tendering process, which is a key component of the supply The gaming segment’s revenue has been reduced [113].
chain management system. We could interpret its GPU
market domination as being caused by its innovation and The theme “Regulation and social responsibility on AI”
strategic supply chain management, including tender emerged from words, including social, responsible, severe,
processing, leading to the exclusive market position [106], restricted, and procedure. Given that the Environmental,
[107]. Social, and Governance (ESG) ratings effectively influence
return and volatility, ESG scores become important financial
The word set to influence the return negatively could be indicators [114], [115], [116]. Consequently, we could inter
interpreted in a various way. ChatGPT4([108]) suggested that pret from the word set that the importance(severe) of self
‘Market Challenges’ could have a negative impact on regulation(restricted) for its product manufacturing
Nvidia’s sentiment. The relevant words for the theme procedure as part of their social responsibility (social, and
included redemption, grid, initiative, petition, eastern, big, responsible) on AI.
retrospective, branded, revolving, enactment, reducing,
drone, joint, and pursuit. ChatGPT4([108]) argued that they The words relevant to the ’External risks’ theme
reflected various operational and market challenges. For included worm, hacker, fault, occurrence, confidential, and
Nvidia, this could entail navigating energy regulations (grid), geographic. Cyber security, as an external risk, could be

IJISRT25FEB804 www.ijisrt.com 2545

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
deduced from words like worm, hacker, fault, occurrence, and At the company level, the company-specific model
confidential. Also. geographic could indicate a geopolitical using risk factors yielded sentiments with greater
issue as the US government is restricting cutting-edge AI chip informational value than the model focusing on the complete
export to China [117]. While cyber security or geopolitical 10-K documents. The risk factor model outperformed the 10-
issues could heighten volatility, we could interpret that K-centred model For both p̂RET and p̂V OL. The accuracy levels
Nvidia’s effective risk management strategies, in the context for p̂RET and P̂VOL saw increases of 1% and 6%, respectively.
of low volatility, mitigated its volatility. However, note that The training time was shorter. For both the risk-factor
the model could be biased as managers’ biases were reflected company model and the 10-K company model, p̂RET and p̂VOL
in the training data, a risk factor. Just because managers’ exhibited a similar movement. For correlation analysis, the
arguments were positive to those issues, that does not mean model centred on risk factors demonstrated a marginally
risks did not disappear[118]. higher correlation, accompanied by more robust statistical
significance. In the most influential word analysis, both
VI. CONCLUSIONS models provided a distinct and detailed word, leading to
capturing the theme. Hence, at the company level, we advised
A. Conclusions using the risk-factor-centred model for trend and correlation
Our suggested models estimated sentiment scores at analysis while suggesting the use of both models for word
three stakeholder levels: sector level, portfolio level, and analysis.
company level. Also, each model was trained with either the
entire 10-K fillings or the risk factor section. VII. LIMITATIONS AND FUTURE WORKS

At the sector level, the sector model trained with the Our suggested models can not currently capture
entire 10-K provided more informative sentiments than the meaningful phrase words as the model essentially calculates
risk-factor centred sector model. The 10-K model sentiment-Charged words based on a bag of words. For
outperformed the risk factor model for both the return- instance, the model will extract the phrase word ‘chef
labelled sentiment(i.e. p˜RET) and volatility-labelled executive officers(CEO)’ in a word base separately and then
sentiment(i.e. p̂VOL). The accuracies for p̂RET or p̂VOL were evaluate whether each word can be sentiment-charged words
higher by 5% and 2% respectively. The 10-K model took concerning the dependent variables. In this word-based
more time for training though. Notably, p̂VOL, in the 10-K separation process, it loses the original meaning, which is
sector model, showed periodic fluctuation trends in volatility, CEO. Hence, in the future, we suggest to add a function that
whereas p̂RETwas less fluctuated. In terms of correlation contains contextual information on the model. Bi-gram or n-
analysis, the 10-K sector showed a stronger correlation than gram can be examples. Industrial sentiment score prediction
the risk-factor sector model. For the qualitative analysis with does not consider the allocation proportion of the QQQ
the most impactful words, both models seemed to exhibit portfolio. In the QQQ ETF fund, the top 10 firms take around
similar topics. The baseline model (i.e. p̂LM) did show a 45% allocation proportion of the total in 2023, and the rest of
distorted performance compared to both the 10-K sector 90 firms take the rest of 55% proportion. The portfolio is
model and the risk factor sector model. This was because the reconstructed annually. So, to predict a robust industrial-level
baseline model was biased to be negative as the training set sentiment score, the scores should consider the portfolio
was predominately filled with negative tokens. It is allocation proportion. In our industrial sentiment trending
noteworthy that the Kalman filter should be applied for sector analysis, however, all sentiment scores are equally considered
analysis to control noise. as the portfolio rebalancing data is restricted to attain. For
instance, the return-labelled sentiment score of Apple Inc. in
At the portfolio level, the model based on the entire 10- 2023 is 0.006, whereas Amazon.com Inc.’s sentiment in 2023
K generally revealed more informative sentiments than the is -0.009. We used these scores without weighting their
model with the risk factor. The 10-K portfolio model portfolio proportion. In 2023, the QQQ allocated Apple at
outperformed the risk factor portfolio model for both ˜ pRET 9.22% and Amazon at 4.83%. Thus, the weighted sentiment
and ˜ pVOL. The accuracies for ˜ pRET or ˜ pVOL were higher by scores(i.e. 0.05532 = 0.006 * 9.22 for apple and -0.04347 = -
5% and 4% respectively. The training time took longer than 0.009 * 4.83 for Amazon) are required for a robust sentiment
the risk-factor model. When it comes to correlation analysis, score calculation. In the future, we suggest that the sentiment
the 10-K sector depicted a way stronger correlation for all score at the date should consider the allocation weight of the
pairs. In the 10-K portfolio model, ˜ pRET strongly mirrored portfolio of the same date. Our model can not adapt to the
the top 10 portfolio stock price movement. ˜ pRET seemed to latest firms’ return or volatility because our model calculation
show no significant response to macroeconomic external only works with the filings released at the publication date. In
factors such as the 2008 f inancial crisis and COVID-19. On other words, the predicted sentiment score is an annual data
the other hand, ˜ pVOL revealed trends in volatility response to point at which the filing is released. If we have more latest
the external factors. For the influential word analysis, the risk textual information to represent a firm, our model would
factor portfolio model seemed to highlight a distinct topic. generate more recent sentiment scores. We suggest using 10-
Thus, we recommended employing two models(i.e. both the Q filling, which is a comprehensive report of a firm like 10-
10-K one and the risk factor one) for a portfolio-level K but must be submitted quarterly.
analysis. Again, it is remarkable that the Kalman filter was REFERENCES
also required for a portfolio analysis.

IJISRT25FEB804 www.ijisrt.com 2546

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
[1]. Z. Ke, B. T. Kelly, and D. Xiu, “Predicting returns with https://libstore.ugent.be/fulltxt/RUG01/002/837/812/
text data,” University of Chicago, Becker Friedman RUG01-002837812 2020 0001 AC.pdf
Institute for Economics Working Paper, no. 2019-69, [12]. K. R. Jønsson and J. B. Jako, “Predicting stock
September 2020, yale ICF Working Paper No. 2019- performance using 10-k Filings: A natural language
10, Chicago Booth Research Paper No. 20-37. processing approach employing convolutional neural
[Online]. Available: networks,” Copenhagen Business School, Tech. Rep.,
https://ssrn.com/abstract=3389884 2020.
[2]. R. P. Schumaker and H. Chen, “Textual analysis of [13]. “Securities and exchange commission final rule,
stock market prediction using breaking financial release no. 33–8591(fr-75),”
news: The azfin text system,” ACM Transactions on http://sec.gov/rules/final/33-8591.pdf, 2005.
Information Systems, vol. 27, no. 2, p. 12, Mar 2009. [14]. R. Watts and J. Zimmerman, Positive accounting
[Online]. Available: theory. Englewood Cliffs, NJ: Prentice Hall, 1986.
https://doi.org/10.1145/1462198.1462204 [15]. T. Scott, “Incentives and disincentives for financial
[3]. D. Shah, H. Isah, and F. Zulkernine, “Predicting the disclosure: Voluntary disclosure of defined benefit
effects of news sentiments on the stock market,” in pension plan information by canadian firms,” The
2018 IEEE International Conference on Big Data (Big Accounting Review, vol. 69, pp. 26–43, 1994.
Data). IEEE, 2018, pp. 4705–4708. [16]. T. Fields, T. Lys, and L. Vincent, “Empirical research
[4]. J. Maqbool, P. Aggarwal, R. Kaur, A. Mittal, and I. A. on accounting choice,” Journal of Accounting and
Ganaie, “Stock prediction by integrating sentiment Economics, vol. 31, pp. 255–307,2001.
scores of financial news and mlp-regressor: A machine [17]. S. P. Kothari, X. Li, and J. Short, “The effect of
learning approach,” Procedia computer Science, vol. disclosures by management, analysts, and business
218, pp. 1067–1078, 2023. [Online]. Available: press on cost of capital, return volatility, and analyst
https://doi.org/10.1016/j.procs.2023.01.086 forecasts: a study using content analysis,” The
[5]. Z. Wang, Z. Hu, F. Li et al., “Learning-based stock Accounting Review, vol. 84, pp. 1639–1670, 2009.
trending prediction by incorporating technical [18]. S. P. Kothari, S. Shu, and P. Wysocki, “Do managers
indicators and social media sentiment,” Cognitive withhold bad news?” Journal of Accounting Research,
Computation, vol. 15, pp. 1092–1102, 2023, Received vol. 47, pp. 241–276, 2009.
September 15, 2022; Accepted February 9, 2023; [19]. S. Chang, “Risk factor disclosures and ceo
Published March 9, 2023; Issue Date May 1, 2023. overconfidence,” College of Business Administration
[Online]. Available: https://doi.org/10.1007/s12559- Seoul National University, 2019.
023-10125-8 [20]. Reuters, “‘refco risks boiler-plate disclosure.’ By scott
[6]. A. Glodd and D. Hristova, “Extraction of forward- malone,” http://w4.stern.nyu.edu/news/news.cfm?doc
looking financial information for stock price id=5094, 2005.
prediction from annual reports using Nlp techniques,” [21]. SEC, “Form 10-k instructions,”
in Proceedings of the 56th Hawaii International http://www.sec.gov/about/forms/ Form10-k.pdf,
conference on System Sciences, 2023, p. 10, rights: 2010.
Attribution NonCommercial-NoDerivatives 4.0 [22]. C. Magazine, “Sec pushes companies for more risk
International. [Online]. Available: information,” August 2, 2010, 2010.
https://hdl.handle.net/10125/103313 [23]. D. Kingsley, M. Solomon, and K. Jaconi, “Sec risk
[7]. United States Securities and Exchange Commission, factor disclosure rules,”
“Form 10-k,” 2024. [Online]. Available: https://corpgov.law.harvard.edu/2021/12/22/Sec-risk-
https://www.sec.gov/files/form10-k.pdf factor-disclosure-rules/, 2021.
[8]. S. Asthana and S. Balsam, “The effect of edgar on the [24]. H. M. Peirce, “Sec harming investors and helping
market reaction to 10-k filings,” Journal of hackers: Statement on cybersecurity risk management,
Accounting and Public policy, vol. 20, no. 4-5, pp. strategy, governance, and incident disclosure,” 2023.
349–372, 2001. [Online]. Available: [25]. J. L. Campbell, H. Chen, D. Dhaliwal, H. Lu, and L.
https://doi.org/10.1016/S0278-4254(01)00035-7 Steele, “The information content of mandatory risk
[9]. H. You and X. Zhang, “Financial reporting complexity factor disclosures in corporate filings,” Review of
and investor underreaction to 10-k information,” Accounting Studies, vol. 19, no. 1, pp. 396–455,2014.
Review of Accounting studies, vol. 14, pp. 559–586, [26]. R. Israelsen, “Tell it like it is: disclosed risks and factor
2009. [Online]. Available: portfolios,” 2014, working paper.
https://doi.org/10.1007/s11142-008-9083-2 [27]. V. Song, H. Cavusoglu, G. M. Lee, and M. L. Z. Ma,
[10]. C. Kim, K. Wang, and L. Zhang, “Readability of 10-k “It risk factor disclosure and stock price crashes,” in
reports and stock price crash risk,” Contemporary Proceedings of the 53rd Hawaii international
Accounting Research, vol. 36, pp. 1184–1216, 2019. Conference on System Sciences, 2020.
[Online]. Available: https://doi.org/10.1111/1911- [28]. O. Hope, D. Hu, and H. Lu, “The benefits of specific
3846.12452 risk-factor Disclosures,” Review of Accounting
[11]. S. Blomme and J. Dedeyne, “Predicting the effect of Studies, 2016, forthcoming.
10-k,10-q and 8-k company reports on abnormal stock [29]. T. Kravet and V. Muslu, “Textual risk disclosures and
returns using finbert nlp methods,” University of investors’ risk Perceptions,” Review of Accounting
Ghent, Tech. Rep., 2020. [Online]. Available: Studies, vol. 18, no. 4, pp. 1088–1122, 2013.

IJISRT25FEB804 www.ijisrt.com 2547

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
[30]. J. Sha, “A data pipeline framework for automated reits,” Journal of Real Estate Finance and Economics,
extraction of risk factors from sec-10k filings,” School vol. 45, no. 2, pp. 402–434, 2010.
of Informatics, 2023. [45]. X. Huang, S. H. Teoh, and Y. Zhang, “Tone
[31]. J. Smailovic, M. Gr ´ car, N. Lavra ˇ c, and M. ˇ Znidar management,” August 21 2013, the Accounting
ˇ siˇ c, “Predictive ˇ sentiment analysis of tweets: A Review, Forthcoming, Available at SSRN:
stock market application,” in Proc. Int. Workshop https://ssrn.com/abstract=1960376 or
Hum.-Comput. Interact. Knowl. Discovery Complex http://dx.doi.org/10.2139/ssrn.1960376
Unstructured Big Data, 2013, pp. 77–88. [46]. N. J. Ferguson, D. Philip, H. Lam, and J. M. Guo,
[32]. R. Ren, D. D. Wu, and T. Liu, “Forecasting stock “Media content and stock returns: The predictive
market movement direction using sentiment analysis power of press,” pp. 1–31, May 27 2015, Available at
and support vector machine,” IEEE systems Journal, SSRN: https://ssrn.com/abstract=2611046.
vol. 13, no. 1, pp. 760–770, Mar 2019. [47]. H. Chen, P. De, Y. J. Hu, and B.-H. Hwang, “Wisdom
[33]. N. Jegadeesh and D. Wu, “Word power: A new of crowds: The value of stock opinions transmitted
approach for content Analysis,” Journal of Financial through social media,” Review of Financial Studies
Economics, vol. 110, pp. 712–729, 2013. (RFS), Forthcoming, available at SSRN:
[34]. C. Kearney and S. Liu, “Textual sentiment in finance: https://ssrn.com/abstract=1807265 or
A survey of methods and models,” International http://dx.doi.org/10.2139/ssrn.1807265.
Review of Financial Analysis, Vol. 33, pp. 171–185, [48]. T. Loughran and B. McDonald, “Ipo first-day returns,
2014. offer price revisions, volatility, and form s-1
[35]. Y. Guo and L. Zhou, “Textual tone in corporate language,” Journal of Financial Economics, vol. 109,
financial disclosures: A survey of the literature,” pp. 307–326, 2013. [Online]. Available:
International Journal of Disclosure and Governance, https://www.sciencedirect.com/science/article/pii/S03
vol. 17, pp. 101–110, 2020, received October 25, 04405X13000603
2019; Published June 27, 2020; Issue Date September [49]. A. H. Huang, A. Zang, and R. Zheng, “Evidence on
1, 2020. the information content of text in analyst reports,”
[36]. S. Che et al., “Anticipating corporate financial Accounting Review, April 2014, Forthcoming.
performance from Ceo letters utilizing sentiment [Online]. Available:
analysis,” Mathematical Problems in Engineering, https://ssrn.com/abstract=1888724
vol. 2020, 2020. [50]. N. Li, X. Liang, X. Li, C. Wang, and D. D. Wu,
[37]. F. LI, “The information content of forward-looking “Network environment and financial risk using
statements in corporate filings—a na¨ıve bayesian machine learning and sentiment analysis,” Human and
machine learning approach,” Journal of Accounting Ecological Risk Assessment: An International Journal,
Research, vol. 48, pp. 1049–1102, 2010. Vol. 15, no. 2, pp. 227–252, 2009.
[38]. P. Tetlock, “Giving content to investor sentiment: The [51]. S. Shuhidan, S. Hamidi, S. Kazemian, S. Shuhidan,
role of media in the stock market,” Journal of Finance, and M. Ismail, “Sentiment analysis for financial news
vol. 62, pp. 1139–1168, 2007. headlines using machine learning algorithm,” in
[39]. J. Engelberg, “Costly information processing: Proceedings of the 7th International Conference on
Evidence from information announcements,” in AFA Kansei Engineering and Emotion Research 2018.
2009 San Francisco Meetings Paper, 2008, Available KEER 2018. Advances in Intelligent Systems and
at SSRN: http://ssrn.com/abstract=1107998. Computing, vol. 739. Singapore: Springer, 2018.
[40]. J. Engelberg, A. Reed, and M. Ringgenber, “How are [52]. G. Wang, T. Wang, B. Wang, D. Sambasivan, Z.
shorts informed? Short sellers, news, and information Zhang, H. Zheng, B. Zhao, and S. Barbara, “Crowds
processing,” Journal of Financial Economics, vol. on wall street: Extracting value from collaborative
105, no. 2, pp. 260–278, 2012. investing platforms,” 03 2015.
[41]. S. Ferris, G. Hao, and M. Liao, “The effect of issuer [53]. S. Sohangir, D. Wang, A. Pomeranets, and T.
conservatism on Ipo pricing and performance,” Khoshgoftaar, “Big data: deep learning for financial
Review of Finance, vol. 7, no. 3, pp. 993–1027, 2013. sentiment analysis,” Journal of Big Data, Vol. 5, no. 1,
[42]. E. Henry and A. J. Leone, “Measuring qualitative p. 3, Dec 2018.
information in capital markets research,” April 2009, [54]. IBM, “Deep learning,”
available at SSRN: https://ssrn.com/ https://www.ibm.com/topics/deep-learning, accessed:
Abstract=1470807 or 2024-04-01.
http://dx.doi.org/10.2139/ssrn.1470807. [55]. L. Zhang, S. Wang, and B. Liu, “Deep learning for
[43]. A. K. Davis, W. Ge, D. Matsumoto, and J. L. Zhang, sentiment analysis: A survey,” WIREs Data Mining
“The effect of manager-specific optimism on the tone and Knowledge Discovery, vol. 8, no.E1253, 2018.
of earnings conference calls,” January 13 2014, cAAA [56]. D. Tang, B. Qin, and T. Liu, “Document modeling
Annual Conference 2012, Available at SSRN: with gated recurrent neural network for sentiment
https://ssrn.com/abstract=1982259 or classification,” in Proceedings of the conference on
http://dx.doi.org/10.2139/ Ssrn.1982259. Empirical Methods in Natural Language Processing,
[44]. J. Doran, D. Peterson, and S. Price, “Earnings 2015, pp. 1422–1432.
conference call content and stock price: The case of [57]. K. S. Tai, R. Socher, and C. Manning, “Improved
semantic representations from tree-structured long

IJISRT25FEB804 www.ijisrt.com 2548

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
short-term memory networks,” testing/critical-region-and-confidence-interval.html,
http://arxiv.org/abs/1503.00075, 2015. 2024, accessed: 24-03-2024.
[58]. Y. Kim, “Convolutional neural networks for sentence [73]. D. J. Benjamin, J. Berger, M. Johannesson, B. A.
classification,” http://arxiv.org/abs/1408.5882, 2014. Nosek, E. Wagen-Makers, R. Berk et al., “Redefine
[59]. X. Zhang, J. Zhao, and Y. LeCun, “Character-level statistical significance,” Jul 2017,
convolutional networks for text classification,” in https://doi.org/10.31234/osf.io/mky9j.
Proceedings of the Advances in neural Information [74]. M. B. Editor. (2014) How to correctly interpret p
Processing Systems, 2015, pp. 649–657. values. Topics: Hypothesis Testing. [Online].
[60]. R. Johnson and T. Zhang, “Deep pyramid Available: https://blog.minitab.com/en/Adventures-
convolutional neural networks for text in-statistics-2/how-to-correctly-interpret-p-values
categorization,” in Proceedings of the 55th Annual [75]. J. Frost. (2014) Why are p values misinterpreted so
Meeting of the Association for Computational frequently? [Online]. Available:
Linguistics (Long Papers), vol. 1, 2017, pp. 562–570. https://statisticsbyjim.com/hypothesis-testing/p-
[61]. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. values-misinterpreted/
Hovy, “Hierarchical attention networks for document [76]. A. S. Association, “American statistical association
classification,” in Proceedings of the conference of the releases statement on statistical significance and p-
North American Chapter of the Association for values,” 2016, for more information: Ron Wasserstein,
computational Linguistics: Human Language [email protected]. [Online]. Available:
Technologies, 2016, pp.1480–1489. https://www.Amstat.org/asa/files/pdfs/P-
[62]. K. Mishev, A. Gjorgjevikj, I. Vodenska, L. ValueStatement.pdf
Chitkushev, and D. Tra Janov, “Evaluation of [77]. J. Park. (2016) Don’t be overwhelmed by p-value.
sentiment analysis in finance: From lexicons to [Online]. Available:
transformers,” IEEE Access, vol. 8, pp. 131 662–131 https://boxnwhis.kr/2016/04/15/dont be overwhelmed
682, 2020. by pvalue.html
[63]. M. Rizinski, H. Peshov, K. Mishev, M. Jovanovik, and [78]. J. J. Filzen, “The information content of risk factor
D. Trajanov, “Sentiment analysis in finance: From disclosures in quarterly reports,” Accounting
transformers back to explainable lexicons (xlex),” Horizons, vol. 29, no. 4, pp. 887–916, Dec 2015.
IEEE Access, vol. 12, pp. 7170–7198. [Online]. Available: https://doi.org/10.2308/acch-
[64]. J. Dobson, “On reading and interpreting black box 51175
deep neural networks,” International Journal of [79]. L. Chan and N. Devia-Valbuena. (2023, Dec.) In the
Digital Humanities, vol. 5, pp. 431–449, 2023. global rush for lithium, bolivia is at a crossroads.
[65]. J. Hering, “The annual report algorithm: Retrieval of Accessed: 2024-03- 26. [Online]. Available:
financial statements and extraction of textual https://www.usip.org/publications/2023/12/ Global-
information,” Friedrich-Alexander- rush-lithium-bolivia-crossroads
Universitat¨Erlangen-Nurnberg, Lange Gasse 20, [80]. R. S. Koijen, T. J. Philipson, and H. Uhlig, “Financial
90403 Nuremberg, Germany, 11 ¨2016, first Version: health economics,” Econometrica, vol. 84, pp. 195–
October 4, 2016. Current Version: November 28, 242, 2016.
2016. [81]. J. Kastrenakes. (2017, Dec.) Google’s project tango is
[66]. S. Alizadeh, M. W. Brandt, and F. X. Diebold, “Range- shutting Down because arcore is already here.
based estimation of stochastic volatility models,” The Accessed: 2024-03-26. [Online]. Available:
Journal of Finance, vol. 57, no. 3,pp. 1047–1091, https://www.theverge.com/2017/12/15/16782556/
2002. Project-tango-google-shutting-down-drcore-
[67]. A. J. Patton, “Volatility forecast comparison using augmented-reality
imperfect volatility proxies,” Journal of [82]. J. D’Onfro. (2017, Dec.) Google is going to shut down
Econometrics, vol. 160, no. 1, pp. 246–256, 2011. Its once-hyped augmented reality project, tango.
[68]. F. Corsi, “A simple approximate long-memory model Accessed: 2024-03-26. [Online]. Available:
of realized volatility,” Journal of Financial https://www.cnbc.com/2017/12/15/ Google-kills-
Econometrics, vol. 7, no. 2, pp. 174–196,2009. project-tango-ar-project.html
[69]. S. Borovkova and P. Lammers, “Sector news [83]. M. Majeed and M. Mazhar, “Environmental
sentiment indices,” Available at SSRN 3080318, degradation and output Volatility: A global
2017. perspective,” Pakistan Journal of Commerce and
[70]. J. Durbin and S. J. Koopman, Time series analysis by Social Science, vol. 13, pp. 180–208, 03 2019.
state space methods. OUP Oxford, 2012, vol. 38. [84]. M. Majeed, M. Mazhar, and S. Sabir, “Environmental
[71]. Statistics Solutions, “Pearson correlation quality and Output volatility: the case of south asian
assumptions,” 2023, accessed: 24-March-2024. economies,” Environmental Science and Pollution
[Online]. Available: Research, vol. 28, pp. 31 276–31 288, 2021. [Online].
https://www.statisticssolutions.com/pearson- Available: https://doi.org/10.1007/s11356-021-
correlation-assumptions/ 12659-6
[72]. Newcastle University, “Critical region and confidence [85]. J. Gagnon, M. Raskin, J. Remache, and B. Sack,
interval,” https://www.ncl.ac.uk/webtemplate/ask- “Large-scale asset purchases by the federal reserve:
assets/external/maths-resources/statistics/hypothesis-

IJISRT25FEB804 www.ijisrt.com 2549

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
Did they work?” Federal Reserve Bank of New York, 832, 2012. [Online]. Available:
Staff Report 441, March 2010. https://doi.org/10.1007/s00191-012-0289-x
[86]. H. Chen, V. Curdia, and A. Ferrero, “The [100]. S. Gharbi, J.-M. Sahut, and F. Teulon, "R&d
macroeconomic effects of ´Large-scale asset purchase investments and high-tech firms' stock return
programs,” Federal Reserve Bank of New York, Staff volatility," Technological Forecasting and Social
Report 527, December 2011. Change, vol. 88, pp. 306–312, 2014. [Online].
[87]. V. Curdia, “How stimulatory are large-scale asset Available:
purchases?” ´ FRBSF Economic Letter, 08 2013. https://www.sciencedirect.com/science/article/pii/S00
[88]. S. Gilchrist and E. Zakrajsek, “The impact of the 40162513002643
federal reserve’s ˇ Large-scale asset purchase [101]. B. Hai, Q. Gao, X. Yin, and J. Chen, "R&d volatility
programs on corporate credit risk,” Journal Of Money, and market value: the role of executive
Credit and Banking, vol. 45, pp. 29–57, 2013. overconfidence," Chinese Management Studies, vol.
[89]. G. Wang, “The effects of quantitative easing 14, no. 2, pp. 411-431, 2020. [Online]. Available:
announcements on the Mortgage market: An event https://doi.org/10.1108/CMS-05-2019-0170
study approach,” International Journal of Financial [102]. M. Brinkman and V. Sarma, "Infrastructure investing
Studies, vol. 7, no. 1, p. 9, 2019. [Online]. Available: will never be the same," McKinsey & Company, 8
https://doi.org/10.3390/ijfs7010009 2022, article.
[90]. S. Sunder, “How did the u.s. stock market recover [103]. F. Blanc-Brude, W. Schmundt, T. Bumberger, R.
from the covid-19 Contagion?” Mind Soc, vol. 20, pp. Friedrich, B. Georgii, A. Gupta, L. Lum, and M.
261–263, 2021. [Online]. Available: Wilms, "Infrastructure strategy 2022: A pivot to the
https://doi.org/10.1007/s11299-020-00238-0 digital frontier," Boston Consulting Group, 3 2022,
[91]. U. Seven and F. Yılmaz, “World equity markets and article.
covid-19: ¨Immediate response and recovery [104]. L. Gunnion, "Infrastructure investment: An
prospects,” Research in International Business and economist's view from the ground up," 7 2021, article.
Finance, vol. 56, p. 101349, 2021. [Online]. Available: [105]. A. Prakash, H. Amrouch, M. Shafique, T. Mitra, and J.
https://www.sciencedirect.com/science/article/pii/S02 Henkel, "Improving mobile gaming performance
75531920309570 through cooperative cpu-gpu thermal management,"
[92]. P.-O. Gourinchas, Kalemli-Özcan, V. Penciakova, and in Proceedings of the 53rd Annual Design Automation
N. Sander, “Fiscal policy in the age of covid: Does it Conference, ser. DAC '16. New York, NY, USA:
‘get in all of the cracks?,” National Bureau of Association for Computing Machinery, 2016.
Economic Research, Working Paper 29293, [Online]. Available:
September 2021. [Online]. Available: https://doi.org/10.1145/2897937.2898031
http://www.nber.org/papers/w29293 [106]. K. Leswing, "Meet the $10,000 nvidia chip powering
[93]. J. C. Stein, “Evaluating large-scale asset purchases,” the race for a.i." https://www.cnbc.com/2023/02/23/
October 2012. [Online]. Available: nvidias-a100-is-the-10000-chip-powering-the-race-
https://www.federalreserve.gov/newsevents/speech/ for-ai-.html, 2023, accessed: 2024-03-
stein20121011a.htm 27.nvidia.com/en-gb/industries/retail/supply-chain-
[94]. S. Luck and T. Zimmermann, “Ten years later-did qe management/, 2024, ac-
work?” May Org/2019/05/ten-years-laterdid-qe-work/ [107]. "Nvidia industries: Retail - supply chain
[95]. The Economist, “America’s war on huawei nears its management," https://www.cessed: 2024-03-27.
endgame,” The Economist, July 2020. [Online]. [108]. OpenAI, "Chatgpt (4) [large language model],"
Available: https://chat.openai.com, 2024.
https://www.economist.com/Briefing/2020/07/16/am [109]. "Nvidia for startups: Venture capital,"
ericas-war-on-huawei-nears-its-endgame https://www.nvidia.com/en-gb/startups/venture-
[96]. C. Ting-Fang and L. Li, “Us-china tech war: Beijing’s capital/, 2024, accessed: 2024-03-27.
secret chipmaking champions,” Nikkei Asia, May [110]. "Nvidia investments,"
2021. [Online]. Available: https://blogs.nvidia.com/blog/nvidia-
https://asia.nikkei.com/Spotlight/The-Big-Story/US- investments/,2024, accessed: 2024-03-27.
China-tech-war-Beijing-s-secret-chipmaking- [111]. PYMNTS, "Nvidia invests in 35 Ai in companies
champions 2023," https://www.pymnts.com/artificial-
[97]. J. Lewis, Learning the superior techniques of the intelligence-2/2023/ nvidia-invests-in-35-ai-
barbarians China's pursuit of semiconductor companies-in-2023/, December 2023, accessed: 2024-
independence. Washington: CSIS, 2019. 03-27. [Online]. Available:
[98]. D. Wu, Y. Lee, and Y. Ngui, "Chip shortage set to https://euromines.org/files/key_value_chain_
worsen as covid rampages through malaysia," [112]. Euromines, "Key value chain electronics euromines,"
Bloomberg, August 2021. [Online]. Available: 2020. electronics_euromines_final.pdf
https://www.bloomberg.com/news/articles/2021-08- [113]. Nvidia, "2023 annual report," 2023. [Online]. Ay able:
23/ chip-shortage-set-to-worsen-as-covid-rampages- https://s201.q4cdn.com/141608511/files/doc_financia
through-malaysia? sref=TtblOutP ls/202 2023-Annual-Report-1.pdf
[99]. M. Mazzucato and M. Tancioni, "R&d, patents and [114]. M. La Torre, F. Mango, A. Cafaro, and S. Leo, "Does
stock return volatility," J Evol Econ, vol. 22, pp. 811- the esg index affect stock return? evidence from the

IJISRT25FEB804 www.ijisrt.com 2550

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
eurostoxx50," Sustainability, vol. 12, no. 16, p. 6387,
2020. [Online]. Available:
https://doi.org/10.3390/su12166387
[115]. G. Capelle-Blancard and A. Petit, "Every little helps?
esg news and stock market reaction," Journal of
Business Ethics, vol. 157, pp. 543–565, 2019.
[Online]. Available: https://doi.org/10.1007/s10551-
017-3667-3
[116]. G. Giese, L.-E. Lee, D. Melas, Z. Nagy, and L.
Nishikawa, "Foun- dations of esg investing: How esg
affects equity valuation, risk, and performance," The
Journal of Portfolio Management, vol. 45, pp. 69-83,
[117]. D. Sevastopulo and Q. Liu, "Us-china doomsday
scenario not likely to happen, says nvidia's jensen
huang," Financial Times, 10 2023. [Online].
Available: https://www.ft.com/content/
[118]. Y. Yu, "Us-china doomsday scenario not likely to
happen, says nvidia's jensen huang," Nikkei Asian
Review, 03 2024. [Online]. Available:
https://asia.nikkei.com/Business/Tech/Semiconductor
s/U.S. -China-doomsday-scenario-not-likely-to-
happen-Nvidia-s-Jensen-Huang
[119]. Securities and Exchange Commission, "Form 10-k,"
https://www.sec. gov/files/form10-k.pdf, 2024, annual
Report Pursuant to Section 13 or 15(d) of The
Securities Exchange Act of 1934, MB Number: 3235-
0063, Expires: December 31, 2026, Estimated average
burden hours per response: 2,249.36.
[120]. Wiki, "Form 10-k,"
https://en.wikipedia.org/wiki/Form_10-K, 2024,
[121]. Analytics Vidhya, "Stemming vs lemmatization in
nlp: Must know differences," 2022, accessed: 22-03-
2024. textual analysis, dictionaries, and 10-ks," The
Journal of Finance, vol. 66, no. 1, pp. 35-65, 2011.
[122]. D. Garcia, "Sentiment during recessions," The Journal
of Finance, vol. 68, no. 3, pp. 1267-1300, 2013.
[123]. T. Loughran and B. McDonald, "When is a liability
not a liability?
[124]. Wikipedia, "Pearson correlation coefficient -
Wikipedia, the free encyclopedia," 2024, [Online;
accessed 24-March-2024]. [Online]. Available:
https://en.wikipedia.org/wiki/Pearson_correlation_co
efficient

APPENDIX

IJISRT25FEB804 www.ijisrt.com 2551

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804

 Part 1

 Item 1: Business - This section describes the business of the company: its main products or services, subsidiaries, and markets
in which it operates. It may also contain recent events, competition, regulation, and labour problems.
 Item 1A: Risk Factors - This section provides risks and uncertainties, likely external effects, or possible failures that could
affect their financial performance. Risk factors are generally enumerated based on their importance.
 Item 1B: Unresolved Staff Comments - This section offers an explanation of any issues raised by the SEC staff on the previous
reports if these issues have not been resolved afterwards.
 Item 2:Properties - This section only lays out the company's significant physical properties, not intellectual or intangible
property.
 Item 3: Legal Proceedings - This section discloses any significant ongoing lawsuit or other legal proceeding.
 Item 4 - [RESERVED]

 Part 2

 Item 5: Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- This section discloses their performance in the stock market, dividends, and repurchases of their own stocks.
 Item 6 - [RESERVED]
 Item 7: Management's Discussion and Analysis of Financial Condition and Results of Operations (MD&A) - This section
discusses the firm's management for its financial performance, challenges, chances, and future outlook.
 Item 7A: Quantitative and Qualitative Disclosures about Market Risks - This section shows the company's exposure to
market risks.
 Item 8: Financial Statement and Supplementary Data - This section offers the audited financial statements. It contains the
balance sheet, income statement, cash flow statement, and footnotes.
 Item 9: Changes in and Disagreements with Accountants Accounting and Financial Disclosure - Companies address any
alterations in or disputes with their accountants over financial reporting.
 Item 9A: Controls and Procedures - This section details the company's procedures for disclosure controls and its internal
control mechanisms for financial reporting. Item 9B: Other Information - This section offer additional information that does not
align with the conte other sections.

 Part 3

 Item 10: Directors, Executive Officers and Corporate Governance – This section delves into the specifics of the company’s
leadership, their respective roles, and the practices in place for corporate governance.
 Item 11: Executive Compensation – This section addresses compensation policies and programmes, and the compensation Oo
top executives.
 Item 12: Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters – This
section offers major shareholders’ ownership of the company’s stock as well as insiders.
 Item 13: Certain Relationships and Related Transactions, and Director Independence- This section encompasses
transactions involving directors, executives, and their affiliates, along with details regarding the independence of directors.
 Item 14: Principal Accountant Fees and Services – The section details the fees charged by the company’s auditors for their
services.

 Part 4

 Item 15: Exhibits, Financial Statement Schedules – This section contains a list of the financial statements and exhibits.[119],
[120]

 10-K Filing (e.g Nvidia 2010-03-18)

 [heading]Our Company[/heading]

NVIDIA Corporation helped awaken the world to the power of computer graphics when it invented the graphics processor
unit, or GPU, in 1999. Expertise in programmable GPUs has led to breakthroughs in parallel processing which make supercomputing
inexpensive and widely accessible.

Heading]ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF

OPERATIONS[/heading] The following discussion and analysis of our financial condition and results of operations should be read

IJISRT25FEB804 www.ijisrt.com 2552

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
in conjunction with "Item 1A. Risk Factors", "Item 6. Selected Financial Data", our Consolidated Financial Statements and related
Notes thereto

[Heading]Overview[/heading] [heading]Our Company[/heading] NVIDIA Corporation helped awaken the world to the power
of computer graphics when it invented the graphics processor unit, or GPU, in 1999. Expertise in programmable GPUs has led to
breakthroughs in parallel processing which make supercomputing inexpensive and widely accessible. We serve the entertainment
and consumer market with our GeForce graphics products, the professional design and visualization market with our Quadro
graphics products, the high-performance computing market with our Tesla computing solutions products, and the mobile computing
market with our Tegra system-on-a-chip products.

 Risk Factor Section (e.g Nvidia 2023-02-24)

 [Title]Risk Factors Summary[/title]

 [Heading]Risks Related to Our Industry and Mar- kets[/heading]

Failure to meet the evolving needs of our industry and markets may adversely impact our financial results. Competition in our
current and target markets could cause us to lose market share and revenue.

 [Heading]Risks Related to Demand, Supply and Manufactur-

 Ing[/heading]

Failure to estimate customer demand properly has led and could lead to mismatches between supply and demand. Depen-
dency on third-party suppliers and their technology reduces our control over product quantity and quality, manufacturing yields,
development, enhancement, and product delivery schedules and could harm our business. Defects in our products have caused and
could cause us to incur significant expenses to remediate and can damage our business.

 Preprocessing
Through the 10-K filings extraction model at Chapter 3, we finally collected almost all 10-K filings listed in the QQQ from
2006 to 2023, which is 1383 filings as well as the text files with only the Item 1A risk factor section extracted. With these files, we
preprocessed them before feeding them into our prediction model. Firstly, we split all sentences of a filing into words. Secondly, we
removed non-alphabetic tokens from the texts such as numbers, proper nouns, and special characters(i.e. punctuations). For instance,
Item 8 of a 10-K filing contains many numbers as a financial statement of a company is included in that section. As numbers were
not necessary for textual analysis, we removed them. Thirdly, we removed stop words such as “is”, “the”, or “and”, as they do not
contain informative information. In the final step of the preprocessing, we selected lemmatisation, instead of stemming. Although
lemmatisation is computationally more expensive than stemming, it tends to capture the more accurate base form of a word through
linguistic analysis (i.e. considering Parts-of-speech) [121].

 Hyper-Parameter Setting
We set the hyper-parameters equal for the models to ensure a fair comparison between the sentiments with return labels and
those with volatility labels for a sector level, a portfolio level, and a company level.

In Section 4.2, we replaced 0 by the θ in Equation 4.2, and the value 0.5 by quantile q in Equation 4.3. θ is a threshold and we
used it to define high and low volatility. To figure out the balanced and q hyper-parameter for both the technology sector and a
firm(in this paper, Nvidia), we selected the value of and a from, on the training windows. In the technology sector from, we
practically set θ as the 65th percentile of the distribution and q as 0.65. It implies we defined the volatility above the 65 quantile as
high volatility, whereas the volatility below the 65 quantile is low volatility. Note that we set the 65 th percentile and 0.65 for θ and
q for both QQQ volatility itself and the volatilities of the top 10 firms in QQQ, respectively. Empirically, the 3-day volatility for
both QQQ and the top 10 showed a similar trend. We can see the peaks from the graph, referring to the financial crisis of 2008 and
the COVID- 19 pandemic around 2020. In the case of a firm(i.e. Nvidia) from, θ was set as the 65th percentile of the distribution,
and q was 0.65. Similar to the QQQ volatility graph, Nvidia experienced high levels of volatility during the 2008 financial crisis
and COVID-19 showed higher volatility movement in general during the training windows.

Furthermore, returns for both the technology sector and a firm showed a balanced proportion at the median of three- day
returns. Hence, we set 0.5 at Equation 4.3 for both the technology sector in and a firm return in.

In Equation 4.3, we specifically defined α and k. Note that a was a threshold to filter out sentiment-neutral words. We set α+
and a- within the interval (0, 0.5] such that each of the positive word sets and the negative word sets includes 100 words. Moreover,
note that k € N was another threshold to relate to the count of word w across all filings such that we used k as a minimum frequency
requirement to reduce the influence of rare words. We set k as the 90 percentile quantile of the term frequency distribution. It means
we ignored words that appear less than the 90 percentile quantile in a filing.

IJISRT25FEB804 www.ijisrt.com 2553

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804
In Equation 4.8, we also defined λ. It was a positive constant And used in a penalty term to adjust our model. The penalty term
was used to avoid the model overfitting when few sentiment- charged words appear in the filing. Without the penalty term, the
models can consider the filing that contains few negative word but does not include positive as a negative filing. However, jus
because the model contains few negative sentiment-charged words without positive words, that does not mean the filing has a
negative tone. To control this phenomenon, we set the penalty coefficient λ to 0.1 in the penalty term.

 Baseline
The purpose of the paper was to predict the sentiment score for both a technology sector and a firm from the contents of 10-
K filings. To achieve that, we could infer the sentiment scores for p̂RET and p̂VOL by labelling with return and volatility, respectively.
Then, we introduced a baseline sentiment score to compare our sentiment scores with it.

Our baseline model to calculate baseline sentiment scores was suggested by [122], and used the dictionary created by [123].
Our baseline Loughran and McDonald(LM) sentiment score, p̂LM , is computed as

[16]

Where LM+ refers to the positive word lists and LM-refers to the negative words list of Loughran and McDonald’s dictionary.
To attain the base sentiment scores, we calculated the difference between the number of positive words and the number of negative
words, and then we divided this result by the total number of words in each 10-K filing.

To explain more of the LM’s dictionary for our robust evaluation, the LM dictionary is traditionally used for financial analysis.
LM offered an improved textual dictionary for a better accurate financial analysis in financial documents. Existing the financial
dictionary of 10-K filings based on the Harvard dictionary misclassified the negative words in a financial context. Three-fourths of
negative words in the 10-K filings do not carry a negative connotation in financial reports. To this issue, LM suggested three
approaches. Firstly, LM created a refined list of words that more accurately reflects negative sentiment in a financial context, by
analysing every word that appears in at least 5% of the SEC’s 10-K filings. Secondly, LM introduced a term weighting scheme that
controls the influence of frequently mentioned words and amplifies the significance of rarer terms, thus mitigating the
misclassification of words. Finally, they added five other word classifications(e.g., positive, uncertainty, litigious, strong modal, and
weak modal words). They found that these new classifications can be linked to market reactions, volatility in stock returns,
unexpected earnings, and trading volumes [123]. In our study, we used only negative and positive categories to adjust the financial
dictionary for our model.

 Portfolio
In this study, we formed a portfolio to evaluate the technology sector sentiment scores. A portfolio can be changed to any
portfolio for financial analysis. In practice of our study, we selected Invesco QQQ Trust Series 1 an exchange-traded fund (QQQ or
QQQ ETF). This passive fund(i.e., our portfolio) tracks the Nasdaq 100 Index, which consists of shares from 100 of the largest and
most innovative non-financial firms listed on the Nasdaq stock exchange. Holdings in QQQ are predominantly in large-cap
technology firms, accounting for 60% of the portfolio. As such, the QQQ is conventionally considered as a technology sector fund.
The top 10 holdings represent a 50% allocation of the portfolio, with 9 out of 10 firms being in the tech sector. To represent the
technology sector in the US, we formed two portfolios from the QQQ fund. Firstly, we formed the portfolio, which has the exact
same allocation proportion as the QQQ fund itself as of 2023. This allocation proportion is annually corrected so that our current
portfolio reflects 2023 allocation data. The actual 2023 portfolio allocation can be found inAppendix D. This portfolio is used to
compare the sector sentiment scores.

The second portfolio was constructed with the top 10 firms, considering the asset allocation ratio of the first portfolio. In other
words, the second portfolio consisted of the top 10 firms, accounting for 50% of the QQQ portfolio, according to the proportion
invested in the first portfolio. This portfolio was computed as:

[17]

Where j denoted a firm invested at the j-th ranking in the 2023 portfolio , and wj represented the portfolio weight of the j-th
firm. Aj referred to the allocated proportion of the j-th firm in the 2023 portfolio.

In the case of a single firm evaluation, we did not form a portfolio for it. Instead, we followed the firm’s stock market price.
 Definition

IJISRT25FEB804 www.ijisrt.com 2554

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25feb804

[18]

 r refers to the correlation coefficient.

 x̅i and y̅i refer to the individual sample points for variables x and y, respectively.
 x̅ and y̅ represent the mean values of the samples for x and y. [124]

 The Sector Sentiment Model with Only-Risk-Factor

 The Top10 Sentiment Model with 10-K

Our system can update the latest SEC filings, facilitating data preparation(i.e. preprocessing, extracting the risk factor section,
and creating a document-term dictionary). Basically, our system can collect every type of SEC filing. In practice, we collected 10-
K filings, followed by the extraction of the risk factor section and the creation of a document-term dictionary. As mentioned in
Experiment Result, we generated sentiment metrics for three different stakeholders with either the entire 10-K filing or the risk
factor section, respectively. To do that, we need 6 types of a document-term dictionary. One of the dictionaries, for instance, an
individual firm(e.g. Nvidia)’s document-term dictionary from the complete 10-K filings.

Our system can automatically create 6 types of a documentterm dictionary with the latest 10-K filings. This automation process
works on Apache Airflow, an open-source platform designed to manage workflow processes in data engineering pipelines. Our
airflow automation system consists of three dags corresponding to each stakeholder level. Note that a Directed Acyclic Graph(DAG)
refers to a collection of all the takes you want to execute, arranged to reflect their relationships and dependencies. Each dag creates
the corresponding documentterm dictionary for our Sentiment Score Prediction Model. As mentioned in Project Objective, the
publication dates of 10-K filing are various to each firm although it should be released nearly every day throughout that year. Thus,
we scheduled our dags differently. Sector-level dag(i.e. Technology sector from the QQQ) updates daily, Portfolio-level dags(i.e.
Top10) updates monthly, and Firm-level dag update yearly. You cancheck data workflow in.