Sentiment Prediction For Market Volatility
Sentiment Prediction For Market Volatility
Abstract: This project presents an automated framework for generating sentiment metrics from SEC 10-K filings, aiming
to predict stock market returns and volatility at the sector, portfolio, and firm levels. The system comprises two core models:
an SEC Filing Extraction Model, which preprocesses filings, and a Supervised Lexicon Learning Model, which analyzes
sentiment using a four-step process. This includes identifying sentimentrelated words, assigning predictive weights,
aggregating sentiment scores, and applying the Kalman Filter for trend analysis. Empirical results demonstrate the
effectiveness of sentiment metrics from 10-K filings, particularly the Item 1A risk factor section, in forecasting market
movements.
How to Cite: Niraj Patel. (2025). Sentiment Prediction for Market Volatility. International Journal of Innovative Science and
Research Technology, 10(2), 2531-2555. https://doi.org/10.38124/ijisrt/25feb804
I. INTRODUCTION Commission (SEC) since 1934. It has its origins in the the
Section 13 or 15(d) of the Securities Exchange Act of 1934.
Motivation The 10-K is a comprehensive official document offering a
Now is the age of Artificial Intelligence and Big data. through overview of a company’s business, its’ potential
With the advance of computational powers, a large amount of challenges, and its financial performance through the fiscal
various data, such as text, video, and audio have been used year. A company’s leadership, in the 10-K, provides their
for scientific analysis. Among a myriad of data forms, textual perspective on the business outcomes and the factors
data has gotten the fastest attention in the social science influencing them [7]. Furthermore, several studies found that
academic field. Textual data’s numerical representation for 10-K filings offered predictive power to stock price
statistical analysis in nature is extremely high in dimensions prediction[8], [9], [10], [11], [12]. [8] found a shred of
that empirical study seeking its textual richness should face evidence that the market reacted to 10-K filings in a
its dimensionality challenges. Machine learning will be statistically significant way. [10], [9] showed there was a
employed to extract richer meaning from textual data for correlation between the complexity of 10-K filing and stock
predictive analysis in a high-dimensional data environment price volatility. [11], [12] found a positive correlation
[1]. between 10-K filing and stock price through cutting-edge AI
methodologies. Hence, investors should pay attention to 10-
In finance, textual data is commonly employed for K filings.
predicting market movements[1], [2], [3], [4], [5], [6]. In
stock prediction, textual analysis of market sentiment has Since the beginning of 2005, a new section has been
shown notable success. News data was employed to analyse required to be included in all firms ‘annual filings by the SEC.
sentiments in the prediction of short-term stock price The section called “Section 1A of the Annual Report on Form
movements [3]. Similarly, social media textual data was 10-K” discusses “the most significant factors that make th
utilised by integrating social media sentiment and AI [5]. company speculative or risky” [13]. Prior to this alteration,
Also, annual report data was used for stock market companies were only obligated to provide this information in
forecasting [6]. Likewise, we could find myriad types of their registration when issuing their equity or debt securities.
textual data were used in predicting market movement. Then, Also, some companies voluntarily offer risk disclosures in the
what type of textual data can be informative for such a section called “Management’s Discussion and Analysis of
purpose? If you want to invest in a public company in the Financial Condition and Results of Operations(MD&A)”
United States, where can you begin your investment journey?
Opponents of the new disclosure requirements argue
There are myriad ways to begin your investment in a that risk factor disclosures are unlikely to offer valuable
public company. However, what if you do not know about a information. First, risk disclosure can be biased. Managers
firm at all in which you would invest or what if you do just might resist disclosing negative information about their
partially know the firm? Then you should first know the firm business or career incentives [14], [15], [16], [17], [18].
correctly. If so, where we can find reliable and trustworthy Second, managers’ overconfidence could make them perceive
information about a firm? You can find a rich deposit of less risk or overconfident managers could have the illusion
reliable knowledge on Form 10-K filing. The Form 10-K that they can effectively manage the risks confronting their
filing has been mandated by the Securities and Exchange firms [19]. Third, managers tended to disclose all possible
In the final step, to ensure the estimates correspond to a Volatility is, in nature, an asymmetric variable; thus,
probability distribution, we corrected any negative entries by getting a ranking of the volatility like in Equation 4.4 will lose
resetting them to zero and then re-normalised each column so substantial informational value to estimate the sentiment
that their totals were equal to one. This process produced a score. Subsequently, normalising the volatility can be an
revised matrix, but to simplify notation, we reused Ô for the appropriate alternative as it preserves asymmetry:
resulting matrix, and we labelled its first and second columns
as Ô+ and Ô−, respectively. Pi*=
𝑦𝑖 − 𝑚𝑖𝑛(𝑦)
[10]
𝑚𝑎𝑥(𝑦) − 𝑚𝑖𝑛(𝑦)
[9]
[11]
Where ̂s represented the total count of words from the
set Ŝ in the new filing, while di,w,Ô+,w, and Ô-,w referred to
Table 4 Sentiment token Statistics with 10-K and Risk Factor Signal Noise.
# of Total Tokens Mean SD Max/Min
(# of Sentiment Tokens)
pˆRET with 10-K 3,861 (3,861) 0.49 0.32 0.89/0.13
pˆV OL with 10-K 3,861 (3,861) 0.39 0.21 0.86/0.13
pˆRET with Risk factor 2,201 (2,201) 0.49 0.32 0.90/0.11
pˆV OL with Risk factor 2,201 (2,201) 0.41 0.18 0.80/0.17
Note: *p-value < 0.05, **p-value < 0.005 factor sections, despite typically constituting only 10% to
15% of the entire filing, contained a significant portion of the
The Sector Sentiment Model with Only-Risk-Factor relevant words to the model’s sentiments. It showed that a
The sentiment scores in were predicted based on the risk risk factor section could offer informational value for textual
factor section. Note again that only showed the filtered one analysis in the finance domain. Also. the minor differences in
(the unfiltered can be found at. accuracy between the models could suggest that the risk
factor section provided a valuable textual context in
The PRET model accuracy was 73% with the 0.25 loss. predicting sentiment scores for both P̴RET and P᷃VOL. However,
The risk factor P͠RET model accuracy is 5% lower than the upon comparing both models, it was observed that all scores
entire 10-K P͠RET model, which is 78%. Moreover, the risk from the model analyzing risk factor sections were lower than
factor P᷃VOL model showed a stronger model performance. It those from the comprehensive model. It indicated that the
performed 90% accuracy with a 0.22 loss. Comparing the 10- tone of a risk factor section is pessimistic/negative as other
K sector ˜ PLM and the risk factor P᷃LM, the latter is generally studies also empirically found [25], [78]. The training time
lower than the former. This pattern suggested that the p᷃LM for this model, additionally, was also reduced by 40%(i.e.
score, influenced by an LM dictionary dominated by negative 23s).
words and a risk factor section pessimistically toned,
inherently leaned towards a negative sentiment [25], [78]. Table showed the Pearson correlation coefficients of the
filtered sentiment sector model with only the risk factor.
When comparing the training data for both Compared to the sector model with the entire 10-K filing,
models(Figure , and ), it’s noted that the aggregated word shown in Table 5.1, the sector risk factor model showed, in
count from all QQQ firms’ filings was 15,863 post- general, lower r values for all pairs, and some pairs showed
preprocessing, and the count from just the risk factor sections altered sign such as the pair of P̴RETand P᷃LM, the pair of P̴RET
of these filings was 9,030 after undergoing the same and Stock, and the pair of P᷃VOL and P̴LM. Except for the pair
preprocessing. This observation highlighted that the risk of P̴RETand P̴LM which did not show statistical significance,
The Top10 Sentiment Model with 10-K Filing Table 5 showed the Pearson correlation coefficients of
Indicated the predicted sentiment score for the top10 the filtered top10 sentiment model with the entire 10-K
port folio. These scores were estimated based on the top 10 filings. All r values indicated a strong correlation with
firms’ entire 10-K filing for a given window from 2006 to rigorous statistical significance. Compared to Figure , the top
2023. The p͠RET model performed a 76% accuracy rate with 10 firms revealed clearer correlations by removing the noise
0.22 loss. The previously mentioned sector models had 78% associated with the other 90 firms. A strong positive
and 72%, respectively. The p̴VOL model decreased its correlation of r = +.905 between p᷃RET and p᷃VOL indicated that
accuracy rate to 78% with the 0.20 loss (: 92%, : 90%), but it higher sector positivity was linked with greater uncertainty.
still predicted the sector’s sentiment labelled with volatility Additionally, rp᷃RET,Stock = +.956 showed that increases in
well. The training time was slightly faster, reduced to 13 sector stock prices correlated with more positive sentiment,
seconds, as we only used the 10-K filings of the top 10 firms. and similarly, a strong positive correlation existed between
p᷃VOL and stock prices, suggesting that stock prices rose with
Compared to the sector’s sentiment score (Figure ), the increased market uncertainty. Strong negative correlations of
top10 10-K model () showed a significantly far clearer trend rp̂RET, p̂LM = −.922 and rp̂VOL, p̂LM = −.903 indicated that
while noisy signals were removed again through the Kalman decreases in p̂LM were associated with the increased market
filter (the unfiltered model can be found at ). We found sentiment and uncertainty, and vice versa. Also, p̂LM had a
extremely interesting a few phenomena in this model. Firstly, strong negative correlation with Stock.
we observed the p͠RET strongly mirrored the top 10 stock price
movement. We could see that the p̴RET had a steady and The Top10 Sentiment Model with Only-Risk-Factor
moderate increase from 2006 to around 2016. Then, Showed the top 10 firms’ estimated sentiment scores by
beginning around 2018, p͠RET showed a rapidly sharp rise, but training only the risk factor section. Compared to the 10 K
even after the outbreak of COVID-19. Very similarly, the Top top10 model (), this model() performance was decreased
10 stock rose sharply over a few years in early 2018 and slightly to 71% with the 0.23 loss (p̂RET) and 74% with the
showed a steep drop from 2021 to the beginning of 2023. 0.21 loss (p̂VOL). This could be because of the smaller size of
Then, it increased sharply again. The steep drop from 2021 to the vocabulary of the risk factor than the entire 10-K f ilings.
2023 could be stemmed from the COVID-19 pandemic. Then, the training time also was reduced to 8 seconds. Due to
Several studies argued that quantitative easing(QE) seemed the same reason, we observed that the p̂RET model steadily
to be significantly and positively related to a higher recovery increased. However, this model did not reflect the market
of the USA’s stock market. This could explain the QQQ’s situation well. This model gradually increased during the
sharp increase after the outbreak of COVID 19 [85], [86], technology industry boom time and very slightly decreased
[87], [88], [89], [90], [91], [92]. However, the post-COVID- after the outbreak of COVID-19. Also, the p̂VOLmodel showed
19 surge in sentiment suggested the p͠RET model did not the market uncertainty steadily fell during the same period.
accurately gauge event significance, a factor likely The p̂RET model dynamically, albeit minimally, reflected
considered in 10-K filings. This limitation could arise from market changes, while the p̂VOL model struggled due to its
managers’ ignorance about an event’s impact at filing time or focus on the risk factor section dominated by negative tones.
the model’s inability to quantify how much sentiment is When using the model trained on risk factors, we must
driven by the event. Essentially, the model treated significant remember it might offer a pessimistic market view.
terms like ”COVID-19” or ”restrictions” merely as text, Furthermore, the p̂LM model indicated a slow, moderate shift
without evaluating their actual importance or sentiment towards negativity, showing it lacks dynamic market
contribution. responsiveness, similar to patterns seen in the earlier p̂LM
model(). This negative bias distorted the analysis of the
Secondly, the p͠VOL model showed interesting points. portfolio level.
The p͠VOL model revealed trends in market volatility, with a
sharp increase until around 2010, followed by a decline until
At the sector level, the sector model trained with the Our suggested models can not currently capture
entire 10-K provided more informative sentiments than the meaningful phrase words as the model essentially calculates
risk-factor centred sector model. The 10-K model sentiment-Charged words based on a bag of words. For
outperformed the risk factor model for both the return- instance, the model will extract the phrase word ‘chef
labelled sentiment(i.e. p˜RET) and volatility-labelled executive officers(CEO)’ in a word base separately and then
sentiment(i.e. p̂VOL). The accuracies for p̂RET or p̂VOL were evaluate whether each word can be sentiment-charged words
higher by 5% and 2% respectively. The 10-K model took concerning the dependent variables. In this word-based
more time for training though. Notably, p̂VOL, in the 10-K separation process, it loses the original meaning, which is
sector model, showed periodic fluctuation trends in volatility, CEO. Hence, in the future, we suggest to add a function that
whereas p̂RETwas less fluctuated. In terms of correlation contains contextual information on the model. Bi-gram or n-
analysis, the 10-K sector showed a stronger correlation than gram can be examples. Industrial sentiment score prediction
the risk-factor sector model. For the qualitative analysis with does not consider the allocation proportion of the QQQ
the most impactful words, both models seemed to exhibit portfolio. In the QQQ ETF fund, the top 10 firms take around
similar topics. The baseline model (i.e. p̂LM) did show a 45% allocation proportion of the total in 2023, and the rest of
distorted performance compared to both the 10-K sector 90 firms take the rest of 55% proportion. The portfolio is
model and the risk factor sector model. This was because the reconstructed annually. So, to predict a robust industrial-level
baseline model was biased to be negative as the training set sentiment score, the scores should consider the portfolio
was predominately filled with negative tokens. It is allocation proportion. In our industrial sentiment trending
noteworthy that the Kalman filter should be applied for sector analysis, however, all sentiment scores are equally considered
analysis to control noise. as the portfolio rebalancing data is restricted to attain. For
instance, the return-labelled sentiment score of Apple Inc. in
At the portfolio level, the model based on the entire 10- 2023 is 0.006, whereas Amazon.com Inc.’s sentiment in 2023
K generally revealed more informative sentiments than the is -0.009. We used these scores without weighting their
model with the risk factor. The 10-K portfolio model portfolio proportion. In 2023, the QQQ allocated Apple at
outperformed the risk factor portfolio model for both ˜ pRET 9.22% and Amazon at 4.83%. Thus, the weighted sentiment
and ˜ pVOL. The accuracies for ˜ pRET or ˜ pVOL were higher by scores(i.e. 0.05532 = 0.006 * 9.22 for apple and -0.04347 = -
5% and 4% respectively. The training time took longer than 0.009 * 4.83 for Amazon) are required for a robust sentiment
the risk-factor model. When it comes to correlation analysis, score calculation. In the future, we suggest that the sentiment
the 10-K sector depicted a way stronger correlation for all score at the date should consider the allocation weight of the
pairs. In the 10-K portfolio model, ˜ pRET strongly mirrored portfolio of the same date. Our model can not adapt to the
the top 10 portfolio stock price movement. ˜ pRET seemed to latest firms’ return or volatility because our model calculation
show no significant response to macroeconomic external only works with the filings released at the publication date. In
factors such as the 2008 f inancial crisis and COVID-19. On other words, the predicted sentiment score is an annual data
the other hand, ˜ pVOL revealed trends in volatility response to point at which the filing is released. If we have more latest
the external factors. For the influential word analysis, the risk textual information to represent a firm, our model would
factor portfolio model seemed to highlight a distinct topic. generate more recent sentiment scores. We suggest using 10-
Thus, we recommended employing two models(i.e. both the Q filling, which is a comprehensive report of a firm like 10-
10-K one and the risk factor one) for a portfolio-level K but must be submitted quarterly.
analysis. Again, it is remarkable that the Kalman filter was REFERENCES
also required for a portfolio analysis.
APPENDIX
Part 1
Item 1: Business - This section describes the business of the company: its main products or services, subsidiaries, and markets
in which it operates. It may also contain recent events, competition, regulation, and labour problems.
Item 1A: Risk Factors - This section provides risks and uncertainties, likely external effects, or possible failures that could
affect their financial performance. Risk factors are generally enumerated based on their importance.
Item 1B: Unresolved Staff Comments - This section offers an explanation of any issues raised by the SEC staff on the previous
reports if these issues have not been resolved afterwards.
Item 2:Properties - This section only lays out the company's significant physical properties, not intellectual or intangible
property.
Item 3: Legal Proceedings - This section discloses any significant ongoing lawsuit or other legal proceeding.
Item 4 - [RESERVED]
Part 2
Item 5: Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- This section discloses their performance in the stock market, dividends, and repurchases of their own stocks.
Item 6 - [RESERVED]
Item 7: Management's Discussion and Analysis of Financial Condition and Results of Operations (MD&A) - This section
discusses the firm's management for its financial performance, challenges, chances, and future outlook.
Item 7A: Quantitative and Qualitative Disclosures about Market Risks - This section shows the company's exposure to
market risks.
Item 8: Financial Statement and Supplementary Data - This section offers the audited financial statements. It contains the
balance sheet, income statement, cash flow statement, and footnotes.
Item 9: Changes in and Disagreements with Accountants Accounting and Financial Disclosure - Companies address any
alterations in or disputes with their accountants over financial reporting.
Item 9A: Controls and Procedures - This section details the company's procedures for disclosure controls and its internal
control mechanisms for financial reporting. Item 9B: Other Information - This section offer additional information that does not
align with the conte other sections.
Part 3
Item 10: Directors, Executive Officers and Corporate Governance – This section delves into the specifics of the company’s
leadership, their respective roles, and the practices in place for corporate governance.
Item 11: Executive Compensation – This section addresses compensation policies and programmes, and the compensation Oo
top executives.
Item 12: Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters – This
section offers major shareholders’ ownership of the company’s stock as well as insiders.
Item 13: Certain Relationships and Related Transactions, and Director Independence- This section encompasses
transactions involving directors, executives, and their affiliates, along with details regarding the independence of directors.
Item 14: Principal Accountant Fees and Services – The section details the fees charged by the company’s auditors for their
services.
Part 4
Item 15: Exhibits, Financial Statement Schedules – This section contains a list of the financial statements and exhibits.[119],
[120]
[heading]Our Company[/heading]
NVIDIA Corporation helped awaken the world to the power of computer graphics when it invented the graphics processor
unit, or GPU, in 1999. Expertise in programmable GPUs has led to breakthroughs in parallel processing which make supercomputing
inexpensive and widely accessible.
[Heading]Overview[/heading] [heading]Our Company[/heading] NVIDIA Corporation helped awaken the world to the power
of computer graphics when it invented the graphics processor unit, or GPU, in 1999. Expertise in programmable GPUs has led to
breakthroughs in parallel processing which make super- computing inexpensive and widely accessible. We serve the entertainment
and consumer market with our GeForce graphics products, the professional design and visualization market with our Quadro
graphics products, the high-performance computing market with our Tesla computing solutions products, and the mobile computing
market with our Tegra system-on-a-chip products.
Failure to meet the evolving needs of our industry and markets may adversely impact our financial results. Competition in our
current and target markets could cause us to lose market share and revenue.
Failure to estimate customer demand properly has led and could lead to mismatches between supply and demand. Depen-
dency on third-party suppliers and their technology reduces our control over product quantity and quality, manufacturing yields,
development, enhancement, and product delivery schedules and could harm our business. Defects in our products have caused and
could cause us to incur significant expenses to remediate and can damage our business.
Preprocessing
Through the 10-K filings extraction model at Chapter 3, we finally collected almost all 10-K filings listed in the QQQ from
2006 to 2023, which is 1383 filings as well as the text files with only the Item 1A risk factor section extracted. With these files, we
preprocessed them before feeding them into our prediction model. Firstly, we split all sentences of a filing into words. Secondly, we
removed non-alphabetic tokens from the texts such as numbers, proper nouns, and special characters(i.e. punctuations). For instance,
Item 8 of a 10-K filing contains many numbers as a financial statement of a company is included in that section. As numbers were
not necessary for textual analysis, we removed them. Thirdly, we removed stop words such as “is”, “the”, or “and”, as they do not
contain informative information. In the final step of the preprocessing, we selected lemmatisation, instead of stemming. Although
lemmatisation is computationally more expensive than stemming, it tends to capture the more accurate base form of a word through
linguistic analysis (i.e. considering Parts-of-speech) [121].
Hyper-Parameter Setting
We set the hyper-parameters equal for the models to ensure a fair comparison between the sentiments with return labels and
those with volatility labels for a sector level, a portfolio level, and a company level.
In Section 4.2, we replaced 0 by the θ in Equation 4.2, and the value 0.5 by quantile q in Equation 4.3. θ is a threshold and we
used it to define high and low volatility. To figure out the balanced and q hyper-parameter for both the technology sector and a
firm(in this paper, Nvidia), we selected the value of and a from, on the training windows. In the technology sector from, we
practically set θ as the 65th percentile of the distribution and q as 0.65. It implies we defined the volatility above the 65 quantile as
high volatility, whereas the volatility below the 65 quantile is low volatility. Note that we set the 65 th percentile and 0.65 for θ and
q for both QQQ volatility itself and the volatilities of the top 10 firms in QQQ, respectively. Empirically, the 3-day volatility for
both QQQ and the top 10 showed a similar trend. We can see the peaks from the graph, referring to the financial crisis of 2008 and
the COVID- 19 pandemic around 2020. In the case of a firm(i.e. Nvidia) from, θ was set as the 65th percentile of the distribution,
and q was 0.65. Similar to the QQQ volatility graph, Nvidia experienced high levels of volatility during the 2008 financial crisis
and COVID-19 showed higher volatility movement in general during the training windows.
Furthermore, returns for both the technology sector and a firm showed a balanced proportion at the median of three- day
returns. Hence, we set 0.5 at Equation 4.3 for both the technology sector in and a firm return in.
In Equation 4.3, we specifically defined α and k. Note that a was a threshold to filter out sentiment-neutral words. We set α+
and a- within the interval (0, 0.5] such that each of the positive word sets and the negative word sets includes 100 words. Moreover,
note that k € N was another threshold to relate to the count of word w across all filings such that we used k as a minimum frequency
requirement to reduce the influence of rare words. We set k as the 90 percentile quantile of the term frequency distribution. It means
we ignored words that appear less than the 90 percentile quantile in a filing.
Baseline
The purpose of the paper was to predict the sentiment score for both a technology sector and a firm from the contents of 10-
K filings. To achieve that, we could infer the sentiment scores for p̂RET and p̂VOL by labelling with return and volatility, respectively.
Then, we introduced a baseline sentiment score to compare our sentiment scores with it.
Our baseline model to calculate baseline sentiment scores was suggested by [122], and used the dictionary created by [123].
Our baseline Loughran and McDonald(LM) sentiment score, p̂LM , is computed as
[16]
Where LM+ refers to the positive word lists and LM-refers to the negative words list of Loughran and McDonald’s dictionary.
To attain the base sentiment scores, we calculated the difference between the number of positive words and the number of negative
words, and then we divided this result by the total number of words in each 10-K filing.
To explain more of the LM’s dictionary for our robust evaluation, the LM dictionary is traditionally used for financial analysis.
LM offered an improved textual dictionary for a better accurate financial analysis in financial documents. Existing the financial
dictionary of 10-K filings based on the Harvard dictionary misclassified the negative words in a financial context. Three-fourths of
negative words in the 10-K filings do not carry a negative connotation in financial reports. To this issue, LM suggested three
approaches. Firstly, LM created a refined list of words that more accurately reflects negative sentiment in a financial context, by
analysing every word that appears in at least 5% of the SEC’s 10-K filings. Secondly, LM introduced a term weighting scheme that
controls the influence of frequently mentioned words and amplifies the significance of rarer terms, thus mitigating the
misclassification of words. Finally, they added five other word classifications(e.g., positive, uncertainty, litigious, strong modal, and
weak modal words). They found that these new classifications can be linked to market reactions, volatility in stock returns,
unexpected earnings, and trading volumes [123]. In our study, we used only negative and positive categories to adjust the financial
dictionary for our model.
Portfolio
In this study, we formed a portfolio to evaluate the technology sector sentiment scores. A portfolio can be changed to any
portfolio for financial analysis. In practice of our study, we selected Invesco QQQ Trust Series 1 an exchange-traded fund (QQQ or
QQQ ETF). This passive fund(i.e., our portfolio) tracks the Nasdaq 100 Index, which consists of shares from 100 of the largest and
most innovative non-financial firms listed on the Nasdaq stock exchange. Holdings in QQQ are predominantly in large-cap
technology firms, accounting for 60% of the portfolio. As such, the QQQ is conventionally considered as a technology sector fund.
The top 10 holdings represent a 50% allocation of the portfolio, with 9 out of 10 firms being in the tech sector. To represent the
technology sector in the US, we formed two portfolios from the QQQ fund. Firstly, we formed the portfolio, which has the exact
same allocation proportion as the QQQ fund itself as of 2023. This allocation proportion is annually corrected so that our current
portfolio reflects 2023 allocation data. The actual 2023 portfolio allocation can be found inAppendix D. This portfolio is used to
compare the sector sentiment scores.
The second portfolio was constructed with the top 10 firms, considering the asset allocation ratio of the first portfolio. In other
words, the second portfolio consisted of the top 10 firms, accounting for 50% of the QQQ portfolio, according to the proportion
invested in the first portfolio. This portfolio was computed as:
[17]
Where j denoted a firm invested at the j-th ranking in the 2023 portfolio , and wj represented the portfolio weight of the j-th
firm. Aj referred to the allocated proportion of the j-th firm in the 2023 portfolio.
In the case of a single firm evaluation, we did not form a portfolio for it. Instead, we followed the firm’s stock market price.
Definition
[18]
Our system can automatically create 6 types of a documentterm dictionary with the latest 10-K filings. This automation process
works on Apache Airflow, an open-source platform designed to manage workflow processes in data engineering pipelines. Our
airflow automation system consists of three dags corresponding to each stakeholder level. Note that a Directed Acyclic Graph(DAG)
refers to a collection of all the takes you want to execute, arranged to reflect their relationships and dependencies. Each dag creates
the corresponding documentterm dictionary for our Sentiment Score Prediction Model. As mentioned in Project Objective, the
publication dates of 10-K filing are various to each firm although it should be released nearly every day throughout that year. Thus,
we scheduled our dags differently. Sector-level dag(i.e. Technology sector from the QQQ) updates daily, Portfolio-level dags(i.e.
Top10) updates monthly, and Firm-level dag update yearly. You cancheck data workflow in.