More Than Sentiments
More Than Sentiments
Listen Share
Introduction
MoreThanSentiments (Jiang and Srinivasan, 2022) is a python library written to help
researchers calculate Boilerplate (Lang and Stice-Lawrence, 2015), Redundancy
(Cazier and Pfeiffer, 2017), Specificity (Hope et al., 2016), Relative Prevalence
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 1/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science
(Blankespoor, 2019), etc. Nowadays, people frequently talk about text embedding,
semantic similarity, intention detection, and sentiments analysis… However,
MoreThanSentiments is inspired by the idea that properly quantifying the text
structure will also help researchers extract tons of meaningful information. And
this domain-independent package is easy to be implemented in various projects for
text quantification tasks.
Supported Measurements
Boilerplate
In textual analysis, a Boilerplate is a combination of words that can be removed from
a sentence without significantly changing the original meaning. In other words, it is
a measure of informativeness. It is calculated based on the portion of sentences
containing boilerplates against the total number of words.
Image by author
Redundancy
Redundancy is a measure of the usefulness of the text. It is defined as the
percentage of super-long sentences/phrases (e.g., 10-grams) that occur more than
once in each document. Intuitively, if a super-long sentence/phrase is used
repetitively, it means the author tries to impose the duplicated information that has
been mentioned previously.
Specificity
Specificity is a measure of the quality of relating uniquely to a particular subject. It
is defined as the number of specific entity names, quantitative values, and
times/dates all scaled by the total number of words in a document. Currently, the
function of Specificity is built on the Named Entity Recognizer from spaCy.
Relative Prevalence
Relative Prevalence is a measure of hard information. It is the number of numerical
values against the length of the whole text. It helps evaluate the portion of
quantitative information in a given text.
Installation
The easiest way to install the toolbox is via pip (pip3 in some distributions):
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 2/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science
Usage
Import the Package
1 my_dir_path = "D:/YourDataFolder"
2 df = mts.read_txt_files(PATH = my_dir_path)
This is a built-in function that helps you read a folder of separated .txt files in
python. If you already have all the data stored in a .csv file, you may just read the
file with pandas like normal.
Sentence Token
1 df['sent_tok'] = df.text.apply(mts.sent_tok)
Clean Data
If you want to clean on the sentence level:
1 df['cleaned_data'] = pd.Series()
2 for i in range(len(df['sent_tok'])):
3 df['cleaned_data'][i] = [mts.clean_data(x,\
4 lower = True,\
5 punctuations = True,\
6 number = False,\
7 unicode = True,\
8 stop_words = False) for x in df['sent_tok'][i]]
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 3/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science
Parameters:
min_doc: when building the ngram list, ignore the ngrams that have a
document frequency strictly lower than the given threshold. The default is 5
documents. 30% of the number of documents is recommended. The min_doc
can also read numbers that are between 0 and 1 as percentages. (e.g., 0.3 will be
read as 30%)
get_ngram: if this parameter is set to “True,” it will return a dataframe with all
the ngrams and the corresponding frequency, and the “min_doc” parameter will
become ineffective.
Redundancy
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 4/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science
Parameters:
Specificity
1 df['Specificity'] = mts.Specificity(df.text)
Parameters:
Relative Prevalence
1 df['Relative_prevalence'] = mts.Relative_prevalence(df.text)
Parameters:
Conclusion
MoreThanSentiments is still a developing project. Yet it already shows the potential
to help researchers in different domains. This package simplifies the process of
quantifying the text structures and provides various text scores for their NLP
projects.
Python Script
Citation
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 5/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science
Related Reading
Reference
BLANKESPOOR, E. (2019), The Impact of Information Processing Costs on Firm
Disclosure Choice: Evidence from the XBRL Mandate. Journal of Accounting
Research, 57: 919–967. https://doi.org/10.1111/1475-679X.12268
Hope, OK., Hu, D. & Lu, H. (2016), The benefits of specific risk-factor disclosures.
Rev Account Stud 21, 1005–1045. https://doi.org/10.1007/s11142-016-9371-1
Richard A. Cazier, Ray J. Pfeiffer. (2017), 10-K Disclosure Repetition and Managerial
Reporting Incentives. Journal of Financial Reporting; 2 (1): 107–131.
https://doi.org/10.2308/jfir-51912
https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 6/14