0% found this document useful (0 votes)

17 views6 pages

More Than Sentiments

This document introduces MoreThanSentiments, a Python library for quantifying text. It allows calculating metrics like boilerplate, redundancy, specificity, and relative prevalence. It provides functions for reading data, sentence tokenization, data cleaning, and calculating the various metrics. Examples of using the library are also given.

Uploaded by

Zaki Kace

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

More Than Sentiments

Uploaded by

Zaki Kace

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

Open in app Sign up Sign In

MoreThanSentiments: Text Analysis Package

A collection of functions that help researchers calculate Boilerplate, Redundancy,
Specificity, Relative Prevalence, etc., in python

Jinhang Jiang · Follow

Published in Towards Data Science
4 min read · May 31, 2022

Listen Share

Photo by Patrick Tomasso on Unsplash

Introduction
MoreThanSentiments (Jiang and Srinivasan, 2022) is a python library written to help
researchers calculate Boilerplate (Lang and Stice-Lawrence, 2015), Redundancy
(Cazier and Pfeiffer, 2017), Specificity (Hope et al., 2016), Relative Prevalence

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 1/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

(Blankespoor, 2019), etc. Nowadays, people frequently talk about text embedding,
semantic similarity, intention detection, and sentiments analysis… However,
MoreThanSentiments is inspired by the idea that properly quantifying the text
structure will also help researchers extract tons of meaningful information. And
this domain-independent package is easy to be implemented in various projects for
text quantification tasks.

Supported Measurements
Boilerplate
In textual analysis, a Boilerplate is a combination of words that can be removed from
a sentence without significantly changing the original meaning. In other words, it is
a measure of informativeness. It is calculated based on the portion of sentences
containing boilerplates against the total number of words.

Image by author

Redundancy
Redundancy is a measure of the usefulness of the text. It is defined as the
percentage of super-long sentences/phrases (e.g., 10-grams) that occur more than
once in each document. Intuitively, if a super-long sentence/phrase is used
repetitively, it means the author tries to impose the duplicated information that has
been mentioned previously.

Specificity
Specificity is a measure of the quality of relating uniquely to a particular subject. It
is defined as the number of specific entity names, quantitative values, and
times/dates all scaled by the total number of words in a document. Currently, the
function of Specificity is built on the Named Entity Recognizer from spaCy.
Relative Prevalence
Relative Prevalence is a measure of hard information. It is the number of numerical
values against the length of the whole text. It helps evaluate the portion of
quantitative information in a given text.

Installation
The easiest way to install the toolbox is via pip (pip3 in some distributions):

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 2/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

1 pip install MoreThanSentiments

installation.py hosted with ❤ by GitHub view raw

Usage
Import the Package

1 import MoreThanSentiments as mts

import.py hosted with ❤ by GitHub view raw

Read data from txt files

1 my_dir_path = "D:/YourDataFolder"
2 df = mts.read_txt_files(PATH = my_dir_path)

read_txt.py hosted with ❤ by GitHub view raw

This is a built-in function that helps you read a folder of separated .txt files in
python. If you already have all the data stored in a .csv file, you may just read the
file with pandas like normal.
Sentence Token

1 df['sent_tok'] = df.text.apply(mts.sent_tok)

sent_tok.py hosted with ❤ by GitHub view raw

If you want to calculate the Boilerplate and Redundancy, it is necessary to tokenize

the sentences as the n-grams are generated on the sentence level.

Clean Data
If you want to clean on the sentence level:

1 df['cleaned_data'] = pd.Series()
2 for i in range(len(df['sent_tok'])):
3 df['cleaned_data'][i] = [mts.clean_data(x,\
4 lower = True,\
5 punctuations = True,\
6 number = False,\
7 unicode = True,\
8 stop_words = False) for x in df['sent_tok'][i]]

data_clean1.py hosted with ❤ by GitHub view raw

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 3/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

If you want to clean on the document level:

1 df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))

data_clean2.py hosted with ❤ by GitHub view raw

For the data cleaning function, we offer the following options:

lower: make all the words lowercase

punctuations: remove all the punctuations in the corpus

number: remove all the digits in the corpus

unicode: remove all the Unicode in the corpus

stop_words: remove the stopwords in the corpus

Boilerplate

1 df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)

boilerplate.py hosted with ❤ by GitHub view raw

Parameters:

input_data: this function requires tokenized documents.

n: number of the ngrams to use. The default is 4.

min_doc: when building the ngram list, ignore the ngrams that have a
document frequency strictly lower than the given threshold. The default is 5
documents. 30% of the number of documents is recommended. The min_doc
can also read numbers that are between 0 and 1 as percentages. (e.g., 0.3 will be
read as 30%)

get_ngram: if this parameter is set to “True,” it will return a dataframe with all
the ngrams and the corresponding frequency, and the “min_doc” parameter will
become ineffective.

Redundancy

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 4/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

1 df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)

redundancy.py hosted with ❤ by GitHub view raw

Parameters:

input_data: this function requires tokenized documents.

n: number of the n-grams to use. The default is 10.

Specificity

1 df['Specificity'] = mts.Specificity(df.text)

specificity.py hosted with ❤ by GitHub view raw

Parameters:

input_data: this function requires the documents without tokenization

Relative Prevalence

1 df['Relative_prevalence'] = mts.Relative_prevalence(df.text)

relative_prevalence.py hosted with ❤ by GitHub view raw

Parameters:

input_data: this function requires the documents without tokenization

Conclusion
MoreThanSentiments is still a developing project. Yet it already shows the potential
to help researchers in different domains. This package simplifies the process of
quantifying the text structures and provides various text scores for their NLP
projects.

Here are the links to the full examples:

Python Script

Python Jupyter Notebook

Citation

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 5/14
24/05/2023, 13:37 MoreThanSentiments: Text Analysis Package | by Jinhang Jiang | Towards Data Science

If this package was helpful in your work, feel free to cite it as

Jiang, J., Srinivasan, K. MoreThanSentiments: A text analysis package. Software Impacts,

100456 (2022). https://doi.org/10.1016/J.SIMPA.2022.100456

Use R to Calculate Boilerplate for Accounting Analysis

A demonstration of calculating Boilerplate with 30
telecommunication companies’ CSRs.
towardsdatascience.co

Use R to Calculate Boilerplate for Accounting Analysis

Reference
BLANKESPOOR, E. (2019), The Impact of Information Processing Costs on Firm
Disclosure Choice: Evidence from the XBRL Mandate. Journal of Accounting
Research, 57: 919–967. https://doi.org/10.1111/1475-679X.12268

Hope, OK., Hu, D. & Lu, H. (2016), The benefits of specific risk-factor disclosures.
Rev Account Stud 21, 1005–1045. https://doi.org/10.1007/s11142-016-9371-1

Richard A. Cazier, Ray J. Pfeiffer. (2017), 10-K Disclosure Repetition and Managerial
Reporting Incentives. Journal of Financial Reporting; 2 (1): 107–131.
https://doi.org/10.2308/jfir-51912

Mark Lang, Lorien Stice-Lawrence. (2015), Textual analysis and international

financial reporting: Large sample evidence, Journal of Accounting and Economics,
Volume 60, Issues 2–3, Pages 110–135, ISSN 0165–4101.
https://doi.org/10.1016/j.jacceco.2015.09.002.

NLP Text Mining Python Software Development Open Source

https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5 6/14

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
No ratings yet
HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
9 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
37 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Python NLP
No ratings yet
Python NLP
15 pages
Exploring Security Commits in Python
No ratings yet
Exploring Security Commits in Python
11 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Text Mining Techniques
No ratings yet
Text Mining Techniques
7 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
ChatGPT Twitter Sentiment Analyzer
No ratings yet
ChatGPT Twitter Sentiment Analyzer
50 pages
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
No ratings yet
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
4 pages
NLP Record
No ratings yet
NLP Record
15 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
EXP5
No ratings yet
EXP5
15 pages
Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
No ratings yet
Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
2 pages
TSA Student
No ratings yet
TSA Student
20 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Artificial Neural Network Code
No ratings yet
Artificial Neural Network Code
3 pages
NLP Transformer-Based Models Used For Sentiment Analysis
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis
45 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Fake News Detection
No ratings yet
Fake News Detection
2 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Social Media Sentimental Analysis 1
No ratings yet
Social Media Sentimental Analysis 1
30 pages
Thesis Final - Pham Dung - Quang Anh - Ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - Ver2
30 pages
Group 1
No ratings yet
Group 1
9 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
19 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Sma 3
No ratings yet
Sma 3
3 pages
Miniproject NLP
No ratings yet
Miniproject NLP
22 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Methodology
No ratings yet
Methodology
9 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Tsa Lab Manual Document About Text and Speech Analysis
No ratings yet
Tsa Lab Manual Document About Text and Speech Analysis
25 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages