Project Example
Project Example
1. Introduction
In this group project, we are provided the “A Million News Headlines” dataset, on which we
perform three NLP tasks: Clustering, Topic Modeling, and Headline Classification. Based on
the provided Kaggle notebooks for starters, we examine the classical methods and make further
extensions by leveraging “deeper” algorithms.
In Section 2, we present several preliminaries for our experiments to facilitate the entire
training process, after which from Section 3 to Section 5, the formulation and exploration of
three major NLP tasks are presented. In Section 6, we attempt to combine the use of these
methods and propose a simple web-based search engine.
Prior to applying multiple Machine Learning methods to explore the latent semantic space of
the given dataset, we perform some basic methods to examine the statistics and have the
following findings:
1) The total number of headlines is 1244184, including 31180 duplicated headlines.
2) The total number of unique words is 108058, including stop words and inflected words
with the same word stem.
3) The publish date starts from 2003/02/19 to 2021/12/31.
To filter out useless terms and facilitate further explorations, we perform three strategies to
preprocess the dataset:
1) Deduplication: With the help of pandas library, redundant examples with an identical
headline are deleted.
2) Stemming: In this step, a SnowballStemmer from the nltk library is applied to reduce
inflected words to their word stem.
3) Stop words removal: Stop words refer to the most common words appearing in every
document. Ignoring stop words can filter out considerable useless information and
promote an efficient learning process.
Considering that the original data has no labels for testing, we need to decide the number of
clusters by ourselves (that is, the k value of k-means). By trying different k values and
evaluating the results using metrics like the elbow method or the silhouette score, we can get
an optimal k value.
1) Elbow method
The elbow method is a technique used to determine the optimal number of clusters in a
dataset for clustering algorithms like K-Means. It works by plotting the sum of squared
distances between data points and their assigned cluster center (also known as the
"inertia") against different values of k, the number of clusters.
The idea is that as the number of clusters increases, the distance between each data
point and its assigned cluster center decreases, resulting in a lower inertia value.
However, after a certain point, adding more clusters doesn't result in a significant
decrease in inertia. This point is known as the "elbow point", and it indicates the optimal
number of clusters.
We use SSE to measure inertia:
To calculate the silhouette score for a single data point, we first calculate two values:
a(i) [intra-cluster dissimilarity] = average (the distance from the i vector to all
other points in the cluster it belongs to)
b(i) [inter-cluster dissimilarity] = average (the distance from the i vector to all
points in the closest cluster to it)
We then compute the silhouette score as:
The resulting silhouette score ranges from -1 to 1, with higher values indicating better
clustering results. A score of 0 indicates that the data points may be overlapping or very
close to the decision boundary between clusters. In general, a silhouette score of 0.5 or
higher is considered to be a good clustering solution.
Substituting k=50 as a parameter, we get the following clustering results (only displayed 15
clusters):
4. Method 2: Topic Modeling
In this section, we will first introduce and examine two classical Topic Modeling methods:
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), after which a
recently proposed DNN-based method called BERTopic will be introduced. Finally,
evaluations of the three methods are presented. For more details of the training process, please
refer to the attached “Topic_Modeling.ipynb”.
As introduced in our lectures, LSA is mainly based on the Singular Value Decomposition
(SVD) process. During the implementation, a sparse matrix called term-document matrix is
generated, which describes the frequencies (typically TF-IDF) of words in each document.
Subsequently, the matrix will be decomposed into three matrices: U, Σ, and V, by SVD so that
the number of rows (words) can be reduced without significantly affecting the similarity
structure among columns (documents).
Topics for each document can be derived from the matrix by finding the top n words that are
most related to the document. In this section, we perform LSA to get 10 topic clusters using a
downsized dataset with 50K headlines:
0 polic, man, charg, new, say, court, death, murder, car, crash 5813
1 new, say, plan, council, govt, australia, year, nsw, health, water 6483
2 man, charg, court, face, murder, jail, accus, new, die, kill 1603
3 say, plan, council, govt, australia, urg, nsw, chang, fund, water 2755
4 plan, council, govt, urg, water, fund, nsw, warn, health, hous 6774
5 australia, win, court, nsw, death, day, face, warn, crash, report 17498
6 court, charg, face, govt, murder, council, accus, death, hear, told 2229
7 council, govt, man, urg, fund, nsw, water, hit, qld, wa 1924
8 council, australia, charg, court, face, say, day, polic, new, murder 605
9 win, australian, crash, kill, court, council, open, car, die, world 4316
Like LSA, the proposal of LDA is based on the distributional hypothesis that words with close
meanings are likely to occur in similar documents. However, LDA is a generative statistical
model that regards each document as the probability density of topics and each topic as the
probability density of words according to the Dirichlet Distribution, by which the term-
document matrix is generated. Rather than using SVD, LDA decomposes the term-document
matrix through a probabilistic procedure and generally gains better performance than LSA. The
results of LDA Topic Model processing 50K headlines is as follows:
2 govern, final, dead, group, say, train, forc, brisban, need, run 5076
3 govt, fund, school, test, protest, record, boost, worker, work, health 5123
4 say, court, plan, death, water, hous, polic, woman, accus, melbourn 4936
5 man, kill, year, crash, win, attack, murder, qld, car, claim 5121
6 new, australian, cut, case, flood, coronavirus, famili, high, driver, 5065
ban
7 urg, hospit, elect, china, rise, fear, nation, price, hit, market 4881
8 council, sydney, wa, nsw, day, warn, chang, world, die, talk 4904
9 report, help, sa, home, coast, south, minist, fight, north, meet 5039
4.3 BERTopic
As an extended application of BERT in Topic Modeling, BERTopic [5] improves this process
by leveraging several clustering techniques and a class-based variation of TF-IDF (c-TF-IDF)
to extract coherent topic representation. Innovative mechanisms are adopted in the following
stages:
1) Document Embedding: In this step, an additional embedder based on the Sentence-
Transformer framework is utilized to convert headlines to dense vector representations
so that the embedded headlines can be clustered more accurately. Here we choose
“SentenceTransformer('all-mpnet-base-v2')” as the embedding model, which
generally achieves optimal performance in embedding tasks.
2) Document Clustering: With mixed exploitation of the dimension reduction method
UMAP and the hierarchical clustering algorithm HDBSCAN, BERTopic is able to well
preserve more features of high-dimensional data and effectively handle the dense vector
representations generated previously. In addition, UMAP can be used across different
language models as it is not computationally restricted by the embedding dimensions.
3) Topic Representation: To model topic representations based on the word distribution
of each topic cluster, a modified version of TF-IDF is adopted. Typically, a TF-IDF
measure is exploited to measure how important a term is to a document:
Where Wt,d represents the frequency of term t in a cluster d, and N refers to the total number
of documents. Now, to apply a similar idea to computing the importance of t to cluster c,
we use the modified version called c-TF-IDF:
Where the inverse class frequency takes the place of the inverse document frequency.
-1 river, indigen, offic, shoot, cost, teen, food, adelaid, babi, assault 26004
0 cup, fund, test, polici, promis, drop, pacif, welfar, feder, posit 906
1 england, cricket, ash, odi, india, socceroo, cup, lanka, pont, sri 845
2 murder, guilti, plead, appeal, manslaught, lawyer, killer, trial, juri, 823
verdict
3 china, burma, trade, thai, thailand, taiwan, kong, hong, myanmar, 544
free
4 drug, alcohol, heroin, bust, traffick, cocain, seiz, raid, liquor, meth 479
5 sexual, sex, porn, rape, paedophil, child, offend, assault, offenc, 456
abus
7 doctor, medic, surgeri, health, patient, medicar, clinic, wait, ama, 407
dr
8 highway, fatal, truck, crash, driver, car, die, road, hurt, injur 369
Since Topic Modeling is an unsupervised method, which does not include any labeled data for
accuracy-based metrics, in this section, we evaluate the performance of the three topic models
using Topic Coherence (TC) and Topic Diversity (TD).
Topic Model TC TD
It can be observed that the TCs of all three models are relatively low, the highest of which is
only very close to 0. One of the possible causes is an underfit of the dataset, as we only use
50K headlines in this task, and our models fail to fully comprehend the relationship among
words.
5.1 Introduction
As a part of a group project focusing on the classification of news headlines, this section aims
to explore zero-shot learning techniques in the context of natural language processing (NLP).
We investigate the performance of various state-of-the-art NLP models, including GPT-2,
GPT-3, BERT, and BART, in classifying news headlines into predefined categories.
We examine different data pre-processing methods to enhance model performance and develop
a web application that allows users to input news headlines and receive the top 4 relevant topics
with their similarity scores. This section contributes to the overall project by demonstrating the
potential of zero-shot learning techniques in news headline classification and providing a
practical tool for real-time contextual understanding of news articles. The followings are the
models and techniques we used in this part of the experiment
GPT-2: Introduced by Radford et al. [1], the Generative Pre-trained Transformer 2 (GPT-2) is
a large-scale generative language model that has demonstrated impressive performance in
various NLP tasks, including text classification. The model uses a unidirectional Transformer
architecture and has been pretrained on a diverse range of web text. Although GPT-2 has shown
remarkable results in several NLP tasks, its unidirectional nature can be a limitation for certain
classification problems.
GPT-3: Developed by Brown et al., GPT-3 is one of the largest language models available,
with 175 billion parameters. It has achieved state-of-the-art results in various NLP benchmarks
and can be effectively fine-tuned for a wide range of tasks, including zero-shot learning.
However, its massive size can pose challenges in terms of computational resources and
deployment.
BART: Presented by Lewis et al. [2], the Bidirectional and Auto-Regressive Transformers
(BART) model combines the strengths of BERT and GPT models by employing a bidirectional
encoder and a unidirectional decoder. BART has shown exceptional performance in various
NLP tasks, including abstractive summarization and text classification. Its hybrid architecture
can provide a more balanced approach to capturing context in text.
In addition to the models mentioned above, we also explored various data pre-processing
techniques to improve model performance. These techniques include tokenization,
lemmatization, and stop word removal, which are commonly used in NLP tasks to reduce noise
and focus on the most informative parts of the text. By investigating these models and
techniques, we aim to identify the most suitable approach for zero-shot classification of news
headlines in our project.
Before delving into the model selection process, it is crucial to pre-process the input data to
ensure the efficiency and accuracy of the models. Data pre-processing involves several steps,
including tokenization, stop word removal, and lemmatization. In this section, we will discuss
the reprocessing techniques applied to the "A Million News Headlines" dataset for our project.
1. Tokenization: The first step in pre-processing is tokenization, which breaks the input text
into individual words or tokens. This process allows the models to better understand and
analyse the input data.
2. Stop word Removal: Stop words are common words such as "and", "the", and "in" that do
not carry significant meaning and can be safely removed from the text. By filtering out stop
words, we reduce noise and focus on the meaningful words in the headlines. This step helps
improve the performance of the models by reducing the input size and allowing them to
concentrate on relevant information.
The pre-processed data is then used as input for the various models, including GPT-2, GPT-3,
BERT, and BART, in the model selection process. By applying these pre-processing techniques,
we ensure that the input data is clear, concise, and suitable for analysis by the models, resulting
in better performance and more accurate results in the classification task.
In our experiment, we aimed to select the most suitable model for zero-shot classification of
news headlines by comparing the performance of GPT-2, GPT-3, BERT, and BART models.
We provided the same news headline to each model and analysed the similarity scores between
the input headline and the predefined topics. The model that generated the most accurate and
consistent similarity scores across various headlines was considered the most suitable for our
project.
GPT-2 and GPT-3 showed reasonable performance in some cases, but GPT-2 struggled with
certain headlines due to its unidirectional nature. While GPT-3 delivered impressive similarity
scores, its massive size and computational requirements posed challenges for deployment.
BERT demonstrated a strong ability to capture context, which is essential for accurate
classification, but its performance was not significantly better than BART in our experiment.
The BART model achieved the most balanced performance, generating accurate similarity
scores across different headlines and effectively capturing context with its hybrid architecture.
BART is a denoising autoencoder, which allows it to reconstruct the input text from a corrupted
version. This unique architecture enables BART to understand the relationships between
different parts of the input text, making it suitable for tasks like text classification. Additionally,
it demonstrated a better trade-off between model complexity and computational requirements
compared to GPT-3.
Based on the experimental results, we selected the BART model for our project. Its strong
performance in generating accurate similarity scores and effectively capturing context,
combined with its relatively lower computational requirements, makes it a suitable choice for
zero-shot classification of news headlines. In the following sections, we will discuss the
implementation details, including data pre-processing techniques and the integration of the
BART model into our system. The following images show the results of the program on the
front-end page.
In our initial exploration, we also briefly attempted to handle Chinese characters in order to
improve the classification performance for Chinese news headlines. We used the "jieba" library,
which is popular for Chinese text segmentation. However, we did not integrate this method
into the final version of our code.
The preliminary approach involved tokenization and keyword extraction using "jieba",
followed by reassembling the keywords into a string. This pre-processed string was then
classified using the pre-trained BART model. Although this method was effective in processing
Chinese text and yielded improved classification results, it was not adopted in the final
implementation due to the project's focus on English news headlines.
6. Combination of Methods: A Simple Search Engine
The image above showcases the user interface of our search engine, where users
can input text queries to explore relevant content. The search engine processes the
input data using a range of techniques, such as topic modeling, clustering, and zero-
shot classification, to generate accurate and meaningful results.
The second image demonstrates the output display of the search engine. The results
are grouped by topic, providing users with a clear and organized presentation of the
information. This allows users to easily identify and focus on the topics that are most
relevant to their interests.
From the BERTopic model, we can get about 459 topics. The frequency of these topics changes
over time. We hope to use time series analysis to predict the trend of the frequency of these
topics over time. Due to the limitation of computing power, we only consider predicting the
trend of topics in a year (regardless of seasonality). We use the ARIMA model for time series
analysis, and the accuracy of the frequency forecast is within a 95% confidence interval with
an error of (±1.53).
6.2.1 Preprocessing
First, by processing the data generated from the BERTopic model, we can get the time-
frequency distribution of each topic. Taking topic 49 as an example, we divide it into training
set (230) and test set (14) to prepare for future predictions.
figure 1
We chose Autoregressive Integrated Moving Average (ARIMA) as our final forecasting model
which takes three parameters p, d, q. To find out the optimal parameters of the ARIMA model,
we need to find out some time-frequency distribution characteristics of the topic.
Through simple observation, we found that the distribution of some topics changes less over
time (a), while others change greatly (b).
(a) (b)
We want to find out those topics that have changed greatly over time, and make predictions
about their subsequent development trends. Therefore, we conduct a white noise test on the
data to detect whether the time series belongs to random distribution. We use the Ljung Box
method (LB test) to test whether there is a lag correlation in the time series, and judge whether
the overall correlation or randomness of the sequence exists. The white noise test is usually
performed in conjunction with the data stationarity test, which means that if the stationarity test
passes, the white noise test will generally also pass. If a topic fails the white noise test, we
consider it to have no predictive value (random distribution).
We assume that the frequency of topic 49 is not randomly distributed over time.
From the figure 1 we can see that the distribution of frequency over time is not stable, we need
more stable data to help us predict. It is a good way to find the first order difference of the
original data. However, we need some calculations to justify doing so.
i. ADF test
ii. ACF/PACF
Both raw data and first order difference’s ACF/PACF lie between confidence intervals. Can
not draw a conclusion of p, q value.
6.2.2 Determine p, q
For ACF/PACF can not settle p, q value, we use iteration to see the p, q value with minimum
BIC and get p = 2, q = 3.
In this project, we explored three different methods for natural language processing and applied
them to the task of news headline classification and search. We utilized topic modeling with
BERTopic, clustering with K-means, and zero-shot learning with the BART model to analyze
news headlines and generate relevant topics and clusters.
Through our experiments, we found that BART was the most suitable model for our
classification task due to its strong performance in generating accurate similarity scores and
effectively capturing context with its hybrid architecture. We also developed a web application
that allows users to input news headlines and receive the top four relevant topics with their
similarity scores.
In addition to news headline classification, we also combined the methods to build a simple
search engine that efficiently handles text queries and displays accurate, relevant results. We
used Flask for back-end development and JavaScript for front-end design to create a user-
friendly interface.
Furthermore, we explored time series analysis with the ARIMA model to predict the frequency
trend of different topics over time. We found that topics with significant changes over time can
be predicted with a reasonable level of accuracy.
Overall, our project demonstrates the potential of natural language processing techniques in
understanding and analyzing news headlines. Our web application and search engine can
provide users with quick and accurate access to relevant news topics. However, there are still
limitations to our methods, such as the need for large computational resources and the potential
for biases in topic modeling.
8. Reference
[1] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are
unsupervised multitask learners," OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[4] J. Devlin et al. "Bert: Pre-training of deep bidirectional transformers for language
understanding," arXiv preprint arXiv:1810.04805, 2018.
[5] M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,”
arXiv preprint arXiv:2203.05794, 2022.