0% found this document useful (0 votes)

46 views

Project Example

Uploaded by

benjaminxin11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Project Example

Uploaded by

benjaminxin11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Project Report

1. Introduction

In this group project, we are provided the “A Million News Headlines” dataset, on which we
perform three NLP tasks: Clustering, Topic Modeling, and Headline Classification. Based on
the provided Kaggle notebooks for starters, we examine the classical methods and make further
extensions by leveraging “deeper” algorithms.

In Section 2, we present several preliminaries for our experiments to facilitate the entire
training process, after which from Section 3 to Section 5, the formulation and exploration of
three major NLP tasks are presented. In Section 6, we attempt to combine the use of these
methods and propose a simple web-based search engine.

2. Data Exploration and Preprocessing

Prior to applying multiple Machine Learning methods to explore the latent semantic space of
the given dataset, we perform some basic methods to examine the statistics and have the
following findings:
1) The total number of headlines is 1244184, including 31180 duplicated headlines.
2) The total number of unique words is 108058, including stop words and inflected words
with the same word stem.
3) The publish date starts from 2003/02/19 to 2021/12/31.

To filter out useless terms and facilitate further explorations, we perform three strategies to
preprocess the dataset:
1) Deduplication: With the help of pandas library, redundant examples with an identical
headline are deleted.
2) Stemming: In this step, a SnowballStemmer from the nltk library is applied to reduce
inflected words to their word stem.
3) Stop words removal: Stop words refer to the most common words appearing in every
document. Ignoring stop words can filter out considerable useless information and
promote an efficient learning process.

After the preprocessing:

1) The total number of headlines is 1213004.
2) The total number of unique words is 77914.
3) The publish date starts from 2003/02/19 to 2021/12/31.
4) The following most frequent words can be discovered:
3. Method 1: Clustering
After the preprocessing procedure, we use clustering to explore the features of the headline
data. Clustering is a type of unsupervised learning technique where we group similar data
points into clusters. It can be useful when we don't have labeled data and we want to explore
the structure of the data. Here we use k-means to find the potential clustering of headline data.

3.1 Determine the k

Considering that the original data has no labels for testing, we need to decide the number of
clusters by ourselves (that is, the k value of k-means). By trying different k values and
evaluating the results using metrics like the elbow method or the silhouette score, we can get
an optimal k value.
1) Elbow method
The elbow method is a technique used to determine the optimal number of clusters in a
dataset for clustering algorithms like K-Means. It works by plotting the sum of squared
distances between data points and their assigned cluster center (also known as the
"inertia") against different values of k, the number of clusters.
The idea is that as the number of clusters increases, the distance between each data
point and its assigned cluster center decreases, resulting in a lower inertia value.
However, after a certain point, adding more clusters doesn't result in a significant
decrease in inertia. This point is known as the "elbow point", and it indicates the optimal
number of clusters.
We use SSE to measure inertia:

and we take k at elbow point.

2) Silhouette score
The silhouette score is a metric used to evaluate the quality of clustering results. It
provides a measure of how similar a data point is to its own cluster compared to other
clusters, and it ranges from -1 to 1. A high silhouette score indicates that a data point is
well-matched to its own cluster and poorly-matched to neighboring clusters, while a
low score indicates that a data point may be assigned to the wrong cluster.

To calculate the silhouette score for a single data point, we first calculate two values:
a(i) [intra-cluster dissimilarity] = average (the distance from the i vector to all
other points in the cluster it belongs to)
b(i) [inter-cluster dissimilarity] = average (the distance from the i vector to all
points in the closest cluster to it)
We then compute the silhouette score as:

The resulting silhouette score ranges from -1 to 1, with higher values indicating better
clustering results. A score of 0 indicates that the data points may be overlapping or very
close to the decision boundary between clusters. In general, a silhouette score of 0.5 or
higher is considered to be a good clustering solution.

Finally, we get k = 50.

3.2 Model building

Substituting k=50 as a parameter, we get the following clustering results (only displayed 15
clusters):
4. Method 2: Topic Modeling

As an important unsupervised Machine Learning method, Topic Modeling can be used to

explore and analyze the latent space of collections of documents, during which documents are
clustered by a pre-defined set of mechanisms and multiple topics can be summarized.

Generally, a Topic Modeling task incorporates the following steps:

1) Document Vectorization: In order to extract latent topics from given documents, we will
usually perform vectorization to preprocess the data and gain a dictionary of all the words.
Based on the dictionary, the vectorizer can convert each document to a mathematical
representation, i.e. an array.
2) Document Modeling: For each document di, approximate it as a linear combination of
topics. Draw its topic proportion θid ~ Cat(di) based on the pre-defined mechanisms (e.g.
LSA, LDA, to be discussed later).
3) Document Embedding: With extracted topics, perform embeddings for each document di.
This step can also be completed using a pre-defined word embedder.
4) Document Clustering: With all embeddings, cluster documents into k clusters.
5) Cluster Modeling: For each cluster cj (1 ≤ j ≤ k), draw its topic proportion θjc ~ Cat(cj).
6) Topic Representation: For each cluster cj, choose its top n topics based on θjc. For each
document di, choose its top m topics based on θid.
Note that Cat(∙) denotes the categorical distribution.

In this section, we will first introduce and examine two classical Topic Modeling methods:
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), after which a
recently proposed DNN-based method called BERTopic will be introduced. Finally,
evaluations of the three methods are presented. For more details of the training process, please
refer to the attached “Topic_Modeling.ipynb”.

4.1 Latent Semantic Analysis (LSA)

As introduced in our lectures, LSA is mainly based on the Singular Value Decomposition
(SVD) process. During the implementation, a sparse matrix called term-document matrix is
generated, which describes the frequencies (typically TF-IDF) of words in each document.
Subsequently, the matrix will be decomposed into three matrices: U, Σ, and V, by SVD so that
the number of rows (words) can be reduced without significantly affecting the similarity
structure among columns (documents).
Topics for each document can be derived from the matrix by finding the top n words that are
most related to the document. In this section, we perform LSA to get 10 topic clusters using a
downsized dataset with 50K headlines:

Topic cluster Top 10 words # of headlines

0 polic, man, charg, new, say, court, death, murder, car, crash 5813

1 new, say, plan, council, govt, australia, year, nsw, health, water 6483

2 man, charg, court, face, murder, jail, accus, new, die, kill 1603

3 say, plan, council, govt, australia, urg, nsw, chang, fund, water 2755

4 plan, council, govt, urg, water, fund, nsw, warn, health, hous 6774

5 australia, win, court, nsw, death, day, face, warn, crash, report 17498

6 court, charg, face, govt, murder, council, accus, death, hear, told 2229

7 council, govt, man, urg, fund, nsw, water, hit, qld, wa 1924

8 council, australia, charg, court, face, say, day, polic, new, murder 605

9 win, australian, crash, kill, court, council, open, car, die, world 4316

4.2 Latent Dirichlet Allocation (LDA)

Like LSA, the proposal of LDA is based on the distributional hypothesis that words with close
meanings are likely to occur in similar documents. However, LDA is a generative statistical
model that regards each document as the probability density of topics and each topic as the
probability density of words according to the Dirichlet Distribution, by which the term-
document matrix is generated. Rather than using SVD, LDA decomposes the term-document
matrix through a probabilistic procedure and generally gains better performance than LSA. The
results of LDA Topic Model processing 50K headlines is as follows:

Topic cluster Top 10 words # of headlines

0 australia, open, investig, arrest, make, farmer, indigen, time, 5048

region, park
1 polic, charg, face, miss, set, drug, end, adelaid, mp, victim 4807

2 govern, final, dead, group, say, train, forc, brisban, need, run 5076

3 govt, fund, school, test, protest, record, boost, worker, work, health 5123

4 say, court, plan, death, water, hous, polic, woman, accus, melbourn 4936

5 man, kill, year, crash, win, attack, murder, qld, car, claim 5121

6 new, australian, cut, case, flood, coronavirus, famili, high, driver, 5065
ban

7 urg, hospit, elect, china, rise, fear, nation, price, hit, market 4881

8 council, sydney, wa, nsw, day, warn, chang, world, die, talk 4904

9 report, help, sa, home, coast, south, minist, fight, north, meet 5039

4.3 BERTopic

In recent years, Bidirectional Encoder Representations from Transformers (BERT)

developed by Google [4] has been applied in a wide range of NLP tasks and achieves state-of-
the-art performance, including but not least to Language Understanding and Question
Answering. With deep bidirectional representations revolutionarily pre-trained from unlabeled
texts by jointly conditioning both left and right context, BERT successfully outperforms
classical unsupervised methods in NLP.

As an extended application of BERT in Topic Modeling, BERTopic [5] improves this process
by leveraging several clustering techniques and a class-based variation of TF-IDF (c-TF-IDF)
to extract coherent topic representation. Innovative mechanisms are adopted in the following
stages:
1) Document Embedding: In this step, an additional embedder based on the Sentence-
Transformer framework is utilized to convert headlines to dense vector representations
so that the embedded headlines can be clustered more accurately. Here we choose
“SentenceTransformer('all-mpnet-base-v2')” as the embedding model, which
generally achieves optimal performance in embedding tasks.
2) Document Clustering: With mixed exploitation of the dimension reduction method
UMAP and the hierarchical clustering algorithm HDBSCAN, BERTopic is able to well
preserve more features of high-dimensional data and effectively handle the dense vector
representations generated previously. In addition, UMAP can be used across different
language models as it is not computationally restricted by the embedding dimensions.
3) Topic Representation: To model topic representations based on the word distribution
of each topic cluster, a modified version of TF-IDF is adopted. Typically, a TF-IDF
measure is exploited to measure how important a term is to a document:

Where Wt,d represents the frequency of term t in a cluster d, and N refers to the total number
of documents. Now, to apply a similar idea to computing the importance of t to cluster c,
we use the modified version called c-TF-IDF:

Where the inverse class frequency takes the place of the inverse document frequency.

Finally, the results output by BERTopic are as follows:

Topic cluster Top 10 words # of headlines

-1 river, indigen, offic, shoot, cost, teen, food, adelaid, babi, assault 26004

0 cup, fund, test, polici, promis, drop, pacif, welfar, feder, posit 906

1 england, cricket, ash, odi, india, socceroo, cup, lanka, pont, sri 845

2 murder, guilti, plead, appeal, manslaught, lawyer, killer, trial, juri, 823
verdict

3 china, burma, trade, thai, thailand, taiwan, kong, hong, myanmar, 544
free

4 drug, alcohol, heroin, bust, traffick, cocain, seiz, raid, liquor, meth 479

5 sexual, sex, porn, rape, paedophil, child, offend, assault, offenc, 456
abus

6 fish, shark, whale, prawn, lobster, tuna, fishermen, dolphin, 424

seafood, fisher

7 doctor, medic, surgeri, health, patient, medicar, clinic, wait, ama, 407
dr

8 highway, fatal, truck, crash, driver, car, die, road, hurt, injur 369

9 drum, hour, countri, monday, tuesday, wednesday, 2014, 349

grandstand, 2015, thursday

The following heatmap (i.e. similarity matrix) is generated:

4.4 Evaluation: Topic Coherence (TC) and Topic Diversity (TD)

Since Topic Modeling is an unsupervised method, which does not include any labeled data for
accuracy-based metrics, in this section, we evaluate the performance of the three topic models
using Topic Coherence (TC) and Topic Diversity (TD).

Topic Model TC TD

LSA 0.003 0.390

LDA -0.094 0.980

BERTopic -0.090 0.981

It can be observed that the TCs of all three models are relatively low, the highest of which is
only very close to 0. One of the possible causes is an underfit of the dataset, as we only use
50K headlines in this task, and our models fail to fully comprehend the relationship among
words.

5. Method 3: Headline Classification

5.1 Introduction
As a part of a group project focusing on the classification of news headlines, this section aims
to explore zero-shot learning techniques in the context of natural language processing (NLP).
We investigate the performance of various state-of-the-art NLP models, including GPT-2,
GPT-3, BERT, and BART, in classifying news headlines into predefined categories.

We examine different data pre-processing methods to enhance model performance and develop
a web application that allows users to input news headlines and receive the top 4 relevant topics
with their similarity scores. This section contributes to the overall project by demonstrating the
potential of zero-shot learning techniques in news headline classification and providing a
practical tool for real-time contextual understanding of news articles. The followings are the
models and techniques we used in this part of the experiment

5.2 Experiment Overview:

GPT-2: Introduced by Radford et al. [1], the Generative Pre-trained Transformer 2 (GPT-2) is
a large-scale generative language model that has demonstrated impressive performance in
various NLP tasks, including text classification. The model uses a unidirectional Transformer
architecture and has been pretrained on a diverse range of web text. Although GPT-2 has shown
remarkable results in several NLP tasks, its unidirectional nature can be a limitation for certain
classification problems.

GPT-3: Developed by Brown et al., GPT-3 is one of the largest language models available,
with 175 billion parameters. It has achieved state-of-the-art results in various NLP benchmarks
and can be effectively fine-tuned for a wide range of tasks, including zero-shot learning.
However, its massive size can pose challenges in terms of computational resources and
deployment.

BERT: Proposed by Devlin et al., the Bidirectional Encoder Representations from

Transformers (BERT) model has revolutionized the NLP field by introducing a bidirectional
architecture, allowing it to better capture context in text. BERT has been pretrained on a large
corpus of text and has demonstrated strong performance on various NLP tasks, including text
classification. Its bidirectional nature can be an advantage for classification tasks where context
is essential.

BART: Presented by Lewis et al. [2], the Bidirectional and Auto-Regressive Transformers
(BART) model combines the strengths of BERT and GPT models by employing a bidirectional
encoder and a unidirectional decoder. BART has shown exceptional performance in various
NLP tasks, including abstractive summarization and text classification. Its hybrid architecture
can provide a more balanced approach to capturing context in text.

In addition to the models mentioned above, we also explored various data pre-processing
techniques to improve model performance. These techniques include tokenization,
lemmatization, and stop word removal, which are commonly used in NLP tasks to reduce noise
and focus on the most informative parts of the text. By investigating these models and
techniques, we aim to identify the most suitable approach for zero-shot classification of news
headlines in our project.

5.3 Data Pre-processing:

Before delving into the model selection process, it is crucial to pre-process the input data to
ensure the efficiency and accuracy of the models. Data pre-processing involves several steps,
including tokenization, stop word removal, and lemmatization. In this section, we will discuss
the reprocessing techniques applied to the "A Million News Headlines" dataset for our project.

1. Tokenization: The first step in pre-processing is tokenization, which breaks the input text
into individual words or tokens. This process allows the models to better understand and
analyse the input data.

2. Stop word Removal: Stop words are common words such as "and", "the", and "in" that do
not carry significant meaning and can be safely removed from the text. By filtering out stop
words, we reduce noise and focus on the meaningful words in the headlines. This step helps
improve the performance of the models by reducing the input size and allowing them to
concentrate on relevant information.

3. Lemmatization: Lemmatization is the process of converting words to their base or dictionary

form. For example, "running" would be transformed into "run". This step helps in reducing the
dimensionality of the data and allows models to recognize different forms of the same word as
a single entity, thereby improving their ability to capture context and relationships between
words.

The pre-processed data is then used as input for the various models, including GPT-2, GPT-3,
BERT, and BART, in the model selection process. By applying these pre-processing techniques,
we ensure that the input data is clear, concise, and suitable for analysis by the models, resulting
in better performance and more accurate results in the classification task.

5.4 Model Selection and Implementation

In our experiment, we aimed to select the most suitable model for zero-shot classification of
news headlines by comparing the performance of GPT-2, GPT-3, BERT, and BART models.
We provided the same news headline to each model and analysed the similarity scores between
the input headline and the predefined topics. The model that generated the most accurate and
consistent similarity scores across various headlines was considered the most suitable for our
project.

GPT-2 and GPT-3 showed reasonable performance in some cases, but GPT-2 struggled with
certain headlines due to its unidirectional nature. While GPT-3 delivered impressive similarity
scores, its massive size and computational requirements posed challenges for deployment.
BERT demonstrated a strong ability to capture context, which is essential for accurate
classification, but its performance was not significantly better than BART in our experiment.

The BART model achieved the most balanced performance, generating accurate similarity
scores across different headlines and effectively capturing context with its hybrid architecture.
BART is a denoising autoencoder, which allows it to reconstruct the input text from a corrupted
version. This unique architecture enables BART to understand the relationships between
different parts of the input text, making it suitable for tasks like text classification. Additionally,
it demonstrated a better trade-off between model complexity and computational requirements
compared to GPT-3.

Based on the experimental results, we selected the BART model for our project. Its strong
performance in generating accurate similarity scores and effectively capturing context,
combined with its relatively lower computational requirements, makes it a suitable choice for
zero-shot classification of news headlines. In the following sections, we will discuss the
implementation details, including data pre-processing techniques and the integration of the
BART model into our system. The following images show the results of the program on the
front-end page.

5.5 Extra Attempt: Handling Chinese Query

In our initial exploration, we also briefly attempted to handle Chinese characters in order to
improve the classification performance for Chinese news headlines. We used the "jieba" library,
which is popular for Chinese text segmentation. However, we did not integrate this method
into the final version of our code.

The preliminary approach involved tokenization and keyword extraction using "jieba",
followed by reassembling the keywords into a string. This pre-processed string was then
classified using the pre-trained BART model. Although this method was effective in processing
Chinese text and yielded improved classification results, it was not adopted in the final
implementation due to the project's focus on English news headlines.
6. Combination of Methods: A Simple Search Engine

6.1. Search Engine Features and Implementation

In this section, we introduce the features of our search engine, which is built using a
combination of methods. The back-end is developed using Flask, while the front-end
is created with JavaScript. Users can enter text information as queries, and the
search engine processes the input using topic modeling, clustering, and zero-shot
classification techniques. The output is then displayed on the front-end for users to
view and analyze.

The image above showcases the user interface of our search engine, where users
can input text queries to explore relevant content. The search engine processes the
input data using a range of techniques, such as topic modeling, clustering, and zero-
shot classification, to generate accurate and meaningful results.
The second image demonstrates the output display of the search engine. The results
are grouped by topic, providing users with a clear and organized presentation of the
information. This allows users to easily identify and focus on the topics that are most
relevant to their interests.

In summary, our search engine combines multiple methods to effectively process

and present information to users. By leveraging the power of Flask for back-end
development and JavaScript for front-end design, we can create a user-friendly
interface that efficiently handles text queries and displays accurate, relevant results.
Don't forget to insert the corresponding images in the designated locations marked
by the special symbols throughout the text.

6.2. Topic trend prediction

From the BERTopic model, we can get about 459 topics. The frequency of these topics changes
over time. We hope to use time series analysis to predict the trend of the frequency of these
topics over time. Due to the limitation of computing power, we only consider predicting the
trend of topics in a year (regardless of seasonality). We use the ARIMA model for time series
analysis, and the accuracy of the frequency forecast is within a 95% confidence interval with
an error of (±1.53).

6.2.1 Preprocessing
First, by processing the data generated from the BERTopic model, we can get the time-
frequency distribution of each topic. Taking topic 49 as an example, we divide it into training
set (230) and test set (14) to prepare for future predictions.

figure 1

We chose Autoregressive Integrated Moving Average (ARIMA) as our final forecasting model
which takes three parameters p, d, q. To find out the optimal parameters of the ARIMA model,
we need to find out some time-frequency distribution characteristics of the topic.

6.2.1.1 White noise test

Through simple observation, we found that the distribution of some topics changes less over
time (a), while others change greatly (b).

(a) (b)

We want to find out those topics that have changed greatly over time, and make predictions
about their subsequent development trends. Therefore, we conduct a white noise test on the
data to detect whether the time series belongs to random distribution. We use the Ljung Box
method (LB test) to test whether there is a lag correlation in the time series, and judge whether
the overall correlation or randomness of the sequence exists. The white noise test is usually
performed in conjunction with the data stationarity test, which means that if the stationarity test
passes, the white noise test will generally also pass. If a topic fails the white noise test, we
consider it to have no predictive value (random distribution).

The result of topic 49’s white noise test:

We assume that the frequency of topic 49 is not randomly distributed over time.

6.2.1.2 Stationarity test

From the figure 1 we can see that the distribution of frequency over time is not stable, we need
more stable data to help us predict. It is a good way to find the first order difference of the
original data. However, we need some calculations to justify doing so.

i. ADF test

We can see a significant drop in p-value. Therefore, we choose d = 1.

ii. ACF/PACF

Result of raw data’s ACF/PACF:

Result of first order difference’s ACF/PACF:

Both raw data and first order difference’s ACF/PACF lie between confidence intervals. Can
not draw a conclusion of p, q value.

6.2.2 Determine p, q

For ACF/PACF can not settle p, q value, we use iteration to see the p, q value with minimum
BIC and get p = 2, q = 3.

The performance of model:

We get the residual of the model and p-value < 0.05.
6.2.3 Final prediction

7. Discussion and Conclusion

In this project, we explored three different methods for natural language processing and applied
them to the task of news headline classification and search. We utilized topic modeling with
BERTopic, clustering with K-means, and zero-shot learning with the BART model to analyze
news headlines and generate relevant topics and clusters.

Through our experiments, we found that BART was the most suitable model for our
classification task due to its strong performance in generating accurate similarity scores and
effectively capturing context with its hybrid architecture. We also developed a web application
that allows users to input news headlines and receive the top four relevant topics with their
similarity scores.

In addition to news headline classification, we also combined the methods to build a simple
search engine that efficiently handles text queries and displays accurate, relevant results. We
used Flask for back-end development and JavaScript for front-end design to create a user-
friendly interface.

Furthermore, we explored time series analysis with the ARIMA model to predict the frequency
trend of different topics over time. We found that topics with significant changes over time can
be predicted with a reasonable level of accuracy.

Overall, our project demonstrates the potential of natural language processing techniques in
understanding and analyzing news headlines. Our web application and search engine can
provide users with quick and accurate access to relevant news topics. However, there are still
limitations to our methods, such as the need for large computational resources and the potential
for biases in topic modeling.
8. Reference

[1] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are
unsupervised multitask learners," OpenAI Blog, vol. 1, no. 8, p. 9, 2019.

[2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P.

Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,
"Language models are few-shot learners," arXiv preprint arXiv:2005.14165, 2020.

[3] A. Choudhary, M. Alugubelly, and R. Bhargava, "A Comparative Study on Transformer-

based News Summarization," in 2023 15th International Conference on Developments in
eSystems Engineering (DeSE), Baghdad & Anbar, Iraq, 2023, pp. 256-261, doi:
10.1109/DeSE58274.2023.10099798.

[4] J. Devlin et al. "Bert: Pre-training of deep bidirectional transformers for language
understanding," arXiv preprint arXiv:1810.04805, 2018.

[5] M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,”
arXiv preprint arXiv:2203.05794, 2022.

Chat_GPT_Guide_for_OFM
No ratings yet
Chat_GPT_Guide_for_OFM
13 pages
Topic Modeling With BERT. - Towards Data Science
No ratings yet
Topic Modeling With BERT. - Towards Data Science
9 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
dbm302Presentation
No ratings yet
dbm302Presentation
5 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Running Head: Topic Model by Using Latent Dirichlet Allocation 1
No ratings yet
Running Head: Topic Model by Using Latent Dirichlet Allocation 1
8 pages
A Survey of Topic Pattern Mining in Text Mining PDF
No ratings yet
A Survey of Topic Pattern Mining in Text Mining PDF
7 pages
Latent Dirichlet Allocation
100% (2)
Latent Dirichlet Allocation
13 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
A Document Exploring System On Lda Topic Model For Wikipedia Articles
No ratings yet
A Document Exploring System On Lda Topic Model For Wikipedia Articles
13 pages
Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing
No ratings yet
Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing
3 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
IIT-P ADS Week 22 Transcripts
No ratings yet
IIT-P ADS Week 22 Transcripts
4 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
UTOPIC 2023.eacl-main.132
No ratings yet
UTOPIC 2023.eacl-main.132
16 pages
The_Supervised_Hierarchical_Dirichlet_Process
No ratings yet
The_Supervised_Hierarchical_Dirichlet_Process
13 pages
4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP
No ratings yet
4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP
21 pages
A LDA Based Model For Topic Evolution: Evidence From Information Science Journals
No ratings yet
A LDA Based Model For Topic Evolution: Evidence From Information Science Journals
6 pages
ssrn-4575985
No ratings yet
ssrn-4575985
29 pages
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
No ratings yet
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
6 pages
Topcat: Data Mining For Topic Identification in A Text Corpus
No ratings yet
Topcat: Data Mining For Topic Identification in A Text Corpus
33 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
A Survey of Topic Modeling in Text Mining
No ratings yet
A Survey of Topic Modeling in Text Mining
7 pages
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
43 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
SNLP Overview
No ratings yet
SNLP Overview
43 pages
7.2 Latent
No ratings yet
7.2 Latent
27 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
No ratings yet
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
9 pages
Machine Learning for data science Unit-5
No ratings yet
Machine Learning for data science Unit-5
10 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
Sessionppt Topicmoelling
No ratings yet
Sessionppt Topicmoelling
40 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Draft: Automatic Topic Labeling Using Ontology-Based Topic Models
No ratings yet
Draft: Automatic Topic Labeling Using Ontology-Based Topic Models
7 pages
Ljubesic08 Document
No ratings yet
Ljubesic08 Document
6 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
2018 Conference Paper 2
No ratings yet
2018 Conference Paper 2
9 pages
Topic Modeling Clustering of Deep Webpages
No ratings yet
Topic Modeling Clustering of Deep Webpages
9 pages
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Sbalchiero Topicmodelinglongtextsand
No ratings yet
Sbalchiero Topicmodelinglongtextsand
14 pages
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
No ratings yet
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
5 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
News Article Category Predictor
No ratings yet
News Article Category Predictor
6 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Topic Modeling Text Clustering Based On Deep Learning Model
No ratings yet
Topic Modeling Text Clustering Based On Deep Learning Model
11 pages
Improving Topic Models With Latent Feature Word Representations
No ratings yet
Improving Topic Models With Latent Feature Word Representations
16 pages
OpenAI O3 and the New Era of Smart AI Models
No ratings yet
OpenAI O3 and the New Era of Smart AI Models
5 pages
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
No ratings yet
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
45 pages
Ask Your PDF (Thesis)
No ratings yet
Ask Your PDF (Thesis)
42 pages
Data + AI Summit 2024 - Keynote Day 2
No ratings yet
Data + AI Summit 2024 - Keynote Day 2
32 pages
A Single Model Is Not All You Need
No ratings yet
A Single Model Is Not All You Need
11 pages
ChatGPT-Ebook-4th-edition
No ratings yet
ChatGPT-Ebook-4th-edition
109 pages
Prompt Engineering - Links and Resources
No ratings yet
Prompt Engineering - Links and Resources
2 pages
Introduction To Azure AI PPT
No ratings yet
Introduction To Azure AI PPT
32 pages
Impact of Gen AI in Cybersecurity & Privacy
No ratings yet
Impact of Gen AI in Cybersecurity & Privacy
27 pages
AI API Course
No ratings yet
AI API Course
85 pages
2401.06466v1
No ratings yet
2401.06466v1
13 pages
ChatGPT in the Classroom the Future of Educational AI From Elementary to University - Transformative Strategies for... (Hussaini, Saif) (Z-Library)
No ratings yet
ChatGPT in the Classroom the Future of Educational AI From Elementary to University - Transformative Strategies for... (Hussaini, Saif) (Z-Library)
153 pages
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
No ratings yet
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
13 pages
Build A Chatbot On Your CSV Data With LangChain and OpenAI
No ratings yet
Build A Chatbot On Your CSV Data With LangChain and OpenAI
5 pages
openai_response_aia_white-paper
No ratings yet
openai_response_aia_white-paper
7 pages
Cheat-Sheet-Azure-AI-Engineer-Associate-AI-102
No ratings yet
Cheat-Sheet-Azure-AI-Engineer-Associate-AI-102
29 pages
Large Language Models For Business Process Management
No ratings yet
Large Language Models For Business Process Management
18 pages
Project Report
No ratings yet
Project Report
12 pages
Virtual Assitant with NLP Proposal
No ratings yet
Virtual Assitant with NLP Proposal
9 pages
Contrash
No ratings yet
Contrash
68 pages
Document From Manishkumar
No ratings yet
Document From Manishkumar
48 pages
NL2Color Refining Color Palettes For Charts With Natural Language
No ratings yet
NL2Color Refining Color Palettes For Charts With Natural Language
11 pages
Generating Music Using AI: Ebba Rickard
No ratings yet
Generating Music Using AI: Ebba Rickard
66 pages
Chat GPT For Hacking 230130 133243
100% (1)
Chat GPT For Hacking 230130 133243
43 pages
Week2 Llms
No ratings yet
Week2 Llms
25 pages
GPT4架构揭秘
No ratings yet
GPT4架构揭秘
12 pages
I. From Gpt-4 To Agi Counting The Ooms - Situational Awareness
No ratings yet
I. From Gpt-4 To Agi Counting The Ooms - Situational Awareness
4 pages
ChatGPT's Influence On Customer Experience in Digital Marketing
No ratings yet
ChatGPT's Influence On Customer Experience in Digital Marketing
11 pages
An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures
No ratings yet
An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures
11 pages

Project Example

Uploaded by

Project Example

Uploaded by

Project Report

2. Data Exploration and Preprocessing

After the preprocessing:

3.1 Determine the k

and we take k at elbow point.

Finally, we get k = 50.

3.2 Model building

As an important unsupervised Machine Learning method, Topic Modeling can be used to

Generally, a Topic Modeling task incorporates the following steps:

4.1 Latent Semantic Analysis (LSA)

Topic cluster Top 10 words # of headlines

4.2 Latent Dirichlet Allocation (LDA)

Topic cluster Top 10 words # of headlines

0 australia, open, investig, arrest, make, farmer, indigen, time, 5048

In recent years, Bidirectional Encoder Representations from Transformers (BERT)

Finally, the results output by BERTopic are as follows:

Topic cluster Top 10 words # of headlines

6 fish, shark, whale, prawn, lobster, tuna, fishermen, dolphin, 424

9 drum, hour, countri, monday, tuesday, wednesday, 2014, 349

The following heatmap (i.e. similarity matrix) is generated:

LSA 0.003 0.390

LDA -0.094 0.980

BERTopic -0.090 0.981

5. Method 3: Headline Classification

5.2 Experiment Overview:

BERT: Proposed by Devlin et al., the Bidirectional Encoder Representations from

5.3 Data Pre-processing:

3. Lemmatization: Lemmatization is the process of converting words to their base or dictionary

5.4 Model Selection and Implementation

5.5 Extra Attempt: Handling Chinese Query

6.1. Search Engine Features and Implementation

In summary, our search engine combines multiple methods to effectively process

6.2. Topic trend prediction

6.2.1.1 White noise test

The result of topic 49’s white noise test:

6.2.1.2 Stationarity test

We can see a significant drop in p-value. Therefore, we choose d = 1.

Result of raw data’s ACF/PACF:

The performance of model:

7. Discussion and Conclusion

[2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P.

[3] A. Choudhary, M. Alugubelly, and R. Bhargava, "A Comparative Study on Transformer-

You might also like