0% found this document useful (0 votes)
25 views9 pages

Large Language Models Are Zero Shot Text Classifiers

Uploaded by

hibayesian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

Large Language Models Are Zero Shot Text Classifiers

Uploaded by

hibayesian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Large Language Models Are Zero-Shot Text

Classifiers
Zhiqiang Wang, Yiran Pang, Yanbin Lin
Department of Electrical Engineering and Computer Science, Florida Atlantic University
Boca Raton, FL 33431, USA
{zwang2022, ypang2022, liny2020}@fau.edu

Abstract—Retrained large language models (LLMs) have be- can effectively predict outcomes on new data [9]. Another
come extensively used across various sub-disciplines of natural drawback of supervised learning is that models are limited
language processing (NLP). In NLP, text classification problems
arXiv:2312.01044v1 [cs.CL] 2 Dec 2023

to classifying data into known classes, and are incapable of


have garnered considerable focus, but still faced with some
limitations related to expensive computational cost, time con- categorizing data into classes that are unseen or not labeled in
sumption, and robust performance to unseen classes. With the the training data [10]. Recently, deep learning (DL) methods
proposal of chain of thought prompting (CoT), LLMs can be have outperformed earlier ML algorithms in natural language
implemented using zero-shot learning (ZSL) with the step-by- processing (NLP) tasks. The success of these deep learning
step reasoning prompts, instead of conventional question-and- algorithms is attributed to their ability to model intricate and
answer formats. The zero-shot LLMs in the text classification
problems can alleviate these limitations by directly utilizing non-linear relationships in data [11]. The deep learning models
pre-trained models to predict both seen and unseen classes. for text and document classification typically involve three
Our research primarily validates the capability of GPT models fundamental architectures, which are deep neural networks
in text classification. We focus on effectively utilizing prompt (DNN), recurrent neural network (RNN), and convolutional
strategies to various text classification scenarios. Besides, we neural network (CNN). RNN usually works through Long
compare the performance of zero-shot LLMs with other state-of-
the-art text classification methods, including traditional machine Short-Term Memory networks (LSTM) or Gated Recurrent
learning methods, deep learning methods, and ZSL methods. Units (GRU) architectures for text classification, encompassing
Experimental results demonstrate that the performance of LLMs an input layer (word embedding), hidden layers, and the output
underscores their effectiveness as zero-shot text classifiers in layer [12]. However, DL methods also need dataset labeling
three of the four datasets analyzed. The proficiency is especially and a lot of training data. A dataset labeled with one specific
advantageous for small businesses or teams that may not have
extensive knowledge in text classification. set of classes cannot be repurposed to train a method designed
Index Terms—Zero-shot text classification, large language to predict an entirely different set of classes.
models, text classification, Chat GPT-4, Llama2. Apart from these ML-based and DL-based methods, recent
progress in NLP has resulted in the development of large
I. I NTRODUCTION language models (LLMs) such as Llama2, ChatGPT. Existing
Text classification is considered the most fundamental and research [13]–[15] suggests that the impressive performance of
crucial task in the field of natural language processing. Over LLMs holds promise for their potential to address the above
the past several decades, text classification issues have received limitations. LLMs represent sophisticated language models
extensive attention and have been effectively tackled in numer- characterized by their extensive parameter sizes and remark-
ous practical applications [1]–[3]. These applications contain able learning abilities [16]. The effectiveness of LLMs is
sentiment analysis [4], topic labeling [5], question answering commonly credited to their proficiency in few-shot or zero-
[6] and dialog act classification [7]. Due to the information shot learning within context. Pre-trained LLMs are extensively
explosion, processing and classifying large amounts of text employed across various sub-disciplines of NLP and are
data manually is a time-consuming and challenging. Therefore, typically recognized for their exceptional ability to learn from
introducing machine learning methods for text classification a few examples, which is Few-Shot Learning (FSL) [17]. In
tasks is indispensable. NLP tasks, both zero-shot learning (ZSL) [18] and FSL are
The majority of text classification systems can typically be techniques used to enable models to perform tasks they haven’t
separated into four key stages: Feature Extraction, Dimen- been explicitly trained on, but they differ in their approach and
sion Reduction, Classifier Selection, and Evaluation [8]. For reliance on training data. In FSL, the model is trained with a
the classifier, traditional and popular machine learning (ML) handful of labeled task-specific examples [19]. Different from
methods include logistic regression (LR), Multinomial Naive FSL, ZSL directly utilizes pre-trained models to predict both
Bayes (MNB), Logistic Regression (LG), k-nearest neighbor known (seen) and unknown (unseen) classes without requiring
(KNN), Support Vector Machine (SVM), Decision Tree (DT), any labeled training instances and fine-tuning [20]. FSL is
Random Forest, and Adaboost. These supervised ML methods advantageous, even with access to larger datasets, because
face significant limitations, primarily that it requires extensive labeling data is time-consuming and training on extensive
amounts of task-specific, labeled data for training before it data sets can be computationally costly [21]. While ZSL can

1
save more computational cost and time consumption with • Innovative Application of GPT Models in Text Clas-
totally skipping the steps of labeling, tokenization, data pre- sification: We demonstrate how GPT models can sim-
processing and feature extraction [22]. plify the text classification process by directly generating
ZSL is an emerging learning paradigm, aiming to tackle classification labels, thereby avoiding traditional feature
a task in the absence of any training examples specific to extraction and classifier training steps.
that task. The LLMs using ZSL can solve various tasks by • Extensive Evaluation and Comparison Across Mul-
merely conditioning the models on instructions to describe tiple Datasets: Our research includes a wide-ranging
the task, which is known as “prompting” [23]. The prompts evaluation across various domain datasets, comparing the
can be designed manually [24] or automatically [25]. GPT-3 performance of GPT models with traditional machine
was assessed on tasks using zero-shot, one-shot, and n-shot learning methods and neural network models, affirming
prompts, which included only a natural language description, their effectiveness in text classification tasks.
one solved example, and n solved examples, respectively • Practical Implications for Small Businesses or Teams
[26]–[28]. They found that GPT-3’s zero-shot performance and Open Source Contribution: We highlight the prac-
significantly lags behind its few-shot performance in tasks tical value of GPT models in text classification for small
like reading comprehension, question answering, and natural businesses or teams, who may lack in-depth knowledge
language inference [27]. A potential reason for this is that in this area. To foster further research and application in
without few-shot examples, it becomes more challenging for this field, we have made our code open source, allowing
models to perform effectively on prompts that deviate from community members to directly utilize and improve upon
the format of the pre-training data, as same as GPT-3.5. The these models.
chain of thought prompting (CoT) was proposed in [29] to The organization of the paper is summarized as follows:
feed LLMs with the step-by-step reasoning examples, instead In Section II, we briefly discuss the background and related
of conventional question-and-answer formats. Zero-shot-CoT, work. Section III presents our proposed methodology, mainly
a new approach was introduced in [30] that significantly including the overview of proposed method and practical
enhanced zero-shot performance of LLMs in various reasoning methodology. Section IV presents the experiment results of
tasks, such as arithmetic, symbolic reasoning, and logical all methods using four different datasets. Section V discusses
reasoning, without the need for task-specific few-shot exam- the results. Section VI concludes our paper. Finally, Section
ples. By adding a simple prompt “Let’s think step by step” VII states the data and results availability.
before answers, the study shows that LLMs can significantly
outperform standard zero-shot methods in diverse reasoning II. BACKGROUND AND R ELATED W ORK
tasks. Img2LLM, a plug-and-play module was proposed by A. Rule-Based Methods
[31] to generate effective LLMs prompts, describing image Early work in text classification primarily focused on rule-
content as exemplar question-answer pairs. These designed based methods. Decision tree algorithms, including C4.5 [36]
prompts enabled LLMs to perform zero-shot visual question- and CART [37], classify texts by constructing tree structures
answering without end-to-end training. The capability of based on feature selection criteria. These methods are easy
LLMs was investigated in [32] for zero-shot vulnerability to understand and implement. However, they may generate
repair in coding, like OpenAI’s Codex and AI21’s Jurassic overly complex structures when dealing with complex or high-
J-1. A multi-turn question-answering framework for zero-shot dimensional data, which leads to overfitting issues. Expert
information extraction, called as ChatIE, was introduced by systems, like MYCIN [38], make decisions based on rules
[33] to leverage the capabilities of ChatGPT. defined by domain experts. Pattern matching methods identify
Although there are many researches focus on the perfor- specific types of texts by matching predefined patterns or
mance of zero-shot LLMs, fewer researches compare the keyword sequences. They perform well in applications such
classification performance of zero-shot LLMs with traditional as spam email filtering [39]. However, these rule-based ap-
ML methods, DL methods, and ZSL methods. For an intuitive proaches depend heavily on initial settings and may struggle
comparison, we design some step-by-step prompts to ana- to adapt to new patterns or noisy data.
lyze the zero-shot classification performance of state-of-the-
art large language models, like Llama2, GPT-3.5, and GPT- B. Probability-based Methods
4. We carry out comprehensive assessments of these models Compared to rule-based models, probability-based models
using four different datasets, including the applications of the offer greater flexibility and generalization capabilities through
sentiment analysis, a four-class classification task, and the their mathematical frameworks and data-driven approach. The
spam detection. Various ML methods, DL methods, and ZSL Naive Bayes classifier [40], employing Bayes’ theorem for
methods are implemented to compare performance, such as text classification, stands out even with its assumption of
MNB, LG, RF, DT, and KNN, RNN, LSTM, GRU, BART feature independence. It demonstrates significant effectiveness
[34] and DeBERTa [35]. Our main contribution is to evaluate in areas like spam detection and sentiment analysis. The Naive
the zero-shot performance of LLMs against existing other text Bayes classifier is widely acknowledged for its simplicity and
classification approaches. robust performance with large datasets. The Hidden Markov
Our main contributions are as follows: Model (HMM) [41] is another key probability-based model,

2
especially suitable for processing sequential data. In natural F. Zero-shot Methods
language processing tasks such as part-of-speech tagging and ZSL aims to classify data without direct examples of
speech recognition, HMMs effectively address challenges by certain classes during training. This offers a solution to the
considering state transition probabilities. limitations of data-intensive deep learning models in text clas-
sification. Methods based on knowledge graphs, as explored
C. Geometry-based Methods in recent studies, utilize auxiliary information about inter-
class relationships, represented in rich semantic knowledge
Geometry-based methods offer a distinct perspective in graphs. These methods have been instrumental in achieving
handling high-dimensional data. In text classification, these state-of-the-art performance across several benchmarks and
methods primarily focus on the spatial relationships between tasks [51]. Another significant advancement in ZSL is the
data points. SVM [42] effectively classify text by finding the use of semantic embedding vectors. This approach, which
optimal separating hyperplane in a high-dimensional space. conceptualizes zero-shot learning as a regression problem
This approach is particularly suited for scenarios with large from input to embedding space, has shown effectiveness in
and complex feature spaces, as it maximizes the margin tasks like ImageNet zero-shot learning [52]. Furthermore, the
between classes to enhance classification accuracy. However, exploration of large pre-trained language models (PLMs) such
SVMs may face challenges in computational efficiency and as BERT in zero-shot learning scenarios has opened new
resource consumption when dealing with very large datasets. avenues. Research has revealed that strategies like Multi-
Techniques like Principal Component Analysis (PCA) [43] Null Prompting in BERT family models can yield promising
and Linear Discriminant Analysis (LDA) [44] simplify the results, surpassing manually created prompts, although some
classification task by reducing data dimensions. These methods limitations exist, particularly in language understanding tasks
effectively lower complexity and computational costs but under zero-shot settings [53]. In addition to these methods,
may lose important information for classification during the the adaptation of sophisticated PLMs like BART [34] and
dimensionality reduction process. DeBERTa [35] has further expanded the capabilities of zero-
shot learning in text classification. These models, with their
D. Statistic-based Methods advanced architectures and pre-training methodologies, have
been leveraged to enhance performance in various NLP tasks,
Statistic-based models in text classification utilize the sta- including those that require understanding and generating nu-
tistical properties of data for decision-making. The KNN [45] anced human language. BART, with its unique combination of
classifies a data point by analyzing its closest neighbors. bidirectional and autoregressive transformers, and DeBERTa,
However, KNN may encounter efficiency issues with large with its disentangled attention mechanism, exemplify the ad-
datasets as it requires calculating the distance between each vancements in leveraging deep learning for effective zero-shot
data point and every other point. Logistic Regression [46] text classification.
classifies by estimating the probability of data belonging to
a specific category. It performs well in text classification tasks III. M ETHODOLOGY
with relatively simple feature relationships. However, statistic-
As shown in Fig.1, traditional text classification involves
based models are generally sensitive to data preprocessing
three key steps: data preprocessing, feature extraction, and
and feature selection. They struggle with complex or heavily
classifier training. Each step plays a vital role in the overall
nonlinear feature-rich data.
process. Given a textual input x = (x1 , x2 , . . .), the first
step is standardizing and preprocessing. This step includes
E. Deep learning Methods noise removal (e.g., punctuation and special characters), stop
word filtering, stemming, and lemmatization. The aim is to
Deep learning has become a key technology in text classifi-
reduce noise and standardize text data for subsequent feature
cation, capable of handling complex language features. CNN
extraction. The preprocessing function P can be expressed as:
text classification model [47] captures local textual features
through convolutional layers. This approach excels in senti- P (x) → x′ (1)
ment analysis and topic categorization. LSTM [48] and GRU
[49], as optimized versions of RNNs, are particularly effective where x′ denotes the preprocessed text. Next, the processed
in addressing long-distance dependencies in text. Transformer text x′ undergoes feature extraction. Common techniques
models, like BERT [50], achieve remarkable results in various include bag-of-words, TF-IDF, and word embeddings. This
NLP tasks by utilizing self-attention mechanisms. Specifically, step transforms text into a numerical feature vector suitable
the BERT model demonstrates powerful capabilities in text for machine learning models. The feature extraction function
classification tasks. However, these methods typically require F can be defined as:
substantial data for training, often necessitating extensive F (x′ ) → h (2)
datasets to achieve optimal performance. This reliance on large
training sets can pose challenges, especially when collecting where h represents the feature vector of the text. Finally, a
or labeling data is difficult or impractical. machine learning algorithm (e.g., Support Vector Machine,

3
Decision Tree, Neural Network) is employed to construct a IV. E XPERIMENT AND R ESULTS
classification model. This model predicts the category of text
While accuracy calculation is the most straightforward eval-
based on the feature vector h. The classification model C is
uation method, it is not effective for unbalanced datasets [54].
formulated as:
F1 Score, Matthews Correlation Coefficient (MCC), Accuracy
p(y|x′ ) = softmax(W · MLP(h)) (3) (ACC), receiver operating characteristics (ROC), and area
under the ROC curve (AUC) methods are suitable for text
where W denotes the trainable parameters of the classifier, classification algorithms’ evaluation [8].
typically trained from scratch. In this traditional approach, In this study, we conduct an extensive evaluation of our
each step is essential, forming a complete text classifica- proposed methodologies across four distinct datasets. These
tion process. While this method can achieve high accuracy, datasets encompass a diverse range of applications: senti-
especially when fine-tuned for specific tasks, it may have ment analysis was performed using COVID-19 related tweets
limitations in processing efficiency for large datasets and (Gabriel Preda, 2020) [55] and economic texts (Malo et al.,
adaptability to novel task types. 2014) [56], a four-class classification task was applied to e-
Utilizing GPT models for text classification employs a commerce texts (Gautam, 2019) [57], and spam detection was
single-step, prompt-based method. This streamlined approach implemented on an SMS dataset (SMS Spam Collection, 2012)
leverages GPT’s generative capabilities to directly produce [58].
specific classification labels. The prompt is meticulously de- For each dataset, a comprehensive set of models is employed
signed to guide the GPT model towards generating a precise to assess the effectiveness of the proposed methods. These
classification label in a predefined format. Considering exam- models are spanned to three categories:
ples from Table I and Table II, a prompt for sentiment analysis • Traditional ML Algorithms: This category includes
could be structured as follows: MNB, LG, RF, DT, and KNN.
You are an AI assistant and you • DL Architectures: We utilize advanced deep neural
are very good at doing e-commerce network models such as RNN, LSTM, and GRU.
products classification. You • ZSL Models: In this category, we explore the perfor-
are going to help a customer mance of zero-shot models, specifically the transformer-
to classify the products in the based models, BART (facebook/bart-large-mnli) and De-
e-commerce website. You are only BERTa (microsoft/deberta-large-mnli).
allowed to choose one of the • LLMs: State-of-the-art large language models including
following 4 categories: Household, Llama2 (Llama2-70B), GPT-3.5 (gpt-3.5-turbo-1106),
Books, Clothing & Accessories, and GPT-4 (gpt-4-1106-preview) were assessed.
Electronics. Please provide only
For all traditional ML algorithms and DL models, we
one category for each product in
maintain uniformity in the input processing. This means that
JSON format where the key is the
each model uses the same processed text derived from a con-
index for each product and the
sistent raw text processing flow, encompassing both training
value is one of the 4 categories.
and testing datasets. This approach ensures that variations in
For example: {1: Household}. Please
performance could be attributed more directly to the model’s
do not repeat or return the content
capabilities rather than differences in input processing.
back again, just provide the
On the other hand, for zero-shot learning models and LLMs,
category in the defined format.
we directly employ the raw text from the testing dataset. It’s
This prompt explicitly instructs the GPT model to classify important to note that the testing dataset remained identical
products into one of four categories and express the outcome across all models, fostering a fair and consistent basis for
in JSON format. The classification process using this prompt comparison.
can be formalized as: It is worth mentioning that the traditional ML algorithms
GPT-Response(Prompt) → JSON Classification (4) and DL models can not undergo any specialized or model-
specific text processing enhancements. This decision is in-
Here, GPT-Response represents the GPT model processing the tentional to minimize the number of variables influencing
prompt, and JSON Classification is the output in JSON format, the experimental results. While this might have resulted in
indicating the category for each product. performance that is not state-of-the-art for each individual
This method efficiently utilizes the natural language pro- model, it is crucial for maintaining the integrity of the compar-
cessing and generative capabilities of GPT models. By direct- ative analysis. Our goal is to evaluate each model’s inherent
ing the model to produce classification results in a specific capabilities under standardized conditions, thereby providing
format, it simplifies the classification process and eliminates a more transparent and direct comparison of their performance
the need for intermediate steps like feature extraction or in the context of zero-shot text classification.
explicit verbalizer mapping. This makes it a highly practical In order to maintain consistency and precision in the results
approach for diverse text classification tasks. obtained from LLMs, we standardize the hyper-parameters

4
Fig. 1. Traditional text classification flow

Fig. 2. LLMs’ zero shot text classification flow

TABLE I
E XAMPLES OF C OVID 19 T WEETS WITH D IFFERENT S ENTIMENT L ABELS

Label Tweet
Neural Vistamalls says supermarket sales to ’balance’ #COVID19 impact https://t.co/caE2rT6MhO
Negative Just now on the telly, Woolies have stopped all online and click n collect orders. Due to overwhelming demand. #coronavirus
#StopPanicBuying
Positive Efforts 2 contain #COVID-19 are shifting demand & disrupting Ag supply chains. @raboresearch has collated our analysis of
current & expected impacts in one place 2 help our @RabobankAU network keep informed. https://t.co/P41vjG4uD6

TABLE II
E XAMPLES OF E COMERCE TEXT WITH D IFFERENT L ABELS

Label E commerce Text


Clothing & Accessories Cherokee by Unlimited Boys’ Straight Regular Fit Trousers Cherokee kids beige trousers made of 100% cotton twill
fabric.
Household Nutella Hazelnut Spread with Cocoa, 290g Size:290g Because the taste is simply unique! The secret is its special recipe,
the selected ingredients and the careful preparation. Here we want to tell you about Nutella and all the passion and care
that we put in its production every day.
Books NIACL Assistant Preliminary Online Exam Practice Work Book - 2280.
Electronics Transcend 512 MB Compact Flash (TS512MCF300) Transcend’s CF300 cards are high-speed industrial CF cards offering
impressive 300X transfer rates. With matchless performance and durability, CF300 CF cards are perfect for POS and
embedded systems that require both industrial-grade reliability and an ultra-high speed data transfer.

across all LLMs by setting the temperature to 0.01 and the topp LLMs. For different datasets, the same core structure of the
to 0.9. The temperature parameter controls the randomness in prompts are used that only labels and dataset names are
the prediction distribution, with a lower temperature resulting adjusted as required. This approach is adopted to minimize
in less random completions. The topp parameter, also known variability in model responses attributable to differences in
as nucleus sampling, restricts the model’s choices to the top input. Such an approach enables a more precise evaluation of
90% probabilities, thereby preventing the selection of highly each model’s performance to focus on their inherent capabil-
improbable words. At the same time, the prompt used for ities rather than variances in input.
different LLMs for the same dataset is also the same.
From Table III, it shows that the performance of the
Furthermore, to ensure uniformity in model inputs, the evaluated algorithms and models in sentiment classification
prompts for the same dataset are kept identical across different tasks for both COVID-19 tweets and economic texts is not

5
TABLE III dataset. Then the cleansed tweets are re-evaluated in LLMs
R ESULTS IN S ENTIMENT C LASSIFICATION over 5 times to ensure the results are reliable and consistent.
Table IV shows that before text cleaning, GPT-3.5 has
COVID19 Tweet Economic Text
superior accuracy, while for post-cleaning, GPT-4 outperforms
ACC F1 AUC ACC F1 AUC the others. The decreased performance in GPT-3.5 and LlaMa2
after cleaning suggests that these models may leverage the
MNB 0.3933 0.3639 0.5531 0.4533 0.3632 0.5563
LG 0.4333 0.3488 0.5404 0.5200 0.3066 0.5427
full range of information contained in raw tweets, including
RF 0.4467 0.3184 0.6184 0.5133 0.3453 0.5990
hashtags and mentions, to inform their predictions. Conversely,
DT 0.4733 0.4105 0.5602 0.4067 0.3446 0.5060
GPT-4’s improved accuracy indicates a possible advantage in
KNN 0.3800 0.3486 0.5216 0.4800 0.3620 0.5614 processing cleaner, more structured data. This variation in
RNN 0.7400 0.7186 0.8925 0.6333 0.5797 0.7874 model performance addresses the importance of considering
LSTM 0.7867 0.7619 0.8925 0.6533 0.4627 0.7293 the specific characteristics of text data and the corresponding
GRU 0.8200 0.8106 0.9226 0.6933 0.5767 0.7928 pre-processing steps when working with different LLMs.
BART 0.5000 0.3516 0.5882 0.4600 0.4258 0.6603
DeBERTa 0.5467 0.3805 0.5954 0.4467 0.4251 0.6385 TABLE V
R ESULTS IN E C OMMERCE T EXT
Llama2 0.5267 0.4748 - 0.7000 0.5230 -
GPT-3.5 0.5333 0.4943 - 0.6667 0.6683 -
E commerce text
GPT-4 0.5267 0.5095 - 0.7133 0.7096 -
ACC F1 AUC

MNB 0.2667 0.2546 0.5289


outstanding. Within the four categories of classifiers, tradi- LG 0.3867 0.2820 0.6529
tional ML algorithms consistently present the least favorable RF 0.5133 0.4410 0.7267
accuracy across the two datasets. DT 0.5467 0.5412 0.6964
Notably, the accuracies of most models do not surpass the KNN 0.3600 0.3159 0.5891
80% threshold, with the exception of the GRU model, which RNN 0.9600 0.9598 0.9964
achieves the highest accuracy of 82% in classifying COVID-19 LSTM 0.9467 0.9468 0.9854
tweets. This represents a substantial increase of approximately GRU 0.9400 0.9401 0.9883
20-30% over the zero-shot classifiers and LLMs, which ap- BART 0.7133 0.7272 0.4391
proach around 50-55%. Furthermore, these results highlight a DeBERTa 0.6267 0.6358 0.4726
considerable improvement of about 3-10% over the traditional Llama2 0.8067 0.6644 -
ML algorithms. GPT-3.5 0.8867 0.8935 -
In the analysis of economic texts, while DL methods GPT-4 0.9000 0.9078 -
perform better than traditional ML algorithms, the margin
is narrower compared to the results with COVID-19 tweets.
Particularly that among the 3 LLMs, Llama2 and GPT-4 TABLE VI
slightly outperform all the other algorithms or models with R ESULTS IN SMS
the accuracies of 70.00% and 71.33% respectively that GPT-4
achieves the highest accuracy. SMS

ACC F1 AUC
TABLE IV
R ESULTS IN ORIGINAL TWEETS V.S. C LEAN TWEETS MNB 0.7600 0.6425 0.7346
LG 0.8333 0.4545 0.4808
GPT-3.5 GPT-4 LlaMa2 RF 0.9067 0.7678 0.7346
Original Text 0.5413±0.0099 0.5200±0.0047 0.5280±0.0099 DT 0.8667 0.7337 0.7538
Clean Text 0.5145±0.0087 0.5560±0.0203 0.4973±0.0037 KNN 0.8400 0.6384 0.6327
RNN 0.9800 0.9558 0.9462
In COVID Tweet testing dataset, there are 65 negative, 24 LSTM 0.9600 0.9005 0.8500
neutral, and 61 positive tweets. By looking into the details of GRU 0.9867 0.9699 0.9500
LLMs predicted results, GPT-3.5 and Llama2 models show a BART 0.7000 0.5479 0.5942
preference towards predicting tweets as ”negative”, more than DeBERTa 0.8200 0.6906 0.7481
what the original distribution suggests. While GPT-4 tends to Llama2 0.7267 0.4441 -
classifier tweets into ”neutral”. GPT-3.5 0.8733 0.7996 -
Considering tweets usually includes, urls, html tags, hash GPT-4 0.9733 0.9467 -
tags mentions and text pre-handling is an important step in
traditional text classification process, we remove urls, htmls In table V, The DL models demonstrate superior perfor-
tags, digits, hash tags, mentions and stop words from tweets mance with all the accuracies surpassing 90% where RNN

6
Fig. 3. The confusion matrices for LLMs’ classification results in COVID19 tweets.

achieves the highest accuracy, 0.9600. This suggests that RNN strated commendable performance, comparable or superior to
architectures are particularly effective for this task. Notably, RNNs/CNNs, with accuracies exceeding 90%.
the LLMs also perform well with GPT-4 achieving the highest
VI. C ONCLUSION AND F UTURE W ORK
accuracy of 0.9000, which addresses the effectiveness of LLMs
in understanding and classifying complex e-commerce text. The performance of LLMs in three out of the four datasets
In table VI, the DL models again show strong results in studies supports the conclusion that LLMs can effectively
detecting spam SMS, particularly in accuracy in F1 score with function as zero-shot text classifiers. This capability is par-
all accuracies over 95% which are even better in classifying ticularly beneficial for small businesses or teams lacking
e commerce text. Traditional ML algorithms, like RF and DT in-depth expertise in text classification. It enables them to
also show commendable performance, especially in accuracy. rapidly deploy text classifiers, allowing them to concentrate
As for LLMs, GPT-4 showcases a significant lead in accuracy on downstream tasks.
at 97.33% which is only slightly worse than GRU, 98.67% Future accuracy improvements might include refining
and RNN, 98.00% but better than LSTM, 96.00%. prompts with more detailed background information or more
precise label definitions. Another prospective improvement
could involve implementing a critic agent, drawing inspiration
V. D ISCUSSION
from actor-critic algorithms [59], to evaluate and enhance
In this study, GPT-4 consistently outperformes traditional the results provided by LLMs. This study opens up new
ML algorithms across all four datasets. It is also worth point- approaches, especially in sentiment analysis where none of
ing out that Llama2 and GPT-3.5 show strengths in sentiment the algorithms achieved superior accuracy with a standard
analysis and e-commerce text classification. More importantly, text classification process, indicating a promising direction for
Llama2 and GPT-4 defect all the other models in economic text future research.
analysis. Despite the gap in accuracy between the LLMs and
VII. DATA AND R ESULTS AVAILABILITY S TATEMENT
DL models in COVID-19 sentiment analysis, LLMs delivers
robust results in the remaining tasks. Source code, datasets and all experiment logs generated
Base on our research and experiment, it shows that while and/or analysed in this study are available in the following
setting a low temperature and a high topp value, LLMs might GitHub repository: https://github.com/yeyimilk/llm-zero-shot-
not always yield the expected output. For instance, despite classifiers.
given prompts, Llama2 occasionally adds extraneous text or R EFERENCES
generated more outputs than inputs. In contrast, traditional [1] M. Jiang, Y. Liang, X. Feng, X. Fan, Z. Pei, Y. Xue, and R. Guan, “Text
ML and DL models typically produce standardized outputs, classification based on deep belief network and softmax regression,”
facilitating downstream tasks. The latest OpenAI LLMs offer Neural Computing and Applications, vol. 29, pp. 61–70, 2018.
[2] X. Liu, X. You, X. Zhang, J. Wu, and P. Lv, “Tensor graph convolutional
JSON output format support and consistent result through the networks for text classification,” in Proceedings of the AAAI conference
use of a constant random seed that we hope to see in other on artificial intelligence, vol. 34, no. 05, 2020, pp. 8409–8416.
LLMs shortly. [3] K. N. Singh, S. D. Devi, H. M. Devi, and A. K. Mahanta, “A novel
approach for dimension reduction using word embedding: An enhanced
However, a concern arises from the LLMs’ training dataset, text classification approach,” International Journal of Information Man-
which usually includes most of the open-source internet data, agement Data Insights, vol. 2, no. 1, p. 100061, 2022.
that it may have included our evaluation datasets, poten- [4] B. Liu, Sentiment analysis and opinion mining. Springer Nature, 2022.
[5] J. Chen, Z. Gong, and W. Liu, “A dirichlet process biterm-based mixture
tially biasing the results. Nonetheless, in our experience with model for short text stream clustering,” Applied Intelligence, vol. 50, pp.
commercial private data, GPT-3.5 and GPT-4 have demon- 1609–1619, 2020.

7
[6] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, [28] T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models
and J. Gao, “Deep learning–based text classification: a comprehensive better few-shot learners,” arXiv preprint arXiv:2012.15723, 2020.
review,” ACM computing surveys (CSUR), vol. 54, no. 3, pp. 1–40, 2021. [29] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh-
[7] L. Qin, W. Che, Y. Li, M. Ni, and T. Liu, “Dcr-net: A deep co- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning
interactive relation network for joint dialog act recognition and sentiment in language models,” arXiv preprint arXiv:2203.11171, 2022.
classification,” in Proceedings of the AAAI conference on artificial [30] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large lan-
intelligence, vol. 34, no. 05, 2020, pp. 8665–8672. guage models are zero-shot reasoners,” Advances in neural information
[8] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, processing systems, vol. 35, pp. 22 199–22 213, 2022.
and D. Brown, “Text classification algorithms: A survey,” Information, [31] J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, and S. Hoi,
vol. 10, no. 4, p. 150, 2019. “From images to textual prompts: Zero-shot visual question answering
[9] I. Sarker, “Machine learning: algorithms, real-world applications and with frozen large language models,” in Proceedings of the IEEE/CVF
research directions. sn comput sci 2: 160,” 2021. Conference on Computer Vision and Pattern Recognition, 2023, pp.
[10] W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot 10 867–10 877.
learning: Settings, methods, and applications,” ACM Transactions on [32] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining
Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–37, zero-shot vulnerability repair with large language models,” in 2023 IEEE
2019. Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2339–2356.
[11] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, [33] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie, J. Xu,
no. 7553, pp. 436–444, 2015. Y. Chen, M. Zhang et al., “Zero-shot information extraction via chatting
[12] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with with chatgpt,” arXiv preprint arXiv:2302.10205, 2023.
recurrent neural networks,” in Proceedings of the 28th international
[34] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
conference on machine learning (ICML-11), 2011, pp. 1017–1024.
V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence
[13] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
pre-training for natural language generation, translation, and comprehen-
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
sion,” arXiv preprint arXiv:1910.13461, 2019.
language models trained on code,” arXiv preprint arXiv:2107.03374,
2021. [35] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert
with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020.
[14] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva,
F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., [36] J. R. Quinlan, C4. 5: programs for machine learning. Elsevier, 2014.
“Chatgpt for good? on opportunities and challenges of large language [37] W.-Y. Loh, “Classification and regression trees,” Wiley interdisciplinary
models for education,” Learning and individual differences, vol. 103, p. reviews: data mining and knowledge discovery, vol. 1, no. 1, pp. 14–23,
102274, 2023. 2011.
[15] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and [38] E. Shortliffe, Computer-based medical consultations: MYCIN. Elsevier,
J. Ba, “Large language models are human-level prompt engineers,” arXiv 2012, vol. 2.
preprint arXiv:2211.01910, 2022. [39] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D.
[16] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, Spyropoulos, and P. Stamatopoulos, “Learning to filter spam e-mail: A
C. Wang, Y. Wang et al., “A survey on evaluation of large language comparison of a naive bayesian and a memory-based approach,” arXiv
models,” arXiv preprint arXiv:2307.03109, 2023. preprint cs/0009009, 2000.
[17] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a [40] S. Xu, “Bayesian naı̈ve bayes classifiers to text classification,” Journal
few examples: A survey on few-shot learning,” ACM computing surveys of Information Science, vol. 44, no. 1, pp. 48–59, 2018.
(csur), vol. 53, no. 3, pp. 1–34, 2020. [41] L. R. Rabiner, “A tutorial on hidden markov models and selected
[18] Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the applications in speech recognition,” Proceedings of the IEEE, vol. 77,
bad and the ugly,” in Proceedings of the IEEE conference on computer no. 2, pp. 257–286, 1989.
vision and pattern recognition, 2017, pp. 4582–4591. [42] T. Joachims, Learning to classify text using support vector machines.
[19] W. Alhoshan, L. Zhao, A. Ferrari, and K. J. Letsholo, “A zero-shot Springer Science & Business Media, 2002, vol. 668.
learning approach to classifying requirements: A preliminary study,” [43] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”
in International Working Conference on Requirements Engineering: Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp.
Foundation for Software Quality. Springer, 2022, pp. 52–59. 37–52, 1987.
[20] H. Larochelle, D. Erhan, and Y. Bengio, “Zero-data learning of new [44] K. Torkkola, “Linear discriminant analysis in document classification,”
tasks.” in AAAI, vol. 1, no. 2, 2008, p. 3. in IEEE ICDM workshop on text mining, vol. 29, 2001.
[21] S. Kadam and V. Vaidya, “Review and analysis of zero, one and few [45] G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “Knn model-based ap-
shot learning approaches,” in Intelligent Systems Design and Applica- proach in classification,” in On The Move to Meaningful Internet Systems
tions: 18th International Conference on Intelligent Systems Design and 2003: CoopIS, DOA, and ODBASE: OTM Confederated International
Applications (ISDA 2018) held in Vellore, India, December 6-8, 2018, Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy,
Volume 1. Springer, 2020, pp. 100–112. November 3-7, 2003. Proceedings. Springer, 2003, pp. 986–996.
[22] W. Alhoshan, A. Ferrari, and L. Zhao, “Zero-shot learning for require-
[46] A. Genkin, D. D. Lewis, and D. Madigan, “Large-scale bayesian logistic
ments classification: An exploratory study,” Information and Software
regression for text categorization,” technometrics, vol. 49, no. 3, pp.
Technology, vol. 159, p. 107202, 2023.
291–304, 2007.
[23] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-
train, prompt, and predict: A systematic survey of prompting methods [47] Y. Kim, “Convolutional neural networks for sentence classification,”
in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, arXiv preprint arXiv:1408.5882, 2014.
pp. 1–35, 2023. [48] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
[24] L. Reynolds and K. McDonell, “Prompt programming for large language with neural networks,” Advances in neural information processing
models: Beyond the few-shot paradigm,” in Extended Abstracts of the systems, vol. 27, 2014.
2021 CHI Conference on Human Factors in Computing Systems, 2021, [49] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
pp. 1–7. gated recurrent neural networks on sequence modeling,” arXiv preprint
[25] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Auto- arXiv:1412.3555, 2014.
prompt: Eliciting knowledge from language models with automatically [50] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
generated prompts,” arXiv preprint arXiv:2010.15980, 2020. of deep bidirectional transformers for language understanding,” arXiv
[26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, preprint arXiv:1810.04805, 2018.
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- [51] J. Chen, Y. Geng, Z. Chen, J. Z. Pan, Y. He, W. Zhang, I. Horrocks,
els are few-shot learners,” Advances in neural information processing and H. Chen, “Low-resource learning with knowledge graphs: A com-
systems, vol. 33, pp. 1877–1901, 2020. prehensive survey,” CoRR abs/2112.10006, 2021.
[27] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, [52] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome,
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot G. S. Corrado, and J. Dean, “Zero-shot learning by convex combination
learners,” arXiv preprint arXiv:2109.01652, 2021. of semantic embeddings,” arXiv preprint arXiv:1312.5650, 2013.

8
[53] Y. Wang, L. Wu, J. Li, X. Liang, and M. Zhang, “Are the bert family
zero-shot learners? a study on their potential and limitations,” Artificial
Intelligence, p. 103953, 2023.
[54] J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning
algorithms,” IEEE Transactions on knowledge and Data Engineering,
vol. 17, no. 3, pp. 299–310, 2005.
[55] G. Preda, “Covid19 tweets,” 2020. [Online]. Available: https:
//www.kaggle.com/dsv/1451513
[56] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt
or bad debt: Detecting semantic orientations in economic texts,” Journal
of the Association for Information Science and Technology, vol. 65,
no. 4, pp. 782–796, 2014.
[57] Gautam, “E commerce text dataset,” 2019. [Online]. Available:
https://doi.org/10.5281/zenodo.3355823
[58] T. Almeida and J. Hidalgo, “SMS Spam Collection,” UCI Machine
Learning Repository, 2012, DOI: https://doi.org/10.24432/C5CC84.
[59] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural
information processing systems, vol. 12, 1999.

You might also like