Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning
Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning
net/publication/374836072
CITATIONS READS
0 125
6 authors, including:
Hazriani Zainuddin
Handayani University, Indonesia
26 PUBLICATIONS 70 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yuyun Wabula on 20 October 2023.
Research Center for Data and Research Center for Data and Research Center for Data and
Information Science Information Science Information Science
National Research and Innovation National Research and Innovation National Research and Innovation
Agency Agency Agency
Bandung, Indonesia Bandung, Indonesia Jakarta, Indonesia
[email protected] [email protected] [email protected]
Abstract— Text preprocessing in the field of Natural [11], trending topic analysis[12] and others. In its
Language Processing (NLP) plays a crucial role in enhancing the implementation, NLP algorithms can help human tasks to
model's ability to understand given tasks. This process helps in become more effective. For example the chatboat application
avoiding potential errors that could interfere with data [13], machine translation [14], summarization [15], question
processing. This study aims to investigate the effectiveness of answering system[16] and so on.
four preprocessing techniques (tokenization, case-folding,
stopword removal, and stemming) in predicting the next word Performance of an NLP model is strongly influenced by
in a sentence. The case focuses on Indonesian language text preprocessing techniques. Several studies have been analyzed
documents preprocessing using word2vec embedding and the the effect of preprocessing techniques on text classification.
long short-term memory networks (LSTM) classification model. For example, [17] analyzes sentiment using stopword removal
The evaluation was conducted by incorporating human and stemming. They concluded that both techniques have a
perception approach and evaluation matrix which focused on good impact on the accuracy of machine learning algorithms
the suitability of word sequences and pairs on the prediction even though they did not include other preprocessing
results. The result reveals that the combination of tokenizing techniques. Another study [18] proposed tokenization,
and casefolding can enrich the meaning of the sentences. stopword removal, stemming and other preprocessing
Meanwhile the combination of tokenizing, casefolding, parameters in classifying medical documents which
stopword removal, and stemming can cause lose and overlap concluded that stemming contributes to increase accuracy.
meaning in the sentences. The prediction model performs best
Furthermore, [19] revealed that stemming, tokenizing,
on one-word n-grams, with precision 0.1, recall 1.0, and F-score
0.1.
stopwords removal, removing title from data frame and
punctuations in text summarization can increase accuracy,
Keywords—word prediction, n-gram, word2vec, rouge, lstm, however the accuracy decreased when the testing doesn’t
include. In addition, [20] applied tokenization, normalization,
I. INTRODUCTION stopwords removal and stemming to classify the Arabic
Next sentence prediction (NSP) is a supervised learning language, which proved that the model with preprocessing
task in natural language processing (NLP), where a model is give good accuracy, while without preprocessing the accuracy
trained to predict whether a given sentence is the next sentence decreases.
in a sequence. The importance of NSP to know relation On the other hand, preprocessing also can reduce accuracy
between sentences and understand the context of the sentence as demonstrated in research [21]. This research proposed the
[1]. It has been shown beneficial for various NLP tasks, stemming and stopwords removal to classify the Turkish
including machine translation, question answering, and newspaper texts, and stated that those preprocessing
summarization. For example, in extractive summarization, the approaches have very small impact in terms of accuracy. In
model can predict the next sentence to include in the summary addition, [22] proposed stemming, light-stemming, and word
based on the source article and the summary produced[2]. In clusters. Their experiments proved that stemming reduces
machine translation, next sentence prediction can be used to vector sizes, and improves the classification time. We
generate translations by predicting the next sentence in the observed that in most cases, preprocessing techniques have
target language based on the previous sentence [3]. Similarly, increased accuracy. Furthermore, [23] demonstrated disaster
in question answering, it can help generate relevant dataset to examine the effect of the text semi-preprocessing
answers[4]. In recent years, information extraction has technique i.e., tokenization, case folding, stop word removal.
become an important area for successful NLP applications in It can be deduced that preprocessing models can improve
the field of sentence prediction. Information extraction in NLP classification accuracy.
refers to the process of identifying and extracting structured
information from unstructured text. The main goal is to obtain Based on the description above, the use of preprocessing
information in the form of relationships, entities and other techniques in classification accuracy tends to fluctuate. The
facts from a set of documents. So far, information extraction preprocessing techniques could promote efficiency and
has been successfully applied to various sectors such as increase classification accuracy. On the other hand, can also
industry[5][6][7], health [8][9], government [10], education reduce the accuracy. In addition, the dataset in previous
Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-9487-0/23/$31.00 ©2023 IEEE 274
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
research has been labeled or targeted to make it easier for the III. MATERIAL AND METHODS
classification algorithm to understand the rules. The weakness
of this model lies in who labels the dataset. Therefore, in depth A. Dataset Collection
research is needed to evaluate preprocessing performance. The dataset in this study is sourced from online news sites
The purpose of this study is to examine the impact of the that contain NLP information (in Indonesian language). A
preprocessing technique on the performance of the next- total of 1423 data were collected, with 391 unique words and
sentence prediction with deep learning technique, particularly 106 number sentences to train using the deep learning
in Indonesian language. Our works consist of three stages; algorithm.
first work is to design a model for data preprocessing in a
B. Methods
document text. Secondly, we explore the processing steps i.e.,
tokenizing, filtering, casefolding, and stemming towards the In the context of NLP, Preprocessing techniques are used
documents dataset. And the last, we use the N-gram model to to clean and prepare data for analysis. These techniques can
construct the words. In this study, we propose LSTM help to remove noise from the data, make it easier to read, and
algorithm uses word2vec embedding for model training. To reduce the computational burden of analysis. By simplifying
validate the analysis, we evaluate the frequency of input words the analysis process, preprocessing techniques can help to
compared to the number of corresponding words in the word improve the accuracy and efficiency of NLP models. In this
group using ROUGE metrics, and we assess the meaning and research, we divide the analysis model into 2 main parts,
ambiguity of the prediction result using human evaluation. namely preprocessing and prediction models. The
preprocessing including four steps; First is casefolding by
II. RELATED WORKS changing letters in sentences into lowercase; Secondly is
A lot of studies on different types of preprocessing tokenizing by dividing text in sentences into words; Third is
methods for next-sentence prediction have been established so stopword removal or filtering by removing irrelevant words in
far. The scope of next-sentence prediction tends to discussed the document; Fourth is stemming by changing affixed words
in deep-learning models. For example, a study[24] proposed into basic word forms; and fifth is applying N-gram to
a GRU based RNN method that can predict the next most summarize and transform the input data into a set of features.
appropriate and suitable word in Bangla language. The results In this study, preprocessing stopword removal and stemming
show a significant contribution to the prediction result. A used the NLTK and Sastrawi libraries. For prediction model
study by [25] compared deep learning model for a next-word two algorithms are used, namely word2vec and LSTM. The
prediction for Ukrainian language. The experiments proved word2vec embedding to represent the vector of the words.
that the Markov chains presented the fastest and adequate
results. The hybrid model LSTM presents adequate results but
it works slowly. In addition, [26] presented two deep learning
techniques namely STM and Bi LSTM in exploring the task
of next word prediction. As a result, the model can predict the
next word effectively, reduce the number of word and increase
the typing speed. Other study in [27], investigating the impact
of pre-processing techniques in detecting Arabic health
information on social media. This work compared the
traditional machine learning classifiers i.e., KNN, SVM,
multinomial NB, logistic regression against deep learning
architectures Bi-LSTM and CNN. It is concluded that the Bi-
LSTM architecture performed better than the CNN
architecture and the MNB classifier. Then Bi-LSTM with
Mazajak Skip-Gram pre-trained word embedding produced
the best model. Then [28] proposed N-gram model for the next
word prediction using Kurdish language since it can suggest
text accurately.
The authors observed that previous works focused on
building deep learning models in generating predictions.
However, in depth the contribution of preprocessing to
classification accuracy [29] has not well-investigate. The main
contributions of this paper are as follows:
• Comparing the four preprocessing techniques i.e.,
tokenizing, stopwords removal, casefolding, and Fig. 1. Flow diagram of the next sentence prediction
stemming towards text prediction.
• In the context of constructing words, we proposed an
N-gram model can be used to generate new words
based on the statistical properties of a dataset of
existing words.
• Demonstrate LSTM Model uses the word2vec
embedding for next-sentence prediction
Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
275
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
3 NLP merupakan gabungan 0.055 0.255 0.083 0.071 0.255 0.111 0.055 0.255 0.083
4 Natural NLP 0.155 1.0 0.26 0.214 1.0 0.352 0.1 0.666 0.173
5 Preprocessing Teknologi 0.055 0.333 0.086 0.07 0.333 0.117 0.055 0.33 0.086
6 Natural dilatih 0.055 0.255 0.083 0.071 0.255 0.111 0.055 0.25 0.083
TABLE II. COMPARISON OF THE TESTING RESULT OF TEXT REPROCESSING TECHNIQUES ON SENTENCE PREDICTION
Keyword
(n-gram size:
N Sentence prediction Sentence prediction Sentence prediction
unigram,
o (case folding and tokenizing) (case folding, tokenizing and stop word) folding, tokenizing and stemming
bigram,
trigram)
Id: natural language processing nlp
Id: natural language processing nlp adalah Id: natural language processing nlp
adalah teknologi yang kait dengan
salah satu cabang ilmu artificial intelligence salah cabang ilmu artificial intelligence
mampu komputer dalam paham
ai yang memberi kemampuan pada ai kemampuan komputer memahami
Id: natural tafsir
1 komputer untuk memahami teks teks
Eg: natural Eg: natural language processing
Eg: natural language processing nlp is a Eg: natural language processing nlp
nlp is a technologi related to the
branch of artificial intelligence ai that gives branch artificial intelligence ai ability
ability of computers to understand
computers the ability to understand text computers understand text
interpretation
Id:
Id: stemming adalah proses mereduksi Id: stemming proses menghilangkan Id: stemming adalah proses reduksi
stemming
token kata ke bentuk dasarnya imbuhan akhir token kata ke bentuk dasar
2 adalah
Eg: stemming is the process of reducing Eg: stemming adalah proses Eg: stemming is the process of
Eg: stemming
word tokens to their basic form menghilangkan imbuhan akhir remove final affixes
is
Id: NLP Id: nlp merupakan gabungan antara Id: nlp merupakan gabungan komputasi Id: nlp merupakan gabungan dapat
merupakan komputasi linguistik dan model statistik, linguistik model statistik machine bantu usaha tingkat wawas mereka
gabungan machine learning, dan deep learning learning deep learning dan dapat lebih banyak visibilitas
3
Eg: NPL is a Eg: nlp is a combination of computational Eg: nlp combination computational Eg: nlp is a combination that can
combinatio linguistics, and statistical models, machine linguistics statistical models machine help businesses gain insight and
n of learning and deep learning learning deep learning gain more visibility
Id: natural nlp processing nlp salah
Id: natural nlp menggunakan kecerdasan
cabang ilmu artificial intelligence ai Id: natural nlp juga masih bisa
buatan untuk mengambil input,
kemampuan komputer memahami teks terus latih dan ubah iring dengan
Id: natural NLP memprosesnya, dan memahaminya dengan
kata kata manusia kembang bisnis bisnis
4 Eg: natural cara yang dapat dipahami
Eg: natural nlp processing nlp branch Eg: natural NLP can also continue
NLP Eg: natural nlp uses artificial intelligence
artificial intelligence ai ability to train and change along with the
to take input process it and understand it in
computers understand text human development of the business
a comprehensible way
words
Id: preprocessing teknologi ini
Id: Id: preprocessing teknologi skala besar,
Id: preprocessing teknologi ini mampu yang buat komputer mampu erti
preprocessi menganalisa teks suara sistem internal,
bekerja dalam skala yang besar, termasuk bahasa dalam bentuk teks maupun
ng bentuk dokumen, testimoni online
menganalisa teks dan suara pesan suara
5 teknologi Eg: large-scale preprocessing
Eg: This preprocessing technology is Eg: This preprocessing technology
Eg: technology, analyzing internal system
capable of working on a large scale, makes computers capable of
preprocessin voice text, document forms, online
including analyzing text and sound understand language in the form of
g technology testimonials
text or voice messages
6 Id: nlp berbasis aturan linguistik model Id: nlp berbasis aturan dapat bantu
Id: nlp berbasis Id: nlp berbasis aturan menggunakan aturan statistik, machine learning, deep usaha tingkat wawas mereka dan
aturan linguistik yang dirancang dengan cermat learning dapat lebih banyak visibilitas
Eg: a rule- Eg: rule-based nlp uses thorough designed Eg: nlp based linguistic rules statistical Eg: rule-based nlp can help their
based nlp linguistic rules models, machine learning, deep businesses level up and get more
learning visibility
Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
276
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
shows the structure of the model trained with hidden layers For the second experiment with human evaluation,
configuration: embedding_1, lstm_1 and dense_1. observing the result in Table II, it can be described as follows:
Meanwhile, Fig. 3. shows the training and loss of the LSTM • Scenario-1 (preprocessing with preprocessing
model. casefolding and tokenizing). The model can predict
next text suitably based on predetermined keywords.
IV. RESULTING AND DISCUSSION
• Scenario-2 (preprocessing with preprocessing
In this section, we evaluate the effect of three text casefolding, tokenizing and stopword removal). The
preprocessing scenarios on the predictive results, namely: the model can make predictions but affecting the meaning.
first scenario is applying the casefolding and tokenizing; the The stopword removal process tends to eliminate
second scenario is applying casefolding, tokenizing and words that are not important in sentences, thus
stopword removal; and the third scenario is applying removing connecting words such as “yang”, “tentang”,
casefolding, tokenizing and stemming. We will compare the
“atau”, “dengan”, and others will affect the meaning
performance of the three scenarios on a variety of predictive
of the predicted sentences, NLTK stopwords list [33].
tasks. The result obtained for each scenario as presented in
Table I and II. The table provide comparison result for each • Scenario-3 (preprocessing with preprocessing
evaluation scenarios over six provided text keywords. casefolding, tokenizing and stemming. Stemming is the
preprocessing technique that reduces words to their
A. Evaluation and Comparison root form [34]. Stemming works by removing affixes,
There are two main evaluation approaches for and suffix. The effect of stemming on the clarity of
investigating the quality of prediction results: human meaning can be significant, especially for languages
evaluation and the ROUGE evaluation matrix. Human that have a lot of affixes. This would remove the
evaluation involves having human judges assess the quality of important information about the tense and aspect of the
the predictions. This can be done by asking the judges to read verbs. The result shows that sentence predictions with
the predictions and the actual text and then rate the quality of this scenario cannot provide clarity of meaning,
the predictions. While evaluation with ROUGE (Oriented because sentences containing prefix, infix and other
Understudy for Gisting Evaluation) which consists of rouge- elements are returned to their basic form. This result
1, rouge-2, rouge-L, and rouge Lsum[32] , conducted by using contrast with researches in [17][19][20] [18][35],
rouge-1 to calculate input similarity and prediction results and which had a good impact on the classification results
comparing recall, precision, and f-score, with equation 1, 2, using the stopword removal and stemming.
and 3.
∑
R-1 Recall= (1) V. CONCLUSION
∑
∑
R-1 Precision= (2) The findings of the analysis reveal that the application of
∑
∗ preprocessing techniques, specifically tokenizing and case
R-1 F-Score = 2 ∗ (3) folding gives a significant increase in the meaning of the
!!
predicted word. For the reason that these techniques make it
B. Discussion
easier for the model to learn the patterns of the word. For
In this part we discuss the comparison of the evaluation combination of tokenizing, case folding and stopword
approaches, namely between human evaluation and the removal results in ineffective sentence prediction as well as
ROUGE Metrics toward each preprocessing scenario. For error meaning. This is because stopword removal removing
the first evaluation is ROUGE-1 metrics, which is to match conjunctions, punctuation marks, and irrelevant words in the
the suitability of the number of n-grams consisting of word documents. Removing unimportant words by removing stop
order and word pairs between candidate text (sentence words affects precision, recall and f-score, which is worth 0
prediction) and reference, as presented in Table I. The (see Table I no. 2). Meanwhile, the combination of
candidate is the predicted n-gram. Meanwhile, the reference tokenizing, case folding and stemming lead to lost meaning
is a list of word groups that the predicted n-gram should of the predicted sentence as well as produce overlapping
match. We tested one-word (unigram), two-word (bigram) sentence structures. Therefore, the application of stopword
and three-word (trigram) keywords sequentially. It can be removal and stemming techniques in these cases can reduce
seen in Table I that the precision, recall, and f-score values the size and dimensions index of a sentence. The application
towards the four preprocessing techniques, the case folding of stopwords except to remove certain punctuation marks,
tend to produce low scores among others techniques. This symbols, numbers, and internet sites that are not considered
condition caused by the probability of successive words in the important indications in documents. While ROUGE metrics
dataset is less. The prediction results with case folding and are useful for evaluating the frequency of input words
tokenizing preprocessing produce sentence suitability that compared to the number of corresponding words in the word
correlates with the prediction results. Conversely, testing dataset, they do not guarantee the correctness of prediction
with non-consecutive (random) keywords produces a higher results, as they do not consider the semantic meaning of the
score, because more non-consecutive keywords (Table I, words or the context of the text. For further work, it is
numbers 4-6) are found in the dataset. An important note to necessary to compare the model with other word embeddings,
be notices is despite having a high score, testing with non- and deeper studies for non-consecutive (random) keywords
consecutive keywords results in overlapping sentence
recommendations especially when the stopword removal and ACKNOWLEDGMENT
stemming are applied. This work is supported and organized by the National
Research and Innovation Agency “BRIN”
Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
277
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
278
View publication stats