0% found this document useful (0 votes)

26 views6 pages

Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning

Uploaded by

Yuyun Wabula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views6 pages

Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning

Uploaded by

Yuyun Wabula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/374836072

Next Sentence Prediction: The Impact of Preprocessing Techniques in Deep

Learning

Conference Paper · October 2023

DOI: 10.1109/IC3INA60834.2023.10285805

CITATIONS READS

0 125

6 authors, including:

Yuyun Wabula Andi Djalal Latief

STMIK Handayani BRIN Indonesia
34 PUBLICATIONS 91 CITATIONS 11 PUBLICATIONS 15 CITATIONS

SEE PROFILE SEE PROFILE

Hazriani Zainuddin
Handayani University, Indonesia
26 PUBLICATIONS 70 CITATIONS

SEE PROFILE

All content following this page was uploaded by Yuyun Wabula on 20 October 2023.

The user has requested enhancement of the downloaded file.

2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)

Next Sentence Prediction: The Impact of

Preprocessing Techniques in Deep Learning
1st Yuyun 2nd Andi Djalal Latief 3rd Tri Sampurno
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA) | 979-8-3503-9487-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/IC3INA60834.2023.10285805

Research Center for Data and Research Center for Data and Research Center for Data and
Information Science Information Science Information Science
National Research and Innovation National Research and Innovation National Research and Innovation
Agency Agency Agency
Bandung, Indonesia Bandung, Indonesia Jakarta, Indonesia
[email protected] [email protected] [email protected]

4th Hazriani 5th Andriansyah Oktafiandi Arisha 6th Mushaf

dept. of computer system dept. of computer system dept. of informatics management
Handayani University Handayani University Dipa Makassar University
Makassar, Indonesia Makassar, Indonesia Makassar, Indonesia
[email protected] [email protected] [email protected]

Abstract— Text preprocessing in the field of Natural [11], trending topic analysis[12] and others. In its
Language Processing (NLP) plays a crucial role in enhancing the implementation, NLP algorithms can help human tasks to
model's ability to understand given tasks. This process helps in become more effective. For example the chatboat application
avoiding potential errors that could interfere with data [13], machine translation [14], summarization [15], question
processing. This study aims to investigate the effectiveness of answering system[16] and so on.
four preprocessing techniques (tokenization, case-folding,
stopword removal, and stemming) in predicting the next word Performance of an NLP model is strongly influenced by
in a sentence. The case focuses on Indonesian language text preprocessing techniques. Several studies have been analyzed
documents preprocessing using word2vec embedding and the the effect of preprocessing techniques on text classification.
long short-term memory networks (LSTM) classification model. For example, [17] analyzes sentiment using stopword removal
The evaluation was conducted by incorporating human and stemming. They concluded that both techniques have a
perception approach and evaluation matrix which focused on good impact on the accuracy of machine learning algorithms
the suitability of word sequences and pairs on the prediction even though they did not include other preprocessing
results. The result reveals that the combination of tokenizing techniques. Another study [18] proposed tokenization,
and casefolding can enrich the meaning of the sentences. stopword removal, stemming and other preprocessing
Meanwhile the combination of tokenizing, casefolding, parameters in classifying medical documents which
stopword removal, and stemming can cause lose and overlap concluded that stemming contributes to increase accuracy.
meaning in the sentences. The prediction model performs best
Furthermore, [19] revealed that stemming, tokenizing,
on one-word n-grams, with precision 0.1, recall 1.0, and F-score
0.1.
stopwords removal, removing title from data frame and
punctuations in text summarization can increase accuracy,
Keywords—word prediction, n-gram, word2vec, rouge, lstm, however the accuracy decreased when the testing doesn’t
include. In addition, [20] applied tokenization, normalization,
I. INTRODUCTION stopwords removal and stemming to classify the Arabic
Next sentence prediction (NSP) is a supervised learning language, which proved that the model with preprocessing
task in natural language processing (NLP), where a model is give good accuracy, while without preprocessing the accuracy
trained to predict whether a given sentence is the next sentence decreases.
in a sequence. The importance of NSP to know relation On the other hand, preprocessing also can reduce accuracy
between sentences and understand the context of the sentence as demonstrated in research [21]. This research proposed the
[1]. It has been shown beneficial for various NLP tasks, stemming and stopwords removal to classify the Turkish
including machine translation, question answering, and newspaper texts, and stated that those preprocessing
summarization. For example, in extractive summarization, the approaches have very small impact in terms of accuracy. In
model can predict the next sentence to include in the summary addition, [22] proposed stemming, light-stemming, and word
based on the source article and the summary produced[2]. In clusters. Their experiments proved that stemming reduces
machine translation, next sentence prediction can be used to vector sizes, and improves the classification time. We
generate translations by predicting the next sentence in the observed that in most cases, preprocessing techniques have
target language based on the previous sentence [3]. Similarly, increased accuracy. Furthermore, [23] demonstrated disaster
in question answering, it can help generate relevant dataset to examine the effect of the text semi-preprocessing
answers[4]. In recent years, information extraction has technique i.e., tokenization, case folding, stop word removal.
become an important area for successful NLP applications in It can be deduced that preprocessing models can improve
the field of sentence prediction. Information extraction in NLP classification accuracy.
refers to the process of identifying and extracting structured
information from unstructured text. The main goal is to obtain Based on the description above, the use of preprocessing
information in the form of relationships, entities and other techniques in classification accuracy tends to fluctuate. The
facts from a set of documents. So far, information extraction preprocessing techniques could promote efficiency and
has been successfully applied to various sectors such as increase classification accuracy. On the other hand, can also
industry[5][6][7], health [8][9], government [10], education reduce the accuracy. In addition, the dataset in previous

Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-9487-0/23/$31.00 ©2023 IEEE 274
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)

research has been labeled or targeted to make it easier for the III. MATERIAL AND METHODS
classification algorithm to understand the rules. The weakness
of this model lies in who labels the dataset. Therefore, in depth A. Dataset Collection
research is needed to evaluate preprocessing performance. The dataset in this study is sourced from online news sites
The purpose of this study is to examine the impact of the that contain NLP information (in Indonesian language). A
preprocessing technique on the performance of the next- total of 1423 data were collected, with 391 unique words and
sentence prediction with deep learning technique, particularly 106 number sentences to train using the deep learning
in Indonesian language. Our works consist of three stages; algorithm.
first work is to design a model for data preprocessing in a
B. Methods
document text. Secondly, we explore the processing steps i.e.,
tokenizing, filtering, casefolding, and stemming towards the In the context of NLP, Preprocessing techniques are used
documents dataset. And the last, we use the N-gram model to to clean and prepare data for analysis. These techniques can
construct the words. In this study, we propose LSTM help to remove noise from the data, make it easier to read, and
algorithm uses word2vec embedding for model training. To reduce the computational burden of analysis. By simplifying
validate the analysis, we evaluate the frequency of input words the analysis process, preprocessing techniques can help to
compared to the number of corresponding words in the word improve the accuracy and efficiency of NLP models. In this
group using ROUGE metrics, and we assess the meaning and research, we divide the analysis model into 2 main parts,
ambiguity of the prediction result using human evaluation. namely preprocessing and prediction models. The
preprocessing including four steps; First is casefolding by
II. RELATED WORKS changing letters in sentences into lowercase; Secondly is
A lot of studies on different types of preprocessing tokenizing by dividing text in sentences into words; Third is
methods for next-sentence prediction have been established so stopword removal or filtering by removing irrelevant words in
far. The scope of next-sentence prediction tends to discussed the document; Fourth is stemming by changing affixed words
in deep-learning models. For example, a study[24] proposed into basic word forms; and fifth is applying N-gram to
a GRU based RNN method that can predict the next most summarize and transform the input data into a set of features.
appropriate and suitable word in Bangla language. The results In this study, preprocessing stopword removal and stemming
show a significant contribution to the prediction result. A used the NLTK and Sastrawi libraries. For prediction model
study by [25] compared deep learning model for a next-word two algorithms are used, namely word2vec and LSTM. The
prediction for Ukrainian language. The experiments proved word2vec embedding to represent the vector of the words.
that the Markov chains presented the fastest and adequate
results. The hybrid model LSTM presents adequate results but
it works slowly. In addition, [26] presented two deep learning
techniques namely STM and Bi LSTM in exploring the task
of next word prediction. As a result, the model can predict the
next word effectively, reduce the number of word and increase
the typing speed. Other study in [27], investigating the impact
of pre-processing techniques in detecting Arabic health
information on social media. This work compared the
traditional machine learning classifiers i.e., KNN, SVM,
multinomial NB, logistic regression against deep learning
architectures Bi-LSTM and CNN. It is concluded that the Bi-
LSTM architecture performed better than the CNN
architecture and the MNB classifier. Then Bi-LSTM with
Mazajak Skip-Gram pre-trained word embedding produced
the best model. Then [28] proposed N-gram model for the next
word prediction using Kurdish language since it can suggest
text accurately.
The authors observed that previous works focused on
building deep learning models in generating predictions.
However, in depth the contribution of preprocessing to
classification accuracy [29] has not well-investigate. The main
contributions of this paper are as follows:
• Comparing the four preprocessing techniques i.e.,
tokenizing, stopwords removal, casefolding, and Fig. 1. Flow diagram of the next sentence prediction
stemming towards text prediction.
• In the context of constructing words, we proposed an
N-gram model can be used to generate new words
based on the statistical properties of a dataset of
existing words.
• Demonstrate LSTM Model uses the word2vec
embedding for next-sentence prediction

Fig. 2. Structure of layer in LSTM training model

Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
275
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)

The advantage of this algorithm is that it can capture semantic

relationships of the words by similar words grouping. In
addition, it can overcome the problems of synonyms and
homonyms that are often found in the NLP task[30]. This
relationship is then used to convert alphanumeric characters
into vectors. While the Long-Short Term Memory (LSTM)
model is used to train the model and predict the next sentence.
The whole stages conducting in this study as presented in Fig.
1. LSTM is a Deep Learning method that is able to remember
information from the past in the learning process of the model.
To train models, LSTM has the ability to store memories for
Fig. 3. Training and lost of LSTM model, loss: 0.0634, training: 0.9683 a long time simultaneously in learning patterns [31]. Fig. 2
TABLE I. MATRIX EVALUATION OF PREDICTION RESULT WITH RECALL-ORIENTED UNDERSTUDY FOR GISTING EVALUATION (ROUGE-1)
Keyword case folding and tokenizing case folding, tokenizing, stop word case folding, tokenizing, stemming
N
(n-gram size: unigram,
o bigram, trigram) precision Recall f-score precision Recall f-score precision Recall f-score
1 Natural 0.1 1.0 0.181 0.142 1.0 0.255 0.1 1.0 0.181

2 Stemming adalah 0.055 0.333 0.086 0 0 0 0.055 0.333 0.086

3 NLP merupakan gabungan 0.055 0.255 0.083 0.071 0.255 0.111 0.055 0.255 0.083

4 Natural NLP 0.155 1.0 0.26 0.214 1.0 0.352 0.1 0.666 0.173

5 Preprocessing Teknologi 0.055 0.333 0.086 0.07 0.333 0.117 0.055 0.33 0.086
6 Natural dilatih 0.055 0.255 0.083 0.071 0.255 0.111 0.055 0.25 0.083

TABLE II. COMPARISON OF THE TESTING RESULT OF TEXT REPROCESSING TECHNIQUES ON SENTENCE PREDICTION
Keyword
(n-gram size:
N Sentence prediction Sentence prediction Sentence prediction
unigram,
o (case folding and tokenizing) (case folding, tokenizing and stop word) folding, tokenizing and stemming
bigram,
trigram)
Id: natural language processing nlp
Id: natural language processing nlp adalah Id: natural language processing nlp
adalah teknologi yang kait dengan
salah satu cabang ilmu artificial intelligence salah cabang ilmu artificial intelligence
mampu komputer dalam paham
ai yang memberi kemampuan pada ai kemampuan komputer memahami
Id: natural tafsir
1 komputer untuk memahami teks teks
Eg: natural Eg: natural language processing
Eg: natural language processing nlp is a Eg: natural language processing nlp
nlp is a technologi related to the
branch of artificial intelligence ai that gives branch artificial intelligence ai ability
ability of computers to understand
computers the ability to understand text computers understand text
interpretation
Id:
Id: stemming adalah proses mereduksi Id: stemming proses menghilangkan Id: stemming adalah proses reduksi
stemming
token kata ke bentuk dasarnya imbuhan akhir token kata ke bentuk dasar
2 adalah
Eg: stemming is the process of reducing Eg: stemming adalah proses Eg: stemming is the process of
Eg: stemming
word tokens to their basic form menghilangkan imbuhan akhir remove final affixes
is
Id: NLP Id: nlp merupakan gabungan antara Id: nlp merupakan gabungan komputasi Id: nlp merupakan gabungan dapat
merupakan komputasi linguistik dan model statistik, linguistik model statistik machine bantu usaha tingkat wawas mereka
gabungan machine learning, dan deep learning learning deep learning dan dapat lebih banyak visibilitas
3
Eg: NPL is a Eg: nlp is a combination of computational Eg: nlp combination computational Eg: nlp is a combination that can
combinatio linguistics, and statistical models, machine linguistics statistical models machine help businesses gain insight and
n of learning and deep learning learning deep learning gain more visibility
Id: natural nlp processing nlp salah
Id: natural nlp menggunakan kecerdasan
cabang ilmu artificial intelligence ai Id: natural nlp juga masih bisa
buatan untuk mengambil input,
kemampuan komputer memahami teks terus latih dan ubah iring dengan
Id: natural NLP memprosesnya, dan memahaminya dengan
kata kata manusia kembang bisnis bisnis
4 Eg: natural cara yang dapat dipahami
Eg: natural nlp processing nlp branch Eg: natural NLP can also continue
NLP Eg: natural nlp uses artificial intelligence
artificial intelligence ai ability to train and change along with the
to take input process it and understand it in
computers understand text human development of the business
a comprehensible way
words
Id: preprocessing teknologi ini
Id: Id: preprocessing teknologi skala besar,
Id: preprocessing teknologi ini mampu yang buat komputer mampu erti
preprocessi menganalisa teks suara sistem internal,
bekerja dalam skala yang besar, termasuk bahasa dalam bentuk teks maupun
ng bentuk dokumen, testimoni online
menganalisa teks dan suara pesan suara
5 teknologi Eg: large-scale preprocessing
Eg: This preprocessing technology is Eg: This preprocessing technology
Eg: technology, analyzing internal system
capable of working on a large scale, makes computers capable of
preprocessin voice text, document forms, online
including analyzing text and sound understand language in the form of
g technology testimonials
text or voice messages
6 Id: nlp berbasis aturan linguistik model Id: nlp berbasis aturan dapat bantu
Id: nlp berbasis Id: nlp berbasis aturan menggunakan aturan statistik, machine learning, deep usaha tingkat wawas mereka dan
aturan linguistik yang dirancang dengan cermat learning dapat lebih banyak visibilitas
Eg: a rule- Eg: rule-based nlp uses thorough designed Eg: nlp based linguistic rules statistical Eg: rule-based nlp can help their
based nlp linguistic rules models, machine learning, deep businesses level up and get more
learning visibility

Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
276
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)

shows the structure of the model trained with hidden layers For the second experiment with human evaluation,
configuration: embedding_1, lstm_1 and dense_1. observing the result in Table II, it can be described as follows:
Meanwhile, Fig. 3. shows the training and loss of the LSTM • Scenario-1 (preprocessing with preprocessing
model. casefolding and tokenizing). The model can predict
next text suitably based on predetermined keywords.
IV. RESULTING AND DISCUSSION
• Scenario-2 (preprocessing with preprocessing
In this section, we evaluate the effect of three text casefolding, tokenizing and stopword removal). The
preprocessing scenarios on the predictive results, namely: the model can make predictions but affecting the meaning.
first scenario is applying the casefolding and tokenizing; the The stopword removal process tends to eliminate
second scenario is applying casefolding, tokenizing and words that are not important in sentences, thus
stopword removal; and the third scenario is applying removing connecting words such as “yang”, “tentang”,
casefolding, tokenizing and stemming. We will compare the
“atau”, “dengan”, and others will affect the meaning
performance of the three scenarios on a variety of predictive
of the predicted sentences, NLTK stopwords list [33].
tasks. The result obtained for each scenario as presented in
Table I and II. The table provide comparison result for each • Scenario-3 (preprocessing with preprocessing
evaluation scenarios over six provided text keywords. casefolding, tokenizing and stemming. Stemming is the
preprocessing technique that reduces words to their
A. Evaluation and Comparison root form [34]. Stemming works by removing affixes,
There are two main evaluation approaches for and suffix. The effect of stemming on the clarity of
investigating the quality of prediction results: human meaning can be significant, especially for languages
evaluation and the ROUGE evaluation matrix. Human that have a lot of affixes. This would remove the
evaluation involves having human judges assess the quality of important information about the tense and aspect of the
the predictions. This can be done by asking the judges to read verbs. The result shows that sentence predictions with
the predictions and the actual text and then rate the quality of this scenario cannot provide clarity of meaning,
the predictions. While evaluation with ROUGE (Oriented because sentences containing prefix, infix and other
Understudy for Gisting Evaluation) which consists of rouge- elements are returned to their basic form. This result
1, rouge-2, rouge-L, and rouge Lsum[32] , conducted by using contrast with researches in [17][19][20] [18][35],
rouge-1 to calculate input similarity and prediction results and which had a good impact on the classification results
comparing recall, precision, and f-score, with equation 1, 2, using the stopword removal and stemming.
and 3.
∑
R-1 Recall= (1) V. CONCLUSION
∑
∑
R-1 Precision= (2) The findings of the analysis reveal that the application of
∑
∗ preprocessing techniques, specifically tokenizing and case
R-1 F-Score = 2 ∗ (3) folding gives a significant increase in the meaning of the
!!
predicted word. For the reason that these techniques make it
B. Discussion
easier for the model to learn the patterns of the word. For
In this part we discuss the comparison of the evaluation combination of tokenizing, case folding and stopword
approaches, namely between human evaluation and the removal results in ineffective sentence prediction as well as
ROUGE Metrics toward each preprocessing scenario. For error meaning. This is because stopword removal removing
the first evaluation is ROUGE-1 metrics, which is to match conjunctions, punctuation marks, and irrelevant words in the
the suitability of the number of n-grams consisting of word documents. Removing unimportant words by removing stop
order and word pairs between candidate text (sentence words affects precision, recall and f-score, which is worth 0
prediction) and reference, as presented in Table I. The (see Table I no. 2). Meanwhile, the combination of
candidate is the predicted n-gram. Meanwhile, the reference tokenizing, case folding and stemming lead to lost meaning
is a list of word groups that the predicted n-gram should of the predicted sentence as well as produce overlapping
match. We tested one-word (unigram), two-word (bigram) sentence structures. Therefore, the application of stopword
and three-word (trigram) keywords sequentially. It can be removal and stemming techniques in these cases can reduce
seen in Table I that the precision, recall, and f-score values the size and dimensions index of a sentence. The application
towards the four preprocessing techniques, the case folding of stopwords except to remove certain punctuation marks,
tend to produce low scores among others techniques. This symbols, numbers, and internet sites that are not considered
condition caused by the probability of successive words in the important indications in documents. While ROUGE metrics
dataset is less. The prediction results with case folding and are useful for evaluating the frequency of input words
tokenizing preprocessing produce sentence suitability that compared to the number of corresponding words in the word
correlates with the prediction results. Conversely, testing dataset, they do not guarantee the correctness of prediction
with non-consecutive (random) keywords produces a higher results, as they do not consider the semantic meaning of the
score, because more non-consecutive keywords (Table I, words or the context of the text. For further work, it is
numbers 4-6) are found in the dataset. An important note to necessary to compare the model with other word embeddings,
be notices is despite having a high score, testing with non- and deeper studies for non-consecutive (random) keywords
consecutive keywords results in overlapping sentence
recommendations especially when the stopword removal and ACKNOWLEDGMENT
stemming are applied. This work is supported and organized by the National
Research and Innovation Agency “BRIN”

Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
277
2023 International Conference on Computer, Control, Informatics and its Applications (IC3INA)

REFERENCES Pattern Recognition in Information Systems, PRIS 2010, in

Conjunction with ICEIS 2010, no. Iceis, pp. 53–61, 2010, doi:
[1] W. Shi and V. Demberg, “Next sentence prediction helps implicit 10.5220/0003028700530061.
discourse relation classification within and across domains,” [19] A. P. Widyassari, E. Noersasongko, A. Syukur, and Affandy, “The 7-
EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Phases Preprocessing Based On Extractive Text Summarization,”
Natural Language Processing and 9th International Joint Conference 2022 7th International Conference on Informatics and Computing,
on Natural Language Processing, Proceedings of the Conference, pp. ICIC 2022, pp. 1–8, 2022, doi: 10.1109/ICIC56845.2022.10006998.
5790–5796, 2019, doi: 10.18653/v1/d19-1586. [20] A. Y. Muaad et al., “Arabic Document Classification: Performance
[2] J. Liu, J. C. K. Cheung, and A. Louis, “What comes next? Extractive Investigation of Preprocessing and Representation Techniques,”
summarization by next-sentence prediction,” 2019, [Online]. Math Probl Eng, vol. 2022, 2022, doi: 10.1155/2022/3720358.
Available: http://arxiv.org/abs/1901.03859 [21] D. Torunoǧlu, E. Çakirman, M. C. Ganiz, S. Akyokuş, and M. Z.
[3] K. Yang, “Transformer-based Korean Pretrained Language Models: Gürbüz, “Analysis of preprocessing methods on classification of
A Survey on Three Years of Progress,” pp. 1–7, 2021, [Online]. Turkish texts,” INISTA 2011 - 2011 International Symposium on
Available: http://arxiv.org/abs/2112.03014 INnovations in Intelligent SysTems and Applications, pp. 112–117,
[4] P. Do and T. H. V. Phan, “Developing a BERT based triple 2011, doi: 10.1109/INISTA.2011.5946084.
classification model using knowledge graph embedding for question [22] X. Liu, “Feature Reduction Techniques for Arabic Text
answering system,” Applied Intelligence, vol. 52, no. 1, pp. 636–651, Categorization,” Journal of the American Society for Information
2022, doi: 10.1007/s10489-021-02460-w. Science and Technology, vol. 64, no. July, pp. 1852–1863, 2013, doi:
[5] M. S. Lin, C. G. Y. Tang, X. J. Kom, J. Y. Eyu, and C. Xu, “Building 10.1002/asi.
a Natural Language Processing Model to Extract Order Information [23] A. O. Arisha, Hazriani, and Y. Wabula, “Text Preprocessing
from Customer Orders for Interpretative Order Management,” IEEE Approaches in CNN for Disaster Reports Dataset,” 5th International
International Conference on Industrial Engineering and Engineering Conference on Artificial Intelligence in Information and
Management, vol. 2022-Decem, pp. 81–86, 2022, doi: Communication, ICAIIC 2023, pp. 216–220, 2023, doi:
10.1109/IEEM55944.2022.9989801. 10.1109/ICAIIC57133.2023.10067109.
[6] R. Garg, A. W. Kiwelekar, and L. D. Netak, “Logistics and freight [24] O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah,
transportation management: An NLP based approach for shipment “Bangla word prediction and sentence completion using GRU: An
tracking,” Pertanika J Sci Technol, vol. 29, no. 4, pp. 2745–2765, extended version of RNN on N-gram language model,” 2019
2021, doi: 10.47836/PJST.29.4.28. International Conference on Sustainable Technologies for Industry
[7] P. M. Mah, I. Skalna, and J. Muzam, “Natural Language Processing 4.0, STI 2019, vol. 0, pp. 1–6, 2019, doi:
and Artificial Intelligence for Enterprise Management in the Era of 10.1109/STI47673.2019.9068063.
Industry 4.0,” Applied Sciences (Switzerland), vol. 12, no. 18, 2022, [25] K. Shakhovska, I. Dumyn, N. Kryvinska, and M. K. Kagita, “An
doi: 10.3390/app12189207. Approach for a Next-Word Prediction for Ukrainian Language,”
[8] E. Hossain et al., “Natural Language Processing in Electronic Health Wirel Commun Mob Comput, vol. 2021, 2021, doi:
Records in relation to healthcare decision-making: A systematic 10.1155/2021/5886119.
review,” Comput Biol Med, vol. 155, no. February, p. 106649, 2023, [26] R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash, “Next
doi: 10.1016/j.compbiomed.2023.106649. Word Prediction in Hindi Using Deep Learning Techniques,” 2019
[9] A. Rahman et al., Federated learning-based AI approaches in smart International Conference on Data Science and Engineering, ICDSE
healthcare: concepts, taxonomies, challenges and open issues, vol. 2019, pp. 55–60, 2019, doi: 10.1109/ICDSE47409.2019.8971796.
26, no. 4. Springer US, 2022. doi: 10.1007/s10586-022-03658-4. [27] Y. Albalawi, J. Buckley, and N. S. Nikolov, “Investigating the impact
[10] C. H. Ku, A. Iriberri, and G. Leroy, “Natural Language Processing of pre-processing techniques and pre-trained word embeddings in
and e-Government : Crime Information Extraction from detecting Arabic health information on social media,” J Big Data,
Heterogeneous Data Sources,” The Proceedings of the 9th Annual vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00488-w.
International Digital Government Research Conference, pp. 162– [28] H. K. Hamarashid, S. A. Saeed, and T. A. Rashid, “Next word
170, 2006. prediction based on the N-gram model for Kurdish Sorani and
[11] S. Montalvo, J. Palomo, and C. De La Orden, “Building an Kurmanji,” Neural Comput Appl, vol. 33, no. 9, pp. 4547–4566,
educational platform using NLP: A case study in teaching finance,” 2021, doi: 10.1007/s00521-020-05245-3.
Journal of Universal Computer Science, vol. 24, no. 10, pp. 1403– [29] A. K. Uysal and S. Gunal, “The impact of preprocessing on text
1423, 2018. classification,” Inf Process Manag, vol. 50, no. 1, pp. 104–112, 2014,
[12] K. H. Musliadi, H. Zainuddin, and Y. Wabula, “Twitter Social Media doi: 10.1016/j.ipm.2013.08.006.
Conversion Topic Trending Analysis Using Latent Dirichlet [30] B. A. H. Kholifatullah and A. Prihanto, “Penerapan Metode Long
Allocation Algorithm,” Journal of Applied Engineering and Short Term Memory Untuk Klasifikasi Pada Hate Speech,” Journal
Technological Science, vol. 4, no. 1, pp. 390–399, 2022, doi: of Informatics and Computer Science (JINACS), vol. 04, pp. 292–
10.37385/jaets.v4i1.1143. 297, 2023, doi: 10.26740/jinacs.v4n03.p292-297.
[13] A. Elcholiqi and A. Musdholifah, “Chatbot in Bahasa Indonesia [31] D. Bruneo and F. De Vita, “On the use of LSTM networks for
using NLP to Provide Banking Information,” IJCCS (Indonesian predictive maintenance in smart industries,” Proceedings - 2019
Journal of Computing and Cybernetics Systems), vol. 14, no. 1, p. 91, IEEE International Conference on Smart Computing, SMARTCOMP
2020, doi: 10.22146/ijccs.41289. 2019, pp. 241–248, 2019, doi: 10.1109/SMARTCOMP.2019.00059.
[14] N. S. Khan, A. Abid, and K. Abid, “A Novel Natural Language [32] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of
Processing (NLP)–Based Machine Translation Model for English to Summaries Chin-Yew,” ProceedingsoftheACL-04Workshop., vol.
Pakistan Sign Language Translation,” Cognit Comput, vol. 12, no. 4, Text Summa, pp. 74–81, 2004.
pp. 748–765, 2020, doi: 10.1007/s12559-020-09731-7. [33] S. Sarica and J. Luo, “Stopwords in technical language processing,”
[15] Rahul, S. Adhikar, and Monika, “NLP based Machine Learning PLoS One, vol. 16, no. 8 August, pp. 1–13, 2021, doi:
Approaches for Text Summarization,” Proceedings of the 4th 10.1371/journal.pone.0254937.
International Conference on Computing Methodologies and [34] D. Mustikasari, I. Widaningrum, R. Arifin, and W. H. E. Putri,
Communication, ICCMC 2020, no. Iccmc, pp. 535–538, 2020, doi: “Comparison of Effectiveness of Stemming Algorithms in Indonesian
10.1109/ICCMC48092.2020.ICCMC-00099. Documents,” Proceedings of the 2nd Borobudur International
[16] A. Bouziane, D. Bouchiha, N. Doumi, and M. Malki, “Question Symposium on Science and Technology (BIS-STE 2020), vol. 203, pp.
Answering Systems: Survey and Trends,” Procedia Comput Sci, vol. 154–158, 2021, doi: 10.2991/aer.k.210810.025.
73, no. Awict, pp. 366–375, 2015, doi: 10.1016/j.procs.2015.12.005. [35] D. J. Ladani and N. P. Desai, “Stopword Identification and Removal
[17] S. Alam and N. Yao, “The impact of preprocessing steps on the Techniques on TC and IR applications: A Survey,” 2020 6th
accuracy of machine learning algorithms in sentiment analysis,” International Conference on Advanced Computing and
Comput Math Organ Theory, vol. 25, no. 3, pp. 319–335, 2019, doi: Communication Systems, ICACCS 2020, pp. 466–472, 2020, doi:
10.1007/s10588-018-9266-8. 10.1109/ICACCS48705.2020.9074166.
[18] C. A. Gonçalves, C. T. Gonçalves, R. Camacho, and E. Oliveira,
“The impact of pre-processing on the classification of MEDLINE
documents,” Proceedings of the 10th International Workshop on

Authorized licensed use limited to: Badan Riset Dan Inovasi Nasional. Downloaded on October 19,2023 at 00:48:57 UTC from IEEE Xplore. Restrictions apply.
278
View publication stats

RPS Ilmu Ashwat
No ratings yet
RPS Ilmu Ashwat
7 pages
Modern Legal Drafting
94% (17)
Modern Legal Drafting
274 pages
Rankin On The Treatment of Pitch in Early Music Writing
100% (1)
Rankin On The Treatment of Pitch in Early Music Writing
71 pages
The Therapeutic Properties and Applications of Aloe Vera: A Review
No ratings yet
The Therapeutic Properties and Applications of Aloe Vera: A Review
11 pages
Agar Wood Grade Determination System Using Image Processing Technique
No ratings yet
Agar Wood Grade Determination System Using Image Processing Technique
4 pages
Rheumatoid Arthritis: History, Stages, Epidemiology, Pathogenesis, Diagnosis and Treatment
No ratings yet
Rheumatoid Arthritis: History, Stages, Epidemiology, Pathogenesis, Diagnosis and Treatment
12 pages
Review and Open Issues of Cryptographic Algorithms in Cyber Security
No ratings yet
Review and Open Issues of Cryptographic Algorithms in Cyber Security
6 pages
Bill of Quantities With 3D Views Using Building Information Modeling
No ratings yet
Bill of Quantities With 3D Views Using Building Information Modeling
16 pages
Spoken English Book
100% (1)
Spoken English Book
118 pages
Enabling Technologies and Tools For Digital Twin: Journal of Manufacturing Systems March 2021
No ratings yet
Enabling Technologies and Tools For Digital Twin: Journal of Manufacturing Systems March 2021
20 pages
LinearizedMorisonDragforImprovementSemi Submersible
No ratings yet
LinearizedMorisonDragforImprovementSemi Submersible
10 pages
3 Naskahijcp
No ratings yet
3 Naskahijcp
14 pages
Earthing
No ratings yet
Earthing
13 pages
Electrical Earthing in Troubled Environment PDF
No ratings yet
Electrical Earthing in Troubled Environment PDF
13 pages
Antioxidant and Antiproliferative Activities of An
No ratings yet
Antioxidant and Antiproliferative Activities of An
6 pages
Tracing Technique For Blaster Attack PDF
No ratings yet
Tracing Technique For Blaster Attack PDF
9 pages
A Review of Rare Earth Mineral Processing Technolo
100% (1)
A Review of Rare Earth Mineral Processing Technolo
17 pages
Highlighted BANGKOK DECLARATION AND AWARENESS OF ASEAN MEMBER COUNTRIES THE REGIONAL LAW OF CLEANING OUR OCEANS
No ratings yet
Highlighted BANGKOK DECLARATION AND AWARENESS OF ASEAN MEMBER COUNTRIES THE REGIONAL LAW OF CLEANING OUR OCEANS
6 pages
GalleyproofJPAA 2020 164pre Print
No ratings yet
GalleyproofJPAA 2020 164pre Print
12 pages
Evaluation of DSP Based Numerical Relay For Overcu
No ratings yet
Evaluation of DSP Based Numerical Relay For Overcu
9 pages
Design and Analysis of Low Noise Amplifier
No ratings yet
Design and Analysis of Low Noise Amplifier
11 pages
Effectof Different Drying Methodsonthe Morphology Crystallinity Swellingof Nata
No ratings yet
Effectof Different Drying Methodsonthe Morphology Crystallinity Swellingof Nata
8 pages
Distributionof Nutrient Concentrationinthe Upwellingareaoff East Coastof Peninsularmalaysiaduringthe Southwest Monsoon
No ratings yet
Distributionof Nutrient Concentrationinthe Upwellingareaoff East Coastof Peninsularmalaysiaduringthe Southwest Monsoon
15 pages
Robotics and Automation in Agriculture: Present and Future Applications
No ratings yet
Robotics and Automation in Agriculture: Present and Future Applications
12 pages
40 IJPQAVol8Issue3Article7
No ratings yet
40 IJPQAVol8Issue3Article7
7 pages
Risk Factors in The Development of Knee Osteoarthritis: A Case-Control Study
No ratings yet
Risk Factors in The Development of Knee Osteoarthritis: A Case-Control Study
5 pages
The Covid-19 Chronicles of Malaysia in The Face A Pandemic (Second Edition)
No ratings yet
The Covid-19 Chronicles of Malaysia in The Face A Pandemic (Second Edition)
2 pages
TMI PhaseDetection10
No ratings yet
TMI PhaseDetection10
13 pages
Development of Strand Burner For Solid Propellant Burning Rate Studies
No ratings yet
Development of Strand Burner For Solid Propellant Burning Rate Studies
10 pages
Sur Ah Yasin Dan Am Alan Harian
No ratings yet
Sur Ah Yasin Dan Am Alan Harian
6 pages
Congestion Management in Power System: A Review: April 2017
No ratings yet
Congestion Management in Power System: A Review: April 2017
7 pages
Aerobic Post-Treatment of Different Anaerobically Digested Palm Oil Mill Ef Uent (POME)
No ratings yet
Aerobic Post-Treatment of Different Anaerobically Digested Palm Oil Mill Ef Uent (POME)
6 pages
Ground Control 2013
No ratings yet
Ground Control 2013
6 pages
Azmietal 2022
No ratings yet
Azmietal 2022
15 pages
List of 188 Japanese Particles: Indicates The Indirect Object of A Verb
No ratings yet
List of 188 Japanese Particles: Indicates The Indirect Object of A Verb
9 pages
Numerical Relay For Overcurrent Protection Using T
No ratings yet
Numerical Relay For Overcurrent Protection Using T
7 pages
Introducing and Exchanging Personal Information: Practice 1
No ratings yet
Introducing and Exchanging Personal Information: Practice 1
11 pages
1 s2.0 S0960148123003579 Main
No ratings yet
1 s2.0 S0960148123003579 Main
18 pages
Buổi 3 Có Liên Quan Đến Phân Phối Chương Trình Ra Đề Kiểm Tra
100% (1)
Buổi 3 Có Liên Quan Đến Phân Phối Chương Trình Ra Đề Kiểm Tra
9 pages
IGCSE Extended Exam Booklet
100% (2)
IGCSE Extended Exam Booklet
25 pages
Spatiotemporal Dynamics of Development Inequalities in Lahore City Region, Pakistan
No ratings yet
Spatiotemporal Dynamics of Development Inequalities in Lahore City Region, Pakistan
2 pages
Waste To Energy Conversion For A Sustainable Future: Heliyon October 2021
No ratings yet
Waste To Energy Conversion For A Sustainable Future: Heliyon October 2021
13 pages
Early Astronomical Sites in Kashmir
No ratings yet
Early Astronomical Sites in Kashmir
8 pages
Goethe Institut Nigeria Brochure 20232027
No ratings yet
Goethe Institut Nigeria Brochure 20232027
16 pages
Sensors 20 04410
No ratings yet
Sensors 20 04410
28 pages
PBS and Oil Palm Mesocarp Fiber As New Lignocellulosic Material Composites
No ratings yet
PBS and Oil Palm Mesocarp Fiber As New Lignocellulosic Material Composites
9 pages
Journal 2
No ratings yet
Journal 2
9 pages
2013arisetal IEEEBEIAC CSRforAccountingGraduate
No ratings yet
2013arisetal IEEEBEIAC CSRforAccountingGraduate
7 pages
Dervin & Gross 2016 ICC in Education PDF
No ratings yet
Dervin & Gross 2016 ICC in Education PDF
270 pages
Potato Disease Detection Using Machine Learning: February 2021
No ratings yet
Potato Disease Detection Using Machine Learning: February 2021
5 pages
Voice - Meherab
No ratings yet
Voice - Meherab
13 pages
ACA-Critical Review of Technology Enhanced Learning1
No ratings yet
ACA-Critical Review of Technology Enhanced Learning1
11 pages
Internet of Things IoT Security With Blockchain Technology A State-of-the-Art Review
No ratings yet
Internet of Things IoT Security With Blockchain Technology A State-of-the-Art Review
18 pages
Examples Simple Present
No ratings yet
Examples Simple Present
7 pages
3 EvaluationofCounselingServicesProvidedbyCommunity
No ratings yet
3 EvaluationofCounselingServicesProvidedbyCommunity
9 pages
Spolsky - Bridging The Gap - A Theory SLL
No ratings yet
Spolsky - Bridging The Gap - A Theory SLL
21 pages
Third Term English Test 1: Part One: Reading Comprehension and Vocabulary (06 Points) Ary (06 Points) 06 Points)
100% (1)
Third Term English Test 1: Part One: Reading Comprehension and Vocabulary (06 Points) Ary (06 Points) 06 Points)
2 pages
2023 AMS ReviewRotorSystems
No ratings yet
2023 AMS ReviewRotorSystems
8 pages
RPS
No ratings yet
RPS
5 pages
Exploring The Influence of Artificial Intelligence Technology On Consumer
No ratings yet
Exploring The Influence of Artificial Intelligence Technology On Consumer
12 pages
Daily English Language Lesson Plan Class 2 Biru: Teacher's Reflection
No ratings yet
Daily English Language Lesson Plan Class 2 Biru: Teacher's Reflection
4 pages
Simple Present and Past Simple.
No ratings yet
Simple Present and Past Simple.
2 pages
Student Assessment Workbook: BSBWRT401 Write Complex Documents
No ratings yet
Student Assessment Workbook: BSBWRT401 Write Complex Documents
25 pages
Capitalization Worksheet 2
No ratings yet
Capitalization Worksheet 2
3 pages
Echinochloa Colona From The Rice Fields of Punjab: Article
No ratings yet
Echinochloa Colona From The Rice Fields of Punjab: Article
9 pages
Review of Complexity Metrics For Object Oriented S
No ratings yet
Review of Complexity Metrics For Object Oriented S
8 pages
Third Conditional Exercise 2
No ratings yet
Third Conditional Exercise 2
2 pages
2021 Ao
No ratings yet
2021 Ao
8 pages
C. Articles
No ratings yet
C. Articles
8 pages
1 s2.0 S0160791X22003311 Main
No ratings yet
1 s2.0 S0160791X22003311 Main
12 pages
Micro ControllerBasedSmartElectronicVoting
No ratings yet
Micro ControllerBasedSmartElectronicVoting
6 pages
The Relationship Between Conceptual Metaphor and Culture I. Ibarretxe-Antuïano, I.,In Intercultural Pragmatics - Vol. 10 (2013), P. 315 - 339
No ratings yet
The Relationship Between Conceptual Metaphor and Culture I. Ibarretxe-Antuïano, I.,In Intercultural Pragmatics - Vol. 10 (2013), P. 315 - 339
26 pages
Bài Thi Tham KH o
No ratings yet
Bài Thi Tham KH o
13 pages
The Role of Machine Learning in Digital
No ratings yet
The Role of Machine Learning in Digital
13 pages
Unit V Flat LM Cse
No ratings yet
Unit V Flat LM Cse
19 pages
Elementary GW 05a-1
No ratings yet
Elementary GW 05a-1
2 pages
Predicting The Actual Use of M-Learning Systems: A Comparative Approach Using PLS-SEM and Machine Learning Algorithms
No ratings yet
Predicting The Actual Use of M-Learning Systems: A Comparative Approach Using PLS-SEM and Machine Learning Algorithms
17 pages
1 ST
No ratings yet
1 ST
8 pages
Q4 Module 2
No ratings yet
Q4 Module 2
35 pages
UNIT2 TOP DOWN AND BOTTOM UP Parser
No ratings yet
UNIT2 TOP DOWN AND BOTTOM UP Parser
151 pages
ESL Notes On The Future Tense
No ratings yet
ESL Notes On The Future Tense
6 pages
Social Awareness The Power of Digital Elements in
No ratings yet
Social Awareness The Power of Digital Elements in
11 pages
Modes of Failure of Proximal Femoral Nail PFN in U
No ratings yet
Modes of Failure of Proximal Femoral Nail PFN in U
11 pages
The Effectof Exam Stresson Students Eatinghabitsandlifestyleofuniversitystudentsin Erbil
No ratings yet
The Effectof Exam Stresson Students Eatinghabitsandlifestyleofuniversitystudentsin Erbil
8 pages
Revolutionizing Agriculture: A Comprehensive Review of Nanotechnology Applications
No ratings yet
Revolutionizing Agriculture: A Comprehensive Review of Nanotechnology Applications
19 pages
Grade Vii Question Bank
No ratings yet
Grade Vii Question Bank
28 pages
Allied 1 - Writing Good Journalistic Style
No ratings yet
Allied 1 - Writing Good Journalistic Style
24 pages
IoT based Projects: Realization with Raspberry Pi, NodeMCU and Arduino
From Everand
IoT based Projects: Realization with Raspberry Pi, NodeMCU and Arduino
Rajesh Singh
No ratings yet
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
From Everand
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
Dr. Anita Gehlot
No ratings yet
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet

Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning

Uploaded by

Next Sentence Prediction The Impact of Preprocessing Techniques in Deep Learning

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Next Sentence Prediction: The Impact of Preprocessing Techniques in Deep

Conference Paper · October 2023

Yuyun Wabula Andi Djalal Latief

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Next Sentence Prediction: The Impact of

4th Hazriani 5th Andriansyah Oktafiandi Arisha 6th Mushaf

Fig. 2. Structure of layer in LSTM training model

The advantage of this algorithm is that it can capture semantic

2 Stemming adalah 0.055 0.333 0.086 0 0 0 0.055 0.333 0.086

REFERENCES Pattern Recognition in Information Systems, PRIS 2010, in

You might also like