Final Thesis Report
Final Thesis Report
A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
Summer 2023
Next Word and Sentence Prediction in Bengali Text Using LSTM and N-grams: A
Machine Learning Journey
A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE). The thesis has been examined and
approved,
_____________________________
Prof. Dr. Utpal Kanti Das
Professor and Chairman
_____________________________
Dr. Hasibur Rashid Chayon
Co-supervisor, Coordinator, and Associate Professor
_____________________________
Rubayea Ferdows
Assistant Professor
Summer 2023
Letter of Transmittal
07 October 2022
The Chair
4 Embankment Drive Road, Sector 10, Uttara Model Town Dhaka 1230, Bangladesh
Dear Sir,
The supporters would like to submit a proposal titled " Next Word and Sentence
Prediction in Bengali Text Using LSTM and N-grams: A Machine Learning Journey"
in compliance with your instructions and the requirements of the thesis course. The
study's objective is to employ machine learning strategies to anticipate the subsequent
word from user input. With the help of this model, we can predict the next word.
Yours sincerely,
_______________ _________________
Sadia Afrin Mahazabin Akter
Id:20103025 Id:20103013
III
Student Declaration
This attests to the fact that the study and research carried out by the aforementioned
students under the direction of Rubayea Ferdows, Assistant Professor, Department of
Computer Science, and Engineering, International University of Business,
Agriculture, and Technology, produced the results of the work presented in this thesis
paper with the title " Next Word and Sentence Prediction in Bengali Text Using
LSTM and N-grams: A Machine Learning Journey.”
_______________ _________________
Sadia Afrin Mahazabin Akter
Id:20103025 Id:20103013
IV
Supervisor’s Certification
We attest that between September 22, 2022, and September 5, 2023, Sadia Afrin
(ID:20103025) and Mahazabin Akter (ID:20103013) completed their thesis work in
the Department of Computer Science and Engineering at the International University
of Business Agriculture and Technology (IUBAT). The thesis was titled “Next Word
and Sentence Prediction in Bengali Text Using LSTM and N-grams: A Machine
Learning Journey” As mandated by the department during this time, they often
consulted with me. So, I urge that their thesis report be approved for an oral
examination.
_____________________________
Rubayea Ferdows
Assistant Professor
Department of Computer Science and Engineering
V
IUBAT – International University of Business Agriculture and Technology
Abstract
Due to the internet's current widespread accessibility and the rapid expansion of digital
data, machine learning research has considerably risen during the last ten years.
Machine learning is also crucial for natural language processing. We currently conduct
daily online searches. Our search engine proposes a different word each time you
perform a search. Our search engine uses a machine learning technique to achieve this.
They specifically use Natural Language Processing. Bangla is one of the world's most
extensively used languages. The next word in a language is also something we may
anticipate. Our efforts will be made simpler, and our quality of life will increase.
Traditionally, we conduct online searches, and the writing process takes time during
this period. However, if we apply machine learning in this case, it will make keyword
suggestions depending on our search. With the help of this recommendation, we can
quickly choose our desired word and conduct an internet search for it. However, there
is also an issue. For the sake of word prediction, we occasionally need to write the
entire sentence.In our study, we proposed a novel approach for LSTM and N-gram-
based prediction of Bengali words and sentences. We used n-gram sequencing,
tokenization, and stop-word filtering to train our model in our system. After utilizing
Long Short Term Memory (LSTM) to train and test our model, we obtained high
accuracy for unigrams, bigrams, trigrams, four-gram sentences, and five-gram
sentences. Based on input, our suggested model can also anticipate the complete text.
VI
Acknowledgments
First and foremost, we want to thank the all-powerful, kind, and beautiful God for
granting us the stamina to finish the thesis report on time.
We would like to express our gratitude to the late Professor Dr. Md. Alimullah Miyan
for founding the International University of Business, Agriculture, and Technology
(IUBAT), which is a great place to study. We would like to express our gratitude and
respect to Professor Dr. Abdur Rab, our current honorary vice chancellor at the
International University of Business, Agriculture, and Technology. We would like to
express our gratitude to the late Prof. Dr. Md Abdul Haque, Chairman of the
Department of Computer Science and Engineering at the International University of
Business, Agriculture, and Technology (IUBAT), for granting us permission to study
there and enabling us to envision a promising future in the emerging field of
technology.
For their unwavering guidance and support throughout the entire BCSE program, we
are extremely grateful to Prof. Dr. Utpal Kanti Das, Chairman of the Department of
Computer Science and Engineering, and Dr. Hasibur Rashid Chayon, Co-Ordinator of
the Department of Computer Science and Engineering, at IUBAT - International
University of Business Agriculture and Technology. We also want to express our
gratitude to Rubayea Ferdows, our thesis supervisor, who gave us the chance to create
this thesis report by offering her insightful recommendations and counsel at any time
and in any circumstance. We truly appreciate all of the faculty's ongoing assistance
and support throughout our thesis period.
VII
Table of contents
Letter of Transmittal......................................................................................................III
Student Declaration.......................................................................................................IV
Supervisor’s Certification...............................................................................................V
Abstract.........................................................................................................................VI
Acknowledgments.......................................................................................................VII
List of Tables.................................................................................................................IX
List of Figures................................................................................................................X
Chapter I: Introduction....................................................................................................1
1.1 Background and Context....................................................................................1
1.2 Problem Statements............................................................................................2
1.3 Research Objective.............................................................................................3
1.4 Research Motivation...........................................................................................3
1.5 Relevance and Importance of the Research........................................................4
Chapter II: Literature Review.........................................................................................5
Chapter III: Methodology..............................................................................................10
3.1 Dataset Collection.............................................................................................10
3.2 Dataset Pre-Processing.....................................................................................15
3.2.1 Punctuation Removal..............................................................................15
3.2.2 Stop Words Removal...............................................................................15
3.2.3 Tokenization............................................................................................11
3.3 N-Gram.............................................................................................................16
3.4 Splitting the Dataset..........................................................................................12
3.5 Deep Learning Model.......................................................................................12
3.5.1 LSTM......................................................................................................12
Chapter IV: Result and Discussions..............................................................................23
4.1 Result Analysis..................................................Error! Bookmark not defined.
Chapter V: Conclusion..................................................................................................28
5.1 Limitations........................................................................................................28
5.2 Future Works.....................................................................................................29
Reference.......................................................................................................................30
VIII
List of Tables
IX
List of Figures
X
Chapter I: Introduction
A part of artificial intelligence, it is. By enabling them to learn and create their own
programming, it aims to make robots behave and make decisions more like humans.
Both explicit programming and significant human engagement are lacking. Based on
the experiences the machines have while doing the task, the learning process is
automated and improved.
The computers are given high-quality data, and a variety of methodologies are
employed to create ML models to train the computers on this data. Depending on the
type of data available and the specifics of the automated action needed, a certain
method will be used. [1]
A network with three or more layers is referred to as a neural network, and deep
learning is a subclass of machine learning. Despite the fact that they are unable to
replicate the human brain in any manner, these neural networks make an effort to
emulate its functionality and enable it to "learn" from massive amounts of data. Even
if a neural network with a single layer can still produce approximations, more hidden
layers can assist improve and optimize.
Deep learning is the cornerstone of many artificial intelligence (AI) systems and
services that enhance automation by performing mental and physical tasks without
requiring human input. Deep learning technology is the driving force behind both
dated technologies (like self-driving cars) and cutting-edge ones (like digital
assistants, voice-activated TV remote controls, and credit card fraud detection).[2] In
1
computer science, the field of "artificial intelligence" (AI) known as "natural language
processing" (NLP) is more narrowly focused on enabling machines to comprehend
spoken and written words similarly to how people do.
NLP combines statistical, machine learning, and deep learning models with
computational linguistics, which uses rules to represent human language. These
technologies enable computers to fully "understand" what is said or written, including
the motivations and feelings of the speaker or writer, and to translate human language
into text or audio data.[3]
The main objective of the Natural Language Processing (NLP) task "next word
prediction" is to identify the word that is most likely to come after a given string of
words. Numerous applications, such as text auto-completion, speech recognition, and
machine translation, depend on this predictive capabilities. Due to the remarkable
success of deep learning algorithms in a variety of language-related tasks, such as
next-word prediction, natural language processing (NLP) has experienced a
revolution.[4]
Bangla is among the most voluminous languages in existence.We can also guess the
following word in bangla and generate a phrase by using deep learning. In our
innovative method, we present a method for anticipating the subsequent Bengali word
and formulating a sentence.
The objective of the problem statement is to preprocess our dataset using different
preprocessing methods, including tokenization, removing punctuation, and model
training and testing. We will attempt to explore the correlation between the attributes
in order to ascertain how the attributes of the dataset relate to one another and how
next forecasts evolve. Bangladeshi citizens will be able to use their effectively and
conduct more focused internet research with the help of automatic next word
2
1.3 Research Objective
Machine learning is without a doubt one of the most intriguing fields of computer. The
task of learning from knowledge is achieved with the right machine inputs. It is
crucial to comprehend how machine learning works and, subsequently, how it is
frequently utilized over time. The first stage of machine learning involves choosing
the algorithmic program and entering coaching knowledge.Whether a subject is more
or less well-known, tutoring can assist you in developing the most effective machine
learning algorithmic software. The algorithmic program's ability to briefly connect up
concepts will depend on the type of coaching information given.A new computer file
is added to the machine learning algorithmic software to see if it is operating
successfully.After then, the outcome and prediction are cross-checked. The
algorithmic software is continually trained until the desired result is attained if the
prediction and results are inconsistent. This makes it possible for the algorithmic
machine learning program to continuously learn on its own and come up with the
optimal solution, progressively increasing accuracy over time. Machine learning is
significant because it enables companies to monitor patterns in operational behavior
and customer behavior trends, as well as to assist the introduction of the newest
goods. Today's most prosperous businesses, including Facebook, Google, and Uber,
give machine learning top priority in their day-to-day operations. Machine learning
has become a key point of differentiation for many firms. With the aid of machine
learning, we may develop an algorithm that uses data from statistical analysis to
predict outcomes or build classifications. To improve a model's performance, we can
train it using machine learning. We explore the underlying concepts of machine
learning and learn how it functions. We can effectively anticipate the next word for
next word prediction and also generate the phrase using deep learning.
3
1.5 Relevance and Importance of the Research
One of the most exciting technologies ever created is metric capacity units. The field
of study known as machine learning enables computers to learn without being
explicitly instructed. Artificial intelligence systems strive to complete difficult tasks in
a manner akin to how individuals approach challenges. We are attempting to
anticipate the next word in the technique we've proposed, as well as to construct a
phrase in bangla.One of the richest languages in the world is Bangla. Many people
conduct daily online searches. However, the majority of the time they are unable to
obtain precise facts because of ignorance. Deep learning allows us to predict the
following word in a sentence and generate it based on our research, which makes it
more simpler for us to do online searches, create reports, and perform other tasks.
4
Chapter II: Literature Review
One of the richest languages in the world is Bangla. Nearly 30 million people speak
and conduct everyday business in the Bangla language. To help Bengalis use the
internet more effectively and produce reports and other documents more quickly, we
will forecast the next word in our research and also construct a sentence in the bangla
language. Everyone will benefit greatly from it. In this study, we put out a cutting-edge
technique for predicting the next word in Bengali and creating a phrase. For our
research, we have done some studies which give us motivation for this work. Some of
the papers are similar to our work. Here we discuss some of them.
5
methodology is discussed in the third section, the experimental setup is described in the
fourth section, and the experimental results are shown in the final section. The
proposed method can be utilized as a word prediction assist in a variety of applications,
the paper says.
6
contributes to the field of word prediction by introducing a learning and classification-
based approach, which leverages machine learning to provide more contextually
relevant and accurate word predictions, ultimately improving the efficiency and quality
of text input in various applications.
Paper 6 : N-gram based Statistical Grammar Checker for Bangla and English
The thesis paper describes a statistical grammar checker that analyzes words based on
n-grams and POS tags to evaluate whether a sentence is correct in grammar. N-grams,
which are sequential patterns of words or characters, are employed to analyze the
linguistic structure of sentences in both languages. The paper explores how statistical
information from large corpora in both languages can be harnessed to identify and
correct grammatical errors, including issues related to word order, verb agreement, and
sentence structure. By leveraging N-gram models, the system provides automated
suggestions for grammatical improvements, making it a valuable tool for writers and
learners of both Bangla and English. The research contributes to the development of
7
effective and accessible grammar-checking tools for two languages, addressing the
specific linguistic nuances and challenges of each.
8
research contributes to the development of automated text input systems in Bengali,
improving the user experience and text generation efficiency in this language.
Paper 9: Word completion and sequence prediction in Bangla language using trie
and a hybrid approach of sequential LSTM and N-gram
The research suggests a hybrid method that combines sequential LSTM and N-gram
models to anticipate and finish a Bangla phrase. To increase the accuracy of the
prediction procedure, the study additionally used a trie data structure to hold the Bangla
words and their frequencies. When the suggested model was tested on a corpus of
Bangla, the findings revealed that it performed better than other models already in use
regarding accuracy and effectiveness. The work aims to create a Bangla language
model that is more useful and accurate and can be applied to various tasks, including
text-to-speech conversion, machine translation, and speech recognition.
Paper 10: Bangla Word Prediction and Sentence Completion Using GRU: An
Extended Version of RNN on N-gram Language Mode
The paper aims to solve the problem of predicting Bangla's next most appropriate and
suitable word. It also contributes to the technology of word prediction systems. The
proposed method is based on RNN with GRU (Gated Recurrent Unit) as an extension
of RNN. It uses a sequence-to-sequence architecture and employs a Bangla N-gram
Language Model. In this work, they suggested a way to predict the following most
acceptable and suitable words in Bangla, and it can also recommend the related
sentence to help advance the field of word prediction systems. The method predicts the
next most appropriate and suitable word in Bangla. It first tokenizes the sentence, then
predicts the next word using the GRU model. The prediction is made considerin
9
Table 2.1 Literature Review
10
recommend incorporating other
improvements.
4 A Learning- This thesis paper approach to word Naive Bayes don’t predict
Classification prediction using a learning-classification and latent the full
Based framework. In this research, they information ,S sentient, Out-
Approach for evaluate a word prediction system based VM,RNN of-Vocabulary
Word on machine learning and classification Words
Prediction techniques. The proposed approach
focuses on training models to classify
and rank potential next words in a given
context. The study aims to investigate
the effectiveness and applicability of this
learning-classification based approach
for word prediction, shedding light on its
potential benefits and limitations
5 A Stochastic The thesis paper focuses on developing a Stochastic if the input is
Prediction stochastic prediction interface for Urdu, prediction more than 3
Interface for which is a morphologically rich algorithm, N- words their
Urdu language (MRL) and requires a number gram model approach
of computational tools for natural (Uni gram, Bi- doesn’t work
language processing (NLP). The paper gram and Ti-
highlights the usefulness of stochastic gram)
models in predicting the next word in a
sentence and improving the efficiency of
text input
6 N-gram This thesis paper presents a POS -Parts of Rule-based and
based comprehensive study on the speech tagging hybrid
Statistical development and evaluation of an N- approach, N- checking
Grammar gram-based statistical grammar checker gram systems can be
Checker for for two distinct languages, Bangla and used
Bangla and English. This research, explores the
English utilization of N-gram models to build a
grammar-checking system that is capable
11
of identifying and suggesting corrections
for grammatical errors in both the
Bangla and English languages. The
study involves the creation of N-gram
language models, the implementation of
error detection algorithms, and the
evaluation of the system's performance
in grammar correction for these
languages.
7 Verification The thesis paper focuses on using N- N-gram model Combine rule
of Bangla gram language models to verify the and hybrid
Sentence sentence structure of Bangla language . system can be
Structure The paper also mentions the importance used
using N- of developing language-specific models
Gram for sentence structure verification due to
the differences in grammar rules and
structures between languages .The paper
describes a system for sentence structure
verification based on N-gram models
8 Automated This paper that proposes a method for stochastic N- there is a lack
word word prediction in the Bangla language gram language of word
prediction in using stochastic N-gram language models, prediction
bangla models. The paper aims to reduce hybrid method tools
language misspelling by predicting the correct specifically
using word in a sentence. The paper also tailored for the
stochastic proposes a hybrid method for real-word Bangla
language error detection and correction in Bangla, language
models. which involves a combination of two
different approaches like N-gram. The
paper is based on evaluating the
performance of different stochastic
language models for word prediction in
Bangla .
12
09 Word The thesis paper proposes using a Hybrid Limited
completion hybrid approach of Trie, Sequential Sequential Dataset Size
and sequence LSTM, and N-gram models to predict LSTM and N- Complex
prediction in the next word and complete words in gram models Grammar and
Bangla Bangla language. Morphology
language The paper mentions the importance of
using trie and developing language-specific models for
a hybrid word completion and sequence
approach of prediction due to the differences in
sequential grammar rules and structures between
LSTM and languages. paper describes the
N-gram implementation of the Trie data structure
for word completion
10 Bangla Word This research explores advanced GRU, RNN The prediction
Prediction techniques for Bangla (Bengali) word is made in the
and Sentence prediction and sentence completion context of the
Completion using Gated Recurrent Units (GRU), an input
Using GRU: extended version of Recurrent Neural sequence. It
An Extended Networks (RNN), in conjunction with N- may not be
Version of gram language models. This study suitable for
RNN on N- focuses on enhancing text prediction and general use
gram sentence completion accuracy in where the
Language Bengali, aiming to provide users with context may
Mode contextually relevant word suggestions vary
and sentence completions.
13
Chapter III:Methodology
In our methods we can achieve the objectives of preprocessing the dataset, predicting
the next word, generating sentences, and evaluating the performance of your Deep
Learning Model for Next Word and Sentence Prediction in Bengali Text using LSTM
and N-grams. This systematic approach will help you develop an effective and
context-aware text prediction system tailored for the Bengali language. In our
methods, we tokenize our dataset, preprocess it by removing stop words and
punctuation, do N-gram modeling, and then train our model using LSTM. Fig. 3.1
displays the whole operational flow of our proposed system for the Bangla Next Word
Prediction from the dataset. The full procedure is previously described in Section III
of the text. We use our cutting-edge deep learning algorithm to train and test for the
bangla next word prediction.
14
3.1 Dataset Collection
We gathered our data from articles in newspapers like Prothom Alo, posts on social
media, and various web sites.
Eliminating punctuation will make it easier to treat all texts equally. For instance,
when the punctuation is removed, the phrases data and data! are considered
similarly.Because the contraction words lose their meaning when the punctuation is
removed, we must be careful not to damage the content. If the parameter is set to
"dont" or "don't," words like "don't" will change. We remove all punctuation from our
Bangla word prediction.[17]
15
3.2.2 Stop Words Removal
When examining natural language, stop words are typically disregarded. Along with
articles, prepositions, pronouns, conjunctions, etc., these words are among the most
common in all languages, although they don't offer much context for the text. Stop
words in English include "the," "a," "an," "so," and "what."Every language used by
humans has numerous stop words. These terms can be made more focused on the key
information by removing the low-level information. In other words, we may claim
that the model we train to achieve our goal is unaffected by the absence of these
elements.Since fewer tokens are needed for training, the removal of stop words
undoubtedly reduces the size of the dataset and, consequently, the training time. Stop
words are also taken out of our analysis.[18]
3.2.3 Tokenization
Tokenization is the process of dividing the raw text into manageable pieces.
Tokenization, or the division of the original text into tokens like words and phrases, is
used. These tokens help with context comprehension and NLP model building. By
examining the word order in the text, tokenization aids readers in deciphering the
text's meaning."It is raining," for instance, can be tokenized into "It," "is," and
"raining."A variety of programs and frameworks can be used for tokenization. NLTK,
Gensim, and Keras are a few libraries that can be used to complete the task. You can
tokenize phrases or whole sentences. Word tokenization is the technological process
of dividing text into words, whereas sentence tokenization is the technological process
of doing the same for sentences.[19]
3.3 N-Gram
N-grams are groups of characters from a document that appear repeatedly, including
letters, numbers, and other symbols. Technically, they are the adjacent groups of items
in a document. The use of text data for NLP (Natural Language Processing) activities
depends on them. Spelling checks, text mining, machine translation, semantic
16
features, language models, and semantic characteristics are just a few of the various
uses they can be put to. Because they offer a natural way to develop sentence
completion systems, we choose N-gram based word prediction systems. These
statistical techniques can more accurately predict the words that will be added to a
phrase. N-gram data is divided into a training set and a test set in order to calculate
probabilities using statistical models. We meticulously create and analyze the training
corpus in order to determine the probabilities of a test sentence. Take a look at a
corpus of 1,000,000 Bangla words that contains the word "abstract" 500 times. Thus,
it can be determined that the probability of the word "abstract" is 500/100000=0.005.
We split our dataset into 80% for training and 20% for testing. After spliting our
dataset we fit our model for training and testing.
For our research we have used LSTM for predicting bangla word.
3.5.1 LSTM
The Long Short-Term Memory (LSTM) recurrent neural network (RNN) architecture
is frequently used in deep learning. It is perfect for applications requiring sequence
prediction since it is excellent at capturing long-term dependencies. Because it has
feedback connections as compared to regular neural networks, LSTM can handle entire
data sequences as opposed to simply single data points. This makes it particularly
adept in identifying and forecasting patterns in sequential data, such as time series,
text, and voice.
We discovered that it resolves the vanishing gradient problem that RNNs have,
therefore in this section, we'll examine how it does so by examining the LSTM's
architectural layout. A LSTM performs very much like an RNN cell at a high level.
The LSTM network's internal operations are depicted here. As seen in the graphic
17
below, the LSTM network design is split into three sections, each of which serves a
particular purpose.
Whether or not the information from the prior timestamp should be recalled is decided
in the first section. This cell attempts to learn new information using the input from
the second part. The cell then transmits the revised information from the third
section's present timestamp to the subsequent timestamp. One LSTM cycle is regarded
as one single-time step.
Gates makes reference to these three LSTM unit parts. They manage how data enters
and exits the memory cell or lstm cell. The Forget gate, Input gate, and Output gate
are the names of the first, second, and third gates, respectively. An LSTM unit
composed of these three gates and a memory cell can be compared to a layer of
neurons in a normal feedforward neural network as each neuron has a hidden layer
and a current state.
An LSTM has a hidden state, similar to a conventional RNN, with H(t1) standing for
the hidden state of the preceding timestamp and Ht for the present timestamp. The
timestamps C(t-1) for the past and C(t) for the present, respectively, serve as
representations of the LSTM's cell state.
The cell state is referred to as Long Term Memory in this context, whereas the
concealed state is referred to as Short Term Memory.
See the illustration below.
18
Figure 3.3 LSTM
It's important to remember that the cell state transmits all timestamps along with the
content.
Forget Gate:
The initial decision made in a cell of the LSTM neural network is whether to keep or
discard the data from the previous time step.
The forget gate equation is shown below.
ft =σ(Wf⋅[ht−1,xt]+Uf⋅Ct−1+bf) …………………………..(1)
Uf is the weight that is attached to it, and Xt is the current timestamp input. The prior
timestamp's concealed state
It is the Wf, or the weight matrix, that is linked to the hidden state.
19
After that, it performs a sigmoid function. In the end, ft will become a number
between 0 and 1. This ft is afterwards multiplied by the cell state of the prior
timestamp, as may be seen in the sample below.
A sigmoid function is then applied to it after that. In the end, ft will become a number
between 0 and 1. This ft is afterwards multiplied by the cell state of the prior
timestamp,
Input Gate:
Utilizing the input gate, the fresh data brought in by the input is assessed for
relevancy. Below is a depiction of the input gate's equation.
it=σ(Xt⋅[ht−1,Xt]+Wi)…………………………………………….(2)
New Information:
…………………….(3)
The new data that needed to be transmitted to the cell state at this time was determined
by the concealed state at timestamp t-1 in the past and input x at timestamp t. Tanh is
the activation function in this case. The value of new information will range between -
1 and 1, thanks to the tanh function. Information is either added or deleted from the
cell state at the current timestamp depending on whether Nt is positive or negative.
However, the Nt won't be immediately added to the cell state. The revised equation is
as follows:
20
……………………………….(4)
In this case, Ct-1 represents the cell state at the current timestamp, and the other
variables are the ones we previously calculated.
Output Gate:
……………………………………………….(5)
With the help of the sigmoid function, it is also guaranteed to have a value between 0
and 1. The hidden state will now be determined using the updated cell states Ot and
tanh. like in the instance below.
…………………………………………………..(6)
It is discovered that both long-term memory and the present output (Ct) have an
impact on the concealed state. Simply turn on SoftMax on hidden state Ht.w to get the
output of the current timestamp.
Prediction is the token in this instance having the highest score in the output.
21
Figure 3.13 LSTM diagram
22
4.1 Result:
our research demonstrates the significant advantages of leveraging deep learning
techniques, specifically LSTM networks, in Bengali text prediction and generation
tasks. These findings have practical implications for the development of intelligent
Bengali language applications and open avenues for further exploration in the field of
natural language processing for Bengali text.
In this section, we present the results of our study, which encompasses data
preprocessing, next-word prediction, sentence generation, and the evaluation of deep
learning model performance.
The dataset used in the paper consists of about 200,000 plus word data. Those data are
collected from newspapers, posts on social media, and various websites. For the initial
stage of our research, we meticulously preprocessed our Bengali text dataset. This
involved tokenization, the removal of stopwords, punctuation, and special characters,
and ensuring uniform text encoding. The aim was to create a clean and structured
dataset that could be effectively utilized for training and evaluating our models.
23
Figure 4.1 lstm model
In comparison to the Bigram model, which has an average accuracy of 73.15% and an
average loss of 53.36%, the trained Uni-gram model has an average accuracy of
32.17% and an average loss of 276.44%. The identical dataset used for Uni-gram and
Bi-gram yields an average accuracy of 99.84% and an average loss of 8.52% for Tri-
gram. The average accuracy for the 4-gram and 5-gram samples was 99.52% and
99.70%, respectively, with average losses of 2.04% and 1.11%. This demonstrates that
as n grows, accuracy and loss level both increase. This indicates the effectiveness of
LSTM networks in capturing sequential dependencies in Bengali text. This highlights
the limitations of traditional N-gram models in capturing long-range dependencies and
context in the Bengali language, as opposed to the more advanced LSTM architecture.
24
Table 4.1 Performance Analysis
Name Loss
1-gram 128 512 276.44 32.17%
modeling
2-gram 128 512 48.16% 73.15%
Model with
LSTM
3-gram with 128 512 8.52% 99.84%
LSTM
4-gram with 128 512 2.04% 99.52%
LSTM
5-gram with 128 512 1.11% 99.70%
LSTM
we explored the task of sentence generation, a more complex and creative aspect of
natural language processing. The LSTM-based model excelled in generating coherent
and contextually relevant Bengali sentences. Qualitatively, the sentences produced by
the LSTM model were more fluent and contextually accurate than those generated by
the N-gram model. In our work it suggests 5 words-based sentences, it doesn’t support
more than 5 words if there are more than 5 words. They will be changed into three
words for input.
25
Figure 4.2 Input and Output of LSTM
4.2 Discussion:
The LSTM model outperforms the traditional N-gram model in both next-word
prediction and sentence generation tasks, showcasing the effectiveness of deep learning
approaches for Bengali text processing. Effective data pre-processing is fundamental to
the success of natural language processing tasks. Our pre-processing efforts resulted in
a high-quality dataset, which served as the foundation for subsequent tasks. This
underscores the importance of data preparation in text analysis and machine learning.
The high accuracy achieved by our LSTM model highlights its suitability for next-
word prediction tasks. It excels in capturing context and dependencies within Bengali
sentences, making it a valuable tool for applications like language modeling and text
generation.
Sentence generation is a complex task that requires the model to understand the
semantics and syntax of the language. Our successful sentence generation results with
the LSTM model showcase its potential in creative applications such as content
generation and chatbot interactions.
26
Our comprehensive evaluation demonstrates the superiority of the LSTM model in
capturing semantic and syntactic structures in Bengali text. Its superior performance on
various metrics highlights its effectiveness as a deep learning architecture for language-
related tasks.
our study has successfully achieved the defined objectives. Effective data pre-
processing laid the foundation for subsequent tasks, emphasizing the importance of
data quality in NLP. The LSTM model demonstrated superior performance in both
next-word prediction and sentence-generation tasks, showcasing its versatility and
potential for various NLP applications. Comprehensive model evaluation using
multiple metrics confirms the advantages of deep learning, particularly LSTM, in
Bengali text analysis.
27
Chapter V: Conclusion
In our proposed model by using LSTM with n-gram model we predict the next word
and also generate the full sentence. It will also performs better.
5.1 Limitations
The "next word prediction" language modeling task in machine learning aims to
anticipate the most probable word or group of words that will appear following a
particular input context. This job uses linguistic structures and statistical trends to
generate precise predictions based on the provided context.The Next Word Prediction
models have a wide range of applications across many industries. As you start to send
a message, your phone can suggest the subsequent word to assist you type more
quickly. Search engines do something similar by anticipating your input and offering
search alternatives. Next word prediction helps us communicate more effectively and
swiftly by foreseeing what we could say or look for.We can also predict what would
happen next and produce sentences in Bangla. For word prediction and sentence
generation in our research, we used LSTM and n-gram sequencing. Our accuracy is
also rather good. However, there are some restrictions in our job as well. We don't
employ any data processing techniques like NER, Bag of Words, or algorithms like
BERT, or GPT, in our work. Only when we enter 5 words does our model function.
Our model doesn't function if there are more than 5 words. They will be changed into
three words for input. Finding the data for the word predictions is challenging. There
is a lot of unfamiliar language as a result.
28
5.2 Future Works
The paper proposes a method for predicting the next most appropriate and suitable
word in Bengali language, as well as suggesting the corresponding sentence to
contribute to the technology of word prediction systems. The proposed approach uses
LSTM (Long Short-Term Memory) and N-gram models to create language models
that can predict the word(s) from the input sequence provided. The paper discusses the
dataset used for the experiments, which was collected from different sources in
Bengali language.
29
Reference
[2] IBM (2023). What is Deep Learning? | IBM. [online] www.ibm.com. Available at:
https://www.ibm.com/topics/deep-learning.
[3] IBM (2023). What is Natural Language Processing? | IBM. [online] www.ibm.com.
Available at: https://www.ibm.com/topics/natural-language-processing.
[4]GeeksforGeeks. (2023). Next Word Prediction with Deep Learning in NLP. [online]
Available at: https://www.geeksforgeeks.org/next-word-prediction-with-deep-learning-
in-nlp/ [Accessed 27 Aug. 2023].
[5] Steffen Bickel, Peter Haider and Tobais Scheffer, (2005), "Predicting Sentences
using N-GramLanguage Models," In Proceedings of Conference on Empirical Methods
in Natural languageProcessing.Table II. Results of all models usedModel Average
AccuracyUnigram 21.24Bigram 45.84Trigram 63.04Backoff 63.50Deleted
Interpolation 62.86Fig. 5. Graphs of the results of all models used
International Journal in Foundations of Computer Science & Technology (IJFCST)
Vol.5, No.6, November 201575
30
[7] Eyas El-Qawasmeh, (2004), "Word Prediction via a Clustered Optimal Binary
Search Tree,"International Arab Journal of Information Technology, Vol. 1, No. 1.
[8] Hisham Al-Mubaid , (2007), "A Learning-Classification Based Approach for Word
Prediction", TheInternational Arab Journal of Information Technology, Vol. 4, No. 3.
[9] Qaiser Abbas, (2014), "A Stochastic Prediction Interface for Urdu", Intelligent
Systems andApplications , Vol.7, No.1, pp 94-100 .
[10] Umrinder Pal Singh, Vishal Goyal and Anisha Rani, (2014), "Disambiguating
Hindi Words Using N-Gram Smoothing Models", International Journal of Engineering
Sciences, Vol.10, Issue June, pp 26-29.
[11] Jahangir Alam, Naushad Uzzaman and Mumit khan, (2006), "N-gram based
Statistical GrammarChecker for Bangla and English", In Proceedings of International
Conference on Computer andInformation Technology.
[12] Nur Hossain Khan, Gonesh Chandra Saha, Bappa Sarker and Md. Habibur
Rahman, (2014),"Checking the Correctness of Bangla Words using N-Gram",
International Journal of ComputerApplication, Vol. 89, No. 11.
[13] Nur Hossain Khan, Md. Farukuzzaman Khan, Md. Mojahidul Islam, Md. Habibur
Rahman andBappa Sarker, (2014), "Verification of Bangla Sentence Structure using N-
Gram," Global Journal ofComputer Science and Technology , vol.14 ,issue-1 .
[16] Avro Keyboard for auto completion tool in Bangla typing, Available
athttps://www.omicronlab.com/avro-keyboard.html.
31
[17] Daniel Jurafsky and James H. Martin, (2000), Speech and Language processing,
USA: Prentice-Hall,Inc.
https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-
nlp.
[21] Haque, M.M., Habib, M.T. and Rahman, M.M., 2016. Automated word
prediction in bangla language using stochastic language models. arXiv
preprint arXiv:1602.07803.
32
33
34
35
36
37
38