0% found this document useful (0 votes)
155 views48 pages

Final Thesis Report

The document discusses next word and sentence prediction in Bengali text using LSTM and N-grams. It proposes a novel approach using LSTM and N-grams for Bengali word and sentence prediction. The study uses techniques like n-gram sequencing, tokenization and stop-word filtering to train the model, achieving high accuracy for word and sentence prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views48 pages

Final Thesis Report

The document discusses next word and sentence prediction in Bengali text using LSTM and N-grams. It proposes a novel approach using LSTM and N-grams for Bengali word and sentence prediction. The study uses techniques like n-gram sequencing, tokenization and stop-word filtering to train the model, achieving high accuracy for word and sentence prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Next Word and Sentence Prediction in Bengali Text

Using LSTM and N-grams: A Machine Learning


Journey

Sadia Afrin, Mahazabin Akter

A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and
Technology

Summer 2023
Next Word and Sentence Prediction in Bengali Text Using LSTM and N-grams: A
Machine Learning Journey

Sadia Afrin, Mahazabin Akter

A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE). The thesis has been examined and
approved,

_____________________________
Prof. Dr. Utpal Kanti Das
Professor and Chairman

_____________________________
Dr. Hasibur Rashid Chayon
Co-supervisor, Coordinator, and Associate Professor

_____________________________
Rubayea Ferdows
Assistant Professor

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology

Summer 2023
Letter of Transmittal

07 October 2022

The Chair

Thesis Defense Committee

Department of Computer Science and Engineering

IUBAT–International University of Business Agriculture and Technology

4 Embankment Drive Road, Sector 10, Uttara Model Town Dhaka 1230, Bangladesh

Subject: Letter of Transmittal.

Dear Sir,

The supporters would like to submit a proposal titled " Next Word and Sentence
Prediction in Bengali Text Using LSTM and N-grams: A Machine Learning Journey"
in compliance with your instructions and the requirements of the thesis course. The
study's objective is to employ machine learning strategies to anticipate the subsequent
word from user input. With the help of this model, we can predict the next word.

We hope that this request will merit your approval.

Yours sincerely,

_______________ _________________
Sadia Afrin Mahazabin Akter
Id:20103025 Id:20103013

III
Student Declaration

This attests to the fact that the study and research carried out by the aforementioned
students under the direction of Rubayea Ferdows, Assistant Professor, Department of
Computer Science, and Engineering, International University of Business,
Agriculture, and Technology, produced the results of the work presented in this thesis
paper with the title " Next Word and Sentence Prediction in Bengali Text Using
LSTM and N-grams: A Machine Learning Journey.”

_______________ _________________
Sadia Afrin Mahazabin Akter
Id:20103025 Id:20103013

IV
Supervisor’s Certification

We attest that between September 22, 2022, and September 5, 2023, Sadia Afrin
(ID:20103025) and Mahazabin Akter (ID:20103013) completed their thesis work in
the Department of Computer Science and Engineering at the International University
of Business Agriculture and Technology (IUBAT). The thesis was titled “Next Word
and Sentence Prediction in Bengali Text Using LSTM and N-grams: A Machine
Learning Journey” As mandated by the department during this time, they often
consulted with me. So, I urge that their thesis report be approved for an oral
examination.

_____________________________

Rubayea Ferdows
Assistant Professor
Department of Computer Science and Engineering

V
IUBAT – International University of Business Agriculture and Technology

Abstract

Due to the internet's current widespread accessibility and the rapid expansion of digital
data, machine learning research has considerably risen during the last ten years.
Machine learning is also crucial for natural language processing. We currently conduct
daily online searches. Our search engine proposes a different word each time you
perform a search. Our search engine uses a machine learning technique to achieve this.
They specifically use Natural Language Processing. Bangla is one of the world's most
extensively used languages. The next word in a language is also something we may
anticipate. Our efforts will be made simpler, and our quality of life will increase.
Traditionally, we conduct online searches, and the writing process takes time during
this period. However, if we apply machine learning in this case, it will make keyword
suggestions depending on our search. With the help of this recommendation, we can
quickly choose our desired word and conduct an internet search for it. However, there
is also an issue. For the sake of word prediction, we occasionally need to write the
entire sentence.In our study, we proposed a novel approach for LSTM and N-gram-
based prediction of Bengali words and sentences. We used n-gram sequencing,
tokenization, and stop-word filtering to train our model in our system. After utilizing
Long Short Term Memory (LSTM) to train and test our model, we obtained high
accuracy for unigrams, bigrams, trigrams, four-gram sentences, and five-gram
sentences. Based on input, our suggested model can also anticipate the complete text.

Keywords: LSTM, N-gram, Tokenization, Stop words, next word prediction.

VI
Acknowledgments

First and foremost, we want to thank the all-powerful, kind, and beautiful God for
granting us the stamina to finish the thesis report on time.
We would like to express our gratitude to the late Professor Dr. Md. Alimullah Miyan
for founding the International University of Business, Agriculture, and Technology
(IUBAT), which is a great place to study. We would like to express our gratitude and
respect to Professor Dr. Abdur Rab, our current honorary vice chancellor at the
International University of Business, Agriculture, and Technology. We would like to
express our gratitude to the late Prof. Dr. Md Abdul Haque, Chairman of the
Department of Computer Science and Engineering at the International University of
Business, Agriculture, and Technology (IUBAT), for granting us permission to study
there and enabling us to envision a promising future in the emerging field of
technology.
For their unwavering guidance and support throughout the entire BCSE program, we
are extremely grateful to Prof. Dr. Utpal Kanti Das, Chairman of the Department of
Computer Science and Engineering, and Dr. Hasibur Rashid Chayon, Co-Ordinator of
the Department of Computer Science and Engineering, at IUBAT - International
University of Business Agriculture and Technology. We also want to express our
gratitude to Rubayea Ferdows, our thesis supervisor, who gave us the chance to create
this thesis report by offering her insightful recommendations and counsel at any time
and in any circumstance. We truly appreciate all of the faculty's ongoing assistance
and support throughout our thesis period.

VII
Table of contents

Letter of Transmittal......................................................................................................III
Student Declaration.......................................................................................................IV
Supervisor’s Certification...............................................................................................V
Abstract.........................................................................................................................VI
Acknowledgments.......................................................................................................VII
List of Tables.................................................................................................................IX
List of Figures................................................................................................................X
Chapter I: Introduction....................................................................................................1
1.1 Background and Context....................................................................................1
1.2 Problem Statements............................................................................................2
1.3 Research Objective.............................................................................................3
1.4 Research Motivation...........................................................................................3
1.5 Relevance and Importance of the Research........................................................4
Chapter II: Literature Review.........................................................................................5
Chapter III: Methodology..............................................................................................10
3.1 Dataset Collection.............................................................................................10
3.2 Dataset Pre-Processing.....................................................................................15
3.2.1 Punctuation Removal..............................................................................15
3.2.2 Stop Words Removal...............................................................................15
3.2.3 Tokenization............................................................................................11
3.3 N-Gram.............................................................................................................16
3.4 Splitting the Dataset..........................................................................................12
3.5 Deep Learning Model.......................................................................................12
3.5.1 LSTM......................................................................................................12
Chapter IV: Result and Discussions..............................................................................23
4.1 Result Analysis..................................................Error! Bookmark not defined.
Chapter V: Conclusion..................................................................................................28
5.1 Limitations........................................................................................................28
5.2 Future Works.....................................................................................................29
Reference.......................................................................................................................30

VIII
List of Tables

Table 2.1 Literature Review.............................................................8


Table 4.1 Performance Analysis.....................................................20

IX
List of Figures

Figure 3.1 Working Procedure................................................................10


Figure 3.2 Architeture of LSTM……………………………………….13
Figure 3.3 LSTM....................................................................................14
Figure 3.4 Timestamps of LSTM……………………………………...14
Figure 3.5 Forget Gate............................................................................14
Figure 3.6 Sigmoid Function..................................................................15
Figure 3.7 Input Gate..............................................................................15
Figure 3.8 New Information....................................................................16
Figure 3.9 Updating Current Cells..........................................................16
Figure 3.11 Output Gate.........................................................................17
Figure 3.12 Hidden State........................................................................17
Figure 3.13 Output of LSTM................................................................. 17
Figure 3.14 LSTM diagram....................................................................18
Figure 4.2 Input and Output of LSTM...................................................20

X
Chapter I: Introduction

1.1 Background and Context

A branch of artificial intelligence called "machine learning" use statistical methods to


enable computers to learn and form judgments without having to be explicitly
programmed. It is based on the idea that computers are capable of independently
coming to conclusions, identifying patterns in data, and learning from all of this.

A part of artificial intelligence, it is. By enabling them to learn and create their own
programming, it aims to make robots behave and make decisions more like humans.
Both explicit programming and significant human engagement are lacking. Based on
the experiences the machines have while doing the task, the learning process is
automated and improved.

The computers are given high-quality data, and a variety of methodologies are
employed to create ML models to train the computers on this data. Depending on the
type of data available and the specifics of the automated action needed, a certain
method will be used. [1]

A network with three or more layers is referred to as a neural network, and deep
learning is a subclass of machine learning. Despite the fact that they are unable to
replicate the human brain in any manner, these neural networks make an effort to
emulate its functionality and enable it to "learn" from massive amounts of data. Even
if a neural network with a single layer can still produce approximations, more hidden
layers can assist improve and optimize.

Deep learning is the cornerstone of many artificial intelligence (AI) systems and
services that enhance automation by performing mental and physical tasks without
requiring human input. Deep learning technology is the driving force behind both
dated technologies (like self-driving cars) and cutting-edge ones (like digital
assistants, voice-activated TV remote controls, and credit card fraud detection).[2] In

1
computer science, the field of "artificial intelligence" (AI) known as "natural language
processing" (NLP) is more narrowly focused on enabling machines to comprehend
spoken and written words similarly to how people do.

NLP combines statistical, machine learning, and deep learning models with
computational linguistics, which uses rules to represent human language. These
technologies enable computers to fully "understand" what is said or written, including
the motivations and feelings of the speaker or writer, and to translate human language
into text or audio data.[3]

The main objective of the Natural Language Processing (NLP) task "next word
prediction" is to identify the word that is most likely to come after a given string of
words. Numerous applications, such as text auto-completion, speech recognition, and
machine translation, depend on this predictive capabilities. Due to the remarkable
success of deep learning algorithms in a variety of language-related tasks, such as
next-word prediction, natural language processing (NLP) has experienced a
revolution.[4]

Bangla is among the most voluminous languages in existence.We can also guess the
following word in bangla and generate a phrase by using deep learning. In our
innovative method, we present a method for anticipating the subsequent Bengali word
and formulating a sentence.

1.2 Problem Statements

The objective of the problem statement is to preprocess our dataset using different
preprocessing methods, including tokenization, removing punctuation, and model
training and testing. We will attempt to explore the correlation between the attributes
in order to ascertain how the attributes of the dataset relate to one another and how
next forecasts evolve. Bangladeshi citizens will be able to use their effectively and
conduct more focused internet research with the help of automatic next word

predictions and sentence construction.

2
1.3 Research Objective

 Pre Processing our dataset


 Predict Next word
 Generate a sentence
 Evaluating Deep learning Model Performance

1.4 Research Motivation

Machine learning is without a doubt one of the most intriguing fields of computer. The
task of learning from knowledge is achieved with the right machine inputs. It is
crucial to comprehend how machine learning works and, subsequently, how it is
frequently utilized over time. The first stage of machine learning involves choosing
the algorithmic program and entering coaching knowledge.Whether a subject is more
or less well-known, tutoring can assist you in developing the most effective machine
learning algorithmic software. The algorithmic program's ability to briefly connect up
concepts will depend on the type of coaching information given.A new computer file
is added to the machine learning algorithmic software to see if it is operating
successfully.After then, the outcome and prediction are cross-checked. The
algorithmic software is continually trained until the desired result is attained if the
prediction and results are inconsistent. This makes it possible for the algorithmic
machine learning program to continuously learn on its own and come up with the
optimal solution, progressively increasing accuracy over time. Machine learning is
significant because it enables companies to monitor patterns in operational behavior
and customer behavior trends, as well as to assist the introduction of the newest
goods. Today's most prosperous businesses, including Facebook, Google, and Uber,
give machine learning top priority in their day-to-day operations. Machine learning
has become a key point of differentiation for many firms. With the aid of machine
learning, we may develop an algorithm that uses data from statistical analysis to
predict outcomes or build classifications. To improve a model's performance, we can
train it using machine learning. We explore the underlying concepts of machine
learning and learn how it functions. We can effectively anticipate the next word for
next word prediction and also generate the phrase using deep learning.

3
1.5 Relevance and Importance of the Research

One of the most exciting technologies ever created is metric capacity units. The field
of study known as machine learning enables computers to learn without being
explicitly instructed. Artificial intelligence systems strive to complete difficult tasks in
a manner akin to how individuals approach challenges. We are attempting to
anticipate the next word in the technique we've proposed, as well as to construct a
phrase in bangla.One of the richest languages in the world is Bangla. Many people
conduct daily online searches. However, the majority of the time they are unable to
obtain precise facts because of ignorance. Deep learning allows us to predict the
following word in a sentence and generate it based on our research, which makes it
more simpler for us to do online searches, create reports, and perform other tasks.

4
Chapter II: Literature Review

One of the richest languages in the world is Bangla. Nearly 30 million people speak
and conduct everyday business in the Bangla language. To help Bengalis use the
internet more effectively and produce reports and other documents more quickly, we
will forecast the next word in our research and also construct a sentence in the bangla
language. Everyone will benefit greatly from it. In this study, we put out a cutting-edge
technique for predicting the next word in Bengali and creating a phrase. For our
research, we have done some studies which give us motivation for this work. Some of
the papers are similar to our work. Here we discuss some of them.

Paper 1: Predicting Sentences using N-Gram Language Models


A method for predicting sentences using N-gram language models is proposed in this
research. In order to improve the ability for predicting sentences, the research
generalizes hierarchical word sequence models. By inserting several assumptions
within the framework, this is accomplished. The thesis proposes document analysis as
opposed to sentence-by-sentence analysis. This strategy is advantageous since it gives
the words a wider context, which increases the model's accuracy. The thesis also
presents a strategy for predicting various phrase forms, which represents a significant
development in the discipline of natural language processing (NLP). The thesis
outlines a number of possible uses for this approach, such as determining the order of
article blocks in English e-newspapers and determining the tone of news stories.

Paper 2: Application of Artificial Intelligence Methods in a Word-Prediction Aid


The purpose of the thesis paper is to use artificial intelligence to develop a word
prediction tool. A method for predicting the next word in a sentence using N-gram
language models is proposed in the research. The importance of next-word prediction
as a Natural Language Processing (NLP) work is addressed in the paper, and this offers
experimental data to demonstrate the efficacy of the suggested approach. The next part
describes how the paper is set up: The word prediction problem is introduced in
Section 1, the N-gram language model is explained in the second section, the suggested

5
methodology is discussed in the third section, the experimental setup is described in the
fourth section, and the experimental results are shown in the final section. The
proposed method can be utilized as a word prediction assist in a variety of applications,
the paper says.

Paper 3: Word Prediction via a Clustered Optimal Binary Search Tree


The paper presents an innovative approach to word prediction, specifically focusing on
optimizing the prediction process using a clustered optimal binary search tree (BST).
Word prediction is a valuable technology for enhancing text input efficiency, and this
research seeks to improve its accuracy and speed. This applies to each thesis paper's
design and the word prediction statistical methodology. The suggested approach for
word prediction entails creating an OBST. To do this, a cluster of computers is set up to
create an Optimal Binary Search Tree (OBST). Then, a trigram, unigram, and bigram
statistical model of words is produced using the OBST. The keystroke saving is
enhanced by the statistical model. It was discovered that the suggested strategy, which
includes an OBST, increases word prediction accuracy in comparison to the
conventional approaches. The study offers a thorough method for word prediction that
includes statistical models and the creation of an OBST. The suggested strategy is
discovered to be more successful than the conventional ways.

Paper 4: A Learning-Classification-Based Approach for Word Prediction


The paper introduces an innovative methodology for word prediction in natural
language processing. Word prediction is a valuable tool for enhancing text input
efficiency, and this research presents a novel approach to improving its accuracy. The
suggested method is based on machine learning algorithms and context features. The
proposed model was developed by utilizing a method that consists of three stages:
training, optimization, and inferences. In the training stage, a sizable corpus of text data
is collected and preprocessed to extract features including word co-occurrence, part-of-
speech tags, and word frequency. The best machine learning algorithm is chosen and its
parameters are adjusted to produce the best performance during the optimization stage.
In the inference stage, the following word will be predicted using the trained model and
its context. The research comes to the conclusion that the suggested method performs
well at word prediction and can be applied to a variety of NLP tasks. This research

6
contributes to the field of word prediction by introducing a learning and classification-
based approach, which leverages machine learning to provide more contextually
relevant and accurate word predictions, ultimately improving the efficiency and quality
of text input in various applications.

Paper 5: A Stochastic Prediction Interface for Urdu


A stochastic prediction interface for the Urdu language is presented in the thesis article.
The interface is evaluated in the research using the KS formula, and outcomes are
presented. The paper also provides a framework for Bengali word prediction using
stochastic language models. The term "stochastic" suggests that the interface leverages
statistical language models, which capture the likelihood of word sequences in Urdu
text. These models allow the interface to predict the next word or phrase in a sentence
based on the context provided by the user's input. The purpose of this study is to
describe the construction of a hierarchically annotated corpus of the Urdu language and
the evaluation of this corpus using semi-semantic annotation of part of speech. The
paper's methodology uses a stochastic language model to generate a term from a
sentence fragment. Five stochastic language models—the unigram, bigram, trigram,
backoff, and deleted interpolation—are used in this study. This research contributes to
the development of text input systems for Urdu by introducing a stochastic prediction
interface that enhances the user experience and text generation efficiency in this
language. The paper lays down a foundation for text prediction of an unknown word in
the Bangla language.

Paper 6 : N-gram based Statistical Grammar Checker for Bangla and English
The thesis paper describes a statistical grammar checker that analyzes words based on
n-grams and POS tags to evaluate whether a sentence is correct in grammar. N-grams,
which are sequential patterns of words or characters, are employed to analyze the
linguistic structure of sentences in both languages. The paper explores how statistical
information from large corpora in both languages can be harnessed to identify and
correct grammatical errors, including issues related to word order, verb agreement, and
sentence structure. By leveraging N-gram models, the system provides automated
suggestions for grammatical improvements, making it a valuable tool for writers and
learners of both Bangla and English. The research contributes to the development of

7
effective and accessible grammar-checking tools for two languages, addressing the
specific linguistic nuances and challenges of each.

Paper 7: Verification of Bangla Sentence Structure using N-Gram


The paper delves into the application of N-gram language models to the task of
verifying the grammatical structure of Bangla sentences. The paper explores how
statistical information derived from large corpora of Bangla text can be leveraged to
verify the correctness of sentence structures, including aspects like word order, tense
agreement, and phrase composition. It specifically trains a system using a corpus of
one million experimental word tokens. This corpus is used to validate the system's
performance and train it. The approach has been successful in determining the test
sentences' structural validity. With a 93% success rate, it has attained a high rate of
accuracy. The method can be used in a variety of fields where precise phrase
identification is essential, including spelling and syntactic verification, speech
recognition, machine translation, character recognition, and others.

Paper 8: Automated word prediction in bangla language using stochastic language


models
The paper explores the development and application of stochastic language models to
the task of word prediction in the Bengali (Bangla) language. Stochastic language
models are statistical models that capture the probability of word sequences occurring
in a language. The thesis paper presents a methodology for AWP in Bangla using
SLMs. The authors suggest using different N-gram language models (unigram, bigram,
trigram, deleted interpolation, and backoff models) to forecast the right word to employ
in a sentence. This method aids in speeding up prediction time and predicting Bangla
words more accurately. Combining SLMs and other N-gram language models may
efficiently predict Bangla words. Additional language models and hybrid models can
improve forecast accuracy and efficiency. Deep Learning models can also be an
effective AWP technique, especially for complex Bangla words. These results can be a
foundation for future study or AWP implementation in Bangla. The paper outlines how
these stochastic language models are trained on large Bengali text corpora to learn the
patterns and probabilities of word sequences. The resulting models can then provide
contextually relevant word suggestions based on the input provided by users. This

8
research contributes to the development of automated text input systems in Bengali,
improving the user experience and text generation efficiency in this language.

Paper 9: Word completion and sequence prediction in Bangla language using trie
and a hybrid approach of sequential LSTM and N-gram
The research suggests a hybrid method that combines sequential LSTM and N-gram
models to anticipate and finish a Bangla phrase. To increase the accuracy of the
prediction procedure, the study additionally used a trie data structure to hold the Bangla
words and their frequencies. When the suggested model was tested on a corpus of
Bangla, the findings revealed that it performed better than other models already in use
regarding accuracy and effectiveness. The work aims to create a Bangla language
model that is more useful and accurate and can be applied to various tasks, including
text-to-speech conversion, machine translation, and speech recognition.

Paper 10: Bangla Word Prediction and Sentence Completion Using GRU: An
Extended Version of RNN on N-gram Language Mode
The paper aims to solve the problem of predicting Bangla's next most appropriate and
suitable word. It also contributes to the technology of word prediction systems. The
proposed method is based on RNN with GRU (Gated Recurrent Unit) as an extension
of RNN. It uses a sequence-to-sequence architecture and employs a Bangla N-gram
Language Model. In this work, they suggested a way to predict the following most
acceptable and suitable words in Bangla, and it can also recommend the related
sentence to help advance the field of word prediction systems. The method predicts the
next most appropriate and suitable word in Bangla. It first tokenizes the sentence, then
predicts the next word using the GRU model. The prediction is made considerin

9
Table 2.1 Literature Review

SL Paper Name Overview Method Limitations


No Applied
1 Predicting They create an evaluation metric and N-gram Limited
Sentences modify N-gram language models to Language Contextual
using N- tackle the issue of word prediction from Models, Understanding,
Gram an initial text fragment. We empirically Baseline Fixed Context
Language investigate the predictability of emails Models, Length,
Models from call centers, private emails, Evaluation
weather forecasts, and cooking Metrics
instructions using an instance-based
technique as a baseline.
2 Application This thesis paper explores the chart Bottom- No limitations
of Artificial development and application of artificial up AI discussed in
Intelligence intelligence (AI) methods in the context algorithm, the paper.
Methods in a of a word-prediction aid.The study Recurrent
Word- investigates various AI techniques and Neural
Prediction models to enhance word prediction, with Networks
Aid a focus on improving user experience (RNNs),
and accessibility. The research
encompasses the design,
implementation, evaluation, and analysis
of AI-based word prediction aids,
addressing their potential benefits and
limitations.
3 Word To create an Optimal Binary Search Tree Optimal Binary No Neural
Prediction (OBST) for the statistical method to Search Tree Network
via a word prediction. Additional connections algorithm has
Clustered will be included in the OBST so that the been applied
Optimal bigram and trigram of the language can
Binary be provided. To obtain the best word
Search Tree prediction performance, they also

10
recommend incorporating other
improvements.
4 A Learning- This thesis paper approach to word Naive Bayes don’t predict
Classification prediction using a learning-classification and latent the full
Based framework. In this research, they information ,S sentient, Out-
Approach for evaluate a word prediction system based VM,RNN of-Vocabulary
Word on machine learning and classification Words
Prediction techniques. The proposed approach
focuses on training models to classify
and rank potential next words in a given
context. The study aims to investigate
the effectiveness and applicability of this
learning-classification based approach
for word prediction, shedding light on its
potential benefits and limitations
5 A Stochastic The thesis paper focuses on developing a Stochastic if the input is
Prediction stochastic prediction interface for Urdu, prediction more than 3
Interface for which is a morphologically rich algorithm, N- words their
Urdu language (MRL) and requires a number gram model approach
of computational tools for natural (Uni gram, Bi- doesn’t work
language processing (NLP). The paper gram and Ti-
highlights the usefulness of stochastic gram)
models in predicting the next word in a
sentence and improving the efficiency of
text input
6 N-gram This thesis paper presents a POS -Parts of Rule-based and
based comprehensive study on the speech tagging hybrid
Statistical development and evaluation of an N- approach, N- checking
Grammar gram-based statistical grammar checker gram systems can be
Checker for for two distinct languages, Bangla and used
Bangla and English. This research, explores the
English utilization of N-gram models to build a
grammar-checking system that is capable

11
of identifying and suggesting corrections
for grammatical errors in both the
Bangla and English languages. The
study involves the creation of N-gram
language models, the implementation of
error detection algorithms, and the
evaluation of the system's performance
in grammar correction for these
languages.
7 Verification The thesis paper focuses on using N- N-gram model Combine rule
of Bangla gram language models to verify the and hybrid
Sentence sentence structure of Bangla language . system can be
Structure The paper also mentions the importance used
using N- of developing language-specific models
Gram for sentence structure verification due to
the differences in grammar rules and
structures between languages .The paper
describes a system for sentence structure
verification based on N-gram models
8 Automated This paper that proposes a method for stochastic N- there is a lack
word word prediction in the Bangla language gram language of word
prediction in using stochastic N-gram language models, prediction
bangla models. The paper aims to reduce hybrid method tools
language misspelling by predicting the correct specifically
using word in a sentence. The paper also tailored for the
stochastic proposes a hybrid method for real-word Bangla
language error detection and correction in Bangla, language
models. which involves a combination of two
different approaches like N-gram. The
paper is based on evaluating the
performance of different stochastic
language models for word prediction in
Bangla .

12
09 Word The thesis paper proposes using a Hybrid Limited
completion hybrid approach of Trie, Sequential Sequential Dataset Size
and sequence LSTM, and N-gram models to predict LSTM and N- Complex
prediction in the next word and complete words in gram models Grammar and
Bangla Bangla language. Morphology
language The paper mentions the importance of
using trie and developing language-specific models for
a hybrid word completion and sequence
approach of prediction due to the differences in
sequential grammar rules and structures between
LSTM and languages. paper describes the
N-gram implementation of the Trie data structure
for word completion
10 Bangla Word This research explores advanced GRU, RNN The prediction
Prediction techniques for Bangla (Bengali) word is made in the
and Sentence prediction and sentence completion context of the
Completion using Gated Recurrent Units (GRU), an input
Using GRU: extended version of Recurrent Neural sequence. It
An Extended Networks (RNN), in conjunction with N- may not be
Version of gram language models. This study suitable for
RNN on N- focuses on enhancing text prediction and general use
gram sentence completion accuracy in where the
Language Bengali, aiming to provide users with context may
Mode contextually relevant word suggestions vary
and sentence completions.

13
Chapter III:Methodology

In our methods we can achieve the objectives of preprocessing the dataset, predicting
the next word, generating sentences, and evaluating the performance of your Deep
Learning Model for Next Word and Sentence Prediction in Bengali Text using LSTM
and N-grams. This systematic approach will help you develop an effective and
context-aware text prediction system tailored for the Bengali language. In our
methods, we tokenize our dataset, preprocess it by removing stop words and
punctuation, do N-gram modeling, and then train our model using LSTM. Fig. 3.1
displays the whole operational flow of our proposed system for the Bangla Next Word
Prediction from the dataset. The full procedure is previously described in Section III
of the text. We use our cutting-edge deep learning algorithm to train and test for the
bangla next word prediction.

Figure 3.1 Working Procedure

14
3.1 Dataset Collection

We gathered our data from articles in newspapers like Prothom Alo, posts on social
media, and various web sites.

Table 3.1 Dataset Details

3.2 Dataset Pre-Processing

Performance of a deep learning model is dependent on how effectively input is


processed. We take numerous processes in the data processing for our research. Here
are the steps:

3.2.1 Punctuation Removal

Eliminating punctuation will make it easier to treat all texts equally. For instance,
when the punctuation is removed, the phrases data and data! are considered
similarly.Because the contraction words lose their meaning when the punctuation is
removed, we must be careful not to damage the content. If the parameter is set to
"dont" or "don't," words like "don't" will change. We remove all punctuation from our
Bangla word prediction.[17]

15
3.2.2 Stop Words Removal

When examining natural language, stop words are typically disregarded. Along with
articles, prepositions, pronouns, conjunctions, etc., these words are among the most
common in all languages, although they don't offer much context for the text. Stop
words in English include "the," "a," "an," "so," and "what."Every language used by
humans has numerous stop words. These terms can be made more focused on the key
information by removing the low-level information. In other words, we may claim
that the model we train to achieve our goal is unaffected by the absence of these
elements.Since fewer tokens are needed for training, the removal of stop words
undoubtedly reduces the size of the dataset and, consequently, the training time. Stop
words are also taken out of our analysis.[18]

3.2.3 Tokenization

Tokenization is the process of dividing the raw text into manageable pieces.
Tokenization, or the division of the original text into tokens like words and phrases, is
used. These tokens help with context comprehension and NLP model building. By
examining the word order in the text, tokenization aids readers in deciphering the
text's meaning."It is raining," for instance, can be tokenized into "It," "is," and
"raining."A variety of programs and frameworks can be used for tokenization. NLTK,
Gensim, and Keras are a few libraries that can be used to complete the task. You can
tokenize phrases or whole sentences. Word tokenization is the technological process
of dividing text into words, whereas sentence tokenization is the technological process
of doing the same for sentences.[19]

3.3 N-Gram

N-grams are groups of characters from a document that appear repeatedly, including
letters, numbers, and other symbols. Technically, they are the adjacent groups of items
in a document. The use of text data for NLP (Natural Language Processing) activities
depends on them. Spelling checks, text mining, machine translation, semantic

16
features, language models, and semantic characteristics are just a few of the various
uses they can be put to. Because they offer a natural way to develop sentence
completion systems, we choose N-gram based word prediction systems. These
statistical techniques can more accurately predict the words that will be added to a
phrase. N-gram data is divided into a training set and a test set in order to calculate
probabilities using statistical models. We meticulously create and analyze the training
corpus in order to determine the probabilities of a test sentence. Take a look at a
corpus of 1,000,000 Bangla words that contains the word "abstract" 500 times. Thus,
it can be determined that the probability of the word "abstract" is 500/100000=0.005.

3.4 Spliting the Dataset

We split our dataset into 80% for training and 20% for testing. After spliting our
dataset we fit our model for training and testing.

3.5 Deep Learning Model

For our research we have used LSTM for predicting bangla word.

3.5.1 LSTM

The Long Short-Term Memory (LSTM) recurrent neural network (RNN) architecture
is frequently used in deep learning. It is perfect for applications requiring sequence
prediction since it is excellent at capturing long-term dependencies. Because it has
feedback connections as compared to regular neural networks, LSTM can handle entire
data sequences as opposed to simply single data points. This makes it particularly
adept in identifying and forecasting patterns in sequential data, such as time series,
text, and voice.

We discovered that it resolves the vanishing gradient problem that RNNs have,
therefore in this section, we'll examine how it does so by examining the LSTM's
architectural layout. A LSTM performs very much like an RNN cell at a high level.
The LSTM network's internal operations are depicted here. As seen in the graphic

17
below, the LSTM network design is split into three sections, each of which serves a
particular purpose.
Whether or not the information from the prior timestamp should be recalled is decided
in the first section. This cell attempts to learn new information using the input from
the second part. The cell then transmits the revised information from the third
section's present timestamp to the subsequent timestamp. One LSTM cycle is regarded
as one single-time step.

Gates makes reference to these three LSTM unit parts. They manage how data enters
and exits the memory cell or lstm cell. The Forget gate, Input gate, and Output gate
are the names of the first, second, and third gates, respectively. An LSTM unit
composed of these three gates and a memory cell can be compared to a layer of
neurons in a normal feedforward neural network as each neuron has a hidden layer
and a current state.

Figure 3.2 Architecture of LSTM

An LSTM has a hidden state, similar to a conventional RNN, with H(t1) standing for
the hidden state of the preceding timestamp and Ht for the present timestamp. The
timestamps C(t-1) for the past and C(t) for the present, respectively, serve as
representations of the LSTM's cell state.
The cell state is referred to as Long Term Memory in this context, whereas the
concealed state is referred to as Short Term Memory.
See the illustration below.

18
Figure 3.3 LSTM

It's important to remember that the cell state transmits all timestamps along with the
content.

Figure 3.4 Timestamps of LSTM

Forget Gate:

The initial decision made in a cell of the LSTM neural network is whether to keep or
discard the data from the previous time step.
The forget gate equation is shown below.

ft =σ(Wf⋅[ht−1,xt]+Uf⋅Ct−1+bf) …………………………..(1)

Let's attempt to comprehend this equation.

Uf is the weight that is attached to it, and Xt is the current timestamp input. The prior
timestamp's concealed state
It is the Wf, or the weight matrix, that is linked to the hidden state.

19
After that, it performs a sigmoid function. In the end, ft will become a number
between 0 and 1. This ft is afterwards multiplied by the cell state of the prior
timestamp, as may be seen in the sample below.
A sigmoid function is then applied to it after that. In the end, ft will become a number
between 0 and 1. This ft is afterwards multiplied by the cell state of the prior
timestamp,

Input Gate:

Utilizing the input gate, the fresh data brought in by the input is assessed for
relevancy. Below is a depiction of the input gate's equation.

it=σ(Xt⋅[ht−1,Xt]+Wi)…………………………………………….(2)

Input the information beginning with timestamp t here, Xt.


Weight input matrix in Ui
Ht-1: The weight matrix representing the hidden state at the time stamp preceding the
input, Wi
We've once more used the sigmoid function on it. I will therefore have a value
between 0 and 1 at timestamp t.

New Information:

…………………….(3)

The new data that needed to be transmitted to the cell state at this time was determined
by the concealed state at timestamp t-1 in the past and input x at timestamp t. Tanh is
the activation function in this case. The value of new information will range between -
1 and 1, thanks to the tanh function. Information is either added or deleted from the
cell state at the current timestamp depending on whether Nt is positive or negative.
However, the Nt won't be immediately added to the cell state. The revised equation is
as follows:

20
……………………………….(4)

In this case, Ct-1 represents the cell state at the current timestamp, and the other
variables are the ones we previously calculated.

Output Gate:

……………………………………………….(5)

With the help of the sigmoid function, it is also guaranteed to have a value between 0
and 1. The hidden state will now be determined using the updated cell states Ot and
tanh. like in the instance below.

…………………………………………………..(6)

It is discovered that both long-term memory and the present output (Ct) have an
impact on the concealed state. Simply turn on SoftMax on hidden state Ht.w to get the
output of the current timestamp.

Figure 3.12 Output of LSTM

Prediction is the token in this instance having the highest score in the output.

21
Figure 3.13 LSTM diagram

Chapter IV: Result and Discussions


In this section, we give the outcomes of the next Word prediction system we've
developed and evaluate the effectiveness of our model.

22
4.1 Result:
our research demonstrates the significant advantages of leveraging deep learning
techniques, specifically LSTM networks, in Bengali text prediction and generation
tasks. These findings have practical implications for the development of intelligent
Bengali language applications and open avenues for further exploration in the field of
natural language processing for Bengali text.

In this section, we present the results of our study, which encompasses data
preprocessing, next-word prediction, sentence generation, and the evaluation of deep
learning model performance.

The dataset used in the paper consists of about 200,000 plus word data. Those data are
collected from newspapers, posts on social media, and various websites. For the initial
stage of our research, we meticulously preprocessed our Bengali text dataset. This
involved tokenization, the removal of stopwords, punctuation, and special characters,
and ensuring uniform text encoding. The aim was to create a clean and structured
dataset that could be effectively utilized for training and evaluating our models.

We conducted extensive experiments to evaluate the performance of our models in


predicting the next word in Bengali sentences. Experiments must be conducted, andthe
findings must be thoroughly analyzed, to validate the proposed technique. As a result,
we put our suggested method to the test on a corpus dataset that was trained across 128
epochs and 512 batches on five distinct models with equivalent topologies.

23
Figure 4.1 lstm model

In comparison to the Bigram model, which has an average accuracy of 73.15% and an
average loss of 53.36%, the trained Uni-gram model has an average accuracy of
32.17% and an average loss of 276.44%. The identical dataset used for Uni-gram and
Bi-gram yields an average accuracy of 99.84% and an average loss of 8.52% for Tri-
gram. The average accuracy for the 4-gram and 5-gram samples was 99.52% and
99.70%, respectively, with average losses of 2.04% and 1.11%. This demonstrates that
as n grows, accuracy and loss level both increase. This indicates the effectiveness of
LSTM networks in capturing sequential dependencies in Bengali text. This highlights
the limitations of traditional N-gram models in capturing long-range dependencies and
context in the Bengali language, as opposed to the more advanced LSTM architecture.

24
Table 4.1 Performance Analysis

Model Epoch Bacthes Average Accuracy

Name Loss
1-gram 128 512 276.44 32.17%
modeling
2-gram 128 512 48.16% 73.15%

Model with

LSTM
3-gram with 128 512 8.52% 99.84%

LSTM
4-gram with 128 512 2.04% 99.52%

LSTM
5-gram with 128 512 1.11% 99.70%

LSTM

we explored the task of sentence generation, a more complex and creative aspect of
natural language processing. The LSTM-based model excelled in generating coherent
and contextually relevant Bengali sentences. Qualitatively, the sentences produced by
the LSTM model were more fluent and contextually accurate than those generated by
the N-gram model. In our work it suggests 5 words-based sentences, it doesn’t support
more than 5 words if there are more than 5 words. They will be changed into three
words for input.

25
Figure 4.2 Input and Output of LSTM

To comprehensively assess the performance of our deep learning models, we employed


various evaluation metrics. The LSTM model consistently outperformed the N-gram
model across all these metrics. This emphasizes the superior ability of LSTM networks
to capture semantic and syntactic structures in Bengali text.

4.2 Discussion:
The LSTM model outperforms the traditional N-gram model in both next-word
prediction and sentence generation tasks, showcasing the effectiveness of deep learning
approaches for Bengali text processing. Effective data pre-processing is fundamental to
the success of natural language processing tasks. Our pre-processing efforts resulted in
a high-quality dataset, which served as the foundation for subsequent tasks. This
underscores the importance of data preparation in text analysis and machine learning.
The high accuracy achieved by our LSTM model highlights its suitability for next-
word prediction tasks. It excels in capturing context and dependencies within Bengali
sentences, making it a valuable tool for applications like language modeling and text
generation.

Sentence generation is a complex task that requires the model to understand the
semantics and syntax of the language. Our successful sentence generation results with
the LSTM model showcase its potential in creative applications such as content
generation and chatbot interactions.

26
Our comprehensive evaluation demonstrates the superiority of the LSTM model in
capturing semantic and syntactic structures in Bengali text. Its superior performance on
various metrics highlights its effectiveness as a deep learning architecture for language-
related tasks.

our study has successfully achieved the defined objectives. Effective data pre-
processing laid the foundation for subsequent tasks, emphasizing the importance of
data quality in NLP. The LSTM model demonstrated superior performance in both
next-word prediction and sentence-generation tasks, showcasing its versatility and
potential for various NLP applications. Comprehensive model evaluation using
multiple metrics confirms the advantages of deep learning, particularly LSTM, in
Bengali text analysis.

27
Chapter V: Conclusion

In our proposed model by using LSTM with n-gram model we predict the next word
and also generate the full sentence. It will also performs better.

5.1 Limitations

The "next word prediction" language modeling task in machine learning aims to
anticipate the most probable word or group of words that will appear following a
particular input context. This job uses linguistic structures and statistical trends to
generate precise predictions based on the provided context.The Next Word Prediction
models have a wide range of applications across many industries. As you start to send
a message, your phone can suggest the subsequent word to assist you type more
quickly. Search engines do something similar by anticipating your input and offering
search alternatives. Next word prediction helps us communicate more effectively and
swiftly by foreseeing what we could say or look for.We can also predict what would
happen next and produce sentences in Bangla. For word prediction and sentence
generation in our research, we used LSTM and n-gram sequencing. Our accuracy is
also rather good. However, there are some restrictions in our job as well. We don't
employ any data processing techniques like NER, Bag of Words, or algorithms like
BERT, or GPT, in our work. Only when we enter 5 words does our model function.
Our model doesn't function if there are more than 5 words. They will be changed into
three words for input. Finding the data for the word predictions is challenging. There
is a lot of unfamiliar language as a result.

28
5.2 Future Works

The paper proposes a method for predicting the next most appropriate and suitable
word in Bengali language, as well as suggesting the corresponding sentence to
contribute to the technology of word prediction systems. The proposed approach uses
LSTM (Long Short-Term Memory) and N-gram models to create language models
that can predict the word(s) from the input sequence provided. The paper discusses the
dataset used for the experiments, which was collected from different sources in
Bengali language.

The objective of a language modeling problem in machine learning known as "next


word prediction" is to predict the word or group of words that are most likely to
appear after a specific input context. This job uses language patterns and statistical
trends to make accurate predictions based on the provided context. The Next Word
Prediction models have many applications in a wide range of disciplines. In order to
assist you type more quickly, your phone can suggest the next word as you begin to
send a message. Search engines work in a similar way by preparing search options
before you enter them. Similar to this, search engines offer search choices by
preparing for your input. Thanks to next-word prediction, we may communicate more
swiftly and efficiently by thinking forward to what we could say or seek. We can also
construct Bangla sentences and predict what would happen next. In order to produce
sentences and predict words, we used LSTM with n-gram sequencing in this study.
We are also fairly accurate. But there are some restrictions on what we can do. We
don't employ any data processing techniques like NER, Bag of Words, or algorithms
like BERT, or GPT, in our work. Only when we enter 5 words does our model
function. Our model won't function if there are more than five words. Our model
won't function if there are more than five words. They will be changed into three
words for input. Finding the data for the word predictions is challenging. There is a lot
of odd language as a result. We will attempt to implement the n-gram model in the
following steps based on the user's input preferences. Additionally, we'll employ
several data processing methods like NER, Bag of Words, and algorithms like BERT
and GPT to get more accurate data. In an effort to construct sentences that adhere to
Bangla grammatical norms and conventions, we will try to predict the next word and
check the grammar and spelling.

29
Reference

[1] Advani, V. (2020). What is Machine Learning? How Machine Learning


Works and future of it? [online] GreatLearning. Available at:
https://www.mygreatlearning.com/blog/what-is-machine-learning/.

[2] IBM (2023). What is Deep Learning? | IBM. [online] www.ibm.com. Available at:
https://www.ibm.com/topics/deep-learning.

[3] IBM (2023). What is Natural Language Processing? | IBM. [online] www.ibm.com.
Available at: https://www.ibm.com/topics/natural-language-processing.

[4]GeeksforGeeks. (2023). Next Word Prediction with Deep Learning in NLP. [online]
Available at: https://www.geeksforgeeks.org/next-word-prediction-with-deep-learning-
in-nlp/ [Accessed 27 Aug. 2023].

[5] Steffen Bickel, Peter Haider and Tobais Scheffer, (2005), "Predicting Sentences
using N-GramLanguage Models," In Proceedings of Conference on Empirical Methods
in Natural languageProcessing.Table II. Results of all models usedModel Average
AccuracyUnigram 21.24Bigram 45.84Trigram 63.04Backoff 63.50Deleted
Interpolation 62.86Fig. 5. Graphs of the results of all models used
International Journal in Foundations of Computer Science & Technology (IJFCST)
Vol.5, No.6, November 201575

[6] Nestor Garay-Vitoria and Julio Gonzalez-Abascal, (2005), "Application of Artificial


IntelligenceMethods in a Word-Prediction Aid," Laboratory of Human-Computer
Interaction for Special Needs.

30
[7] Eyas El-Qawasmeh, (2004), "Word Prediction via a Clustered Optimal Binary
Search Tree,"International Arab Journal of Information Technology, Vol. 1, No. 1.

[8] Hisham Al-Mubaid , (2007), "A Learning-Classification Based Approach for Word
Prediction", TheInternational Arab Journal of Information Technology, Vol. 4, No. 3.

[9] Qaiser Abbas, (2014), "A Stochastic Prediction Interface for Urdu", Intelligent
Systems andApplications , Vol.7, No.1, pp 94-100 .

[10] Umrinder Pal Singh, Vishal Goyal and Anisha Rani, (2014), "Disambiguating
Hindi Words Using N-Gram Smoothing Models", International Journal of Engineering
Sciences, Vol.10, Issue June, pp 26-29.

[11] Jahangir Alam, Naushad Uzzaman and Mumit khan, (2006), "N-gram based
Statistical GrammarChecker for Bangla and English", In Proceedings of International
Conference on Computer andInformation Technology.

[12] Nur Hossain Khan, Gonesh Chandra Saha, Bappa Sarker and Md. Habibur
Rahman, (2014),"Checking the Correctness of Bangla Words using N-Gram",
International Journal of ComputerApplication, Vol. 89, No. 11.

[13] Nur Hossain Khan, Md. Farukuzzaman Khan, Md. Mojahidul Islam, Md. Habibur
Rahman andBappa Sarker, (2014), "Verification of Bangla Sentence Structure using N-
Gram," Global Journal ofComputer Science and Technology , vol.14 ,issue-1 .

[14] TypingAid auto completion tool, Available


athttp://www.autohotkey.com/community/viewtopic.php?f=2\&t=53630.

[15] LetMeType auto completion tool, Available at


http://letmetype.en.softonic.com/download?ex=SWH-1776.0#downloadingor .

[16] Avro Keyboard for auto completion tool in Bangla typing, Available
athttps://www.omicronlab.com/avro-keyboard.html.

31
[17] Daniel Jurafsky and James H. Martin, (2000), Speech and Language processing,
USA: Prentice-Hall,Inc.

[18] Rastogi, K. (2022). Text Cleaning Methods in NLP. [online] Analytics


Vidhya. Available at:

https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-
nlp.

[19] Khanna, C. (2021). Text pre-processing: Stop words removal using


different libraries. [online] Medium. Available at:
https://towardsdatascience.com/text-pre-processing-stop-words-removal-
using-different-libraries-f20bac19929a.

[20] Chakravarthy, S. (2020). Tokenization for Natural Language


Processing. [online] Medium. Available at:
https://towardsdatascience.com/tokenization-for-natural-language-
processing-a179a891bad4

[21] Haque, M.M., Habib, M.T. and Rahman, M.M., 2016. Automated word
prediction in bangla language using stochastic language models. arXiv
preprint arXiv:1602.07803.

32
33
34
35
36
37
38

You might also like