Language_Detection_Using_Natural_Language_Processing
Language_Detection_Using_Natural_Language_Processing
net/publication/371160696
CITATIONS READS
5 2,875
3 authors, including:
All content following this page was uploaded by Yadvendra Pratap Singh on 28 February 2024.
Abstract— Natural language processing (NLP) is a which each language belongs and evaluating the text to
method for correctly identifying text based on the provided determine its meaning and intent, NLP assists us in
content or topic matter. An extensive study will make it implementing numerous languages and detecting them.
simple to interpret any language and comprehend what is The same can be implemented using NLP with the use of
being said. Despite the fact that NLP is a challenging numerous datasets and libraries for assistance and a wider
technique, notable examples include Siri and Alexa. Natural scope. The majority of NLP applications require data that
language detection allows us to determine the language being is monolingual because they are language specific. It can
used in a given document. A Python-written model that has
be essential to perform preprocessing and filter out text
been utilised in this work can be used to analyse the basic
that is written in languages other than the target language
linguistics of any language. The "words" that make up
sentences are the essential building blocks of knowledge and
in order to develop an application in the target language
its expression. Correctly identifying them and [2]. For instance, we must declare each input's precise
comprehending the situation in which they are used are language. Lexical (structural) analysis, syntactic analysis,
essential. NLP steps in to help us in this circumstance by semantic analysis, discourse synthesis, and pragmatic
making it easier for us to identify the linguistics used in a analysis are all included in the processing processes of
particular piece of information, whether it be written or natural language. Voice detector, Scanner, computational
vocal. NLP gives computers the ability to understand human linguistics, and text chats are common applications in
language and respond correctly, performing language linguistic communication. These days, we employ
detection for us. The current paper provides a summary of artificial intelligence (AI) techniques to operate tongue
developments in tongue process, including analysis, words by analysing enormous samples of human-written
establishment, various areas of rapid advancement in words (conversation, keywords, and details) [3]. Training
natural language processing research, development tools, algorithms can comprehend the "context" of writing,
and techniques. human speech, and other forms of human communication
by looking at these patterns. Algorithms for deep learning
Keywords—Natural Language Processing, Language and machine learning are frequently used to build NLP
Detection, Virtual Assistants, Text Analytics, Machine frameworks and efficiently complete typical NLP tasks
Learning
[1]. The application of language detection and natural
language processing is expanding significantly in the
I. INTRODUCTION current world as it develops.
Natural Language Processing (NLP) is a technique for
processing languages and transforming them into forms II. LITERATURE REVIEW
that the user can readily process or interpret. NLP is a The work on NLP truly started in the late 1940s, even
method of computer programming that is based on pattern though the "Turing Test," syntactic structures, and its
learning [1]. It consists of two parts i.e., Natural Language system that was based on rules were developed in 1950
Understanding (NLU) and Natural Language Generation and 1957, respectively. Up until 1990, growth was
(NLG). sluggish because to inadequate computer power, the use
We can use NLU to determine the meaning of a of systems that relied on complex handwritten rule
specific word or passage of text, whether it is written or systems, and a narrow vocabulary. Due to the
spoken. Using a representation of text or data, NLG advancement of machine learning and the ongoing
creates meaningful sentences. expansion of computer power, interest in research and
applications has recently surged [15]. The recent major
NLP is the foundation of how Language Detection NLP breakthrough areas include speech recognition,
operates. Language is processed and identified using NLP. dialogue systems, language processing, and the
With the aid of NLP, different word and language types application of deep learning techniques.
can be detected.
NLP has generated a great deal of research interest and
NLP aids in analyzing presented text and identifies opened up many opportunities for using its techniques in
language and word meaning. NLP makes it simple to automation, robotics, and digital transformation despite
recognise business writings. By identifying the datasets to
673
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)
the challenges it still faces (such as those related to human assistants manage common tasks including patient
computer interfaces) [3]. registration and appointment scheduling.
Prior to 1990, the majority of the research on NLP Self-driving cars are one of the most remarkable
concepts and machine translation was done. Deep learning, developments in the manufacturing sector. which are
machine learning, and statistical models have been used to enabled by NLP and are becoming in popularity in the
great effect in the most recent NLP research. Research in industry.
deep learning and artificial intelligence occasionally
overlaps with research in natural language processing. In banking sector NLP-based solutions are used to
Today, these techniques are commonly employed to do create applications such as sentiment analysis, document
NLP tasks in the most efficient way possible [1]. search, and credit scoring. Credit scoring programmes let
banks and financial institutions determine a person's
One day, conversing with a machine will be as simple creditworthiness and provide a credit score by using NLP
as conversing with a person. NLP continues to use and machine learning. Applications for sentiment analysis
unstructured data to give it meaning for a machine. automate the procedures of document categorization and
Industries including robotics, healthcare, finance, linked named entity recognition to select the information that is
autos, and smart homes will continue to benefit from NLP most relevant to investors' demands [23]. Banks and other
[2]. financial organisations utilise chatbot interfaces to let their
consumers conduct information searches and get simple
One of the first uses of NLP in the early years of the transactional answers in document search apps [24].
twenty-first century was machine translation from one
human language to another[13]. However, it immediately Robotics and process automation are two incredibly
became well-liked in the customer service industry. The potential NLP application topics. In order to process
most well-known NLP customer service tool is a virtual instructions for assembling and moving products and
assistant, also known as a "Chatbot." Different machines, a robot on a manufacturing line can use natural
applications are used in various sectors. These are listed language processing (NLP) to communicate with a human
below: operator who is stationed remotely [4].
A. Systems for conversation Using Natural Language, Computer Vision, and
A conversational system enables us to hold a natural- Machine Learning technologies, a retail virtual assistant
language conversation with an automated system using a that is placed in front of a retail business can detect and
speech or text interface [2]. They help businesses know what the customer requires and provides them with
automate challenging activities and offer round-the-clock quick information and promotional offers [10].
service to their customers. The two most common
Because computer vision and natural language
varieties of conversational devices are chatbots and virtual
processing are integrated, a platform in the education
assistants. Today, e-commerce, social media, banking, and
industry can provide students a virtual classroom. Digital
other self-service point-of-sale systems use these two
assistants have already been used to help students solve
devices to provide a range of services to its customers.
problems using specialised information from online
B. Text Analytics libraries [9].
The goal of text analytics, sometimes referred to as
text mining, is to extract useful information from text,
D. Frameworks and Tools for NLP Development
whether it be in longer texts like emails and documents or
in shorter ones like SMS texts and tweets [23]. Social
media analysis is one of the most common use cases for Today's development tools are readily accessible due
text analytics. to the worldwide interest that open-source communities
have shown in them [6]. These frameworks and tools
C. Machine Translation contain built-in libraries and can be customised to fit
The objective of machine translation is to specific industry standards.
automatically translate material from one natural language
The natural language representation block uses
to other also ensuring maintenance of the intended structured, tree or graph models to express the knowledge
meaning. of natural language [7]. A Natural Language database is a
Google Translate is the most widely used machine set of Natural Language data that machine learning
translation tool. In speech translation and education, other algorithms use to do extra NLP tasks, similar to MNIST
machine translation software is also employed [14]. or other databases.
NLP is also used in manufacturing, healthcare, This database is used by representation and
customer service, automotive, retail, finance, and transformation blocks to perform their tasks. Natural
education. language transformation will employ a range of learning
and extraction techniques to gain meaningful and
Virtual assistants that were developed by combining pertinent activities from the NLP jobs [5]. Natural
machine learning, computer vision, and natural language language communication is the presentation of the
processing are being used by hospitals. These virtual behaviours that are intended and desired to occur as a
assistants will automatically develop and obtain patient result of tasks aided by NLP [11]. The end result might
histories by interacting with patients [12][25]. Virtual either be computer activity, like a robot arm moving, or it
could be Natural Language.
674
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)
Natural language processing has developed as a result English language and it has a particular spelling error [18].
of human conversation. The procedure will undoubtedly Then, using the principle of Language Detection in the
involve the conversion of human natural language into a system, we can identify and correct the errors in the
machine-understandable format. The following tasks spelling of the word that is written incorrectly and also,
could be included in NLP: the system can help us analyze the text and recognize the
1)Word Sense Ambiguation- In this, a meaning of a language in which the text is written as ‘English’. NLP
word with multiple meanings is selected with the help of has many libraries such as NLTK, spaCy, genism, etc [16].
semantic analysis through which the word that is most These libraries help in accessing the features of NLP and
suitable in a particular context is selected. in the creation of NLP models through their use. These
help widely and vastly in Language Detection models and
2)Speech Recognition- This is a process in which therefore, serve their purpose.
voice data is converted into text data.
3)Named Entity Recognition- It identifies words as III. METHODOLOGY
relevant and useful entities. For implementation, "Google Colab" Platforms are
4)Part of speech tagging- It determines the part of utilised. The data is loaded using a "Language Detection
speech of a particular piece of text in a sentence or piece Using NLP" file that has been prepared. To train a model,
of information according to the most suitable context. a dataset from Kaggle and Github is used. Only a few of
the many languages in the downloaded dataset were
There are two components of NLP i.e. Natural picked based on the requirements. We'll go over every
Language Understanding (NLU) and Natural Language implementation in depth, step by step.
Generation (NLG)
• STEP - I
NLU: It involves the following-
First step is importing all the libraries and packages
a. Lexical Ambiguity: It comes into picture when which are needed to accomplish the task. i.e.
correct and relevant meaning of a word has to be found
in a text. • STEP-II
b. Referential Ambiguity: It comes into picture Mounting a dataset from the local computer to Google
when there is repetition of a word in a sentence. Drive is the next step. Dataset is submitted to the Google
Drive platform as a zip file.The dataset is currently
c. Syntactical Ambiguity: Observing more than one mounted to the "Google Colab" environment on Google
meaning in a piece of text. Drive. We have access to about 80 GB of local storage on
NLG: This is a process of converting structured the distributed server of the Google Colab Environment.
information into human language [20]. It produces • STEP- III
meaningful sentences from a representation of text or data.
It involves- To obtain data from a csv file, use the function read_
csv (), which will extract data in the form of a data frame.
a. Sentence Planning: It includes choosing
meaningful words and phrases in a piece of information. • STEP -IV
b. Text Planning: Through this, we obtain relevant Now, we will define the necessary variable which
facts and figures from a knowledge base. plays a very important role to build our machine learning
model. The picture tells us the variable names and their
c. Text Realization: Through this, sentence plan is respective values.
mapped into sentence structure.
Natural Language processing also includes Sentiment
Analysis, which is a technique that uses statistics to
determine the meaning and intention of the content
provided emotionally.
Language Detection (LD) comes as a subset of NLP. It
has discussed earlier, works on the principle of NLP as its
basis [19]. Here, the language and linguistics used in a
particular piece of writing or knowledge base is judged
and detected in its form. Here, identification of which
language is the content in is done [11]. Computational
approaches to this problem look at this as a special case of
text categorization that is solved with the help of various
statistical methods [21]. LD is a great way to easily and
efficiently sort as well as categorize information and
apply additional layers of workflows that are language
specific [22]. It can help us in identifying and detecting
errors in a particular document, be it grammatically or
with the spelling. For example, if we write a sentence in Fig. 1. Defining the Variable
675
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)
Definition of the variables in the Fig.1 are as follows: .lower function is used to convert all alphabet in
lower case.
head
Then the text is added in data_list array using append
The first five rows of the data frame are displayed in function.
head function in python by default. It includes a single
parameter: the number of rows. We can use this parameter We will use the sklearn.feature_extraction module
to display the number of rows of our choice. class from the sklearn module and all the use of this
module is given below: -
.head(n) is used to get the first n rows of the dataframe.
It includes one optional argument n (number of rows you The sklearn.feature_extraction module is used in
want to get from the start). order to extract features in such a format that is supported
and guarded by algorithms of Machine Learning typically
• value_count from the datasets that consist of formats like image and
The function, value_counts(), returns the object that text.
comprises of counts of values that are unique. The object CountVectorizer uses the method which acts as a
obtained as a result, will appear in an order that is good tool provided by the scikit-learn library present in
descending so that the element occurring most often is Python. It can be used to convert a given text into a vector
the first element.
based on the frequency (count) or occurence of every
• STEP - V word which is occuring throughout the text [8]. This can
be extremely helpful when multiple texts are present, and
Class Label Encoder from the sklearn module is used, we wish to convert all the words in every text as vectors
and all the use of this module is given below: in order to use it in upcoming analysis of the text.
Sklearn gives an immensely effective tool as it Through CountVectorizer a matrix is created in which
encodes into values that are numeric, the categorical a matrix column is a representation of every word that is
features levels. Label Encoding means to convert the unique and every row of the matrix represents every
labels into a form that is numeric in order to get them sample of text in that document. The value for every
converted into a form that can be read by the machine. single cell is defined by the count of the words in the
Then, Machine Learning algorithms can decide better how particlar given sample of text.
the labels should be operated. It is an important step for
pre-processing of the structured dataset in supervised We will use the method fit_transform () which
learning. basically is a combination of transform method and fit
method and is equivalent to transform(). fit(). In this
LabelEncoder does encode labels with any value method, transform and fit is performed on the data that is
between 0 and n_classes-1, where n refers to the labels input at the same time and hence, data points are
whose numbers are distinct. If there is a repeatition in a converted.
label, then the same value is assigned to as assigned
earlier. And then we set the dimension of the array X
using .shape method. Creating an array data_list is shown
fit_transform () method is used which will fit label in Fig. 3.
encoder and convert or transform multi-class labels into
binary labels. The output for this conversion is sometimes
called the 1-of-K coding scheme. Including LableEncoder
and Fit_Trasform is shown in Fig. 2.
• STEP - VI
One array is created which named as data_list and
re.sub() function is used that belongs to the Regular
Expressions (re) module in Python.
It will return a string in which all the occurrences that
are matching with the specified pattern will be replaced by
the replace string. The re.sub() function holds for a
Fig. 3. Creating an array data_list
substring and will return a string with values that are
replaced. Using this function, we can replace multiple • STEP - VII
elements by making use of a list.
In this code snippet, the list is split into the training set
and testing set. It is one of the major concepts in the
676
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)
machine learning model building [17]. The X , Y and We will check the accuracy of our model. Accuracy of
test_size is used as the parameter for the train_test_split the model is shown in Fig. 7.
method which is the part of the sklearn module. Dividing
dataset into training and testing dataset is shown in Fig. 4.
training and testing dataset is divided into the four
variable as given below:-
1. X_train
2. Y_train
3. X_test
4. Y_test
• STEP - VIII
In this step neural network model is built with the
model.fit() from the MultinomialNb module.
It is another useful Naïve Bayes classifier. In this, it is
assumed that from a simple Multinomial distribution,
drawing of the features is done. To implement the
Multinomial Naïve Bayes algorithm for classification,
Scikit-learn provides sklearn.naive_bayes.MultinomialNB. Fig. 7. Accuracy of the model
V. CONCLUSION
The rise of technology in the modern world has also
given rise to increased requirements which justify the
development taking place around us every day. Natural
Language Processing and Language Detection, here, give
rise to wider as well as broader scopes which can make
Fig. 5. Building Neural Network Model tasks easier for human beings and can help them
recognize texts in a much easier, better and systematic
manner, hence, making technical work easier for them
• STEP-IX with the use of statistical methods. Therefore, an attempt
has been put forward by us for creating such a Language
This step is used to find the model accuracy. Find the
Model Accuracy is shown in Fig. 6. Detection model with the help of Natural Language
Processing that can solve Language Detection problems
and can help us in identifying text easily and aptly with
the help of appropriate and efficient methods as it is very
important and useful in today’s world and justifies fairly
the usage of words and linguistics in the body of the
content provided in various documents.
Fig. 6. Find the Model Accuracy REFERENCES
STEP-X
677
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.
2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)
[1] Daniel W. Otter, Julian R. Medina, and Jugal K. Kalita. 2018. A Professor, Dean (R&D), CARE College of Engineering, Trichy,
Survey of the Usages of Deep Learning in Natural Language India. ISSN: 2582-2640 (online) Submitted: 17.05.2021 Revised:
Processing. 1, 1 (July 2018), 35 pages. 07.06.2021 Accepted: 26.06.2021 Published: 03.07.2021.
[2] ROBERT DALE. "The commercial NLP Landscape in 2017", [22] Radiuk, Pavlo , Pavlova, Olga , Hrypynska, Nadiia .An ensemble
Article in Natural Language Engineering, July 2017 machine learning approach for Twitter sentiment analysis. Issue
[3] ACL 2018: 56th Annual Meeting of Association for Computational Date: 17-Jul-2022.
Linguistics https://acl2018.org [23] Conducting Sentiment Analysis. Lei Lei and Dinlin Liu
[4] Predictive Analytics Today: ,Cambridge: Cambridge, University Press, 2021.
www.predictiveanalyticstoday.com[accessed in Dec 2018] [24] Luca Barbaglia , Sergio Consoli , Sebastiano Manzan , Luca
[5] Ali Shatnawi, Ghadeer Al-Bdour, Raffi Al-Qurran and Mahmoud Tiozzo Pezzoli, Elisa Tosetti , . Sentiment Analysis of
Al-Ayyoub 2018. A Comparative Study of Open Source Deep Economic Text: A Lexicon-based Approach . ,23 Pages Posted: 13
Learning Frameworks. 2018 9th International Conference on May 2022 , Date Written: May 11, 2022.
Information and Communication Systems (ICICS) [25] Patil, Ratna, and Sharavari Tamane. “A Comparative Analysis on
[6] Intelligent automation: Making cognitive real Knowledge Series I the Evaluation of Classification Algorithms in the Prediction of
Chapter 2. 2018, EY report. Diabetes.” International Journal of Electrical and Computer
Engineering (IJECE), vol. 8, no. 5, 1 Oct. 2018,p.3966
[7] Jacques Bughin, Eric Hazan, SreeRamaswamy, Michael Chui ,
TeraAllas, Peter Dahlström, Nicolaus Henke, Monica Trench,
2017. MGI ARTIFICIAL INTELLIGENCE THE NEXT DIGITAL
FRONTIER? McKinsey & Company McKinsey & Company
report July 2017
[8] Svetlana Sicular, Kenneth Brant 2018, Hype Cycle for Artificial
Intelligence, 2018 Gartner report July 2018.
[9] Oshin Agarwal, Funda Durupinar, Norman I. Badler,and Ani
Nenkova. 2019. Word embeddings (also) encode human
personality stereotypes. In Proceedings of the Joint Conference on
Lexical and Computational Semantics, pages 205–211,
Minneapolis, MN.
[10] Quarteroni, Silvia. (2018). Natural Language Processing for
Industry: ELCA’s experience. Informatik-Spektrum.
41.10.1007/s00287-018-1094-1.
[11] Young, Tom &Hazarika, Devamanyu&Poria, Soujanya& Cambria,
Erik. (2018). Recent Trends in Deep Learning Based Natural
Language Processing [Review Article]. IEEE Computational
Intelligence Magazine. 13.55-75.10.1109/MCI.2018.2840738.
[12] Amirhosseini, Mohammad Hossein, Kazemian, Hassan, Ouazzane,
Karim and Chandler, Chris (2018) Natural language processing
approach to NLP meta model automation. In: International Joint
Conference on Neural Networks (IJCNN), 8-13 July 2018, Rio de
Janeiro,Brazil.
[13] Alan Ramponi and Barbara Plank. 2020. Neural unsupervised
domain adaptation in NLP—A survey. Proceedings of the 28th
International Conference on Computational Linguistics, pages
6838–6855.
[14] Alexandre Magueresse, Vincent Carles, and Evan Heetderks. 2020.
Low-resource languages: A review of past work and future
challenges.
[15] Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised
deep domain adaptation. ACM Transactions on Intelligent Systems
and Technology (TIST), 11(5):1–46.
[16] Artem Abzaliev. 2019. On GAP coreference resolution shared task:
insights from the 3rd place solution.In Proceedings of the
Workshop on Gender Bias in Natural Language Processing, pages
107–112, Florence, Italy.
[17] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,and Quoc Le.
2020. Unsupervised data augmentation for consistency training.
Advances in Neural Information Processing Systems, 33
[18] Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and
Anupam Datta. 2020. Gender Bias inNeural Natural Language
Processing, pages 189–202. Springer International Publishing,
Cham.George A. Miller. 1995. Wordnet: a lexical database for
english. Communications of the ACM,38(11):39–41.
[19] Su Lin Blodgett, Solon Barocas, Hal Daume, III, and ´Hanna
Wallach. 2020. Language (technology) is power: A critical survey
of “bias” in NLP. In Proc. of ACL.
[20] Marouane Birjali , Mohammed Kasri , Abderrahim Beni-Hssane .
A comprehensive survey on sentiment analysis: Approaches,
challenges and trends . Received 1 July 2020, Revised 25 March
2021, Accepted 10 May 2021, Available online 14 May 2021,
Version of Record 18 May 2021.
[21] Performance Evaluation and Comparison using Deep Learning
Techniques in Sentiment Analysis A. Pasumpon Pandian,
678
Authorized licensed use limited to: Manipal University. Downloaded on July 18,2023 at 04:52:19 UTC from IEEE Xplore. Restrictions apply.