An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis


and Identification of Language in Code-Mixed Social Media
Text in English and Roman Hindi

Shashi Shekhar1, Dilip Kumar Sharma1, M.M. Sufyan Beg2

1 GLA University,

Department of Computer Engineering and Applications,


India
2 Aligarh
Muslim University,
Department of Computer Engineering,
India
{shashi.shekhar, dilip.sharma}@gla.ac.in, [email protected]

Abstract. The paper describes the application of the cBoW and Skip gram for evaluation. The proposed
code mixed index in Indian social media texts and system BLSTM context capture module gives better
comparing the complexity to identify language at word accuracy for word embedding model as compared to
level using BLSTM neural model. In Natural Language character embedding evaluated on our two testing sets.
Processing one of the imperative and relatively less The problem is modeled collectively with the deep-
mature areas is a transliteration. During transliteration, learning design. We tend to gift an in-depth empirical
issues like language identification, script specification, analysis of the proposed methodology against standard
missing sounds arise in code mixed data. Social media approaches for language identification.
platforms are now widely used by people to express their
opinion or interest. The language used by the users in Keywords. Language identification, transliteration,
social media nowadays is Code-mixed text, i.e., mixing character embedding, word embedding, NLP,
of two or more languages. In code-mixed data, one machine learning.
language will be written using another language script.
So to process such code-mixed text, identification of
language used in each word is important for language 1 Introduction
processing. The major contribution of the work is to
propose a technique for identifying the language of
Hindi-English code-mixed data used in three social Humans use natural language as their medium for
media platforms namely, Facebook, Twitter, and communication. Natural Language Processing
WhatsApp. We propose a deep learning framework (NLP), is an area of Artificial Intelligence where we
based on cBoW and Skip gram model for language
train the machine to understand and process the
identification in code mixed data. Popular word
embedding features were used for the representation of text to make human-computer interactions more
each word. Many researches have been recently done efficient. Applications of NLP lies under several
in the field of language identification, but word level fields like machine translation, text processing,
language identification in the transliterated environment entity extraction and so on [1]. A large amount of
is a current research issue in code mixed data. We have data is now available on the Web as text. With the
implemented a deep learning model based on BLSTM emergence of several social media platforms and
that predicts the origin of the word from language the availability of a large amount of text data in
perspective in the sequence based on the specific words them, NLP plays a great role in understanding and
that have come before it in the sequence. The
generating data today. The social media platforms
multichannel neural networks combining CNN and
BLSTM for word level language identification of code- are used widely today by people to discuss the
mixed data where English and Hindi roman interests, hobbies, reviews on products, movies
transliteration has been used. Combining this with a and so on.

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1416 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

In earlier days, the language used in such Here Hindi words are written in Roman
platforms was purely English. Today mixing transliterated form using the English alphabet.
multiple languages together is a popular trend. For processing monolingual text, the primary
These kinds of languages are called code- step would be Part-Of-Speech (POS), tagging of
mixed language. the text. However, in the case of social media text,
A large amount of textual data is available on the primary feature to be considered is the
the web. With the emergence of several social identification of the language particularly for code-
media platforms and the availability of a large mixed text [3]. The language identification for code-
amount of text data in them, NLP plays a great role mixed text proposed in this paper is implemented
in understanding and generating data today. The using word embedding models. The term word
social media platforms are used widely today by embedding refers to the vector representation of
people to discuss the interests, hobbies, reviews the given data capturing the semantic relation
on products, movies and so on. In earlier days, the between the words in the data. The work is a
language used in such platforms was purely generalized approach because this system can be
English. Today combining different languages extended for other NLP applications since only
together is a common phenomenon. These kinds word embedding features are considered. The
of languages are called code-mixed language. An work involves features obtained from two
example of Hindi-English code-mixed text is embedding models, word-based embedding and
described in the below sentences: character-based embedding. A comparison of the
performance of the two models with the addition of
Sentence 1: GLA University ka course structure contextual information is performed in this paper.
kaisa hai: The machine learning [4] based classification is
used for training and testing the system.
NE/OOV E H E E H H
Framework for discovering user intend based
Sentence 2: Aray Friend, ek super idea hai mere on Hindi roman transliteration by identifying the
paas: word level language identification was addressed
here. The remaining section of the paper is
H E H E E H H H organized as follows: An overview of the related
works on language identification in the multilingual
Here Hindi words are labeled as H and English domain is discussed in section 2. A discussion on
word are labeled as E and Named entity as NE. We the methodology proposed considering word
can observe from the example that the Hindi embedding and character embedding method is
words, tagged as H, were written in Roman Script discussed in section 3.The dataset description is
instead of Unicode characters. The above example stated in section 4. Section 5 describes the
has been cited to give an idea of code mixed data. experimental evaluation and results obtained.
The paper presents a novel architecture, which Section 6, analyses the inferences obtained from
captures information at both word level and context the work done and a pointer towards the
level to output the final tag for language future work.
identification in context to the word belongs to
which language. For word level, we have used a
multichannel neural network (MNN) inspired by the 2 Related Research
recent works of computer vision. Such networks
have also shown promising results in NLP tasks
like sentence classification [2]. For context capture, In this section, some of the recent techniques
we used Bi-LSTM-CRF. The context module was regarding to the language transliteration and
tested more rigorously as in quite a few of the identification is listed and reviewed as follows:
previous work, this information has been sidelined code-switching and mixing is a current research
or ignored. We have experimented on Hindi- area in the field of language tagging. Language
English (H-E) code mixed data. Hindi is the most Identification (LID), is a primary task in many text
popular spoken language of India. processing applications and hence several

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1417

research is going on this area especially with the dataset was performed with embedding
code-mixed data. King and Abney [5] used semi- models [18].
supervised methods for building a word level Sapkal et al. [19] have given the approach by
language identifier. Nguyen and Dogru [6] used the use of SMS, which is meant for communicating
CRF model limited to bigrams for identifying the with others in minimal words. The regional
language. Logistic regression along with a module, language messages are printed using English
which gives code-switching probability was used alphabets due to the lack of regional keywords.
by Vyas et al. [7]. Das and Gamback [8] used This SMS language may fluctuate, which leads to
various features like a dictionary, n-gram, edit miscommunication. The focus was on
distance and word context for identifying the origin transliterating short form to full form. Zubiaga et al.
of the word. [20] had mentioned language identification, as the
A shared task on Mixed Script Information mission of defining the language of a given text. On
Retrieval (MSIR) 2015 was conducted in which a the other hand, certain issues like quantifying the
subtask includes language identification of 8 code- individuality of similar languages in multilingualism
mixed Indian Languages, Telugu, Tamil, Marathi, document and analyzing the language of short
Bangla, Gujarati, Hindi, Kannada, and Malayalam, texts are still unresolved. The below section
each mixed with English [9]. describes the proposed methodology to overcome
the research gap identified in the area of
The MSIR language identification task was
transliterated code mixed data. Alekseev et al. [29]
implemented by using machine learning based
consider word embedding as an efficient feature
SVM classifier and obtained an accuracy of 76%
and proposed entity extraction for user profiling
[16]. Word level language identification was
using word-embedding features.
performed for English-Hindi using supervised
methods in [10]. Naive Bayes classifier was used
to identify the language of Hindi-English data and 3 Proposed Methodology for
an accuracy of 77% was obtained [11].
Language Identification
Language Identification is also performed as a
primary step to several other applications. [12],
The code mixed data include the combination of
implemented a sentiment analysis system which
the native script (familiar language) and the non-
utilized MSIR 2015 English-Tamil, English-Telugu,
native script (unfamiliar language). Due to this
English-Hindi, and English-Bengali code-mixed
combination, a massive number of complications
dataset. Another emotion detection system was
arises while dealing with this mixed code.
developed for Hindi-English data with machine
Language Identification is the main and the
learning based and Teaching Learning Based
foremost problem identified in the mixed code data
Optimization (TLBO), techniques [13]. Part-of-
since every user will not be clear about every
Speech tagging was done for English-Bengali-
language recognized in the globe. The language
Hindi corpus including the language identification
identification arises when the text is written in
step in [14].
different languages. This incorporates problems
Since the code-mixed script is the common such as the script specifications, which is the
trend in the social media text today, many kinds of possibility of different scripts between the source
research are going on for the information extraction and target languages.
from such text. An analysis of the behavior of code- The proposed system is comprised of two
mixed Hindi-English Facebook dataset was done modules. The first one is a multichannel neural
in [15]. POS Tagging technique was performed on network trained at the word level, while the second
code-mixed social media text in Indian one is a simple bidirectional LSTM trained at the
languages [16]. context level. The second module takes the input
A shared task was organized for entity from the first module along with some other
extraction on code-mixed Hindi-English and Tamil- features to produce the language tag.
English social media text [17]. Entity extraction for The proposed architecture illustrated in Figure
code-mixed Hindi-English and Tamil-English 2 is inspired by [30, 31, 21] where the recent deep

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1418 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

Fig. 1. Framework for word origin detection

Fig. 2. Methodology of the proposed system

neural architectures developed for image This is because the architecture allows the network
classification tasks. to capture representations of different types, which
The proposed model uses a very similar can be really helpful for NLP tasks for identifying
concept for learning the language at the word level. the origin of a word in context to the language used

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1419

features, a matrix of size |V| x 300 is obtained. 5-


gram appending will result in a matrix of size |V|
x 500.
The test data was also given to the embedding
models. The data were then appended with the 3-
gram and 5-gram context information. These are
then fed to a machine learning based classifier, to
train and test the system.

3.1 Word-Based Embedding Model

The word-based embedding model is used to find


the feature vectors that are useful in predicting the
Fig. 3. Skip gram model neighboring tokens in a context. The feature vector
in code mixed data. The network we developed for this model is generated using Skip-gram
has 4 channels, the first three enters into a architecture of popular Word2vec package
Convolution 1D (Conv1D) network [22], while the proposed by Mikolov et al. [24]. Apart from the skip-
fourth one enters into a Long Short Term Memory gram model, another architecture continuous Bag
(LSTM) network [23]. of Words (cBoW), is also present [24].
In this work, two systems were developed Word2vec is a predictive model that is used to
based on word-based embedding features and produce word embeddings from raw text. It exists
character-based context features. in two forms, the continuous Bag-of-Words model
For the character-based system, the same (cBoW) and the Skip-Gram model. Algorithmically,
procedure as that of word-based is done except these two are similar, except that cBoW forecasts
that the vectors are character vectors. The target words from source context words, whereas
methodology of the proposed system is illustrated the skip-gram forecasts source context-words from
in Figure 2. the target words.
For the embedding to capture the word This gives the flexibility to use skip gram when
representation more effectively, additional data we are having a large dataset and one can use
apart from the train and test data must be provided cBoW for the smaller dataset. We focused on the
to the embedding model. The additional data used skip-gram model for language identification at word
here is also a code-mixed Hindi-English social level in the multilingual domain to answer (word
media data collected from other shared tasks. belongs to which language) in the rest of this
The input for the word embedding will be the paper. The illustration of Skip-gram model is
train data and the additionally collected dataset. shown in Figure 3.
The embedding model generates the vector of
each vocabulary (unique), words present in the Here the input token is T0, which is fed to a log-
data. Along with extracting the feature vectors of linear classifier to predict the neighboring words.
the train data, its context information is T−2, T−1, T1 and T2 are the words that are before
also extracted. and after the current word.
The incorporation of the immediate left and right When the data is given to the Skip-gram model,
context features with the features of the current it maximizes the average log probability, given by
word is called 3-gram context appending. 5-gram L, which is formulated as in Equation 1. In the
features were also extracted, which is the equation, N is the total number of words in the train
extraction of features from two neighboring words data and x is the context size. p is the softmax
before and after the current word. So if the probability which is given using Equation 2:
vocabulary size of the training data is |V |, and the
embedding feature size generated is 100 for each 1 𝑁
𝐿= ∑𝑛=1 ∑−𝑥≤𝑖≤𝑥 𝑙𝑜𝑔𝑃(𝑇𝑛 + 𝑖|𝑇𝑛) , (1)
𝑁
word, then after context appending with 3-gram

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1420 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

Each token in the train data gets splitted into


characters and fed to the system. This will
generate a vector for each character. The vector
size to be generated was fixed as 100. The vectors
generated for each character is used to create
vectors for each token as per equation 3:
⁡𝑌 = 𝑥 + 𝑆ℎ (𝑊, 𝐶𝑡−𝑘,…. 𝐶𝑡+𝑘, 𝐶). (3)

In regard to above, equation softmax


parameters are denoted by x and S where h is the
embedding features of character and word. C is the
character vectors and W is the word vectors. ct−k,...
ct+k, are the characters in the train data.
The figure 4 suggests that the word PAANI gets
splitted into characters and given to the system to
produce an embedding feature vector. The vectors
are generated for each character in the word.
These are then transformed to produce the
character-based embedding vector of the word
PAANI using Equation 3. The vectors for each
Fig. 4. Embedding Model
token are then used to extract the context feature
vectors. The feature vector with context features is
exp⁡(V′Tj (VTk )⁡) appended along with the language tag and is fed
P(Tj|Tk) = w ⁡, (2)
∑w=1 exp⁡(V′Tj (VTk )) to the classifier for training the system. The similar
procedure is done for the test file. The vectors
where w is the vocabulary size, p(Tj |Tk), is the generated from character embedding model is
probability of occurrence of the next word. V' is the then transformed as a context matrix for the test
output vector representation. The dataset along data. This context matrix with the test words is fed
with the additional dataset collected was given to to the classifier for testing the system.
the skip-gram model.
The vector sizes to be generated were fixed as 3.3 Design Consideration and Proposed
100. The skip-gram model generates a vector of Algorithm
size 1 × 100 for each vocabulary word available in
the dataset.
‒ Each document must consist of words from
From this, the vectors for the training data were
two languages.
extracted. The context appending features were
then extracted from this file. The final training file ‒ All the documents must be in a single script.
for the classifier will consist of the tokens in the The chosen script, in this case, is
train data, their language tag and the 3-gram and ROMAN Script.
5-gram context feature vectors extracted.
Thus, three training files are generated with |V| ‒ In the Indian scenario, code-mixing is
x 101, |V| x 301 and |V| x 501 dimension. The test applicable between English and other
data with its corresponding context appended Indian languages.
vectors are fed to the classifier for testing
‒ The language used in the proposal is English
the system.
and Hindi, where Hindi is represented using
Roman, not Devanagari.
3.2 Character-Based Embedding Model
If the Hindi words are written in Devanagari
The procedure for character embedding is the script, it is then a simpler task to identify the
same as that of skip-gram based word embedding. language. This becomes non-trivial tasks to

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1421

identify the language as both Hindi and English are Table1. Dataset ICON 2016 [25]
written using the same character set.
No. of Sentences No. of Tokens
Algorithm 1. Proposed Algorithm for Language Data Trainin Testin Trainin Testin
Identification g data g data g data g data
Input: Code Mixed Data Facebook 772 111 20,615 2,167
Output: Language of the input word. Twitter 1,096 110 17,311 2,163
Algorithm steps WhatsAp
763 219 3,218 802
p
1. Input term from Test Document
Let D=W 1, W 2….W n be a document Table 2. Sample data
where
Data sample
W is are the words
amir se hoti hai, garib se hotii hai
2. Letter of the words{a-z}
door se hotee hai, qarib se hoti hai
3. // word level tagging task based on vectors magar jahaan bhi hoti hai, ai mere dost
3.1 Lb(W i) chosen from Language L shaadiyaan to naseeb se hoti hai
Where L= {LE,LH,LO) Mixed Script Data sample
//Check the frequencies of character in W i and Party abhi baaki hai…………..
W j. Party abhee baaki hai………………………..
3.2 Generate Vectors for characters
for W i and W j
4 Dataset Descriptions
3.3 Apply Similarity metrics
∑𝑛𝑖=1 𝑋𝑖 𝑌𝑖
𝑆𝑖𝑚(𝑋, 𝑌) = The dataset used for this work is obtained from
√∑𝑛𝑖=1 𝑋𝑖 2 √∑𝑛𝑖=1 𝑌𝑖 2 POS Tagging task for Hindi-English code-mixed
social media text conducted by ICON 2016 [25].
The dataset contains the text of three social
4. Label the word E- English or H-Hindi.
media platforms namely Facebook, Twitter and
5. Check the Conf_Score of the classifier for WhatsApp. The train data provided contains the
Language Lj on input W i as 0 ≤ Conf_Score ≤ 1 tokens of the dataset with its corresponding
Where Conf_Score is similarity metrics language tag and POS tag.
sim(W x,W y) x and y can be word in string The dataset used here for language
sim(x,y) ∈[0,1] for Normalization identification is Indian language corpora used in
the FIRE2014 (Forum for IR Evaluation) shared
sim(x,y) = 1 : exact match task on transliterated search. Data used for training
sim(x,y) = 0 : “completely different“ x and y. the classifier consists of bilingual documents
0 < sim(x,y) < 1 :approximate similarity containing English and Hindi words in Romanized
script for Bollywood Song Lyrics.
Threshold value =1 for exact match
1 matches LE Complete database of songs consists of 63,000
documents in form of text file. (Dataset of FIRE
< 1 matches LH OR Lo Based on List condition LoW MSIR). The below table shows the sample dataset
If LE matches LoW showing various transliterated variations for non-
L=Lo English word and a second sample for mixed script
data having words as English and transliterated
6. Classify the Word as E, H or O. Hindi words.

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1422 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

the performance of the language, we have


Similarity Score : Test Word
computed code-mixing patterns in the dataset on
hoti
two metrics. This is being used to know the mixing
1.1 patterns in the dataset.
1 1 The proposed system is analyzed and
0.94 evaluated based on the following code
0.9 0.89 0.89 mixing metrics.
0.8 MI: Multilingual index is a measure for word-
hoti hotii hotie hotei count that quantifies the distribution variations of
the language tags in a corpus of languages.
Fig. 5 (a). Word level similarity Equation 4 defines the MI (Multilingual Index) as:
1 − ∑𝑃 2⁡ 𝐽
𝑀𝐼 = 𝑥 = ⁡, (4)
Similarity Score at word level (𝑘 − 1)∑𝑃 2⁡ 𝐽
1.05 where k denotes the number of languages, Pj
1 1 1 denotes the number of words in the language j over
the number of words in the corpus. The value of MI
0.95 0.94 0.93 resides between 0 and 1. Value of 0 relates
0.9 0.89 0.89 monolingual corpus and 1 relates to the equal
0.85 number of tokens from each language in a corpus.
0.84 0.84
CMI: Code-Mixing Index: At the phonetic level,
0.8
this is calculated by discovering the most frequent
0.75 language in the utterance and then counting the
frequency of the words belonging to all other
languages present. It is calculated using
Fig. 5 (b). Word level similarity equation (5):
∑𝑛𝑖=1(𝑤𝑖 ) − 𝑚𝑎𝑥(𝑤𝑖 )
Table 3. MI and CMI values 𝐶𝑀𝐼 = ⁡, (5)
𝑛−𝑢
Language set MI CMI
where ∑𝑛𝑖=1 𝑤𝑖 is the sum of all languages present
Hindi-English 0.582 22.229 in the utterance, max{wi} is the maximum number
of words exists from any language.(considering the
case more than one language can have same
5 Evaluation and Experimental maximum word count), n denotes total number of
tokens, and u denotes the number of tokens for
Results other language independent tags.
If an utterance only contains u (i.e., N=u)
The next section discusses the complete language independent tokens. Its index is
experimental part along with results and considered to be zero. For other utterances, we
consequent discussions. use the normalization (multiply the value by 100) to
acquire the digits in the range of 0 to 100.
5.1 Experimental Results
The next wi are the tagged language words and
max(wi) is the most prominent language words.
The proposed algorithm for retrieving language of Applying this equation we will get CMI=0 for
the word in code mixed data is evaluated on the monolingual utterances because max(wi = n − u).
basis of statistical measures and also evaluated Equation 5 is normalized as below in equation (6):
using the machine learning approach. The below
section provides the complete evaluation based on 𝑚𝑎𝑥{𝑤𝑖}
the statistical model. We performed two separate 𝐶𝑀𝐼 = { 100 × [1 − 𝑛 − 𝑢 ] ∶ 𝑛 > 𝑢, ⁡⁡ (6)
experiments on the code mixed data to rationalize 0⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡ ∶ 𝑛 = 𝑢,

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1423

where wi are the words labelled with each


Similarity Score at Sentence Level
language tag, max{wi} is the most prominent
language words. By applying the above equation 1.02 1
we will get a value of CMI as 0 for monolingual and 0.994
1
a higher value of CMI designates high mixing 0.98
of languages. 0.96 0.949 0.946
To understand the model, consider the
0.94
following scenario, sentence S1 contains ten
words. Five words are from Language L1 and 0.92 0.903
remaining 5 words are from Language L2. 0.9
Applying equation 6 the CMI will be 100 × (1 – 0.88
5/10) = 50. However, another sentence S2 0.86
contains 10 words and each word is from a 0.84
different language. The CMI = 100 × (1 – 1/10) = party party paartee party party
90. It rightly reflects that S2 is highly mixed as abhi abhi abhi abhi abhi
every word belongs to a different language. This baki hai baaki baki hai abhee baakee
CMI value helps us to understand the level of code hai baki hai haie
mixing available in the dataset. The table below
describes the values obtained for MI and CMI for Fig. 6. Sentence level similarity
the corpus.
Secondly, we computed the similarity score
based on the proposed algorithm on the dataset
using the equation (7).It gives significance in
labeling the word as either English or Hindi based
on the frequency of the word. The proposed
algorithm checks the Conf_Score of the classifier
for Language Lj on input Wi as 0 ≤ Conf_Score ≤
1, where Conf_Score is similarity metrics,
sim(Wx,Wy) x and y can be the word in a string.
The below section describes the different results
obtained on the code mixed dataset for calculating
the similarity score at word level and sentence
Fig. 7. Visualization of word level language identification
level. Figure 5(a) and 5(b) describes the result
by the statistical model
obtained at word level for Hindi roman
transliterated words in the corpus. Figure 6 plots WhatsApp. We use the Hindi-English dataset for
the similarity at the sentence level. Figure 7 the experimental evaluation.
describes the sentence level language
The labels used are summarized in Table 4.
identification based on the proposed design and
The training data contains the tokens of the
algorithm discussed in section 3.3:
dataset with its corresponding language tag and
∑𝑛𝑖=1 𝑋𝑖 𝑌𝑖 POS tag.
𝑆𝑖𝑚(𝑋, 𝑌) = ⁡,
(7)
√∑𝑛𝑖=1 𝑋𝑖 2 √∑𝑛𝑖=1 𝑌𝑖 2
1. E indicates English words, for example: This,
The next section describes the experimental and, there.
evaluation based on applying BLSTM neural 2. H indicates Hindi words, for example: aisa,
model. The dataset used for this work is obtained mera, tera.
from POS Tagging task for Hindi-English code- 3. NE indicates named entities like Person,
mixed social media text conducted by ICON 2016 Location and Organization, for example:
[25]. The dataset contains the text of three social Narendra Modi, India, Facebook.
media platforms namely Facebook, Twitter and 4. Other indicates tokens containing special

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1424 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

characters and numbers, for example: Table 4. Description of the labels for Hindi-
@,#,0- 9. English dataset
5. Ambiguous indicates words used
ambiguously in Hindi and English, for example: Hindi-
Label Description
is, to, us. English %
6. Mixed indicates words of Hindi-English and E English words only 57.76
Number combination, for example: MadamJi H Hindi words only 20.41
Sirji.
NE Named Entity 6.59
7. Unk indicates unrecognized words, for
Other Symbols, Emoticons 14.8
example: t.M , @.s,Ss.
Can’t determine
All the seven tags are present in the Facebook Ambiguous whether Hindi or 0.27
dataset, where ’E’, ’H’, ’NE’, ’Other’ are the tags English
present in Twitter and Whatsapp data. The size of Word of Hindi English
the training and testing data is summarized in Mixed 0.08
in combination
Table 4. From the table, it can be observed that the Unk Unrecognized word 0.09
average tokens per comment of Whatsapp training
and testing data are very less than Facebook and
Table 5. Embedding dataset
Twitter data. This may be due to the fact that
Facebook and Twitter data mostly contains news
articles and comments which make the average Number of sentences in the dataset used for
embedding (Facebook, Twitter and WhatsApp)
tokens per comment count to be more while
Whatsapp contains conversational ICON2016 2631
short messages. MSIR 2015 2700
For generating the embedding vectors, more MSIR 2016 6139
dataset has to be provided to obtain efficiently the
distributional similarity of the data. The additional
dataset collected along with the training data will Table 6. F measure obtained for Twitter
be given to the embedding model. The Hindi-
English additional code-mixed data were collected Embedding Type E H NE
from Shared task on Mixed Script Information 1 gram 84.95 93.31 78.38
Retrieval (MSIR), conducted in the year 2016 [26] Character
3 gram 85.34 93.44 77.12
& 2015 [27] and shared task on Code-Mix Entity Embedding
5 gram 85.38 93,49 80.27
Extraction task conducted by Forum for
Information Retrieval and Evaluation (FIRE), 2016 1 gram 65.86 82.96 62.22
[28]. Most of the data collected for embedding is Word
3 gram 85.71 93.97 83.94
Hindi-English code-mixed Twitter data. The size of Embedding
5 gram 85.42 93.16 78.15
the dataset used for embedding is given in
below table.
Context appending was done for each Table 7. F measure obtained for Facebook
Facebook, Twitter and WhatsApp train as well as
test data. These were given to the learning model Embedding Type E H NE
for training and testing. The cross-validation 1 gram 85.65 92.92 64.95
accuracies obtained for Facebook, Twitter, and Character
3 gram 86.45 93.36 65.02
WhatsApp with 1-gram, 3-gram and 5-gram Embedding
features for character-based embedding model 5 gram 85.47 92.55 65.05
and word-based embedding model is presented in 1 gram 85.02 92.03 62.80
below section. Word
3 gram 86.99 93.51 67.21
Embedding
When comparing the overall accuracy obtained 5 gram 85.15 92.47 61.03
for Facebook, Twitter, and WhatsApp, we can see

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1425

From the performance of data tabulated in


Table 6, 7 and 8, it is clearly seen that the word
embedding 3-gram based model gives a better
score than other models. Table 6 holds label wise
accuracy for Twitter data, Table 7 holds label wise
accuracy for Facebook data and Table 8 holds
label wise accuracy for WhatsApp data.
It can be observed from the table that 3-gram
word embedding model gives significant accuracy
in comparison to 1-gram and 5-gram word
embedding and to character embedding model
whereas in case of character gram model accuracy
is better in 5-gram model except for WhatsApp
accuracy where 5 gram shows better accuracy.
This is because the system needs more context
information to identify the language. That is why
the 5-gram embedding gives a better result in the
case of WhatsApp for character
embedding techniques.
We tend to envision the representations
learned by the RNN model by the word
Fig. 8. Visualization of word representation learned by embeddings for the selected subset of words from
the Bi-LSTM model for Hindi-English datasets. The above result maps the labels to
colors’ indicating the defined seven parameters
Table 8. F measure obtained for WhatsApp defined in table 4. The color encoding is
summarized as follows: 1)Red for label E, 2)Blue
Embedding Type E H NE for Label H, 3) Black for Label NE, 4) Orange for
1 gram 52.42 80.15 28.57 Label Others, 5) Purple for Label Ambiguous and
Character
3 gram 54.99 80.26 37.70 Mixed, and 7) Yellow for Label Unk
Embedding (Unrecognized word).
5 gram 54.39 80.91 31.58 The proposed neural model gives a clearer
1 gram 50.42 79.62 40.00 separation between the different labeling
Word parameters as defined in table 4 along with giving
3 gram 60.88 81.98 40.27
Embedding a crystal clear separation between the language
5 gram 53.70 80.19 40.12
Hindi and English used in the code mixed dataset.
This result shows that this model can be scaled to
that the accuracy obtained is more with the word- detect language in code mixed data without any
based model as compared to character-based additional feature engineering.
embedding model.
It can also be observed that in the word-based
embedding model, 3-gram-based features give 6 Conclusions
more accuracy than 1-gram and 5-gram context
feature model while in character-based model 5-
gram gives more accuracy than 1-gram and The intricacy of language identification in code
3- gram. mixed and code switched data is governed by the
When observing Table 6,7, and 8 shows the following: data source, code switching and code
performance of Facebook, Twitter and WhatsApp mixing manners, and the relation between the
Hindi-English code-mixed data, we can see that languages involved. We find that the code mixing
the F-score for language labels E-English, H-Hindi, is more used in social media context as per the
NE-Named Entity is better using word embedding. evaluation and experiments were undertaken in

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

1426 Shashi Shekhar, Dilip Kumar Sharma, M.M. Sufyan Beg

this work. Code mixing metrics helps in identifying Computational Approaches to Code Switching, pp.
the code-mixing patterns across language pairs. 102–106. DOI:10.3115/v1/W14-3912.
By analyzing the code mixing metrics we 5. King, B. & Abney, S. (2013). Labeling the
conclude that Hindi-English words are often mixed languages of words in mixed-language documents
using weakly supervised methods. Proceedings of
in our dataset. It would be a great idea to
the 2013 Conference of the North American Chapter
investigate the emerging trend of code switching of the Association for Computational Linguistics:
and code mixing to bring conclusion about the Human Language Technologies, pp. 1110–1119.
behavioral patterns in the data of different sources 6. Dong, N. & Doʇruöz A.S. (2013). Word level
like lyrics of songs, chat data having different language identification in online multilingual
language sets, blog data and scripts of plays or communication. Proceedings of the 2013
movies. We have implemented two different Conference on Empirical Methods in Natural
evaluation models: statistical model and neural Language Processing, pp. 857–862.
based learning model and obtained competitive 7. Vyas, Y., Gella, S., Sharma, J., Bali, K., &
results for the identification of languages. This is Choudhury, M. (2014). Pos tagging of english-hindi
probably due to the amount of training and testing code-mixed social media content. Proceedings of
data we have. the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 974–
The results depict that the word embeddings
979. DOI:10.3115/v1/D14-1105.
are capable to detect the language separation by
8. Das, A. & Gamback, B. (2014). Identifying
identifying the origin of the word and
languages at the word level in code-mixed indian
correspondingly mapping to its language label. The social media text. ICON´14.
BLSTM system performs better for HIN-ENG
9. Sequiera, R., Choudhury, M., Gupta, P., Rosso,
language pairs. The BLSTM model captures long-
P., Kumar, S., Banerjee, S., Naskar, S.,
distance dependencies in a sequence and this is in Bandyopadhyay, S., Chittaranjan, G., Das, A., &
line with the observation made above for Chakma, K. (2015). Overview of FIRE´15 Shared
identifying word level language identification in Task on Mixed Script Information Retrieval.
code mixed data considering the context of the Proceedings of FIRE, Vol. 1587, pp. 19–25.
word belonging to labeled languages. Scaling this 10. Jhamtani, H., Bhogi, S. K., & Raychoudhury, V.
system to identify other characteristics in linguistics (2014). Word-level language identification in bi-
with different language dataset is a potential future lingual code-switched texts. 28th Pacific Asia
direction to explore. Conference on Language, Information and
Computation, pp. 348–357.
11. Ethiraj, R., Shanmugam, S., Srinivasa, G., &
References Sinha, N. (2015). NELIS - Named Entity and
Language Identification System: Shared Task
System Description. Proceedings of FIRE, Vol.
1. Weiscbedel, R., Carbonell, J., Grosz, B., Lehnert,
1587, pp. 43–46.
W., Marcus, M., Perrault, R., & Wilensky, R.
(1989). White paper on natural language 12. Bhargava, R., Sharma, Y., & Sharma, S. (2016).
processing. Association for Computational Sentiment Analysis for Mixed Script Indic
Linguistics, pp. 481–493. Sentences. International Conference on Advances
in Computing, Communications and Informatics,
2. Kim, Y. (2014). Convolutional neural networks for ICACCI´16, pp. 524–529. DOI:10.1109/ICACCI.
sentence classification. arXiv:1408.5882. 2016.7732099.
3. Barman, U., Das, A., Wagner, J., & Foster, J. 13. Sharma, S., Srinivas, P., & Balabantaray, R.
(2014). Code mixing: A challenge for Language (2016). Emotion Detection using Online Machine
Identification in the Language of Social Media. Learning Method and TLBO on Mixed Script.
EMNLP´14, Vol. 13, pp. 1–23. Language Resources and Evaluation Conference,
4. King, L., Baucom, E., Gilmanov, T., Kübler, S., Vol. 10, No. 5, pp. 47–51.
Whyatt, D., Maier, W., & Rodrigues, P. (2014). 14. Barman, U., Wagner, J., & Foster, J. (2016). Part-
The IUCL+ System: Word-Level Language of-speech tagging of code-mixed social media
Identification via Extended Markov Models. content: Pipeline, stacking and joint modeling.
Proceedings of the First Workshop on Proceedings of the Second Workshop on

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151
ISSN 2007-9737

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1427

Computational Approaches to Code Switchingpp, 24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.
pp. 30–39. S., & Dean, J. (2013). Distributed Representations
15. Bali, K., Jatin, S., Choudhury, M., & Vyas, Y. of Words and Phrases and their Compositionality.
(2014). I am borrowing ya mixing?. An Analysis of pp. 3111–3119.
English-Hindi Code Mixing in Facebook. 25. Jamatia, A. & Das, A. (2016). Task Report: Tool
Proceedings of the First Workshop on Contest on POS Tagging for Code-mixed Indian
Computational Approaches to Code Switching, pp. Social Media (Facebook, Twitter, and Whatsapp)
116–126. Text@ ICON 2016. Conference International
16. Vyas, Y., Gella, S., Sharma, J., Bali, K., & Conference on Natural Language Processing.
Choudhury, M. (2014). POS tagging of English- 26. Banerjee, S., Chakma, K., Naskar, S., Das, A.,
Hindi Code-Mixed Social Media Content. Rosso, P., Bandyopadhyay, S., & Choudhury, M.
Proceedings of the 2014 Conference on Empirical (2016). Overview of the Mixed Script Information
Methods in Natural Language Processing, pp. Retrieval (MSIR) at FIRE-2016. CEUR Workshop
974– 979. Proceedings, Vol. 1737, pp. 94–99. DOI:10.1007/
17. Rao, P.R.K. & Devi, S. (2016). CMEE-IL: Code Mix 978-3-319-73606-8_3.
Entity Extraction in Indian Languages from Social 27. Sequiera, R., Choudhury, P., Rosso, P., Kumar,
Media Text@FIRE´16 - An Overview. FIRE S., Banerjee, S., Naskar, S., Bandyopadhyay, S.,
Workshops, Vol. 1737, pp. 289–295. Chittaranjan, G., Das, A., & Chakma, K. (2015).
18. Remmiya-Devi, G., Veena, P.V., Anand-Kumar, Overview of FIRE´15 Shared Task on Mixed Script
M., & Soman, K. P. (2016). AMRITA-CEN@FIRE Information Retrieval. Post Proceedings of the
2016: Code-mix Entity Extraction for Hindi-English Workshops at the 7th Forum for Information
and Tamil-English tweets. CEUR Workshop Retrieval Evaluation, Gandhinaga, Vol. 1587, pp.
Proceedings, Vol. 1737, pp. 304–308. 19–25.
19. Sapkal, K. & Shrawankar, U. (2016). 28. Srinidhi-Skanda, V., Singh, S., Remmiya-Devi,
Transliteration of Secured SMS to Indian Regional G., Veena, P.V., Anand-Kumar, M., & Soman, K.
Language. Procedia Computer Science, Vol. 78, pp. P. (2016). CEN@ Amrita FIRE 2016: Context based
748–755. DOI: 10.1016/j.procs.2016.02.048. Character Embeddings for Entity Extraction in
Code-Mixed Text. CEUR Workshop Proceedings,
20. Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel,
Vol. 1737, pp. 321–324.
J.R., Alegria, I., Aranberri, N., & Fresno, V.
(2015). TweetLID: A benchmark for tweet language 29. Alekseev, A. & Nikolenko, S. (2017). Word
identification. Language Resources and Evaluation, embeddings for user profiling in online social
Vol. 50, No. 4, pp. 729–766. DOI:10.1007/s10579- networks. Computación y Sistemas, Vol. 21, No. 2.
015-9317-4. DOI:10.13053/cys-21-2-2734.
21. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, 30. Shekhar, S., Sharma, D.K., & Beg, M.S. (2018).
S., Anguelov, D., & Rabinovich, A. (2015). Going Hindi Roman Linguistic Framework for Retrieving
deeper with convolutions. Proceedings of the IEEE Transliteration Variants using Bootstrapping.
Conference on Computer Vision and Pattern Procedia Computer Science, Vol. 125, pp. 59–67.
Recognition, pp. 1–9. DOI:10.1109/CVPR.2015. DOI:10.1016/j.procs.2017.12.010.
7298594. 31. Veena, P.V., Anand-Kumar, M., & Soman, K. P.
22. LeCun, Y., Haffner, P., Bottou, L., & Bengio. Y. (2018). Character Embedding for Language
(1999). Object recognition with gradient-based Identification in Hindi-English Code-mixed Social
learning. Shape, Contour and Grouping in Media Text. Computación y Sistemas, Vol. 22, No.
Computer Vision, pp. 319–345. 1. DOI:10.13053/cys-22-1-2775.
23. Hochreiter, S. & Schmidhuber, J. (1997). Long
short-term memory. Neural Computation, Vol. 9, No.
8, pp. 1735–1780. DOI:10.1162/neco.1997. Article received on 20/02/2019; accepted on 25/07/2020.
9.8.1735 Corresponding author is Shashi Shekhar.

Computación y Sistemas, Vol. 24, No. 4, 2020, pp. 1415–1427


doi: 10.13053/CyS-24-4-3151

You might also like