An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
1 GLA University,
Abstract. The paper describes the application of the cBoW and Skip gram for evaluation. The proposed
code mixed index in Indian social media texts and system BLSTM context capture module gives better
comparing the complexity to identify language at word accuracy for word embedding model as compared to
level using BLSTM neural model. In Natural Language character embedding evaluated on our two testing sets.
Processing one of the imperative and relatively less The problem is modeled collectively with the deep-
mature areas is a transliteration. During transliteration, learning design. We tend to gift an in-depth empirical
issues like language identification, script specification, analysis of the proposed methodology against standard
missing sounds arise in code mixed data. Social media approaches for language identification.
platforms are now widely used by people to express their
opinion or interest. The language used by the users in Keywords. Language identification, transliteration,
social media nowadays is Code-mixed text, i.e., mixing character embedding, word embedding, NLP,
of two or more languages. In code-mixed data, one machine learning.
language will be written using another language script.
So to process such code-mixed text, identification of
language used in each word is important for language 1 Introduction
processing. The major contribution of the work is to
propose a technique for identifying the language of
Hindi-English code-mixed data used in three social Humans use natural language as their medium for
media platforms namely, Facebook, Twitter, and communication. Natural Language Processing
WhatsApp. We propose a deep learning framework (NLP), is an area of Artificial Intelligence where we
based on cBoW and Skip gram model for language
train the machine to understand and process the
identification in code mixed data. Popular word
embedding features were used for the representation of text to make human-computer interactions more
each word. Many researches have been recently done efficient. Applications of NLP lies under several
in the field of language identification, but word level fields like machine translation, text processing,
language identification in the transliterated environment entity extraction and so on [1]. A large amount of
is a current research issue in code mixed data. We have data is now available on the Web as text. With the
implemented a deep learning model based on BLSTM emergence of several social media platforms and
that predicts the origin of the word from language the availability of a large amount of text data in
perspective in the sequence based on the specific words them, NLP plays a great role in understanding and
that have come before it in the sequence. The
generating data today. The social media platforms
multichannel neural networks combining CNN and
BLSTM for word level language identification of code- are used widely today by people to discuss the
mixed data where English and Hindi roman interests, hobbies, reviews on products, movies
transliteration has been used. Combining this with a and so on.
In earlier days, the language used in such Here Hindi words are written in Roman
platforms was purely English. Today mixing transliterated form using the English alphabet.
multiple languages together is a popular trend. For processing monolingual text, the primary
These kinds of languages are called code- step would be Part-Of-Speech (POS), tagging of
mixed language. the text. However, in the case of social media text,
A large amount of textual data is available on the primary feature to be considered is the
the web. With the emergence of several social identification of the language particularly for code-
media platforms and the availability of a large mixed text [3]. The language identification for code-
amount of text data in them, NLP plays a great role mixed text proposed in this paper is implemented
in understanding and generating data today. The using word embedding models. The term word
social media platforms are used widely today by embedding refers to the vector representation of
people to discuss the interests, hobbies, reviews the given data capturing the semantic relation
on products, movies and so on. In earlier days, the between the words in the data. The work is a
language used in such platforms was purely generalized approach because this system can be
English. Today combining different languages extended for other NLP applications since only
together is a common phenomenon. These kinds word embedding features are considered. The
of languages are called code-mixed language. An work involves features obtained from two
example of Hindi-English code-mixed text is embedding models, word-based embedding and
described in the below sentences: character-based embedding. A comparison of the
performance of the two models with the addition of
Sentence 1: GLA University ka course structure contextual information is performed in this paper.
kaisa hai: The machine learning [4] based classification is
used for training and testing the system.
NE/OOV E H E E H H
Framework for discovering user intend based
Sentence 2: Aray Friend, ek super idea hai mere on Hindi roman transliteration by identifying the
paas: word level language identification was addressed
here. The remaining section of the paper is
H E H E E H H H organized as follows: An overview of the related
works on language identification in the multilingual
Here Hindi words are labeled as H and English domain is discussed in section 2. A discussion on
word are labeled as E and Named entity as NE. We the methodology proposed considering word
can observe from the example that the Hindi embedding and character embedding method is
words, tagged as H, were written in Roman Script discussed in section 3.The dataset description is
instead of Unicode characters. The above example stated in section 4. Section 5 describes the
has been cited to give an idea of code mixed data. experimental evaluation and results obtained.
The paper presents a novel architecture, which Section 6, analyses the inferences obtained from
captures information at both word level and context the work done and a pointer towards the
level to output the final tag for language future work.
identification in context to the word belongs to
which language. For word level, we have used a
multichannel neural network (MNN) inspired by the 2 Related Research
recent works of computer vision. Such networks
have also shown promising results in NLP tasks
like sentence classification [2]. For context capture, In this section, some of the recent techniques
we used Bi-LSTM-CRF. The context module was regarding to the language transliteration and
tested more rigorously as in quite a few of the identification is listed and reviewed as follows:
previous work, this information has been sidelined code-switching and mixing is a current research
or ignored. We have experimented on Hindi- area in the field of language tagging. Language
English (H-E) code mixed data. Hindi is the most Identification (LID), is a primary task in many text
popular spoken language of India. processing applications and hence several
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1417
research is going on this area especially with the dataset was performed with embedding
code-mixed data. King and Abney [5] used semi- models [18].
supervised methods for building a word level Sapkal et al. [19] have given the approach by
language identifier. Nguyen and Dogru [6] used the use of SMS, which is meant for communicating
CRF model limited to bigrams for identifying the with others in minimal words. The regional
language. Logistic regression along with a module, language messages are printed using English
which gives code-switching probability was used alphabets due to the lack of regional keywords.
by Vyas et al. [7]. Das and Gamback [8] used This SMS language may fluctuate, which leads to
various features like a dictionary, n-gram, edit miscommunication. The focus was on
distance and word context for identifying the origin transliterating short form to full form. Zubiaga et al.
of the word. [20] had mentioned language identification, as the
A shared task on Mixed Script Information mission of defining the language of a given text. On
Retrieval (MSIR) 2015 was conducted in which a the other hand, certain issues like quantifying the
subtask includes language identification of 8 code- individuality of similar languages in multilingualism
mixed Indian Languages, Telugu, Tamil, Marathi, document and analyzing the language of short
Bangla, Gujarati, Hindi, Kannada, and Malayalam, texts are still unresolved. The below section
each mixed with English [9]. describes the proposed methodology to overcome
the research gap identified in the area of
The MSIR language identification task was
transliterated code mixed data. Alekseev et al. [29]
implemented by using machine learning based
consider word embedding as an efficient feature
SVM classifier and obtained an accuracy of 76%
and proposed entity extraction for user profiling
[16]. Word level language identification was
using word-embedding features.
performed for English-Hindi using supervised
methods in [10]. Naive Bayes classifier was used
to identify the language of Hindi-English data and 3 Proposed Methodology for
an accuracy of 77% was obtained [11].
Language Identification
Language Identification is also performed as a
primary step to several other applications. [12],
The code mixed data include the combination of
implemented a sentiment analysis system which
the native script (familiar language) and the non-
utilized MSIR 2015 English-Tamil, English-Telugu,
native script (unfamiliar language). Due to this
English-Hindi, and English-Bengali code-mixed
combination, a massive number of complications
dataset. Another emotion detection system was
arises while dealing with this mixed code.
developed for Hindi-English data with machine
Language Identification is the main and the
learning based and Teaching Learning Based
foremost problem identified in the mixed code data
Optimization (TLBO), techniques [13]. Part-of-
since every user will not be clear about every
Speech tagging was done for English-Bengali-
language recognized in the globe. The language
Hindi corpus including the language identification
identification arises when the text is written in
step in [14].
different languages. This incorporates problems
Since the code-mixed script is the common such as the script specifications, which is the
trend in the social media text today, many kinds of possibility of different scripts between the source
research are going on for the information extraction and target languages.
from such text. An analysis of the behavior of code- The proposed system is comprised of two
mixed Hindi-English Facebook dataset was done modules. The first one is a multichannel neural
in [15]. POS Tagging technique was performed on network trained at the word level, while the second
code-mixed social media text in Indian one is a simple bidirectional LSTM trained at the
languages [16]. context level. The second module takes the input
A shared task was organized for entity from the first module along with some other
extraction on code-mixed Hindi-English and Tamil- features to produce the language tag.
English social media text [17]. Entity extraction for The proposed architecture illustrated in Figure
code-mixed Hindi-English and Tamil-English 2 is inspired by [30, 31, 21] where the recent deep
neural architectures developed for image This is because the architecture allows the network
classification tasks. to capture representations of different types, which
The proposed model uses a very similar can be really helpful for NLP tasks for identifying
concept for learning the language at the word level. the origin of a word in context to the language used
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1419
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1421
identify the language as both Hindi and English are Table1. Dataset ICON 2016 [25]
written using the same character set.
No. of Sentences No. of Tokens
Algorithm 1. Proposed Algorithm for Language Data Trainin Testin Trainin Testin
Identification g data g data g data g data
Input: Code Mixed Data Facebook 772 111 20,615 2,167
Output: Language of the input word. Twitter 1,096 110 17,311 2,163
Algorithm steps WhatsAp
763 219 3,218 802
p
1. Input term from Test Document
Let D=W 1, W 2….W n be a document Table 2. Sample data
where
Data sample
W is are the words
amir se hoti hai, garib se hotii hai
2. Letter of the words{a-z}
door se hotee hai, qarib se hoti hai
3. // word level tagging task based on vectors magar jahaan bhi hoti hai, ai mere dost
3.1 Lb(W i) chosen from Language L shaadiyaan to naseeb se hoti hai
Where L= {LE,LH,LO) Mixed Script Data sample
//Check the frequencies of character in W i and Party abhi baaki hai…………..
W j. Party abhee baaki hai………………………..
3.2 Generate Vectors for characters
for W i and W j
4 Dataset Descriptions
3.3 Apply Similarity metrics
∑𝑛𝑖=1 𝑋𝑖 𝑌𝑖
𝑆𝑖𝑚(𝑋, 𝑌) = The dataset used for this work is obtained from
√∑𝑛𝑖=1 𝑋𝑖 2 √∑𝑛𝑖=1 𝑌𝑖 2 POS Tagging task for Hindi-English code-mixed
social media text conducted by ICON 2016 [25].
The dataset contains the text of three social
4. Label the word E- English or H-Hindi.
media platforms namely Facebook, Twitter and
5. Check the Conf_Score of the classifier for WhatsApp. The train data provided contains the
Language Lj on input W i as 0 ≤ Conf_Score ≤ 1 tokens of the dataset with its corresponding
Where Conf_Score is similarity metrics language tag and POS tag.
sim(W x,W y) x and y can be word in string The dataset used here for language
sim(x,y) ∈[0,1] for Normalization identification is Indian language corpora used in
the FIRE2014 (Forum for IR Evaluation) shared
sim(x,y) = 1 : exact match task on transliterated search. Data used for training
sim(x,y) = 0 : “completely different“ x and y. the classifier consists of bilingual documents
0 < sim(x,y) < 1 :approximate similarity containing English and Hindi words in Romanized
script for Bollywood Song Lyrics.
Threshold value =1 for exact match
1 matches LE Complete database of songs consists of 63,000
documents in form of text file. (Dataset of FIRE
< 1 matches LH OR Lo Based on List condition LoW MSIR). The below table shows the sample dataset
If LE matches LoW showing various transliterated variations for non-
L=Lo English word and a second sample for mixed script
data having words as English and transliterated
6. Classify the Word as E, H or O. Hindi words.
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1423
characters and numbers, for example: Table 4. Description of the labels for Hindi-
@,#,0- 9. English dataset
5. Ambiguous indicates words used
ambiguously in Hindi and English, for example: Hindi-
Label Description
is, to, us. English %
6. Mixed indicates words of Hindi-English and E English words only 57.76
Number combination, for example: MadamJi H Hindi words only 20.41
Sirji.
NE Named Entity 6.59
7. Unk indicates unrecognized words, for
Other Symbols, Emoticons 14.8
example: t.M , @.s,Ss.
Can’t determine
All the seven tags are present in the Facebook Ambiguous whether Hindi or 0.27
dataset, where ’E’, ’H’, ’NE’, ’Other’ are the tags English
present in Twitter and Whatsapp data. The size of Word of Hindi English
the training and testing data is summarized in Mixed 0.08
in combination
Table 4. From the table, it can be observed that the Unk Unrecognized word 0.09
average tokens per comment of Whatsapp training
and testing data are very less than Facebook and
Table 5. Embedding dataset
Twitter data. This may be due to the fact that
Facebook and Twitter data mostly contains news
articles and comments which make the average Number of sentences in the dataset used for
embedding (Facebook, Twitter and WhatsApp)
tokens per comment count to be more while
Whatsapp contains conversational ICON2016 2631
short messages. MSIR 2015 2700
For generating the embedding vectors, more MSIR 2016 6139
dataset has to be provided to obtain efficiently the
distributional similarity of the data. The additional
dataset collected along with the training data will Table 6. F measure obtained for Twitter
be given to the embedding model. The Hindi-
English additional code-mixed data were collected Embedding Type E H NE
from Shared task on Mixed Script Information 1 gram 84.95 93.31 78.38
Retrieval (MSIR), conducted in the year 2016 [26] Character
3 gram 85.34 93.44 77.12
& 2015 [27] and shared task on Code-Mix Entity Embedding
5 gram 85.38 93,49 80.27
Extraction task conducted by Forum for
Information Retrieval and Evaluation (FIRE), 2016 1 gram 65.86 82.96 62.22
[28]. Most of the data collected for embedding is Word
3 gram 85.71 93.97 83.94
Hindi-English code-mixed Twitter data. The size of Embedding
5 gram 85.42 93.16 78.15
the dataset used for embedding is given in
below table.
Context appending was done for each Table 7. F measure obtained for Facebook
Facebook, Twitter and WhatsApp train as well as
test data. These were given to the learning model Embedding Type E H NE
for training and testing. The cross-validation 1 gram 85.65 92.92 64.95
accuracies obtained for Facebook, Twitter, and Character
3 gram 86.45 93.36 65.02
WhatsApp with 1-gram, 3-gram and 5-gram Embedding
features for character-based embedding model 5 gram 85.47 92.55 65.05
and word-based embedding model is presented in 1 gram 85.02 92.03 62.80
below section. Word
3 gram 86.99 93.51 67.21
Embedding
When comparing the overall accuracy obtained 5 gram 85.15 92.47 61.03
for Facebook, Twitter, and WhatsApp, we can see
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1425
this work. Code mixing metrics helps in identifying Computational Approaches to Code Switching, pp.
the code-mixing patterns across language pairs. 102–106. DOI:10.3115/v1/W14-3912.
By analyzing the code mixing metrics we 5. King, B. & Abney, S. (2013). Labeling the
conclude that Hindi-English words are often mixed languages of words in mixed-language documents
using weakly supervised methods. Proceedings of
in our dataset. It would be a great idea to
the 2013 Conference of the North American Chapter
investigate the emerging trend of code switching of the Association for Computational Linguistics:
and code mixing to bring conclusion about the Human Language Technologies, pp. 1110–1119.
behavioral patterns in the data of different sources 6. Dong, N. & Doʇruöz A.S. (2013). Word level
like lyrics of songs, chat data having different language identification in online multilingual
language sets, blog data and scripts of plays or communication. Proceedings of the 2013
movies. We have implemented two different Conference on Empirical Methods in Natural
evaluation models: statistical model and neural Language Processing, pp. 857–862.
based learning model and obtained competitive 7. Vyas, Y., Gella, S., Sharma, J., Bali, K., &
results for the identification of languages. This is Choudhury, M. (2014). Pos tagging of english-hindi
probably due to the amount of training and testing code-mixed social media content. Proceedings of
data we have. the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 974–
The results depict that the word embeddings
979. DOI:10.3115/v1/D14-1105.
are capable to detect the language separation by
8. Das, A. & Gamback, B. (2014). Identifying
identifying the origin of the word and
languages at the word level in code-mixed indian
correspondingly mapping to its language label. The social media text. ICON´14.
BLSTM system performs better for HIN-ENG
9. Sequiera, R., Choudhury, M., Gupta, P., Rosso,
language pairs. The BLSTM model captures long-
P., Kumar, S., Banerjee, S., Naskar, S.,
distance dependencies in a sequence and this is in Bandyopadhyay, S., Chittaranjan, G., Das, A., &
line with the observation made above for Chakma, K. (2015). Overview of FIRE´15 Shared
identifying word level language identification in Task on Mixed Script Information Retrieval.
code mixed data considering the context of the Proceedings of FIRE, Vol. 1587, pp. 19–25.
word belonging to labeled languages. Scaling this 10. Jhamtani, H., Bhogi, S. K., & Raychoudhury, V.
system to identify other characteristics in linguistics (2014). Word-level language identification in bi-
with different language dataset is a potential future lingual code-switched texts. 28th Pacific Asia
direction to explore. Conference on Language, Information and
Computation, pp. 348–357.
11. Ethiraj, R., Shanmugam, S., Srinivasa, G., &
References Sinha, N. (2015). NELIS - Named Entity and
Language Identification System: Shared Task
System Description. Proceedings of FIRE, Vol.
1. Weiscbedel, R., Carbonell, J., Grosz, B., Lehnert,
1587, pp. 43–46.
W., Marcus, M., Perrault, R., & Wilensky, R.
(1989). White paper on natural language 12. Bhargava, R., Sharma, Y., & Sharma, S. (2016).
processing. Association for Computational Sentiment Analysis for Mixed Script Indic
Linguistics, pp. 481–493. Sentences. International Conference on Advances
in Computing, Communications and Informatics,
2. Kim, Y. (2014). Convolutional neural networks for ICACCI´16, pp. 524–529. DOI:10.1109/ICACCI.
sentence classification. arXiv:1408.5882. 2016.7732099.
3. Barman, U., Das, A., Wagner, J., & Foster, J. 13. Sharma, S., Srinivas, P., & Balabantaray, R.
(2014). Code mixing: A challenge for Language (2016). Emotion Detection using Online Machine
Identification in the Language of Social Media. Learning Method and TLBO on Mixed Script.
EMNLP´14, Vol. 13, pp. 1–23. Language Resources and Evaluation Conference,
4. King, L., Baucom, E., Gilmanov, T., Kübler, S., Vol. 10, No. 5, pp. 47–51.
Whyatt, D., Maier, W., & Rodrigues, P. (2014). 14. Barman, U., Wagner, J., & Foster, J. (2016). Part-
The IUCL+ System: Word-Level Language of-speech tagging of code-mixed social media
Identification via Extended Markov Models. content: Pipeline, stacking and joint modeling.
Proceedings of the First Workshop on Proceedings of the Second Workshop on
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language... 1427
Computational Approaches to Code Switchingpp, 24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.
pp. 30–39. S., & Dean, J. (2013). Distributed Representations
15. Bali, K., Jatin, S., Choudhury, M., & Vyas, Y. of Words and Phrases and their Compositionality.
(2014). I am borrowing ya mixing?. An Analysis of pp. 3111–3119.
English-Hindi Code Mixing in Facebook. 25. Jamatia, A. & Das, A. (2016). Task Report: Tool
Proceedings of the First Workshop on Contest on POS Tagging for Code-mixed Indian
Computational Approaches to Code Switching, pp. Social Media (Facebook, Twitter, and Whatsapp)
116–126. Text@ ICON 2016. Conference International
16. Vyas, Y., Gella, S., Sharma, J., Bali, K., & Conference on Natural Language Processing.
Choudhury, M. (2014). POS tagging of English- 26. Banerjee, S., Chakma, K., Naskar, S., Das, A.,
Hindi Code-Mixed Social Media Content. Rosso, P., Bandyopadhyay, S., & Choudhury, M.
Proceedings of the 2014 Conference on Empirical (2016). Overview of the Mixed Script Information
Methods in Natural Language Processing, pp. Retrieval (MSIR) at FIRE-2016. CEUR Workshop
974– 979. Proceedings, Vol. 1737, pp. 94–99. DOI:10.1007/
17. Rao, P.R.K. & Devi, S. (2016). CMEE-IL: Code Mix 978-3-319-73606-8_3.
Entity Extraction in Indian Languages from Social 27. Sequiera, R., Choudhury, P., Rosso, P., Kumar,
Media Text@FIRE´16 - An Overview. FIRE S., Banerjee, S., Naskar, S., Bandyopadhyay, S.,
Workshops, Vol. 1737, pp. 289–295. Chittaranjan, G., Das, A., & Chakma, K. (2015).
18. Remmiya-Devi, G., Veena, P.V., Anand-Kumar, Overview of FIRE´15 Shared Task on Mixed Script
M., & Soman, K. P. (2016). AMRITA-CEN@FIRE Information Retrieval. Post Proceedings of the
2016: Code-mix Entity Extraction for Hindi-English Workshops at the 7th Forum for Information
and Tamil-English tweets. CEUR Workshop Retrieval Evaluation, Gandhinaga, Vol. 1587, pp.
Proceedings, Vol. 1737, pp. 304–308. 19–25.
19. Sapkal, K. & Shrawankar, U. (2016). 28. Srinidhi-Skanda, V., Singh, S., Remmiya-Devi,
Transliteration of Secured SMS to Indian Regional G., Veena, P.V., Anand-Kumar, M., & Soman, K.
Language. Procedia Computer Science, Vol. 78, pp. P. (2016). CEN@ Amrita FIRE 2016: Context based
748–755. DOI: 10.1016/j.procs.2016.02.048. Character Embeddings for Entity Extraction in
Code-Mixed Text. CEUR Workshop Proceedings,
20. Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel,
Vol. 1737, pp. 321–324.
J.R., Alegria, I., Aranberri, N., & Fresno, V.
(2015). TweetLID: A benchmark for tweet language 29. Alekseev, A. & Nikolenko, S. (2017). Word
identification. Language Resources and Evaluation, embeddings for user profiling in online social
Vol. 50, No. 4, pp. 729–766. DOI:10.1007/s10579- networks. Computación y Sistemas, Vol. 21, No. 2.
015-9317-4. DOI:10.13053/cys-21-2-2734.
21. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, 30. Shekhar, S., Sharma, D.K., & Beg, M.S. (2018).
S., Anguelov, D., & Rabinovich, A. (2015). Going Hindi Roman Linguistic Framework for Retrieving
deeper with convolutions. Proceedings of the IEEE Transliteration Variants using Bootstrapping.
Conference on Computer Vision and Pattern Procedia Computer Science, Vol. 125, pp. 59–67.
Recognition, pp. 1–9. DOI:10.1109/CVPR.2015. DOI:10.1016/j.procs.2017.12.010.
7298594. 31. Veena, P.V., Anand-Kumar, M., & Soman, K. P.
22. LeCun, Y., Haffner, P., Bottou, L., & Bengio. Y. (2018). Character Embedding for Language
(1999). Object recognition with gradient-based Identification in Hindi-English Code-mixed Social
learning. Shape, Contour and Grouping in Media Text. Computación y Sistemas, Vol. 22, No.
Computer Vision, pp. 319–345. 1. DOI:10.13053/cys-22-1-2775.
23. Hochreiter, S. & Schmidhuber, J. (1997). Long
short-term memory. Neural Computation, Vol. 9, No.
8, pp. 1735–1780. DOI:10.1162/neco.1997. Article received on 20/02/2019; accepted on 25/07/2020.
9.8.1735 Corresponding author is Shashi Shekhar.