0% found this document useful (0 votes)
20 views6 pages

Banglanewsrecommendationusingdoc 2 Vec

This document summarizes a paper that proposes using doc2vec to recommend Bangla news articles based on their content. Doc2vec is a neural network technique that learns vector representations of documents from unlabeled text data. The authors trained doc2vec on a corpus of 300,000 Bangla news articles and found it performed better than LDA and LSA for news recommendation and information retrieval tasks. They believe doc2vec is well-suited for this task as it can capture semantic relationships between documents without extensive feature engineering or preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Banglanewsrecommendationusingdoc 2 Vec

This document summarizes a paper that proposes using doc2vec to recommend Bangla news articles based on their content. Doc2vec is a neural network technique that learns vector representations of documents from unlabeled text data. The authors trained doc2vec on a corpus of 300,000 Bangla news articles and found it performed better than LDA and LSA for news recommendation and information retrieval tasks. They believe doc2vec is well-suited for this task as it can capture semantic relationships between documents without extensive feature engineering or preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328043719

Bangla News Recommendation Using doc2vec

Conference Paper · September 2018


DOI: 10.1109/ICBSLP.2018.8554679

CITATIONS READS

16 2,735

6 authors, including:

Rabindra Nath Nandi M. M. Arefin Zaman


Khulna University of Engineering and Technology Socian Ltd
24 PUBLICATIONS 104 CITATIONS 4 PUBLICATIONS 81 CITATIONS

SEE PROFILE SEE PROFILE

Tareq Al Muntasir Md. Jamil-Ur Rahman


Socian Ltd University of Alberta
3 PUBLICATIONS 66 CITATIONS 11 PUBLICATIONS 96 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Bangla Handwritten Digit Recognition Using CNN View project

Generative Adversarial Networks View project

All content following this page was uploaded by Rabindra Nath Nandi on 03 October 2018.

The user has requested enhancement of the downloaded file.


International Conference on Bangla Speech and Language Processing (ICBSLP), 21-22 September, 2018

Bangla news recommendation using doc2vec


Rabindra Nath Nandi*, M.M.Arefin Zaman*, Tareq Al Muntasir, Sakhawat Hosain Sumit, Tanvir Sourov and Md. Jamil-Ur Rahman
Socian Ltd
Dhaka, Bangladesh
{rabindra, arefin, tareq, sumit, tanvir, jamil}@socian.ai

Abstract—We present a content-based Bangla news documents which are semantically and contextually
recommendation system using paragraph vectors also known similar. Several studies are available on topic or concept
as doc2vec. doc2vec is a neural network driven approach discovery techniques e.g. Latent Semantic Analysis
that encapsulates the document representation in a low (LSA), Latent Discriminate Analysis (LDA) and their
dimensional vector. doc2vec can capture semantic
variants for Document Recommendation [4, 5].
relationship effectively between documents from a large
collection of texts. We perform both qualitative and Bangla is one of the most widely spoken languages in the
quantitative experiments on a large Bangla news corpus and world as about 250-300 million people speak Bangla as
show that doc2vec performs better than two popular topic their first language. Due to the lack of availability of
modeling techniques LDA and LSA. In the top-10
recommendation scenario, the suggestions from doc2vec are
sufficient research works and language modeling
more contextually correct than both LDA and LSA. doc2vec resources, developing an intelligent system like news
also outperforms LDA and LSA on human-generated triplet recommender for Bangla is comparatively challenging
dataset with 91% accuracy where LDA and LSA give than other rich languages e.g. English, Spanish and
85%,84% accuracy respectively. German etc. The authors in [13] proposed an ontology-
based recommendation system for cross-lingual languages
Keywords—content-based, Bangla news recommendation, which includes Bangla and English news.
paragraph vectors, doc2vec, LDA, LSA.
In this work, we use a recent idea from distributional
I. INTRODUCTION semantics called document embedding using doc2vec [8]
Digital information over the internet is growing which is highly scalable and works without any
exponentially day by day. With this tremendous amount preprocessing or feature extraction steps except for only
of information, users face information overload problem. tokenization. doc2vec learns the semantics and
News recommendation systems solve the information compositionality of the linguistic components by using a
overload problem by providing information quickly and deep learning architecture. This neural architecture is
efficiently to the massive users [1]. simple and reduces human effort significantly. It
compresses the whole contextual and structural
Recommendation systems are mainly categorized into information into a one-dimensional numeric vector.
three types: (1) Collaborative Filtering, (2) Content-based Although this way is theoretically interesting and
Filtering and Hybrid System [2]. Collaborative approach straightforward, the main challenge is that it needs a lot
estimates an interest factor of an item for a user by amount of data to build a high dimensional semantic space
analyzing the preference of other users who have already where documents are placed perfectly with their latent
experienced the item. This approach doesn't focus on the version. We have collected about 0.3 million news from
content and features of the item. For this reason, when a different Bangla news website to build a Bangla news
system starts or bootstraps without any user’s prior corpus.
information, it faces cold-start problem. On the other
hand, Content-based recommendation systems try to Document embedding models are trained on
recommend an item to a user based on the description of uncategorized articles and the trained model can generate
the item and profile of the user’s interests if user an embedding to an input article. This embedding can be
information available [3]. Content-based recommendation used as a feature vector and it can be used for document
systems analyze item descriptions to identify items that categorization, document ranking and information
are of particular interest to the user. There are some retrieval tasks.
hybrid systems that integrate both collaborative and We have evaluated the performance of the document
content-based idea altogether to build a more robust vector on information retrieval tasks against LDA, LSA
system by overcoming the drawbacks of these two and n-gram based feature extraction methods. We have
approaches. Most of modern industry based shown that the performance of the document vector is
recommendation systems are actually hybrid systems. better and easy to deploy as an information retrieval
The quality of a content-based recommendation system system. We purely focus on the content-based
mainly depends on the representation of the content of the recommendation since our crawled dataset does not
items and user profiles. Most of the item's content is in the contain any user information, the similar approach has
textual format such as document, news, movie reviews. been followed in [7].
So, working with textual information to extract latent
features is a major concern in this type of
recommendation systems. Topic modeling is an active II. METHODS
research area in the document modeling domain to find In this section, we discuss the doc2vec model that is
out the topics of a set of documents and similar used for the recommendation task. Furthermore, we
*The authors contributed equally
978-1-5386-8207-4/18/$31.00 ©2018 IEEE
briefly explain two popular topic modeling methods
namely LSA and LDA which are also used for modeling
unstructured texts.
A. doc2vec
doc2vec is an approach to learn a model that can
generate an embedding to a given document [8]. Unlike
some of the commonly used methods such as bag-of-
words (BOW), n-gram models or averaging the word
vectors, this method is very much generic and can be used
to generate embeddings from texts of any length. It can be
trained in a totally unsupervised fashion from large
corpora of raw text without needing any task-specific Fig. 1. The distributed memory model of doc2vec for a Bangla sentence
labeled dataset. doc2vec performs really well in the case
of representing longer documents [12]. In this model, a paragraph vector is fixed for all samples
doc2vec is an extension to the existing word embedding generated by a sliding window from a document and the
models. A very well-known technique for learning the word vectors are shared across all documents. The total
word vectors is to predict a word given the other words in learning parameters excluding the softmax parameters are
a context. × 十 × , where = the length of paragraph
vector, = the length of word vector, N=no of paragraphs,
For a sequence of training words w1, w2, w3, ..., wT, M= no of words in the vocabulary.
the objective of the model is to maximize the average log
probability, In the distributed bag of words model in Fig. 2, the model
is strictly trained to predict words randomly sampled from
a paragraph and no context word is used as a part of the
input data. For each iteration, a text window is sampled
∑ log p w |w ,........,w ) (1) and then a random word is chosen from the text window
The probability is calculated using the softmax to form a classification task given the paragraph vector.
function which is defined as:
The basic difference with the distributed memory model is
p( wt | wt-k,..........,wt+k ) = (2) that this model needs less parameter and model size is
∑ relatively small but it cannot preserve word order. This
model is similar to word2vec skip-gram model.

Paragraph vectors are jointly trained with the word B. Latent Semantic Analysis
vectors. At first, the paragraph vector and the word Latent Semantic Analysis (LSA) is a popular
vectors are initialized randomly. While training the technique in distributional semantics to analyze the
language model, both of these vectors learn a semantic semantic relationship between a set of documents by using
representation of the sequence of sentences. The the term-document matrix and singular value
paragraph vector also contributes to the prediction task decomposition technique [4]. LSA outputs a term-
along with the word vectors. document matrix where similar documents and similar
words are placed closer. The similarity between two
Two kinds of frameworks have been proposed by Le
documents is computed by the cosine similarity between
and Mikolov to learn the doc2vec [8]. These are:
their corresponding two column vectors and in a similar
(1) doc2vec with the distributed memory model way, the correlation between two words are computed by
their corresponding row vectors. LSA captures some basic
(2) doc2vec with the distributed bag-of-words.
linguistic properties such as synonymy and polysemy.
In the distributed memory model, paragraphs and words
are jointly trained using a stochastic gradient descent
optimizer. Each paragraph is mapped to a fixed dimension
unique vector represented by a column in matrix D and
each word is represented by a column in matrix W. The
paragraph vector and word vectors are averaged or
concatenated to predict the next word in a context. The
model is shown in Fig. 1.

Fig. 2. The distributed bag-of-words model of doc2vec for a Bangla


Sentence
C. Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is a generative
probabilistic model of a corpus to extract the hidden
structure and topics [5]. The key concept is that
documents are represented as random mixtures over latent
topics where each topic is characterized by a distribution
over words [9]. LDA model projects documents in a
topical embedding space and it generates a topic vector
from a document which can be used as the features of the
document.

III. EXPERIMENTAL STUDIES


This section describes data acquisition techniques, Fig. 3. t-SNE visualization of Bangla news embedding using doc2vec
validation parameters and performance evaluation of
D. Results and Discussion
doc2vec and other document modeling techniques.
We present both qualitative and quantitative
A. Data Acquisition experiments for a better understanding of model
We have collected data from fifteen Bangla performance.
newspaper by running a news crawler for 1 week using
First, the model is trained on our uncategorized news
Apache Nutch. The news corpus has more than 3,00,000
articles and visualized the trained vectors using t-SNE
uncategorized articles which have been used for training
[10]. t-SNE (t-Distributed Stochastic Neighbor
the paragraph vector model. We collected 37,000 labeled
Embedding) is a visualization technique for large-scale
news articles consisting of 12 different categories for the
high dimensional data. It projects high dimensional data
purpose of evaluating the recommendation performance.
into low-dimensional space by capturing local and global
In the preprocessing step, the HTML pages are cleaned structure perfectly resulting in a map containing clusters
using python-goose library and further distilled by using of similar data points placed nearby. From the
python beautifulsoup4 library. In the tokenization step we visualization map in Fig. 3, it can be seen that the articles
firstly remove special characters (e.g. ‘।‘, ‘!’, ‘?’) and non- from the same category are clustered together.
Bangla characters and then a whitespace tokenizer is used Next, we qualitatively look at the nearest neighbors of
for word segmentation. We don’t use any stop-word news articles in the document embedding space. We
removal technique as it causes to lose structural generate top ten recommendations for different query
information and also makes context understanding document to test the model performances and we show
difficult for the model. that doc2vec gives more acceptable recommendation than
B. Experimental Setup LDA and LSA. For an example, we tabulated the result
Though distributed bag-of-words model loses the word for a document titled ‘pিতমা িবসজর্ েন েশষ হেলা দুেগর্াৎসব’ in Table
order and should be inferior to the memory model I. The query article is related to ‘Durga Puja’, an
according to the original paper [8], our experiments show annual festival of Hindu religion and the recommended
that given enough training samples, this model articles are expected to be related to this festival. All of
outperforms the distributed memory model. Therefore, our the top ten recommendations by doc2vec are closely
choice of architecture for training doc2vec is dBoW. related to the event of ‘Durga Puja’.
These models have been implemented using the gensim
library [14], which is one of the most popular python On the other hand, the first recommendation from LDA is
libraries for text mining and statistical semantics. The titled ‘আকােশ uড়েলা ফানুস, নদীেত ভাসেলা েনৗকা’ and this news is
model is trained only using CPU. related to ‘Buddha Purnima’, not expected as the first
recommendation because there are lots of news related to
C. Validation Parameters
the ‘Durga Puja’. Another recommendation of LDA is
The accuracy of the model is evaluated by measuring
‘ঐিতহয্ আর সmpীিতেত পিবt আ রা পািলত ৈসয়দপুের’ that is related to
the performance on triplet dataset using Eqn. 3 where M=
total no. of triplets, sim(A , P ) = Similarity between the one of the major religious days ‘Ashura’ of Islam. This
anchor and the positive document and sim(A , N ) = recommendation is correct in a sense that it is religious
similarity between the anchor and the negative document news but not sufficient enough to be in the top-10
of ith element of the triplet dataset. recommendations. The incorrect recommendation of LDA
is ‘সেmলেনর psিত চেল রাতভর’ which is about a totally
∑ , , ) different event.
,
, ) , )
LSA makes even more mistakes than LDA for this query.
where, A ,P, ) (3) The articles recommend by LSA, titled ‘ মায়ূন আহেমেদর
The accuracy is defined by the total no. correctly জnিদন আজ’, ‘ d মুহmদ শিহদুlাহ'র তম জn বািষর্কী আজ’ and ‘হ
recommended triplets where the similarity is higher ◌ু মায়ূন আহেমেদর জnিদন আজ’ are totally different from the
between the anchor and positive document rather than query article. These unexpected recommendations are
between anchor and negative document.
TABLE I. TITLES OF TOP 10 RECOMMENDATIONS FOR QUERY ARTICLE “pিতমা িবসজর্ েন েশষ হেলা দুেগর্াৎসব” BY DOC2VEC, LDA AND LSA

doc2vec LDA LSA


kমারী পূজায় জনেজায়ার আকােশ uড়েলা ফানুস, নদীেত ভাসেলা েনৗকা মহা মী o kমারীপূজা আজ

ম েপ ম েপ হাহাকার, চলেছ pিতমা িবসজর্ েনর psিত ম েপ ম েপ হাহাকার, চলেছ pিতমা িবসজর্ েনর psিত ফিরদপুের রামকৃ িমশন আ েম
ঁ র
িসদ ু েখলায় েমেত uঠেলা কলকাতা দুেগর্াৎসেবর মহানবমী আজ pিতমা িবসজর্ েন েশষ হেলা দুেগর্াৎসব
মহা মী o kমারীপূজা আজ সাভাের আনn ulােস মহা মী পািলত আগরতলায় kমারী পূজায় পূণয্াথর্ীর ঢল
ঐিতহয্ আর সmpীিতেত পিবt আ রা পািলত kি য়ার েছঁ uিড়য়ায় লালন sরণ utসব আজ
বৃnাবেনর িবকl দুবলার চেরর রাস েমলা
ৈসয়দপুের েথেক
সাmpদািয়ক সmpীিতর িচরnন kমারী পূজায় জনেজায়ার pিতমা িবসজর্ েনর মাধয্েম েশষ হেলা শারদীয় uৎসব
েদবী দুগার্ িবসজর্ েন ম প েলােত িবষােদর সুর pিতমা িবসজর্ েনর মাধয্েম েশষ হেলা শারদীয় uৎসব মায়ূন আহেমেদর জnিদন আজ
দুগির্ তনািশনী েদবী দুগার্ সেmলেনর psিত চেল রাতভর d মুহmদ শিহদুlাহ'র তম জn বািষর্ কী আজ
ঝালকািঠেত দুেগর্াৎসব uপলেk ঐিতহয্বাহী ‘দশহরার
বাnরবােন দুগার্ pিতমা িবসজর্ ন তারায় তারায় িমেলেছ রং-েবরংেয়র ফানুস
েমলা
নানা রঙ আর ৈবিচtয্ বষর্ণ o ধমর্ীয় uৎসবমুখর পিরেবেশ pিতমা িবসজর্ ন হুমায়ূন আহেমেদর জnিদন আজ
We see that doc2vec gives 91.0% accuracy which is better
than other methods. LDA and LSA give 85%, 84%
about the birthday of two popular writers of Bangladesh.
accuracy respectively. We can conclude that doc2vec can
Another recommendation by LSA ‘kি য়ার েছঁ uিড়য়ায় লালন
capture the semantic similarity between Bangla news
documents with long text more efficiently than other
sরণ utসব আজ েথেক' which is related to a prominent
baseline methods.
Bengali philosopher and poet ‘Lalon Shah’.
The recommendations by doc2vec are significantly better
than both LDA and LSA model for this query as its top 10 IV. CONCLUSION
recommendations are all about ‘Durga Puja’.

Content-based news recommendation system is


We perform a quantitative evaluation to measure how well strongly connected to the semantic relevance measure of
doc2vec learned semantic representation using a triplet news articles. We develop a Bangla news recommender
dataset as described in [6]. system using doc2vec. The most attractive part of this
model is its language-independent learning and adaptation
capability to a large corpus. With a sufficient amount of
The triplets were generated manually by human from a data, it can learn a lot of intrinsic property, structural maps
Bangla news corpus. The dataset consists of 330 triplets. and semantic variations of a language.
Each triplet (anchor, positive, negative) is chosen by
carefully analyzing the content of the three articles. The Our experiments show that doc2vec surpasses two
first two documents are not only from the same category popular topic modeling techniques; LDA and LSA for
of the newspaper (e.g. entertainment, sports) but also they building a content-based Bangla news recommendation
are semantically equivalent. The negative document is system. This model can also be used for many different
from a different category and also contextually different applications like Bangla news clustering, news
from the anchor article. The objective of triplet evaluation summarization and question answering.
is to explore the performance of a model to ensure high
relevance between similar document pairs (anchor,
positive) than (anchor, negative) pairs. The model REFERENCES
accuracy is counted by using the Eqn. 3 which depends on
the number of triplets for which the model outputs a high [1] G. Adomavicius and A. Tuzhilin, “Toward the next generation
similarity score between the anchor and positive of recommender systems: A survey of the state-of-the-art and
possible extensions”, IEEE Transaction on Knowledge and
document. Data Engineering, Volume 17, Issue 6, pp.734-749, 2005.
The experimental result on triplet dataset is given in Table [2] M. Balabanović and Y. Shoham, “Fab: content-based, collaborative
recommendation”, Communications of the ACM, vol. 40, issue 3,
II. We extract 100 topics from both LDA and LSA to use 1997, pp. 66-72.
feature size same for all models. [3] M. Madhukar, Challenges & Limitation in Recommender Systems.
International Journal of Latest Trends in Engineering and
TABLE II: PERFORMANCES OF DIFFERENT METHODS ON HUMAN- Technology (IJLTET), vol. 4, issue 3, pp. 138-142, September
GENERATED TRIPLETS 2014.
Embedding Size/ [4] P. Velvizhi, S. Aishwarya and R. Bhuvaneswari, "Ranking of
Model Accuracy Document Recommendations from Conversations using
Topics
Probabilistic Latent Semantic Analysis", International Conference
Bag-of-words N/A 67%
on Innovations in Engineering and Technology (ICIET), 2016:
LSA 100 84% 133-138.
LDA 100 85% [5] T. Chang and W. Hsiao, “LDA-based Personalized Document
doc2vec 100 91.0% Recommendation”, PACIS 2013 Proceedings, 2013.
[6] A. M. Dai, C. Olah, Q. V. Le, and G. S. Corrado, “Document
embedding with doc2vec.” NIPS Deep Learning Workshop, 2014.
[7] Md. N. M. Adnan, M. R. Chowdury, I. Taz, T. Ahmed and R. M [12] J. H. Lau and T. Baldwin, “An Empirical Evaluation of doc2vec
Rahman, “Content Based News Recommendation System Based on with Practical Insights into Document Embedding Generation”,
Fuzzy Logic”, 3rd International Conference on Informatics, Proceedings of the 1st Workshop on Representation Learning for
Electronics & Vision, 2014. NLP, pp. 78–86,, 2016.
[8] Q. V. Le and T. Mikolov. “Distributed representations of sentences [13] S. N. Ferdous and M. M. Ali, ‘A Semantic Content Based
and documents”, International Conference on Machine Learning, Recommendation System for Cross-Lingual News’, International
2014. Conference on Imaging, Vision & Pattern Recognition, 2017.
[9] D. M. Beli, A. Y. Ng and M. I. Jordan, “Latent dirichlet [14] R. Rehurek and P. Sojka, ‘Software Framework for Topic
allocation”, Journal of machine Learning research, vol. 3, pp. 993- Modelling with Large Corpora’, Proceeding of the LREC 2010
1022, 2003. Workshop on New Challenges for NLP Frameworks, 2010.
[10] L. J. P. van der Maaten and G. E. Hinton. “Visualizing high-
dimensional data using t-SNE”. Journal of Machine Learning
Research, 2008.
[11] S. T. Dumais. ‘’Latent Semantic Analysis’’. Annual Review of
Information Science and Technology. vol. 38, pp. 188–230, 2005.

View publication stats

You might also like