Banglanewsrecommendationusingdoc 2 Vec
Banglanewsrecommendationusingdoc 2 Vec
net/publication/328043719
CITATIONS READS
16 2,735
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Rabindra Nath Nandi on 03 October 2018.
Abstract—We present a content-based Bangla news documents which are semantically and contextually
recommendation system using paragraph vectors also known similar. Several studies are available on topic or concept
as doc2vec. doc2vec is a neural network driven approach discovery techniques e.g. Latent Semantic Analysis
that encapsulates the document representation in a low (LSA), Latent Discriminate Analysis (LDA) and their
dimensional vector. doc2vec can capture semantic
variants for Document Recommendation [4, 5].
relationship effectively between documents from a large
collection of texts. We perform both qualitative and Bangla is one of the most widely spoken languages in the
quantitative experiments on a large Bangla news corpus and world as about 250-300 million people speak Bangla as
show that doc2vec performs better than two popular topic their first language. Due to the lack of availability of
modeling techniques LDA and LSA. In the top-10
recommendation scenario, the suggestions from doc2vec are
sufficient research works and language modeling
more contextually correct than both LDA and LSA. doc2vec resources, developing an intelligent system like news
also outperforms LDA and LSA on human-generated triplet recommender for Bangla is comparatively challenging
dataset with 91% accuracy where LDA and LSA give than other rich languages e.g. English, Spanish and
85%,84% accuracy respectively. German etc. The authors in [13] proposed an ontology-
based recommendation system for cross-lingual languages
Keywords—content-based, Bangla news recommendation, which includes Bangla and English news.
paragraph vectors, doc2vec, LDA, LSA.
In this work, we use a recent idea from distributional
I. INTRODUCTION semantics called document embedding using doc2vec [8]
Digital information over the internet is growing which is highly scalable and works without any
exponentially day by day. With this tremendous amount preprocessing or feature extraction steps except for only
of information, users face information overload problem. tokenization. doc2vec learns the semantics and
News recommendation systems solve the information compositionality of the linguistic components by using a
overload problem by providing information quickly and deep learning architecture. This neural architecture is
efficiently to the massive users [1]. simple and reduces human effort significantly. It
compresses the whole contextual and structural
Recommendation systems are mainly categorized into information into a one-dimensional numeric vector.
three types: (1) Collaborative Filtering, (2) Content-based Although this way is theoretically interesting and
Filtering and Hybrid System [2]. Collaborative approach straightforward, the main challenge is that it needs a lot
estimates an interest factor of an item for a user by amount of data to build a high dimensional semantic space
analyzing the preference of other users who have already where documents are placed perfectly with their latent
experienced the item. This approach doesn't focus on the version. We have collected about 0.3 million news from
content and features of the item. For this reason, when a different Bangla news website to build a Bangla news
system starts or bootstraps without any user’s prior corpus.
information, it faces cold-start problem. On the other
hand, Content-based recommendation systems try to Document embedding models are trained on
recommend an item to a user based on the description of uncategorized articles and the trained model can generate
the item and profile of the user’s interests if user an embedding to an input article. This embedding can be
information available [3]. Content-based recommendation used as a feature vector and it can be used for document
systems analyze item descriptions to identify items that categorization, document ranking and information
are of particular interest to the user. There are some retrieval tasks.
hybrid systems that integrate both collaborative and We have evaluated the performance of the document
content-based idea altogether to build a more robust vector on information retrieval tasks against LDA, LSA
system by overcoming the drawbacks of these two and n-gram based feature extraction methods. We have
approaches. Most of modern industry based shown that the performance of the document vector is
recommendation systems are actually hybrid systems. better and easy to deploy as an information retrieval
The quality of a content-based recommendation system system. We purely focus on the content-based
mainly depends on the representation of the content of the recommendation since our crawled dataset does not
items and user profiles. Most of the item's content is in the contain any user information, the similar approach has
textual format such as document, news, movie reviews. been followed in [7].
So, working with textual information to extract latent
features is a major concern in this type of
recommendation systems. Topic modeling is an active II. METHODS
research area in the document modeling domain to find In this section, we discuss the doc2vec model that is
out the topics of a set of documents and similar used for the recommendation task. Furthermore, we
*The authors contributed equally
978-1-5386-8207-4/18/$31.00 ©2018 IEEE
briefly explain two popular topic modeling methods
namely LSA and LDA which are also used for modeling
unstructured texts.
A. doc2vec
doc2vec is an approach to learn a model that can
generate an embedding to a given document [8]. Unlike
some of the commonly used methods such as bag-of-
words (BOW), n-gram models or averaging the word
vectors, this method is very much generic and can be used
to generate embeddings from texts of any length. It can be
trained in a totally unsupervised fashion from large
corpora of raw text without needing any task-specific Fig. 1. The distributed memory model of doc2vec for a Bangla sentence
labeled dataset. doc2vec performs really well in the case
of representing longer documents [12]. In this model, a paragraph vector is fixed for all samples
doc2vec is an extension to the existing word embedding generated by a sliding window from a document and the
models. A very well-known technique for learning the word vectors are shared across all documents. The total
word vectors is to predict a word given the other words in learning parameters excluding the softmax parameters are
a context. × 十 × , where = the length of paragraph
vector, = the length of word vector, N=no of paragraphs,
For a sequence of training words w1, w2, w3, ..., wT, M= no of words in the vocabulary.
the objective of the model is to maximize the average log
probability, In the distributed bag of words model in Fig. 2, the model
is strictly trained to predict words randomly sampled from
a paragraph and no context word is used as a part of the
input data. For each iteration, a text window is sampled
∑ log p w |w ,........,w ) (1) and then a random word is chosen from the text window
The probability is calculated using the softmax to form a classification task given the paragraph vector.
function which is defined as:
The basic difference with the distributed memory model is
p( wt | wt-k,..........,wt+k ) = (2) that this model needs less parameter and model size is
∑ relatively small but it cannot preserve word order. This
model is similar to word2vec skip-gram model.
Paragraph vectors are jointly trained with the word B. Latent Semantic Analysis
vectors. At first, the paragraph vector and the word Latent Semantic Analysis (LSA) is a popular
vectors are initialized randomly. While training the technique in distributional semantics to analyze the
language model, both of these vectors learn a semantic semantic relationship between a set of documents by using
representation of the sequence of sentences. The the term-document matrix and singular value
paragraph vector also contributes to the prediction task decomposition technique [4]. LSA outputs a term-
along with the word vectors. document matrix where similar documents and similar
words are placed closer. The similarity between two
Two kinds of frameworks have been proposed by Le
documents is computed by the cosine similarity between
and Mikolov to learn the doc2vec [8]. These are:
their corresponding two column vectors and in a similar
(1) doc2vec with the distributed memory model way, the correlation between two words are computed by
their corresponding row vectors. LSA captures some basic
(2) doc2vec with the distributed bag-of-words.
linguistic properties such as synonymy and polysemy.
In the distributed memory model, paragraphs and words
are jointly trained using a stochastic gradient descent
optimizer. Each paragraph is mapped to a fixed dimension
unique vector represented by a column in matrix D and
each word is represented by a column in matrix W. The
paragraph vector and word vectors are averaged or
concatenated to predict the next word in a context. The
model is shown in Fig. 1.
ম েপ ম েপ হাহাকার, চলেছ pিতমা িবসজর্ েনর psিত ম েপ ম েপ হাহাকার, চলেছ pিতমা িবসজর্ েনর psিত ফিরদপুের রামকৃ িমশন আ েম
ঁ র
িসদ ু েখলায় েমেত uঠেলা কলকাতা দুেগর্াৎসেবর মহানবমী আজ pিতমা িবসজর্ েন েশষ হেলা দুেগর্াৎসব
মহা মী o kমারীপূজা আজ সাভাের আনn ulােস মহা মী পািলত আগরতলায় kমারী পূজায় পূণয্াথর্ীর ঢল
ঐিতহয্ আর সmpীিতেত পিবt আ রা পািলত kি য়ার েছঁ uিড়য়ায় লালন sরণ utসব আজ
বৃnাবেনর িবকl দুবলার চেরর রাস েমলা
ৈসয়দপুের েথেক
সাmpদািয়ক সmpীিতর িচরnন kমারী পূজায় জনেজায়ার pিতমা িবসজর্ েনর মাধয্েম েশষ হেলা শারদীয় uৎসব
েদবী দুগার্ িবসজর্ েন ম প েলােত িবষােদর সুর pিতমা িবসজর্ েনর মাধয্েম েশষ হেলা শারদীয় uৎসব মায়ূন আহেমেদর জnিদন আজ
দুগির্ তনািশনী েদবী দুগার্ সেmলেনর psিত চেল রাতভর d মুহmদ শিহদুlাহ'র তম জn বািষর্ কী আজ
ঝালকািঠেত দুেগর্াৎসব uপলেk ঐিতহয্বাহী ‘দশহরার
বাnরবােন দুগার্ pিতমা িবসজর্ ন তারায় তারায় িমেলেছ রং-েবরংেয়র ফানুস
েমলা
নানা রঙ আর ৈবিচtয্ বষর্ণ o ধমর্ীয় uৎসবমুখর পিরেবেশ pিতমা িবসজর্ ন হুমায়ূন আহেমেদর জnিদন আজ
We see that doc2vec gives 91.0% accuracy which is better
than other methods. LDA and LSA give 85%, 84%
about the birthday of two popular writers of Bangladesh.
accuracy respectively. We can conclude that doc2vec can
Another recommendation by LSA ‘kি য়ার েছঁ uিড়য়ায় লালন
capture the semantic similarity between Bangla news
documents with long text more efficiently than other
sরণ utসব আজ েথেক' which is related to a prominent
baseline methods.
Bengali philosopher and poet ‘Lalon Shah’.
The recommendations by doc2vec are significantly better
than both LDA and LSA model for this query as its top 10 IV. CONCLUSION
recommendations are all about ‘Durga Puja’.