0% found this document useful (0 votes)
51 views8 pages

Running Head: Topic Model by Using Latent Dirichlet Allocation 1

This document discusses topic modeling using Latent Dirichlet Allocation (LDA). It explains that LDA is a probabilistic generative model and hierarchical Bayesian model that is commonly used for topic modeling. LDA treats documents as mixtures of topics and topics as mixtures of words. It allows documents to overlap in content rather than being separated into discrete groups. The document provides details on how LDA works and its applications, such as summarizing text collections, automatic text tagging, and dimensionality reduction.

Uploaded by

Tonnie Kiama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

Running Head: Topic Model by Using Latent Dirichlet Allocation 1

This document discusses topic modeling using Latent Dirichlet Allocation (LDA). It explains that LDA is a probabilistic generative model and hierarchical Bayesian model that is commonly used for topic modeling. LDA treats documents as mixtures of topics and topics as mixtures of words. It allows documents to overlap in content rather than being separated into discrete groups. The document provides details on how LDA works and its applications, such as summarizing text collections, automatic text tagging, and dimensionality reduction.

Uploaded by

Tonnie Kiama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Running head: TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION 1

Topic Model by Using Latent Dirichlet Allocation

Name

Institution
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
2

Topic model by using Latent Dirichlet Allocation (LDA)

Topic modelling is a type of statistical modelling for discovering the abstract topics

that normally occur in a collection of documents. Topic model is considered as one of the

most powerful techniques used in text mining for data mining, finding relationships among

data and text documents and latent data discovery. In text mining, there often collections of

documents, such as news articles or blog posts that are to be divided into natural groups with

the purpose of understanding the articles separately,[ CITATION Jor17 \l 2057 ]. * The topic

model is an unattended classification technique for such papers, comparable to the

classification of numeric information, which discovers natural groups of items even if we are

uncertain about what we are searching for *. Researchers have published many articles in the

field of topic modelling and applied in various fields such as political science, software

engineering, medical and linguistic science.

There are different modelling techniques for topics; one of the most popular in this

field is the Latent Dirichlet Allocation (LDA). Researchers have suggested different models

for the subject modelling based on the allocation of latent conduct, [ CITATION Vir18 \l 2057 ].

This article will be beneficial and helpful in implementing latent approaches to modelling of

subject matter; Research has also been conducted from highly scholarly articles on the

subject of LDA-based modelling to discover the development and intellectual structures of

subject modelling. Standard statistical techniques can be used to reverse this process by

referring to the set of subjects responsible for creating a document collection.

Topic Models, in a nutshell, are a type of statistical language models used for

uncovering hidden structure in a collection of texts, in particular and more intuitively, you

can advocate some of its tasks are:

(a) Tagging
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
3

Are the abstract topics that appear in a collection of documents that best represents the

information in them

(b) Unsupervised Learning

Where the number of topics, such as the number of clusters, can be compared with clustering,

the output parameter is the number of topics. By modelling the subject, we build cluster of

words rather than cluster of texts, (Blei, D., Andrew Y., & Michael I, 2003). Text is therefore

a mixture of all subjects, as each has a specific weight.

(c) Dimensionality Reduction

Where, instead of representing the text T in its function space, count as {Word i: count

(Word_i, T) for Word_i in Vocabulary}, you can represent it in a topic space as {Topic i:

Weight(Topic_i, T) for Topic_i in Topics}

Latent Dirichlet Allocation (LDA) is one of the most common algorithms of topic

model and is considered particularly popular method for fitting a topic. Latent Dirichlet

Allocation is a probabilistic generative model of a composite set made up of components. It

utilizes topic modelling and natural language processing (NLP), among others. LDA is a

hierarchical Bayesian model of three levels, where each piece in a collection is modelled on

the fundamental subjects as a finite blend. Each subject is, in turn, modelled as an endless

mix of subject probabilities. The subject probabilities give an explicit representation of the

document in the context of text modelling. LDA imagines a fixed set of topics, and each

topic represents a set of words, hence, the main goal of LDA is to help in mapping all the

documents to the topics in a way, such that words in each document are mainly captured by

those imaginary topics. Latent Dirichlet Allocation then treats each document as a mixture of

topics and each topic as a mixture of words that is used to classifying text in a document to a

particular topic. This ability allows documents to overlap each other in terms of their content
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
4

rather than being separated into different discrete groups or principles. The two guiding

principles of the LDA are:

(i) Every document is a mixture of topics

It is discussed that each document may contain words from several topics in certain

proportions, for instance, in a two topic model one can say, “The first Document 1 is 90%

topic A and 10% topic B, while second Document is 20% topic A and 80% topic B.”

(ii) Every topic is a mixture of words

For instance, one can imagine a two topic model of British news, with one topic for

entertainment and other for politics. The most repetitive and common words in the

entertainment may be “actor”, “movies”, “series” and “starring”, while politics may be use up

of words like “Queen”, “Prime minister” and “The loyal family”. Essentially, words can be

shared between such kinds of topics; and by their appearances they may be shared in both

topics (politics and entertainment); these words can be “The winner”, “budget”, and “the

loser”, (Teh, Y., Newman. D & Welling, M, 2007)

LDA is a mathematical technique for estimating both of these at the same moment:

discovering a combination of phrases that is associated with each subject. It also determines

the mix of subjects that each document discusses. There are a number of current applications

of this algorithm, and we're going to investigate one in depth.

Some of the benefits and cases where Latent dirichlet Allocation has been used are:

 As an original step to summarizing a big collection of text information.

 To tag fresh text information automatically using the learned subjects.

 To reduce the big body of text information to certain keywords (or sequence of main

words – using N-gram), or decrease the clustering or search job of a large amount of
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
5

files to Search or cluster keywords (themes). This helps to reduce the amount of

resources needed for data search and retrieval.

 For the clustering of images and classification of pictures.

The figure 1.0 below shows a flow chart of a text analysis that incorporates topic modelling.

The Theme Model Packages uses the Document Matrix as input and produces a model that

can be sorted by tidy text, so that it can be manipulated and visualized with dplyr and ggplot

2.
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
6

Figure 1.0
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
7
TOPIC MODEL BY USING LATENT DIRICHLET ALLOCATION
8

Reference

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of

machine Learning research, 3(Jan), 993-1022.

Boyd-Graber, J. L. (2017). Applications of topic models. Hanover, MA : now Publishers Inc.

Teh, Y. W., Newman, D., & Welling, M. (2007). A collapsed variation Bayesian inference

algorithm for latent Dirichlet allocation. In Advances in neural information

processing systems (pp. 1353-1360).

Virashree Hrushikesh Patel & Kansas State University,. (2018). Topic modeling using latent

dirichlet allocation on disaster tweets. Manhattan, Kan.: Kansas State University.

You might also like