0% found this document useful (0 votes)
14 views3 pages

Sma Exp 4

This document describes an experiment to build an LDA topic model from a Google CSV dataset and use word clouds to visualize the top keywords in each topic. It lists the key Python libraries used like NLTK, NumPy, Pandas, Gensim, and Matplotlib. It provides theoretical background on LDA topic modeling and how it assigns documents to topics based on word distributions. It also explains techniques like word clouds, n-grams, and t-SNE used in the analysis. The conclusion states that the experiment successfully created an LDA model from the dataset to identify meaningful topics and used word clouds to aid in interpreting the topics.

Uploaded by

pameluft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Sma Exp 4

This document describes an experiment to build an LDA topic model from a Google CSV dataset and use word clouds to visualize the top keywords in each topic. It lists the key Python libraries used like NLTK, NumPy, Pandas, Gensim, and Matplotlib. It provides theoretical background on LDA topic modeling and how it assigns documents to topics based on word distributions. It also explains techniques like word clouds, n-grams, and t-SNE used in the analysis. The conclusion states that the experiment successfully created an LDA model from the dataset to identify meaningful topics and used word clouds to aid in interpreting the topics.

Uploaded by

pameluft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

SOCIAL MEDIA ANALYTICS

EXPERIMENT 4
Aim: To build a LDA Topic model from given google csv dataset and use
wordcloud to visualize top N keywords in each topic.
Libraries Used:
NLKT: The Natural Language Toolkit, or more commonly NLTK, is a suite of
libraries and programs for symbolic and statistical natural language processing
for English written in the Python programming language. It supports
classification, tokenization, stemming, tagging, parsing, and semantic
reasoning functionalities.
RE: A regular expression (or RE) specifies a set of strings that matches it; the
functions in this module let you check if a particular string matches a given
regular expression (or if a given regular expression matches a particular string,
which comes down to the same thing).
NUMPY: NumPy is a library for the Python programming language, adding
support for large, multi-dimensional arrays and matrices, along with a
large collection of high-level mathematical functions to operate on these
arrays.
Pandas: Pandas is a Python library used for working with data sets. It has
functions for analyzing, cleaning, exploring, and manipulating data. pandas
is a fast, powerful, flexible and easy to use open-source data analysis and
manipulation tool, built on top of the programming language.
Pprint: The pprint module provides a capability to “pretty-print” arbitrary
Python data structures in a form which can be used as input to the interpreter.
Gensim: Gensim is an open-source library for unsupervised topic modelling,
document indexing, retrieval by similarity, and other natural language
processing functionalities, using modern statistical machine learning. Gensim is
implemented in Python.
Spacy: spaCy is a free, open-source library for NLP in Python written in Cython.
spaCy is designed to make it easy to build systems for information
extraction or general-purpose natural language processing.
Logging: is used to import the built-in logging module. This module allows you
to use a logger to log messages. Logging is the process of keeping a record
of events that occur in a computer system, these events can include problems,
errors, or information on current operations.
Matplotlib: Matplotlib is an amazing visualization library in Python for 2D
plots of arrays. Matplotlib is a multi-platform data visualization library built on
NumPy arrays and designed to work with the broader SciPy stack. It was
introduced by John Hunter in the year 2002. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in
easily digestible visuals. Matplotlib consists of several plots like line, bar,
scatter, histogram etc.

CoherenceModel: CoherenceModel is used for evaluation of topic models.


coherence score in topic modeling to measure how interpretable the topics
are to humans. In this case, topics are represented as the top N words with
the highest probability of belonging to that particular topic. Briefly, the
coherence score measures how similar these words are to each other.

Lemmatize: Lemmatization is the process of grouping together the


different inflected forms of a word so they can be analyzed as a single
item. Lemmatization is similar to stemming but it brings context to the
words. So, it links words with similar meanings to one word.

PyLDAvis: Python library for interactive topic model visualization.


pyLDAvis is designed to help users interpret the topics in a topic model
that has been fit to a corpus of text data. The package extracts information
from a fitted LDA topic model to inform an interactive web-based
visualization.

Here are some theoretical points related to the experiment:


LDA topic model using Gensim: The topic modelling strategy used by LDA is
to assign text in a document to a specific topic, and LDA constructs Dirichlet
distributions as a model. A model of a topic per document and a model of words
per topic. It re-arranges the topic-keyword distribution after giving the LDA topic
model algorithm to produce a decent composition of the topic-keyword
distribution. The distribution of themes inside the document and the distribution
of keywords within the topics.
Every document is modelled as a set of multi-nominal topic distributions.
Every topic is represented by multi-nominal word distributions. Because LDA
believes that each piece of text contains the linked terms, we should choose the
proper corpus of data.

WordCloud: Word Cloud is a data visualization technique used for


representing text data in which the size of each word indicates its frequency
or importance. Significant textual data points can be highlighted using a word
cloud. Word clouds are widely used for analyzing data from social network
websites. For generating word cloud in Python, modules needed are –
matplotlib, pandas and wordcloud.
Bigram and Trigram Model: An N-gram is a sequence of n items (words in
this case) from a given sample of text or speech. For example, given the text “
Susan is a kind soul, she will help you out as long as it is within her
boundaries”
bigram: [‘susan is’, ‘is a’, ‘a kind’, ’kind soul’, ‘ soul she‘, ‘she will’, ‘will help’, ‘
help you’………….]
trigram: [‘susan is a’, ‘is a kind’, ‘a kind soul’, ’kind soul she’, ‘soul she will‘,
‘she will help you’………….]
From the examples above, we can see that n in n-grams can be different
values, a sequence of 2 grams is called a bigram, sequence of 3 grams is
called a trigram.
t-SNE: (t-distributed Stochastic Neighbour Embedding) is a technique for
dimensionality reduction. It's used to visualize high-dimensional data by
embedding it into lower dimensional data, such as 2D or 3D. The algorithm
matches the original data to determine how to best represent it using fewer
dimensions. t-SNE is a non-linear dimensionality reduction technique, which
means it can separate data that cannot be separated by a straight line. It can
be used for data exploration and visualization.

CONCLUSION:

this experiment successfully demonstrated the creation of an LDA (Latent


Dirichlet Allocation) Topic model using a provided Google CSV dataset. LDA is
a powerful technique for uncovering hidden thematic structures within a large
collection of text documents. By employing this model, we were able to identify
and extract meaningful topics from the dataset, shedding light on the underlying
themes present in the data
the utilization of word clouds to visualize the top N keywords within each
identified topic added an informative and visually engaging dimension to our
analysis. Word clouds provide a clear and concise representation of the most
prominent terms associated with each topic, aiding in the interpretation and
understanding of the topics generated by the LDA model.
this experiment showcased the potential of LDA Topic modeling and word cloud
visualization as powerful tools for uncovering and communicating the key
themes and keywords within a dataset, facilitating better comprehension and
decision-making in various applications involving textual data.

You might also like