Sma Exp 4
Sma Exp 4
EXPERIMENT 4
Aim: To build a LDA Topic model from given google csv dataset and use
wordcloud to visualize top N keywords in each topic.
Libraries Used:
NLKT: The Natural Language Toolkit, or more commonly NLTK, is a suite of
libraries and programs for symbolic and statistical natural language processing
for English written in the Python programming language. It supports
classification, tokenization, stemming, tagging, parsing, and semantic
reasoning functionalities.
RE: A regular expression (or RE) specifies a set of strings that matches it; the
functions in this module let you check if a particular string matches a given
regular expression (or if a given regular expression matches a particular string,
which comes down to the same thing).
NUMPY: NumPy is a library for the Python programming language, adding
support for large, multi-dimensional arrays and matrices, along with a
large collection of high-level mathematical functions to operate on these
arrays.
Pandas: Pandas is a Python library used for working with data sets. It has
functions for analyzing, cleaning, exploring, and manipulating data. pandas
is a fast, powerful, flexible and easy to use open-source data analysis and
manipulation tool, built on top of the programming language.
Pprint: The pprint module provides a capability to “pretty-print” arbitrary
Python data structures in a form which can be used as input to the interpreter.
Gensim: Gensim is an open-source library for unsupervised topic modelling,
document indexing, retrieval by similarity, and other natural language
processing functionalities, using modern statistical machine learning. Gensim is
implemented in Python.
Spacy: spaCy is a free, open-source library for NLP in Python written in Cython.
spaCy is designed to make it easy to build systems for information
extraction or general-purpose natural language processing.
Logging: is used to import the built-in logging module. This module allows you
to use a logger to log messages. Logging is the process of keeping a record
of events that occur in a computer system, these events can include problems,
errors, or information on current operations.
Matplotlib: Matplotlib is an amazing visualization library in Python for 2D
plots of arrays. Matplotlib is a multi-platform data visualization library built on
NumPy arrays and designed to work with the broader SciPy stack. It was
introduced by John Hunter in the year 2002. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in
easily digestible visuals. Matplotlib consists of several plots like line, bar,
scatter, histogram etc.
CONCLUSION: