Ex6 SMA
Ex6 SMA
AIM:
Google Reviews Analytics
THEORY:
Instant Data Scraper extracts data from web pages and exports it as Excel or CSV files
Instant Data Scraper is an automated data extraction tool for any website. It uses AI to
predict which data is most relevant on a HTML page and allows saving it to Excel or CSV
file (XLS, XLSX, CSV).
This extension is completely FREE. Instant Data Scraper works well in unison with SEO
tools, CRM recruiter systems, sales leads management tools or email marketing
campaigns. Web scraping and data downloading is made easy with our tool.
Furthermore, you have data security and privacy as the scraped data does not leave
your browser.
Topic modeling is a frequently used approach to discover hidden semantic patterns portrayed
by a text corpus and automatically identify topics that exist inside it.
Namely, it’s a type of statistical modeling that leverages unsupervised machine learning to
analyze and identify clusters or groups of similar words within a body of text.
For example, a topic modeling algorithm may be deployed to determine whether the contents
of a document imply it’s an invoice, complaint, or contract
Topics are the latent descriptions of a corpus (large group) of text. Intuitively, documents
regarding a specific topic are more likely to produce certain words more frequently.
For example, the words “dog” and “bone” are more likely to appear in documents concerning
dogs, whereas “cat” and “meow” are more likely to be found in documents regarding cats.
Consequently, the topic model would scan the documents and produce clusters of similar
words.
Essentially, topic models work by deducing words and grouping similar ones into topics to
create topic clusters.
Two popular topic modeling techniques are Latent Semantic Analysis (LSA) and Latent
Dirichlet Allocation (LDA). Their objective to discover hidden semantic patterns portrayed
by text data is the same, but how they achieve it is different.
Latent Dirichlet Allocation (LDA) was initially proposed in 2000 in a paper titled “Inference
of population structure using multilocus genotype data.” The paper predominantly focused
on population genetics, which is a subfield of genetics concerned with genetic differences
within and among populations. Three years later, Latent Dirichlet Allocation was applied in
machine learning.
The authors of the paper describe the technique as “a generative model for text and other
collections of discrete data.” Thus, LDA may be described as a natural language technique
used to identify topics a document belongs to based on the words contained within it.
More specifically, LDA is a Bayesian network, meaning it’s a generative statistical model
that assumes documents are made up of words that aid in determining the topics. Thus,
documents are mapped to a list of topics by assigning each word in the document to different
topics. This model ignores the order of words occurring in a document and treats them as a
bag of words.
The output will be the topic model, and the documents expressed as a combination of the
topics.
In a nutshell, all the algorithm does is finding the weight of connections between
documents and topics and between topics and words.
Let’s see it in a visual representation. Imagine this is our word distribution over documents:
We see how the algorithm created an intermediate layer with topics and figured out the
weights between documents and topics and between topics and words. Documents are no
longer connected to words but to topics.
If we want to get an intuition about what each of the topics is about, we’ll need to carry on a
more detailed analysis.
1. Choose the number of topics (k) you want to extract from the corpus.
2. Preprocess the reviews corpus by removing stop words, punctuations, and converting
words to their root forms using stemming or lemmatization.
3. Create a vocabulary list of all unique words in the corpus.
4. Convert each review in the corpus into a bag-of-words representation, where each
word is represented by its index in the vocabulary list and the count of that word in
the review.
5. Initialize the model by randomly assigning each word in each review to one of the k
topics.
6. For each review 'r' in the corpus, iterate through each word w in the review and
calculate the probability distribution over the k topics, given the current assignments
of all other words in the document to their topics and the current topic-word
distribution.
7. Sample a new topic assignment for word w based on the probability distribution
calculated in step 6.
8. Repeat steps 6 and 7 for all reviewss in the corpus until convergence is achieved.
9. Output the topic-word distribution and document-topic distribution as the final result.
CODE / OUTPUT:
import nltk
nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words=set(nltk.corpus.stopwords.words('english'))
import pandas as pd
df = pd.read_csv('/content/google.csv')
df
import nltk;
from nltk.corpus import sentiwordnet as swn
from nltk.tag import pos_tag,map_tag
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('omw-1.4')
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import pandas as pd
vocab = vect.get_feature_names_out()
print("Review 0: ") #Service is not good as wanted. Very limited staff.otherwise food quality
is good.
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")
print("Review 1: ")
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")
reviews=df['cleaned_text']
dictionary1 = corpora.Dictionary(texts)
print(dictionary1)
import gensim
from gensim import corpora
import pyLDAvis.gensim
word_dict = {};
for i in range(7):
words = lda_model.show_topic(i, topn = 20)
word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)
CONCLUSION:
Through this experiment we learnt about Instant data scraper and LDA.