0% found this document useful (0 votes)

26 views11 pages

Ex6 SMA

The document describes using Instant Data Scraper to extract data from Google reviews and analyze them using topic modeling with LDA. It provides steps to install the Instant Data Scraper Chrome extension and use it to scrape Google reviews for a business. It then explains LDA topic modeling, describing how it identifies topics in a corpus by grouping similar words. The code segment shows preprocessing the reviews data, generating TF-IDF vectors, and fitting an LDA model to identify topics.

Uploaded by

TEA106PATIL JUHILEE TUSHAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views11 pages

Ex6 SMA

Uploaded by

TEA106PATIL JUHILEE TUSHAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

EXPERIMENT – 6

AIM:
Google Reviews Analytics
THEORY:

Instant Data Scraper extracts data from web pages and exports it as Excel or CSV files

Instant Data Scraper is an automated data extraction tool for any website. It uses AI to
predict which data is most relevant on a HTML page and allows saving it to Excel or CSV
file (XLS, XLSX, CSV).

Instant Data Scraper FEATURES:

1. Detecting data for extraction with AI.

2. Detecting when dynamic data has loaded.
3. Delay and maximum wait time customization for desired crawling speed
4. Support for pagination on websites.
5. Automatic navigation to next page via buttons or links.
6. Support for infinite scrolling.
7. Extracted data preview with copy and paste support.
8. Data export to Excel spreadsheet or CSV file.
9. Extracted data column renaming and filtering.

This extension is completely FREE. Instant Data Scraper works well in unison with SEO
tools, CRM recruiter systems, sales leads management tools or email marketing
campaigns. Web scraping and data downloading is made easy with our tool.
Furthermore, you have data security and privacy as the scraped data does not leave
your browser.

Instant Data Scraper USE CASES:

1. Lead generation for companies and freelancers.

2. Growth hackers looking for easy ways to collect data.
3. Recruiters looking for job candidates.
4. Crawl results from search engines.
5. Get product pricing data from e-commerce websites.
6. Amazon sellers, distributors and review analysts.
7. Get email and addresses and phone numbers from directories.
8. Get contact info from professional association websites.
9. Scrape reviews and ratings.
10. Analyze posts for likes, comments, connection and contacts.
11. Extract emails and ID from social media profiles.

BEA106_Juhilee Tushar Patil

Steps to install and use Instant Data Scraper:
Step 1: Install and add the extension to the Google chrome

Click on ‘ADD TO CHROME’

Click on ‘Add Extension’

BEA106_Juhilee Tushar Patil

Step 2: Go to Google Maps and look for the business that interests you. Choose the reviews
and launch Instant data scraper to crawl Google reviews

Topic Modeling with LDA

What is Topic Modeling?

Topic modeling is a frequently used approach to discover hidden semantic patterns portrayed
by a text corpus and automatically identify topics that exist inside it.

Namely, it’s a type of statistical modeling that leverages unsupervised machine learning to
analyze and identify clusters or groups of similar words within a body of text.

For example, a topic modeling algorithm may be deployed to determine whether the contents
of a document imply it’s an invoice, complaint, or contract

BEA106_Juhilee Tushar Patil

What are topics, and how do topic models work?

Topics are the latent descriptions of a corpus (large group) of text. Intuitively, documents
regarding a specific topic are more likely to produce certain words more frequently.

For example, the words “dog” and “bone” are more likely to appear in documents concerning
dogs, whereas “cat” and “meow” are more likely to be found in documents regarding cats.
Consequently, the topic model would scan the documents and produce clusters of similar
words.

Essentially, topic models work by deducing words and grouping similar ones into topics to
create topic clusters.

Two popular topic modeling techniques are Latent Semantic Analysis (LSA) and Latent
Dirichlet Allocation (LDA). Their objective to discover hidden semantic patterns portrayed
by text data is the same, but how they achieve it is different.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) was initially proposed in 2000 in a paper titled “Inference
of population structure using multilocus genotype data.” The paper predominantly focused
on population genetics, which is a subfield of genetics concerned with genetic differences
within and among populations. Three years later, Latent Dirichlet Allocation was applied in
machine learning.

The authors of the paper describe the technique as “a generative model for text and other
collections of discrete data.” Thus, LDA may be described as a natural language technique
used to identify topics a document belongs to based on the words contained within it.

More specifically, LDA is a Bayesian network, meaning it’s a generative statistical model
that assumes documents are made up of words that aid in determining the topics. Thus,
documents are mapped to a list of topics by assigning each word in the document to different
topics. This model ignores the order of words occurring in a document and treats them as a
bag of words.

Latent Dirichlet Allocation (LDA) is a statistical generative model using Dirichlet

distributions.

BEA106_Juhilee Tushar Patil

We start with a corpus of documents and choose how many topics we want to discover out of
this corpus.

The output will be the topic model, and the documents expressed as a combination of the
topics.

In a nutshell, all the algorithm does is finding the weight of connections between
documents and topics and between topics and words.

Let’s see it in a visual representation. Imagine this is our word distribution over documents:

For , an LDA model could look like this:

We see how the algorithm created an intermediate layer with topics and figured out the
weights between documents and topics and between topics and words. Documents are no
longer connected to words but to topics.

BEA106_Juhilee Tushar Patil

In this example, each topic was named for clarity, but in real life, we wouldn’t know exactly
what they represent. We’d have topics 1, 2, … up to , that’s all.

If we want to get an intuition about what each of the topics is about, we’ll need to carry on a
more detailed analysis.

Steps of LDA (Latent Dirichlet Allocation):

1. Choose the number of topics (k) you want to extract from the corpus.
2. Preprocess the reviews corpus by removing stop words, punctuations, and converting
words to their root forms using stemming or lemmatization.
3. Create a vocabulary list of all unique words in the corpus.
4. Convert each review in the corpus into a bag-of-words representation, where each
word is represented by its index in the vocabulary list and the count of that word in
the review.
5. Initialize the model by randomly assigning each word in each review to one of the k
topics.
6. For each review 'r' in the corpus, iterate through each word w in the review and
calculate the probability distribution over the k topics, given the current assignments
of all other words in the document to their topics and the current topic-word
distribution.
7. Sample a new topic assignment for word w based on the probability distribution
calculated in step 6.
8. Repeat steps 6 and 7 for all reviewss in the corpus until convergence is achieved.
9. Output the topic-word distribution and document-topic distribution as the final result.

CODE / OUTPUT:

!pip install nltk

import nltk
nltk.download('stopwords')

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words=set(nltk.corpus.stopwords.words('english'))

import pandas as pd
df = pd.read_csv('/content/google.csv')
df

BEA106_Juhilee Tushar Patil

from nltk import word_tokenize

import nltk;
from nltk.corpus import sentiwordnet as swn
from nltk.tag import pos_tag,map_tag
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

import nltk
nltk.download('omw-1.4')

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize WordNetLemmatizer and stopwords

le = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

import nltk
nltk.download('wordnet')

BEA106_Juhilee Tushar Patil

def clean_text(review):
# Tokenize the review
word_tokens = word_tokenize(review)
# Apply lemmatization and remove stopwords
tokens = [le.lemmatize(w) for w in word_tokens if w.lower() not in stop_words and len(w)
> 3]
# Join the tokens back into a cleaned text
cleaned_text = " ".join(tokens)
return cleaned_text

import pandas as pd

# Assuming your DataFrame is named df and has a column named 'Text'

# Convert non-string values to strings
df['Text'] = df['Text'].astype(str)

# Apply the clean_text function

df['cleaned_text'] = df['Text'].apply(clean_text)

# Assuming your DataFrame is named df and has a column named 'Text'

df['cleaned_text'] = df['Text'].apply(clean_text)

TFIDF vectorization on the text

from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer with English stop words

vect = TfidfVectorizer(stop_words='english', max_features=1000)

# Fit and transform the cleaned text data

vect_text = vect.fit_transform(df['cleaned_text'])

from sklearn.decomposition import LatentDirichletAllocation

lda_model=LatentDirichletAllocation(n_components=7,
learning_method='online',random_state=42,max_iter=1)
lda_top=lda_model.fit_transform(vect_text)

vocab = vect.get_feature_names_out()

for i, comp in enumerate(lda_model.components_):

vocab_comp = zip(vocab, comp)
sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:7]
print("Topic "+str(i)+": ")

BEA106_Juhilee Tushar Patil

for t in sorted_words:
print(t[0],end=" ")
print("n")

print("Review 0: ") #Service is not good as wanted. Very limited staff.otherwise food quality
is good.
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")

print("Review 1: ")
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")

BEA106_Juhilee Tushar Patil

Topics Visualization
from gensim import corpora

reviews=df['cleaned_text']

texts = [[token for token in reviews ] for text in reviews]

dictionary1 = corpora.Dictionary(texts)

print(dictionary1)

corpus = [dictionary1.doc2bow(text) for text in texts]

import gensim
from gensim import corpora
import pyLDAvis.gensim

# Create a bag-of-words representation of the documents

corpus = [dictionary1.doc2bow(text) for text in texts]

# Train the LDA model

lda_model = gensim.models.LdaModel(corpus, num_topics=10, id2word=dictionary1)

word_dict = {};
for i in range(7):
words = lda_model.show_topic(i, topn = 20)
word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)