Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory
Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory
ipynb - Colaboratory
Contribution - Team
Project Summary -
Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato
provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.
India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in
diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining
outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get
some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato
restaurant data for each city in India.
There are two separate files, while the columns are self explanatory. Below is a brief description:
Restaurant names and Metadata - This could help in clustering the restaurants into segments. Also the data has valuable information around
cuisine and costing which can be used in cost vs. benefit analysis Restaurant reviews - Data could be used for sentiment analysis. Also the
metadata of reviewers can be used for identifying the critics in the industry.
Importing libraries
Loading the dataset
Shape of dataset
Dataset information
Handling the duplicate values
Handling missing values.
Undeerstanding the columns
Variable description
Data wrangling
Data visualization
Story telling and experimenting with charts.
Text preprocessing,
Latent Direchlet Allocation
Sentiment analysis
Challenges faced
Conclusion.
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 1/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
GitHub Link -
https://github.com/Bhojraj-Jadhav/Zomato-Restaurant-Clustering-and-Sentiment-Analysis
Problem Statement
The Project focuses on Customers and Company, you have to analyze the sentiments of the reviews given by the customer in the data and
made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized
as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding
the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.
This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be
used in cost vs. benefit analysis
Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.
Let's Begin !
Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, LancasterStemmer
from sklearn.feature_extraction.text import CountVectorizer
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 2/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
from sklearn.feature_extraction.text import TfidfTransformer
from textblob import TextBlob
from IPython.display import Image
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
import gensim
import warnings
warnings.filterwarnings('ignore')
Dataset Loading
# mounting drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Importing datasets.
meta_df_main=pd.read_csv("/content/drive/MyDrive/ML P3/Zomato Metadata.csv")
meta_df.head()
12noo
Food
t
Hygiene Chinese,
3:30pm
Rated Continental,
Beyond https://www.zomato.com/hyderabad/beyond- 6:30p
0 800 Restaurants Kebab,
Flavours flavou... t
in European,
11:30p
Hyderabad, South I...
(Mon
C...
Sun
Biryani, North 11 A
https://www.zomato.com/hyderabad/paradise- Hyderabad's
1 Paradise 800 Indian, to 1
gach... Hottest
Chinese P
Dataset Information
# Dataset Info.
meta_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 105 non-null object
1 Links 105 non-null object
2 Cost 105 non-null object
3 Collections 51 non-null object
4 Cuisines 105 non-null object
5 Timings 104 non-null object
dtypes: object(6)
memory usage: 5.0+ KB
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 3/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Duplicate Values
meta_df.duplicated(keep='last').sum()
# Resting Index.
meta_df.reset_index(inplace=True)
meta_df['Name'].duplicated().sum()
meta_df.isnull().sum()
index 0
Name 0
Links 0
Cost 0
Collections 54
Cuisines 0
Timings 1
dtype: int64
meta_df[meta_df['Collections'].isnull()].head()
Shah
12 No
Ghouse https://www.zomato.com/hyderabad/shah-
7 7 300 NaN Lebanese to
Spl ghouse-s...
Midn
Shawarma
Burger, 11
https://www.zomato.com/hyderabad/kfc-
15 15 KFC 500 NaN Fast to
gachibowli
Food
NorFest - 12 No
plt.figure(figsize=(15,5))
sns.heatmap(meta_df.isnull(),cmap='plasma',annot=False,yticklabels=False)
plt.title(" Visualising Missing Values");
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 4/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Our data has missing values in collection column. Since the column contains sentiments hence no need to impute the null values.
# Dataset Columns.
meta_df.columns
Variables Description
3. Data Wrangling
# Convert the 'Cost' column, deleting the comma and changing the data type into 'int64'.
meta_df['Cost'] = meta_df['Cost'].str.replace(",","").astype('int64')
Convert the 'Cost' column, deleting the comma and changing the data type into 'int64'
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 5/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
# Dataset Info.
meta_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 105 non-null int64
1 Name 105 non-null object
2 Links 105 non-null object
3 Cost 105 non-null int64
4 Collections 51 non-null object
5 Cuisines 105 non-null object
6 Timings 104 non-null object
dtypes: int64(2), object(5)
memory usage: 5.9+ KB
4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships
between variables
Chart - 1
top10_res_by_cost = meta_df[['Name','Cost']].groupby('Name',as_index=False).sum().sort_values(by='Cost',ascending=False).head(10)
plt.axis("off");
Chart - 2
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 6/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
plt.figure(figsize=(15,6))
# Performing groupby To get values accourding to Names and sort it for visualisation.
top_10_affor_rest=meta_df[['Name','Cost']].groupby('Name',as_index=False).sum().sort_values(by='Cost',ascending=False).tail(10)
The plot shows the top 10 affordable restaurants based on their total cost. The y-axis represents the restaurant names, while the x-axis shows
the total cost. The affordable restaurants are sorted in ascending order of their cost.
Chart - 3
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 7/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
<Axes: >
The resulting bar chart shows the top 10 most frequent values in the Collections column on the y-axis and their corresponding counts on the x-
axis. The horizontal orientation of the bars makes it easy to compare the counts of the different collections. The longer the bar, the higher the
count.
In Order to plot the cuisines from the data we have to count the frequency of the words from the document.(Frequency of cuisine). For that We
have to perform the opration like removing stop words, Convert all the text into lower case, removing punctuations, removing repeated
charactors, removing Numbers and emojies and finally count vectorizer.
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 8/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
# Extracting the first word from the number for cuisines in the sentence.
two_words = {' '.join(words):n for words,n in Counter(zip(words, words[1:])).items() if not words[0][-1]==(',')}
# Sorting the most frequent cuisine at the top and order by descending
two_words_dfc = two_words_dfc.sort_values(by = "Frequency", ascending = False)
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 9/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
6 north indian 61
9 indian chinese 27
42 fast food 15
4 south indian 9
5 indian north 9
33 chinese north 8
24 indian continental 6
65 italian north 6
8 biryani north 6
28 food north 6
93 continental italian 6
0 chinese continental 5
34 indian kebab 3
84 indian asian 3
77 indian mughlai 3
19 continental north 3
54 chinese biryani 3
Chart105
-4 desserts cafe 3
53 burger fast 3
18
# Visualizingasian
the continental 3
frequency of the Cuisines.
sns.set_style("whitegrid")
plt.figure(figsize = (18, 8))
sns.barplot(y = "Cuisine Words", x = "Frequency", data = two_words_20c, palette = "magma")
plt.title("Top 20 Two-word Frequencies of Cuisines", size = 20)
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.xlabel("Cuisine Words", size = 20)
plt.ylabel(None)
plt.savefig("Top_20_Two-word_Frequencies_of_Cuisines.png")
plt.show()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 10/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
The DataFrame contains two columns: "Cuisine Words" and "Frequency." The "Cuisine Words" column lists the most frequent two-word cuisine
terms, while the "Frequency" column shows the number of times each two-word cuisine term appears in the dataset.This information can be
helpful in understanding the most common cuisine types in the dataset. It can also be used to identify trends and patterns in the types of
cuisines that are popular or in demand among the customers.
review_df.head()
Dataset Information
review_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Restaurant 10000 non-null object
1 Reviewer 9962 non-null object
2 Review 9955 non-null object
3 Rating 9962 non-null object
4 Metadata 9962 non-null object
5 Time 9962 non-null object
6 Pictures 10000 non-null int64
dtypes: int64(1), object(6)
memory usage: 547.0+ KB
Duplicate Values
36
review_df.isnull().sum()
Restaurant 0
Reviewer 38
Review 45
Rating 38
Metadata 38
Time 38
Pictures 0
dtype: int64
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 11/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
As we can see, there are few missing values, so I decide to drop them all because there isn't a big loss
This notebook will use bokeh and plotly to see ratings, reviews and cost relationships , will use NLTK,gensim, to convert text to vectors to find
relationships between text. We will also see wordclouds.
# proportion or percentage of occurrences for each unique value in the Rating column.
review_df['Rating'].value_counts(normalize=True)
5 0.384662
4 0.238205
1 0.174162
3 0.119755
2 0.068661
4.5 0.006926
3.5 0.004718
2.5 0.001907
1.5 0.000903
Like 0.000100
Name: Rating, dtype: float64
# Removing like value and taking the mean in the rating column.
review_df.loc[review_df['Rating'] == 'Like'] = np.nan
print(review_df['Rating'].mean())
3.601044071880333
review_df['Rating'].value_counts(normalize=True)
5.0 0.3832
4.0 0.2373
1.0 0.1735
3.0 0.1193
2.0 0.0684
4.5 0.0069
3.5 0.0047
3.6 0.0039
2.5 0.0019
1.5 0.0009
Name: Rating, dtype: float64
The Ratings distribution 38% reviews are 5 rated,23% are 4 rated stating that people do rate good food high.
Chart - 5
import plotly.express as px
fig = px.scatter(review_df, x=review_df['Rating'], y=review_df['Review_length'])
fig.update_layout(title_text="Rating vs Review Length")
fig.update_xaxes(ticks="outside", tickwidth=1, tickcolor='crimson',tickangle=45, ticklen=10)
fig.show()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 12/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
5000
4000
Review_length
3000
2000
1000
The scatter plot confirms that length of review doesnt impact ratings.
0
Chart - 6
1
5
# Creating polarity variable to see sentiments in reviews.(using textblob)
Rating
from textblob import TextBlob
review_df['Polarity'] = review_df['Review'].apply(lambda x: TextBlob(x).sentiment.polarity)
<Axes: ylabel='Frequency'>
Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences
generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in
the range of [0,1].
Stop words are used in a language to removed from text data during natural language processing. This helps to reduce the dimensionality of
the feature space and focus on the more important words in the text.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stop_words)
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 13/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
rest_word=['order','restaurant','taste','ordered','good','food','table','place','one','also']
rest_word
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yours
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
['order',
'restaurant',
'taste',
'ordered',
'good',
'food',
'table',
'place',
'one',
'also']
Chart - 7
Chart - 8
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 14/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
The output of top 15 reviewers based on the number of reviews they have made in a given dataset. Analyzing the reviews made by these top
reviewers can help in improving customer satisfaction and loyalty, ultimately leading to increased revenue and growth.
Chart - 9
import re
review_df['Review']=review_df['Review'].map(lambda x: re.sub('[,\.!?]','', x))
review_df['Review']=review_df['Review'].map(lambda x: x.lower())
review_df['Review']=review_df['Review'].map(lambda x: x.split())
review_df['Review']=review_df['Review'].apply(lambda x: [item for item in x if item not in stop_words])
review_df['Review']=review_df['Review'].apply(lambda x: [item for item in x if item not in rest_word])
ps = PorterStemmer()
review_df['Review']=review_df['Review'].map(lambda x: ps.stem(x))
long_string = ','.join(list(review_df['Review'].values))
long_string
wordcloud = WordCloud(background_color="white", max_words=100, contour_width=3, contour_color='steelblue')
wordcloud.generate(long_string)
wordcloud.to_image()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 15/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Chart - 10
long_string = ','.join(list(neg_rev['Review'].values))
long_string
wordcloud = WordCloud(background_color="white", max_words=100, contour_width=3, contour_color='steelblue')
wordcloud.generate(long_string)
wordcloud.to_image()
Service , bad chicken , staff behavior, stale food are key reasons for neagtive reviews
Text Cleaning
# Creating word embeddings and t-SNE plot. (for positive and negative reviews).
Dataframe where the Rating column is greater than or equal to 3. This selects all the positive reviews where as the Rating column is less than 3.
This selects all the negative reviews, assuming that the Rating column is a scale from 1 to 5 with 5 being the highest rating.
Create a corpus of words from the negative reviews in the neg_rev DataFrame.
return corpus
[["['corn',",
"'cheese',",
"'balls',",
"'manchow',",
"'soup',",
"'paneer',",
"'shashlik',",
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 16/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
"'sizzler',",
"'sizzler',",
"'stale',",
"'paneer',",
"'smelling',",
"'waiter',",
"'impolite',",
"'even',",
"'accept',",
"'mistake',",
"'never',",
"'going']"],
["['went',",
"'team',",
"'lunch',",
"'worst',",
"'tasteless',",
"'service',",
"'slow',",
"'ac',",
"'working',",
"'we’ve',",
"'requested',",
"'multiple',",
"'times',",
"'use',",
"'please',",
"'don’t',",
"'waste',",
"'money',",
"'strictly',",
"'recommend',",
"'prefer',",
"'beyond',",
"'flavours']"]]
Create a corpus of words from the positive reviews in the neg_rev DataFrame.
return corpus
[["['ambience',",
"'quite',",
"'saturday',",
"'lunch',",
"'cost',",
"'effective',",
"'sate',",
"'brunch',",
"'chill',",
"'friends',",
"'parents',",
"'waiter',",
"'soumen',",
"'das',",
"'really',",
"'courteous',",
"'helpful']"],
["['ambience',",
"'pleasant',",
"'evening',",
"'service',",
"'prompt',",
"'experience',",
"'soumen',",
"'das',",
"'-',",
"'kudos',",
"'service']"]]
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 17/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
LDA
LDA is one of the methods to assign topic to texts. If observations are words collected into documents, it posits that each document is a
mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.
Plotting the top 10 most occuring words. Topic modeling is a process to automatically identify topics present in a text object and to assign text
corpus to one category of topic.
Topic 0: veg, starters, buffet, main, course, ambience, non, service, lunch, items
Topic 1: great, best, ambience, music, night, friends, drinks, nice, service, floor
Topic 2: service, great, staff, excellent, visit, awesome, time, experience, thanks, us
Topic 3: chicken, biryani, rice, mutton, fried, veg, soup, pork, cooked, tikka
Topic 4: paneer, butter, indian, curry, north, masala, paratha, dal, roti, naan
Topic 5: ambience, nice, service, really, try, best, great, amazing, must, visit
Topic 6: delivery, ice, cream, time, sauces, shake, chocolate, brownie, hot, awesome
Topic 7: quantity, less, money, quality, shawarma, received, delivered, waste, value, tasty
Topic 8: even, time, service, bad, experience, zomato, worst, us, never, get
Topic 9: chicken, like, burger, spicy, dish, fried, sauce, cheese, sweet, try
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 18/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.9/dist-packages (from gensim->pyLDAvis) (6.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->pyLDAvis) (2.1.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas>=1.3.4->pyLD
Installing collected packages: funcy, joblib, pyLDAvis
Attempting uninstall: joblib
Found existing installation: joblib 1.1.1
Uninstalling joblib-1.1.1:
Successfully uninstalled joblib-1.1.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the so
pandas-profiling 3.2.0 requires joblib~=1.1.0, but you have joblib 1.2.0 which is incompatible.
Successfully installed funcy-2.0 joblib-1.2.0 pyLDAvis-3.4.0
import gensim
import pyLDAvis.gensim
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
Selected Topic: 0 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric:(2)
λ=1 0.0 0.2 0
Intertopic Distance Map (via multidimensional scaling) Top-30 Most Salient Terms(
0 500 1,000 1,500 2,000
PC2
6 chicken
biryani
8 service
quantity
great
delivery
veg
quality
less
5 ambience
3
paneer
money
2 nice
4 starters
PC1 staff
cream
rice
ice
main
1 buffet
best
course
7 visit
time
9 music
excellent
awesome
chocolate
bad
10
Marginal topic distribution friends
The topics and topic terms can be visualised to help assess how interpretable the topic model is.
Sentiment Analysis
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 19/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
import plotly.express as px
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
0 0.600000
1 0.733333
2 0.540000
3 0.800000
4 0.350000
...
9995 0.277841
9996 0.174621
9997 0.082074
9998 0.560000
9999 0.103030
Name: Polarity, Length: 10000, dtype: float64
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
0 0.900000
1 0.966667
2 0.740000
3 0.750000
4 0.450000
...
9995 0.646591
9996 0.710606
9997 0.501252
9998 0.620000
9999 0.630303
Name: Subjectivity, Length: 10000, dtype: float64
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 20/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
# Create a function to compute the negative, neutral and positive analysis
def getAnalysis(score):
if score <0:
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return 'Positive'
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
If the score is less than 0, the function returns the string 'Negative'. If the score is equal to 0, the function returns the string 'Neutral'. If the score
is greater than 0, the function returns the string 'Positive'.
# Apply get analysis function to separate the sentiments from the column
review_df['Analysis'] = review_df['Polarity'].apply(getAnalysis)
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
Sentiment Analysis
Analysis
1 Positive
Negative
Neutral
0.8
Subjectivity
0.6
0.4
0.2
−1 −0.5 0 0.5 1
Polarity
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 21/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
The resulting plot can provide several insights into the sentiment analysis results. Firstly, the histogram bars on the left side of the plot
(negative polarity) indicate that a significant number of reviews expressed negative sentiments. Similarly, the histogram bars on the right side
of the plot (positive polarity) indicate that a significant number of reviews expressed positive sentiments.
Overall, this plot can provide a quick and easy way to visualize the sentiment polarity distribution of the reviews, which can help in
understanding the overall sentiment of the customers towards the restaurants.
Clustering
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning);
/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:
`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen
3 Shah Ghouse Hotel & Restaurant 800 biryani, north indian, chinese, seafood, bever...
4 Over The Moon Brew Company 1,200 asian, continental, north indian, chinese, med...
3 Shah Ghouse Hotel & Restaurant 800 [biryani, northindian, chinese, seafood, bever...
4 Over The Moon Brew Company 1,200 [asian, continental, northindian, chinese, med...
# converting a list of labels for each sample into a binary indicator matrix
mlb = MultiLabelBinarizer(sparse_output=True)
# converting the Cuisines column in the cuisine_df DataFrame into a binary indicator matrix.
cuisine_df = cuisine_df.join(pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(cuisine_df.pop('Cuisines')),
index=cuisine_df.index, columns=mlb.classes_))
# Overview
cuisine_df.head()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 22/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Name Cost american andhra arabian asian bakery bbq beverages biryani ... northindian pizza salad seafood south
Beyond
0 800 0 0 0 0 0 0 0 0 ... 1 0 0 0
Flavours
Shah
Ghouse
3 800 0 0 0 0 0 0 1 1 ... 1 0 0 1
Hotel &
R t t
Restaurant Rating
11 B-Dubs 4.810
67 Paradise 4.700
35 Flechazo 4.660
# Combining the information on restaurant cuisine and ratings into a single DataFrame.
df_cluster = cuisine_df.merge(ratings_df, left_on='Name',right_on='Restaurant')
# Overview
df_cluster.head()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 23/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Name Cost american andhra arabian asian bakery bbq beverages biryani ... salad seafood southindian spanish str
Beyond
0 800 0 0 0 0 0 0 0 0 ... 0 0 1 0
Flavours
Shah
Ghouse
3 800 0 0 0 0 0 0 1 1 ... 0 1 0 0
Hotel &
Restaurant
Over The
# Checking the data type and null counts for newly created variables.
df_cluster.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 100 non-null object
1 Cost 100 non-null object
2 Rating 100 non-null float64
3 american 100 non-null Sparse[int64, 0]
4 andhra 100 non-null Sparse[int64, 0]
5 arabian 100 non-null Sparse[int64, 0]
6 asian 100 non-null Sparse[int64, 0]
7 bbq 100 non-null Sparse[int64, 0]
8 bakery 100 non-null Sparse[int64, 0]
9 beverages 100 non-null Sparse[int64, 0]
10 biryani 100 non-null Sparse[int64, 0]
11 burger 100 non-null Sparse[int64, 0]
12 cafe 100 non-null Sparse[int64, 0]
13 chinese 100 non-null Sparse[int64, 0]
14 continental 100 non-null Sparse[int64, 0]
15 desserts 100 non-null Sparse[int64, 0]
16 european 100 non-null Sparse[int64, 0]
17 fastfood 100 non-null Sparse[int64, 0]
18 fingerfood 100 non-null Sparse[int64, 0]
19 goan 100 non-null Sparse[int64, 0]
20 healthyfood 100 non-null Sparse[int64, 0]
21 hyderabadi 100 non-null Sparse[int64, 0]
22 icecream 100 non-null Sparse[int64, 0]
23 indonesian 100 non-null Sparse[int64, 0]
24 italian 100 non-null Sparse[int64, 0]
25 japanese 100 non-null Sparse[int64, 0]
26 juices 100 non-null Sparse[int64, 0]
27 kebab 100 non-null Sparse[int64, 0]
28 lebanese 100 non-null Sparse[int64, 0]
29 malaysian 100 non-null Sparse[int64, 0]
30 mediterranean 100 non-null Sparse[int64, 0]
31 mexican 100 non-null Sparse[int64, 0]
32 mithai 100 non-null Sparse[int64, 0]
33 modernindian 100 non-null Sparse[int64, 0]
34 momos 100 non-null Sparse[int64, 0]
35 mughlai 100 non-null Sparse[int64, 0]
36 northeastern 100 non-null Sparse[int64, 0]
37 northindian 100 non-null Sparse[int64, 0]
38 pizza 100 non-null Sparse[int64, 0]
39 salad 100 non-null Sparse[int64, 0]
40 seafood 100 non-null Sparse[int64, 0]
41 southindian 100 non-null Sparse[int64, 0]
42 spanish 100 non-null Sparse[int64, 0]
43 streetfood 100 non-null Sparse[int64, 0]
44 sushi 100 non-null Sparse[int64, 0]
45 thai 100 non-null Sparse[int64, 0]
46 wraps 100 non-null Sparse[int64, 0]
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 24/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
dtypes: Sparse[int64, 0](44), float64(1), object(2)
memory usage: 6.7+ KB
# Visualising relationship between the cost of a meal and the rating of a restaurant
sns.lmplot(y='Rating',x='Cost',data=df_cluster,line_kws={'color' :'red'},height=6.27, aspect=11.7/8.27)
<seaborn.axisgrid.FacetGrid at 0x7f8c28230d00>
The resulting plot shows the relationship between the cost of a meal and the rating of a restaurant, with the regression line indicating the
general trend in the data. This can help identify any patterns or correlations between cost and rating.
K-means Clustering
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 25/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
The plot can help to identify the optimal number of clusters based on the elbow point of the curve, where the rate of decrease in inertia score
slows down significantly.
# Initializing a K-Means clustering model with number of clusters and random state.
model = KMeans(random_state=11, n_clusters=5)
model.fit(df_cluster.drop('Name',axis=1))
▾ KMeans
KMeans(n_clusters=5, random_state=11)
# predict the cluster label of a new data point based on a trained clustering model.
cluster_lbl = model.predict(df_cluster.drop('Name',axis=1))
df_cluster['labels'] = cluster_lbl
list_of_cluster=[cluster_0,cluster_1,cluster_2,cluster_3,cluster_4]
# Create a scatter plot of the clusters with annotations for top cuisines
plt.figure(figsize=(15,7))
sns.scatterplot(x='Cost', y='Rating', hue='labels', data=df_cluster)
plt.xlabel('Cost')
plt.ylabel('Rating')
plt.title('Clustering of Restaurants')
plt.show()
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 26/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
For each cluster, the top three cuisines are identified and annotated on the plot. The annotation includes the name of the cluster, its centroid
location (mean cost and mean rating), and the top three cuisines and their counts within the cluster. This plot can be used to visually identify
how the restaurants are grouped and the dominant features of each cluster.
Conclusion
The project was successful in achieving the goals of clustering and sentiment analysis. The clustering part provided insights into the grouping
of restaurants based on their features, which can help in decision making for users and businesses. The sentiment analysis part provided
insights into the sentiments expressed by the users in their reviews, which can help businesses in improving their services and user experience.
Colab paid products - Cancel contracts here
Th l t ti l f f t k h i l ti d d l t i l ith d ti t l i
https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 27/27