0% found this document useful (0 votes)

470 views27 pages

Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory

The document summarizes a project analyzing restaurant data from Zomato in India. The project aims to cluster Zomato restaurants into different segments and perform sentiment analysis on customer reviews. The team loads and cleans the Zomato metadata and reviews datasets. They visualize the data, perform text preprocessing, latent Dirichlet allocation for clustering, and sentiment analysis. The findings will provide insights into the Indian food industry and help customers and Zomato.

Uploaded by

bilal nagori

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

470 views27 pages

Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory

Uploaded by

bilal nagori

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.

ipynb - Colaboratory

Project Name - Zomato Restaurant Clustering and Sentiment Analysis.

Project Type - Unsupervised

Contribution - Team

Team Member 1 - Abhishek Nagpure.

Team Member 2 - Priyanka Bajaj.

Team Member 3 - Bhojraj Jadhav

Project Summary -

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato
provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.

India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in
diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining
outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get
some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato
restaurant data for each city in India.

There are two separate files, while the columns are self explanatory. Below is a brief description:

Restaurant names and Metadata - This could help in clustering the restaurants into segments. Also the data has valuable information around
cuisine and costing which can be used in cost vs. benefit analysis Restaurant reviews - Data could be used for sentiment analysis. Also the
metadata of reviewers can be used for identifying the critics in the industry.

Steps that are performed:

Importing libraries
Loading the dataset
Shape of dataset
Dataset information
Handling the duplicate values
Handling missing values.
Undeerstanding the columns
Variable description
Data wrangling
Data visualization
Story telling and experimenting with charts.
Text preprocessing,
Latent Direchlet Allocation
Sentiment analysis
Challenges faced
Conclusion.

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 1/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

GitHub Link -

https://github.com/Bhojraj-Jadhav/Zomato-Restaurant-Clustering-and-Sentiment-Analysis

Problem Statement

The Project focuses on Customers and Company, you have to analyze the sentiments of the reviews given by the customer in the data and
made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized
as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding
the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.

This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be
used in cost vs. benefit analysis

Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.

Let's Begin !

1. Know Your Data

Import Libraries

# Import Libraries and modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from wordcloud import WordCloud

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, LancasterStemmer
from sklearn.feature_extraction.text import CountVectorizer

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 2/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
from sklearn.feature_extraction.text import TfidfTransformer
from textblob import TextBlob
from IPython.display import Image
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
import gensim

import warnings
warnings.filterwarnings('ignore')

Dataset Loading

# mounting drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# Importing datasets.
meta_df_main=pd.read_csv("/content/drive/MyDrive/ML P3/Zomato Metadata.csv")

# Creating the copy of dataset.

meta_df = meta_df_main.copy()

Dataset First View

# Dataset First Look.

meta_df.head()

Biryani, North 11 A
https://www.zomato.com/hyderabad/paradise- Hyderabad's
1 Paradise 800 Indian, to 1
gach... Hottest
Chinese P

Dataset Rows & Columns count

# Dataset Rows & Columns count.

print(f' We have total {meta_df.shape[0]} rows and {meta_df.shape[1]} columns.')

We have total 105 rows and 6 columns.

Dataset Information

# Dataset Info.

meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 105 non-null object
1 Links 105 non-null object
2 Cost 105 non-null object
3 Collections 51 non-null object
4 Cuisines 105 non-null object
5 Timings 104 non-null object
dtypes: object(6)
memory usage: 5.0+ KB

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 3/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Duplicate Values

# Dataset Duplicate Value Count.

meta_df.duplicated(keep='last').sum()

# Resting Index.

meta_df.reset_index(inplace=True)

# Checking duplicate restaurant name.

meta_df['Name'].duplicated().sum()

Missing Values/Null Values

# Missing Values/Null Values Count.

meta_df.isnull().sum()

index 0
Name 0
Links 0
Cost 0
Collections 54
Cuisines 0
Timings 1
dtype: int64

# Checking for Null values.

meta_df[meta_df['Collections'].isnull()].head()

index Name Links Cost Collections Cuisines Timin

Shah
12 No
Ghouse https://www.zomato.com/hyderabad/shah-
7 7 300 NaN Lebanese to
Spl ghouse-s...
Midn
Shawarma

Burger, 11
https://www.zomato.com/hyderabad/kfc-
15 15 KFC 500 NaN Fast to
gachibowli
Food

NorFest - 12 No

# Visualizing the missing values.

plt.figure(figsize=(15,5))
sns.heatmap(meta_df.isnull(),cmap='plasma',annot=False,yticklabels=False)
plt.title(" Visualising Missing Values");

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 4/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

What did you know about your dataset?

Our data has missing values in collection column. Since the column contains sentiments hence no need to impute the null values.

There are 105 total observation with 6 different features.

Feature like collection and timing has null values.
There is no duplicate values i.e., 105 unique data.
Feature cost represent amount but has object data type because these values are separated by comma ','.
Timing represent operational hour but as it is represented in the form of text has object data type.

2. Understanding Your Variables

# Dataset Columns.

meta_df.columns

Index(['index', 'Name', 'Links', 'Cost', 'Collections', 'Cuisines', 'Timings'], dtype='object')

Variables Description

Zomato Restaurant names and Metadata

1. Name : Name of Restaurants

2. Links : URL Links of Restaurants

3. Cost : Per person estimated Cost of dining

4. Collection : Tagging of Restaurants w.r.t. Zomato categories

5. Cuisines : Cuisines served by Restaurants

6. Timings : Restaurant Timings

Zomato Restaurant reviews

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

3. Data Wrangling

Data Wrangling Code

# Convert the 'Cost' column, deleting the comma and changing the data type into 'int64'.

meta_df['Cost'] = meta_df['Cost'].str.replace(",","").astype('int64')

Convert the 'Cost' column, deleting the comma and changing the data type into 'int64'

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 5/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
# Dataset Info.

meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 105 non-null int64
1 Name 105 non-null object
2 Links 105 non-null object
3 Cost 105 non-null int64
4 Collections 51 non-null object
5 Cuisines 105 non-null object
6 Timings 104 non-null object
dtypes: int64(2), object(5)
memory usage: 5.9+ KB

4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships
between variables

Chart - 1

# Chart - 1 visualization code.

top10_res_by_cost = meta_df[['Name','Cost']].groupby('Name',as_index=False).sum().sort_values(by='Cost',ascending=False).head(10)

# Creating word cloud for expensive restaurants.

plt.figure(figsize=(15,8))
text = " ".join(name for name in meta_df.sort_values('Cost',ascending=False).Name[:30])

# Creating word_cloud with text as argument in .generate() method.

word_cloud = WordCloud(width = 1400, height = 1400,collocations = False, background_color = 'black').generate(text)

# Display the generated Word Cloud.

plt.imshow(word_cloud, interpolation='bilinear')

plt.axis("off");

Chart - 2

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 6/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

# Affordable price restaurants.

plt.figure(figsize=(15,6))

# Performing groupby To get values accourding to Names and sort it for visualisation.
top_10_affor_rest=meta_df[['Name','Cost']].groupby('Name',as_index=False).sum().sort_values(by='Cost',ascending=False).tail(10)

# Lables for X and Y axis

x = top_10_affor_rest['Cost']
y = top_10_affor_rest['Name']

# Assigning the arguments for chart

plt.title("Top 10 Affordable Restaurant",fontsize=20, weight='bold',color=sns.cubehelix_palette(8, start=.5, rot=-.75)[-3])
plt.ylabel("Name",weight='bold',fontsize=15)
plt.xlabel("Cost",weight='bold',fontsize=15)
plt.xticks(rotation=90)
sns.barplot(x=x, y=y,palette='rocket')
plt.show()

The plot shows the top 10 affordable restaurants based on their total cost. The y-axis represents the restaurant names, while the x-axis shows
the total cost. The affordable restaurants are sorted in ascending order of their cost.

Chart - 3

# Visualisation the value counts of collection.

meta_df['Collections'].value_counts()[0:10].sort_values().plot(figsize=(10,8),kind='barh')

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 7/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

<Axes: >

The resulting bar chart shows the top 10 most frequent values in the Collections column on the y-axis and their corresponding counts on the x-
axis. The horizontal orientation of the bars makes it easy to compare the counts of the different collections. The longer the bar, the higher the
count.

Text preprocessing for the meta dataset.

In Order to plot the cuisines from the data we have to count the frequency of the words from the document.(Frequency of cuisine). For that We
have to perform the opration like removing stop words, Convert all the text into lower case, removing punctuations, removing repeated
charactors, removing Numbers and emojies and finally count vectorizer.

# Downloading and importing the dependancies for text cleaning.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.

# Extracting the stopwords from nltk library for English corpus.

sw = stopwords.words('english')

# Creating a function for removing stopwords.

def stopwords(text):
'''a function for removing the stopword'''

# removing the stop words and lowercasing the selected words

text = [word.lower() for word in str(text).split() if word.lower() not in sw]

# joining the list of words with space separator

return " ".join(text)

# Removing stopwords from Cuisines.

meta_df['Cuisines'] = meta_df['Cuisines'].apply(lambda text: stopwords(text))
meta_df['Cuisines'].head()

0 chinese, continental, kebab, european, south i...

1 biryani, north indian, chinese
2 asian, mediterranean, north indian, desserts
3 biryani, north indian, chinese, seafood, bever...
4 asian, continental, north indian, chinese, med...
Name: Cuisines, dtype: object

Stop words are removed successfully

# Defining the function for removing punctuation.

def remove_punctuation(text):
'''a function for removing punctuation'''
import string

# replacing the punctuations with no space,

# which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)

# return the text stripped of punctuation marks

return text.translate(translator)

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 8/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

# Removing punctuation from Cuisines.

meta_df['Cuisines'] = meta_df['Cuisines'].apply(lambda x: remove_punctuation(x))
meta_df['Cuisines'].head()

0 chinese continental kebab european south india...

1 biryani north indian chinese
2 asian mediterranean north indian desserts
3 biryani north indian chinese seafood beverages
4 asian continental north indian chinese mediter...
Name: Cuisines, dtype: object

Punctuations present in the text are removed successfully

# Cleaning and removing Numbers.

import re

# Writing a function to remove repeating characters.

def cleaning_repeating_char(text):
return re.sub(r'(.)1+', r'1', text)

# Removing repeating characters from Cuisines.

meta_df['Cuisines'] = meta_df['Cuisines'].apply(lambda x: cleaning_repeating_char(x))
meta_df['Cuisines'].head()

0 chinese continental kebab european south india...

Removed repeated characters successfully

# Removing the Numbers from the data.

def cleaning_numbers(data):
return re.sub('[0-9]+', '', data)

# Implementing the cleaning.

meta_df['Cuisines'] = meta_df['Cuisines'].apply(lambda x: cleaning_numbers(x))
meta_df['Cuisines'].head()

0 chinese continental kebab european south india...

We dont want numbers in the text Hence removed number successfully

# Top 20 Two word Frequencies of Cuisines.

from collections import Counter
text = ' '.join(meta_df['Cuisines'])

# separating each word from the sentences

words = text.split()

# Extracting the first word from the number for cuisines in the sentence.
two_words = {' '.join(words):n for words,n in Counter(zip(words, words[1:])).items() if not words[0][-1]==(',')}

# Extracting the most frequent cuisine present in the collection.

# Counting a frequency for cuisines.
two_words_dfc = pd.DataFrame(two_words.items(), columns=['Cuisine Words', 'Frequency'])

# Sorting the most frequent cuisine at the top and order by descending
two_words_dfc = two_words_dfc.sort_values(by = "Frequency", ascending = False)

# selecting first top 20 frequent cuisine.

two_words_20c = two_words_dfc[:20]
two_words_20c

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 9/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Cuisine Words Frequency

6 north indian 61

9 indian chinese 27

42 fast food 15

4 south indian 9

5 indian north 9

33 chinese north 8

24 indian continental 6

65 italian north 6

8 biryani north 6

28 food north 6

93 continental italian 6

0 chinese continental 5

34 indian kebab 3

84 indian asian 3

77 indian mughlai 3

19 continental north 3

54 chinese biryani 3

Chart105
-4 desserts cafe 3

53 burger fast 3

18
# Visualizingasian
the continental 3
frequency of the Cuisines.

sns.set_style("whitegrid")
plt.figure(figsize = (18, 8))
sns.barplot(y = "Cuisine Words", x = "Frequency", data = two_words_20c, palette = "magma")
plt.title("Top 20 Two-word Frequencies of Cuisines", size = 20)
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.xlabel("Cuisine Words", size = 20)
plt.ylabel(None)
plt.savefig("Top_20_Two-word_Frequencies_of_Cuisines.png")
plt.show()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 10/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

The DataFrame contains two columns: "Cuisine Words" and "Frequency." The "Cuisine Words" column lists the most frequent two-word cuisine
terms, while the "Frequency" column shows the number of times each two-word cuisine term appears in the dataset.This information can be
helpful in understanding the most common cuisine types in the dataset. It can also be used to identify trends and patterns in the types of
cuisines that are popular or in demand among the customers.

Review Dataset Analysis

# Loading the review dataset.

review_df=pd.read_csv("/content/drive/MyDrive/ML P3/Zomato reviews.csv")

Dataset First View

# First look of dataset.

review_df.head()

Restaurant Reviewer Review Rating Metadata Time Pictures

The ambience was

Beyond Rusha 1 Review , 2 5/25/2019
0 good, food was quite 5 0
Flavours Chakraborty Followers 15:54
good . h...

Ambience is too good

Beyond Anusha 3 Reviews , 2 5/25/2019
1 for a pleasant 5 0
Flavours Tirumalaneedi Followers 14:20
evening. S...

A must try.. great food

Beyond Ashok 2 Reviews , 3 5/24/2019
2 great ambience Thnx 5 0

Dataset Information

# Info about review dataset.

review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Restaurant 10000 non-null object
1 Reviewer 9962 non-null object
2 Review 9955 non-null object
3 Rating 9962 non-null object
4 Metadata 9962 non-null object
5 Time 9962 non-null object
6 Pictures 10000 non-null int64
dtypes: int64(1), object(6)
memory usage: 547.0+ KB

Duplicate Values

# Dataset Duplicate Value Count.

review_df.duplicated().sum()

Missing Values/Null Values

review_df.isnull().sum()

Restaurant 0
Reviewer 38
Review 45
Rating 38
Metadata 38
Time 38
Pictures 0
dtype: int64

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 11/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

As we can see, there are few missing values, so I decide to drop them all because there isn't a big loss

This notebook will use bokeh and plotly to see ratings, reviews and cost relationships , will use NLTK,gensim, to convert text to vectors to find
relationships between text. We will also see wordclouds.

# proportion or percentage of occurrences for each unique value in the Rating column.
review_df['Rating'].value_counts(normalize=True)

5 0.384662
4 0.238205
1 0.174162
3 0.119755
2 0.068661
4.5 0.006926
3.5 0.004718
2.5 0.001907
1.5 0.000903
Like 0.000100
Name: Rating, dtype: float64

# Removing like value and taking the mean in the rating column.
review_df.loc[review_df['Rating'] == 'Like'] = np.nan

# Chenging the data type of rating column

review_df['Rating']= review_df['Rating'].astype('float64')

print(review_df['Rating'].mean())

3.601044071880333

# Filling mean in place of null value

review_df['Rating'].fillna(3.6, inplace=True)

# Changing the data type of review column.

review_df['Review'] = review_df['Review'].astype(str)

# Creating a review_length column to check the frequency of each rating.

review_df['Review_length'] = review_df['Review'].apply(len)

review_df['Rating'].value_counts(normalize=True)

5.0 0.3832
4.0 0.2373
1.0 0.1735
3.0 0.1193
2.0 0.0684
4.5 0.0069
3.5 0.0047
3.6 0.0039
2.5 0.0019
1.5 0.0009
Name: Rating, dtype: float64

The Ratings distribution 38% reviews are 5 rated,23% are 4 rated stating that people do rate good food high.

Chart - 5

# Visualizing the rating column against the review length.

# Polting the frequency of the rating on scatter bar plot

import plotly.express as px
fig = px.scatter(review_df, x=review_df['Rating'], y=review_df['Review_length'])
fig.update_layout(title_text="Rating vs Review Length")
fig.update_xaxes(ticks="outside", tickwidth=1, tickcolor='crimson',tickangle=45, ticklen=10)
fig.show()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 12/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Rating vs Review Length

5000

4000
Review_length

3000

2000

1000
The scatter plot confirms that length of review doesnt impact ratings.

0
Chart - 6
1

5
# Creating polarity variable to see sentiments in reviews.(using textblob)
Rating
from textblob import TextBlob
review_df['Polarity'] = review_df['Review'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Visualizing the polarity using histogram.

review_df['Polarity'].plot(kind='hist', bins=100)

<Axes: ylabel='Frequency'>

Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences
generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in
the range of [0,1].

Removing Stop words

Stop words are used in a language to removed from text data during natural language processing. This helps to reduce the dimensionality of
the feature space and focus on the more important words in the text.

# Importing dependancies and removing stopwords.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Creating argument for stop words.

stop_words = stopwords.words('english')

print(stop_words)

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 13/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
rest_word=['order','restaurant','taste','ordered','good','food','table','place','one','also']
rest_word

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yours
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
['order',
'restaurant',
'taste',
'ordered',
'good',
'food',
'table',
'place',
'one',
'also']

Chart - 7

# We will extrapolate the 15 profiles that have made more reviews.

# Groupby on the basis of rivewer gives the fequency of the reviews

reviewer_list = review_df.groupby('Reviewer').apply(lambda x: x['Reviewer'].count()).reset_index(name='Review_Count')

# Sorting the frequency of reviews decending

reviewer_list = reviewer_list.sort_values(by = 'Review_Count',ascending=False)

# Selecting the top 15 reviewrs

top_reviewers = reviewer_list[:15]

# Visualizing the top 15 reviewers.

plt.figure(figsize=(13,5))
plt.bar(top_reviewers['Reviewer'], top_reviewers['Review_Count'], color = sns.color_palette("hls", 8))
plt.xticks(rotation=75)
plt.title('Top 15 reviews',size=28)
plt.xlabel("Reviewer's Name",size=15)
plt.ylabel('N of reviews',size=15)

Text(0, 0.5, 'N of reviews')

Chart - 8

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 14/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

# Calculate the average of their ratings review.

review_ratings=review_df.groupby('Reviewer').apply(lambda x:np.average(x['Rating'])).reset_index(name='Average_Ratings')
review_ratings=pd.merge(top_reviewers,review_ratings,how='inner',left_on='Reviewer',right_on='Reviewer')
top_reviewers_ratings=review_ratings[:15]

# Average rating of top reviewers.

plt.figure(figsize=(15,6))
x = top_reviewers_ratings['Average_Ratings']
y = top_reviewers_ratings['Reviewer']
plt.title("Top 15 reviewers with average rating of review",fontsize=20, weight='bold',color=sns.cubehelix_palette(8, start=.5, rot=90)[-5
plt.ylabel("Name",weight='bold',fontsize=15)
plt.xlabel("Average Ratings",weight='bold',fontsize=15)
plt.xticks(rotation=90)
sns.barplot(x=x, y=y,palette='plasma')
plt.show()

The output of top 15 reviewers based on the number of reviews they have made in a given dataset. Analyzing the reviews made by these top
reviewers can help in improving customer satisfaction and loyalty, ultimately leading to increased revenue and growth.

Chart - 9

# Removing Special characters and punctuation from review columns.

import re
review_df['Review']=review_df['Review'].map(lambda x: re.sub('[,\.!?]','', x))
review_df['Review']=review_df['Review'].map(lambda x: x.lower())
review_df['Review']=review_df['Review'].map(lambda x: x.split())
review_df['Review']=review_df['Review'].apply(lambda x: [item for item in x if item not in stop_words])
review_df['Review']=review_df['Review'].apply(lambda x: [item for item in x if item not in rest_word])

# Word cloud for positive reviews.

from wordcloud import WordCloud

review_df['Review']=review_df['Review'].astype(str)

ps = PorterStemmer()
review_df['Review']=review_df['Review'].map(lambda x: ps.stem(x))
long_string = ','.join(list(review_df['Review'].values))
long_string
wordcloud = WordCloud(background_color="white", max_words=100, contour_width=3, contour_color='steelblue')
wordcloud.generate(long_string)
wordcloud.to_image()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 15/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Service, taste, time, starters are key to good review.

Chart - 10

# Creating two datasets for positive and negative reviews.

review_df['Rating']= pd.to_numeric(review_df['Rating'],errors='coerce') # The to_numeric() function in pandas is used to convert a pand

pos_rev = review_df[review_df.Rating>= 3]
neg_rev = review_df[review_df.Rating< 3]

# Negative reviews wordcloud.

long_string = ','.join(list(neg_rev['Review'].values))
long_string
wordcloud = WordCloud(background_color="white", max_words=100, contour_width=3, contour_color='steelblue')
wordcloud.generate(long_string)
wordcloud.to_image()

Service , bad chicken , staff behavior, stale food are key reasons for neagtive reviews

Text Cleaning

# Creating word embeddings and t-SNE plot. (for positive and negative reviews).

from gensim.models import word2vec

pos_rev = review_df[review_df.Rating>= 3]
neg_rev = review_df[review_df.Rating< 3]

Dataframe where the Rating column is greater than or equal to 3. This selects all the positive reviews where as the Rating column is less than 3.
This selects all the negative reviews, assuming that the Rating column is a scale from 1 to 5 with 5 being the highest rating.

Create a corpus of words from the negative reviews in the neg_rev DataFrame.

# Plot for negative reviews.

def build_corpus(data):
"Creates a list of lists containing words from each sentence"
corpus = []
for col in ['Review']:
for sentence in data[col].iteritems():
word_list = sentence[1].split(" ")
corpus.append(word_list)

return corpus

# Display the first two elements of the corpus list

corpus = build_corpus(neg_rev)
corpus[0:2]

[["['corn',",
"'cheese',",
"'balls',",
"'manchow',",
"'soup',",
"'paneer',",
"'shashlik',",

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 16/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
"'sizzler',",
"'sizzler',",
"'stale',",
"'paneer',",
"'smelling',",
"'waiter',",
"'impolite',",
"'even',",
"'accept',",
"'mistake',",
"'never',",
"'going']"],
["['went',",
"'team',",
"'lunch',",
"'worst',",
"'tasteless',",
"'service',",
"'slow',",
"'ac',",
"'working',",
"'we’ve',",
"'requested',",
"'multiple',",
"'times',",
"'use',",
"'please',",
"'don’t',",
"'waste',",
"'money',",
"'strictly',",
"'recommend',",
"'prefer',",
"'beyond',",
"'flavours']"]]

Create a corpus of words from the positive reviews in the neg_rev DataFrame.

# Plot for postive reviews

return corpus

# Display the first two elements of the corpus list

corpus = build_corpus(pos_rev)
corpus[0:2]

[["['ambience',",
"'quite',",
"'saturday',",
"'lunch',",
"'cost',",
"'effective',",
"'sate',",
"'brunch',",
"'chill',",
"'friends',",
"'parents',",
"'waiter',",
"'soumen',",
"'das',",
"'really',",
"'courteous',",
"'helpful']"],
["['ambience',",
"'pleasant',",
"'evening',",
"'service',",
"'prompt',",
"'experience',",
"'soumen',",
"'das',",
"'-',",
"'kudos',",
"'service']"]]

# Checking for the implimented code

review_df['Review']

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 17/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

0 ['ambience', 'quite', 'saturday', 'lunch', 'co...

1 ['ambience', 'pleasant', 'evening', 'service',...
2 ['must', 'try', 'great', 'great', 'ambience', ...
3 ['soumen', 'das', 'arun', 'great', 'guy', 'beh...
4 ['goodwe', 'kodi', 'drumsticks', 'basket', 'mu...
...
9995 ['madhumathi', 'mahajan', 'well', 'start', 'ni...
9996 ['never', 'disappointed', 'us', 'courteous', '...
9997 ['bad', 'rating', 'mainly', '"chicken', 'bone'...
9998 ['personally', 'love', 'prefer', 'chinese', 'c...
9999 ['checked', 'try', 'delicious', 'chinese', 'se...
Name: Review, Length: 10000, dtype: object

LDA

Topic Modeling using LDA

LDA is one of the methods to assign topic to texts. If observations are words collected into documents, it posits that each document is a
mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.

from gensim import corpora

from gensim.models import LdaModel
from gensim.utils import simple_preprocess

Plotting the top 10 most occuring words. Topic modeling is a process to automatically identify topics present in a text object and to assign text
corpus to one category of topic.

# Assume that documents is a list of strings representing text documents

# Tokenize the documents

tokenized_docs = [simple_preprocess(doc) for doc in review_df['Review']]

# Create a dictionary from the tokenized documents

dictionary = corpora.Dictionary(tokenized_docs)

# Convert the tokenized documents to a bag-of-words corpus

bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Train an LDA model on the bag-of-words corpus

num_topics = 10 # The number of topics to extract
lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the topics and their top 10 terms

for topic in lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False):
print('Topic {}: {}'.format(topic[0], ', '.join([term[0] for term in topic[1]])))

Topic 0: veg, starters, buffet, main, course, ambience, non, service, lunch, items
Topic 1: great, best, ambience, music, night, friends, drinks, nice, service, floor
Topic 2: service, great, staff, excellent, visit, awesome, time, experience, thanks, us
Topic 3: chicken, biryani, rice, mutton, fried, veg, soup, pork, cooked, tikka
Topic 4: paneer, butter, indian, curry, north, masala, paratha, dal, roti, naan
Topic 5: ambience, nice, service, really, try, best, great, amazing, must, visit
Topic 6: delivery, ice, cream, time, sauces, shake, chocolate, brownie, hot, awesome
Topic 7: quantity, less, money, quality, shawarma, received, delivered, waste, value, tasty
Topic 8: even, time, service, bad, experience, zomato, worst, us, never, get
Topic 9: chicken, like, burger, spicy, dish, fried, sauce, cheese, sweet, try

pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Collecting pyLDAvis
Downloading pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 35.0 MB/s eta 0:00:00
Collecting funcy
Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: numexpr in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (2.8.4)
Requirement already satisfied: numpy>=1.22.0 in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (1.22.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (1.10.1)
Requirement already satisfied: pandas>=1.3.4 in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (1.4.4)
Collecting joblib>=1.2.0
Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.0/298.0 KB 30.9 MB/s eta 0:00:00
Requirement already satisfied: scikit-learn>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (1.2.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (67.6.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (3.1.2)
Requirement already satisfied: gensim in /usr/local/lib/python3.9/dist-packages (from pyLDAvis) (4.3.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=1.3.4->pyLDAvis) (2.8
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=1.3.4->pyLDAvis) (2022.7.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn>=1.0.0->pyLDAvis)

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 18/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.9/dist-packages (from gensim->pyLDAvis) (6.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->pyLDAvis) (2.1.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas>=1.3.4->pyLD
Installing collected packages: funcy, joblib, pyLDAvis
Attempting uninstall: joblib
Found existing installation: joblib 1.1.1
Uninstalling joblib-1.1.1:
Successfully uninstalled joblib-1.1.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the so
pandas-profiling 3.2.0 requires joblib~=1.1.0, but you have joblib 1.2.0 which is incompatible.
Successfully installed funcy-2.0 joblib-1.2.0 pyLDAvis-3.4.0

import gensim
import pyLDAvis.gensim
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

lda_visualization = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary, mds='tsne')

pyLDAvis.display(lda_visualization)

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

Selected Topic: 0 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric:(2)
λ=1 0.0 0.2 0

Intertopic Distance Map (via multidimensional scaling) Top-30 Most Salient Terms(
0 500 1,000 1,500 2,000
PC2
6 chicken
biryani
8 service
quantity
great
delivery
veg
quality
less

5 ambience
3
paneer
money
2 nice
4 starters
PC1 staff
cream
rice
ice
main
1 buffet
best
course
7 visit
time
9 music
excellent
awesome
chocolate
bad
10
Marginal topic distribution friends

Overall term frequency

2% Estimated term frequency within the selected topic

5% 1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for t

2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Siev
10%

The topics and topic terms can be visualised to help assess how interpretable the topic model is.

Sentiment Analysis

from textblob import TextBlob

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 19/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
import plotly.express as px

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# Create a function to get the subjectivity

def subjectivity(text):
return TextBlob(text).sentiment.subjectivity

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# Create a function to get the polarity

def polarity(text):
return TextBlob(text).sentiment.polarity

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# Applying subjectivity and the polarity function to the respective columns

review_df['Subjectivity'] = review_df['Review'].apply(subjectivity)
review_df['Polarity'] = review_df['Review'].apply(polarity)

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# Checking for created columns

review_df['Polarity']

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

0 0.600000
1 0.733333
2 0.540000
3 0.800000
4 0.350000
...
9995 0.277841
9996 0.174621
9997 0.082074
9998 0.560000
9999 0.103030
Name: Polarity, Length: 10000, dtype: float64

# Checking for created columns

review_df['Subjectivity']

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

0 0.900000
1 0.966667
2 0.740000
3 0.750000
4 0.450000
...
9995 0.646591
9996 0.710606
9997 0.501252
9998 0.620000
9999 0.630303
Name: Subjectivity, Length: 10000, dtype: float64

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 20/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
# Create a function to compute the negative, neutral and positive analysis
def getAnalysis(score):
if score <0:
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return 'Positive'

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

If the score is less than 0, the function returns the string 'Negative'. If the score is equal to 0, the function returns the string 'Neutral'. If the score
is greater than 0, the function returns the string 'Positive'.

# Apply get analysis function to separate the sentiments from the column
review_df['Analysis'] = review_df['Polarity'].apply(getAnalysis)

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# plot the polarity and subjectivity

fig = px.scatter(review_df,
x='Polarity',
y='Subjectivity',
color = 'Analysis',
size='Subjectivity')

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# Add a vertical line at x=0 for Netural Reviews

fig.update_layout(title='Sentiment Analysis',
shapes=[dict(type= 'line',
yref= 'paper', y0= 0, y1= 1,
xref= 'x', x0= 0, x1= 0)])
fig.show()

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

Sentiment Analysis

Analysis
1 Positive
Negative
Neutral

0.8
Subjectivity

0.6

0.4

0.2

−1 −0.5 0 0.5 1

Polarity

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 21/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

The resulting plot can provide several insights into the sentiment analysis results. Firstly, the histogram bars on the left side of the plot
(negative polarity) indicate that a significant number of reviews expressed negative sentiments. Similarly, the histogram bars on the right side
of the plot (positive polarity) indicate that a significant number of reviews expressed positive sentiments.

Overall, this plot can provide a quick and easy way to visualize the sentiment polarity distribution of the reviews, which can help in
understanding the overall sentiment of the customers towards the restaurants.

Clustering

warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning);

/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning:

`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argumen

# converting the cuisines to lower case

meta_df_main['Cuisines'] = meta_df_main['Cuisines'].apply(lambda x : x.lower());

# Separating the Name, cost and cuisines column.

cuisine_df = meta_df_main.loc[:,['Name','Cost','Cuisines']]

# Overview of separated variables.

cuisine_df.head()

Name Cost Cuisines

0 Beyond Flavours 800 chinese, continental, kebab, european, south i...

1 Paradise 800 biryani, north indian, chinese

2 Flechazo 1,300 asian, mediterranean, north indian, desserts

3 Shah Ghouse Hotel & Restaurant 800 biryani, north indian, chinese, seafood, bever...

4 Over The Moon Brew Company 1,200 asian, continental, north indian, chinese, med...

# Removing spces from cuisine column.

cuisine_df['Cuisines'] = cuisine_df['Cuisines'].str.replace(' ','')

# Spliting the Words in cuisine.

cuisine_df['Cuisines'] = cuisine_df['Cuisines'].str.split(',')

# Overview on text cleaning.

cuisine_df.head()

Name Cost Cuisines

0 Beyond Flavours 800 [chinese, continental, kebab, european, southi...

1 Paradise 800 [biryani, northindian, chinese]

2 Flechazo 1,300 [asian, mediterranean, northindian, desserts]

3 Shah Ghouse Hotel & Restaurant 800 [biryani, northindian, chinese, seafood, bever...

4 Over The Moon Brew Company 1,200 [asian, continental, northindian, chinese, med...

from sklearn.preprocessing import MultiLabelBinarizer

# converting a list of labels for each sample into a binary indicator matrix
mlb = MultiLabelBinarizer(sparse_output=True)

# converting the Cuisines column in the cuisine_df DataFrame into a binary indicator matrix.
cuisine_df = cuisine_df.join(pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(cuisine_df.pop('Cuisines')),
index=cuisine_df.index, columns=mlb.classes_))

# Overview
cuisine_df.head()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 22/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Name Cost american andhra arabian asian bakery bbq beverages biryani ... northindian pizza salad seafood south

Beyond
0 800 0 0 0 0 0 0 0 0 ... 1 0 0 0
Flavours

1 Paradise 800 0 0 0 0 0 0 0 1 ... 1 0 0 0

2 Flechazo 1,300 0 0 0 1 0 0 0 0 ... 1 0 0 0

Shah
Ghouse
3 800 0 0 0 0 0 0 1 1 ... 1 0 0 1
Hotel &
R t t

# Checking the unique for rating.

review_df['Rating'].unique()

array([5. , 4. , 1. , 3. , 2. , 3.5, 4.5, 2.5, 1.5, 3.6])

# Remove nan rating in Rating column.

review_df.dropna(subset=['Rating'],inplace=True)

# Change data type of rating column to float.

review_df['Rating']= review_df['Rating'].astype('float')

# Dropping the null Values from review column.

review_df.dropna(subset =['Review'], inplace=True)

# Grouping the restaurant on the basis of average rating.

ratings_df = review_df.groupby('Restaurant')['Rating'].mean().reset_index()

# Top highly rated 15 restaurants.

ratings_df .sort_values(by='Rating',ascending = False).head(15)

Restaurant Rating

3 AB's - Absolute Barbecues 4.880

11 B-Dubs 4.810

2 3B's - Buddies, Bar & Barbecue 4.760

67 Paradise 4.700

35 Flechazo 4.660

87 The Indi Grill 4.600

97 Zega - Sheraton Hyderabad Hotel 4.450

64 Over The Moon Brew Company 4.340

16 Beyond Flavours 4.280

19 Cascade - Radisson Hyderabad Hitec City 4.260

84 The Fisherman's Wharf 4.220

34 Feast - Sheraton Hyderabad Hotel 4.220

71 Prism Club & Kitchen 4.215

58 Mazzo - Marriott Executive Apartments 4.190

13 Barbeque Nation 4.120

# Combining the information on restaurant cuisine and ratings into a single DataFrame.
df_cluster = cuisine_df.merge(ratings_df, left_on='Name',right_on='Restaurant')

# Overview
df_cluster.head()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 23/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

Name Cost american andhra arabian asian bakery bbq beverages biryani ... salad seafood southindian spanish str

Beyond
0 800 0 0 0 0 0 0 0 0 ... 0 0 1 0
Flavours

1 Paradise 800 0 0 0 0 0 0 0 1 ... 0 0 0 0

2 Flechazo 1,300 0 0 0 1 0 0 0 0 ... 0 0 0 0

Shah
Ghouse
3 800 0 0 0 0 0 0 1 1 ... 0 1 0 0
Hotel &
Restaurant

Over The

# Changing name and order of columns

df_cluster = df_cluster[['Name', 'Cost','Rating', 'american', 'andhra', 'arabian', 'asian', 'bbq',
'bakery', 'beverages', 'biryani', 'burger', 'cafe', 'chinese',
'continental', 'desserts', 'european', 'fastfood', 'fingerfood', 'goan',
'healthyfood', 'hyderabadi', 'icecream', 'indonesian', 'italian',
'japanese', 'juices', 'kebab', 'lebanese', 'malaysian', 'mediterranean',
'mexican', 'mithai', 'modernindian', 'momos', 'mughlai', 'northeastern',
'northindian', 'pizza', 'salad', 'seafood', 'southindian', 'spanish',
'streetfood', 'sushi', 'thai', 'wraps']]

# Checking the data type and null counts for newly created variables.
df_cluster.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 100 non-null object
1 Cost 100 non-null object
2 Rating 100 non-null float64
3 american 100 non-null Sparse[int64, 0]
4 andhra 100 non-null Sparse[int64, 0]
5 arabian 100 non-null Sparse[int64, 0]
6 asian 100 non-null Sparse[int64, 0]
7 bbq 100 non-null Sparse[int64, 0]
8 bakery 100 non-null Sparse[int64, 0]
9 beverages 100 non-null Sparse[int64, 0]
10 biryani 100 non-null Sparse[int64, 0]
11 burger 100 non-null Sparse[int64, 0]
12 cafe 100 non-null Sparse[int64, 0]
13 chinese 100 non-null Sparse[int64, 0]
14 continental 100 non-null Sparse[int64, 0]
15 desserts 100 non-null Sparse[int64, 0]
16 european 100 non-null Sparse[int64, 0]
17 fastfood 100 non-null Sparse[int64, 0]
18 fingerfood 100 non-null Sparse[int64, 0]
19 goan 100 non-null Sparse[int64, 0]
20 healthyfood 100 non-null Sparse[int64, 0]
21 hyderabadi 100 non-null Sparse[int64, 0]
22 icecream 100 non-null Sparse[int64, 0]
23 indonesian 100 non-null Sparse[int64, 0]
24 italian 100 non-null Sparse[int64, 0]
25 japanese 100 non-null Sparse[int64, 0]
26 juices 100 non-null Sparse[int64, 0]
27 kebab 100 non-null Sparse[int64, 0]
28 lebanese 100 non-null Sparse[int64, 0]
29 malaysian 100 non-null Sparse[int64, 0]
30 mediterranean 100 non-null Sparse[int64, 0]
31 mexican 100 non-null Sparse[int64, 0]
32 mithai 100 non-null Sparse[int64, 0]
33 modernindian 100 non-null Sparse[int64, 0]
34 momos 100 non-null Sparse[int64, 0]
35 mughlai 100 non-null Sparse[int64, 0]
36 northeastern 100 non-null Sparse[int64, 0]
37 northindian 100 non-null Sparse[int64, 0]
38 pizza 100 non-null Sparse[int64, 0]
39 salad 100 non-null Sparse[int64, 0]
40 seafood 100 non-null Sparse[int64, 0]
41 southindian 100 non-null Sparse[int64, 0]
42 spanish 100 non-null Sparse[int64, 0]
43 streetfood 100 non-null Sparse[int64, 0]
44 sushi 100 non-null Sparse[int64, 0]
45 thai 100 non-null Sparse[int64, 0]
46 wraps 100 non-null Sparse[int64, 0]

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 24/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory
dtypes: Sparse[int64, 0](44), float64(1), object(2)
memory usage: 6.7+ KB

# Removing commas from the cost variables.

df_cluster['Cost']= df_cluster['Cost'].str.replace(',','')

# Changing the data type of the cost column.

df_cluster['Cost']= df_cluster['Cost'].astype('float')

# Visualising relationship between the cost of a meal and the rating of a restaurant
sns.lmplot(y='Rating',x='Cost',data=df_cluster,line_kws={'color' :'red'},height=6.27, aspect=11.7/8.27)

<seaborn.axisgrid.FacetGrid at 0x7f8c28230d00>

The resulting plot shows the relationship between the cost of a meal and the rating of a restaurant, with the regression line indicating the
general trend in the data. This can help identify any patterns or correlations between cost and rating.

K-means Clustering

from sklearn.preprocessing import StandardScaler,MinMaxScaler

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer

# Create a list of inertia scores for different numbers of clusters

scores = [KMeans(n_clusters=i+2, random_state=11).fit(df_cluster.drop('Name',axis=1)).inertia_
for i in range(8)]

# Create a line plot of inertia scores versus number of clusters

plt.figure(figsize=(7,7))
sns.lineplot(x=np.arange(2, 10), y=scores)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Inertia of k-Means versus number of clusters')
plt.show()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 25/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

The plot can help to identify the optimal number of clusters based on the elbow point of the curve, where the rate of decrease in inertia score
slows down significantly.

# Initializing a K-Means clustering model with number of clusters and random state.
model = KMeans(random_state=11, n_clusters=5)
model.fit(df_cluster.drop('Name',axis=1))

▾ KMeans
KMeans(n_clusters=5, random_state=11)

# predict the cluster label of a new data point based on a trained clustering model.
cluster_lbl = model.predict(df_cluster.drop('Name',axis=1))

df_cluster['labels'] = cluster_lbl

# Creating the data frame for each cluster.

cluster_0 = df_cluster[df_cluster['labels'] == 0].reset_index()
cluster_1 = df_cluster[df_cluster['labels'] == 1].reset_index()
cluster_2 = df_cluster[df_cluster['labels'] == 2].reset_index()
cluster_3 = df_cluster[df_cluster['labels'] == 3].reset_index()
cluster_4 = df_cluster[df_cluster['labels'] == 4].reset_index()

list_of_cluster=[cluster_0,cluster_1,cluster_2,cluster_3,cluster_4]

# Create a scatter plot of the clusters with annotations for top cuisines
plt.figure(figsize=(15,7))
sns.scatterplot(x='Cost', y='Rating', hue='labels', data=df_cluster)

# Add annotations for top cuisines in each cluster

for i, df in enumerate(list_of_cluster):
top_cuisines = df.drop(['index', 'Name', 'Cost', 'Rating', 'labels'], axis=1).sum().sort_values(ascending=False)[:3]
top_cuisines_str = '\n'.join([f'{cuisine}: {count}' for cuisine, count in top_cuisines.items()])
plt.annotate(f'Top cuisines in cluster {i}\n{top_cuisines_str}',
xy=(df['Cost'].mean(), df['Rating'].mean()),
ha='center', va='center', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.xlabel('Cost')
plt.ylabel('Rating')
plt.title('Clustering of Restaurants')
plt.show()

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 26/27
5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.ipynb - Colaboratory

For each cluster, the top three cuisines are identified and annotated on the plot. The annotation includes the name of the cluster, its centroid
location (mean cost and mean rating), and the top three cuisines and their counts within the cluster. This plot can be used to visually identify
how the restaurants are grouped and the dominant features of each cluster.

# Top cuisines in each cluster

for i,df in enumerate(list_of_cluster):
print(f'Top cuisines in cluster {i}\n', df.drop(['index','Name','Cost','Rating','labels'],axis=1).sum().sort_values(ascending=False)[:3

Top cuisines in cluster 0

northindian 16
chinese 9
fastfood 8
dtype: int64

Top cuisines in cluster 1

northindian 11
continental 6
asian 5
dtype: int64

Top cuisines in cluster 2

northindian 18
chinese 18
biryani 11
dtype: int64

Top cuisines in cluster 3

asian 2
italian 2
continental 2
dtype: int64

Top cuisines in cluster 4

northindian 14
chinese 9
italian 7
dtype: int64

Conclusion

The project was successful in achieving the goals of clustering and sentiment analysis. The clustering part provided insights into the grouping
of restaurants based on their features, which can help in decision making for users and businesses. The sentiment analysis part provided
insights into the sentiments expressed by the users in their reviews, which can help businesses in improving their services and user experience.
Colab paid products - Cancel contracts here
Th l t ti l f f t k h i l ti d d l t i l ith d ti t l i

https://colab.research.google.com/drive/1pMVGVX3X9-3qmICCq7t31_i5zWNuZYTz#printMode=true 27/27

Garrison Managerial Accounting 17e
33% (3)
Garrison Managerial Accounting 17e
6 pages
Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
SOWBHAGYA - Interim Report - Research Project Report Template
No ratings yet
SOWBHAGYA - Interim Report - Research Project Report Template
24 pages
Zomato Data Analysis Presentation
No ratings yet
Zomato Data Analysis Presentation
16 pages
REPORT ON DATA ANALYTICS - Docx NANMA
No ratings yet
REPORT ON DATA ANALYTICS - Docx NANMA
52 pages
Internship Report On Ai
No ratings yet
Internship Report On Ai
32 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
1577141-AI PPT-unit-1 - PROJECT CYCLE-full Chapter With Updated Syllabus
No ratings yet
1577141-AI PPT-unit-1 - PROJECT CYCLE-full Chapter With Updated Syllabus
58 pages
Capstone Project
No ratings yet
Capstone Project
47 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
Starbucks Sentiment Analysis Using VADER
No ratings yet
Starbucks Sentiment Analysis Using VADER
23 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
PYF Project LearnerNotebook LowCode
No ratings yet
PYF Project LearnerNotebook LowCode
6 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Esm 101 Module Updated Notes Mmust-2
No ratings yet
Esm 101 Module Updated Notes Mmust-2
80 pages
Predictive Modeling
No ratings yet
Predictive Modeling
21 pages
Hotels Review Classification Final
No ratings yet
Hotels Review Classification Final
34 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Project-Time Series Forecasting
100% (1)
Project-Time Series Forecasting
10 pages
Stock Market Volatility Final
No ratings yet
Stock Market Volatility Final
27 pages
Managing Different Stages of CRM: Dr. Savita Sharma
No ratings yet
Managing Different Stages of CRM: Dr. Savita Sharma
28 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Mini Project Time Series
No ratings yet
Mini Project Time Series
55 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Power BI Guide
100% (2)
Power BI Guide
46 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
TSF - Project
100% (1)
TSF - Project
5 pages
Zomato Customer Satisfaction Code and Results
No ratings yet
Zomato Customer Satisfaction Code and Results
11 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Interview Preparations - NielsenIQ
No ratings yet
Interview Preparations - NielsenIQ
1 page
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Project Detailed Review
No ratings yet
Project Detailed Review
9 pages
Telecommunication Customer Churn (New)
100% (1)
Telecommunication Customer Churn (New)
23 pages
BENCHMARKING (Tata Steel)
No ratings yet
BENCHMARKING (Tata Steel)
42 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Mra Project - Milestone1: Student Name: Gowri Srinivasan Batch: Dsba Online Mar 20
No ratings yet
Mra Project - Milestone1: Student Name: Gowri Srinivasan Batch: Dsba Online Mar 20
30 pages
Capstone Presentation
No ratings yet
Capstone Presentation
58 pages
Capstone Project ON Impact of Quality Management Systems On Performance of A Company (Automobile Sector)
No ratings yet
Capstone Project ON Impact of Quality Management Systems On Performance of A Company (Automobile Sector)
49 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
Report On Linear Regression Using R
No ratings yet
Report On Linear Regression Using R
15 pages
PG Program Dsba Classroom
No ratings yet
PG Program Dsba Classroom
16 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Tourism Adoption Project Report
No ratings yet
Tourism Adoption Project Report
14 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
Internship Report
No ratings yet
Internship Report
12 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Data Analyst Roadmap
No ratings yet
Data Analyst Roadmap
2 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
Data Science Unit 1 Unit 2
No ratings yet
Data Science Unit 1 Unit 2
49 pages
Internship Report - K
No ratings yet
Internship Report - K
30 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
6 pages
Geospatial Crime Analytics
No ratings yet
Geospatial Crime Analytics
41 pages
Big Data, Data Analytics and Artificial Intelligence in Accounting: An
No ratings yet
Big Data, Data Analytics and Artificial Intelligence in Accounting: An
35 pages
1-Intro - Review of Basic Tools 26716
No ratings yet
1-Intro - Review of Basic Tools 26716
39 pages
Power System Simulator: Power World, Students' Evaluation Package
No ratings yet
Power System Simulator: Power World, Students' Evaluation Package
13 pages
Capstone Presentation
No ratings yet
Capstone Presentation
9 pages
Challenges and Opportunities in The Implementation of AI in Manufacturing: A Bibliometric Analysis
No ratings yet
Challenges and Opportunities in The Implementation of AI in Manufacturing: A Bibliometric Analysis
38 pages
2-MUTUALLY COUPLED BRANCHES IN Ybus
100% (1)
2-MUTUALLY COUPLED BRANCHES IN Ybus
7 pages
UNIT 3B Data Visualization
No ratings yet
UNIT 3B Data Visualization
42 pages
Power System Simulator: Power World, Students' Evaluation Package
No ratings yet
Power System Simulator: Power World, Students' Evaluation Package
13 pages
Seaborn
No ratings yet
Seaborn
7 pages
1-Intro - Review of Basic Tools 26716
No ratings yet
1-Intro - Review of Basic Tools 26716
39 pages
Constitutional Development in Pakistan: Dr. Tahir Jamil
No ratings yet
Constitutional Development in Pakistan: Dr. Tahir Jamil
79 pages
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
No ratings yet
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
2 pages
Predictive Commitments Management
No ratings yet
Predictive Commitments Management
13 pages
Geo - 10 Chapter-8
No ratings yet
Geo - 10 Chapter-8
16 pages
Descriptiv Minor1
No ratings yet
Descriptiv Minor1
2 pages
Msds iuFUXPCU
No ratings yet
Msds iuFUXPCU
47 pages
Power System Simulator
No ratings yet
Power System Simulator
11 pages
Modul - Data Representation Ver 3.0 - Updated
No ratings yet
Modul - Data Representation Ver 3.0 - Updated
63 pages
Assessment Information - MGT2120
No ratings yet
Assessment Information - MGT2120
12 pages
Carbon Accounting Reporting and AI Driven Climate Solutions
No ratings yet
Carbon Accounting Reporting and AI Driven Climate Solutions
8 pages
International Standards
No ratings yet
International Standards
32 pages
Thesis Leadership and Innovation
100% (1)
Thesis Leadership and Innovation
6 pages
FAFEN Analysis Delimitation Proposals 2018 03 27-Final
No ratings yet
FAFEN Analysis Delimitation Proposals 2018 03 27-Final
28 pages
The Last March of The Lemmings Hegel's Critique of Rousseau's Theory of The General Will
No ratings yet
The Last March of The Lemmings Hegel's Critique of Rousseau's Theory of The General Will
13 pages
CRPF Head Constable Exam Syllabus 2023
No ratings yet
CRPF Head Constable Exam Syllabus 2023
4 pages
L5: Simple Sequential Circuits and Verilog
No ratings yet
L5: Simple Sequential Circuits and Verilog
11 pages
Tableau Resume Sample
100% (1)
Tableau Resume Sample
4 pages
ATSmart Robot
No ratings yet
ATSmart Robot
4 pages
PRASHANT GORADE's Resume
No ratings yet
PRASHANT GORADE's Resume
2 pages
College of Asia and The Pacific, The Australian National University Australian National University
No ratings yet
College of Asia and The Pacific, The Australian National University Australian National University
4 pages
Pakistan National Assembly 2024 Constituency-All-Maps
No ratings yet
Pakistan National Assembly 2024 Constituency-All-Maps
1 page
Data Science Classification Etc
No ratings yet
Data Science Classification Etc
19 pages
Hon S. Chan: Cadre Personnel Management in China: The
No ratings yet
Hon S. Chan: Cadre Personnel Management in China: The
32 pages
Harshil Project Proposal BTECH
No ratings yet
Harshil Project Proposal BTECH
3 pages
189098
No ratings yet
189098
28 pages
Data Analysis and Visualisation
No ratings yet
Data Analysis and Visualisation
3 pages
Solutions To Four Problems of Rolling ADie
No ratings yet
Solutions To Four Problems of Rolling ADie
5 pages
S0003055410000626 PDF
No ratings yet
S0003055410000626 PDF
19 pages
Unsymmetrical Fault: Short Circuit Calculation Chapter 12 Grainger
No ratings yet
Unsymmetrical Fault: Short Circuit Calculation Chapter 12 Grainger
19 pages
Pacific Affairs, University of British Columbia
No ratings yet
Pacific Affairs, University of British Columbia
3 pages
Fin Irjmets1711372102
No ratings yet
Fin Irjmets1711372102
3 pages
Unsymmetrical Fault: Short Circuit Calculation Chapter 12 Grainger
No ratings yet
Unsymmetrical Fault: Short Circuit Calculation Chapter 12 Grainger
19 pages
New Research Paper
No ratings yet
New Research Paper
6 pages
Power Flow Analysis: Chapter 9-Grainger
No ratings yet
Power Flow Analysis: Chapter 9-Grainger
9 pages
Bus Impedance Matrix: Chapter 8-Grainger
No ratings yet
Bus Impedance Matrix: Chapter 8-Grainger
9 pages
Assignment
No ratings yet
Assignment
5 pages
Bus Impedance Matrix: Chapter 8-Grainger
No ratings yet
Bus Impedance Matrix: Chapter 8-Grainger
9 pages
Power Flow Analysis: Chapter 9-Grainger
No ratings yet
Power Flow Analysis: Chapter 9-Grainger
9 pages
Bus Impedance Matrix: Chapter 8-Grainger
No ratings yet
Bus Impedance Matrix: Chapter 8-Grainger
9 pages
2-MUTUALLY COUPLED BRANCHES IN Ybus
No ratings yet
2-MUTUALLY COUPLED BRANCHES IN Ybus
7 pages
NA-246 Constituency Map
No ratings yet
NA-246 Constituency Map
1 page

Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory

Uploaded by

Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory

Uploaded by

5/13/23, 10:51 AM Zomato Restaurant Clustering & Sentiment Analysis.

Project Name - Zomato Restaurant Clustering and Sentiment Analysis.

Project Type - Unsupervised

Team Member 1 - Abhishek Nagpure.

Team Member 2 - Priyanka Bajaj.

Team Member 3 - Bhojraj Jadhav

Steps that are performed:

1. Know Your Data

# Import Libraries and modules

# Creating the copy of dataset.

Dataset First View

# Dataset First Look.

Name Links Cost Collections Cuisines Timing

Dataset Rows & Columns count

# Dataset Rows & Columns count.

print(f' We have total {meta_df.shape[0]} rows and {meta_df.shape[1]} columns.')

We have total 105 rows and 6 columns.

# Dataset Duplicate Value Count.

# Checking duplicate restaurant name.

Missing Values/Null Values

# Missing Values/Null Values Count.

# Checking for Null values.

index Name Links Cost Collections Cuisines Timin

# Visualizing the missing values.

What did you know about your dataset?

There are 105 total observation with 6 different features.

2. Understanding Your Variables

Index(['index', 'Name', 'Links', 'Cost', 'Collections', 'Cuisines', 'Timings'], dtype='object')

Zomato Restaurant names and Metadata

1. Name : Name of Restaurants

2. Links : URL Links of Restaurants

3. Cost : Per person estimated Cost of dining

4. Collection : Tagging of Restaurants w.r.t. Zomato categories

5. Cuisines : Cuisines served by Restaurants

6. Timings : Restaurant Timings

Zomato Restaurant reviews

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

Data Wrangling Code

# Chart - 1 visualization code.

# Creating word cloud for expensive restaurants.

# Creating word_cloud with text as argument in .generate() method.

# Display the generated Word Cloud.

# Affordable price restaurants.

# Lables for X and Y axis

# Assigning the arguments for chart

# Visualisation the value counts of collection.

Text preprocessing for the meta dataset.

# Downloading and importing the dependancies for text cleaning.

[nltk_data] Downloading package stopwords to /root/nltk_data...

# Extracting the stopwords from nltk library for English corpus.

# Creating a function for removing stopwords.

# removing the stop words and lowercasing the selected words

# joining the list of words with space separator

# Removing stopwords from Cuisines.

0 chinese, continental, kebab, european, south i...

Stop words are removed successfully

# Defining the function for removing punctuation.

# replacing the punctuations with no space,

# return the text stripped of punctuation marks

# Removing punctuation from Cuisines.

0 chinese continental kebab european south india...

Punctuations present in the text are removed successfully

# Cleaning and removing Numbers.

# Writing a function to remove repeating characters.

# Removing repeating characters from Cuisines.

0 chinese continental kebab european south india...

Removed repeated characters successfully

# Removing the Numbers from the data.

# Implementing the cleaning.

0 chinese continental kebab european south india...