0% found this document useful (0 votes)
26 views11 pages

Zomato Rating Prediction

The document outlines a mini-project focused on predicting Zomato restaurant ratings using a dataset containing 51,717 entries and 17 features. It details the steps taken for data cleaning, including handling missing values, removing unnecessary columns, and converting data types for analysis. The project aims to prepare the dataset for further analysis and modeling to predict restaurant ratings.

Uploaded by

kanadeshubhu04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Zomato Rating Prediction

The document outlines a mini-project focused on predicting Zomato restaurant ratings using a dataset containing 51,717 entries and 17 features. It details the steps taken for data cleaning, including handling missing values, removing unnecessary columns, and converting data types for analysis. The project aims to prepare the dataset for further analysis and modeling to predict restaurant ratings.

Uploaded by

kanadeshubhu04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

NAME : Kanade Shubhada Sanjay

ROLL NO. : 65
DIV : A

MINI-PROJECT
Zomato-rating-prediction

1. Importing the libraires


In [1]: import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

1.1 Loading the dataset

In [2]: data = pd.read_csv('../input/zomato-bangalore-restaurants/zomato.csv')

In [3]: data
Out[3]:
url address name online_order book_t

942, 21st Main


https://www.zomato.com/bangalore/jalsa- Road, 2nd
0 Stage, Jalsa Yes
banasha...
Banashankari,
...

https://www.zomato.com/bangalore/spice- Spice
Elephant

1112, Next to
https://www.zomato.com/SanchurroBangalore? KIMS Medical San Churro
2 Yes
cont... College, 17th Cafe
Cross...

Addhuri
https://www.zomato.com/bangalore/addhuri-
Udupi
Bhojana

10, 3rd Floor,


https://www.zomato.com/bangalore/grand- Lakshmi Grand
4 No
village... Associates, Village
Gandhi Baza...

... ... ... ... ...

Best Brews
Four Points by
- Four
https://www.zomato.com/bangalore/best- Sheraton
51712 Points by No
brews-fo... Bengaluru,
Sheraton
43/3, White...
Bengaluru...

https://www.zomato.com/bangalore/vinod-
Palya, And
bar-and...
Mahadevapura, Restaurant

Plunge -
Sheraton
Sheraton
Grand
https://www.zomato.com/bangalore/plunge- Grand
51714 Bengaluru No
sherat... Bengaluru
Whitefield
Whitefield
Hotel & Co...
H...

Grand
Bengaluru
Whitefield
url address name online_order book_t

ITPL Main
Road, KIADB The Nest -
https://www.zomato.com/bangalore/the-nest-
51716 Export The Den No
the-...
Promotion Bengaluru
Industr...

51717 rows × 17 columns

1.2 checking the shape of dataset

In [4]: data.shape

Out[4]: (51717, 17)

there are total 51717 samples with 17 features.

In [5]: data.columns

Out[5]: Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',


'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
'approx_cost(for two people)', 'reviews_list', 'menu_item',
'listed_in(type)', 'listed_in(city)'],
dtype='object')

1.3 checking the datatypes

In [6]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
# Column Non-Null Count Dtype

0 url 51717 non-null object


1 address 51717 non-null object
2 name 51717 non-null object
3 online_order 51717 non-null object
4 book_table 51717 non-null object
5 rate 43942 non-null object
6 votes 51717 non-null int64
7 phone 50509 non-null object
8 location 51696 non-null object
9 rest_type 51490 non-null object
10 dish_liked 23639 non-null object
11 cuisines 51672 non-null object
12 approx_cost(for two people) 51371 non-null object
13 reviews_list 51717 non-null object
14 menu_item 51717 non-null object
15 listed_in(type) 51717 non-null object
16 listed_in(city) 51717 non-null object
dtypes: int64(1), object(16)
memory usage: 6.7+ MB

there are so many object type columns, we have to convert them into numeric type. letter we will
convert oject dtype to numeric type
2. Data Cleaning
2.1 checking the missing values

In [7]: data.isnull().sum()

Out[7]: url 0
address 0
name 0
online_order 0
book_table 0
rate 7775
votes 0
phone 1208
location 21
rest_type 227
dish_liked 28078
cuisines 45
approx_cost(for two people) 346
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64

there are so many null values.we can clearly see that in the 'rate', 'phone', 'location', 'rest_type',
'dish_liked', 'cuisines' and 'approx_cost(for two people)' these columns have missing values.So
firstly we have to handle the missing values.

2.2 Removing the unnecessary columns form data

In [8]: df = data.drop(['url', 'phone'], axis = 1) # dropped 'url' and 'phone' columns

In [9]: df.head()
Out[9]:
address name online_order book_table rate votes location rest_type dish_lik

Pas
942, 21st Lun
Main Road, Buff
Casual
0 2nd Stage, Jalsa Yes Yes 4.1/5 775 Banashankari Mas
Banashankari, Dining
Papa
... Pane
Laj

Mom
2nd Floor, 80 Lun
Feet Road, Spice Casual Buff
1 Yes No 4.1/5 787 Banashankari
Near Big Elephant Dining Chocola
Bazaar, 6th ... Nirva
Thai

Churr
1112, Next to San
KIMS Medical Cafe, Cannello
2 Churro
College, 17th Yes No 3.8/5 918 Banashankari Casual Minestro
Cafe Dining Soup, H
Cross...
Cho

Addhuri
Quick Mas
Bites
Bhojana
Banashankar...

10, 3rd Floor,


Casual
Lakshmi Grand No No 3.8/5 166 Basavanagudi Panipu
4
Associates, Village Dining Gol Gap
Gandhi Baza...

2.3 handling the null or missing values

In [10]: df.dropna(inplace = True)

In [11]: df.isnull().sum()

Out[11]: address 0
name 0
online_order 0
book_table 0
rate 0
votes 0
location 0
rest_type 0
dish_liked 0
cuisines 0
approx_cost(for two people) 0
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64

Now there is no null values


2.4 checking the duplicates & handling the duplicates values

In [12]: df.duplicated().sum()

Out[12]: 11

In [13]: df.drop_duplicates(inplace = True)


df.duplicated().sum()

Out[13]: 0

Now there are no duplicate values.

In [ ]:

2.5 Renaming the columns appropriately

In [14]: df = df.rename(columns = {'approx_cost(for two people)':'cost',


'listed_in(type)':'type', 'listed_in(city)': 'city'})

In [15]: df.head()

Out[15]: address name online_order book_table rate votes location rest_type dish_lik

Pas
942, 21st Lun
Main Road, Casual Buff
Jalsa Yes Yes 4.1/5 775 Banashankari
0 2nd Stage, Mas
Dining
Banashankari, Papa
... Pane
Laj

Churr
1112, Next to San Cafe, Cannello
KIMS Medical
2 Churro Yes No 3.8/5 918 Banashankari Casual Minestro
College, 17th
Cafe Dining Soup, H
Cross...
Cho

Addhuri
Quick Mas
Bites
Bhojana
Banashankar...

10, 3rd Floor,


Lakshmi Grand Casual Panipu
4 No No 3.8/5 166 Basavanagudi
Associates, Village Dining Gol Gap
Gandhi Baza...

Sucessfully rename the columns


2.6 cleaning the "cost" column

In [16]: df['cost'].unique()

Out[16]: array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
'750', '200', '850', '1,200', '150', '350', '250', '1,500',
'1,300', '1,000', '100', '900', '1,100', '1,600', '950', '230',
'1,700', '1,400', '1,350', '2,200', '2,000', '1,800', '1,900',
'180', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40',
'1,250', '3,500', '4,000', '2,400', '1,450', '3,200', '6,000',
'1,050', '4,100', '2,300', '120', '2,600', '5,000', '3,700',
'1,650', '2,700', '4,500'], dtype=object)

here we can see that data point is string type and some values like 5,000 6,000 have comma(,).
we have to remove that ',' from the values and we have convert them into numeric type.

In [17]: df['cost'] = df['cost'].apply(lambda x:x.replace(',', '')) # lo


df['cost'] = df['cost'].astype(float)

df['cost'].unique()

Out[17]: array([ 800., 300., 600., 700., 550., 500., 450., 650., 400.,
750., 200., 850., 1200., 150., 350., 250., 1500., 1300.,
1000., 100., 900., 1100., 1600., 950., 230., 1700., 1400.,
1350., 2200., 2000., 1800., 1900., 180., 330., 2500., 2100.,
3000., 2800., 3400., 40., 1250., 3500., 4000., 2400., 1450.,
3200., 6000., 1050., 4100., 2300., 120., 2600., 5000., 3700.,
1650., 2700., 4500.])

Now sucessfully we converted the values into numeric type

2.7 handling the rate columns

In [18]: df['rate'].unique()

Out[18]: array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',


'3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
'3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
'3.4/5', '2.7/5', '4.7/5', 'NEW', '2.4/5', '2.2/5', '2.3/5',
'4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5',
'2.7 /5', '2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5',
'4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5',
'3.3 /5', '4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5',
'3.5 /5', '3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
'2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

here rating column also string type. we have to convert them into numeric type. we have to
remove the '/5' form given values.

there is 'NEW' value which make no sense. SO we have to remove that values.

In [19]: df = df.loc[df.rate != 'NEW'] # geting rid of 'NEW'

In [20]: df['rate'].unique()
Out[20]: array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
'3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
'3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
'3.4/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '4.8/5',
'3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5', '2.7 /5',
'2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5', '4.4 /5',
'4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5', '3.3 /5',
'4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5', '3.5 /5',
'3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
'2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [21]: df['rate'] = df['rate'].apply(lambda x:x.replace('/5', ''))

df['rate'].unique()

Out[21]: array(['4.1', '3.8', '3.7', '4.6', '4.0', '4.2', '3.9', '3.0', '3.6',
'2.8', '4.4', '3.1', '4.3', '2.6', '3.3', '3.5', '3.8 ', '3.2',
'4.5', '2.5', '2.9', '3.4', '2.7', '4.7', '2.4', '2.2', '2.3',
'4.8', '3.9 ', '4.2 ', '4.0 ', '4.1 ', '2.9 ', '2.7 ', '2.5 ',
'2.6 ', '4.5 ', '4.3 ', '3.7 ', '4.4 ', '4.9', '2.1', '2.0', '1.8',
'3.4 ', '3.6 ', '3.3 ', '4.6 ', '4.9 ', '3.2 ', '3.0 ', '2.8 ',
'3.5 ', '3.1 ', '4.8 ', '2.3 ', '4.7 ', '2.4 ', '2.1 ', '2.2 ',
'2.0 ', '1.8 '], dtype=object)

In [22]: df['rate'] = df['rate'].apply(lambda x: float(x))


df['rate']

Out[22]: 0 4.1
1 4.1
2 3.8
3 3.7
4 3.8
...
51705 3.8
51707 3.9
51708 2.8
51711 2.5
51715 4.3
Name: rate, Length: 23248, dtype: float64

Now our data is cleaned and we can perform visulization

3. Data Visulaization
3.1 Most famous restaurant chains in banaglore

In [23]: plt.figure(figsize = (17,10))


chains = df['name'].value_counts()[:20]
sns.barplot(x = chains, y= chains.index, palette= 'deep')
plt.title('Most famous restaurants chains in bangalore')
plt.xlabel('Number of outlets')
plt.show()
Insights:

'Onesta', 'Empire Restaurant' & 'KFC' are the most famous restaurant in bangalore.

In [ ]:

3.2 checking online order or not

In [24]: v = df['online_order'].value_counts()
fig = plt.gcf()
fig.set_size_inches((10,6))
cmap = plt.get_cmap('Set3')
color = cmap(np.arange(len(v)))

plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = Tru


plt.title('Online orders', fontsize = 20)
plt.show()

Insight:
Most Restaurants offer option for online order and delivery.

3.3 Book table or not

In [25]: v = df['book_table'].value_counts()

fig = plt.gcf()
fig.set_size_inches((8,6))
cmap = plt.get_cmap('Set1')
color = cmap(np.arange(len(v)))

plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = Tru


plt.title('Book Table', fontsize = 20)
plt.show()

Insight:

Most of restaurants doesn't offer table booking.

3.4 Rating Distribution

In [26]: plt.figure(figsize = (9,7))


sns.distplot(df['rate'])
plt.title('Rating Distribution')

Out[26]: Text(0.5, 1.0, 'Rating Distribution')


Insight:

We can infer from above that most of the ratings are within 3.5 and 4.5

You might also like