Zomato Rating Prediction
Zomato Rating Prediction
ROLL NO. : 65
DIV : A
MINI-PROJECT
Zomato-rating-prediction
import warnings
warnings.filterwarnings('ignore')
In [3]: data
Out[3]:
url address name online_order book_t
https://www.zomato.com/bangalore/spice- Spice
Elephant
1112, Next to
https://www.zomato.com/SanchurroBangalore? KIMS Medical San Churro
2 Yes
cont... College, 17th Cafe
Cross...
Addhuri
https://www.zomato.com/bangalore/addhuri-
Udupi
Bhojana
Best Brews
Four Points by
- Four
https://www.zomato.com/bangalore/best- Sheraton
51712 Points by No
brews-fo... Bengaluru,
Sheraton
43/3, White...
Bengaluru...
https://www.zomato.com/bangalore/vinod-
Palya, And
bar-and...
Mahadevapura, Restaurant
Plunge -
Sheraton
Sheraton
Grand
https://www.zomato.com/bangalore/plunge- Grand
51714 Bengaluru No
sherat... Bengaluru
Whitefield
Whitefield
Hotel & Co...
H...
Grand
Bengaluru
Whitefield
url address name online_order book_t
ITPL Main
Road, KIADB The Nest -
https://www.zomato.com/bangalore/the-nest-
51716 Export The Den No
the-...
Promotion Bengaluru
Industr...
In [4]: data.shape
In [5]: data.columns
In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
# Column Non-Null Count Dtype
there are so many object type columns, we have to convert them into numeric type. letter we will
convert oject dtype to numeric type
2. Data Cleaning
2.1 checking the missing values
In [7]: data.isnull().sum()
Out[7]: url 0
address 0
name 0
online_order 0
book_table 0
rate 7775
votes 0
phone 1208
location 21
rest_type 227
dish_liked 28078
cuisines 45
approx_cost(for two people) 346
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64
there are so many null values.we can clearly see that in the 'rate', 'phone', 'location', 'rest_type',
'dish_liked', 'cuisines' and 'approx_cost(for two people)' these columns have missing values.So
firstly we have to handle the missing values.
In [9]: df.head()
Out[9]:
address name online_order book_table rate votes location rest_type dish_lik
Pas
942, 21st Lun
Main Road, Buff
Casual
0 2nd Stage, Jalsa Yes Yes 4.1/5 775 Banashankari Mas
Banashankari, Dining
Papa
... Pane
Laj
Mom
2nd Floor, 80 Lun
Feet Road, Spice Casual Buff
1 Yes No 4.1/5 787 Banashankari
Near Big Elephant Dining Chocola
Bazaar, 6th ... Nirva
Thai
Churr
1112, Next to San
KIMS Medical Cafe, Cannello
2 Churro
College, 17th Yes No 3.8/5 918 Banashankari Casual Minestro
Cafe Dining Soup, H
Cross...
Cho
Addhuri
Quick Mas
Bites
Bhojana
Banashankar...
In [11]: df.isnull().sum()
Out[11]: address 0
name 0
online_order 0
book_table 0
rate 0
votes 0
location 0
rest_type 0
dish_liked 0
cuisines 0
approx_cost(for two people) 0
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64
In [12]: df.duplicated().sum()
Out[12]: 11
Out[13]: 0
In [ ]:
In [15]: df.head()
Out[15]: address name online_order book_table rate votes location rest_type dish_lik
Pas
942, 21st Lun
Main Road, Casual Buff
Jalsa Yes Yes 4.1/5 775 Banashankari
0 2nd Stage, Mas
Dining
Banashankari, Papa
... Pane
Laj
Churr
1112, Next to San Cafe, Cannello
KIMS Medical
2 Churro Yes No 3.8/5 918 Banashankari Casual Minestro
College, 17th
Cafe Dining Soup, H
Cross...
Cho
Addhuri
Quick Mas
Bites
Bhojana
Banashankar...
In [16]: df['cost'].unique()
Out[16]: array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
'750', '200', '850', '1,200', '150', '350', '250', '1,500',
'1,300', '1,000', '100', '900', '1,100', '1,600', '950', '230',
'1,700', '1,400', '1,350', '2,200', '2,000', '1,800', '1,900',
'180', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40',
'1,250', '3,500', '4,000', '2,400', '1,450', '3,200', '6,000',
'1,050', '4,100', '2,300', '120', '2,600', '5,000', '3,700',
'1,650', '2,700', '4,500'], dtype=object)
here we can see that data point is string type and some values like 5,000 6,000 have comma(,).
we have to remove that ',' from the values and we have convert them into numeric type.
df['cost'].unique()
Out[17]: array([ 800., 300., 600., 700., 550., 500., 450., 650., 400.,
750., 200., 850., 1200., 150., 350., 250., 1500., 1300.,
1000., 100., 900., 1100., 1600., 950., 230., 1700., 1400.,
1350., 2200., 2000., 1800., 1900., 180., 330., 2500., 2100.,
3000., 2800., 3400., 40., 1250., 3500., 4000., 2400., 1450.,
3200., 6000., 1050., 4100., 2300., 120., 2600., 5000., 3700.,
1650., 2700., 4500.])
In [18]: df['rate'].unique()
here rating column also string type. we have to convert them into numeric type. we have to
remove the '/5' form given values.
there is 'NEW' value which make no sense. SO we have to remove that values.
In [20]: df['rate'].unique()
Out[20]: array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
'3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
'3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
'3.4/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '4.8/5',
'3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5', '2.7 /5',
'2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5', '4.4 /5',
'4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5', '3.3 /5',
'4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5', '3.5 /5',
'3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
'2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
df['rate'].unique()
Out[21]: array(['4.1', '3.8', '3.7', '4.6', '4.0', '4.2', '3.9', '3.0', '3.6',
'2.8', '4.4', '3.1', '4.3', '2.6', '3.3', '3.5', '3.8 ', '3.2',
'4.5', '2.5', '2.9', '3.4', '2.7', '4.7', '2.4', '2.2', '2.3',
'4.8', '3.9 ', '4.2 ', '4.0 ', '4.1 ', '2.9 ', '2.7 ', '2.5 ',
'2.6 ', '4.5 ', '4.3 ', '3.7 ', '4.4 ', '4.9', '2.1', '2.0', '1.8',
'3.4 ', '3.6 ', '3.3 ', '4.6 ', '4.9 ', '3.2 ', '3.0 ', '2.8 ',
'3.5 ', '3.1 ', '4.8 ', '2.3 ', '4.7 ', '2.4 ', '2.1 ', '2.2 ',
'2.0 ', '1.8 '], dtype=object)
Out[22]: 0 4.1
1 4.1
2 3.8
3 3.7
4 3.8
...
51705 3.8
51707 3.9
51708 2.8
51711 2.5
51715 4.3
Name: rate, Length: 23248, dtype: float64
3. Data Visulaization
3.1 Most famous restaurant chains in banaglore
'Onesta', 'Empire Restaurant' & 'KFC' are the most famous restaurant in bangalore.
In [ ]:
In [24]: v = df['online_order'].value_counts()
fig = plt.gcf()
fig.set_size_inches((10,6))
cmap = plt.get_cmap('Set3')
color = cmap(np.arange(len(v)))
Insight:
Most Restaurants offer option for online order and delivery.
In [25]: v = df['book_table'].value_counts()
fig = plt.gcf()
fig.set_size_inches((8,6))
cmap = plt.get_cmap('Set1')
color = cmap(np.arange(len(v)))
Insight:
We can infer from above that most of the ratings are within 3.5 and 4.5