0% found this document useful (0 votes)
2 views

SQL_for DS

This document outlines a two-part assignment for analyzing the Yelp dataset, focusing on data profiling and user analysis. The first part involves answering specific questions about the dataset, including record counts, distinct records, null values, and statistical summaries, while the second part requires students to derive their own insights and analyses based on the data. Students are graded on the correctness of their findings and the readability of their code.

Uploaded by

sonambhanu27
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

SQL_for DS

This document outlines a two-part assignment for analyzing the Yelp dataset, focusing on data profiling and user analysis. The first part involves answering specific questions about the dataset, including record counts, distinct records, null values, and statistical summaries, while the second part requires students to derive their own insights and analyses based on the data. Students are graded on the correctness of their findings and the readability of their code.

Uploaded by

sonambhanu27
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Scientist Role Play: Profiling and Analyzing the Yelp Dataset

Coursera Worksheet

This is a 2-part assignment. In the first part, you are asked a


series of questions that will help you profile and understand the
data just like a data scientist would. For this first part of the
assignment, you will be assessed both on the correctness of your
findings, as well as the code you used to arrive at your answer.
You will be graded on how easy your code is to read, so
remember to use proper formatting and comments where
necessary.

In the second part of the assignment, you are asked to come up


with your own inferences and analysis of the data for a particular
research question you want to answer. You will be required to
prepare the dataset for the analysis you choose to do. As with the
first part, you will be graded, in part, on how easy your code is to
read, so use proper formatting and comments to illustrate and
communicate your intent as required.

For both parts of this assignment, use this "worksheet." It


provides all the questions you are being asked, and your job will
be to transfer your answers and SQL coding where indicated into
this worksheet so that your peers can review your work. You
should be able to use any Text Editor (Windows Notepad, Apple
TextEdit, Notepad ++, Sublime Text, etc.) to copy and paste your
answers. If you are going to use Word or some other page layout
application, just be careful to make sure your answers and code
are lined appropriately.
In this case, you may want to save as a PDF to ensure your
formatting remains intact for you reviewer.

Part 1: Yelp Dataset Profiling and Understanding

1. Profile the data by finding the total number of records for each
of the tables below:

i. Attribute table = 10000


ii. Business table = 10000
iii. Category table = 10000
iv. Checkin table = 10000
v. elite_years table = 10000
vi. friend table = 10000
vii. hours table = 10000
viii. photo table = 10000
ix. review table = 10000
x. tip table = 10000
xi. user table =10000

2. Find the total distinct records by either the foreign key or


primary key for each table. If two foreign keys are listed in the
table, please specify which foreign key.

i. Business = (id- 10000)


ii. Hours = (business_id- 1562)
iii. Category = (business_id- 2643
iv. Attribute = (business_id-1115)
v. Review = (id- 10000)
vi. Checkin = (business_id-493)
vii. Photo = (id-10000)
viii. Tip = (user_id -537)
ix. User = (id-10000)
x. Friend = (user_id-11)
xi. Elite_years = (user_id-2780)

Note: Primary Keys are denoted in the ER-Diagram with a yellow


key icon.

3. Are there any columns with null values in the Users table?
Indicate "yes," or "no."

Answer: no

SQL code used to arrive at answer:

select
sum(case when id is null then 1 else 0 end) as id,
sum(case when name is null then 1 else 0 end) as name,
sum(case when review_count is null then 1 else 0 end) as
review_count,
sum(case when yelping_since is null then 1 else 0 end) as
yelping_since,
sum(case when useful is null then 1 else 0 end) as useful,
sum(case when funny is null then 1 else 0 end) as funny,
sum(case when cool is null then 1 else 0 end) as cool,
sum(case when fans is null then 1 else 0 end) as fans,
sum(case when average_stars is null then 1 else 0 end) as
average_stars,
sum(case when compliment_hot is null then 1 else 0 end) as
compliment_hot,
sum(case when compliment_more is null then 1 else 0 end) as
compliment_more,
sum(case when compliment_profile is null then 1 else 0 end)
as compliment_profile,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_list is null then 1 else 0 end) as
compliment_list,
sum(case when compliment_note is null then 1 else 0 end) as
compliment_note,
sum(case when compliment_plain is null then 1 else 0 end) as
compliment_plain,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_funny is null then 1 else 0 end)
as compliment_funny,
sum(case when compliment_writer is null then 1 else 0 end)
as compliment_writer,
sum(case when compliment_photos is null then 1 else 0 end)
as compliment_photos
from user

4. For each table and column listed below, display the smallest
(minimum), largest (maximum), and average (mean) value for the
following fields:

i. Table: Review, Column: Stars

min: 1 max: 5 avg: 3.7082


ii. Table: Business, Column: Stars

min: 1 max: 5 avg: 3.6549

iii. Table: Tip, Column: Likes

min: 0 max: 2 avg: 0.0144

iv. Table: Checkin, Column: Count

min: 1 max: 53 avg: 1.9414

v. Table: User, Column: Review_count

min: 0 max: 2000 avg: 24.2995

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

SELECT
city,
SUM(review_count) AS reviews_num
FROM business
GROUP BY city
ORDER BY reviews_num DESC

Copy and Paste the Result Below:


+-----------------+-------------+
| city | reviews_num |
+-----------------+-------------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montréal | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+-----------------+-------------+
(Output limit exceeded, 25 of 362 total rows shown)

6. Find the distribution of star ratings to the business in the


following cities:

i. Avon

SQL code used to arrive at answer:

SELECT stars, count(stars)


FROM business b
WHERE b.city = 'Avon'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns – star
rating and count):

+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 1.5 | 1|
| 2.5 | 2|
| 3.5 | 3|
| 4.0 | 2|
| 4.5 | 1|
| 5.0 | 1|
+-------+--------------+

ii. Beachwood

SQL code used to arrive at answer:

SELECT stars, count(stars)


FROM business b
WHERE b.city = 'Beachwood'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns – star
rating and count):

+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 2.0 | 1|
| 2.5 | 1|
| 3.0 | 2|
| 3.5 | 2|
| 4.0 | 1|
| 4.5 | 2|
| 5.0 | 5|
+-------+--------------+

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

SELECT name,
id,
review_count
FROM user
ORDER BY review_count DESC
LIMIT 3
Copy and Paste the Result Below:

+--------+------------------------+--------------+
| name | id | review_count |
+--------+------------------------+--------------+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 |
+--------+------------------------+--------------+

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

SELECT name,
id,
review_count ,
fans
FROM user
ORDER BY fans DESC

No, I do not think there is any correlation between the number of


reviews and the number of fans.
+-----------+------------------------+--------------+------+
| name | id | review_count | fans |
+-----------+------------------------+--------------+------+
| Amy | -9I98YbNQnLdAmcYfb324Q | 609 | 503 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 968 | 497 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 1153 | 311 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 | 253 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 930 | 173 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 813 | 159 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 377 | 133 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 1215 | 126 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 862 | 124 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 834 | 120 |
| Mark | -B-QEUESGWHPE_889WJaeg | 861 | 115 |
| Tiffany | -DmqnhW4Omr3YhmnigaqHg | 408 | 111 |
| bernice | -cv9PPT7IHux7XUc9dOpkg | 255 | 105 |
| Roanna | -DFCC64NXgqrxlO8aLU5rg | 1039 | 104 |
| Angela | -IgKkE8JvYNWeGu8ze4P8Q | 694 | 101 |
| .Hon | -K2Tcgh2EKX6e6HqqIrBIQ | 1246 | 101 |
| Ben | -4viTt9UC44lWCFJwleMNQ | 307 | 96 |
| Linda | -3i9bhfvrM3F1wsC9XIB8g | 584 | 89 |
| Christina | -kLVfaJytOJY2-QdQoCcNQ | 842 | 85 |
| Jessica | -ePh4Prox7ZXnEBNGKyUEA | 220 | 84 |
| Greg | -4BEUkLvHQntN6qPfKJP2w | 408 | 81 |
| Nieves | -C-l8EHSLXtZZVfUAUhsPA | 178 | 80 |
| Sui | -dw8f7FLaUmWR7bfJ_Yf0w | 754 | 78 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 | 76 |
| Nicole | -0zEEaDFIjABtPQni0XlHA | 161 | 73 |
+-----------+------------------------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)

We can see from the above table that user who gave 2000
reviews has only 253 fans. Similarly, there are other users who
have given one review, and have very few fans. Similarly, there
are users who has 503 fans, the highest number of fans gave only
609 reviews
May be other factors like compliment_photos, compliment_profile,
yelping since, compliment_writer, complimentary_funny which
influence the number of fans.

9. Are there more reviews with the word "love" or with the word
"hate" in them?

Answer:

SQL code used to arrive at answer:


SELECT sum(case when text like '%love%' then 1 else 0 end) as
words_with_love,
sum(case when text like '%hate%' then 1 else 0 end ) as
words_with_hate
From review

+-----------------+-----------------+
| words_with_love | words_with_hate |
+-----------------+-----------------+
| 1780 | 232 |
+-----------------+-----------------+

10. Find the top 10 users with the most fans:


SQL code used to arrive at answer:
SELECT name, fans
FROM user
ORDER BY fans desc
LIMIT 10

Copy and Paste the Result Below:

+-----------+------+
| name | fans |
+-----------+------+
| Amy | 503 |
| Mimi | 497 |
| Harald | 311 |
| Gerald | 253 |
| Christine | 173 |
| Lisa | 159 |
| Cat | 133 |
| William | 126 |
| Fran | 124 |
| Lissa | 120 |
+-----------+------+

Part 2: Inferences and Analysis

1. Pick one city and category of your choice and group the
businesses in that city or category by their overall star rating.
Compare the businesses with 2-3 stars to the businesses with 4-5
stars and answer the following questions. Include your code.

i. Do the two groups you chose to analyze have a different


distribution of hours?

city= Toronto, category= Restaurant

From the data below, we can observe that, hour distribution has
nothing to do with the rating
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| name | city | category | stars | hours
| review_count | address | latitude | longitude |
postal_code |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| 99 Cent Sushi | Toronto | Restaurants | 2.0 | Saturday|11:00-
23:00 | 5 | 389 Church Street | 43.6614 | -
79.379 | M5B 2E5 |
| Pizzaiolo | Toronto | Restaurants | 3.0 | Saturday|10:00-
4:00 | 34 | 270 Adelaide Street W | 43.6479 | -
79.3901 | M5H 1X6 |
| Edulis | Toronto | Restaurants | 4.0 | Saturday|18:00-
23:00 | 89 | 169 Niagara Street | 43.6419 | -
79.4066 | M5V |
| Sushi Osaka | Toronto | Restaurants | 4.5 | Saturday|11:00-
23:00 | 8 | 5084 Dundas Street W | 43.6452 | -
79.5324 | M9A 1C2 |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+

ii. Do the two groups you chose to analyze have a different


number of reviews?

Yes, The restaurants with low rating(2-3) have fewer reviews than
the ones with higher rating (4-5)

iii. Are you able to infer anything from the location data provided
between these two groups? Explain.
From the postal code, latitude and longitude data, we can see
that they are close by.

SQL code used for analysis:

SELECT
b.name ,
b.city ,
c.category ,
b.stars ,
h.hours,
b.review_count,
b.address,
b.latitude,
b.longitude,
b.postal_code
FROM (business b INNER JOIN category c ON b.id = c.business_id)
INNER JOIN hours h ON h.business_id = b.id
WHERE b.city = 'Toronto' AND c.category = "Restaurants"
GROUP BY b.stars;

2. Group business based on the ones that are open and the ones
that are closed. What differences can you find between the ones
that are still open and the ones that are closed? List at least two
differences and the SQL code you used to arrive at your answer.

i. Difference 1:
There are more number of listings that are open than closed. Also,
the total reviews are more for the ones that are open.

ii. Difference 2:

SELECT
b.is_open,
AVG(r.funny)
FROM business b
JOIN review r
ON b.id= r.business_id
GROUP BY is_open

+---------+----------------+
| is_open | AVG(r.funny) |
+---------+----------------+
| 0 | 0.211267605634 |
| 1 | 0.269026548673 |
+---------+----------------+

The listings that are open have higher funny score than the ones
closed.
SQL code used for analysis:

SELECT
is_open,
SUM(review_count),
AVG(stars),
AVG(review_count),
COUNT(*)
FROM business
GROUP BY is_open

+---------+-------------------+---------------+-------------------+----------+
| is_open | SUM(review_count) | AVG(stars) |
AVG(review_count) | COUNT(*) |
+---------+-------------------+---------------+-------------------+----------+
| 0| 35261 | 3.52039473684 |
23.1980263158 | 1520 |
| 1| 269300 | 3.67900943396 |
31.7570754717 | 8480 |
+---------+-------------------+---------------+-------------------+----------+

3. For this last part of your analysis, you are going to choose the
type of analysis you want to conduct on the Yelp dataset and are
going to prepare the data for analysis.

Ideas for analysis include: Parsing out keywords and business


attributes for sentiment analysis, clustering businesses to find
commonalities or anomalies between them, predicting the overall
star rating for a business, predicting the number of fans a user
will have, and so on. These are just a few examples to get you
started, so feel free to be creative and come up with your own
problem you want to solve. Provide answers, in-line, to all of the
following:

i. Indicate the type of analysis you chose to do:

Predicting the overall star rating for a business


ii. Write 1-2 brief paragraphs on the type of data you will need for
your analysis and why you chose that data:

- Most of the places with high rating have useful, funny, cool
scores high.
- The places with high rating have reviews with more words like
love, awesome, like, good whihch have postive connotation.
- The number of reviews is high for high star rated places.
- Maximum places have 4-5 rating

iii. Output of your finished dataset:

+-------------------+-------------+------------+-----------+------------------
+-----------+
| SUM(review_count) | SUM(useful) | SUM(funny) | SUM(cool) |
SUM(rating_type) | rating |
+-------------------+-------------+------------+-----------+------------------
+-----------+
| 66861 | 151 | 64 | 80 |
31 | None |
| 22469 | 125 | 38 | 36 |
7 | 2-3 Stars |
| 95708 | 277 | 65 | 133 |
76 | 4-5 Stars |
+-------------------+-------------+------------+-----------+------------------
+-----------+

iv. Provide the SQL code you used to create your final dataset:

SELECT SUM(review_count),
SUM(useful),
SUM(funny),
SUM(cool),
SUM(rating_type),
rating
FROM (SELECT *,
CASE
WHEN stars>=4 THEN '4-5 Stars'
WHEN (stars>=2 AND stars<=3) THEN '2-3 Stars'
END as rating,
CASE
WHEN text like '%love%' or '%like%' or '%happy%' or
'%good%' or '%awesome%' then 1
WHEn text like '%hate%' or 'bad%' or 'disappointed%' or
'%horrible%' or '%rude%' then -1
ELSE 0
END as rating_type
from (SELECT
b.review_count,
b.stars,
r.text,
r.useful,
r.funny,
r.cool
FROM business b
JOIN review r
on b.id=r.business_id)joined_table)
GROUP BY rating

You might also like