SQL_for DS
SQL_for DS
Coursera Worksheet
1. Profile the data by finding the total number of records for each
of the tables below:
3. Are there any columns with null values in the Users table?
Indicate "yes," or "no."
Answer: no
select
sum(case when id is null then 1 else 0 end) as id,
sum(case when name is null then 1 else 0 end) as name,
sum(case when review_count is null then 1 else 0 end) as
review_count,
sum(case when yelping_since is null then 1 else 0 end) as
yelping_since,
sum(case when useful is null then 1 else 0 end) as useful,
sum(case when funny is null then 1 else 0 end) as funny,
sum(case when cool is null then 1 else 0 end) as cool,
sum(case when fans is null then 1 else 0 end) as fans,
sum(case when average_stars is null then 1 else 0 end) as
average_stars,
sum(case when compliment_hot is null then 1 else 0 end) as
compliment_hot,
sum(case when compliment_more is null then 1 else 0 end) as
compliment_more,
sum(case when compliment_profile is null then 1 else 0 end)
as compliment_profile,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_list is null then 1 else 0 end) as
compliment_list,
sum(case when compliment_note is null then 1 else 0 end) as
compliment_note,
sum(case when compliment_plain is null then 1 else 0 end) as
compliment_plain,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_funny is null then 1 else 0 end)
as compliment_funny,
sum(case when compliment_writer is null then 1 else 0 end)
as compliment_writer,
sum(case when compliment_photos is null then 1 else 0 end)
as compliment_photos
from user
4. For each table and column listed below, display the smallest
(minimum), largest (maximum), and average (mean) value for the
following fields:
SELECT
city,
SUM(review_count) AS reviews_num
FROM business
GROUP BY city
ORDER BY reviews_num DESC
i. Avon
Copy and Paste the Resulting Table Below (2 columns – star
rating and count):
+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 1.5 | 1|
| 2.5 | 2|
| 3.5 | 3|
| 4.0 | 2|
| 4.5 | 1|
| 5.0 | 1|
+-------+--------------+
ii. Beachwood
Copy and Paste the Resulting Table Below (2 columns – star
rating and count):
+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 2.0 | 1|
| 2.5 | 1|
| 3.0 | 2|
| 3.5 | 2|
| 4.0 | 1|
| 4.5 | 2|
| 5.0 | 5|
+-------+--------------+
SELECT name,
id,
review_count
FROM user
ORDER BY review_count DESC
LIMIT 3
Copy and Paste the Result Below:
+--------+------------------------+--------------+
| name | id | review_count |
+--------+------------------------+--------------+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 |
+--------+------------------------+--------------+
SELECT name,
id,
review_count ,
fans
FROM user
ORDER BY fans DESC
We can see from the above table that user who gave 2000
reviews has only 253 fans. Similarly, there are other users who
have given one review, and have very few fans. Similarly, there
are users who has 503 fans, the highest number of fans gave only
609 reviews
May be other factors like compliment_photos, compliment_profile,
yelping since, compliment_writer, complimentary_funny which
influence the number of fans.
9. Are there more reviews with the word "love" or with the word
"hate" in them?
Answer:
+-----------------+-----------------+
| words_with_love | words_with_hate |
+-----------------+-----------------+
| 1780 | 232 |
+-----------------+-----------------+
+-----------+------+
| name | fans |
+-----------+------+
| Amy | 503 |
| Mimi | 497 |
| Harald | 311 |
| Gerald | 253 |
| Christine | 173 |
| Lisa | 159 |
| Cat | 133 |
| William | 126 |
| Fran | 124 |
| Lissa | 120 |
+-----------+------+
1. Pick one city and category of your choice and group the
businesses in that city or category by their overall star rating.
Compare the businesses with 2-3 stars to the businesses with 4-5
stars and answer the following questions. Include your code.
From the data below, we can observe that, hour distribution has
nothing to do with the rating
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| name | city | category | stars | hours
| review_count | address | latitude | longitude |
postal_code |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| 99 Cent Sushi | Toronto | Restaurants | 2.0 | Saturday|11:00-
23:00 | 5 | 389 Church Street | 43.6614 | -
79.379 | M5B 2E5 |
| Pizzaiolo | Toronto | Restaurants | 3.0 | Saturday|10:00-
4:00 | 34 | 270 Adelaide Street W | 43.6479 | -
79.3901 | M5H 1X6 |
| Edulis | Toronto | Restaurants | 4.0 | Saturday|18:00-
23:00 | 89 | 169 Niagara Street | 43.6419 | -
79.4066 | M5V |
| Sushi Osaka | Toronto | Restaurants | 4.5 | Saturday|11:00-
23:00 | 8 | 5084 Dundas Street W | 43.6452 | -
79.5324 | M9A 1C2 |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
Yes, The restaurants with low rating(2-3) have fewer reviews than
the ones with higher rating (4-5)
iii. Are you able to infer anything from the location data provided
between these two groups? Explain.
From the postal code, latitude and longitude data, we can see
that they are close by.
SELECT
b.name ,
b.city ,
c.category ,
b.stars ,
h.hours,
b.review_count,
b.address,
b.latitude,
b.longitude,
b.postal_code
FROM (business b INNER JOIN category c ON b.id = c.business_id)
INNER JOIN hours h ON h.business_id = b.id
WHERE b.city = 'Toronto' AND c.category = "Restaurants"
GROUP BY b.stars;
2. Group business based on the ones that are open and the ones
that are closed. What differences can you find between the ones
that are still open and the ones that are closed? List at least two
differences and the SQL code you used to arrive at your answer.
i. Difference 1:
There are more number of listings that are open than closed. Also,
the total reviews are more for the ones that are open.
ii. Difference 2:
SELECT
b.is_open,
AVG(r.funny)
FROM business b
JOIN review r
ON b.id= r.business_id
GROUP BY is_open
+---------+----------------+
| is_open | AVG(r.funny) |
+---------+----------------+
| 0 | 0.211267605634 |
| 1 | 0.269026548673 |
+---------+----------------+
The listings that are open have higher funny score than the ones
closed.
SQL code used for analysis:
SELECT
is_open,
SUM(review_count),
AVG(stars),
AVG(review_count),
COUNT(*)
FROM business
GROUP BY is_open
+---------+-------------------+---------------+-------------------+----------+
| is_open | SUM(review_count) | AVG(stars) |
AVG(review_count) | COUNT(*) |
+---------+-------------------+---------------+-------------------+----------+
| 0| 35261 | 3.52039473684 |
23.1980263158 | 1520 |
| 1| 269300 | 3.67900943396 |
31.7570754717 | 8480 |
+---------+-------------------+---------------+-------------------+----------+
3. For this last part of your analysis, you are going to choose the
type of analysis you want to conduct on the Yelp dataset and are
going to prepare the data for analysis.
- Most of the places with high rating have useful, funny, cool
scores high.
- The places with high rating have reviews with more words like
love, awesome, like, good whihch have postive connotation.
- The number of reviews is high for high star rated places.
- Maximum places have 4-5 rating
+-------------------+-------------+------------+-----------+------------------
+-----------+
| SUM(review_count) | SUM(useful) | SUM(funny) | SUM(cool) |
SUM(rating_type) | rating |
+-------------------+-------------+------------+-----------+------------------
+-----------+
| 66861 | 151 | 64 | 80 |
31 | None |
| 22469 | 125 | 38 | 36 |
7 | 2-3 Stars |
| 95708 | 277 | 65 | 133 |
76 | 4-5 Stars |
+-------------------+-------------+------------+-----------+------------------
+-----------+
iv. Provide the SQL code you used to create your final dataset:
SELECT SUM(review_count),
SUM(useful),
SUM(funny),
SUM(cool),
SUM(rating_type),
rating
FROM (SELECT *,
CASE
WHEN stars>=4 THEN '4-5 Stars'
WHEN (stars>=2 AND stars<=3) THEN '2-3 Stars'
END as rating,
CASE
WHEN text like '%love%' or '%like%' or '%happy%' or
'%good%' or '%awesome%' then 1
WHEn text like '%hate%' or 'bad%' or 'disappointed%' or
'%horrible%' or '%rude%' then -1
ELSE 0
END as rating_type
from (SELECT
b.review_count,
b.stars,
r.text,
r.useful,
r.funny,
r.cool
FROM business b
JOIN review r
on b.id=r.business_id)joined_table)
GROUP BY rating