0% found this document useful (0 votes)

6 views

SQL_for DS

This document outlines a two-part assignment for analyzing the Yelp dataset, focusing on data profiling and user analysis. The first part involves answering specific questions about the dataset, including record counts, distinct records, null values, and statistical summaries, while the second part requires students to derive their own insights and analyses based on the data. Students are graded on the correctness of their findings and the readability of their code.

Uploaded by

sonambhanu27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

SQL_for DS

Uploaded by

sonambhanu27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Scientist Role Play: Profiling and Analyzing the Yelp Dataset

Coursera Worksheet

This is a 2-part assignment. In the first part, you are asked a

series of questions that will help you profile and understand the
data just like a data scientist would. For this first part of the
assignment, you will be assessed both on the correctness of your
findings, as well as the code you used to arrive at your answer.
You will be graded on how easy your code is to read, so
remember to use proper formatting and comments where
necessary.

In the second part of the assignment, you are asked to come up

with your own inferences and analysis of the data for a particular
research question you want to answer. You will be required to
prepare the dataset for the analysis you choose to do. As with the
first part, you will be graded, in part, on how easy your code is to
read, so use proper formatting and comments to illustrate and
communicate your intent as required.

For both parts of this assignment, use this "worksheet." It

provides all the questions you are being asked, and your job will
be to transfer your answers and SQL coding where indicated into
this worksheet so that your peers can review your work. You
should be able to use any Text Editor (Windows Notepad, Apple
TextEdit, Notepad ++, Sublime Text, etc.) to copy and paste your
answers. If you are going to use Word or some other page layout
application, just be careful to make sure your answers and code
are lined appropriately.
In this case, you may want to save as a PDF to ensure your
formatting remains intact for you reviewer.

Part 1: Yelp Dataset Profiling and Understanding

1. Profile the data by finding the total number of records for each
of the tables below:

i. Attribute table = 10000

ii. Business table = 10000
iii. Category table = 10000
iv. Checkin table = 10000
v. elite_years table = 10000
vi. friend table = 10000
vii. hours table = 10000
viii. photo table = 10000
ix. review table = 10000
x. tip table = 10000
xi. user table =10000

2. Find the total distinct records by either the foreign key or

primary key for each table. If two foreign keys are listed in the
table, please specify which foreign key.

i. Business = (id- 10000)

ii. Hours = (business_id- 1562)
iii. Category = (business_id- 2643
iv. Attribute = (business_id-1115)
v. Review = (id- 10000)
vi. Checkin = (business_id-493)
vii. Photo = (id-10000)
viii. Tip = (user_id -537)
ix. User = (id-10000)
x. Friend = (user_id-11)
xi. Elite_years = (user_id-2780)

Note: Primary Keys are denoted in the ER-Diagram with a yellow

key icon.

3. Are there any columns with null values in the Users table?
Indicate "yes," or "no."

Answer: no

SQL code used to arrive at answer:

select
sum(case when id is null then 1 else 0 end) as id,
sum(case when name is null then 1 else 0 end) as name,
sum(case when review_count is null then 1 else 0 end) as
review_count,
sum(case when yelping_since is null then 1 else 0 end) as
yelping_since,
sum(case when useful is null then 1 else 0 end) as useful,
sum(case when funny is null then 1 else 0 end) as funny,
sum(case when cool is null then 1 else 0 end) as cool,
sum(case when fans is null then 1 else 0 end) as fans,
sum(case when average_stars is null then 1 else 0 end) as
average_stars,
sum(case when compliment_hot is null then 1 else 0 end) as
compliment_hot,
sum(case when compliment_more is null then 1 else 0 end) as
compliment_more,
sum(case when compliment_profile is null then 1 else 0 end)
as compliment_profile,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_list is null then 1 else 0 end) as
compliment_list,
sum(case when compliment_note is null then 1 else 0 end) as
compliment_note,
sum(case when compliment_plain is null then 1 else 0 end) as
compliment_plain,
sum(case when compliment_cute is null then 1 else 0 end) as
compliment_cute,
sum(case when compliment_funny is null then 1 else 0 end)
as compliment_funny,
sum(case when compliment_writer is null then 1 else 0 end)
as compliment_writer,
sum(case when compliment_photos is null then 1 else 0 end)
as compliment_photos
from user

4. For each table and column listed below, display the smallest
(minimum), largest (maximum), and average (mean) value for the
following fields:

i. Table: Review, Column: Stars

min: 1 max: 5 avg: 3.7082

ii. Table: Business, Column: Stars

min: 1 max: 5 avg: 3.6549

iii. Table: Tip, Column: Likes

min: 0 max: 2 avg: 0.0144

iv. Table: Checkin, Column: Count

min: 1 max: 53 avg: 1.9414

v. Table: User, Column: Review_count

min: 0 max: 2000 avg: 24.2995

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

SELECT
city,
SUM(review_count) AS reviews_num
FROM business
GROUP BY city
ORDER BY reviews_num DESC

Copy and Paste the Result Below:

+-----------------+-------------+
| city | reviews_num |
+-----------------+-------------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montréal | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+-----------------+-------------+
(Output limit exceeded, 25 of 362 total rows shown)

6. Find the distribution of star ratings to the business in the

following cities:

i. Avon

SQL code used to arrive at answer:

SELECT stars, count(stars)

FROM business b
WHERE b.city = 'Avon'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns â€“ star
rating and count):

+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 1.5 | 1|
| 2.5 | 2|
| 3.5 | 3|
| 4.0 | 2|
| 4.5 | 1|
| 5.0 | 1|
+-------+--------------+

ii. Beachwood

SQL code used to arrive at answer:

SELECT stars, count(stars)

FROM business b
WHERE b.city = 'Beachwood'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns â€“ star
rating and count):

+-------+--------------+
| stars | count(stars) |
+-------+--------------+
| 2.0 | 1|
| 2.5 | 1|
| 3.0 | 2|
| 3.5 | 2|
| 4.0 | 1|
| 4.5 | 2|
| 5.0 | 5|
+-------+--------------+

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

SELECT name,
id,
review_count
FROM user
ORDER BY review_count DESC
LIMIT 3
Copy and Paste the Result Below:

+--------+------------------------+--------------+
| name | id | review_count |
+--------+------------------------+--------------+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 |
+--------+------------------------+--------------+

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

SELECT name,
id,
review_count ,
fans
FROM user
ORDER BY fans DESC

No, I do not think there is any correlation between the number of

reviews and the number of fans.
+-----------+------------------------+--------------+------+
| name | id | review_count | fans |
+-----------+------------------------+--------------+------+
| Amy | -9I98YbNQnLdAmcYfb324Q | 609 | 503 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 968 | 497 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 1153 | 311 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 | 253 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 930 | 173 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 813 | 159 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 377 | 133 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 1215 | 126 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 862 | 124 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 834 | 120 |
| Mark | -B-QEUESGWHPE_889WJaeg | 861 | 115 |
| Tiffany | -DmqnhW4Omr3YhmnigaqHg | 408 | 111 |
| bernice | -cv9PPT7IHux7XUc9dOpkg | 255 | 105 |
| Roanna | -DFCC64NXgqrxlO8aLU5rg | 1039 | 104 |
| Angela | -IgKkE8JvYNWeGu8ze4P8Q | 694 | 101 |
| .Hon | -K2Tcgh2EKX6e6HqqIrBIQ | 1246 | 101 |
| Ben | -4viTt9UC44lWCFJwleMNQ | 307 | 96 |
| Linda | -3i9bhfvrM3F1wsC9XIB8g | 584 | 89 |
| Christina | -kLVfaJytOJY2-QdQoCcNQ | 842 | 85 |
| Jessica | -ePh4Prox7ZXnEBNGKyUEA | 220 | 84 |
| Greg | -4BEUkLvHQntN6qPfKJP2w | 408 | 81 |
| Nieves | -C-l8EHSLXtZZVfUAUhsPA | 178 | 80 |
| Sui | -dw8f7FLaUmWR7bfJ_Yf0w | 754 | 78 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 | 76 |
| Nicole | -0zEEaDFIjABtPQni0XlHA | 161 | 73 |
+-----------+------------------------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)

We can see from the above table that user who gave 2000
reviews has only 253 fans. Similarly, there are other users who
have given one review, and have very few fans. Similarly, there
are users who has 503 fans, the highest number of fans gave only
609 reviews
May be other factors like compliment_photos, compliment_profile,
yelping since, compliment_writer, complimentary_funny which
influence the number of fans.

9. Are there more reviews with the word "love" or with the word
"hate" in them?

Answer:

SQL code used to arrive at answer:

SELECT sum(case when text like '%love%' then 1 else 0 end) as
words_with_love,
sum(case when text like '%hate%' then 1 else 0 end ) as
words_with_hate
From review

+-----------------+-----------------+
| words_with_love | words_with_hate |
+-----------------+-----------------+
| 1780 | 232 |
+-----------------+-----------------+

10. Find the top 10 users with the most fans:

SQL code used to arrive at answer:
SELECT name, fans
FROM user
ORDER BY fans desc
LIMIT 10

Copy and Paste the Result Below:

+-----------+------+
| name | fans |
+-----------+------+
| Amy | 503 |
| Mimi | 497 |
| Harald | 311 |
| Gerald | 253 |
| Christine | 173 |
| Lisa | 159 |
| Cat | 133 |
| William | 126 |
| Fran | 124 |
| Lissa | 120 |
+-----------+------+

Part 2: Inferences and Analysis

1. Pick one city and category of your choice and group the
businesses in that city or category by their overall star rating.
Compare the businesses with 2-3 stars to the businesses with 4-5
stars and answer the following questions. Include your code.

i. Do the two groups you chose to analyze have a different

distribution of hours?

city= Toronto, category= Restaurant

From the data below, we can observe that, hour distribution has
nothing to do with the rating
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| name | city | category | stars | hours
| review_count | address | latitude | longitude |
postal_code |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+
| 99 Cent Sushi | Toronto | Restaurants | 2.0 | Saturday|11:00-
23:00 | 5 | 389 Church Street | 43.6614 | -
79.379 | M5B 2E5 |
| Pizzaiolo | Toronto | Restaurants | 3.0 | Saturday|10:00-
4:00 | 34 | 270 Adelaide Street W | 43.6479 | -
79.3901 | M5H 1X6 |
| Edulis | Toronto | Restaurants | 4.0 | Saturday|18:00-
23:00 | 89 | 169 Niagara Street | 43.6419 | -
79.4066 | M5V |
| Sushi Osaka | Toronto | Restaurants | 4.5 | Saturday|11:00-
23:00 | 8 | 5084 Dundas Street W | 43.6452 | -
79.5324 | M9A 1C2 |
+---------------+---------+-------------+-------+----------------------
+--------------+-----------------------+----------+-----------+-------------+

ii. Do the two groups you chose to analyze have a different

number of reviews?

Yes, The restaurants with low rating(2-3) have fewer reviews than
the ones with higher rating (4-5)

iii. Are you able to infer anything from the location data provided
between these two groups? Explain.
From the postal code, latitude and longitude data, we can see
that they are close by.

SQL code used for analysis:

SELECT
b.name ,
b.city ,
c.category ,
b.stars ,
h.hours,
b.review_count,
b.address,
b.latitude,
b.longitude,
b.postal_code
FROM (business b INNER JOIN category c ON b.id = c.business_id)
INNER JOIN hours h ON h.business_id = b.id
WHERE b.city = 'Toronto' AND c.category = "Restaurants"
GROUP BY b.stars;

2. Group business based on the ones that are open and the ones
that are closed. What differences can you find between the ones
that are still open and the ones that are closed? List at least two
differences and the SQL code you used to arrive at your answer.

i. Difference 1:
There are more number of listings that are open than closed. Also,
the total reviews are more for the ones that are open.

ii. Difference 2:

SELECT
b.is_open,
AVG(r.funny)
FROM business b
JOIN review r
ON b.id= r.business_id
GROUP BY is_open

+---------+----------------+
| is_open | AVG(r.funny) |
+---------+----------------+
| 0 | 0.211267605634 |
| 1 | 0.269026548673 |
+---------+----------------+

The listings that are open have higher funny score than the ones
closed.
SQL code used for analysis:

SELECT
is_open,
SUM(review_count),
AVG(stars),
AVG(review_count),
COUNT(*)
FROM business
GROUP BY is_open

+---------+-------------------+---------------+-------------------+----------+
| is_open | SUM(review_count) | AVG(stars) |
AVG(review_count) | COUNT(*) |
+---------+-------------------+---------------+-------------------+----------+
| 0| 35261 | 3.52039473684 |
23.1980263158 | 1520 |
| 1| 269300 | 3.67900943396 |
31.7570754717 | 8480 |
+---------+-------------------+---------------+-------------------+----------+

3. For this last part of your analysis, you are going to choose the
type of analysis you want to conduct on the Yelp dataset and are
going to prepare the data for analysis.

Ideas for analysis include: Parsing out keywords and business

attributes for sentiment analysis, clustering businesses to find
commonalities or anomalies between them, predicting the overall
star rating for a business, predicting the number of fans a user
will have, and so on. These are just a few examples to get you
started, so feel free to be creative and come up with your own
problem you want to solve. Provide answers, in-line, to all of the
following:

i. Indicate the type of analysis you chose to do:

Predicting the overall star rating for a business

ii. Write 1-2 brief paragraphs on the type of data you will need for
your analysis and why you chose that data:

- Most of the places with high rating have useful, funny, cool
scores high.
- The places with high rating have reviews with more words like
love, awesome, like, good whihch have postive connotation.
- The number of reviews is high for high star rated places.
- Maximum places have 4-5 rating

iii. Output of your finished dataset:

+-------------------+-------------+------------+-----------+------------------
+-----------+
| SUM(review_count) | SUM(useful) | SUM(funny) | SUM(cool) |
SUM(rating_type) | rating |
+-------------------+-------------+------------+-----------+------------------
+-----------+
| 66861 | 151 | 64 | 80 |
31 | None |
| 22469 | 125 | 38 | 36 |
7 | 2-3 Stars |
| 95708 | 277 | 65 | 133 |
76 | 4-5 Stars |
+-------------------+-------------+------------+-----------+------------------
+-----------+

iv. Provide the SQL code you used to create your final dataset:

SELECT SUM(review_count),
SUM(useful),
SUM(funny),
SUM(cool),
SUM(rating_type),
rating
FROM (SELECT *,
CASE
WHEN stars>=4 THEN '4-5 Stars'
WHEN (stars>=2 AND stars<=3) THEN '2-3 Stars'
END as rating,
CASE
WHEN text like '%love%' or '%like%' or '%happy%' or
'%good%' or '%awesome%' then 1
WHEn text like '%hate%' or 'bad%' or 'disappointed%' or
'%horrible%' or '%rude%' then -1
ELSE 0
END as rating_type
from (SELECT
b.review_count,
b.stars,
r.text,
r.useful,
r.funny,
r.cool
FROM business b
JOIN review r
on b.id=r.business_id)joined_table)
GROUP BY rating

Peer Graded Assignment
100% (1)
Peer Graded Assignment
10 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Yelp Data Coursera PR2
No ratings yet
Yelp Data Coursera PR2
13 pages
Lesson Plan Near-Earth Objects
No ratings yet
Lesson Plan Near-Earth Objects
2 pages
Data Scientist Role Play
No ratings yet
Data Scientist Role Play
9 pages
Advanced Database Concepts
No ratings yet
Advanced Database Concepts
26 pages
SQL Prep
No ratings yet
SQL Prep
36 pages
CD - Members CD - Bookings CD - Facilities: Column Name Data Type Column Name Data Type Column Name
100% (1)
CD - Members CD - Bookings CD - Facilities: Column Name Data Type Column Name Data Type Column Name
6 pages
Footwear Testing 101
No ratings yet
Footwear Testing 101
73 pages
Sqlite Project
No ratings yet
Sqlite Project
16 pages
Practice Assignment SQL
No ratings yet
Practice Assignment SQL
8 pages
Data Scientist Role Play
No ratings yet
Data Scientist Role Play
9 pages
Data Scientist Role Play 1
No ratings yet
Data Scientist Role Play 1
15 pages
Yelp Data Coursera
No ratings yet
Yelp Data Coursera
14 pages
SQL For Data Science
No ratings yet
SQL For Data Science
11 pages
SQL
No ratings yet
SQL
15 pages
Yelp Dataset SQL Lookup PDF
No ratings yet
Yelp Dataset SQL Lookup PDF
14 pages
Data Scientist Role Play
No ratings yet
Data Scientist Role Play
11 pages
Data Scientist Role Play Profiling
No ratings yet
Data Scientist Role Play Profiling
9 pages
Select AS From: Part 1: Yelp Dataset Profiling and Understanding
No ratings yet
Select AS From: Part 1: Yelp Dataset Profiling and Understanding
10 pages
Part 1
No ratings yet
Part 1
6 pages
Yelpdatacoursera
No ratings yet
Yelpdatacoursera
11 pages
NRc7xB3wQZ-XO8Qd8CGfng YelpDataCourseraPR2
No ratings yet
NRc7xB3wQZ-XO8Qd8CGfng YelpDataCourseraPR2
10 pages
SQL Peer PDF
No ratings yet
SQL Peer PDF
13 pages
Action Item
No ratings yet
Action Item
5 pages
SQL Peer Graded
No ratings yet
SQL Peer Graded
10 pages
Yelp Dataset
No ratings yet
Yelp Dataset
9 pages
大作业原题
No ratings yet
大作业原题
5 pages
Big Assignment 2024 Questions
No ratings yet
Big Assignment 2024 Questions
10 pages
January All SQL Questions Compiled 1682631354
No ratings yet
January All SQL Questions Compiled 1682631354
122 pages
Mysql Project
No ratings yet
Mysql Project
7 pages
SQL Practice Questions
No ratings yet
SQL Practice Questions
10 pages
Dbms Exercises
No ratings yet
Dbms Exercises
6 pages
Model-FAT Lab
No ratings yet
Model-FAT Lab
12 pages
CRUD
No ratings yet
CRUD
29 pages
DBMS - Assignment1 - Roshan Kumar Thapa - 1
No ratings yet
DBMS - Assignment1 - Roshan Kumar Thapa - 1
8 pages
RDBMS_Syllabus_Assignment_LabPractice
No ratings yet
RDBMS_Syllabus_Assignment_LabPractice
6 pages
Create
No ratings yet
Create
11 pages
DBMS SQL
No ratings yet
DBMS SQL
17 pages
COMPUTER PRACTICAL FILE (Class 12)
No ratings yet
COMPUTER PRACTICAL FILE (Class 12)
49 pages
DS Lab # 06
No ratings yet
DS Lab # 06
3 pages
Wk2 DY2
No ratings yet
Wk2 DY2
7 pages
Query Skills Demo Using Northwind Database
No ratings yet
Query Skills Demo Using Northwind Database
4 pages
lab-manual-muni-babu-practice-sql-queries-1
No ratings yet
lab-manual-muni-babu-practice-sql-queries-1
48 pages
05 Laboratory Exercise 2
No ratings yet
05 Laboratory Exercise 2
1 page
8A.dropDowns_functions_Wk_8 (2)
No ratings yet
8A.dropDowns_functions_Wk_8 (2)
48 pages
Sale Cust Order
No ratings yet
Sale Cust Order
11 pages
ST ANDREWS PUBLIC SCHOOL CS
No ratings yet
ST ANDREWS PUBLIC SCHOOL CS
40 pages
MySQL Activity 3
No ratings yet
MySQL Activity 3
9 pages
8960 - DWM Experiment 3
No ratings yet
8960 - DWM Experiment 3
12 pages
Database Testing Using SQL
No ratings yet
Database Testing Using SQL
6 pages
WK 2
No ratings yet
WK 2
10 pages
beyond pretty
From Everand
beyond pretty
B.C. Hatch
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet
Oklahoma Night Before Christmas
From Everand
Oklahoma Night Before Christmas
Carolyn Macy
No ratings yet
Autodesk 3ds Max 2023: A Comprehensive Guide, 23rd Edition
From Everand
Autodesk 3ds Max 2023: A Comprehensive Guide, 23rd Edition
Prof. Sham Tickoo
No ratings yet
Off the Grid - Getting Started
From Everand
Off the Grid - Getting Started
Wayne J Lutz
No ratings yet
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
From Everand
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Anthony So
No ratings yet
Catia V5-6R2019 for Designers 17th Edition
From Everand
Catia V5-6R2019 for Designers 17th Edition
Prof. Sham Tickoo
No ratings yet
Autodesk 3ds Max 2021: A Comprehensive Guide, 21st Edition
From Everand
Autodesk 3ds Max 2021: A Comprehensive Guide, 21st Edition
Prof. Sham Tickoo
No ratings yet
IBM System 360 RPG Debugging Template and Keypunch Card
From Everand
IBM System 360 RPG Debugging Template and Keypunch Card
Archive Classics
No ratings yet
Updates To Cytation 3 For Gen5 Upgrade Compatibility: A. Instrument Base Code Download
No ratings yet
Updates To Cytation 3 For Gen5 Upgrade Compatibility: A. Instrument Base Code Download
6 pages
Backyard Chicken Basics PDF
No ratings yet
Backyard Chicken Basics PDF
4 pages
APA Style: 1. Throughout The Text: In-Text Citations
No ratings yet
APA Style: 1. Throughout The Text: In-Text Citations
2 pages
MCQ S
No ratings yet
MCQ S
31 pages
Immunology Report: Kamnath Market, Hospital Road, Lakhimpur Phone: 05872-278735, 9235400943
No ratings yet
Immunology Report: Kamnath Market, Hospital Road, Lakhimpur Phone: 05872-278735, 9235400943
1 page
Ire
No ratings yet
Ire
80 pages
Steven Saylor - Gordianus - Something Fishy in Pompeii
No ratings yet
Steven Saylor - Gordianus - Something Fishy in Pompeii
7 pages
Toolbox Talk 10 Safety Footwear
No ratings yet
Toolbox Talk 10 Safety Footwear
1 page
Rma Result by Grade 2
No ratings yet
Rma Result by Grade 2
25 pages
Aveo T250&T255 ANTI-THEFT CONTROL SYSTEM CIRCUIT
No ratings yet
Aveo T250&T255 ANTI-THEFT CONTROL SYSTEM CIRCUIT
5 pages
Kreatryx Control System
No ratings yet
Kreatryx Control System
33 pages
Fringe Benefit Tax: By: Dana Cortez (ACC 311-2984)
No ratings yet
Fringe Benefit Tax: By: Dana Cortez (ACC 311-2984)
46 pages
Pi 5
No ratings yet
Pi 5
19 pages
Ozonation Photocatalytic
No ratings yet
Ozonation Photocatalytic
5 pages
Enhancing Productivity Through AI Technologies
No ratings yet
Enhancing Productivity Through AI Technologies
11 pages
MR - Ashwin-Design Concept Proposal
No ratings yet
MR - Ashwin-Design Concept Proposal
43 pages
Akash Shinde: Loan Account Statement For 402Cdd92213452
No ratings yet
Akash Shinde: Loan Account Statement For 402Cdd92213452
2 pages
A Survey of Fitness Approximation Methods Applied in Evolutionary Algorithms
No ratings yet
A Survey of Fitness Approximation Methods Applied in Evolutionary Algorithms
26 pages
Handbook of Ecological Indicators for Assessment of Ecosystem Health Second Edition Applied Ecology and Environmental Management Sven E. Jørgensen - Read the ebook online or download it to own the full content
100% (1)
Handbook of Ecological Indicators for Assessment of Ecosystem Health Second Edition Applied Ecology and Environmental Management Sven E. Jørgensen - Read the ebook online or download it to own the full content
57 pages
How To Draw A Portrait in Three Quarter View, Part 9
No ratings yet
How To Draw A Portrait in Three Quarter View, Part 9
10 pages
MPPT Fuzzy Logic Controller For Photovoltaic System
No ratings yet
MPPT Fuzzy Logic Controller For Photovoltaic System
1 page
Project 3 A
No ratings yet
Project 3 A
12 pages
1149 CIP Syllabus - 750v6
No ratings yet
1149 CIP Syllabus - 750v6
2 pages
Surface Dressing . Unedited
No ratings yet
Surface Dressing . Unedited
2 pages
Standardization of 0.1 N Potassium Permanganate Solution
No ratings yet
Standardization of 0.1 N Potassium Permanganate Solution
2 pages
2377-77 - Sample Paper A V2
No ratings yet
2377-77 - Sample Paper A V2
13 pages
2024 National Tariff Book 1
No ratings yet
2024 National Tariff Book 1
661 pages
Transactions Vol IV PDF
No ratings yet
Transactions Vol IV PDF
113 pages

SQL_for DS

Uploaded by

SQL_for DS

Uploaded by

Data Scientist Role Play: Profiling and Analyzing the Yelp Dataset

This is a 2-part assignment. In the first part, you are asked a

In the second part of the assignment, you are asked to come up

For both parts of this assignment, use this "worksheet." It

Part 1: Yelp Dataset Profiling and Understanding

i. Attribute table = 10000

2. Find the total distinct records by either the foreign key or

i. Business = (id- 10000)

Note: Primary Keys are denoted in the ER-Diagram with a yellow

SQL code used to arrive at answer:

i. Table: Review, Column: Stars

min: 1 max: 5 avg: 3.7082

min: 1 max: 5 avg: 3.6549

iii. Table: Tip, Column: Likes

min: 0 max: 2 avg: 0.0144

iv. Table: Checkin, Column: Count

min: 1 max: 53 avg: 1.9414

v. Table: User, Column: Review_count

min: 0 max: 2000 avg: 24.2995

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

Copy and Paste the Result Below:

6. Find the distribution of star ratings to the business in the

SQL code used to arrive at answer:

SELECT stars, count(stars)

SQL code used to arrive at answer:

SELECT stars, count(stars)

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

No, I do not think there is any correlation between the number of

SQL code used to arrive at answer:

10. Find the top 10 users with the most fans:

Copy and Paste the Result Below:

Part 2: Inferences and Analysis

i. Do the two groups you chose to analyze have a different

city= Toronto, category= Restaurant

ii. Do the two groups you chose to analyze have a different

SQL code used for analysis:

Ideas for analysis include: Parsing out keywords and business

i. Indicate the type of analysis you chose to do:

Predicting the overall star rating for a business

iii. Output of your finished dataset:

You might also like