Beginners Guide To SQL
Beginners Guide To SQL
2. LENGTH
SELECT
country
FROM
my-first-project-418418.customer_data.customer_address
WHERE
LENGTH(country) > 2
4. Update Query
UPDATE
my-first-project-418418.cars.car_info
SET
price = 5118
WHERE
price = 0
5. CASTE
SELECT
CAST(purchase_price AS float64)
#purchase_price
FROM
my-first-project-418418.customer_data.customer_purchase
ORDER BY
CAST(purchase_price AS float64) DESC
SELECT
CAST (date AS date) AS date_needed,
purchase_price
FROM
my-first-project-418418.customer_data.customer_purchase
WHERE
date BETWEEN '2020-12-01' AND '2020-12-31'
MOVIE DATESET SQL COMMANDS
HANDON TASK
A. INSERT INTO
Example of Query
B. CONCAT
There are four types of JOIN in SQL. Below is the information regarding all the four.
• INNER JOIN: a function that returns records with matching values in both tables
• LEFT JOIN: a function that returns all the records from the left table (first mentioned)
and only the matching records from the right table (second mentioned)
• RIGHT JOIN: a function that returns all records from the right table (second mentioned)
and only the matching records from the left table (first mentioned).
• OUTER JOIN: a function that combines the RIGHT JOIN and LEFT JOIN to return all
matching records in both tables.
In the above SS both the queries will return the same result.
Joins Hands On
question: In 2015, how many people were of the official age for secondary education broken down by
region of the world?
SELECT
summary.region,
SUM(edu.value) secondary_edu_population
FROM
`bigquery-public-data.world_bank_intl_education.international_education` AS edu
INNER JOIN
`bigquery-public-data.world_bank_intl_education.country_summary` AS summary
ON edu.country_code = summary.country_code --country_code is our key
WHERE summary.region IS NOT NULL
AND edu.indicator_name = 'Population of the official age for secondary
education, both sexes (number)'
AND edu.year = 2015
GROUP BY summary.region
ORDER BY secondary_edu_population DESC
27. LEFT JOIN
Consider this scenario: You have been tasked to provide data for a feature sports article on NCAA
basketball in the 1990s. The writer wants to include a funny twist about which Division 1 team
mascots were the winningest.
SELECT
seasons.market AS university,
seasons.name AS team_name,
mascots.mascot AS team_mascot,
AVG(seasons.wins) AS avg_wins,
AVG(seasons.losses) AS avg_losses,
AVG(seasons.ties) AS avg_ties
FROM `bigquery-public-data.ncaa_basketball.mbb_historical_teams_seasons` AS seasons
LEFT JOIN `bigquery-public-data.ncaa_basketball.mascots` AS mascots
ON seasons.team_id = mascots.id
WHERE seasons.season BETWEEN 1990 AND 1999
AND seasons.division = 1
GROUP BY 1,2,3
ORDER BY avg_wins DESC, university
SELECT
*
FROM
my-first-project-418418.warehouse_orders.orders AS orders
JOIN
my-first-project-418418.warehouse_orders.warehouse warehouse
ON
orders.warehouse_id = warehouse.warehouse_id
29. Taking All the columns from Orders but only 2 columns from Warehouse
SELECT
orders.*,
warehouse.warehouse_alias,
warehouse.state
FROM
my-first-project-418418.warehouse_orders.orders AS orders
JOIN
my-first-project-418418.warehouse_orders.warehouse warehouse
ON
orders.warehouse_id = warehouse.warehouse_id
SELECT
COUNT(warehouse.state) AS num_States
FROM
my-first-project-418418.warehouse_orders.orders AS orders
JOIN
my-first-project-418418.warehouse_orders.warehouse warehouse
ON
orders.warehouse_id = warehouse.warehouse_id
NESTED QUERIES
Please find the explaining in the Hands-On Folder of Data Analytics. I have created a word file which
has the explanation for each of the below three queries.
33. USE A SubQuery In a SELECT statement
SELECT
station_id,
num_bikes_available,
(SELECT
AVG(num_bikes_available)
FROM bigquery-public-
data.new_york.citibike_stations) AS avg_num_bikes_available
FROM bigquery-public-data.new_york.citibike_stations;
34. Use a subquery in a FROM statement
SELECT
station_id,
name,
number_of_rides AS number_of_rides_starting_at_station #This column is not
present either in Citibike_trips or Citibike_station so we have created a temporary
table with the name as "station_num_trips" and from that table we are extracting
"number_of_rides".
FROM
(
SELECT
CAST(start_station_id AS STRING) AS start_station_id_str,
COUNT(*) AS number_of_rides
FROM
bigquery-public-data.new_york.citibike_trips
GROUP BY
CAST(start_station_id AS STRING)
) AS station_num_trips
INNER JOIN
bigquery-public-data.new_york.citibike_stations
ON
station_num_trips.start_station_id_str = station_id
ORDER BY
station_num_trips.number_of_rides DESC
35. Use a subquery in a WHERE statement
SELECT
station_id,
name
FROM
bigquery-public-data.new_york.citibike_stations
WHERE
station_id IN
(
SELECT
CAST(start_station_id AS STRING) AS start_station_id_str #**
FROM
bigquery-public-data.new_york.citibike_trips
WHERE
usertype = 'Subscriber'
);
HANDS ON Activity
Scenario : To complete this task, you will create three different subqueries, which will allow you
to gather information about the average trip duration by station, compare trip duration by station,
and determine the five stations with the longest mean trip durations.
36. Query 1 to calculate “ average trip duration by station” (Nested SELECT inside FROM)
SELECT
subquery.start_station_id,
subquery.avg_duration
FROM
(
SELECT
start_station_id,
AVG(tripduration) as avg_duration
FROM bigquery-public-data.new_york_citibike.citibike_trips
GROUP BY start_station_id) as subquery
ORDER BY avg_duration DESC;
37. Query 2 compare “trip duration by station” (Nested SELECT inside SELECT)
SELECT
starttime,
start_station_id,
tripduration,
(
SELECT ROUND(AVG(tripduration),2) #the nested ones are runned when a row
for 1st three columns are ready
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE start_station_id = outer_trips.start_station_id
) AS avg_duration_for_station,
ROUND(outer_trips.tripduration - (
SELECT AVG(tripduration)
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE start_station_id = outer_trips.start_station_id), 2) AS
difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
ORDER BY difference_from_avg DESC
LIMIT 25;
Regarding 1st Subquery : This subquery is used to calculate the average trip
duration for trips starting at the same station as the current trip
(outer_trips.start_station_id).
38. QUERY 3 compose a new query to filter the data to include only the trips from the five stations
with the longest mean trip duration.
SELECT
tripduration,
start_station_id
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE start_station_id IN
(
SELECT
start_station_id
FROM
(
SELECT
start_station_id,
AVG(tripduration) AS avg_duration
FROM bigquery-public-data.new_york_citibike.citibike_trips
GROUP BY start_station_id
) AS top_five
ORDER BY avg_duration DESC
LIMIT 5
);
39. Using CASE to categorize the data.
The use of CASE is very similar to using IF ELSE or SWITCH CASE in java/python
programming.
SELECT
warehouse.warehouse_id,
CONCAT(warehouse.state,': ',warehouse.warehouse_alias) AS warehouse_name,
COUNT(orders.order_id) AS number_of_orders,
(SELECT COUNT(*) FROM my-first-project-418418.warehouse_orders.orders AS orders)
AS total_orders,
CASE
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) <= 0.2
THEN 'Fullfillment is 0-20%'
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) > 20
AND COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) <= 60
THEN 'Fullfillment is 21-60%'
ELSE 'Fullfillment is more than 60% of orders'
END AS fullfillment_summary
FROM my-first-project-418418.warehouse_orders.warehouse AS warehouse
LEFT JOIN my-first-project-418418.warehouse_orders.orders AS orders
ON orders.warehouse_id = warehouse.warehouse_id
GROUP BY
warehouse.warehouse_id,
warehouse_name
HAVING
COUNT(orders.order_id) > 0
40. Using CASE to categorize the data without HAVING
SELECT
warehouse.warehouse_id,
CONCAT(warehouse.state,': ',warehouse.warehouse_alias) AS warehouse_name,
COUNT(orders.order_id) AS number_of_orders,
(SELECT COUNT(*) FROM my-first-project-418418.warehouse_orders.orders AS orders)
AS total_orders,
CASE
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) <= 0.2
THEN 'Fullfillment is 0-20%'
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) > 20
AND COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-first-project-
418418.warehouse_orders.orders AS orders) <= 60
THEN 'Fullfillment is 21-60%'
ELSE 'Fullfillment is more than 60% of orders'
END AS fullfillment_summary
FROM my-first-project-418418.warehouse_orders.warehouse AS warehouse
LEFT JOIN my-first-project-418418.warehouse_orders.orders AS orders
ON orders.warehouse_id = warehouse.warehouse_id
GROUP BY
warehouse.warehouse_id,
warehouse_name
#HAVING
# COUNT(orders.order_id) > 0
Here the number_of_orders column is blank because there are some warehouse_id
numbers which are not present in the orders table.
HANDS ON SCENARIO :
In this scenario, you are a junior data analyst for a multinational food and beverage
manufacturer. You and your team are responsible for maintaining the safety of a wide
array of food products. Because of the overwhelming number of products on the market,
you have been asked to prioritize which products need to be reviewed by your
stakeholders.
While it's useful to know which food industries receive the most complaints, the more
critical aspect to consider is identifying the complaints that lead to severe health
consequences, such as hospital visits.
42. Getting the top 10 Products having Most Reports as well as those that have most
hospitalizations.
SELECT
products_industry_name,
COUNT(report_number) AS count_reports
FROM
bigquery-public-data.fda_food.food_events
WHERE
products_industry_name IN
(
SELECT
products_industry_name,
-- COUNT(report_number) AS count_reports
FROM
`bigquery-public-data.fda_food.food_events`
GROUP BY
products_industry_name
ORDER BY
COUNT(report_number) DESC
LIMIT 10
) AND outcomes LIKE '%Hospitalization%'
GROUP BY
products_industry_name
ORDER BY
count_reports DESC
In the below query we have used EXTRACT to get the year from the “starttime” column.
SELECT
EXTRACT(YEAR FROM starttime) AS year,
COUNT(*) AS number_of_rides
FROM
`bigquery-public-data.new_york.citibike_trips`
GROUP BY
year
ORDER BY
year
46. HANDSON For basic calculations using SQL.
Subtraction
SELECT
station_name,
ridership_2013,
ridership_2014,
ridership_2014 - ridership_2013 AS change_2014_raw
FROM
`bigquery-public-data.new_york_subway.subway_ridership_2013_present`
SELECT
inventory.*,
COALESCE(avg_sales.avg_quantity_sold_in_a_month, 0) AS
avg_quantity_sold_in_a_month
FROM
my-second-project-421106.sales.Inventory AS inventory
LEFT JOIN (
SELECT
ProductId,
StoreID,
AVG(UnitsSold) AS avg_quantity_sold_in_a_month
FROM sales_history
GROUP BY ProductId, StoreID
) AS avg_sales
ON inventory.ProductID = avg_sales.ProductID
AND inventory.StoreID = avg_sales.StoreID;
51. Hands-On to find the bikeId having the highest trip duration (Using TEMP Table).
WITH longest_bike_duration AS (
SELECT
bike_id,
SUM(duration_minutes) AS trip_duration
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY
bike_id
ORDER BY
trip_duration DESC LIMIT 1
)
52. SQL query to find the name of the station where that bike can most likely be
found, so they ask you to determine which bike is used most often.
WITH longest_bike_duration AS (
SELECT
bike_id,
SUM(duration_minutes) AS trip_duration
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY
bike_id
ORDER BY
trip_duration DESC LIMIT 1
)
SELECT
trips.bike_id,
trips.start_station_id,
COUNT(*) AS trip_count
FROM longest_bike_duration AS longest
INNER JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
ON trips.bike_id = longest.bike_id
GROUP BY start_station_id, trips.bike_id
ORDER BY trip_count DESC
LIMIT 1
All the below queries actually create a temporary tables into the backend but the
WITH ... AS Command doesn’t actually create a temporary table In the back but
mimics the temporary table.
(BigQuery currently doesn’t recognize the SELECT INTO command currently below is
the example of how the SELECT INTO might look for other RDBMS system)