0% found this document useful (0 votes)

906 views

w3... Hands On Activity..clean Data Using SQL

1) The document describes a hands-on activity where the user will clean automobile data stored in a SQL database. This involves downloading a CSV file, creating a dataset and table in BigQuery, and using SQL queries to inspect and clean the data. 2) The user cleans the data by inspecting columns for invalid or missing values, identifying errors like misspellings, and ensuring consistency by removing extra spaces. 3) The cleaned data is now ready for analysis to determine the most popular cars and trims for a used car dealership startup.

Uploaded by

Shiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

906 views

w3... Hands On Activity..clean Data Using SQL

Uploaded by

Shiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Hands-On Activity: Clean data using SQL

Total points 2

1.
Question 1

Activity overview

In previous lessons, you learned about the importance of being able to clean your data where it
lives. When it comes to data stored in databases, that means using SQL queries. In this activity,
you will create a custom dataset and table, import a CSV file, and use SQL queries to clean
automobile data.

In this scenario, you are a data analyst working with a used car dealership startup venture. The
investors want you to find out which cars are most popular with customers so they can make
sure to stock accordingly.

By the time you complete this activity, you will be able to clean data using SQL. This will enable
you to process and analyze data in databases, which is a common task for data analysts.

What you will need

To get started, download the automobile_data CSV file. This is data from an external source that
contains historical sales data on car prices and their features.

Click the link to the automobile_data file to download it. Or you may download the CSV file
directly from the attachments below.

Link to data: automobile_data

Download download data:

automobile_dataCSV File

Download file
Upload your data

Similarly to a previous BigQuery activity, you will need to create a dataset and a custom table to
house your data. Then, you’ll be able to use SQL queries to explore and clean it. Once you’ve
downloaded the automobile_data file, you can create your dataset.

Step 1: Create a dataset

Go to the Explorer pane in your workspace and click the three dots next to your pinned
project to open the menu. From here, select Create dataset.

From the Create dataset menu, fill out some information about the dataset. Input the Dataset ID
as cars; you can leave the Data location as Default. Then click CREATE DATASET.
The cars dataset should appear under your project in the Explorer pane as shown below. Click
on the three dots next to the cars dataset to open it.
Step 2: Create table

After you open your newly created dataset, you will be able to add a custom table for your data.

From the cars dataset, click CREATE TABLE.

Under Source, upload the automobile_data CSV. Under Destination, make sure you are
uploading into your cars dataset and name your table car_info. You can set the schema to
Auto-detect. Then, click Create table.
After creating your table, it will appear in your Explorer pane. You can click on the table to
explore the schema and preview your data. Once you have gotten familiar with your data, you
can start querying it.

Cleaning your data

Your new dataset contains historical sales data, including details such as car features and prices.
You can use this data to find the top 10 most popular cars and trims. But before you can perform
your analysis, you’ll need to make sure your data is clean. If you analyze dirty data, you could
end up presenting the wrong list of cars to the investors. That may cause them to lose money on
their car inventory investment.

Step 1: Inspect the fuel_type column

The first thing you want to do is inspect the data in your table so you can find out if there is any
specific cleaning that needs to be done. According to the data’s description, the fuel_type
column should only have two unique string values: diesel and gas. To check and make sure
that’s true, run the following query:

SELECT DISTINCT fuel_type FROM cars.car_info;

This returns the following results:

This confirms that the fuel_type column doesn’t have any unexpected values.

Step 2: Inspect the length column

Next, you will inspect a column with numerical data. The length column should contain numeric
measurements of the cars. So you will check that the minimum and maximum lengths in the
dataset align with the data description, which states that the lengths in this column should range
from 141.1 to 208.1. Run this query to confirm

SELECT MIN(length) AS min_length, MAX(length) AS max_length FROM cars.car_info;

Your results should confirm that 141.1 and 208.1 are the minimum and maximum values
respectively in this column.

Step 3: Fill in missing data

Missing values can create errors or skew your results during analysis. You’re going to want to
check your data for null or missing values. These values might appear as a blank cell or the word
null in BigQuery.
You can check to see if the num_of_doors column contains null values using this query:

SELECT * FROM cars.car_info

WHERE

num_of_doors IS NULL;

This will select any rows with missing data for the num_of_doors column and return them in your
results table. You should get two results, one Mazda and one Dodge:

In order to fill in these missing values, you check with the sales manager, who states that all
Dodge gas sedans and all Mazda diesel sedans sold had four doors. If you are using the
BigQuery free trial, you can use this query to update your table so that all Dodge gas
sedans have four doors:

UPDATE cars.car_info SET num_of_doors = "four" WHERE make = "dodge" AND fuel_type =
"gas" AND body_style = "sedan";

You should get a message telling you that three rows were modified in this table. To make sure,
you can run the previous query again:

SELECT * FROM cars.car_info

WHERE

num_of_doors IS NULL;

Now, you only have one row with a NULL value for num_of_doors. Repeat this process to
replace the null value for the Mazda.

If you are using the BigQuery Sandbox, you can skip these UPDATE queries; they will not affect
your ability to complete this activity.

Step 4: Identify potential errors

Once you have finished ensuring that there aren’t any missing values in your data, you’ll want to
check for other potential errors. You can use SELECT DISTINCT to check what values exist in a
column. You can run this query to check the num_of_cylinders column:
SELECT DISTINCT num_of_cylinders FROM cars.car_info;

After running this, you notice that there are one too many rows. There are two entries for two
cylinders: rows 6 and 7. But the two in row 7 is misspelled.

To correct the misspelling for all rows, you can run this query if you have the BigQuery
free trial:

UPDATE cars.car_info SET num_of_cylinders = "two" WHERE num_of_cylinders = "tow";

You will get a message alerting you that one row was modified after running this statement. To
check that it worked, you can run the previous query again: SELECT DISTINCT
num_of_cylinders FROM cars.car_info;

Next, you can check the compression_ratio column. According to the data description, the
compression_ratio column values should range from 7 to 23. Just like when you checked
the length values , you can use MIN and MAX to check if that’s correct:

SELECT MIN(compression_ratio) AS min_compression_ratio, MAX(compression_ratio) AS

max_compression_ratio FROM cars.car_info;

Notice that this returns a maximum of 70. But you know this is an error because the maximum
value in this column should be 23, not 70. So the 70 is most likely a 7.0. Run the above query
again without the row with 70 to make sure that the rest of the values fall within the expected
range of 7 to 23.

SELECT MIN(compression_ratio) AS min_compression_ratio, MAX(compression_ratio) AS

max_compression_ratio FROM cars.car_info
WHERE

compression_ratio <> 70;

Now the highest value is 23, which aligns with the data description. So you’ll want to correct the
70 value. You check with the sales manager again, who says that this row was made in error and
should be removed. Before you delete anything, you should check to see how many rows contain
this erroneous value as a precaution so that you don’t end up deleting 50% of your data. If there
are too many (for instance, 20% of your rows have the incorrect 70 value), then you would want
to check back in with the sales manager to inquire if these should be deleted or if the 70 should
be updated to another value. Use the query below to count how many rows you would be
deleting:

SELECT

COUNT(*) AS num_of_rows_to_delete

FROM

cars.car_info

WHERE

compression_ratio = 70;

Turns out there is only one row with the erroneous 70 value. So you can delete that row using
this query:

DELETE cars.car_info

WHERE compression_ratio = 70;

If you are using the BigQuery sandbox, you can replace DELETE with SELECT to see which row
would be deleted.

Step 5: Ensure consistency

Finally, you want to check your data for any inconsistencies that might cause errors. These
inconsistencies can be tricky to spot — sometimes even something as simple as an extra space
can cause a problem.

Check the drive_wheels column for inconsistencies by running a query with a SELECT
DISTINCT statement:
SELECT DISTINCT drive_wheels FROM cars.car_info;

It appears that 4wd appears twice in results. However, because you used a SELECT DISTINCT
statement to return unique values, this probably means there’s an extra space in one of the 4wd
entries that makes it different from the other 4wd.

To check if this is the case, you can use a LENGTH statement to determine the length of how
long each of these string variables:

SELECT DISTINCT drive_wheels, LENGTH(drive_wheels) AS string_length FROM cars.car_info;

According to these results, some instances of the 4wd string have four characters instead of the
expected three (4wd has 3 characters). In that case, you can use the TRIM function to remove
all extra spaces in the drive_wheels column if you are using the BigQuery free trial:

UPDATE

cars.car_info

SET

drive_wheels = TRIM(drive_wheels)

WHERE TRUE;

Then, you run the SELECT DISTINCT statement again to ensure that there are only three
distinct values in the drive_wheels column:

SELECT DISTINCT drive_wheels FROM cars.car_info;

And now there should only be three unique values in this column! Which means your data is
clean, consistent, and ready for analysis!

Pricing Procedure In SAP
From Everand
Pricing Procedure In SAP
Shyamala N
4.5/5 (25)
Car Showroom Selling Record: Smt. Chandaben Mohanbhai Patel Institute of Computer Applications, Changa
No ratings yet
Car Showroom Selling Record: Smt. Chandaben Mohanbhai Patel Institute of Computer Applications, Changa
17 pages
SQL Server Functions and tutorials 50 examples
From Everand
SQL Server Functions and tutorials 50 examples
Nino Paiotta
1/5 (1)
DBSLab 3
No ratings yet
DBSLab 3
4 pages
Rishabh Public School (2)-1
No ratings yet
Rishabh Public School (2)-1
20 pages
1738563659003
No ratings yet
1738563659003
11 pages
Slide PTDL.1
No ratings yet
Slide PTDL.1
16 pages
SME Database Notes
No ratings yet
SME Database Notes
13 pages
Topic2 - 2024 - Descriptive Statistics - STD - Revised
No ratings yet
Topic2 - 2024 - Descriptive Statistics - STD - Revised
20 pages
Name: Shuanak Nagvenkar REG NO: 21BCE0296 Course: Programming For Data Science Lab
No ratings yet
Name: Shuanak Nagvenkar REG NO: 21BCE0296 Course: Programming For Data Science Lab
23 pages
Rent A Car DBS - Semstral Work
No ratings yet
Rent A Car DBS - Semstral Work
5 pages
CH 01 - Querying and SQL Leip101 - Codes Only
No ratings yet
CH 01 - Querying and SQL Leip101 - Codes Only
14 pages
Database Testing Using SQL
No ratings yet
Database Testing Using SQL
6 pages
Data Analytics With Financial Accounting Information: Winter 2022 Session 4
No ratings yet
Data Analytics With Financial Accounting Information: Winter 2022 Session 4
36 pages
Coursera Car Project
No ratings yet
Coursera Car Project
3 pages
Fintech Practice V1
No ratings yet
Fintech Practice V1
2 pages
LearningTask4Document1 (2)
No ratings yet
LearningTask4Document1 (2)
20 pages
Data Research Using Marpho Technique
No ratings yet
Data Research Using Marpho Technique
6 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
How To Code SQL Like A Boss
No ratings yet
How To Code SQL Like A Boss
8 pages
Data_bases_lab_7-8
No ratings yet
Data_bases_lab_7-8
6 pages
Lab - 2 SQL - Assignment - 2: Section1 - Mayank Smart Vehicle Database
No ratings yet
Lab - 2 SQL - Assignment - 2: Section1 - Mayank Smart Vehicle Database
10 pages
Classicmodels
No ratings yet
Classicmodels
3 pages
MS Access Tutorial
No ratings yet
MS Access Tutorial
2 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Queryingfefewfwwef and SQL
No ratings yet
Queryingfefewfwwef and SQL
28 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
automobile Project
No ratings yet
automobile Project
3 pages
Unit 3 SQL
No ratings yet
Unit 3 SQL
72 pages
1 Driving School
100% (1)
1 Driving School
4 pages
Exercise 01
No ratings yet
Exercise 01
3 pages
ASM2-Nguyen Manh Cuong-GCH190696
No ratings yet
ASM2-Nguyen Manh Cuong-GCH190696
57 pages
Coursera - Data Analytics - Course 4
No ratings yet
Coursera - Data Analytics - Course 4
6 pages
Querying and SQL Functions: in This Chapter
No ratings yet
Querying and SQL Functions: in This Chapter
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Prelims Lab Exercise #10 - M2U4
No ratings yet
Prelims Lab Exercise #10 - M2U4
45 pages
Mysql Exercises: Use Database Northwind2003
No ratings yet
Mysql Exercises: Use Database Northwind2003
4 pages
Assign 1
No ratings yet
Assign 1
6 pages
Testing Queries
No ratings yet
Testing Queries
2 pages
1.3.3.3 Hands-On Lab Explore Your Dataset Using SQL Queries - MD
No ratings yet
1.3.3.3 Hands-On Lab Explore Your Dataset Using SQL Queries - MD
3 pages
SQL_short_Notes_Top_10_Questions_1748266007
No ratings yet
SQL_short_Notes_Top_10_Questions_1748266007
8 pages
Use Built
No ratings yet
Use Built
7 pages
PostgreSQL Cheat Sheet
No ratings yet
PostgreSQL Cheat Sheet
1 page
Bacchus Safraz CS669 Lab1 Graded
No ratings yet
Bacchus Safraz CS669 Lab1 Graded
28 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
DBMS
No ratings yet
DBMS
8 pages
Class Xii Ip Practical File 2020 21
No ratings yet
Class Xii Ip Practical File 2020 21
8 pages
ClassicModel Queries
33% (3)
ClassicModel Queries
5 pages
Cleaning Function in SQL
No ratings yet
Cleaning Function in SQL
4 pages
SQL - Subqueries and Schema: DR Gordon Russell
No ratings yet
SQL - Subqueries and Schema: DR Gordon Russell
32 pages
Assignment 7
No ratings yet
Assignment 7
7 pages
Range Types Pgopen 2012
No ratings yet
Range Types Pgopen 2012
40 pages
Query Processing and Optimization -Lab-PP
No ratings yet
Query Processing and Optimization -Lab-PP
10 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Lab Assignment
No ratings yet
Lab Assignment
2 pages
Mysql Record File
No ratings yet
Mysql Record File
20 pages
Happay
No ratings yet
Happay
21 pages
DS Practical
No ratings yet
DS Practical
45 pages
Report Analysis Super Cars
100% (1)
Report Analysis Super Cars
15 pages
W1..well Aligned Objectives and Data
No ratings yet
W1..well Aligned Objectives and Data
5 pages
Data Science Pipeline and Hadoop Ecosystem
No ratings yet
Data Science Pipeline and Hadoop Ecosystem
8 pages
Big Query Help
No ratings yet
Big Query Help
4 pages
TC PPT @
No ratings yet
TC PPT @
18 pages
DS and AI IIT Madras Brochure 17aug
No ratings yet
DS and AI IIT Madras Brochure 17aug
20 pages
Exercises On Indirect Speech With Key
No ratings yet
Exercises On Indirect Speech With Key
5 pages
Top 20 Topics For Inicet: Dr. Zainab Vora MBBS, MD Radiology (Aiims)
No ratings yet
Top 20 Topics For Inicet: Dr. Zainab Vora MBBS, MD Radiology (Aiims)
26 pages
Orgman Worksheet W1
No ratings yet
Orgman Worksheet W1
13 pages
Queen of Thieves: Medium Humanoid, Neutral Evil
No ratings yet
Queen of Thieves: Medium Humanoid, Neutral Evil
40 pages
Theory of Architecture
No ratings yet
Theory of Architecture
5 pages
LPCB Approved: Dry Pillar Fire Hydrants
No ratings yet
LPCB Approved: Dry Pillar Fire Hydrants
6 pages
A Portfolio On Benchmarking Experiences in Ateneo de Davao University
No ratings yet
A Portfolio On Benchmarking Experiences in Ateneo de Davao University
10 pages
Word Stress Rules in
No ratings yet
Word Stress Rules in
12 pages
Scheda Goku ssj4 Lollo
No ratings yet
Scheda Goku ssj4 Lollo
3 pages
Batangas State University College of Engineering, Architecture & Fine Arts
No ratings yet
Batangas State University College of Engineering, Architecture & Fine Arts
2 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Anaphora. University of Leeds: Daftar Pustaka
No ratings yet
Anaphora. University of Leeds: Daftar Pustaka
2 pages
HysteresisModels OTANI
100% (1)
HysteresisModels OTANI
57 pages
Clone An Oracle Database Using Rman Duplicate
No ratings yet
Clone An Oracle Database Using Rman Duplicate
3 pages
TV QB
No ratings yet
TV QB
7 pages
Reading Texts Practice Grade 4
No ratings yet
Reading Texts Practice Grade 4
12 pages
Similar Jobs: CRA I/II/SCRA/Principal CRA - Homebased Anywhere in New Zealand
No ratings yet
Similar Jobs: CRA I/II/SCRA/Principal CRA - Homebased Anywhere in New Zealand
1 page
APQR
No ratings yet
APQR
12 pages
Abigail_at_Red_Shield-student_copy
No ratings yet
Abigail_at_Red_Shield-student_copy
8 pages
Section B - July-2022
No ratings yet
Section B - July-2022
25 pages
Advanced Java Practical Assignment
No ratings yet
Advanced Java Practical Assignment
3 pages
British Battleships Of World War One New Revised Edition R A Burt download
No ratings yet
British Battleships Of World War One New Revised Edition R A Burt download
26 pages
Proceduer of Pmi - Rev - 03 Dec 25-2019
No ratings yet
Proceduer of Pmi - Rev - 03 Dec 25-2019
14 pages
Dependent Personality Inventory-Revised (DPI-R) - Incorporating A
No ratings yet
Dependent Personality Inventory-Revised (DPI-R) - Incorporating A
85 pages
NetCol8000_A_In_room_Air_Cooled_Smart_Cooling_Product_Datasheet_
No ratings yet
NetCol8000_A_In_room_Air_Cooled_Smart_Cooling_Product_Datasheet_
2 pages
Speech Language Pathology Assistants A Resource Manual 3rd Edition by Jennifer Ostergren, Margaret Vento Wilson 1635504155 9781635504156 - Quickly access the ebook and start reading today
100% (7)
Speech Language Pathology Assistants A Resource Manual 3rd Edition by Jennifer Ostergren, Margaret Vento Wilson 1635504155 9781635504156 - Quickly access the ebook and start reading today
79 pages
Based On Annex 2B.6 To Deped Order No. 42, S. 2016: Daily Lesson Log Senior High School
No ratings yet
Based On Annex 2B.6 To Deped Order No. 42, S. 2016: Daily Lesson Log Senior High School
2 pages
My Discord Wont Stop Drawing Feet, So I Critiqued Them Ft. Ramon Hurtado - YouTube
No ratings yet
My Discord Wont Stop Drawing Feet, So I Critiqued Them Ft. Ramon Hurtado - YouTube
1 page
PART 1 - Work, Power and Energy PART 2 - Heat, Work and Energy
No ratings yet
PART 1 - Work, Power and Energy PART 2 - Heat, Work and Energy
4 pages
February 14 2013 Mount Ayr Record-News
No ratings yet
February 14 2013 Mount Ayr Record-News
14 pages

w3... Hands On Activity..clean Data Using SQL

Uploaded by

w3... Hands On Activity..clean Data Using SQL

Uploaded by

Hands-On Activity: Clean data using SQL

What you will need

Link to data: automobile_data

Download download data:

Step 1: Create a dataset

From the cars dataset, click CREATE TABLE.

Cleaning your data

Step 1: Inspect the fuel_type column

SELECT DISTINCT fuel_type FROM cars.car_info;

This returns the following results:

Step 2: Inspect the length column

SELECT MIN(length) AS min_length, MAX(length) AS max_length FROM cars.car_info;

Step 3: Fill in missing data

SELECT * FROM cars.car_info

SELECT * FROM cars.car_info

Step 4: Identify potential errors

UPDATE cars.car_info SET num_of_cylinders = "two" WHERE num_of_cylinders = "tow";

SELECT MIN(compression_ratio) AS min_compression_ratio, MAX(compression_ratio) AS

SELECT MIN(compression_ratio) AS min_compression_ratio, MAX(compression_ratio) AS

compression_ratio <> 70;

WHERE compression_ratio = 70;

Step 5: Ensure consistency

SELECT DISTINCT drive_wheels, LENGTH(drive_wheels) AS string_length FROM cars.car_info;

SELECT DISTINCT drive_wheels FROM cars.car_info;

You might also like