Process of Data Form Dirty Cleaning
Process of Data Form Dirty Cleaning
A good analysis depends on the integrity of the data, and data integrity usually depends on using a
common format. So it is important to double-check how dates are formatted to make sure what you
think is December 10, 2020 isn’t really October 12, 2020, and vice versa.
Data replication compromising data integrity: Continuing with the example, imagine you ask
your international counterparts to verify dates and stick to one format. One analyst copies a
large dataset to check the dates. But because of memory issues, only part of the dataset is
actually copied. The analyst would be verifying and standardizing incomplete data. That
partial dataset would be certified as compliant but the full dataset would still contain dates
that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit
of results would be essential to reveal what happened and correct all dates.
Data transfer compromising data integrity: Another analyst checks the dates in a
spreadsheet and chooses to import the validated and standardized data back to the database.
But suppose the date field from the spreadsheet was incorrectly classified as a text field during
the data import (transfer) process. Now some of the dates in the database are stored as text
strings. At this point, the data needs to be cleaned to restore its integrity.
Data manipulation compromising data integrity: When checking dates, another analyst
notices what appears to be a duplicate record in the database and removes it. But it turns out
that the analyst removed a unique record for a company’s subsidiary and not a duplicate
record for the company. Your dataset is now missing data and the data must be restored for
completeness.
Conclusion
Data constraint Definition Examples
Values must be of a certain
If the data type is a date, a single number like 30 would
Data type type: date, number,
fail the constraint and be invalid
percentage, Boolean, etc.
Values must fall between
If the data range is 10-20, a value of 30 would fail the
Data range predefined maximum and
constraint and be invalid
minimum values
Values can’t be left blank or
Mandatory If age is mandatory, that value must be filled in
empty
Values can’t have a Two people can’t have the same mobile phone number
Unique
duplicate within the same service area
Regular expression Values must match a A phone number must match ###-###-#### (no other
(regex) patterns prescribed pattern characters allowed)
Certain conditions for
Cross-field Values are percentages and values from multiple fields
multiple fields must be
validation must add up to 100%
satisfied
A database table can’t have two rows with the same
primary key value. A primary key is an identifier in a
(Databases only) value must
Primary-key database that references a column in which each value
be unique per column
is unique. More information about primary and foreign
keys is provided later in the program.
(Databases only) values for
Value for a column must be set to Yes, No, or Not
Set-membership a column must come from a
Applicable
set of discrete values
(Databases only) values for
In a U.S. taxpayer database, the State column must be a
a column must be unique
Foreign-key valid state or territory with the set of acceptable values
values coming from a
defined in a separate States table
column in another table
The degree to which the
data conforms to the actual If values for zip codes are validated by street location,
Accuracy
entity being measured or the accuracy of the data goes up.
described
The degree to which the
If data for personal profiles required hair and eye color,
Completeness data contains all desired
and both are collected, the data is complete.
components or measures
The degree to which the
data is repeatable from If a customer has the same address in the sales and
Consistency
different points of entry or repair databases, the data is consistent.
collection
Fortunately, with a standard date format and compliance by all people and systems that work with the
data, data integrity can be maintained. But no matter where your data comes from, always be sure to
check that it is valid, complete, and clean before you begin any analysis.
In this reading, you will review the business objectives associated with three scenarios. You will explore
how clean data and well-aligned business objectives can help you come up with accurate conclusions.
On top of that, you will learn how new variables discovered during data analysis can cause you to set
up data constraints so you can keep the data aligned to a business objective.
To start off, the data analyst verifies that the data exported to spreadsheets is clean and confirms that
the data needed (when users access content) is available. Knowing this, the analyst decides there is
good alignment of the data to the business objective. All that is missing is figuring out exactly how long
it takes each user to view content after their subscription has been activated.
Here are the data processing steps the analyst takes for a user from an account called V&L Consulting.
(These steps would be repeated for each subscribing account, and for each user associated with that
account.)
Step 1
Step 2
Relevant data in spreadsheet:
Step 3
Result: 10 days
Pro tip 1
In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and 3 to
populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function that searches for
a certain value in a column to return a related piece of information. Using VLOOKUP can save a lot of
time; without it, you have to look up dates and names manually.
Refer to the VLOOKUP page in the Google Help Center for how to use the function in Google Sheets.
Inputs
Return value
The first matched value from the selected range.
VLOOKUP exact match or approximate match
Pro tip 2
In Step 4 of the above process, the analyst could use the DATEDIF function to automatically calculate
the difference between the dates in column C and column D. The function can calculate the number of
days between two dates.
Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The DAYS360
function does the same thing in accounting spreadsheets that use a 360-day year (twelve 30-day
months).
Refer to the DATEDIF page in the Google Help Center for how to use the function in Google Sheets.
DATEDIF function
Calculates the number of days, months, or years
between two dates.
Warning: Excel provides the DATEDIF function in order to support older workbooks from Lotus
1-2-3. The DATEDIF function may calculate incorrect results under certain scenarios. Please see
the known issues section of this article for further details.
Syntax
DATEDIF(start_date,end_date,unit)
Argument Description
start_date A date that represents the first, or starting date of a given period. Dates may be entered as
text strings within quotation marks (for example, "2001/1/30"), as serial numbers (for
Required example, 36921, which represents January 30, 2001, if you're using the 1900 date system),
or as the results of other formulas or functions (for example, DATEVALUE("2001/1/30")).
end_date A date that represents the last, or ending, date of the period.
Required
Unit The type of information that you want returned, where:
Returns
Unit
"Y" The number of complete years in the period.
"M" The number of complete months in the period.
"D" The number of days in the period.
"MD The difference between the days in start_date
" and end_date. The months and years of the
dates are ignored.
Important: We don't recommend using the "MD" argument, as there are known
limitations with it. See the known issues section below.
Argument Description
Business objective
Cloud Gate, a software company, recently hosted a series of public webinars as free product
introductions. The data analyst and webinar program manager want to identify companies that had
five or more people attend these sessions. They want to give this list of companies to sales managers
who can follow up for potential sales.
The webinar attendance data includes the fields and data shown below.
Name <First name> <Last name> This was required information attendees had to submit
Email Address [email protected] This was required information attendees had to submit
Company <Company name> This was optional information attendees could provide
Data cleaning
The webinar attendance data seems to align with the business objective. But the data analyst and
program manager decide that some data cleaning is needed before the analysis. They think data
cleaning is required because:
The company name wasn’t a mandatory field. If the company name is blank, it might be found
from the email address. For example, if the email address is [email protected], the
company field could be filled in with Google for the data analysis. This data cleaning step
assumes that people with company-assigned email addresses attended a webinar for business
purposes.
Attendees could enter any name. Since attendance across a series of webinars is being looked
at, they need to validate names against unique email addresses. For example, if Joe Cox
attended two webinars but signed in as Joe Cox for one and Joseph Cox for the other, he would
be counted as two different people. To prevent this, they need to check his unique email
address to determine that he was the same person. After the validation, Joseph Cox could be
changed to Joe Cox to match the other instance.
Business objective
An after-school tutoring company, A+ Education, wants to know if there is a minimum number of
tutoring hours needed before students have at least a 10% improvement in their assessment scores.
The data analyst thinks there is good alignment between the data available and the business objective
because:
Students log in and out of a system for each tutoring session, and the number of hours is
tracked
Assessment scores are regularly recorded
Key takeaways
Hopefully these examples give you a sense of what to look for to know if your data aligns with your
business objective.
When there is clean data and good alignment, you can get accurate insights and make
conclusions the data supports.
If there is good alignment but the data needs to be cleaned, clean the data before you perform
your analysis.
If the data only partially aligns with an objective, think about how you could modify the
objective, or use data constraints to make sure that the subset of data better aligns with the
business objective.
Consider the following data issues and suggestions on how to work around them.
Use the following decision tree as a reminder of how to deal with data errors or not enough
data:
1. Can you fix or request a corrected dataset? NO 2. Do you have enough data to omit the wrong
data? NO 3. Can you proxy the data? NO 4. Can you collect more data? NO Modify the business
objective (if possible)
Calculating sample size
Before you dig deeper into sample size, familiarize yourself with these terms and
definitions:
Terminology Definitions
The entire group that you are interested in for your study. For example, if you are
Population
people in your company, the population would be all the employees in your comp
A subset of your population. Just like a food sample, it is called a sample because
Sample taste. So if your company is too large to survey every individual, you can survey a
representative sample of your population.
Since a sample is used to represent a population, the sample’s results are expecte
from what the result would have been if you had surveyed the entire population.
Margin of error difference is called the margin of error. The smaller the margin of error, the closer
results of the sample are to what the result would have been if you had surveyed
population.
How confident you are in the survey results. For example, a 95% confidence level
that if you were to run the same survey 100 times, you would get similar results 95
Confidence level
100 times. Confidence level is targeted before you start your study because it will
big your margin of error is at the end of your study.
Confidence The range of possible values that the population’s result would be at the confiden
interval the study. This range is the sample result +/- the margin of error.
Statistical The determination of whether your result could be due to random chance or not.
significance greater the significance, the less due to chance.
Don’t use a sample size less than 30. It has been statistically proven that 30 is the
smallest sample size where an average result of a sample starts to represent the
average result of a population.
The confidence level most commonly used is 95%, but 90% can work in some
cases.
Increase the sample size to meet specific needs of your project:
This recommendation is based on the Central Limit Theorem (CLT) in the field of
probability and statistics. As sample size increases, the results more closely resemble the
normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the
smallest sample size for which the CLT is still valid. Researchers who rely on regression
analysis – statistical methods to determine the relationships between controlled and
dependent variables – also prefer a minimum sample of 30.
Still curious? Without getting too much into the math, check out these articles:
Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit
Theorem and briefly describes how it can apply to an analysis of a stock index.
Sample Size Formula: This article by Statistics Solutions provides a little more
detail about why some researchers use 30 as a minimum sample size.
For example, if you live in a city with a population of 200,000 and get 180,000 people to
respond to a survey, that is a large sample size. But without actually doing that, what
would an acceptable, smaller sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
A sample size of 200 might be large enough if your business problem is to find out
how residents felt about the new library
A sample size of 200 might not be large enough if your business problem is to
determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the
new library versus surveying residents about how they would vote to fund it. For that
reason, you would most likely use a larger sample size for the voter survey.
Refer to the Determine the Best Sample Size video for a demonstration of a sample size
calculator, or refer to the Sample Size Calculator reading for additional information.
Overview
Now that you have learned about how to prepare for data cleaning, you can pause for a moment and
reflect on these steps. In this self-reflection, you will consider your thoughts about the importance of
pre-cleaning activities and respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to apply your
knowledge of pre-cleaning activities and insufficient data to your own data cleaning work. As you
answer questions—and come up with questions of your own—you will consider concepts, practices,
and principles to help refine your understanding and reinforce your learning. You’ve done the hard
work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Review data integrity
Before data analysts can analyze data, they first need to think about and understand the data they're
working with. Assessing data integrity is a key step in this process. As you've learned in previous
lessons, you should complete the following tasks before analyzing data:
1. Determine data integrity by assessing the overall accuracy, consistency, and completeness of the
data.
2. Connect objectives to data by understanding how your business objectives can be served by an
investigation into the data.
Data analysts perform pre-cleaning activities to complete these steps. Pre-cleaning activities help you
determine and maintain data integrity, which is essential to the role of a junior data analyst.
One of the objectives of pre-cleaning activities is to address insufficient data. Recall from previous
lessons that data can be insufficient for a number of reasons. Insufficient data has one or more of the
following problems:
Reflection
Consider what you have learned about data insufficiency and the steps for how to avoid it:
Here's an example. A nasal version of a vaccine was recently made available. A clinic wants to know
what to expect for contraindications, but just started collecting first-party data from its patients. A
contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would
cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an
open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data
with patient profiles most closely matching the makeup of the patients at the clinic.
There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com)
which we previously introduced, has datasets in a variety of formats including the most basic type,
Comma Separated Values (CSV) files.
As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null
most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted
as the value, 0. It is important to understand how Null was used before you start analyzing a dataset
with Null data.
Completed
Confidence level: The probability that your sample size accurately reflects the
greater population.
Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population.
Population: This is the total number you hope to pull your sample from.
Sample: A part of a population that is representative of the population.
Estimated response rate: If you are running a survey of individuals, this is the
percentage of people you expect will complete your survey out of those who
received the survey.
How to use a sample size calculator
In order to use a sample size calculator, you need to have the population size, confidence
level, and the acceptable margin of error already decided so you can input them into the
tool. If this information is ready to go, check out these sample size calculators below:
Now that you have the basics, try some calculations using the sample size calculators and
refer back to this reading if you need a refresher on the definitions.
Outdated data
Potential harm to
Description Possible causes
businesses
Any data that is old which should be People changing roles or companies, Inaccurate insights,
replaced with newer and more accurate or software and systems becoming decision-making, and
information obsolete analytics
Incomplete data
Incorrect/inaccurate data
For further reading on the business impact of dirty data, enter the term “dirty data” into
your preferred browser’s search bar to bring up numerous articles on the topic. Here are a
few impacts cited for certain industries from a previous search:
Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
Digital commerce: Up to 25% of B2B database contacts contain inaccuracies
(source).
Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales
campaigns (source).
Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s
electronic health records (source).
Additional resources
Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help
you avoid the most common mistakes:
Top ten ways to clean your data: Review an orderly guide to data cleaning in
Microsoft Excel.
10 Google Workspace tips to clean up data: Learn best practices for data cleaning
in Google Sheets.
What was the most challenging part of cleaning the data?
Why is cleaning and transposing data important for data analysis?
If you had to clean this data again, what would you do differently? Why?
cleaned and transposed data on a spreadsheet. A good response would include
that cleaning is a fundamental step in data science as it greatly increases the
integrity of the data.
Good data science results rely heavily on the reliability of the data. Data analysts
clean data to make it more accurate and reliable. This is important for making sure
that the projects you will work on as a data analyst are completed properly.
Workflow automation
In this reading, you will learn about workflow automation and how it can help you work
faster and more efficiently. Basically, workflow automation is the process of automating
parts of your work. That could mean creating an event trigger that sends a notification
when a system is updated. Or it could mean automating parts of the data cleaning
process. As you can probably imagine, automating different parts of your work can save
you tons of time, increase productivity, and give you more bandwidth to focus on other
important aspects of the job.
Can it be
Task Why?
automated?
Communicating with Communication is key to understanding the needs of your team
your team and No and stakeholders as you complete the tasks you are working on.
stakeholders There is no replacement for person-to-person communications.
Presenting your data is a big part of your job as a data analyst.
Presenting your Making data accessible and understandable to stakeholders and
No
findings creating data visualizations can’t be automated for the same
reasons that communications can’t be automated.
Some tasks in data preparation and cleaning can be automated by
Preparing and
Partially setting up specific processes, like using a programming script to
cleaning data
automatically detect missing values.
Sometimes the best way to understand data is to see it. Luckily,
there are plenty of tools available that can help automate the
Data exploration Partially process of visualizing data. These tools can speed up the process of
visualizing and understanding the data, but the exploration itself
still needs to be done by a data analyst.
Data modeling is a difficult process that involves lots of different
Modeling the data Yes factors; luckily there are tools that can completely automate the
different stages.
More resources
There are a lot of tools out there that can help automate your processes, and those tools
are improving all the time. Here are a few articles or blogs you can check out if you want to
learn more about workflow automation and the different tools out there for you to use:
You can start developing your personal approach to cleaning data by creating a standard
checklist to use before your data cleaning process. Think of this checklist as your default
"what to search for" list.
With a good checklist, you can efficiently and, hopefully, swiftly identify all the problem
spots without getting sidetracked. You can also use the checklist to identify the scale and
scope of the dataset itself.
After you have compiled your personal checklist, you can create a list of activities you like
to perform when cleaning data. This list is a collection of procedures that you will
implement when you encounter specific issues present in the data related to your
checklist or every time you clean a new dataset.
For example, suppose that you have a dataset with missing data, how would you handle
it? Moreover, if the data set is very large, what would you do to check for missing data?
Outlining some of your preferred methods for cleaning data can help save you time and
energy.
Now that you have a personal checklist and your preferred data cleaning methods, you
can create a data cleaning motto to help guide and explain your process. The motto is a
short one or two sentence summary of your philosophy towards cleaning data. For
example, here are a few data cleaning mottos from other data analysts:
1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."
The data you encounter as an analyst won’t always conform to your checklist or activities
list regardless of how comprehensive they are. Data cleaning can be an involved and
complicated process, but surprisingly most data has similar problems. A solid personal
motto and explanation can make the more common data cleaning tasks easy to
understand and complete.
Reflection
Now that you have completed your Data Cleaning Approach Table, take a moment to
reflect on the decisions you made about your data cleaning approach. Write 1-2 sentences
(20-40 words) answering each of the following questions:
What items did you add to your data cleaning checklist? Why did you decide these
were important to check for?
How have your own experiences with data cleaning affected your preferred
cleaning methods? Can you think of an example where you needed to perform one
of these cleaning tasks?
How did you decide on your data cleaning motto?
In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst
knows they could get the same results with a single SQL query:
How did working with SQL help you query a larger dataset?
How long do you think it would take a team to query a dataset like this manually?
How does the ability to query large datasets in reasonable amounts of time affect
data analysts?
A good response would include how querying a dataset with billions of items isn’t
feasible without tools such as relational databases and SQL.
Performing large queries by hand would take years and years of manual work. The
ability to query large datasets is an extremely helpful tool for data analysts. You
can gain insights from massive amounts of data to discover trends and
opportunities that wouldn’t be possible to find without tools like SQL.
You must have a BigQuery account to follow along. If you have hopped around courses,
Using BigQuery in the Prepare Data for Exploration course covers how to set up a
BigQuery account.
CSV File
Next, complete the following steps in your BigQuery console to upload the
Customer Table dataset.
Step 1: Open your BigQuery console and click on the project you want to upload the data
to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your
project name and select Create dataset.
Step 3: In the upcoming video, the name "customer_data" will be used for the dataset. If
you plan to follow along with the video, enter customer_data for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project, and then click the
customer_data dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to customer_data and select Open.
Step 7: Click the blue + icon at the middle to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be
coming from.
Select Upload.
Click Browse to select the Customer Table CSV file you downloaded.
Choose CSV from the file format drop-down.
Step 9: For Table name, enter customer_address if you plan to follow along with the
video.
Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the customer_address table
under your customer_data dataset in your project.
Step 12: Click customer_address and then select the Preview tab. Confirm that you see the
data shown below.
And now you have everything you need to follow along with the next video. This is also a
great table to use to practice querying data on your own. Plus, you can use these steps to
upload any other data you want to work with.
Question 1
Activity overview
In previous lessons, you learned about the importance of being able to clean your data where it lives.
When it comes to data stored in databases, that means using SQL queries. In this activity, you will
create a custom dataset and table, import a CSV file, and use SQL queries to clean automobile data.
In this scenario, you are a data analyst working with a used car dealership startup venture. The
investors want you to find out which cars are most popular with customers so they can make sure to
stock accordingly.
By the time you complete this activity, you will be able to clean data using SQL. This will enable you to
process and analyze data in databases, which is a common task for data analysts.
Click the link to the automobile_data file to download it. Or you may download the CSV file directly
from the attachments below.
OR
automobile_data
CSV File
Similarly to a previous BigQuery activity, you will need to create a dataset and a custom table to house
your data. Then, you’ll be able to use SQL queries to explore and clean it. Once you’ve downloaded the
automobile_data file, you can create your dataset.
Step 1: Create a dataset
Go to the Explorer pane in your workspace and click the three dots next to your pinned project to
open the menu. From here, select Create dataset.
From the Create dataset menu, fill out some information about the dataset. Input the Dataset ID as
cars; you can leave the Data location as Default. Then click CREATE DATASET.
The cars dataset should appear under your project in the Explorer pane as shown below. Click on the
three dots next to the cars dataset to open it.
Step 2: Create table
After you open your newly created dataset, you will be able to add a custom table for your data.
Under Source, upload the automobile_data CSV. Under Destination, make sure you are uploading
into your cars dataset and name your table car_info. You can set the schema to Auto-detect. Then,
click Create table.
After creating your table, it will appear in your Explorer pane. You can click on the table to explore the
schema and preview your data. Once you have gotten familiar with your data, you can start querying
it.
Your new dataset contains historical sales data, including details such as car features and prices. You
can use this data to find the top 10 most popular cars and trims. But before you can perform your
analysis, you’ll need to make sure your data is clean. If you analyze dirty data, you could end up
presenting the wrong list of cars to the investors. That may cause them to lose money on their car
inventory investment.
This confirms that the fuel_type column doesn’t have any unexpected values.
Your results should confirm that 141.1 and 208.1 are the minimum and maximum values respectively in
this column.
You can check to see if the num_of_doors column contains null values using this query:
WHERE
num_of_doors IS NULL;
This will select any rows with missing data for the num_of_doors column and return them in your
results table. You should get two results, one Mazda and one Dodge:
In order to fill in these missing values, you check with the sales manager, who states that all Dodge gas
sedans and all Mazda diesel sedans sold had four doors. If you are using the BigQuery free trial, you can
use this query to update your table so that all Dodge gas sedans have four doors:
UPDATE cars.car_info SET num_of_doors = "four" WHERE make = "dodge" AND fuel_type =
"gas" AND body_style = "sedan";
You should get a message telling you that three rows were modified in this table. To make sure, you can
run the previous query again:
WHERE
num_of_doors IS NULL;
Now, you only have one row with a NULL value for num_of_doors. Repeat this process to replace the
null value for the Mazda.
If you are using the BigQuery Sandbox, you can skip these UPDATE queries; they will not affect your
ability to complete this activity.
After running this, you notice that there are one too many rows. There are two entries for two
cylinders: rows 6 and 7. But the two in row 7 is misspelled.
To correct the misspelling for all rows, you can run this query if you have the BigQuery free trial:
You will get a message alerting you that one row was modified after running this statement. To check
that it worked, you can run the previous query again: SELECT DISTINCT num_of_cylinders
FROM cars.car_info;
Next, you can check the compression_ratio column. According to the data description, the
compression_ratio column values should range from 7 to 23. Just like when you checked the length
values , you can use MIN and MAX to check if that’s correct:
Notice that this returns a maximum of 70. But you know this is an error because the maximum value in
this column should be 23, not 70. So the 70 is most likely a 7.0. Run the above query again without the
row with 70 to make sure that the rest of the values fall within the expected range of 7 to 23.
WHERE
Now the highest value is 23, which aligns with the data description. So you’ll want to correct the 70
value. You check with the sales manager again, who says that this row was made in error and should be
removed. Before you delete anything, you should check to see how many rows contain this erroneous
value as a precaution so that you don’t end up deleting 50% of your data. If there are too many (for
instance, 20% of your rows have the incorrect 70 value), then you would want to check back in with the
sales manager to inquire if these should be deleted or if the 70 should be updated to another value. Use
the query below to count how many rows you would be deleting:
SELECT
COUNT(*) AS num_of_rows_to_delete
FROM
cars.car_info
WHERE
compression_ratio = 70;
Turns out there is only one row with the erroneous 70 value. So you can delete that row using this
query:
DELETE cars.car_info
If you are using the BigQuery sandbox, you can replace DELETE with SELECT to see which row would be
deleted.
Check the drive_wheels column for inconsistencies by running a query with a SELECT DISTINCT
statement:
It appears that 4wd appears twice in results. However, because you used a SELECT DISTINCT statement
to return unique values, this probably means there’s an extra space in one of the 4wd entries that
makes it different from the other 4wd.
To check if this is the case, you can use a LENGTH statement to determine the length of how long
each of these string variables:
According to these results, some instances of the 4wd string have four characters instead of the
expected three (4wd has 3 characters). In that case, you can use the TRIM function to remove all
extra spaces in the drive_wheels column if you are using the BigQuery free trial:
UPDATE
cars.car_info
SET
drive_wheels = TRIM(drive_wheels)
WHERE TRUE;
Then, you run the SELECT DISTINCT statement again to ensure that there are only three distinct
values in the drive_wheels column:
Sources of errors: Did you use the right tools and functions to find the source of the errors in
your dataset?
Null data: Did you search for NULLs using conditional formatting and filters?
Misspelled words: Did you locate all misspellings?
Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM
function?
Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function
or DISTINCT in SQL?
Mismatched data types: Did you check that numeric, date, and string data are typecast
correctly?
Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and
meaningful?
Messy (inconsistent) date formats: Did you format the dates consistently throughout your
dataset?
Misleading variable labels (columns): Did you name your columns meaningfully?
Truncated data: Did you check for truncated or missing data that needs correction?
Business Logic: Did you check that the data makes sense given your knowledge of the
business?
Engineers use engineering change orders (ECOs) to keep track of new product design
details and proposed changes to existing products. Writers use document revision
histories to keep track of changes to document flow and edits. And data analysts use
changelogs to keep track of data transformation and cleaning. Here are some examples of
these:
Google 1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arro
Sheets move backward and forward in the history as needed.
Microsoft 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Trac
Excel Changes, click the Accept/Reject Changes option to accept or reject any change mad
Bring up a previous version (without reverting to it) and figure out what changed by co
BigQuery
to the current version.
Finally, a changelog is important for when lots of changes to a spreadsheet or query have
been made. Imagine an analyst made four changes and the change they want to revert to
is change #2. Instead of clicking the undo feature three times to undo change #2 (and
losing changes #3 and #4), the analyst can undo just change #2 and keep all the other
changes. Now, our example was for just 4 changes, but try to think about how important
that changelog would be if there were hundreds of changes to keep track of.
A junior analyst probably only needs to know the above with one exception. If an analyst is
making changes to an existing SQL query that is shared across the company, the company
most likely uses what is called a version control system. An example might be a query that
pulls daily revenue to build a dashboard for senior management.
In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in
the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the
spreadsheet containing the data the first time you import the data.
The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace
it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow
access button.
Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.
On Tuesday, they use the following to import the donor names and matched amounts:
=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123", "sheet1!A1:C10”,
“Matched Funds!A1:B4001”)
On Wednesday, another 500 transactions were processed. They increase the range used by 500 to
easily include the latest transactions when importing the data to the individual donor spreadsheet:
Note: The above examples are for illustrative purposes only. Don't copy and paste them into your
spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you
have multiple tabs) along with the range of cells in the spreadsheet that you have populated with
data.
The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the
range of data that you want to query from, and then use the SQL SELECT command to select the
specific columns. You can also add specific criteria after the SELECT statement by including a
WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!
Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets
use a query wizard to guide you through the steps to connect to a data source and select the tables. In
either case, you are able to be sure that the data imported is verified and clean based on the criteria in
the query.
The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY
function can be combined with other functions for more complex calculations. For example, the
QUERY function can be used with other functions like SUM and COUNT to summarize data, but the
FILTER function can't.
The skills section on your resume likely only has room for 2-4 bullet points, so be sure to
use this space effectively. You might want to prioritize technical skills over soft skills. This
is a great chance for you to highlight some of the skills you’ve picked up in these courses,
such as:
Many companies use algorithms to screen and filter resumes for keywords. If your resume
does not contain the keywords they are searching for, a human may never even read your
resume. Reserving at least one bullet point to list specific programs you are familiar with is
a great way to make sure your resume makes it past automated keyword screenings and
onto the desk of a recruiter or hiring manager.