0% found this document useful (0 votes)
162 views48 pages

Process of Data Form Dirty Cleaning

This document discusses the importance of data integrity and alignment for analysis. It provides an example where a global company's data on calendar dates may lack integrity if different date formatting is not standardized. Ensuring data integrity through processes like checking for proper data formatting and conducting audits is important. The document also discusses how data constraints help ensure data validity and integrity, and gives examples like data types, ranges, and uniqueness. Finally, it discusses how clean, well-aligned data allows for accurate conclusions by analyzing a scenario where an analyst checks when users access content after subscription activation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views48 pages

Process of Data Form Dirty Cleaning

This document discusses the importance of data integrity and alignment for analysis. It provides an example where a global company's data on calendar dates may lack integrity if different date formatting is not standardized. Ensuring data integrity through processes like checking for proper data formatting and conducting audits is important. The document also discusses how data constraints help ensure data validity and integrity, and gives examples like data types, ranges, and uniqueness. Finally, it discusses how clean, well-aligned data allows for accurate conclusions by analyzing a scenario where an analyst checks when users access content after subscription activation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Process of data Form Dirty cleaning

More about data integrity and compliance


This reading illustrates the importance of data integrity using an example of a global company’s data.
Definitions of terms that are relevant to data integrity will be provided at the end. 

Scenario: calendar dates for a global company


Calendar dates are represented in a lot of different short forms. Depending on where you live, a
different format might be used. 

 In some countries,12/10/20 (DD/MM/YY) stands for October 12, 2020. 


 In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-
12. 
 In the United States, (MM/DD/YY) is the accepted format so October 12, 2020 is going to be
10/12/20.
Now, think about what would happen if you were working as a data analyst for a global company and
didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of
the data would be inaccurate. Imagine ordering extra inventory for December when it was actually
needed in October!

A good analysis depends on the integrity of the data, and data integrity usually depends on using a
common format. So it is important to double-check how dates are formatted to make sure what you
think is December 10, 2020 isn’t really October 12, 2020, and vice versa.

Here are some other things to watch out for:

 Data replication compromising data integrity: Continuing with the example, imagine you ask
your international counterparts to verify dates and stick to one format. One analyst copies a
large dataset to check the dates. But because of memory issues, only part of the dataset is
actually copied. The analyst would be verifying and standardizing incomplete data. That
partial dataset would be certified as compliant but the full dataset would still contain dates
that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit
of results would be essential to reveal what happened and correct all dates.
 Data transfer compromising data integrity: Another analyst checks the dates in a
spreadsheet and chooses to import the validated and standardized data back to the database.
But suppose the date field from the spreadsheet was incorrectly classified as a text field during
the data import (transfer) process. Now some of the dates in the database are stored as text
strings. At this point, the data needs to be cleaned to restore its integrity. 
 Data manipulation compromising data integrity: When checking dates, another analyst
notices what appears to be a duplicate record in the database and removes it. But it turns out
that the analyst removed a unique record for a company’s subsidiary and not a duplicate
record for the company. Your dataset is now missing data and the data must be restored for
completeness.

Conclusion
Data constraint Definition Examples
Values must be of a certain
If the data type is a date, a single number like 30 would
Data type type: date, number,
fail the constraint and be invalid
percentage, Boolean, etc.
Values must fall between
If the data range is 10-20, a value of 30 would fail the
Data range predefined maximum and
constraint and be invalid
minimum values
Values can’t be left blank or
Mandatory If age is mandatory, that value must be filled in
empty
Values can’t have a Two people can’t have the same mobile phone number
Unique
duplicate within the same service area
Regular expression Values must match a A phone number must match ###-###-#### (no other
(regex) patterns prescribed pattern characters allowed)
Certain conditions for
Cross-field Values are percentages and values from multiple fields
multiple fields must be
validation must add up to 100%
satisfied
A database table can’t have two rows with the same
primary key value. A primary key is an identifier in a
(Databases only) value must
Primary-key database that references a column in which each value
be unique per column
is unique. More information about primary and foreign
keys is provided later in the program.
(Databases only) values for
Value for a column must be set to Yes, No, or Not
Set-membership a column must come from a
Applicable
set of discrete values
(Databases only) values for
In a U.S. taxpayer database, the State column must be a
a column must be unique
Foreign-key valid state or territory with the set of acceptable values
values coming from a
defined in a separate States table
column in another table
The degree to which the
data conforms to the actual If values for zip codes are validated by street location,
Accuracy
entity being measured or the accuracy of the data goes up.
described
The degree to which the
If data for personal profiles required hair and eye color,
Completeness data contains all desired
and both are collected, the data is complete.
components or measures
The degree to which the
data is repeatable from If a customer has the same address in the sales and
Consistency
different points of entry or repair databases, the data is consistent.
collection
Fortunately, with a standard date format and compliance by all people and systems that work with the
data, data integrity can be maintained. But no matter where your data comes from, always be sure to
check that it is valid, complete, and clean before you begin any analysis. 

Reference: Data constraints and examples


As you progress in your data journey, you'll come across many types of data constraints (or criteria that
determine validity). The table below offers definitions and examples of data constraint terms you
might come across.

Well-aligned objectives and data


You can gain powerful insights and make accurate conclusions when data is well-aligned to business
objectives. As a data analyst, alignment is something you will need to judge. Good alignment means
that the data is relevant and can help you solve a business problem or determine a course of action to
achieve a given business objective.

In this reading, you will review the business objectives associated with three scenarios. You will explore
how clean data and well-aligned business objectives can help you come up with accurate conclusions.
On top of that, you will learn how new variables discovered during data analysis can cause you to set
up data constraints so you can keep the data aligned to a business objective.  

Clean data + alignment to business objective = accurate


conclusions
Business objective
Account managers at Impress Me, an online content subscription service, want to know how soon users
view content after their subscriptions are activated. 

To start off, the data analyst verifies that the data exported to spreadsheets is clean and confirms that
the data needed (when users access content) is available. Knowing this, the analyst decides there is
good alignment of the data to the business objective. All that is missing is figuring out exactly how long
it takes each user to view content after their subscription has been activated.

Here are the data processing steps the analyst takes for a user from an account called V&L Consulting.
(These steps would be repeated for each subscribing account, and for each user associated with that
account.)
Step 1

Data-processing step Source of data


Look up the activation date for V&L Consulting Account spreadsheet
Relevant data in spreadsheet:

Result: October 21, 2019

Data-processing step Source of data


Look up the name of a user belonging to the V&L Consulting account Account spreadsheet (users tab)

Step 2
Relevant data in spreadsheet:

Result: Maria Ballantyne

Step 3

Data-processing step Source of data


Find the first content access date for Maria B. Content usage spreadsheet
Relevant data in spreadsheet:

Result: October 31, 2019


Step 4

Data-processing step Source of data


Calculate the time between activation and first content usage for Maria B. New spreadsheet calculation
Relevant data in spreadsheet:

Result: 10 days

Pro tip 1
In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and 3 to
populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function that searches for
a certain value in a column to return a related piece of information. Using VLOOKUP can save a lot of
time; without it, you have to look up dates and names manually.

Refer to the VLOOKUP page in the Google Help Center for how to use the function in Google Sheets.

Inputs

1. search_key: The value to search for in the first column of the range.


2. range: The upper and lower values to consider for the search.
3. index: The index of the column with the return value of the range. The index must
be a positive integer.
4. is_sorted: Optional input. Choose an option:
 FALSE = Exact match. This is recommended.
 TRUE = Approximate match. This is the default if is_sorted is unspecified.
Important: Before you use an approximate match, sort your search key in
ascending order. Otherwise, you may likely get a wrong return value. Learn
why you may encounter a wrong return value .

Return value
The first matched value from the selected range.
VLOOKUP exact match or approximate match

 Use VLOOKUP exact match to find an exact ID.


 Use VLOOKUP approximate match to find the approximate ID.

Pro tip 2
In Step 4 of the above process, the analyst could use the DATEDIF function to automatically calculate
the difference between the dates in column C and column D. The function can calculate the number of
days between two dates. 

Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The DAYS360
function does the same thing in accounting spreadsheets that use a 360-day year (twelve 30-day
months).

Refer to the DATEDIF page in the Google Help Center for how to use the function in Google Sheets.

DATEDIF function
Calculates the number of days, months, or years
between two dates.
Warning: Excel provides the DATEDIF function in order to support older workbooks from Lotus
1-2-3. The DATEDIF function may calculate incorrect results under certain scenarios. Please see
the known issues section of this article for further details.

Syntax
DATEDIF(start_date,end_date,unit)
Argument Description
start_date    A date that represents the first, or starting date of a given period. Dates may be entered as
text strings within quotation marks (for example, "2001/1/30"), as serial numbers (for
Required example, 36921, which represents January 30, 2001, if you're using the 1900 date system),
or as the results of other formulas or functions (for example, DATEVALUE("2001/1/30")).
end_date    A date that represents the last, or ending, date of the period.

Required
Unit    The type of information that you want returned, where:

Returns
Unit
"Y" The number of complete years in the period.
"M" The number of complete months in the period.
"D" The number of days in the period.
"MD The difference between the days in start_date
" and end_date. The months and years of the
dates are ignored.
Important: We don't recommend using the "MD" argument, as there are known
limitations with it. See the known issues section below.
Argument Description

"YM" The difference between the months in start_date


and end_date. The days and years of the dates
are ignored
"YD" The difference between the days of start_date
and end_date. The years of the dates are
ignored.

Syntax for excel


DAYS360(start_date,end_date,[method])

Alignment to business objective + additional data


cleaning = accurate conclusions 

Business objective
Cloud Gate, a software company, recently hosted a series of public webinars as free product
introductions. The data analyst and webinar program manager want to identify companies that had
five or more people attend these sessions. They want to give this list of companies to sales managers
who can follow up for potential sales.  

The webinar attendance data includes the fields and data shown below.

Name <First name> <Last name> This was required information attendees had to submit
Email Address [email protected] This was required information attendees had to submit
Company <Company name> This was optional information attendees could provide
Data cleaning
The webinar attendance data seems to align with the business objective. But the data analyst and
program manager decide that some data cleaning is needed before the analysis. They think data
cleaning is required because:

 The company name wasn’t a mandatory field. If the company name is blank, it might be found
from the email address. For example, if the email address is [email protected], the
company field could be filled in with Google for the data analysis. This data cleaning step
assumes that people with company-assigned email addresses attended a webinar for business
purposes.
 Attendees could enter any name. Since attendance across a series of webinars is being looked
at, they need to validate names against unique email addresses. For example, if Joe Cox
attended two webinars but signed in as Joe Cox for one and Joseph Cox for the other, he would
be counted as two different people. To prevent this, they need to check his unique email
address to determine that he was the same person. After the validation, Joseph Cox could be
changed to Joe Cox to match the other instance.

Alignment to business objective + newly discovered


variables + constraints = accurate conclusions 

Business objective
An after-school tutoring company, A+ Education,  wants to know if there is a minimum number of
tutoring hours needed before students have at least a 10% improvement in their assessment scores.

The data analyst thinks there is good alignment between the data available and the business objective
because:

 Students log in and out of a system for each tutoring session, and the number of hours is
tracked
 Assessment scores are regularly recorded  

Data constraints for new variables


After looking at the data, the data analyst discovers that there are other variables to consider. Some
students had consistent weekly sessions while other students had scheduled sessions more randomly
even though their total number of tutoring hours was the same. The data doesn’t align as well with the
original business objective as first thought, so the analyst adds a data constraint to focus only on the
students with consistent weekly sessions. This modification helps to get a more accurate picture about
the enrollment time needed to achieve a 10% improvement in assessment scores. 

Key takeaways
Hopefully these examples give you a sense of what to look for to know if your data aligns with your
business objective. 

 When there is clean data and good alignment, you can get accurate insights and make
conclusions the data supports.
 If there is good alignment but the data needs to be cleaned, clean the data before you perform
your analysis. 
 If the data only partially aligns with an objective, think about how you could modify the
objective, or use data constraints to make sure that the subset of data better aligns with the
business objective.

What to do when you find an issue with your


data
When you are getting ready for data analysis, you might realize you don’t have the data
you need or you don’t have enough of it. In some cases, you can use what is known as
proxy data in place of the real data. Think of it like substituting oil for butter in a recipe
when you don’t have butter. In other cases, there is no reasonable substitute and your
only option is to collect more data.

Consider the following data issues and suggestions on how to work around them.

Data issue 1: no data


Possible Solutions Examples of solutions in real life
Gather the data on a small scale to perform a If you are surveying employees about what they think
preliminary analysis and then request additional about a new performance and bonus plan, use a sample for
time to complete the analysis after you have a preliminary analysis. Then, ask for another 3 weeks to
collected more data. collect the data from all employees.
If there isn’t time to collect data, perform the If you are analyzing peak travel times for commuters but
analysis using proxy data from other datasets.  don’t have the data for a particular city, use the data from
This is the most common workaround. another city with a similar size and demographic.

Data issue 2: too little data


Possible Solutions Examples of solutions in real life
Do the analysis using proxy If you are analyzing trends for owners of golden retrievers, make your dataset
data along with actual data. larger by including the data from owners of labradors. 
If you are missing data for 18- to 24-year-olds, do the analysis but note the
Adjust your analysis to align
following limitation in your report: this conclusion applies to adults 25 years
with the data you already have.
and older only.
Data issue 3: wrong data, including data with errors*
Possible Solutions Examples of solutions in real life
If you have the wrong data because requirements were If you need the data for female voters and received
misunderstood, communicate the requirements again. the data for male voters, restate your needs.
If your data is in a spreadsheet and there is a
conditional statement or boolean causing
Identify errors in the data and, if possible, correct them
calculations to be wrong, change the conditional
at the source by looking for a pattern in the errors.
statement instead of just fixing the calculated
values.
If you can’t correct data errors yourself, you can ignore If your dataset was translated from a different
the wrong data and go ahead with the analysis if your language and some of the translations don’t make
sample size is still large enough and ignoring the data sense, ignore the data with bad translation and go
won’t cause systematic bias. ahead with the analysis of the other data.
* Important note: sometimes data with errors can be a warning sign that the data isn’t
reliable. Use your best judgment.

Use the following decision tree as a reminder of how to deal with data errors or not enough
data:

1. Can you fix or request a corrected dataset? NO 2. Do you have enough data to omit the wrong
data? NO 3. Can you proxy the data? NO 4. Can you collect more data? NO Modify the business
objective (if possible)
Calculating sample size
Before you dig deeper into sample size, familiarize yourself with these terms and
definitions:

Terminology Definitions
The entire group that you are interested in for your study. For example, if you are
Population 
people in your company, the population would be all the employees in your comp
A subset of your population. Just like a food sample, it is called a sample because
Sample  taste. So if your company is too large to survey every individual, you can survey a
representative sample of your population.
Since a sample is used to represent a population, the sample’s results are expecte
from what the result would have been if you had surveyed the entire population.
Margin of error difference is called the margin of error. The smaller the margin of error, the closer
results of the sample are to what the result would have been if you had surveyed
population. 
How confident you are in the survey results. For example, a 95% confidence level
that if you were to run the same survey 100 times, you would get similar results 95
Confidence level
100 times. Confidence level is targeted before you start your study because it will
big your margin of error is at the end of your study. 
Confidence The range of possible values that the population’s result would be at the confiden
interval the study. This range is the sample result +/- the margin of error.
Statistical The determination of whether your result could be due to random chance or not.
significance greater the significance, the less due to chance.

Things to remember when determining the size of your


sample
When figuring out a sample size, here are things to keep in mind:

 Don’t use a sample size less than 30. It has been statistically proven that 30 is the
smallest sample size where an average result of a sample starts to represent the
average result of a population.
 The confidence level most commonly used is 95%, but 90% can work in some
cases. 
Increase the sample size to meet specific needs of your project:

 For a higher confidence level, use a larger sample size


 To decrease the margin of error, use a larger sample size
 For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More
about these are coming up in the course! Stay tuned.
Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of
probability and statistics. As sample size increases, the results more closely resemble the
normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the
smallest sample size for which the CLT is still valid. Researchers who rely on regression
analysis – statistical methods to determine the relationships between controlled and
dependent variables – also prefer a minimum sample of 30.

Still curious? Without getting too much into the math, check out these articles:

 Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit
Theorem and briefly describes how it can apply to an analysis of a stock index.
 Sample Size Formula: This article by Statistics Solutions provides a little more
detail about why some researchers use 30 as a minimum sample size.

Sample sizes vary by business problem


Sample size will vary based on the type of business problem you are trying to solve. 

For example, if you live in a city with a population of 200,000 and get 180,000 people to
respond to a survey, that is a large sample size. But without actually doing that, what
would an acceptable, smaller sample size look like? 

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes. 

 A sample size of 200 might be large enough if your business problem is to find out
how residents felt about the new library
 A sample size of 200 might not be large enough if your business problem is to
determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the
new library versus surveying residents about how they would vote to fund it. For that
reason, you would most likely use a larger sample size for the voter survey.

Larger sample sizes have a higher cost


You also have to weigh the cost against the benefits of more accurate results with a larger
sample size. Someone who is trying to understand consumer preferences for a new line of
products wouldn’t need as large a sample size as someone who is trying to understand the
effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger
sample size. But for consumer preferences, a smaller sample size at a lower cost could
provide good enough results. 

Knowing the basics is helpful


Knowing the basics will help you make the right choices when it comes to sample size. You
can always raise concerns if you come across a sample size that is too small. A sample size
calculator is also a great tool for this. Sample size calculators let you enter a desired
confidence level and margin of error for a given population size. They then calculate the
sample size needed to statistically achieve those results. 

Refer to the Determine the Best Sample Size video for a demonstration of a sample size
calculator, or refer to the Sample Size Calculator reading for additional information.

---------@ check statisticts sollutions for the


sample size from interenet explore it
Question 1

Overview

Now that you have learned about how to prepare for data cleaning, you can pause for a moment and
reflect on these steps. In this self-reflection, you will consider your thoughts about the importance of
pre-cleaning activities and respond to brief questions. 

This self-reflection will help you develop insights into your own learning and prepare you to apply your
knowledge of pre-cleaning activities and insufficient data to your own data cleaning work. As you
answer questions—and come up with questions of your own—you will consider concepts, practices,
and principles to help refine your understanding and reinforce your learning. You’ve done the hard
work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Review data integrity

Before data analysts can analyze data, they first need to think about and understand the data they're
working with. Assessing data integrity is a key step in this process. As you've learned in previous
lessons, you should complete the following tasks before analyzing data: 

1. Determine data integrity by assessing the overall accuracy, consistency, and completeness of the
data.

2. Connect objectives to data by understanding how your business objectives can be served by an
investigation into the data.

3. Know when to stop collecting data.

Data analysts perform pre-cleaning activities to complete these steps. Pre-cleaning activities help you
determine and maintain data integrity, which is essential to the role of a junior data analyst.

What makes data insufficient

One of the objectives of pre-cleaning activities is to address insufficient data. Recall from previous
lessons that data can be insufficient for a number of reasons. Insufficient data has one or more of the
following problems:

 Comes from only one source


 Continuously updates and is incomplete
 Is outdated
 Is geographically limited
To deal with insufficient data, you can:

 Identify trends within the available data


 Wait for more data if time allows
 Discuss with stakeholders and adjust your objective
 Search for a new dataset

Reflection

Consider what you have learned about data insufficiency and the steps for how to avoid it:

 Why are pre-cleaning steps important to complete prior to data cleaning?


 What problems might occur if you don't follow these steps? 
 pre-cleaning activities are important because they increase the efficiency and
success of your data analysis tasks.
 If you know that your data is accurate, consistent, and complete, you can be
confident that your results will be valid. Stakeholders will be pleased if you
connect the data to business objectives. And, knowing when to stop collecting
data will allow you to finish your tasks in a timely manner without sacrificing data
integrity.
 Suppose that you didn't determine data integrity. You may find that you are
working with inaccurate or missing data, which can cause misleading results in
your analysis. If you don’t connect objectives with the data, your analysis may not
be relevant to the stakeholders. Finally, not understanding when to stop collecting
data can lead to unnecessary delays in completing tasks. By completing pre-
cleaning activities, you avoid these problems.

What to do when there is no data


Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might
have some questions about proxy data, so this reading will give you a few more examples of the types
of datasets that can serve as alternate data sources.

Proxy data examples


Sometimes the data to support a business objective isn’t readily available. This is when proxy data is
useful. Take a look at the following scenarios and where proxy data comes in for each example:

Business scenario How proxy data can be used


A new car model was just launched a few days ago and
The analyst proxies the number of clicks
the auto dealership can’t wait until the end of the month
specifications on the dealership’s websit
for sales data to come in. They want sales projections
estimate of potential sales at the dealers
now.
A brand new plant-based meat product was only recently The analyst proxies the sales data for a t
stocked in grocery stores and the supplier needs to substitute made out of tofu that has bee
estimate the demand over the next four years.  market for several years.
The Chamber of Commerce wants to know how a tourism The analyst proxies the historical data fo
campaign is going to impact travel to their city, but the bookings to the city one to three months
results from the campaign aren’t publicly available yet. similar campaign was run six months ea

Open (public) datasets


If you are part of a large organization, you might have access to lots of sources of data. But if you are
looking for something specific or a little outside your line of business, you can also make use of open or
public datasets. (You can refer to this Towards Data Science article for a brief explanation of the
difference between open and public data.)

Here's an example. A nasal version of a vaccine was recently made available. A clinic wants to know
what to expect for contraindications, but just started collecting first-party data from its patients. A
contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would
cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an
open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data
with patient profiles most closely matching the makeup of the patients at the clinic. 
There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com)
which we previously introduced, has datasets in a variety of formats including the most basic type,
Comma Separated Values (CSV) files.  

CSV, JSON, SQLite, and BigQuery datasets


 CSV: Check out this Credit card customers dataset, which has information from 10,000
customers including age, salary, marital status, credit card limit, credit card category,
etc. (CC0: Public Domain, Sakshi Goyal).
 JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell
J).
 SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public
Domain, Rachael Tatman).
 BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise
Store (CC0 Public Domain, Google BigQuery).
Refer to the Kaggle documentation for datasets for more information and search for and explore
datasets on your own at kaggle.com/datasets.

As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null
most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted
as the value, 0. It is important to understand how Null was used before you start analyzing a dataset
with Null data.

Completed

Sample size calculator


In this reading, you will learn the basics of sample size calculators, how to use them, and
how to understand the results. A sample size calculator tells you how many people you
need to interview (or things you need to test) to get results that represent the target
population. Let’s review some terms you will come across when using a sample size
calculator:

 Confidence level: The probability that your sample size accurately reflects the
greater population.
 Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population.
 Population: This is the total number you hope to pull your sample from.
 Sample: A part of a population that is representative of the population.
 Estimated response rate: If you are running a survey of individuals, this is the
percentage of people you expect will complete your survey out of those who
received the survey.
How to use a sample size calculator
In order to use a sample size calculator, you need to have the population size, confidence
level, and the acceptable margin of error already decided so you can input them into the
tool. If this information is ready to go, check out these sample size calculators below:

 Sample size calculator by surveymonkey.com


 Sample size calculator by raosoft.com

What to do with the results


After you have plugged your information into one of these calculators, it will give you a
recommended sample size. Keep in mind, the calculated sample size is the minimum
number to achieve what you input for confidence level and margin of error. If you are
working with a survey, you will also need to think about the estimated response rate to
figure out how many surveys you will need to send out. For example, if you need a sample
size of 100 individuals and your estimated response rate is 10%, you will need to send your
survey to 1,000 individuals to get the 100 responses you need for your analysis.

Now that you have the basics, try some calculations using the sample size calculators and
refer back to this reading if you need a refresher on the definitions. 

What is dirty data?


Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to
the problem you are trying to solve.  This reading summarizes:

 Types of dirty data you may encounter


 What may have caused the data to become dirty
 How dirty data is harmful to businesses
Types of dirty data
Duplicate data

Description Possible causes Potential harm to businesses


Manual data entry, batch Skewed metrics or analyses, inflated or
Any data record that
data imports, or data inaccurate counts or predictions, or confusion
shows up more than once
migration during data retrieval

Outdated data

Potential harm to
Description Possible causes
businesses
Any data that is old which should be People changing roles or companies, Inaccurate insights,
replaced with newer and more accurate or software and systems becoming decision-making, and
information obsolete analytics

Incomplete data

Description Possible causes Potential harm to businesses


Any data that is missing Improper data collection or Decreased productivity, inaccurate insights, or
important fields incorrect data entry inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses


Any data that is Human error inserted during data Inaccurate insights or decision-making
complete but input, fake information, or mock based on bad information resulting in
inaccurate data revenue loss
Inconsistent data

Description Possible causes Potential harm to businesses


Any data that uses different Data stored incorrectly or Contradictory data points leading to
formats to represent the same errors inserted during data confusion or inability to classify or
thing transfer segment customers

Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into
your preferred browser’s search bar to bring up numerous articles on the topic. Here are a
few impacts cited for certain industries from a previous search:

 Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
 Digital commerce: Up to 25% of B2B database contacts contain inaccuracies
(source).
 Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales
campaigns (source).
 Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s
electronic health records (source).

Common data-cleaning pitfalls


In this reading, you will learn the importance of data cleaning and how to identify
common mistakes. Some of the errors you might come across while cleaning your data
could include:
Common mistakes to avoid
 Not checking for spelling errors: Misspellings can be as simple as typing or input
errors. Most of the time the wrong spelling or common grammatical errors can be
detected, but it gets harder with things like names or addresses. For example, if
you are working with a spreadsheet table of customer data, you might come across
a customer named “John” whose name has been input incorrectly as “Jon” in
some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t
double-check for spelling errors and catch this, your analysis will have mistakes in
it. 
 Forgetting to document errors: Documenting your errors can be a big time saver,
as it helps you avoid those errors in the future by showing you how you resolved
them. For example, you might find an error in a formula in your spreadsheet. You
discover that some of the dates in one of your columns haven’t been formatted
correctly. If you make a note of this fix, you can reference it the next time your
formula is broken, and get a head start on troubleshooting. Documenting your
errors also helps you keep track of changes in your work, so that you can backtrack
if a fix didn’t work. 
 Not checking for misfielded values: A misfielded value happens when the values
are entered into the wrong field. These values might still be formatted correctly,
which makes them harder to catch if you aren’t careful. For example, you might
have a dataset with columns for cities and countries. These are the same type of
data, so they are easy to mix up. But if you were trying to find all of the instances of
Spain in the country column, and Spain had mistakenly been entered into the city
column, you would miss key data points. Making sure your data has been entered
correctly is key to accurate, complete analysis. 
 Overlooking missing values: Missing values in your dataset can create errors and
give you inaccurate conclusions. For example, if you were trying to get the total
number of sales from the last three months, but a week of transactions were
missing, your calculations would be inaccurate.  As a best practice, try to keep your
data as clean as possible by maintaining completeness and consistency.
 Only looking at a subset of the data: It is important to think about all of the
relevant data when you are cleaning. This helps make sure you understand the
whole story the data is telling, and that you are paying attention to all possible
errors. For example, if you are working with data about bird migration patterns
from different sources, but you only clean one source, you might not realize that
some of the data is being repeated. This will cause problems in your analysis later
on. If you want to avoid common errors like duplicates, each field of your data
requires equal attention.
 Losing track of business objectives: When you are cleaning data, you might make
new and interesting discoveries about your dataset-- but you don’t want those
discoveries to distract you from the task at hand. For example, if you were working
with weather data to find the average number of rainy days in your city, you might
notice some interesting patterns about snowfall, too. That is really interesting, but
it isn’t related to the question you are trying to answer right now. Being curious is
great! But try not to let it distract you from the task at hand.  
 Not fixing the source of the error: Fixing the error itself is important. But if that
error is actually part of a bigger problem, you need to find the source of the issue.
Otherwise, you will have to keep fixing that same error over and over again. For
example, imagine you have a team spreadsheet that tracks everyone’s progress.
The table keeps breaking because different people are entering different values.
You can keep fixing all of these problems one by one, or you can set up your table
to streamline data entry so everyone is on the same page. Addressing the source of
the errors in your data will save you a lot of time in the long run. 
 Not analyzing the system prior to data cleaning: If we want to clean our data and
avoid future errors, we need to understand the root cause of your dirty data.
Imagine you are an auto mechanic. You would find the cause of the problem before
you started fixing the car, right? The same goes for data. First, you figure out where
the errors come from. Maybe it is from a data entry error, not setting up a spell
check, lack of formats, or from duplicates. Then, once you understand where bad
data comes from, you can control it and keep your data clean.
 Not backing up your data prior to data cleaning: It is always good to be proactive
and create your data backup before you start your data clean-up. If your program
crashes, or if your changes cause a problem in your dataset, you can always go
back to the saved version and restore it. The simple procedure of backing up your
data can save you hours of work-- and most importantly, a headache. 
 Not accounting for data cleaning in your deadlines/process: All good things take
time, and that includes data cleaning. It is important to keep that in mind when
going through your process and looking at your deadlines. When you set aside time
for data cleaning, it helps you get a more accurate estimate for ETAs for
stakeholders, and can help you know when to request an adjusted ETA. 

Additional resources
Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help
you avoid the most common mistakes:

 Top ten ways to clean your data: Review an orderly guide to data cleaning in
Microsoft Excel.
 10 Google Workspace tips to clean up data: Learn best practices for data cleaning
in Google Sheets.
 What was the most challenging part of cleaning the data?
 Why is cleaning and transposing data important for data analysis?
 If you had to clean this data again, what would you do differently? Why?
 cleaned and transposed data on a spreadsheet. A good response would include
that cleaning is a fundamental step in data science as it greatly increases the
integrity of the data. 
 Good data science results rely heavily on the reliability of the data. Data analysts
clean data to make it more accurate and reliable. This is important for making sure
that the projects you will work on as a data analyst are completed properly.

Workflow automation
In this reading, you will learn about workflow automation and how it can help you work
faster and more efficiently. Basically, workflow automation is the process of automating
parts of your work. That could mean creating an event trigger that sends a notification
when a system is updated. Or it could mean automating parts of the data cleaning
process. As you can probably imagine, automating different parts of your work can save
you tons of time, increase productivity, and give you more bandwidth to focus on other
important aspects of the job. 

What can be automated?


Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some
parts of the job that can’t be automated. Let's take a look at some things we can automate
and some things that we can’t.

Can it be
Task Why?
automated?
Communicating with Communication is key to understanding the needs of your team
your team and No and stakeholders as you complete the tasks you are working on.
stakeholders There is no replacement for person-to-person communications. 
Presenting your data is a big part of your job as a data analyst.
Presenting your Making data accessible and understandable to stakeholders and
No
findings creating data visualizations can’t be automated for the same
reasons that communications can’t be automated.
Some tasks in data preparation and cleaning can be automated by
Preparing and
Partially  setting up specific processes, like using a programming script to
cleaning data
automatically detect missing values.  
Sometimes the best way to understand data is to see it. Luckily,
there are plenty of tools available that can help automate the
Data exploration  Partially process of visualizing data. These tools can speed up the process of
visualizing and understanding the data, but the exploration itself
still needs to be done by a data analyst.
Data modeling is a difficult process that involves lots of different
Modeling the data Yes factors; luckily there are tools that can completely automate the
different stages.  

More about automating data cleaning


One of the most important ways you can streamline your data cleaning is to clean data
where it lives. This will benefit your whole team, and it also means you don’t have to
repeat the process over and over. For example, you could create a programming script
that counted the number of words in each spreadsheet file stored in a specific folder.
Using tools that can be used where your data is stored means that you don’t have to
repeat your cleaning steps, saving you and your team time and energy. 

More resources
There are a lot of tools out there that can help automate your processes, and those tools
are improving all the time. Here are a few articles or blogs you can check out if you want to
learn more about workflow automation and the different tools out there for you to use: 

 Towards Data Science’s Automating Scientific Data Analysis


 MIT News’ Automating Big-Data Analysis
 TechnologyAdvice’s 10 of the Best Options for Workflow Automation Software 
As a data analyst, automation can save you a lot of time and energy, and free you up to
focus more on other parts of your project. The more analysis you do, the more ways you
will find to make your processes simpler and more streamlined.

Step 1: Create your checklist 

You can start developing your personal approach to cleaning data by creating a standard
checklist to use before your data cleaning process. Think of this checklist as your default
"what to search for" list. 

With a good checklist, you can efficiently and, hopefully, swiftly identify all the problem
spots without getting sidetracked. You can also use the checklist to identify the scale and
scope of the dataset itself.

Some things you might include in your checklist:

 Size of the data set


 Number of categories or labels
 Missing data
 Unformatted data
 The different data types
You can use your own experiences so far to help you decide what else you want to include
in your checklist! 

Step 2: List your preferred cleaning methods 

After you have compiled your personal checklist, you can create a list of activities you like
to perform when cleaning data. This list is a collection of procedures that you will
implement when you encounter specific issues present in the data related to your
checklist or every time you clean a new dataset. 

For example, suppose that you have a dataset with missing data, how would you handle
it? Moreover, if the data set is very large, what would you do to check for missing data?
Outlining some of your preferred methods for cleaning data can help save you time and
energy.

Step 3: Choose a data cleaning motto

Now that you have a personal checklist and your preferred data cleaning methods, you
can create a data cleaning motto to help guide and explain your process. The motto is a
short one or two sentence summary of your philosophy towards cleaning data. For
example, here are a few data cleaning mottos from other data analysts:

1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems." 
The data you encounter as an analyst won’t always conform to your checklist or activities
list regardless of how comprehensive they are. Data cleaning can be an involved and
complicated process, but surprisingly most data has similar problems. A solid personal
motto and explanation can make the more common data cleaning tasks easy to
understand and complete.

Reflection

Now that you have completed your Data Cleaning Approach Table, take a moment to
reflect on the decisions you made about your data cleaning approach. Write 1-2 sentences
(20-40 words) answering each of the following questions: 

 What items did you add to your data cleaning checklist? Why did you decide these
were important to check for?
 How have your own experiences with data cleaning affected your preferred
cleaning methods? Can you think of an example where you needed to perform one
of these cleaning tasks? 
 How did you decide on your data cleaning motto?

Using SQL as a junior data analyst


In this reading, you will learn more about how to decide when to use SQL, or Structured Query
Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is one of the tools
that can help make your work a lot easier. SQL is the primary way data analysts extract data from
databases. As a data analyst, you will work with databases all the time, which is why SQL is such a key
skill. Let’s follow along as a junior data analyst uses SQL to solve a business task.  

The business task and context


The junior data analyst in this example works for a social media company. A new business model was
implemented on February 15, 2020 and the company wants to understand how their user-growth
compares to the previous year. Specifically, the data analyst was asked to find out how many users
have joined since February 15, 2020. 
Spreadsheets functions and formulas or SQL queries?
Before they can address this question, this data analyst needs to choose what tool to use. First, they
have to think about where the data lives. If it is stored in a database, then SQL is the best tool for the
job. But if it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet.
In that scenario, they could create a pivot table of the data and then apply specific formulas and filters
to their data until they were given the number of users that joined after February 15th. It isn’t a really
complicated process, but it would involve a lot of steps. 

In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst
knows they could get the same results with a single SQL query:

SELECT COUNT(DISTINCT user_id) AS count_of_unique_users FROM table WHERE join_date


>= ‘2020-02-15’
Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases 


Smaller data sets Larger datasets
Enter data manually Access tables across a database
Create graphs and visualizations in the same
Prepare data for further analysis in another softw
program
Built-in spell check and other useful functions Fast and powerful functionality
Great for collaborative work and tracking querie
Best when working solo on a project
users
When it comes down to it, where the data lives will decide which tool you use. If you are working with
data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you
are working with data stored in a database, SQL will be the best tool for you to use for your analysis.
You will learn more about SQL coming up, so that you will be ready to tackle any business problem with
the best tool possible. 

 How did working with SQL help you query a larger dataset?
 How long do you think it would take a team to query a dataset like this manually?
 How does the ability to query large datasets in reasonable amounts of time affect
data analysts?
 A good response would include how querying a dataset with billions of items isn’t
feasible without tools such as  relational databases and SQL.
 Performing large queries by hand would take years and years of manual work. The
ability to query large datasets is an extremely helpful tool for data analysts. You
can gain insights from massive amounts of data to discover trends and
opportunities that wouldn’t be possible to find without tools like SQL.

Optional: Upload the customer dataset to


BigQuery
In the next video, the instructor uses a specific dataset. The instructions in this reading are
provided for you to upload the same dataset in your BigQuery console.

You must have a BigQuery account to follow along. If you have hopped around courses,
Using BigQuery in the Prepare Data for Exploration course covers how to set up a
BigQuery account.

Prepare for the next video


 First, download the CSV file from the attachment below.

Customer Table - Sheet1

CSV File

 Next, complete the following steps in your BigQuery console to upload the
Customer Table dataset.
Step 1: Open your BigQuery console and click on the project you want to upload the data
to.

Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your
project name and select Create dataset.
Step 3: In the upcoming video, the name "customer_data" will be used for the dataset. If
you plan to follow along with the video, enter customer_data for the Dataset ID.

Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.

Step 5: In the Explorer on the left, click to expand your project, and then click the
customer_data dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to customer_data and select Open.

Step 7: Click the blue + icon at the middle to open the Create table window.

Step 8: Under Source, for the Create table from selection, choose where the data will be
coming from.

 Select Upload.
 Click Browse to select the Customer Table CSV file you downloaded.
 Choose CSV from the file format drop-down.
Step 9: For Table name, enter customer_address if you plan to follow along with the
video.

Step 10: For Schema, click the Auto detect check box.

Step 11: Click Create table (blue button). You will now see the customer_address table
under your customer_data dataset in your project.

Step 12: Click customer_address and then select the Preview tab. Confirm that you see the
data shown below.

And now you have everything you need to follow along with the next video. This is also a
great table to use to practice querying data on your own. Plus, you can use these steps to
upload any other data you want to work with. 

Question 1
Activity overview 

In previous lessons, you learned about the importance of being able to clean your data where it lives.
When it comes to data stored in databases, that means using SQL queries. In this activity, you will
create a custom dataset and table, import a CSV file, and use SQL queries to clean automobile data.

In this scenario, you are a data analyst working with a used car dealership startup venture. The
investors want you to find out which cars are most popular with customers so they can make sure to
stock accordingly. 

By the time you complete this activity, you will be able to clean data using SQL. This will enable you to
process and analyze data in databases, which is a common task for data analysts.

What you will need 


To get started, download the automobile_data CSV file. This is data from an external source that
contains historical sales data on car prices and their features.

Click the link to the automobile_data file to download it. Or you may download the CSV file directly
from the attachments below.

Link to data: automobile_data

OR

Download download data:

automobile_data

CSV File

Upload your data

Similarly to a previous BigQuery activity, you will need to create a dataset and a custom table to house
your data. Then, you’ll be able to use SQL queries to explore and clean it. Once you’ve downloaded the
automobile_data file, you can create your dataset.
Step 1: Create a dataset
Go to the Explorer pane in your workspace and click the three dots next to your pinned project to
open the menu. From here, select Create dataset.

From the Create dataset menu, fill out some information about the dataset.  Input the Dataset ID as
cars; you can leave the Data location as Default. Then click CREATE DATASET.
The cars dataset should appear under your project in the Explorer pane as shown below. Click on the
three dots next to the cars dataset to open it.
Step 2: Create table
After you open your newly created dataset, you will be able to add a custom table for your data. 

From the cars dataset, click CREATE TABLE.

Under Source, upload the automobile_data CSV. Under Destination, make sure you are uploading
into your cars dataset and name your table car_info. You can set the schema to Auto-detect. Then,
click Create table.
After creating your table, it will appear in your Explorer pane. You can click on the table to explore the
schema and preview your data. Once you have gotten familiar with your data, you can start querying
it. 

Cleaning your data

Your new dataset contains historical sales data, including details such as car features and prices. You
can use this data to find the top 10 most popular cars and trims. But before you can perform your
analysis, you’ll need to make sure your data is clean. If you analyze dirty data, you could end up
presenting the wrong list of cars to the investors. That may cause them to lose money on their car
inventory investment.

Step 1: Inspect the fuel_type column


The first thing you want to do is inspect the data in your table so you can find out if there is any specific
cleaning that needs to be done. According to the data’s description, the fuel_type column should only
have two unique string values: diesel and gas. To check and make sure that’s true, run the following
query:
SELECT   DISTINCT fuel_type FROM   cars.car_info;

This returns the following results:

This confirms that the fuel_type column doesn’t have any unexpected values. 

Step 2: Inspect the length column 


Next, you will inspect a column with numerical data. The length column should contain numeric
measurements of the cars. So you will check that the minimum and maximum lengths in the dataset
align with the data description, which states that the lengths in this column should range from 141.1 to
208.1. Run this query to confirm 

SELECT   MIN(length) AS min_length,   MAX(length) AS max_length FROM   cars.car_info;

Your results should confirm that 141.1 and 208.1 are the minimum and maximum values respectively in
this column. 

Step 3: Fill in missing data


Missing values can create errors or skew your results during analysis. You’re going to want to check
your data for null or missing values. These values might appear as a blank cell or the word null in
BigQuery. 

You can check to see if the num_of_doors column contains null values using this query: 

SELECT   * FROM   cars.car_info 

WHERE 

  num_of_doors IS NULL;
This will select any rows with missing data for the num_of_doors column and return them in your
results table. You should get two results, one Mazda and one Dodge:

In order to fill in these missing values, you check with the sales manager, who states that all Dodge gas
sedans and all Mazda diesel sedans sold had four doors. If you are using the BigQuery free trial, you can
use this query to update your table so that all Dodge gas sedans have four doors:

UPDATE   cars.car_info SET   num_of_doors = "four" WHERE   make = "dodge"   AND fuel_type =
"gas"   AND body_style = "sedan";

You should get a message telling you that three rows were modified in this table. To make sure, you can
run the previous query again:

SELECT   * FROM   cars.car_info 

WHERE 

  num_of_doors IS NULL;

Now, you only have one row with a NULL value for num_of_doors. Repeat this process to replace the
null value for the Mazda.

If you are using the BigQuery Sandbox, you can skip these UPDATE queries; they will not affect your
ability to complete this activity.

Step 4: Identify potential errors


Once you have finished ensuring that there aren’t any missing values in your data, you’ll want to check
for other potential errors. You can use SELECT DISTINCT to check what values exist in a column. You
can run this query to check the num_of_cylinders column:

SELECT   DISTINCT num_of_cylinders FROM   cars.car_info;

After running this, you notice that there are one too many rows. There are two entries for two
cylinders: rows 6 and 7. But the two in row 7 is misspelled. 
To correct the misspelling for all rows, you can run this query if you have the BigQuery free trial:

UPDATE   cars.car_info SET   num_of_cylinders = "two" WHERE   num_of_cylinders = "tow";

You will get a message alerting you that one row was modified after running this statement. To check
that it worked, you can run the previous query again:  SELECT   DISTINCT num_of_cylinders
FROM   cars.car_info;

Next, you can check the compression_ratio column. According to the data description, the
compression_ratio column values should range from 7 to 23. Just like when you checked the length
values , you can use MIN and MAX to check if that’s correct:

SELECT   MIN(compression_ratio) AS min_compression_ratio,   MAX(compression_ratio) AS


max_compression_ratio FROM   cars.car_info;

Notice that this returns a maximum of 70. But you know this is an error because the maximum value in
this column should be 23, not 70. So the 70 is most likely a 7.0. Run the above query again without the
row with 70 to make sure that the rest of the values fall within the expected range of 7 to 23.

SELECT   MIN(compression_ratio) AS min_compression_ratio,   MAX(compression_ratio) AS


max_compression_ratio FROM   cars.car_info

WHERE

  compression_ratio <> 70;

Now the highest value is 23, which aligns with the data description. So you’ll want to correct the 70
value. You check with the sales manager again, who says that this row was made in error and should be
removed. Before you delete anything, you should check to see how many rows contain this erroneous
value as a precaution so that you don’t end up deleting 50% of your data. If there are too many (for
instance, 20% of your rows have the incorrect 70 value), then you would want to check back in with the
sales manager to inquire if these should be deleted or if the 70 should be updated to another value. Use
the query below to count how many rows you would be deleting:

SELECT
   COUNT(*) AS num_of_rows_to_delete

FROM

   cars.car_info

WHERE

   compression_ratio = 70;

Turns out there is only one row with the erroneous 70 value. So you can delete that row using this
query: 

DELETE cars.car_info

WHERE compression_ratio = 70;

If you are using the BigQuery sandbox, you can replace DELETE with SELECT to see which row would be
deleted.

Step 5: Ensure consistency


Finally, you want to check your data for any inconsistencies that might cause errors. These
inconsistencies can be tricky to spot — sometimes even something as simple as an extra space can
cause a problem.

Check the drive_wheels column for inconsistencies by running a query with a SELECT DISTINCT
statement:

 SELECT   DISTINCT drive_wheels FROM   cars.car_info;

It appears that 4wd appears twice in results. However, because you used a SELECT DISTINCT statement
to return unique values, this probably means there’s an extra space in one of the 4wd entries that
makes it different from the other 4wd.

To check if this is the case, you can use a LENGTH statement to determine the length of how long
each of these string variables:

SELECT   DISTINCT drive_wheels,   LENGTH(drive_wheels) AS string_length FROM   cars.car_info;

According to these results, some instances of the 4wd string have four characters instead of the
expected three (4wd has 3 characters). In that case, you can use the TRIM function to remove all
extra spaces in the drive_wheels column if you are using the BigQuery free trial:
UPDATE

  cars.car_info

SET

  drive_wheels = TRIM(drive_wheels)

WHERE TRUE;

Then, you run the SELECT DISTINCT statement again to ensure that there are only three distinct
values in the drive_wheels column: 

 SELECT   DISTINCT drive_wheels FROM   cars.car_info;

 Why is cleaning data before your analysis important?


 Which of these cleaning techniques do you think will be most useful for you in the
future?
 A good response would include that cleaning data is an important step of the
analysis process that will save you time and help ensure accuracy in the future. 
 Cleaning data where it lives is incredibly important for analysts. For instance, you
were able to use SQL to complete multiple cleaning tasks, which allows you to
clean data stored in databases. In upcoming activities, you will use your cleaning
skills to prepare for analysis
 Are there any areas of data processing with SQL that you’ve found particularly
challenging? 
 Are there any data processing skills that you’d like to improve upon? If so, what are
they?
 A good reflection on this topic would describe your challenges with SQL and the
areas that you want to continue learning about or practicing.
 Pausing to reflect on your learning experience helps you identify areas to improve
with practice and further study. This will help you reach your goals more
effectively. If used correctly, SQL makes tasks like removing duplicates or cleaning
string data much easier, especially with datasets that are too large to work on
effectively with spreadsheets. As you build your skills in SQL, you will be able to
process more complex data and start analyzing it.

Very important just check


Data-cleaning verification: A checklist
This reading will give you a checklist of common problems you can refer to when doing your data
cleaning verification, no matter what tool you are using. When it comes to data cleaning verification,
there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects.
Each project has its own organization and data requirements that lead to a unique list of things to run
through for verification. 
Keep in mind, as you receive more data or a better understanding of the project goal(s), you might
want to revisit some or all of these steps. 

Correct the most common problems


Make sure you identified the most common problems and corrected them, including:

 Sources of errors: Did you use the right tools and functions to find the source of the errors in
your dataset?
 Null data: Did you search for NULLs using conditional formatting and filters?
 Misspelled words: Did you locate all misspellings?
 Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
 Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM
function?
 Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function
or DISTINCT in SQL?
 Mismatched data types: Did you check that numeric, date, and string data are typecast
correctly?
 Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and
meaningful?
 Messy (inconsistent) date formats: Did you format the dates consistently throughout your
dataset?
 Misleading variable labels (columns): Did you name your columns meaningfully?
 Truncated data: Did you check for truncated or missing data that needs correction?
 Business Logic: Did you check that the data makes sense given your knowledge of the
business? 

Review the goal of your project


Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project
and confirm that your data is still aligned with that goal. This is a continuous process that you will do
throughout your project-- but here are three steps you can keep in mind while thinking about this: 

 Confirm the business problem 


 Confirm the goal of the project
 Verify that data can solve the problem and is aligned to the goal
Embrace changelogs
What do engineers, writers, and data analysts have in common? Change.

Engineers use engineering change orders (ECOs) to keep track of new product design
details and proposed changes to existing products. Writers use document revision
histories to keep track of changes to document flow and edits. And data analysts use
changelogs to keep track of data transformation and cleaning. Here are some examples of
these:

Automated version control takes you most of the way


Most software applications have a kind of history tracking built in. For example, in Google
sheets, you can check the version history of an entire sheet or an individual cell and go
back to an earlier version. In Microsoft Excel, you can use a feature called Track Changes.
And in BigQuery, you can view the history to check what has changed.

Here’s how it works:

Google 1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arro
Sheets move backward and forward in the history as needed.
Microsoft 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Trac
Excel Changes, click the Accept/Reject Changes option to accept or reject any change mad
Bring up a previous version (without reverting to it) and figure out what changed by co
BigQuery
to the current version.

Changelogs take you down the last mile


A changelog can build on your automated version history by giving you an even more
detailed record of your work. This is where data analysts record all the changes they make
to the data. Here is another way of looking at it. Version histories record what was done in
a data change for a project, but don't tell us why. Changelogs are super useful for helping
us understand the reasons changes have been made. Changelogs have no set format and
you can even make your entries in a blank document. But if you are using a shared
changelog, it is best to agree with other data analysts on the format of all your log entries.

Typically, a changelog records this type of information:  

 Data, file, formula, query, or any other component that changed


 Description of what changed
 Date of the change
 Person who made the change
 Person who approved the change 
 Version number 
 Reason for the change
Let’s say you made a change to a formula in a spreadsheet because you observed it in
another report and you wanted your data to match and be consistent. If you found out
later that the report was actually using the wrong formula, an automated version history
would help you undo the change. But if you also recorded the reason for the change in a
changelog, you could go back to the creators of the report and let them know about the
incorrect formula. If the change happened a while ago, you might not remember who to
follow up with. Fortunately, your changelog would have that information ready for you! By
following up, you would ensure data integrity outside your project. You would also be
showing personal integrity as someone who can be trusted with data. That is the power of
a changelog!

Finally, a changelog is important for when lots of changes to a spreadsheet or query have
been made. Imagine an analyst made four changes and the change they want to revert to
is change #2. Instead of clicking the undo feature three times to undo change #2 (and
losing changes #3 and #4), the analyst can undo just change #2 and keep all the other
changes. Now, our example was for just 4 changes, but try to think about how important
that changelog would be if there were hundreds of changes to keep track of.

What also happens IRL (in real life)

A junior analyst probably only needs to know the above with one exception. If an analyst is
making changes to an existing SQL query that is shared across the company, the company
most likely uses what is called a version control system. An example might be a query that
pulls daily revenue to build a dashboard for senior management. 

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version control


system.
2. An analyst makes sure the most up-to-date version of the query is the one they will
change. This is called syncing 
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a code review
and can be informally or formally done. An informal review could be as simple as
asking a senior analyst to take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version of
the query to a repository in the company's version control system. This is called a
code commit. A best practice is to document exactly what the change was and why
it was made in a comments area. Going back to our example of a query that pulls
daily revenue, a comment might be: Updated revenue to include revenue coming
from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be able to access
and use this new query when they sync to the most up-to-date queries stored in
the version control system.
7. If the query has a problem or business needs change, the analyst can undo the
change to the query using the version control system. The analyst can look at a
chronological list of all changes made to the query and who made each change.
Then, after finding their own change, the analyst can revert to the previous
version.
8. The query is back to what it was before the analyst made the change. And everyone
at the company sees this reverted, original query, too.

*What makes for a good changelog? 


*How do you decide if a change is significant enough to include in the changelog? 
A changelog should capture any of the following changes to the dataset while cleaning:

 Treated missing data


 Changed formatting
 Changed values or cases for data
You have made some of these changes while cleaning data in previous activities. If you
had kept a changelog during those activities, you would have described and categorized
each change. When in doubt about the significance of a change, you should enter it into
the changelog.

Advanced functions for speedy data cleaning


In this reading, you will learn about some advanced functions that can help you speed up the data
cleaning process in spreadsheets. Below is a table summarizing three functions and what they do:

IMPORTRANGE: Syntax: =IMPORTRANGE(spreadsheet_url, range_string) Menu Options:


Paste Link (copy the data first) Primary Use: Imports (pastes) data from one sheet to another and
keeps it automatically updated QUERY: Syntax: =QUERY(Sheet and Range, "Select *") Menu
Options: Data > From Other Sources > From Microsoft Query Primary Use: Enables pseudo SQL
(SQL-like) statements or a wizard to import the data. FILTER: Syntax: =FILTER(range,
condition1, [condition2, ...]) Menu Options: Filter(conditions per column) Primary Use: Displays
only the data that meets the specified conditions.

Keeping data clean and in sync with a source


The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in
Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount
of data is more efficient than manual copying and pasting. They also reduce the chance of errors being
introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you
can “cherry pick” the data you want to analyze and leave behind the data that isn’t relevant to your
project. Basically, it is like canceling noise from your data so you can focus on what is most important
to solve your problem. This functionality is also useful for day-to-day data monitoring; with it, you can
build a tracking spreadsheet to share the relevant data with others. The data is synced with the data
source so when the data is updated in the source file, the tracked data is also refreshed.

In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in
the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the
spreadsheet containing the data the first time you import the data.

The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace
it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow
access button.

Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.

Example of using IMPORTRANGE


An analyst monitoring a fundraiser needs to track and ensure that matching funds are distributed. They
use IMPORTRANGE to pull all the matching transactions into a spreadsheet containing all of the
individual donations. This enables them to determine which donations eligible for matching funds still
need to be processed. Because the total number of matching transactions increases daily, they simply
need to change the range used by the function to import the most up-to-date data. 

On Tuesday, they use the following to import the donor names and matched amounts:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123", "sheet1!A1:C10”,
“Matched Funds!A1:B4001”)

On Wednesday, another 500 transactions were processed. They increase the range used by 500 to
easily include the latest transactions when importing the data to the individual donor spreadsheet:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123”, “Matched Funds!


A1:B4501”)

Note: The above examples are for illustrative purposes only. Don't copy and paste them into your
spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you
have multiple tabs) along with the range of cells in the spreadsheet that you have populated with
data.

Pulling data from other data sources


The QUERY function is also useful when you want to pull data from another spreadsheet. The
QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a large amount
of data, using the QUERY function is faster than filtering data manually. This is especially true when
repeated filtering is required. For example, you could generate a list of all customers who bought your
company’s products in a particular month using manual filtering. But if you also want to figure out
customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the
data for sales during the following month, and then copy those results for the analysis. With the
QUERY function, you can get all the data for both months without a need to change your original
dataset or copy results.

The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the
range of data that you want to query from, and then use the SQL SELECT command to select the
specific columns. You can also add specific criteria after the SELECT statement by including a
WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!

Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets
use a query wizard to guide you through the steps to connect to a data source and select the tables. In
either case, you are able to be sure that the data imported is verified and clean based on the criteria in
the query.

Examples of using QUERY


Check out the Google support page for the QUERY function with sample usage, syntax, and examples
you can download in a Google sheet.

Link to make a copy of the sheet: QUERY examples

Real life solution


Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY
function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales
data for a particular month and another tab could contain all the sales data from a specific region. This
solution illustrates how SQL and spreadsheets are used well together.

Filtering data to get what you want


The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query
language. The FILTER function lets you view only the rows (or columns) in the source data that meet
your specified conditions. It makes it possible to pre-filter data before you analyze it.

The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY
function can be combined with other functions for more complex calculations. For example, the
QUERY function can be used with other functions like SUM and COUNT to summarize data, but the
FILTER function can't.

Example of using FILTER


Check out the Google support page for the FILTER function with sample usage, syntax, and examples
you can download in a Google sheet.
Link to make a copy of the sheet: FILTER examples

The skills section on your resume likely only has room for 2-4 bullet points, so be sure to
use this space effectively. You might want to prioritize technical skills over soft skills. This
is a great chance for you to highlight some of the skills you’ve picked up in these courses,
such as:

 Strong analytical skills


 Pattern recognition
 Relational databases and SQL
 Strong data visualization skills
 Proficiency with spreadsheets, SQL, R, and Tableau
Notice how the skills listed above communicate a well-rounded data analyst’s skill set
without being wordy. The skills section summarizes what you’re capable of doing while
listing the technology and tools you are proficient in.

Many companies use algorithms to screen and filter resumes for keywords. If your resume
does not contain the keywords they are searching for, a human may never even read your
resume. Reserving at least one bullet point to list specific programs you are familiar with is
a great way to make sure your resume makes it past automated keyword screenings and
onto the desk of a recruiter or hiring manager.

You might also like