0% found this document useful (0 votes)
29 views

Module 2 Clean Data For More Accurate Insights

Uploaded by

Agama Agama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 2 Clean Data For More Accurate Insights

Uploaded by

Agama Agama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Every data analyst wants to analyze clean data.

In this part of the course, you’ll learn the difference


between clean and dirty data. Then, you’ll practice cleaning data in spreadsheets and other tools.

A. Data cleaning is a must


1. Clean it up!
Can you guess what inaccurate or bad data costs businesses every year? Thousands of
dollars? Millions? Billions? Well, according to IBM, the yearly cost of poor quality data is
$3.1 trillion in the US alone. That's a lot of zeros. Now, can you guess the number 1 cause of
poor quality data? It's not a new system implementation or a computer technical glitch. The
most common factor is actually human error. Here's a spreadsheet from a law office. It
shows customers, the legal services they bought, the service order number, how much they
paid, and the payment method. Dirty data can be the result of someone typing in a piece of
data incorrectly. Inconsistent formatting, blank fields, or the same piece of data being entered
more than once, which causes duplicates. Dirty data is data that's incomplete, incorrect, or
irrelevant to the problem you're trying to solve. When you work with dirty data, you can't be
sure that your results are correct. In fact, you can pretty much bet they won't be. Earlier, you
learned that data integrity is critical to reliable data analytics results, and clean data helps you
achieve data integrity. Clean data is data that's complete, correct, and relevant to the problem
you're trying to solve. When you work with clean data, you'll find that your projects go much
more smoothly. I remember the first time I witnessed firsthand how important clean data
really is. I had just started using SQL and I thought it worked like magic. I could have the
computer sum up millions of numbers, saving me tons of time and effort. But I quickly
discovered that only works when the data is clean. If there was even one accidental letter in a
column that should only have numbers, the computer wouldn't know what to do. So it would
throw an error, and suddenly I was stuck, and there's no way I can add up millions of
numbers by myself. So I had to clean up that data to make it work. The good news is that
there's plenty of effective processes and tools to help you do that. Coming up, you'll gain the
skills and knowledge you need to make sure the data you work with is always clean. Along
the way, we'll dig deeper into the difference between clean and dirty data and why clean data
is so important. We'll also talk about different ways to clean your data and common
problems to look for during the process. Ready to start? Let's do it.

2. Why data cleaning is critical


Clean data is incredibly important for effective analysis. If a piece of data is entered into a
spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank, or if data
formats are inconsistent, the result is dirty data. Small mistakes can lead to big consequences
in the long run. I'll be completely honest with you, data cleaning is like brushing your teeth.
It's something you should do and do properly because otherwise it can cause serious
problems. For teeth, that might be cavities or gum disease. For data, that might be costing
your company money, or an angry boss. But here's the good news. If you keep brushing
twice a day, every day, it becomes a habit. Soon, you don't even have to think about it. It's
the same with data. Trust me, it will make you look great when you take the time to clean up
that dirty data. As a quick refresher, dirty data is incomplete, incorrect, or irrelevant to the
problem you're trying to solve. It can't be used in a meaningful way, which makes analysis
very difficult, if not impossible. On the other hand, clean data is complete, correct, and
relevant to the problem you're trying to solve. This allows you to understand and analyze
information and identify important patterns, connect related information, and draw useful
conclusions. Then you can apply what you learn to make effective decisions. In some cases,
you won't have to do a lot of work to clean data. For example, when you use internal data
that's been verified and cared for by your company's data engineers and data warehouse
team, it's more likely to be clean. Let's talk about some people you'll work with as a data
analyst. Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data processors and
related systems. Data warehousing specialists develop processes and procedures to
effectively store and organize data. They make sure that data is available, secure, and backed
up to prevent loss. When you become a data analyst, you can learn a lot by working with the
person who maintains your databases to learn about their systems. If data passes through the
hands of a data engineer or a data warehousing specialist first, you know you're off to a good
start on your project. There's a lot of great career opportunities as a data engineer or a data
warehousing specialist. If this kind of work sounds interesting to you, maybe your career
path will involve helping organizations save lots of time, effort, and money by making sure
their data is sparkling clean. But even if you go in a different direction with your data
analytics career and have the advantage of working with data engineers and warehousing
specialists, you're still likely to have to clean your own data. It's important to remember: no
dataset is perfect. It's always a good idea to examine and clean data before beginning
analysis. Here's an example. Let's say you're working on a project where you need to figure
out how many people use your company's software program. You have a spreadsheet that
was created internally and verified by a data engineer and a data warehousing specialist.
Check out the column labeled "Username." It might seem logical that you can just scroll
down and count the rows to figure out how many users you have.
Play video starting at :3:40 and follow transcript3:40
But that won't work because one person sometimes has more than one username.
Play video starting at :3:49 and follow transcript3:49
Maybe they registered from different email addresses, or maybe they have a work and
personal account. In situations like this, you would need to clean the data by eliminating any
rows that are duplicates.
Play video starting at :4:6 and follow transcript4:06
Once you've done that, there won't be any more duplicate entries. Then your spreadsheet is
ready to be put to work. So far we've discussed working with internal data. But data cleaning
becomes even more important when working with external data, especially if it comes from
multiple sources. Let's say the software company from our example surveyed its customers
to learn how satisfied they are with its software product. But when you review the survey
data, you find that you have several nulls.
Play video starting at :4:40 and follow transcript4:40
A null is an indication that a value does not exist in a data set. Note that it's not the same as a
zero. In the case of a survey, a null would mean the customers skipped that question. A zero
would mean they provided zero as their response. To do your analysis, you would first need
to clean this data. Step one would be to decide what to do with those nulls. You could either
filter them out and communicate that you now have a smaller sample size, or you can keep
them in and learn from the fact that the customers did not provide responses. There's lots of
reasons why this could have happened. Maybe your survey questions weren't written as well
as they could be. Maybe they were confusing or biased, something we learned about earlier.
We've touched on the basics of cleaning internal and external data, but there's lots more to
come. Soon, we'll learn about the common errors to be aware of to ensure your data is
complete, correct, and relevant. See you soon!
en

3. Angie: I love cleaning data


I am Angie. I'm a program manager of engineering at Google. I truly believe that cleaning
Data is the heart and soul of data. It's how you get to know your data: its quirks, its flaws, it's
mysteries. I love a good mystery. I remember one time I found somebody had purchased, I
think it was one million dollars worth of chicken sandwiches in one transaction. This
mystery drove me nuts. I had all these questions. Could this have really happened? Or maybe
it was a really big birthday party? How did they make a million dollars worth of chicken
sandwiches? I was cleaning my data and trying to figure out where did it go wrong. We
ended up finding out that we'd been squaring and multiplying all of our transactions for a
very specific case. It took us about three days to figure this out. I will never forget the
moment when it was like, aha! We got to the bottom of it. The result is our data was cleaned,
and we had this great dataset that we could use for analysis. But what I loved was just the
mystery of it and getting to know all these weird intricacies about my dataset. It felt like a
superpower almost, like I was a detective, and I had gone in there and I had really solved
something. I love cleaning data!

4. What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem
you are trying to solve. This reading summarizes:

 Types of dirty data you may encounter


 What may have caused the data to become dirty
 How dirty data is harmful to businesses

Types of dirty data


Duplicate data

Description Possible causes Potential harm to businesses


Any data record that Manual data entry, batch Skewed metrics or analyses, inflated or
shows up more than data imports, or data inaccurate counts or predictions, or confusion
once migration during data retrieval

Outdated data

Description Possible causes Potential harm to


businesses
Any data that is old which should be People changing roles or Inaccurate insights,
replaced with newer and more accurate companies, or software and decision-making, and
information systems becoming obsolete analytics

Incomplete data

Description Possible causes Potential harm to businesses


Any data that is missing Improper data collection or Decreased productivity, inaccurate insights, or
important fields incorrect data entry inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses


Any data that is Human error inserted during data Inaccurate insights or decision-making
complete but input, fake information, or mock based on bad information resulting in
inaccurate data revenue loss

Inconsistent data

Description Possible causes Potential harm to businesses


Any data that uses different Data stored incorrectly or Contradictory data points leading to
formats to represent the same errors inserted during data confusion or inability to classify or
thing transfer segment customers

Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred
browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain
industries from a previous search:

 Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
 Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
 Marketing and sales: 99% of companies are actively tackling data quality in some way (source).
 Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records
(source).
Key takeaways

Dirty data includes duplicate data, outdated data, incomplete data, incorrect or inaccurate data, and
inconsistent data. Each type of dirty data can have a significant impact on analyses, leading to
inaccurate insights, poor decision-making, and revenue loss. There are a number of causes of dirty data,
including manual data entry errors, batch data imports, data migration, software obsolescence,
improper data collection, and human errors during data input. As a data professional, you can take steps
to mitigate the impact of dirty data by implementing effective data quality processes.

5. Recognize and remedy dirty data


Hey, there. In this video, we'll focus on common issues associated with dirty data. These
includes spelling and other texts errors, inconsistent labels, formats and field lane, missing
data and duplicates. This will help you recognize problems quicker and give you the
information you need to fix them when you encounter something similar during your own
analysis. This is incredibly important in data analytics. Let's go back to our law office
spreadsheet. As a quick refresher, we'll start by checking out the different types of dirty data
it shows. Sometimes, someone might key in a piece of data incorrectly. Other times, they
might not keep data formats consistent.
Play video starting at ::46 and follow transcript0:46
It's also common to leave a field blank.
Play video starting at ::51 and follow transcript0:51
That's also called a null, which we learned about earlier. If someone adds the same piece of
data more than once, that creates a duplicate.
Play video starting at :1:4 and follow transcript1:04
Let's break that down. Then we'll learn about a few other types of dirty data and strategies
for cleaning it. Misspellings, spelling variations, mixed up letters, inconsistent punctuation,
and typos in general, happen when someone types in a piece of data incorrectly. As a data
analyst, you'll also deal with different currencies. For example, one dataset could be in US
dollars and another in euros, and you don't want to get them mixed up. We want to find these
types of errors and fix them like this.
Play video starting at :1:41 and follow transcript1:41
You'll learn more about this soon. Clean data depends largely on the data integrity rules that
an organization follows, such as spelling and punctuation guidelines. For example, a
beverage company might ask everyone working in its database to enter data about volume in
fluid ounces instead of cups. It's great when an organization has rules like this in place. It
really helps minimize the amount of data cleaning required, but it can't eliminate it
completely. Like we discussed earlier, there's always the possibility of human error. The next
type of dirty data our spreadsheet shows is inconsistent formatting. In this example,
something that should be formatted as currency is shown as a percentage. Until this error is
fixed, like this, the law office will have no idea how much money this customer paid for its
services. We'll learn about different ways to solve this and many other problems soon. We
discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data
requires a little more work than just fixing a spelling error or changing a format. In this
example, the data analysts would need to research which customer had a consultation on July
4th, 2020. Then when they find the correct information, they'd have to add it to the
spreadsheet.
Play video starting at :3:19 and follow transcript3:19
Another common type of dirty data is duplicated.
Play video starting at :3:27 and follow transcript3:27
Maybe two different people added this appointment on August 13th, not realizing that
someone else had already done it or maybe the person entering the data hit copy and paste by
accident. Whatever the reason, it's the data analyst job to identify this error and correct it by
deleting one of the duplicates.
Play video starting at :3:51 and follow transcript3:51
Now, let's continue on to some other types of dirty data. The first has to do with labeling. To
understand labeling, imagine trying to get a computer to correctly identify panda bears
among images of all different kinds of animals. You need to show the computer thousands of
images of panda bears. They're all labeled as panda bears. Any incorrectly labeled picture,
like the one here that's just bear, will cause a problem. The next type of dirty data is having
an inconsistent field length. You learned earlier that a field is a single piece of information
from a row or column of a spreadsheet. Field length is a tool for determining how many
characters can be keyed into a field. Assigning a certain length to the fields in your
spreadsheet is a great way to avoid errors. For instance, if you have a column for someone's
birth year, you know the field length is four because all years are four digits long. Some
spreadsheet applications have a simple way to specify field lengths and make sure users can
only enter a certain number of characters into a field. This is part of data validation. Data
validation is a tool for checking the accuracy and quality of data before adding or importing
it. Data validation is a form of data cleansing, which you'll learn more about soon. But first,
you'll get familiar with more techniques for cleaning data. This is a very important part of the
data analyst job. I look forward to sharing these data cleaning strategies with you.

B. First steps toward clean data


1. Data-cleaning tools and techniques
Hi. Now that you're familiar with some of the most common types of dirty data, it's time to
clean them up. As you've learned, clean data is essential to data integrity and reliable
solutions and decisions. The good news is that spreadsheets have all kinds of tools you can
use to get your data ready for analysis. The techniques for data cleaning will be different
depending on the specific data set you're working with. So we won't cover everything you
might run into, but this will give you a great starting point for fixing the types of dirty data
analysts find most often. Think of everything that's coming up as a teaser trailer of data
cleaning tools. I'm going to give you a basic overview of some common tools and
techniques, and then we'll practice them again later on. Here, we'll discuss how to remove
unwanted data, clean up text to remove extra spaces and blanks, fix typos, and make
formatting consistent. However, before removing unwanted data, it's always a good practice
to make a copy of the data set. That way, if you remove something that you end up needing
in the future, you can easily access it and put it back in the data set. Once that's done, then
you can move on to getting rid of the duplicates or data that isn't relevant to the problem
you're trying to solve. Typically, duplicates appear when you're combining data sets from
more than one source or using data from multiple departments within the same business.
You've already learned a bit about duplicates, but let's practice removing them once more
now using this spreadsheet, which lists members of a professional logistics association.
Duplicates can be a big problem for data analysts. So it's really important that you can find
and remove them before any analysis starts. Here's an example of what I'm talking about.
Play video starting at :2: and follow transcript2:00
Let's say this association has duplicates of one person's $500 membership in its database.
Play video starting at :2:8 and follow transcript2:08
When the data is summarized, the analyst would think there was $1,000 being paid by this
member and would make decisions based on that incorrect data. But in reality, this member
only paid $500. These problems can be fixed manually, but most spreadsheet applications
also offer lots of tools to help you find and remove duplicates.
Play video starting at :2:35 and follow transcript2:35
Now, irrelevant data, which is data that doesn't fit the specific problem that you're trying to
solve, also needs to be removed. Going back to our association membership list example,
let's say a data analyst was working on a project that focused only on current members. They
wouldn't want to include information on people who are no longer members,
Play video starting at :3:2 and follow transcript3:02
or who never joined in the first place.
Play video starting at :3:11 and follow transcript3:11
Removing irrelevant data takes a little more time and effort because you have to figure out
the difference between the data you need and the data you don't. But believe me, making
those decisions will save you a ton of effort down the road.
Play video starting at :3:25 and follow transcript3:25
The next step is removing extra spaces and blanks. Extra spaces can cause unexpected results
when you sort, filter, or search through your data. And because these characters are easy to
miss, they can lead to unexpected and confusing results. For example, if there's an extra
space and in a member ID number, when you sort the column from lowest to highest, this
row will be out of place.
Play video starting at :3:57 and follow transcript3:57
To remove these unwanted spaces or blank cells, you can delete them yourself.
Play video starting at :4:7 and follow transcript4:07
Or again, you can rely on your spreadsheets, which offer lots of great functions for removing
spaces or blanks automatically. The next data cleaning step involves fixing misspellings,
inconsistent capitalization, incorrect punctuation, and other typos. These types of errors can
lead to some big problems. Let's say you have a database of emails that you use to keep in
touch with your customers. If some emails have misspellings, a period in the wrong place, or
any other kind of typo, not only do you run the risk of sending an email to the wrong people,
you also run the risk of spamming random people. Think about our association membership
example again. Misspelling might cause the data analyst to miscount the number of
professional members if they sorted this membership type
Play video starting at :5: and follow transcript5:00
and then counted the number of rows.
Play video starting at :5:6 and follow transcript5:06
Like the other problems you've come across, you can also fix these problems manually.
Play video starting at :5:16 and follow transcript5:16
Or you can use spreadsheet tools, such as spellcheck, autocorrect, and conditional formatting
to make your life easier. There's also easy ways to convert text to lowercase, uppercase, or
proper case, which is one of the things we'll check out again later. All right, we're getting
there. The next step is removing formatting. This is particularly important when you get data
from lots of different sources. Every database has its own formatting, which can cause the
data to seem inconsistent. Creating a clean and consistent visual appearance for your
spreadsheets will help make it a valuable tool for you and your team when making key
decisions. Most spreadsheet applications also have a "clear formats" tool, which is a great
time saver. Cleaning data is an essential step in increasing the quality of your data. Now you
know lots of different ways to do that. In the next video, you'll take that knowledge even
further and learn how to clean up data that's come from more than one source.

2. Clean data from multiple sources


Welcome back. So far you've learned a lot about dirty data and how to clean up the most
common errors in a dataset. Now we're going to take that a step further and talk about
cleaning up multiple datasets. Cleaning data that comes from two or more sources is very
common for data analysts, but it does come with some interesting challenges. A good
example is a merger, which is an agreement that unites two organizations into a single new
one. In the logistics field, there's been lots of big changes recently, mostly because of the e-
commerce boom. With so many people shopping online, it makes sense that the companies
responsible for delivering those products to their homes are in the middle of a big shake-up.
When big things happen in an industry, it's common for two organizations to team up and
become stronger through a merger. Let's talk about how that will affect our logistics
association. As a quick reminder, this spreadsheet lists association member ID numbers, first
and last names, addresses, how much each member pays in dues, when the membership
expires, and the membership types. Now, let's think about what would happen if the
International Logistics Association decided to get together with the Global Logistics
Association in order to help their members handle the incredible demands of e-commerce.
First, all the data from each organization would need to be combined using data merging.
Data merging is the process of combining two or more datasets into a single dataset. This
presents a unique challenge because when two totally different datasets are combined, the
information is almost guaranteed to be inconsistent and misaligned. For example, the Global
Logistics Association's spreadsheet has a separate column for a person's suite, apartment, or
unit number, but the International Logistics Association combines that information with their
street address. This needs to be corrected to make the number of address columns consistent.
Next, check out how the Global Logistics Association uses people's email addresses as their
member ID, while the International Logistics Association uses numbers. This is a big
problem because people in a certain industry, such as logistics, typically join multiple
professional associations. There's a very good chance that these datasets include membership
information on the exact same person, just in different ways. It's super important to remove
those duplicates. Also, the Global Logistics Association has many more member types than
the other organization.
Play video starting at :2:59 and follow transcript2:59
On top of that, it uses a term, "Young Professional" instead of "Student Associate."
Play video starting at :3:8 and follow transcript3:08
But both describe members who are still in school or just starting their careers. If you were
merging these two datasets, you'd need to work with your team to fix the fact that the two
associations describe memberships very differently. Now you understand why the merging
of organizations also requires the merging of data, and that can be tricky. But there's lots of
other reasons why data analysts merge datasets. For example, in one of my past jobs, I
merged a lot of data from multiple sources to get insights about our customers' purchases.
The kinds of insights I gained helped me identify customer buying patterns. When merging
datasets, I always begin by asking myself some key questions to help me avoid redundancy
and to confirm that the datasets are compatible. In data analytics, compatibility describes
how well two or more datasets are able to work together. The first question I would ask is, do
I have all the data I need? To gather customer purchase insights, I wanted to make sure I had
data on customers, their purchases, and where they shopped. Next I would ask, does the data
I need exist within these datasets? As you learned earlier in this program, this involves
considering the entire dataset analytically. Looking through the data before I start using it
lets me get a feel for what it's all about, what the schema looks like, if it's relevant to my
customer purchase insights, and if it's clean data. That brings me to the next question. Do the
datasets need to be cleaned, or are they ready for me to use? Because I'm working with more
than one source, I will also ask myself, are the datasets cleaned to the same standard? For
example, what fields are regularly repeated? How are missing values handled? How recently
was the data updated? Finding the answers to these questions and understanding if I need to
fix any problems at the start of a project is a very important step in data merging. In both of
the examples we explored here, data analysts could use either the spreadsheet tools or SQL
queries to clean up, merge, and prepare the datasets for analysis. Depending on the tool you
decide to use, the cleanup process can be simple or very complex. Soon, you'll learn how to
make the best choice for your situation. As a final note, programming languages like R are
also very useful for cleaning data. You'll learn more about how to use R and other concepts
we covered soon.

3. Common data-cleaning pitfalls

In this reading, you will learn the importance of data cleaning and how to identify common mistakes.
Some of the errors you might come across while cleaning your data could include:
Common mistakes to avoid

 Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the
time the wrong spelling or common grammatical errors can be detected, but it gets harder with things
like names or addresses. For example, if you are working with a spreadsheet table of customer data,
you might come across a customer named “John” whose name has been input incorrectly as “Jon” in
some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for
spelling errors and catch this, your analysis will have mistakes in it.
 Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you
avoid those errors in the future by showing you how you resolved them. For example, you might find
an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns
haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your
formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you
keep track of changes in your work, so that you can backtrack if a fix didn’t work.
 Not checking for misfielded values: A misfielded value happens when the values are entered into the
wrong field. These values might still be formatted correctly, which makes them harder to catch if you
aren’t careful. For example, you might have a dataset with columns for cities and countries. These are
the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of
Spain in the country column, and Spain had mistakenly been entered into the city column, you would
miss key data points. Making sure your data has been entered correctly is key to accurate, complete
analysis.
 Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate
conclusions. For example, if you were trying to get the total number of sales from the last three months,
but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try
to keep your data as clean as possible by maintaining completeness and consistency.
 Only looking at a subset of the data: It is important to think about all of the relevant data when you
are cleaning. This helps make sure you understand the whole story the data is telling, and that you are
paying attention to all possible errors. For example, if you are working with data about bird migration
patterns from different sources, but you only clean one source, you might not realize that some of the
data is being repeated. This will cause problems in your analysis later on. If you want to avoid common
errors like duplicates, each field of your data requires equal attention.
 Losing track of business objectives: When you are cleaning data, you might make new and
interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from
the task at hand. For example, if you were working with weather data to find the average number of
rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really
interesting, but it isn’t related to the question you are trying to answer right now. Being curious is
great! But try not to let it distract you from the task at hand.
 Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part
of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing
that same error over and over again. For example, imagine you have a team spreadsheet that tracks
everyone’s progress. The table keeps breaking because different people are entering different values.
You can keep fixing all of these problems one by one, or you can set up your table to streamline data
entry so everyone is on the same page. Addressing the source of the errors in your data will save you a
lot of time in the long run.
 Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors,
we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would
find the cause of the problem before you started fixing the car, right? The same goes for data. First, you
figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check,
lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can
control it and keep your data clean.
 Not backing up your data prior to data cleaning: It is always good to be proactive and create your
data backup before you start your data clean-up. If your program crashes, or if your changes cause a
problem in your dataset, you can always go back to the saved version and restore it. The simple
procedure of backing up your data can save you hours of work-- and most importantly, a headache.
 Not accounting for data cleaning in your deadlines/process: All good things take time, and that
includes data cleaning. It is important to keep that in mind when going through your process and
looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate
estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.

Key takeaways
Data cleaning is essential for accurate analysis and decision-making. Common mistakes to avoid when
cleaning data include spelling errors, misfielded values, missing values, only looking at a subset of the
data, losing track of business objectives, not fixing the source of the error, not analyzing the system
prior to data cleaning, not backing up your data prior to data cleaning, and not accounting for data
cleaning in your deadlines/process. By avoiding these mistakes, you can ensure that your data is clean
and accurate, leading to better outcomes for your business.

Additional resources

Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid
the most common mistakes:

 Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft Excel.
 10 Google Workspace tips to clean up data: Learn best practices for data cleaning in Google Sheets.

C. Continue cleaning data in spreadsheets


1. Step-by-Step guide: Data-cleaning features in spreadsheets

This reading outlines the steps the instructor performs in the next video, Data-cleaning features in
spreadsheets. In the video, the instructor explains how to use menu options in spreadsheets to fix errors.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.

What you’ll need

If you’d like to follow along with the examples in this video, choose a spreadsheet tool. Google Sheets
or Excel are recommended.

To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.

Link to logistics data: International Logistics Association Memberships - Data for Cleaning

Link to cosmetics data: Cosmetics Inc. - Data for Cleaning

OR

International Logistics Association Memberships - Data for Cleaning


XLSX File

Cosmetics Inc. - Data for Cleaning


XLSX File
Example 1: Use conditional formatting to highlight blank cells

Conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific
conditions.

1. Open the spreadsheet International Logistics Association Memberships - Data for Cleaning.
2. Select the range of cells to which you’ll apply conditional formatting. In this example, you’ll select
columns A through L, except for columns F and H. To select all columns except for F and H: a. Select
cell A to highlight column A. b. Hold down the SHIFT key and at the same time use your mouse to
select cell E. This will highlight all the columns between A and E. c. To select the remainder of the
columns, hold down the CONTROL (Windows) or COMMAND (Mac) key while you select cells G, I,
J, K, and L. d. Columns A through L in your spreadsheet should be highlighted except Column F and
Column H.
3. From the menu, select Format, then Conditional formatting. The columns you’ve selected should
turn a light shade of green, and a new Conditional format rules tool will appear. Additionally, the
Apply to range field should indicate the cells you’ve selected.
4. Next, apply a condition to these cells to change the cell color if the cell is empty. In the Format cells if
drop-down, select Cell is empty.
5. Select the Formatting style field. Select a bright color from the drop-down to make the blank cells
stand out.
6. Select Done.

Example 2: Remove duplicates

Remove duplicates is a spreadsheet tool that automatically searches for and eliminates duplicate entries
from a spreadsheet. This is faster and easier than searching the data by scrolling through it.

1. Create a copy of your dataset by right clicking the Association ABC membership tab and selecting
Duplicate. This is a good practice, as it protects against accidentally deleting important data. Continue
working in the new sheet, Copy of Association ABC memberships.
2. In the menu, select Data, then Data cleanup, then Remove duplicates.
3. Check the box next to Data has header row.
4. Check the box next to Select All to inspect the entire spreadsheet.
5. Select Remove duplicates.

Example 3: Format dates consistently

Format dates to make all of the data in your spreadsheet consistent. This makes items easier to find and
manipulate.

1. Select column J (Membership valid through), which contains dates.


2. From the menu, select Format, then Number, then Date.

Example 4: Use split to separate data into columns


The split menu option is helpful when you have more than one piece of data in a cell and you want to
separate those pieces of data into different cells.

1. Select column L (Certification).


2. In the menu, select Data, then Split text to columns.
3. The delimiter (for example, a comma) will be automatically detected.
4. If needed, specify the separator manually in the dropdown that appears in your spreadsheet.

Example 5: Use split to fix numbers stored as text

SPLIT is a spreadsheet function that divides text around a specified character and puts each fragment
into a new, separate cell.

1. Open the spreadsheet Cosmetics Inc. - Data for Cleaning.


2. Notice that cell F12 contains an error.
3. Select column E (Orders).
4. In the menu select Data, then select Split text to columns.
5. This removes the quotation marks from cell E12 so the spreadsheet recognizes the data in the cell as a
number. This resolves the error in cell F12.

3. Data-cleaning features in spreadsheets


Hi again. As you learned earlier, there's a lot of different ways to clean up data. I've shown you
some examples of how you can clean data manually, such as searching for and fixing
misspellings or removing empty spaces and duplicates. We also learned that lots of spreadsheet
applications have tools that help simplify and speed up the data cleaning process. There's a lot of
great efficiency tools that data analysts use all the time, such as conditional formatting,
removing duplicates, formatting dates, fixing text strings and substrings, and splitting text to
columns. We'll explore those in more detail now. The first is something called conditional
formatting. Conditional formatting is a spreadsheet tool that changes how cells appear when
values meet specific conditions. Likewise, it can let you know when a cell does not meet the
conditions you've set. Visual cues like this are very useful for data analysts, especially when
we're working in a large spreadsheet with lots of data. Making certain data points standout
makes the information easier to understand and analyze. For cleaning data, knowing when the
data doesn't follow the condition is very helpful. Let's return to the logistics association
spreadsheet to check out conditional formatting in action. We'll use conditional formatting to
highlight blank cells. That way, we know where there's missing information so we can add it to
the spreadsheet. To do this, we'll start by selecting the range we want to search. For this example
we're not focused on address 3 and address 5. The fields will include all the columns in our
spreadsheets, except for F and H. Next, we'll go to Format and choose Conditional formatting.
Play video starting at :2:2 and follow transcript2:02
Great. Our range is automatically indicated in the field. The format rule will be to format cells if
the cell is empty.
Play video starting at :2:15 and follow transcript2:15
Finally, we'll choose the formatting style. I'm going to pick a shade of bright pink, so my blanks
really stand out.
Play video starting at :2:26 and follow transcript2:26
Then click "Done," and the blank cells are instantly highlighted. The next spreadsheet tool
removes duplicates. As you've learned before, it's always smart to make a copy of the data set
before removing anything. Let's do that now.
Play video starting at :2:48 and follow transcript2:48
Great, now we can continue. You might remember that our example spreadsheet has one
association member listed twice.
Play video starting at :3:1 and follow transcript3:01
To fix that, go to Data and select "Remove duplicates." "Remove duplicates" is a tool that
automatically searches for and eliminates duplicate entries from a spreadsheet. Choose "Data
has header row" because our spreadsheet has a row at the very top that describes the contents of
each column. Next, select "All" because we want to inspect our entire spreadsheet. Finally,
"Remove duplicates."
Play video starting at :3:37 and follow transcript3:37
You'll notice the duplicate row was found and immediately removed.
Play video starting at :3:45 and follow transcript3:45
Another useful spreadsheet tool enables you to make formats consistent. For example, some of
the dates in this spreadsheet are in a standard date format.
Play video starting at :3:59 and follow transcript3:59
This could be confusing if you wanted to analyze when association members joined, how often
they renewed their memberships, or how long they've been with the association. To make all of
our dates consistent, first select column J, then go to "Format," select "Number," then "Date."
Now all of our dates have a consistent format. Before we go over the next tool, I want to explain
what a text string is. In data analytics, a text string is a group of characters within a cell, most
often composed of letters. An important characteristic of a text string is its length, which is the
number of characters in it. You'll learn more about that soon. For now, it's also useful to know
that a substring is a smaller subset of a text string. Now let's talk about Split. Split is a tool that
divides a text string around the specified character and puts each fragment into a new and
separate cell. Split is helpful when you have more than one piece of data in a cell and you want
to separate them out. This might be a person's first and last name listed together, or it could be a
cell that contains someone's city, state, country, and zip code, but you actually want each of
those in its own column. Let's say this association wanted to analyze all of the different
professional certifications its members have earned. To do this, you want each certification
separated out into its own column. Right now, the certifications are separated by a comma.
That's the specified text separating each item, also called the delimiter. Let's get them separated.
Highlight the column, then select "Data," and "Split text to columns."
Play video starting at :6:2 and follow transcript6:02
This spreadsheet application automatically knew that the comma was a delimiter and separated
each certification. But sometimes you might need to specify what the delimiter should be. You
can do that here.
Play video starting at :6:18 and follow transcript6:18
Split text to columns is also helpful for fixing instances of numbers stored as text. Sometimes
values in your spreadsheet will seem like numbers, but they're formatted as text. This can
happen when copying and pasting from one place to another or if the formatting's wrong. For
this example, let's check out our new spreadsheet from a cosmetics maker. If a data analyst
wanted to determine total profits, they could add up everything in column F. But there's a
problem; one of the cells has an error. If you check into it, you learn that the "707" in this cell is
text and can't be changed into a number. When the spreadsheet tries to multiply the cost of the
product by the number of units sold, it's unable to make the calculation. But if we select the
orders column and choose "Split text to columns,"
Play video starting at :7:26 and follow transcript7:26
the error is resolved because now it can be treated as a number. Coming up, you'll learn about a
tool that does just the opposite. CONCATENATE is a function that joins multiple text strings
into a single string. Spreadsheets are a very important part of data analytics. They save data
analysts time and effort and help us eliminate errors each and every day. Here, you've learned
about some of the most common tools that we use. But there's a lot more to come. Next, we'll
learn even more about data cleaning with spreadsheet tools. Bye for now!

4. Step-by-Step: Optimize the data-cleaning process

This reading outlines steps the instructor performs in the following video, Optimize the data-cleaning
process. The video teaches some useful spreadsheet functions, which can make your data-cleaning even
more successful.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.

What you’ll need

If you would like to access the spreadsheet the instructor uses in this video, click the link to the dataset
to create a copy. If you don’t have a Google account, you may download the data directly from the
attachments below.

Link to logistics data: International Logistics Association Memberships - Data for Cleaning

Link to cosmetics data: Cosmetics Inc. - Data for Cleaning

OR

International Logistics Association Memberships - Data for Cleaning


XLSX File

Cosmetics Inc. - Data for Cleaning


XLSX File

Example 1: The COUNTIF function


COUNTIF is a spreadsheet function that returns the number of cells within a range that match a
specified value.

Use COUNTIF to find numbers lower than 100

1. Open the International Logistics Association Memberships - Data for Cleaning dataset, and scroll down
to row 74.
1. Note: The dataset has 72 rows, and row 73 is left blank for separation.
2. In cell H74, enter Member Dues < 100 to label the calculation.
3. In cell I74, enter the formula =COUNTIF(I2:I72,"<100") to count how many members in the cell
range I2:I72 pay dues of less than $100. This formula returns a value of 1, indicating one value is
below $100.
4. In cell I55, change -$200 to $200. Cell I74 should now display the value 0.

Use COUNTIF to find numbers higher than 500

1. In cell H75, enter Member Dues > 500.


2. In cell I75, enter the formula =COUNTIF(I2:I72,">500") to count how many members in cell range
I2:I72 pay dues of greater than 500. This formula returns a value of 1, indicating one value is above
500.
3. In cell I44, change $1,000 to $100. Cell I75 should now display the value 0.

Example 2: The LEN function

The LEN function is useful if you have a certain piece of information in your spreadsheet that you
know must contain a certain length.

1. Right click cell A.


2. Select + Insert one column right to create a new, empty column.
3. Select cell B1 and enter LEN to name the new column.
4. In cell B2, enter =LEN(A2). This function references the value of cell A2 and returns its length, 6.
5. Double-click on the lower right corner of cell B2. This will copy the function through the rest of the
column. Each cell will show the length of the Member ID in that row.

Example 3: Use conditional formatting

Conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific
conditions.

1. To highlight all of column B except for the header, select cell B. Then press CONTROL (Windows) or
COMMAND (MAC) and select cell B1.
2. Navigate to the Format menu, and choose Conditional Formatting.
3. Set the Format rules field to Is not equal to and enter 6 as the value.
4. Select Done.
5. Notice cell B36 is highlighted because its value is 7.

Example 4: The LEFT and RIGHT functions


LEFT is a function that returns a set number of characters from the left side of a text string. RIGHT is
a function that returns a set number of characters from the right side of a text string.

The LEFT function

1. Use the Cosmetics Inc. - Data for Cleaning dataset.


2. Select cell H1, and enter Left.
3. In cell H2, enter =LEFT(A2, 5) to extract the first five characters from cell A2. This function will
show the substring 51993.
4. Select cell H2.
5. Select and hold the fill handle, the small circle in the corner of a selected cell, then drag this formula
down to populate the rest of this column.

The RIGHT function

1. Select cell I1, and enter Right.


2. In cell I2, enter =RIGHT(A2, 4) to extract the last four characters from cell A2. This function will
show the substring Masc.
3. Select cell I2.
4. Select and hold the fill handle and drag this formula down to populate the rest of this column.

Example 5: The MID function

MID is a function that returns a segment from the middle of a text string.

1. Select cell J1, and enter Mid.


2. In cell J2, enter =MID(D2, 4, 2) to extract the two-letter state code that starts at character four in cell
D2.
3. Double-click the fill handle and to automatically populate the rest of this column.

Example 6: The CONCATENATE function

CONCATENATE is a spreadsheet function that joins together two or more text strings.

1. Select cell K1, and enter Concatenate.


2. In cell K2, enter =CONCATENATE(H2, I2) to combine the values from columns H and I.
3. Double-click the fill handle and to automatically populate the rest of this column.

Example 7: TRIM function

TRIM is a function that removes leading, trailing, and repeated spaces in data.

1. Select cell L1, and enter Trim.


2. In cell L2, enter =TRIM(C2) to remove any leading, trailing, or repeated spaces.
3. Double-click the fill handle and to automatically populate the rest of this column.

4. Optimize the data-cleaning process


Welcome back. You've learned about some very useful data- cleaning tools that are built right
into spreadsheet applications. Now we'll explore how functions can optimize your efforts to
ensure data integrity. As a reminder, a function is a set of instructions that performs a specific
calculation using the data in a spreadsheet. The first function we'll discuss is called COUNTIF.
COUNTIF is a function that returns the number of cells that match a specified value. Basically,
it counts the number of times a value appears in a range of cells. Let's go back to our
professional association spreadsheet. In this example, we want to make sure the association
membership prices are listed accurately. We'll use COUNTIF to check for some common
problems, like negative numbers or a value that's much less or much greater than expected. To
start, let's find the least expensive membership: $100 for student associates. That'll be the lowest
number that exists in this column. If any cell has a value that's less than 100, COUNTIF will
alert us. We'll add a few more rows at the bottom of our spreadsheet,
Play video starting at :1:22 and follow transcript1:22
then beneath column H, type "member dueS less than $100." Next, type the function in the cell
next to it. Every function has a certain syntax that needs to be followed for it to work. Syntax is
a predetermined structure that includes all required information and its proper placement. The
syntax of a COUNTIF function should be like this: Equals COUNTIF, open parenthesis, range,
comma, the specified value in quotation marks and a closed parenthesis. It will show up like
this.
Play video starting at :2:5 and follow transcript2:05
Where I2 through I72 is the range, and the value is less than 100. This tells the function to go
through column I, and return a count of all cells that contain a number less than 100. Turns out
there is one! Scrolling through our data, we find that one piece of data was mistakenly keyed in
as a negative number. Let's fix that now. Now we'll use COUNTIF to search for any values that
are more than we would expect. The most expensive membership type is $500 for corporate
members. Type the function in the cell.
Play video starting at :3:5 and follow transcript3:05
This time it will appear like this: I2 through I72 is still the range, but the value is greater than
500.
Play video starting at :3:21 and follow transcript3:21
There's one here too. Check it out.
Play video starting at :3:28 and follow transcript3:28
This entry has an extra zero. It should be $100.
Play video starting at :3:36 and follow transcript3:36
The next function we'll discuss is called LEN. LEN is a function that tells you the length of the
text string by counting the number of characters it contains. This is useful when cleaning data if
you have a certain piece of information in your spreadsheet that you know must contain a certain
length. For example, this association uses six-digit member identification codes. If we'd just
imported this data and wanted to be sure our codes are all the correct number of digits, we'd use
LEN. The syntax of LEN is equals LEN, open parenthesis, the range, and the close parenthesis.
We'll insert a new column after Member ID.
Play video starting at :4:25 and follow transcript4:25
Then type an equals sign and LEN. Add an open parenthesis. The range is the first Member ID
number in A2. Finish the function by closing the parenthesis. It tells us that there are six
characters in cell A2. Let's continue the function through the entire column and find out if any
results are not six. But instead of manually going through our spreadsheet to search for these
instances, we'll use conditional formatting. We talked about conditional formatting earlier. It's a
spreadsheet tool that changes how cells appear when values meet specific conditions. Let's
practice that now. Select all of column B except for the header. Then go to Format and choose
Conditional formatting. The format rule is to format cells if not equal to six.
Play video starting at :5:38 and follow transcript5:38
Click "Done." The cell with the seven inside is highlighted.
Play video starting at :5:47 and follow transcript5:47
Now we're going to talk about LEFT and RIGHT. LEFT is a function that gives you a set
number of characters from the left side of a text string. RIGHT is a function that gives you a set
number of characters from the right side of a text string. As a quick reminder, a text string is a
group of characters within a cell, commonly composed of letters, numbers, or both. To see these
functions in action, let's go back to the spreadsheet from the cosmetics maker from earlier. This
spreadsheet contains product codes. Each has a five-digit numeric code and then a four-character
text identifier.
Play video starting at :6:35 and follow transcript6:35
But let's say we only want to work with one side or the other. You can use LEFT or RIGHT to
give you the specific set of characters or numbers you need. We'll practice cleaning up our data
using the LEFT function first. The syntax of LEFT is equals LEFT, open parenthesis, the range,
a comma, and a number of characters from the left side of the text string we want. Then, we
finish it with a closed parenthesis. Here, our project requires just the five-digit numeric codes. In
a separate column,
Play video starting at :7:16 and follow transcript7:16
type equals LEFT, open parenthesis, then the range. Our range is A2. Then, add a comma, and
then number 5 for our five- digit product code. Finally, finish the function with a closed
parenthesis. Our function should show up like this. Press "Enter." And now, we have a
substring, which is the number part of the product code only.
Play video starting at :7:50 and follow transcript7:50
Click and drag this function through the entire column to separate out the rest of the product
codes by number only.
Play video starting at :8:2 and follow transcript8:02
Now, let's say our project only needs the four-character text identifier.
Play video starting at :8:9 and follow transcript8:09
For that, we'll use the RIGHT function, and the next column will begin the function. The syntax
is equals RIGHT, open parenthesis, the range, a comma and the number of characters we want.
Then, we finish with a closed parenthesis. Let's key that in now. Equals right, open parenthesis,
and the range is still A2. Add a comma. This time, we'll tell it that we want the first four
characters from the right. Close up the parenthesis and press "Enter." Then, drag the function
throughout the entire column.
Play video starting at :8:58 and follow transcript8:58
Now, we can analyze the product in our spreadsheet based on either substring. The five-digit
numeric code or the four character text identifier. Hopefully, that makes it clear how you can use
LEFT and RIGHT to extract substrings from the left and right sides of a string. Now, let's learn
how you can extract something in between. Here's where we'll use something called MID. MID
is a function that gives you a segment from the middle of a text string. This cosmetics company
lists all of its clients using a client code. It's composed of the first three letters of the city where
the client is located, its state abbreviation, and then a three- digit identifier. But let's say a data
analyst needs to work with just the states in the middle. The syntax for MID is equals MID,
open parenthesis, the range, then a comma. When using MID, you always need to supply a
reference point. In other words, you need to set where the function should start. After that, place
another comma, and how many middle characters you want. In this case, our range is D2. Let's
start the function in a new column.
Play video starting at :10:21 and follow transcript10:21
Type equals MID, open parenthesis, D2. Then the first three characters represent a city name, so
that means the starting point is the fourth. Add a comma and four. We also need to tell the
function how many middle characters we want. Add one more comma, and two, because the
state abbreviations are two characters long. Press "Enter" and bam, we just get the state
abbreviation. Continue the MID function through the rest of the column.
Play video starting at :11:2 and follow transcript11:02
We've learned about a few functions that help separate out specific text strings. But what if we
want to combine them instead? For that, we'll use CONCATENATE, which is a function that
joins together two or more text strings. The syntax is equals CONCATENATE, then an open
parenthesis inside indicates each text string you want to join, separated by commas. Then finish
the function with a closed parenthesis. Just for practice, let's say we needed to rejoin the left and
right text strings back into complete product codes. In a new column, let's begin our function.
Play video starting at :11:47 and follow transcript11:47
Type equals CONCATENATE, then an open parenthesis. The first text string we want to join is
in H2. Then add a comma. The second part is in I2. Add a closed parenthesis and press "Enter".
Drag it down through the entire column,
Play video starting at :12:16 and follow transcript12:16
and just like that, all of our product codes are back together.
Play video starting at :12:24 and follow transcript12:24
The last function we'll learn about here is TRIM. TRIM is a function that removes leading,
trailing, and repeated spaces in data. Sometimes when you import data, your cells have extra
spaces, which can get in the way of your analysis.
Play video starting at :12:44 and follow transcript12:44
For example, if this cosmetics maker wanted to look up a specific client name, it won't show up
in the search if it has extra spaces. You can use TRIM to fix that problem. The syntax for TRIM
is equals TRIM, open parenthesis, your range, and closed parenthesis. In a separate column,
Play video starting at :13:11 and follow transcript13:11
type equals TRIM and an open parenthesis. The range is C2, as you want to check out the client
names. Close the parenthesis and press "Enter". Finally, continue the function down the column.
Play video starting at :13:32 and follow transcript13:32
TRIM fixed the extra spaces.
Play video starting at :13:42 and follow transcript13:42
Now we know some very useful functions that can make your data cleaning even more
successful. This was a lot of information. As always, feel free to go back and review the video
and then practice on your own. We'll continue building on these tools soon, and you'll also have
a chance to practice. Pretty soon, these data cleaning steps will become second nature, like
brushing your teeth.

5. Workflow automation

In this reading, you will learn about workflow automation and how it can help you work faster and
more efficiently. Basically, workflow automation is the process of automating parts of your work. That
could mean creating an event trigger that sends a notification when a system is updated. Or it could
mean automating parts of the data cleaning process. As you can probably imagine, automating different
parts of your work can save you tons of time, increase productivity, and give you more bandwidth to
focus on other important aspects of the job.

What can be automated?

Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of the job
that can’t be automated. Let's take a look at some things we can automate and some things that we
can’t.

Task Can it be Why?


automated?
Communicating with No Communication is key to understanding the needs of your team
your team and and stakeholders as you complete the tasks you are working on.
stakeholders There is no replacement for person-to-person communications.
Task Can it be Why?
automated?
Presenting your No Presenting your data is a big part of your job as a data analyst.
findings Making data accessible and understandable to stakeholders and
creating data visualizations can’t be automated for the same
reasons that communications can’t be automated.
Preparing and Partially Some tasks in data preparation and cleaning can be automated
cleaning data by setting up specific processes, like using a programming
script to automatically detect missing values.
Data exploration Partially Sometimes the best way to understand data is to see it. Luckily,
there are plenty of tools available that can help automate the
process of visualizing data. These tools can speed up the
process of visualizing and understanding the data, but the
exploration itself still needs to be done by a data analyst.
Modeling the data Yes Data modeling is a difficult process that involves lots of
different factors; luckily there are tools that can completely
automate the different stages.

More about automating data cleaning

One of the most important ways you can streamline your data cleaning is to clean data where it lives.
This will benefit your whole team, and it also means you don’t have to repeat the process over and
over. For example, you could create a programming script that counted the number of words in each
spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means
that you don’t have to repeat your cleaning steps, saving you and your team time and energy.

More resources

There are a lot of tools out there that can help automate your processes, and those tools are improving
all the time. Here are a few articles or blogs you can check out if you want to learn more about
workflow automation and the different tools out there for you to use:

 Towards Data Science’s Automating Scientific Data Analysis


 MIT News’ Automating Big-Data Analysis
 TechnologyAdvice’s 10 of the Best Options for Workflow Automation Software

Key takeaways

As a data analyst, automation can save you a lot of time and energy, and free you up to focus more on
other parts of your project. The more analysis you do, the more ways you will find to make your
processes simpler and more streamlined.

6. Step-by-Step: Different data perspectives

This reading outlines the steps the instructor performs in the next video, Different data perspectives.
The video teaches you different methods data analysts use to view data differently and how looking at
different views leads to more efficient and effective data cleaning.
Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.

What you’ll need

If you’d like to follow along with the examples in this video, choose a spreadsheet tool. Google Sheets
or Excel are recommended.

To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.

Link to template: Cosmetics, Inc.

OR

Cosmetics Inc. - Data for Pivot Table and VLOOKUP


XLSX File

Example 1: Pivot tables

A pivot table is a data summarization tool. It can be used in data processing and in data cleaning, for
which pivot tables offer a quick, clutter-free view of your data. Pivot tables help sort, reorganize,
group, count, total, or average data in a dataset.

1. In the Cosmetics Inc. spreadsheet, select the data you'll include. In this case, select all of the data in
Sheet 1 of the spreadsheet by selecting cell A1 then dragging your cursor to cell F31.
2. Select Insert, then Pivot Table. Choose New sheet and Create. Google Sheets creates a new sheet
where you can define the pivot table.
3. Use the Pivot table editor to add specific data to your pivot table.
a. In the Pivot table editor panel, next to Rows, select Add.

b. From the columns list, select Total.

c. Below Rows, from the Order dropdown list, select Descending to put the most profitable items at
the top.

d. Next to Rows, select Add.

e. From the column list, select Products.

f. Notice that the top two most ordered products are 15143Exfo and 32729Masc. The rest of the orders
total less than $10,000.

Example 2: VLOOKUP
VLOOKUP is a spreadsheet function that vertically searches for a certain value in a column to return a
corresponding piece of information. It's rare for all of the data an analyst will need to be in the same
place. Usually, you'll have to search across multiple sheets or even different databases. VLOOKUP
helps bring the information together.

In the previous example, you found the product codes of the most ordered products. Now, you’ll use
VLOOKUP to find the names of these products.

1. Select the Sheet 1 tab to navigate to Sheet 1 of the spreadsheet.


2. Select cell H2.
3. Enter =VLOOKUP(A2, 'Sheet 2'!A1:B31, 2, false)
1. Note: This references information in another sheet. Make sure you have Sheet 2 in your workbook.
2. This formula will take the value in cell A2 of Sheet 1 and check for that value in Sheet 2 among the
cells from A1:B31 in the 2nd column (which corresponds with the 2 in the formula). Because the
formula includes “false,” it will search only for an exact match. It will then output the value of column
B in Sheet 2 as the result.
4. Press Enter to input the formula. The result is LashX Mascara.
5. Next, select the cell and drag the fill handle in the lower-right corner down to populate the other cells
in the sheet with the formula.
6. To find the names of the two most profitable products you identified in the previous example use the
find and replace tool.
1. Select Edit > Find and Replace.
2. In the Find text box, enter the product code for the most profitable product, 15143Exfo.
3. Select This sheet from the dropdown list next to Search.
4. Select Find to find any cells in this sheet that contain this product code.
5. Notice that cell A31 is a match. This means the VLOOKUP search you ran in Column H31 contains
the name of the most profitable product: SoSoft Exfoliator.
6. Repeat steps a-d with the product code 32729Masc to find the product name of the second most
profitable product. Cell A27 contains 32729Masc, so the product name is Darkest Lashes Mascara.

Example 3: Plotting

The plotting tool allows analysts to quickly create a graph, chart, table, or other visual from data.
Plotting is useful for identifying skewed data or outliers.

1. In Sheet 1 of the Cosmetics, Inc. spreadsheet, select column B, which contains the prices.
2. Select Insert > Chart.
1. If the chart created is not a column chart, select Column chart from the dropdown menu under Chart
type in the Chart editor.
2. Select and drag the chart to the right so you can view the data in the sheet.
3. Check for obvious outliers and fix them in the spreadsheet. For example, you might notice that an item
in the middle of the chart has an extremely low price of $0.73. The decimal point is in the wrong place.
In cell B14, correct this price to $7.30, and notice that Google Sheets automatically updates the chart.

7. Different data perspectives


Hi, let's get into it. Motivational speaker Wayne Dyer once said, "If you change the way you
look at things, the things you look at change." This is so true in data analytics. No two analytics
projects are ever exactly the same. So it only makes sense that different projects require us to
focus on different information differently.
Play video starting at ::26 and follow transcript0:26
In this video, we'll explore different methods that data analysts use to look at data differently
and how that leads to more efficient and effective data cleaning.
Play video starting at ::36 and follow transcript0:36
Some of these methods include sorting and filtering, pivot tables, a function called VLOOKUP,
and plotting to find outliers.
Play video starting at ::47 and follow transcript0:47
Let's start with sorting and filtering. As you learned earlier, sorting and filtering data helps data
analysts customize and organize the information the way they need for a particular project. But
these tools are also very useful for data cleaning.
Play video starting at :1:3 and follow transcript1:03
You might remember that sorting involves arranging data into a meaningful order to make it
easier to understand, analyze, and visualize.
Play video starting at :1:13 and follow transcript1:13
For data cleaning, you can use sorting to put things in alphabetical or numerical order, so you
can easily find a piece of data.
Play video starting at :1:22 and follow transcript1:22
Sorting can also bring duplicate entries closer together for faster identification.
Play video starting at :1:29 and follow transcript1:29
Filters, on the other hand, are very useful in data cleaning when you want to find a particular
piece of information.
Play video starting at :1:36 and follow transcript1:36
You learned earlier that filtering means showing only the data that meets a specific criteria while
hiding the rest.
Play video starting at :1:44 and follow transcript1:44
This lets you view only the information you need.
Play video starting at :1:48 and follow transcript1:48
When cleaning data, you might use a filter to only find values above a certain number, or just
even or odd values. Again, this helps you find what you need quickly and separates out the
information you want from the rest.
Play video starting at :2:5 and follow transcript2:05
That way you can be more efficient when cleaning your data.
Play video starting at :2:10 and follow transcript2:10
Another way to change the way you view data is by using pivot tables.
Play video starting at :2:15 and follow transcript2:15
You've learned that a pivot table is a data summarization tool that is used in data processing.
Play video starting at :2:21 and follow transcript2:21
Pivot tables sort, reorganize, group, count, total or average data stored in the database. In data
cleaning, pivot tables are used to give you a quick, clutter- free view of your data. You can
choose to look at the specific parts of the data set that you need to get a visual in the form of a
pivot table.
Play video starting at :2:44 and follow transcript2:44
Let's create one now using our cosmetic makers spreadsheet again.
Play video starting at :2:49 and follow transcript2:49
To start, select the data we want to use. Here, we'll choose the entire spreadsheet. Select "Data"
and then "Pivot table."
Play video starting at :3:5 and follow transcript3:05
Choose "New sheet" and "Create."
Play video starting at :3:11 and follow transcript3:11
Let's say we're working on a project that requires us to look at only the most profitable products.
Items that earn the cosmetics maker at least $10,000 in orders. So the row we'll include is
"Total" for total profits.
Play video starting at :3:31 and follow transcript3:31
We'll sort in descending order to put the most profitable items at the top.
Play video starting at :3:38 and follow transcript3:38
And we'll show totals.
Play video starting at :3:41 and follow transcript3:41
Next, we'll add another row for products
Play video starting at :3:49 and follow transcript3:49
so that we know what those numbers are about. We can clearly determine tha the most profitable
products have the product codes 15143 E-X-F-O and 32729 M-A-S-C.
Play video starting at :4:6 and follow transcript4:06
We can ignore the rest for this particular project because they fall below $10,000 in orders.
Play video starting at :4:17 and follow transcript4:17
Now, we might be able to use context clues to assume we're talking about exfoliants and
mascaras. But we don't know which ones, or if that assumption is even correct.
Play video starting at :4:30 and follow transcript4:30
So we need to confirm what the product codes correspond to.
Play video starting at :4:34 and follow transcript4:34
And this brings us to the next tool. It's called VLOOKUP.
Play video starting at :4:39 and follow transcript4:39
VLOOKUP stands for vertical lookup. It's a function that searches for a certain value in a
column to return a corresponding piece of information. When data analysts look up information
for a project, it's rare for all of the data they need to be in the same place. Usually, you'll have to
search across multiple sheets or even different databases.
Play video starting at :5:5 and follow transcript5:05
The syntax of the VLOOKUP is equals VLOOKUP, open parenthesis, then the data you want to
look up. Next is a comma and where you want to look for that data.
Play video starting at :5:19 and follow transcript5:19
In our example, this will be the name of a spreadsheet followed by an exclamation point.
Play video starting at :5:25 and follow transcript5:25
The exclamation point indicates that we're referencing a cell in a different sheet from the one
we're currently working in.
Play video starting at :5:32 and follow transcript5:32
Again, that's very common in data analytics.
Play video starting at :5:35 and follow transcript5:35
Okay, next is the range in the place where you're looking for data, indicated using the first and
last cell separated by a colon. After one more comma is the column in the range containing the
value to return.
Play video starting at :5:50 and follow transcript5:50
Next, another comma and the word "false," which means that an exact match is what we're
looking for.
Play video starting at :5:59 and follow transcript5:59
Finally, complete your function by closing the parentheses. To put it simply, VLOOKUP
searches for the value in the first argument in the leftmost column of the specified location.
Play video starting at :6:12 and follow transcript6:12
Then the value of the third argument tells VLOOKUP to return the value in the same row from
the specified column.
Play video starting at :6:20 and follow transcript6:20
The "false" tells VLOOKUP that we want an exact match.
Play video starting at :6:25 and follow transcript6:25
Soon you'll learn the difference between exact and approximate matches. But for now, just know
that V lookup takes the value in one cell and searches for a match in another place.
Play video starting at :6:38 and follow transcript6:38
Let's begin.
Play video starting at :6:41 and follow transcript6:41
We'll type equals VLOOKUP.
Play video starting at :6:49 and follow transcript6:49
Then add the data we are looking for, which is the product data.
Play video starting at :6:56 and follow transcript6:56
The dollar sign makes sure that the corresponding part of the reference remains unchanged.
Play video starting at :7:3 and follow transcript7:03
You can lock just the column, just the row, or both at the same time.
Play video starting at :7:19 and follow transcript7:19
Next, we'll tell it to look at Sheet 2, in both columns
Play video starting at :7:33 and follow transcript7:33
We added 2 to represent the second column.
Play video starting at :7:37 and follow transcript7:37
The last term, "false," says we wanted an exact match.
Play video starting at :7:47 and follow transcript7:47
With this information, we can now analyze the data for only the most profitable products.
Play video starting at :8:1 and follow transcript8:01
Going back to the two most profitable products, we can search for 15143 E-X-F-O And 32729
M-A-S-C. Go to Edit and then Find. Type in the product codes and search for them.
Play video starting at :8:31 and follow transcript8:31
Now we can learn which products we'll be using for this particular project.
Play video starting at :8:36 and follow transcript8:36
The final tool we'll talk about is something called plotting. When you plot data, you put it in a
graph chart, table, or other visual to help you quickly find what it looks like.
Play video starting at :8:49 and follow transcript8:49
Plotting is very useful when trying to identify any skewed data or outliers. For example, if we
want to make sure the price of each product is correct, we could create a chart. This would give
us a visual aid that helps us quickly figure out if anything looks like an error.
Play video starting at :9:10 and follow transcript9:10
So let's select the column with our prices.
Play video starting at :9:15 and follow transcript9:15
Then we'll go to Insert and choose Chart.
Play video starting at :9:24 and follow transcript9:24
Pick a column chart as the type. One of these prices looks extremely low.
Play video starting at :9:31 and follow transcript9:31
If we look into it, we discover that this item has a decimal point in the wrong place.
Play video starting at :9:41 and follow transcript9:41
It should be $7.30, not 73 cents.
Play video starting at :9:51 and follow transcript9:51
That would have a big impact on our total profits. So it's a good thing we caught that during data
cleaning.
Play video starting at :9:58 and follow transcript9:58
Looking at data in new and creative ways helps data analysts identify all kinds of dirty data.
Play video starting at :10:5 and follow transcript10:05
Coming up, you'll continue practicing these new concepts so you can get more comfortable with
them. You'll also learn additional strategies for ensuring your data is clean, and we'll provide
you with effective insights. Great work so far.
en

8. Step-by-Step: Even more data-cleaning techniques

This reading outlines the steps the instructor performs in the next video, Even more data-cleaning
techniques. This video teaches you different methods data analysts use in data mapping. Data mapping
is the process of matching fields from one database to another. It’s critical to the success of data
migration, data integration, and many other data-management activities. This video contains one
activity for you to practice.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skill demonstrated in the video.

What you’ll need

If you’d like to follow along with the example in this video, choose a spreadsheet tool, such as Google
Sheets or Excel.

To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.

Link to templates:
International Logistics Association memberships

Global Logistics Association

Logistics Association Merger

Downloads:

International Logistics Association memberships


XLSX File

Global Logistics Association


XLSX File

Logistics Association Merger


XLSX File

Example: CONCATENATE

CONCATENATE is a function that joins together two or more text strings. In the video, you’ll learn
how to use CONCATENATE to clean data after two datasets have been combined.

1. Open the dataset spreadsheet titled Global Logistics Association. When prompted, select USE
TEMPLATE.
2. Insert a new column to the right of column E. Label it New Address in cell F1.
3. In the second row of the new column (cell F2), enter =CONCATENATE (D2,E2) and press Enter.
1. You will notice that some results need a space between the street address and the unit or suite number,
such as: 25 Dyas RdSte. 101.
2. You could manually clean the data later to add a space between Rd and Ste., but CONCATENATE
can actually do it for you.
3. The CONCATENATE formula can help you format the data as it is merged by entering an additional
string to insert a space between Rd and Ste.
4. Enter =CONCATENATE(D2, " ", E2) and you will have an address that is formatted like this: 25
Dyas Rd Ste. 101. Much better!
4. Ensure the new data in the cell accurately reflects the merging of the two previous columns.
5. Select cell F2 and drag down to apply the formula to all rows in the column.

9. Even more data-cleaning techniques


Hello. So far you've learned about a lot of different tools and functions that analysts use to clean
up data for analysis. Now we'll take a step back and talk about some of the really big picture
aspects of clean data. Knowing how to fix specific problems, either manually with spreadsheet
tools, or with functions, is extremely valuable. But it's also important to think about how your
data has moved between systems and how it's evolved along it's journey to your data analysis
project. To do this, data analysts use something called data mapping. Data mapping is the
process of matching fields from one database to another. This is very important to the success of
data migration, data integration, and lots of other data management activities. As you learned
earlier, different systems store data in different ways. For example, the state field in one
spreadsheet might show Maryland spelled out. But another spreadsheet might store it as MD.
Play video starting at :1:8 and follow transcript1:08
Data mapping helps us note these kinds of differences so we know when data is moved and
combined it will be compatible. As a quick reminder, compatibility describes how well two or
more data sets are able to work together. The first step to data mapping is identifying what data
needs to be moved. This includes the tables and the fields within them. We also need to define
the desired format for the data once it reaches its destination. To figure out how this works let's
go back to the merger between our two logistics associations.
Play video starting at :1:48 and follow transcript1:48
Starting with the first data field, we'll identified that we need to move both sets of member IDs.
To define the desired format, we'll choose whether to use numbers like this spreadsheet, or email
addresses like the other spreadsheet. Next comes mapping the data. Depending on the schema
and number of primary and foreign keys in a data source, data mapping can be simple or very
complex. As a reminder, a schema is a way of describing how something is organized. A
primary key references a column in which each value is unique and a foreign key is a field
within a table that is a primary key in another table. For more challenging projects there's all
kinds of data mapping software programs you can use. These data mapping tools will analyze
field by field how to move data from one place to another then they automatically clean, match,
inspect, and validate the data. They also create consistent naming conventions, ensuring
compatibility when the data is transferred from one source to another. When selecting a software
program to map your data, you want to be sure that it supports the file types you're working
with, such as Excel, SQL, Tableau, and others. Later on, you'll learn more about selecting the
right tool for a particular task. For now, let's practice mapping data manually. First, we need to
determine the content of each section to make sure the data ends up in the right place. For
example, the data on when memberships expire would be consolidated into a single column.
This step makes sure that each piece of information ends up in the most appropriate place in the
merged data source. Now, you might remember that some of the data was inconsistent between
the two organizations, like the fact that one uses a separate column for suite apartment or unit
number but the other doesn't.
Play video starting at :4:11 and follow transcript4:11
This brings us to the next step, transforming the data into a consistent format. This is a great
time to use concatenate. As you learned before, concatenate is a function that joins together two
or more text strings, which is what we did earlier with our cosmetics company example. We'll
insert a new column
Play video starting at :4:43 and follow transcript4:43
and then type equals concatenate,
Play video starting at :4:50 and follow transcript4:50
then the two text strings we want to make one.
Play video starting at :4:58 and follow transcript4:58
Drag that through the entire column.
Play video starting at :5:8 and follow transcript5:08
Now we have the consistency in the new merged association lists of member addresses.
Play video starting at :5:20 and follow transcript5:20
Now that everything's compatible, it's time to transfer the data to its destination. There's a lot of
different ways to move data from one place to another, including querying, import wizards, and
even simple drag and drop. Here's our merged spreadsheet.
Play video starting at :5:41 and follow transcript5:41
It looks good, but we still want to make sure everything was transferred properly. We'll go into
the testing phase of data mapping. For this, you inspect a sample piece of data to confirm that
it's clean and properly formatted.
Play video starting at :6: and follow transcript6:00
It's also a smart practice to do spot checks on things such as the number of nulls. For the test,
you can use a lot of the data cleaning tools we discussed previously, such as data validation,
conditional formatting, COUNTIF, sorting, and filtering. Finally, once you've determined that
the data is clean and compatible, you can start using it for analysis. Data mapping is so
important because even one mistake when merging data can ripple throughout an organization,
causing the same error to appear again and again. This leads to poor results. On the other hand,
data mapping can save the day by giving you a clear road map you can follow to make sure your
data arrives safely at it's destination. That's why you learn how to do it.

10. Working with .csv files

In an earlier course in this certificate program, you worked with .csv files. Data analysts use .csv files
often, so throughout this course you will continue to use .csv files to transfer data into data analysis
programs for further analysis and visualization. .csv files are plain text files with an organized table
structure that includes rows and columns. The values in each row are separated by commas. This table
structure makes them easy to understand, edit, manipulate, and use for data analysis.

A major advantage of .csv files is their widespread compatibility. They can be imported and exported
by a vast range of data analysis tools and software programs.

Download .csv files

To use .csv files and upload them to data analysis programs you will first need to download them to
your local device. Downloading a .csv file from a website can vary depending on your operating
system or internet browser. Here are some ways you can download a .csv file:

 Click the download link or .csv attachment: Locate the link for the .csv file or attachment on the
website. Click on it, and the download process will start.
 Right-click and Save: Right-click on the data table or element containing the .csv data. Choose Save
as… or a similar option. Name the file and make sure the extension on the file is “.csv”.
 Force download: You can use the Alt key on your keyboard while clicking the link. This will trigger
the download, and you will be able to find the .csv file in your Downloads folder.

Note: When using the Chrome browser or ChromeOS, .csv files may open in a new tab instead of
downloading to your machine. If this happens, follow these instructions:
 Select File from the menu bar, then select Save as Google Sheets. This will open the .csv file as a
Google Sheet.
 Select File from the menu bar, then select Download from the dropdown menu, then select Comma
Separated Values (.csv).

Upload .csv files

You will often need to upload .csv files during the data analysis process. Here is how you do this:

 Locate the upload option: Each data analysis platform will have a designated button, menu option, or
drag and drop area labeled Upload or Import. This is where you will upload your .csv file.
 Choose your .csv file: Click Upload or Import on the platform you are using to open your file
explorer. Select your .csv file. If you just downloaded a .csv file from the web, it will be located in
your computer’s Downloads folder.
 Initiate the upload: Once you've selected your .csv file, click Upload or Import.The platform may
display a progress bar or message indicating that the upload is complete.

Note: Some platforms have restrictions on the file size or format of .csv files. Make sure your .csv files
adhere to these requirements before uploading.

Key takeaways

Data analysis programs help us extract insights and knowledge from data. Using .csv files is essential
in data analysis. Understanding how to easily download data from the web or add your data to these
programs will allow you to complete data cleaning, visualizations, analysis, and so much more!

11. Develop your approach to cleaning data

As you continue on your data journey, you’re likely discovering that data is often messy—and you can
expect raw, primary data to be imperfect. In this reading, you’ll consider how to develop your personal
approach to cleaning data. You will explore the idea of a cleaning checklist, which you can use to
guide your cleaning process. Then, you’ll define your preferred methods for cleaning data. By the time
you complete this reading, you’ll have a better understanding of how to methodically approach the data
cleaning process. This will save you time when cleaning data and help you ensure that your data is
clean and usable.

Consider your approach to cleaning data

Data cleaning usually requires a lot of time, energy, and attention. But there are two steps you can take
before you begin to help streamline your process: creating a cleaning checklist and deciding on your
preferred methods. This will help ensure that you know exactly how you want to approach data
cleaning and what you need to do to be confident in the integrity of your data.

Your cleaning checklist

Start developing your personal approach to cleaning data by creating a checklist to help you identify
problems in your data efficiently and identify the scale and scope of your dataset. Think of this
checklist as your default “what to search for” list.
Here are some examples of common data cleaning tasks you could include in your checklist:

 Determine the size of the dataset: Large datasets may have more data quality issues and take longer
to process. This may impact your choice of data cleaning techniques and how much time to allocate to
the project.
 Determine the number of categories or labels: By understanding the number and nature of categories
and labels in a dataset, you can better understand the diversity of the dataset. This understanding also
helps inform data merging and migration strategies.
 Identify missing data: Recognizing missing data helps you understand data quality so you can take
appropriate steps to remediate the problem. Data integrity is important for accurate and unbiased
analysis.
 Identify unformatted data: Identifying improperly or inconsistently formatted data helps analysts
ensure data uniformity. This is essential for accurate analysis and visualization.
 Explore the different data types: Understanding the types of data in your dataset (for instance,
numerical, categorical, text) helps you select appropriate cleaning methods and apply relevant data
analysis techniques.
There might be other data cleaning tasks you’ve been learning about that you also want to prioritize in
your checklist. Your checklist is an opportunity for you to define exactly what you want to remember
about cleaning your data; feel free to make it your own.

Your preferred cleaning methods

In addition to creating a checklist, identify which actions or tools you prefer using when cleaning data.
You’ll use these tools and techniques with each new dataset—or whenever you encounter issues in a
dataset—so this list should be compatible with your checklist.

For example, suppose you have a large dataset with missing data. You’ll want to know how to check
for missing data in larger datasets, and how you plan to handle any missing data, before you start
cleaning. Outlining your preferred methods can save you lots of time and energy.

Key takeaways

The data you encounter as an analyst won’t always conform to your checklist or your preferred actions
and tools. But having these things can make common data cleaning tasks much easier to complete. As
is so often the case, thoughtful planning sets up any project for success!

You might also like