Module 2 Clean Data For More Accurate Insights
Module 2 Clean Data For More Accurate Insights
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem
you are trying to solve. This reading summarizes:
Outdated data
Incomplete data
Incorrect/inaccurate data
Inconsistent data
For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred
browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain
industries from a previous search:
Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
Marketing and sales: 99% of companies are actively tackling data quality in some way (source).
Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records
(source).
Key takeaways
Dirty data includes duplicate data, outdated data, incomplete data, incorrect or inaccurate data, and
inconsistent data. Each type of dirty data can have a significant impact on analyses, leading to
inaccurate insights, poor decision-making, and revenue loss. There are a number of causes of dirty data,
including manual data entry errors, batch data imports, data migration, software obsolescence,
improper data collection, and human errors during data input. As a data professional, you can take steps
to mitigate the impact of dirty data by implementing effective data quality processes.
In this reading, you will learn the importance of data cleaning and how to identify common mistakes.
Some of the errors you might come across while cleaning your data could include:
Common mistakes to avoid
Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the
time the wrong spelling or common grammatical errors can be detected, but it gets harder with things
like names or addresses. For example, if you are working with a spreadsheet table of customer data,
you might come across a customer named “John” whose name has been input incorrectly as “Jon” in
some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for
spelling errors and catch this, your analysis will have mistakes in it.
Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you
avoid those errors in the future by showing you how you resolved them. For example, you might find
an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns
haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your
formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you
keep track of changes in your work, so that you can backtrack if a fix didn’t work.
Not checking for misfielded values: A misfielded value happens when the values are entered into the
wrong field. These values might still be formatted correctly, which makes them harder to catch if you
aren’t careful. For example, you might have a dataset with columns for cities and countries. These are
the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of
Spain in the country column, and Spain had mistakenly been entered into the city column, you would
miss key data points. Making sure your data has been entered correctly is key to accurate, complete
analysis.
Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate
conclusions. For example, if you were trying to get the total number of sales from the last three months,
but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try
to keep your data as clean as possible by maintaining completeness and consistency.
Only looking at a subset of the data: It is important to think about all of the relevant data when you
are cleaning. This helps make sure you understand the whole story the data is telling, and that you are
paying attention to all possible errors. For example, if you are working with data about bird migration
patterns from different sources, but you only clean one source, you might not realize that some of the
data is being repeated. This will cause problems in your analysis later on. If you want to avoid common
errors like duplicates, each field of your data requires equal attention.
Losing track of business objectives: When you are cleaning data, you might make new and
interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from
the task at hand. For example, if you were working with weather data to find the average number of
rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really
interesting, but it isn’t related to the question you are trying to answer right now. Being curious is
great! But try not to let it distract you from the task at hand.
Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part
of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing
that same error over and over again. For example, imagine you have a team spreadsheet that tracks
everyone’s progress. The table keeps breaking because different people are entering different values.
You can keep fixing all of these problems one by one, or you can set up your table to streamline data
entry so everyone is on the same page. Addressing the source of the errors in your data will save you a
lot of time in the long run.
Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors,
we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would
find the cause of the problem before you started fixing the car, right? The same goes for data. First, you
figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check,
lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can
control it and keep your data clean.
Not backing up your data prior to data cleaning: It is always good to be proactive and create your
data backup before you start your data clean-up. If your program crashes, or if your changes cause a
problem in your dataset, you can always go back to the saved version and restore it. The simple
procedure of backing up your data can save you hours of work-- and most importantly, a headache.
Not accounting for data cleaning in your deadlines/process: All good things take time, and that
includes data cleaning. It is important to keep that in mind when going through your process and
looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate
estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.
Key takeaways
Data cleaning is essential for accurate analysis and decision-making. Common mistakes to avoid when
cleaning data include spelling errors, misfielded values, missing values, only looking at a subset of the
data, losing track of business objectives, not fixing the source of the error, not analyzing the system
prior to data cleaning, not backing up your data prior to data cleaning, and not accounting for data
cleaning in your deadlines/process. By avoiding these mistakes, you can ensure that your data is clean
and accurate, leading to better outcomes for your business.
Additional resources
Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid
the most common mistakes:
Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft Excel.
10 Google Workspace tips to clean up data: Learn best practices for data cleaning in Google Sheets.
This reading outlines the steps the instructor performs in the next video, Data-cleaning features in
spreadsheets. In the video, the instructor explains how to use menu options in spreadsheets to fix errors.
Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.
If you’d like to follow along with the examples in this video, choose a spreadsheet tool. Google Sheets
or Excel are recommended.
To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.
Link to logistics data: International Logistics Association Memberships - Data for Cleaning
OR
Conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific
conditions.
1. Open the spreadsheet International Logistics Association Memberships - Data for Cleaning.
2. Select the range of cells to which you’ll apply conditional formatting. In this example, you’ll select
columns A through L, except for columns F and H. To select all columns except for F and H: a. Select
cell A to highlight column A. b. Hold down the SHIFT key and at the same time use your mouse to
select cell E. This will highlight all the columns between A and E. c. To select the remainder of the
columns, hold down the CONTROL (Windows) or COMMAND (Mac) key while you select cells G, I,
J, K, and L. d. Columns A through L in your spreadsheet should be highlighted except Column F and
Column H.
3. From the menu, select Format, then Conditional formatting. The columns you’ve selected should
turn a light shade of green, and a new Conditional format rules tool will appear. Additionally, the
Apply to range field should indicate the cells you’ve selected.
4. Next, apply a condition to these cells to change the cell color if the cell is empty. In the Format cells if
drop-down, select Cell is empty.
5. Select the Formatting style field. Select a bright color from the drop-down to make the blank cells
stand out.
6. Select Done.
Remove duplicates is a spreadsheet tool that automatically searches for and eliminates duplicate entries
from a spreadsheet. This is faster and easier than searching the data by scrolling through it.
1. Create a copy of your dataset by right clicking the Association ABC membership tab and selecting
Duplicate. This is a good practice, as it protects against accidentally deleting important data. Continue
working in the new sheet, Copy of Association ABC memberships.
2. In the menu, select Data, then Data cleanup, then Remove duplicates.
3. Check the box next to Data has header row.
4. Check the box next to Select All to inspect the entire spreadsheet.
5. Select Remove duplicates.
Format dates to make all of the data in your spreadsheet consistent. This makes items easier to find and
manipulate.
SPLIT is a spreadsheet function that divides text around a specified character and puts each fragment
into a new, separate cell.
This reading outlines steps the instructor performs in the following video, Optimize the data-cleaning
process. The video teaches some useful spreadsheet functions, which can make your data-cleaning even
more successful.
Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.
If you would like to access the spreadsheet the instructor uses in this video, click the link to the dataset
to create a copy. If you don’t have a Google account, you may download the data directly from the
attachments below.
Link to logistics data: International Logistics Association Memberships - Data for Cleaning
OR
1. Open the International Logistics Association Memberships - Data for Cleaning dataset, and scroll down
to row 74.
1. Note: The dataset has 72 rows, and row 73 is left blank for separation.
2. In cell H74, enter Member Dues < 100 to label the calculation.
3. In cell I74, enter the formula =COUNTIF(I2:I72,"<100") to count how many members in the cell
range I2:I72 pay dues of less than $100. This formula returns a value of 1, indicating one value is
below $100.
4. In cell I55, change -$200 to $200. Cell I74 should now display the value 0.
The LEN function is useful if you have a certain piece of information in your spreadsheet that you
know must contain a certain length.
Conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific
conditions.
1. To highlight all of column B except for the header, select cell B. Then press CONTROL (Windows) or
COMMAND (MAC) and select cell B1.
2. Navigate to the Format menu, and choose Conditional Formatting.
3. Set the Format rules field to Is not equal to and enter 6 as the value.
4. Select Done.
5. Notice cell B36 is highlighted because its value is 7.
MID is a function that returns a segment from the middle of a text string.
CONCATENATE is a spreadsheet function that joins together two or more text strings.
TRIM is a function that removes leading, trailing, and repeated spaces in data.
5. Workflow automation
In this reading, you will learn about workflow automation and how it can help you work faster and
more efficiently. Basically, workflow automation is the process of automating parts of your work. That
could mean creating an event trigger that sends a notification when a system is updated. Or it could
mean automating parts of the data cleaning process. As you can probably imagine, automating different
parts of your work can save you tons of time, increase productivity, and give you more bandwidth to
focus on other important aspects of the job.
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of the job
that can’t be automated. Let's take a look at some things we can automate and some things that we
can’t.
One of the most important ways you can streamline your data cleaning is to clean data where it lives.
This will benefit your whole team, and it also means you don’t have to repeat the process over and
over. For example, you could create a programming script that counted the number of words in each
spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means
that you don’t have to repeat your cleaning steps, saving you and your team time and energy.
More resources
There are a lot of tools out there that can help automate your processes, and those tools are improving
all the time. Here are a few articles or blogs you can check out if you want to learn more about
workflow automation and the different tools out there for you to use:
Key takeaways
As a data analyst, automation can save you a lot of time and energy, and free you up to focus more on
other parts of your project. The more analysis you do, the more ways you will find to make your
processes simpler and more streamlined.
This reading outlines the steps the instructor performs in the next video, Different data perspectives.
The video teaches you different methods data analysts use to view data differently and how looking at
different views leads to more efficient and effective data cleaning.
Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skills demonstrated in the video.
If you’d like to follow along with the examples in this video, choose a spreadsheet tool. Google Sheets
or Excel are recommended.
To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.
OR
A pivot table is a data summarization tool. It can be used in data processing and in data cleaning, for
which pivot tables offer a quick, clutter-free view of your data. Pivot tables help sort, reorganize,
group, count, total, or average data in a dataset.
1. In the Cosmetics Inc. spreadsheet, select the data you'll include. In this case, select all of the data in
Sheet 1 of the spreadsheet by selecting cell A1 then dragging your cursor to cell F31.
2. Select Insert, then Pivot Table. Choose New sheet and Create. Google Sheets creates a new sheet
where you can define the pivot table.
3. Use the Pivot table editor to add specific data to your pivot table.
a. In the Pivot table editor panel, next to Rows, select Add.
c. Below Rows, from the Order dropdown list, select Descending to put the most profitable items at
the top.
f. Notice that the top two most ordered products are 15143Exfo and 32729Masc. The rest of the orders
total less than $10,000.
Example 2: VLOOKUP
VLOOKUP is a spreadsheet function that vertically searches for a certain value in a column to return a
corresponding piece of information. It's rare for all of the data an analyst will need to be in the same
place. Usually, you'll have to search across multiple sheets or even different databases. VLOOKUP
helps bring the information together.
In the previous example, you found the product codes of the most ordered products. Now, you’ll use
VLOOKUP to find the names of these products.
Example 3: Plotting
The plotting tool allows analysts to quickly create a graph, chart, table, or other visual from data.
Plotting is useful for identifying skewed data or outliers.
1. In Sheet 1 of the Cosmetics, Inc. spreadsheet, select column B, which contains the prices.
2. Select Insert > Chart.
1. If the chart created is not a column chart, select Column chart from the dropdown menu under Chart
type in the Chart editor.
2. Select and drag the chart to the right so you can view the data in the sheet.
3. Check for obvious outliers and fix them in the spreadsheet. For example, you might notice that an item
in the middle of the chart has an extremely low price of $0.73. The decimal point is in the wrong place.
In cell B14, correct this price to $7.30, and notice that Google Sheets automatically updates the chart.
This reading outlines the steps the instructor performs in the next video, Even more data-cleaning
techniques. This video teaches you different methods data analysts use in data mapping. Data mapping
is the process of matching fields from one database to another. It’s critical to the success of data
migration, data integration, and many other data-management activities. This video contains one
activity for you to practice.
Keep this step-by-step guide open as you watch the video. It can serve as a helpful reference if you
need additional context or clarification while following the video steps. This is not a graded activity,
but you can complete these steps to practice the skill demonstrated in the video.
If you’d like to follow along with the example in this video, choose a spreadsheet tool, such as Google
Sheets or Excel.
To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy
of the dataset. If you don’t have a Google account, download the data directly from the attachments
below.
Link to templates:
International Logistics Association memberships
Downloads:
Example: CONCATENATE
CONCATENATE is a function that joins together two or more text strings. In the video, you’ll learn
how to use CONCATENATE to clean data after two datasets have been combined.
1. Open the dataset spreadsheet titled Global Logistics Association. When prompted, select USE
TEMPLATE.
2. Insert a new column to the right of column E. Label it New Address in cell F1.
3. In the second row of the new column (cell F2), enter =CONCATENATE (D2,E2) and press Enter.
1. You will notice that some results need a space between the street address and the unit or suite number,
such as: 25 Dyas RdSte. 101.
2. You could manually clean the data later to add a space between Rd and Ste., but CONCATENATE
can actually do it for you.
3. The CONCATENATE formula can help you format the data as it is merged by entering an additional
string to insert a space between Rd and Ste.
4. Enter =CONCATENATE(D2, " ", E2) and you will have an address that is formatted like this: 25
Dyas Rd Ste. 101. Much better!
4. Ensure the new data in the cell accurately reflects the merging of the two previous columns.
5. Select cell F2 and drag down to apply the formula to all rows in the column.
In an earlier course in this certificate program, you worked with .csv files. Data analysts use .csv files
often, so throughout this course you will continue to use .csv files to transfer data into data analysis
programs for further analysis and visualization. .csv files are plain text files with an organized table
structure that includes rows and columns. The values in each row are separated by commas. This table
structure makes them easy to understand, edit, manipulate, and use for data analysis.
A major advantage of .csv files is their widespread compatibility. They can be imported and exported
by a vast range of data analysis tools and software programs.
To use .csv files and upload them to data analysis programs you will first need to download them to
your local device. Downloading a .csv file from a website can vary depending on your operating
system or internet browser. Here are some ways you can download a .csv file:
Click the download link or .csv attachment: Locate the link for the .csv file or attachment on the
website. Click on it, and the download process will start.
Right-click and Save: Right-click on the data table or element containing the .csv data. Choose Save
as… or a similar option. Name the file and make sure the extension on the file is “.csv”.
Force download: You can use the Alt key on your keyboard while clicking the link. This will trigger
the download, and you will be able to find the .csv file in your Downloads folder.
Note: When using the Chrome browser or ChromeOS, .csv files may open in a new tab instead of
downloading to your machine. If this happens, follow these instructions:
Select File from the menu bar, then select Save as Google Sheets. This will open the .csv file as a
Google Sheet.
Select File from the menu bar, then select Download from the dropdown menu, then select Comma
Separated Values (.csv).
You will often need to upload .csv files during the data analysis process. Here is how you do this:
Locate the upload option: Each data analysis platform will have a designated button, menu option, or
drag and drop area labeled Upload or Import. This is where you will upload your .csv file.
Choose your .csv file: Click Upload or Import on the platform you are using to open your file
explorer. Select your .csv file. If you just downloaded a .csv file from the web, it will be located in
your computer’s Downloads folder.
Initiate the upload: Once you've selected your .csv file, click Upload or Import.The platform may
display a progress bar or message indicating that the upload is complete.
Note: Some platforms have restrictions on the file size or format of .csv files. Make sure your .csv files
adhere to these requirements before uploading.
Key takeaways
Data analysis programs help us extract insights and knowledge from data. Using .csv files is essential
in data analysis. Understanding how to easily download data from the web or add your data to these
programs will allow you to complete data cleaning, visualizations, analysis, and so much more!
As you continue on your data journey, you’re likely discovering that data is often messy—and you can
expect raw, primary data to be imperfect. In this reading, you’ll consider how to develop your personal
approach to cleaning data. You will explore the idea of a cleaning checklist, which you can use to
guide your cleaning process. Then, you’ll define your preferred methods for cleaning data. By the time
you complete this reading, you’ll have a better understanding of how to methodically approach the data
cleaning process. This will save you time when cleaning data and help you ensure that your data is
clean and usable.
Data cleaning usually requires a lot of time, energy, and attention. But there are two steps you can take
before you begin to help streamline your process: creating a cleaning checklist and deciding on your
preferred methods. This will help ensure that you know exactly how you want to approach data
cleaning and what you need to do to be confident in the integrity of your data.
Start developing your personal approach to cleaning data by creating a checklist to help you identify
problems in your data efficiently and identify the scale and scope of your dataset. Think of this
checklist as your default “what to search for” list.
Here are some examples of common data cleaning tasks you could include in your checklist:
Determine the size of the dataset: Large datasets may have more data quality issues and take longer
to process. This may impact your choice of data cleaning techniques and how much time to allocate to
the project.
Determine the number of categories or labels: By understanding the number and nature of categories
and labels in a dataset, you can better understand the diversity of the dataset. This understanding also
helps inform data merging and migration strategies.
Identify missing data: Recognizing missing data helps you understand data quality so you can take
appropriate steps to remediate the problem. Data integrity is important for accurate and unbiased
analysis.
Identify unformatted data: Identifying improperly or inconsistently formatted data helps analysts
ensure data uniformity. This is essential for accurate analysis and visualization.
Explore the different data types: Understanding the types of data in your dataset (for instance,
numerical, categorical, text) helps you select appropriate cleaning methods and apply relevant data
analysis techniques.
There might be other data cleaning tasks you’ve been learning about that you also want to prioritize in
your checklist. Your checklist is an opportunity for you to define exactly what you want to remember
about cleaning your data; feel free to make it your own.
In addition to creating a checklist, identify which actions or tools you prefer using when cleaning data.
You’ll use these tools and techniques with each new dataset—or whenever you encounter issues in a
dataset—so this list should be compatible with your checklist.
For example, suppose you have a large dataset with missing data. You’ll want to know how to check
for missing data in larger datasets, and how you plan to handle any missing data, before you start
cleaning. Outlining your preferred methods can save you lots of time and energy.
Key takeaways
The data you encounter as an analyst won’t always conform to your checklist or your preferred actions
and tools. But having these things can make common data cleaning tasks much easier to complete. As
is so often the case, thoughtful planning sets up any project for success!