Best Practices For Data Cleaning - EN - 1802
Best Practices For Data Cleaning - EN - 1802
Table of contents
I. Basic steps for cleaning your data .......................................................................... 1
II. How to clean your data in Excel ............................................................................ 2
II.1. Get rid of extra spaces .................................................................................... 2
II.2. Select and treat all blank cells (missing data) ................................................. 3
II.3. Convert Numbers stored as text into numbers ................................................ 4
II.4. Remove duplicates .......................................................................................... 4
II.5. Highlight and correct errors ............................................................................ 7
II.6. Change text to lower/upper/proper case ........................................................ 9
II.7. Use Text to Columns to parse data in Excel ..................................................... 9
II.8. Delete all formatting ..................................................................................... 11
II.9. Use “find and replace” to clean data in Excel ................................................ 12
The process of data cleaning can be done using different techniques and software (Excel,
STATA, SPSS etc.). This tutorial focuses on general principles of data cleaning, zooming into
how to implement data cleaning in Excel, as this is the most common and versatile tool used
by Tdh.
checking
checking correcting data
logical
reliability errors cleaning
coherence
Remember that the best way of reducing the data cleaning workload is to plan your data
collection carefully from the start 2, including the analysis you will want to do in the
planning, and testing it with fake data before you deploy!
When cleaning our databases, we are looking for the following inconsistencies and errors:
1
Source : https://support.office.com/en-us/article/top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-
c2e157d1db19
2
https://www.mdc-toolkit.org/design-your-forms/
Before, during and after each of these steps, you should check (visually or by using filters
etc.) that you did not do any changes that you were not expecting!
What happens if you do not create a backup copy of the original data and you
inadvertently delete a column containing data (or another element) in your
database?
Well, if it is not in Kobo anymore, that data is lost forever and there is no way to
recover it. This is why it is very important to create a backup copy of the original data
in a separate workbook!
2. Ensure that the data is in a tabular format of rows and columns with: similar data in each
column, all columns and rows visible, no merged cells, no multiple answers in one cell and
no blank rows within the range.
3. Carry out tasks that don’t require column manipulation first, such as getting rid of the
extra spaces or using the “Find and replace” dialog box.
In order to get rid of extra spaces, you can use the TRIM function in Excel. TRIM Function
will remove all spaces from text except for single spaces between words. The function syntax
is TRIM(text), where (text) is a required argument of the function and refers to the text from
which you want spaces removed.
First, type in your formula next to your first text entry. Then drag the formula down to cover
all your text entries as below. All you text entries will then get cleaned from the extra spaces.
3
Sources : https://trumpexcel.com/clean-data-in-excel/ and https://www.youtube.com/watch?v=e0TfIbZXPeA
4
The Excel version that was used for this tutorial is Excel 2011. Also, should you have any issues or questions, do not
hesitate to search for your issue/question on the web as there are a multitude of sources out there likely to help you.
You might want to detect all blank cells, as it could represent missing data, and replace it with
“Missing Data” for example. In order to do so, select the entire dataset, click on “Find and
Select” and then on “Go to special” where a Go to special dialogue box will open.
In the dialogue box, select the “Blanks” option in order to make your blank cells appear in
grey.
This will select all the blank cells in your dataset at the same time. The first selected cell will
be blank (and not light grey like the others) as this cell is the active cell. In order to input
M/D in all the blank cells (=missing data), type in this text in the active cell and hit
Ctrl+Enter.
In order to convert a number stored as text into a number, go to the “Formatting box” and
type in “General”. This will transform all numbers stored as text into numbers (i.e. they are
aligned to the right of the cell).
Numbers are always aligned to the right of cell, whereas text gets aligned to the left of the
cell.
You first have to find where the duplicates are. Select the whole dataset, go to “Home” on
“Conditioning formatting”, then on “Highlight Cell Rules” and click on “Duplicate
values”.
In the example below you can see that the Beneficiary ID 1116 has been highlighted only
in the “Beneficiary ID” column and not in the “Distribution Date” column. This is due to the
fact that those are not true duplicates as the dates of distribution for the two entries are
different: thus they are two different data entries.
Then, go to “Data” and click on “Remove Duplicates” in order to delete the real duplicates.
It then informs you how many duplicates have been deleted and how many data entries
(unique values) have been kept.
As you can see, the 1116 “false” duplication has been kept.
In the table below, an indicator named “Distributed quantity per household member” has been
computed from the “Quantity distributed” and “Number of people in HH” indicators. However,
some values have errors (as there are some missing values in the “number of people in HH”
indicator).
In order to detect there errors, go to “Home”, “Conditional Formatting” and click on “New
Rule”.
Excel will then highlight all the cells containing an error. You can now therefore filter the
associated columns by colour to work directly on the cells with errors.
In the example below, names are written in different ways. You can harmonize them by using
LOWER/UPPER/PROPER functions.
If you want to make them be fully written in lower cases, you can use the LOWER(text)
function, where (text) refers to the text to be modified. Once you have entered your function
for one data entry, drag the formula down to cover all the data entries that you want to
modify.
If you want to make them be fully written in upper cases, you can use the UPPER(text)
function, where (text) refers to the text to be modified. Once you have entered your function
for one data entry, drag the formula down to cover all the data entries that you want to
modify.
If you want to have the first letters of the first and last names be written in upper cases and
the rest in lower cases, you can use the PROPER(text) function, where (text) refers to the
text to be modified. Once you have entered your function for one data entry, drag the formula
down to cover all the data entries that you want to modify.
It might happen that you obtain a .csv version of the database, where all the data is
concentrated in a single column (separated by a coma, tab etc.) such as in the example below.
In the “Convert Text to Columns Wizard” box that opens, select “Delimited” and then
click on “Next”.
Excel will then ask the way in which it should separate your data. As a comma separates your
different data entries in our example here, select “Comma” and then click on “Next”. Other
options could be “Tab”, “semicolon”, “Space” and “Other”.
If you still have a lot of conditional formatting in your data and wish to get rid of it as it is not
useful anymore, you can first select all your dataset, go to “Home”, then “Clear” and click on
“Clear Formatting”.
If you have the same data entries in your dataset but written in different ways (i.e. with
different spellings etc.), the “Find and replace” option in Excel can become very useful. In
order to use it, first find all the data to be replaced by going to “Home”, to “Find” and then
click on “Find all”. Check that you will not be replacing data from other columns that you
wouldn’t expect to. Then, select all your data that has to be cleaned, go to “Home”, to “Find
& Select” and click on “Replace”.
In the “Find and Replace” box that opens, enter the values that you want to be replaced in
“Find what”, enter the values that you want to replace them with in “Replace with” and
then click on “Replace all”.
Excel will then inform you of how many values have been replaced/modified.