Preparing Data For Analysis Using Excel
Preparing Data For Analysis Using Excel
Most data sets require preparation before analysis. Garbage in, garbage out — your
analysis will only yield meaningful results if your data has high quality. Particularly
when dealing with small or medium-size data sets, wrong entries and inconsistent
values can have a tremendous impact on your analysis — they will introduce bias. The
data preparation process can imply different tasks, depending on the type of data you
analyze.
What should you pay attention to when getting your data ready for analysis? And how
can you perform these steps efficiently in Microsoft Excel? Assuming that you already
collected your data, you should go through the five steps of data preparation. Our
checklist guides you through the data preparation process. The comprehensive
explanations below provide you with some hints on how to implement these data
cleaning tasks in Excel. We point out how PrepJet helps you with these steps, however,
we always show how to work with “pure” Excel functionalities.
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 1/10
4/1/2020 Prepare your data for analysis in five steps using Excel
Needless to say — but still worth a reminder: Do not forget to save a backup of your raw
data before going on with data preparation in case you want to lookup some information
in the original data. One more tip beforehand: If you are dealing with large data sets that
cause Excel to become slow when performing certain operations, simply try the
operation with a small sample before applying it to the whole data set. You can grab
some coffee while Excel is calculating. If your coffee consumption gets too high while
preparing your data, you might want to try PrepJet, which can perform certain
operations more quickly than Excel formulas (e.g. when looking up data from different
tables).
1. Import data
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 2/10
4/1/2020 Prepare your data for analysis in five steps using Excel
If you need a more sophisticated separation of your columns, you can use PrepJet’s
“Separate Cell Content” function. It allows you to define the position of the delimiters.
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 3/10
4/1/2020 Prepare your data for analysis in five steps using Excel
If you don’t want to write lengthy formulas, you can use PrepJet’s Extract Text function.
It allows you to extract text parts from a column in a few clicks.
=TRIM(text)
If you do not want to repeat this exercise for every single column, you can use PrepJet’s
Trim Spaces function which allows you to trim your whole sheet in a single click.
2. Format adjustments
Standardize formats
Before combining and analyzing data, it is crucial to harmonize the formats of your data.
If your data comes from different countries or IT systems with different languages, you
should make sure to have consistent decimal separators (comma vs. dot). The same
applies for date formats (e.g. DD.MM.YYYY vs. MM/DD/YYYY) currencies (e.g. EUR vs.
USD) or measurement units (e.g. miles vs. kilometers). In Excel, the best solution is to
transform data sets with different formats into one standard before combining them. For
date types, you can use Excel’s “Number” functionality. Click on the lower right arrow to
define a custom format:
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 4/10
4/1/2020 Prepare your data for analysis in five steps using Excel
If you want to unify the spelling of your data entries (e.g. UPPERCASE vs. Normal Case),
you can use PrepJet’s Change Case function.
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 5/10
4/1/2020 Prepare your data for analysis in five steps using Excel
In case your import failed for some reason, some symbols might be corrupted after an
export as the encoding of characters might differ between IT systems. Characters like
“ß” might introduce problems if you want to work with your data. We recommend to
correct them using the find and replace function.
3. Correct inconsistencies
Check for inconsistent entries using custom rules
The most valuable resource when preparing data for analysis is your own knowledge
about the data. With custom rules (so-called Business Rules), you can detect wrong data.
The whole process follows the principles of Boolean logic . You can define simple
conditions. Before applying any rules to your data, it is always helpful to spell them in
natural language, e.g.: “The entries in column ‘Age’ always have to be greater or equal
zero.”
PrepJet has a functionality that helps you to implement custom validation rules easily
without a lot of nested formulas and additional validation columns. You can also
implement validation rules in Excel using the IF function. You will have an additional
validation column that indicates whether the rule is fulfilled or not (this is also the
column into which you type your formula). Here, we want the output in our validation
column to be “ok” if the Age is greater or equal zero and we want to have “error” if the
Age is smaller than zero (assuming that column F contains Age):
You can also implement more complex conditions. For example, if you want to analyze
your product range, you could apply the rule “The shipping weight of an article of the
category pants is always smaller than 2kg.” To make the implementation in Excel easier,
it is helpful to phrase the condition closer to logical terminology first:
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 6/10
4/1/2020 Prepare your data for analysis in five steps using Excel
In Excel, we again have a validation column that indicates whether our rule is violated
(“error”) or not (“ok”).
To be prepared for Excel’s syntax, we have to rephrase the logical statement as follows:
Assuming that column B contains category and column C contains shipping weight, we
insert the following formula into the validation column and drag it down to the end of
our table:
You can even go further and introduce conditions with more nestings, conditions that
have to be fulfilled cumulatively (using the AND function) or conditions that can be
fulfilled selectively (using the OR function). Again, you can make your live a little easier
using PrepJet’s Rule-based Validation function.
PrepJet helps to handle suspicious categories more quickly with an intelligent find and
replace feature that is called Harmonize Categories.
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 8/10
4/1/2020 Prepare your data for analysis in five steps using Excel
4. Remove duplicates
Deduplicate your data considering fuzzy duplicates
Another important step before you combine and analyze data sets is to remove duplicate
entries. This is simple if you only want to find exact duplicates. You can directly remove
them with Excel’s Remove Duplicates function in the Data Ribbon. If you first want to
check the duplicate entries to examine where they come from, you can highlight and
sort them with PrepJet’s Detect Duplicates function.
Often, however, actual duplicates are not exactly equal. They might differ slightly due to
typos or different naming conventions. To detect these kind of duplicates, you either
need a sophisticated fuzzy matching algorithm or a manual workaround (learn more in
our introduction to fuzzy matching). One manual solution can be to include only a few
attributes into your duplicate screening. For example, if you want to find duplicates in a
customer list and names might have been spelled differently, you can in a separate step
only have a look at the address fields. It should also help to standardize categorical data
(see above) first before searching for duplicates.
If you do not only want to analyze a single data set, the final step to get your data ready
for analysis is to combine it. You could simply copy columns from one sheet to another if
your data is sorted. However, this is not recommendable as you risk to copy the wrong
data in case one data set misses a row. It is better to identify match criteria and pull your
data based on these criteria from one table into another. If you have one match criterion
that is unique for each row of your data set (a so-called unique identifier), you can use
this match criterion and perform a VLOOKUP in Excel. If you have more than one match
criteria, you have to use a combination of INDEX and MATCH (check out our blog post
on this operation). Or you can use PrepJet’s Lookup Data function, which allows you to
specify as many match criteria as you want in only a few clicks.
Now that you are done with the strenuous part — enjoy your analysis! :-)
. . .
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 10/10