0% found this document useful (0 votes)
127 views

Preparing Data For Analysis Using Excel

The document discusses five steps for preparing data for analysis in Excel: 1) Import data and clean it by splitting entries and extracting parts 2) Format data by standardizing formats, storing it correctly, and replacing corrupted characters 3) Correct inconsistencies using custom validation rules to detect incorrect entries 4) Remove unnecessary data by deleting duplicate records and empty fields 5) Combine data from multiple sources by merging datasets while ensuring unique identifiers

Uploaded by

merin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Preparing Data For Analysis Using Excel

The document discusses five steps for preparing data for analysis in Excel: 1) Import data and clean it by splitting entries and extracting parts 2) Format data by standardizing formats, storing it correctly, and replacing corrupted characters 3) Correct inconsistencies using custom validation rules to detect incorrect entries 4) Remove unnecessary data by deleting duplicate records and empty fields 5) Combine data from multiple sources by merging datasets while ensuring unique identifiers

Uploaded by

merin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

4/1/2020 Prepare your data for analysis in five steps using Excel

Prepare your data for analysis in ve steps using Excel


PrepJet Follow
Aug 11, 2016 · 10 min read

The ve steps of data preparation

Most data sets require preparation before analysis. Garbage in, garbage out — your
analysis will only yield meaningful results if your data has high quality. Particularly
when dealing with small or medium-size data sets, wrong entries and inconsistent
values can have a tremendous impact on your analysis — they will introduce bias. The
data preparation process can imply different tasks, depending on the type of data you
analyze.

What should you pay attention to when getting your data ready for analysis? And how
can you perform these steps efficiently in Microsoft Excel? Assuming that you already
collected your data, you should go through the five steps of data preparation. Our
checklist guides you through the data preparation process. The comprehensive
explanations below provide you with some hints on how to implement these data
cleaning tasks in Excel. We point out how PrepJet helps you with these steps, however,
we always show how to work with “pure” Excel functionalities.

https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 1/10
4/1/2020 Prepare your data for analysis in five steps using Excel

Needless to say — but still worth a reminder: Do not forget to save a backup of your raw
data before going on with data preparation in case you want to lookup some information
in the original data. One more tip beforehand: If you are dealing with large data sets that
cause Excel to become slow when performing certain operations, simply try the
operation with a small sample before applying it to the whole data set. You can grab
some coffee while Excel is calculating. If your coffee consumption gets too high while
preparing your data, you might want to try PrepJet, which can perform certain
operations more quickly than Excel formulas (e.g. when looking up data from different
tables).

1. Import data
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 2/10
4/1/2020 Prepare your data for analysis in five steps using Excel

Split data along delimiters


When you import your data, you should be aware of a clean delimitation of your entries.
IT systems usually define a delimiter such as a semicolon or a comma. In Microsoft Excel,
we recommend to use the import function to get your data into a tabular shape, also if
you import a csv and not a txt file. Compared to simply opening the csv file, the import
function has the advantage that you can define the character encoding. Our animation
shows you how to use the import function.

If you need a more sophisticated separation of your columns, you can use PrepJet’s
“Separate Cell Content” function. It allows you to define the position of the delimiters.

Extract parts from data entries


In case your import did not work well or you need only a part of a certain attribute (such
as a part from a nested ID), you have to perform more advanced split operations. For
example, if you want to extract the domains of email addresses, you have to extract the
part between the @ and the last dot. If you use original Excel functions, LEFT, RIGHT
will help you to extract text parts. With FIND, you can specify the delimiters (in our case
“@” and “.”), with LEN, you can define the length of the character sequences you want to
extract. You will have to workaround with a few combinations to finally get the domain:

https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 3/10
4/1/2020 Prepare your data for analysis in five steps using Excel

If you don’t want to write lengthy formulas, you can use PrepJet’s Extract Text function.
It allows you to extract text parts from a column in a few clicks.

Remove leading and trailing spaces


System exports or web-scraped data often cause unwanted leading and trailing spaces.
You can remove those spaces by using the Excel TRIM function (do not use find and
replace as this will also remove spaces between words!). The syntax is as follows:

=TRIM(text)

If you do not want to repeat this exercise for every single column, you can use PrepJet’s
Trim Spaces function which allows you to trim your whole sheet in a single click.

2. Format adjustments
Standardize formats
Before combining and analyzing data, it is crucial to harmonize the formats of your data.
If your data comes from different countries or IT systems with different languages, you
should make sure to have consistent decimal separators (comma vs. dot). The same
applies for date formats (e.g. DD.MM.YYYY vs. MM/DD/YYYY) currencies (e.g. EUR vs.
USD) or measurement units (e.g. miles vs. kilometers). In Excel, the best solution is to
transform data sets with different formats into one standard before combining them. For
date types, you can use Excel’s “Number” functionality. Click on the lower right arrow to
define a custom format:
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 4/10
4/1/2020 Prepare your data for analysis in five steps using Excel

If you want to unify the spelling of your data entries (e.g. UPPERCASE vs. Normal Case),
you can use PrepJet’s Change Case function.

Store data in the correct format


To make sure that your data is analyzed appropriately, you should store it in the correct
format. For example, your data might contain a numerical Identifier which has however
no numerical meaning. Tell Excel that this is not a number by classifying it as text. Excel
helps you to specify the date format in the Home ribbon in the section “Number” (see
above). To perform a final check on the format of your data, you can use PrepJet’s Find
Inconsistencies function which highlights inconsistent data types automatically.

Replace unrecognized or corrupted characters

https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 5/10
4/1/2020 Prepare your data for analysis in five steps using Excel

In case your import failed for some reason, some symbols might be corrupted after an
export as the encoding of characters might differ between IT systems. Characters like
“ß” might introduce problems if you want to work with your data. We recommend to
correct them using the find and replace function.

Check for truncated entries


When exporting data from IT systems, a lot of annoying accidents can happen. One of
them is truncations, i.e. data entries are cut off at a certain position. Some manual
screening should quickly lead you to suspicious data. How to fix it? The best solution is
probably to request a new and healthy export.

3. Correct inconsistencies
Check for inconsistent entries using custom rules
The most valuable resource when preparing data for analysis is your own knowledge
about the data. With custom rules (so-called Business Rules), you can detect wrong data.
The whole process follows the principles of Boolean logic . You can define simple
conditions. Before applying any rules to your data, it is always helpful to spell them in
natural language, e.g.: “The entries in column ‘Age’ always have to be greater or equal
zero.”

PrepJet has a functionality that helps you to implement custom validation rules easily
without a lot of nested formulas and additional validation columns. You can also
implement validation rules in Excel using the IF function. You will have an additional
validation column that indicates whether the rule is fulfilled or not (this is also the
column into which you type your formula). Here, we want the output in our validation
column to be “ok” if the Age is greater or equal zero and we want to have “error” if the
Age is smaller than zero (assuming that column F contains Age):

=IF(F2>=0; “ok”; “error”)

You can also implement more complex conditions. For example, if you want to analyze
your product range, you could apply the rule “The shipping weight of an article of the
category pants is always smaller than 2kg.” To make the implementation in Excel easier,
it is helpful to phrase the condition closer to logical terminology first:
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 6/10
4/1/2020 Prepare your data for analysis in five steps using Excel

IF “category” EQUAL TO “pants” THEN “shipping_weight” SMALLER THAN “2”

In Excel, we again have a validation column that indicates whether our rule is violated
(“error”) or not (“ok”).

To be prepared for Excel’s syntax, we have to rephrase the logical statement as follows:

IF “category” EQUAL TO “pants” AND “shipping_weight” SMALLER THAN “2” THEN


“ok” ELSE “error”

Assuming that column B contains category and column C contains shipping weight, we
insert the following formula into the validation column and drag it down to the end of
our table:

=IF(AND(B2=”pants”; C2<2); “ok”; ”error”)

You can even go further and introduce conditions with more nestings, conditions that
have to be fulfilled cumulatively (using the AND function) or conditions that can be
fulfilled selectively (using the OR function). Again, you can make your live a little easier
using PrepJet’s Rule-based Validation function.

Numerical data: Check for outliers


If you work with numerical data, you should check your data for outliers. Outliers are
values that deviate from the observed distribution of your data (learn more in our blog
post on the basics of outlier detection). As statistical outlier detection is rather complex
to be implemented in Excel, we recommend to sort the values by size and to check if
there are any suspicious entries at the upper or lower bounds of your range.

Categorical data: Check for wrong categories


If you analyze categorical data, it is important to make sure that there are no different
conventions for assigning categories. For example, if you want to analyze your product
portfolio, make sure that similar products are not put into different categories (e.g. if you
work with a grocery store, all apples, bananas and mangos might be classified as “fruits”
while someone accidentally tagged pineapple as “exotic fruit”. There are different ways
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 7/10
4/1/2020 Prepare your data for analysis in five steps using Excel

to detect those kind of miscategorizations in Excel. If you have a small number of


categories, you can simply insert a filter and manually check all available categories. If
you have a larger range of categories, it is more advisable to plot the frequency of
occurrences in each categories (e.g. with a bar chart in Excel). You can focus on
categories with low frequencies of occurrence. Here in our example chart, in a few cases
“blu” has been entered instead of “blue”. You can easily fix this by using Find and
Replace

PrepJet helps to handle suspicious categories more quickly with an intelligent find and
replace feature that is called Harmonize Categories.

Missing entries: Add data or remove rows


Missing values are an issue that is hard to fix. If you have the chance to get the original
data without inappropriate efforts, the best solution is to add it. Otherwise, you might
want to ignore the rows that contain empty cells when performing certain analyses. A
more advanced option is to impute the data (e.g. by the mean or median or by a logistic
regression). This is only recommendable for users with advanced statistical knowledge
as it might bias your results.

https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 8/10
4/1/2020 Prepare your data for analysis in five steps using Excel

Survey data: Detect suspicious response patterns


When analyzing survey data, you should be aware of response patterns that are not valid
(e.g. because the respondent simply was too tired to answer all your questions
appropriately …). This might be for example a person who always chooses the answer in
the same position for each question — so-called straight-lining (e.g. always the most left
option). More creative respondents also create alternating patterns when answering
surveys (so-called “Christmas tree behavior”). If you use an online survey tool, as a first
step you can try to identify those answers that have been filled in in very short time by
looking at the response timers. Most online survey tools will export the time the
respondents take to fill in the form with your survey data. In a second step, you can
identify suspicious data by screening the replies in Excel. This will be easier once you
translated categorical text responses into numbers (e.g. “I fully agree” = 1, “I agree” = 2
etc.).

4. Remove duplicates
Deduplicate your data considering fuzzy duplicates
Another important step before you combine and analyze data sets is to remove duplicate
entries. This is simple if you only want to find exact duplicates. You can directly remove
them with Excel’s Remove Duplicates function in the Data Ribbon. If you first want to
check the duplicate entries to examine where they come from, you can highlight and
sort them with PrepJet’s Detect Duplicates function.

Often, however, actual duplicates are not exactly equal. They might differ slightly due to
typos or different naming conventions. To detect these kind of duplicates, you either
need a sophisticated fuzzy matching algorithm or a manual workaround (learn more in
our introduction to fuzzy matching). One manual solution can be to include only a few
attributes into your duplicate screening. For example, if you want to find duplicates in a
customer list and names might have been spelled differently, you can in a separate step
only have a look at the address fields. It should also help to standardize categorical data
(see above) first before searching for duplicates.

5. Combine data sets


Lookup data from other tables
https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 9/10
4/1/2020 Prepare your data for analysis in five steps using Excel

If you do not only want to analyze a single data set, the final step to get your data ready
for analysis is to combine it. You could simply copy columns from one sheet to another if
your data is sorted. However, this is not recommendable as you risk to copy the wrong
data in case one data set misses a row. It is better to identify match criteria and pull your
data based on these criteria from one table into another. If you have one match criterion
that is unique for each row of your data set (a so-called unique identifier), you can use
this match criterion and perform a VLOOKUP in Excel. If you have more than one match
criteria, you have to use a combination of INDEX and MATCH (check out our blog post
on this operation). Or you can use PrepJet’s Lookup Data function, which allows you to
specify as many match criteria as you want in only a few clicks.

Now that you are done with the strenuous part — enjoy your analysis! :-)

. . .

Originally published at www.prepjet.de on August 10, 2016.

Data Science Excel Data Preparation Data Cleaning Data Analysis

About Help Legal

https://medium.com/@PrepJet/prepare-your-data-for-analysis-in-five-steps-using-excel-9869006e0a9e 10/10

You might also like