02 Data Cleaning and Formulas
02 Data Cleaning and Formulas
WELCOME TO GA
GENERAL ASSEMBLY
Our Learning Goals
Wrangle/Prepare:
Clean and prepare
relevant data.
The Superstore regional sales director from the central U.S. region has reached
out to you with a request:
We are seeing a high volume of returns. Can you dig into what might be
causing this?
WELCOME TO GA
GENERAL ASSEMBLY
Importing Data | Getting Your Sandbox Ready
But why?
CSV (comma-separated values) is plain text,
while XLSX is a binary file format that holds
information — including both content and
formatting — on all the worksheets.
WELCOME TO GA
GENERAL ASSEMBLY
Data Cleaning
Should this be 0?
Step 1: Right click on the column to the right of the “city_state” column (it should be
“sub_region”) and choose “Insert” to insert a new blank column to the right of
“city_state.”
Step 2: Click on the “city_state” header to select the entire “city_state” column.
Step 3: Select the “Text to Columns” button in the “Data” menu on the ribbon.
Step 2: Use the TRIM function to take out the extra space in front.
=TRIM(text)
WELCOME TO GA
GENERAL ASSEMBLY
Asking the Right Questions
Bad example: Sales and order_id. We can get the average dollar amount
per item in an order_id; for example, the average cost of a product in order
123 was $15. But that doesn’t really lead to many useful insights for the
store. An aggregate of the average order amount across all orders or
particular categories might be more useful.
Let’s brainstorm questions we can ask about the Superstore data set together.
What might be some interesting variables to combine to gain meaningful
insights?
Formulate them into a hypothesis and share it with your class.
Let’s revisit the business problem from earlier: We are seeing a high volume
of returns. Now that you’ve identified the data points you need, open the
lesson worksheet and work with your partner to:
1. Identify the questions you can ask to help gain interesting insights from the data.
2. Then, formulate your questions into a hypothesis. Here’s an example:
“If we compare the shipping cost and the order priority, we might find that high
shipping costs for low-priority orders frequently lead to returns.”
3. List it out in your worksheet.
4. Be prepared to share your work with your class.
WELCOME TO GA
GENERAL ASSEMBLY
Data Referencing
A2 references A1
Current Menu
Formula Bar
Ribbon
Cell Name
=TEXT(argument1, argument2)
Now, let’s figure out what argument1 and argument2 are.
=TEXT(argument1, argument2)
We want to format each date in the
The MS documentation tells us that the first
order_date column! To do so, we
argument is “Value you want to format.”
need to start with the first order date.
So, what is it that we want to format? Then, we can drag the formula down
to calculate the rest. Thus, the first
argument of our function will be C2.
=TEXT(argument1, argument2)
According to the MS documentation, the second argument is
“Format code you want to apply.” We need to figure out what
these format codes are.
Best practice reminder: Put all formulas to the right side of your data set; don’t mix
them in with the raw data.
COUNTIF is another useful function for data cleaning. It can be used to:
● Count the number of cells in a range that contain specific data.
● Tell us whether or not a single cell contains data based on a condition.
When there is a single cell in the COUNTIF range, the maximum that can be
returned is 1 and the minimum that can be returned is 0.
Syntax:
COUNTIF(range cell, condition)
We can now SUM this column to find out the number of orders that were
discounted more than 30%.
To dive deeper into why Superstore is seeing a high volume of returns, we need
to take a closer look at orders, profit, and sales as well as individual customers.
It’s a lot to look at! But don’t worry, we’ll do this together, step by step. First, let’s
find out if some days of the week see higher volumes in sales and returns.
1. To extract the day of the week from the order_date, write out:
=TEXT(C2, “dddd”) OR =TEXT([@[order_date]],"dddd")
2. Looking at profit, does profit margin impact whether or not something gets
returned? To find out, recalculate the profit margin (profit divided by sales) per
row. Insert a new column next to profit in Column N.
=N2/M2 or =[@profit]/[@sales]
Next, we will use IFERROR to wrap the formula. We do this to help us deal with
NULLs in the data set.
=IFERROR(formula,"")
3. Now, let’s look at individual customers to see if some customers return more
than others. You need to concatenate the order_info_id and the
order_id_number with a dash in between them to create just a order_id column.
Write out:
=[@order_info_id]&”-”&[@order_id_number] OR =A2&“-”&B2
4. Finally, let’s decipher sales volume! To help us categorize our sales without
relying on the exact dollar amount, we’ll categorize sale amounts above $500 as
“High” and below $500 as “Low.”
Now that you have sales categorized, does it make a difference to returns?
Wrapping Up
WELCOME TO GA
GENERAL ASSEMBLY
Recap Looking Ahead