Set 9
Set 9
(Total: 20 Marks)
Data Quality
We have dealt a lot with different types of data files and analyzing it so far. But we haven’t dealt with
question regarding data quality and data manipulation yet. Data quality is perhaps one of the most
import aspects in a data project. Ensuring your data is in the right format is key to performing all your
analysis. If the data isn’t right, the analysis most certainly won’t be.
In our previous examples, we have always had clean, good data where we assumed we had no issues,
and the data was usually in a good format, ready to be analyzed and used. This may not always be the
case. In real world scenarios, you may not get a pretty and clean workbook to start you work. Instead,
you will have to filter out relevant data, clean it to make sure there are no issues and combine it from
multiple sources to get the right dataset before you can even start your work.
So let’s spend some time practicing these concepts in Excel within this Lab.
TASK 1 (5 Marks):
a) Split Columns
Open the worksheet “Split Columns”. Here, we will learn to separate data given in one column and
divide it across multiple columns using the Text to Columns option in your Data Tab.
In the first example, we have customer IDs given to us. The letters stand for an area code and the
numbers stand for a registration code. Here they are given as one number e.g. BH-11710. But what if we
need both these items separately for some analysis?
b) Concatenate
Open the “Combine Names” worksheet. Here we have the First and Last Names for a few people. But
you want the entire name given in one column. Use the CONCAT function to achieve this with cell
referencing. (Hint: anything written in two inverted commas can also be concatenated or joined with
other strings/letters. This includes spaces such as “ “ or other alphabets “abc” and symbols “@”).
Our dataset has multiple entries, which may or may not be repeating some categories/Names.
Customers may be purchasing multiple times; hence their names will be repeating. The same product
will be bought by multiple customers, so that will also be repeating. And so on.
Here for example we have a column “Ship Mode”. While a glance can tell us that “Second Class” and
“Standard Class” are two types of shipping modes, we can’t be certain of how many different types of
modes exist.
To get this specific information, we will use the Remove duplicates option in your Data tab.
This allows us to get unique lists, which can help us summarize all our information, including the types of
product categories we have, the segments we have, the cities/states/regions we operate in and our
unique customers. Generate these unique lists in the “Unique Lists” worksheet. The first example has
been done for you as shown below:
(Optional Task:) Now count how many customers this company has based on the unique customers list
you generated. This gives you the accurate number of customers the company has rather than just
running a count function on the entire customers column in the “Sales” worksheet.
You can copy the unique customer list from your previous task and paste that in the given are in the
“Random Numbers” worksheet. The serial numbers given against the names link them as unique
identifiers (e.g. Serial Number 1 means Claire Gute, and so on…).
Now we want to randomly select 5 lucky draw winners out of our customer base. To do that, you will
use the Randbetween function. (You can explore the RAND and RANDBETWEEN functions first to see
how they work).
1. In column A, generate the random numbers using the RANDBETWEEN function (the range
should be that of the serial numbers against your customers).
2. Paste special as values only the result you get from the function in column B. Now these
numbers are static and won’t change constantly (Note: that the randbetween function
keeps changing values when you enter anything in the worksheet – but we cant have our
lucky draw winners keep changing so we keep them static).
3. Use VLOOKUP in column C to locate the customer names for the 5 customers shortlisted.
Data entry is the usually the stem of data errors. You can imagine how one may end up making a
mistake if they are to manually enter all the data given in your main sheet. Humans are prone to error.
But this means bad news for your data analysis projects.
Note the different data types in the entire dataset. Some of them are numbers, some of them are dates,
some are a combination and some of them are just the same category repeating itself.
Let’s try to spot the data entry errors in the given “Sales” worksheet by using Data Validation options in
the Data Tab.
This allows you to set criteria’s on your columns. For example, Row ID can only be Numbers. Customer
Names can have no numbers. Etc.…
Once you have set this criterion, you can pick the circle invalid data option and highlight the wrong
values.
Set criteria for the following and circle the invalid data:
Note: The circles will disappear when you save or reopen the file. That Is ok. The rules will be saved and
as soon as we click the circling option again, we will see if the rule has been implemented accurately.
How do we deal with this invalid data? This is a longer discussion we can have if you are curious.
Move to the “Data Entry” worksheet. The column titles are already in place for you. Implement the
following rules on the entire column:
Order Date should only be a date for January 2022. Add an Error Message saying: “Please enter
dates for January 2022 only”
Customer Name should be text only . Add appropriate Error Message.
Ship Mode categories should be limited to Second Class, Standard Class and First Class as shown
below:
Hint: You already have the unique values list from the previous parts.
Product Name categories should be limited to all the unique product class as shown below:
Profit should be less than or equal to the value in the sales column. Profit can never be greater
than the sales amount. Add an appropriate error message.
Note how useful this can be if you implement all these rules in a worksheet meant for data entry. You
almost finish the chances of data entry issues while also making it easier for the person entering data to
complete their job.
c) Locking cells
Continue on the data entry sheet. We will try to implement a different type of error prevention
technique.
Product ID: Product ID should always match the respective product name and get automatically
filled in as soon as you select the product name. This has already been implemented using a unique
list of product names, their respective codes, and a VLOOKUP function implemented in the product
ID column (You can have a look at the formula itself).
Now if you enter a product Name, the Product ID should show up itself. Try it. But you may notice
how the function cells are open to edits and the data entry person can easily mess up the formula,
leading to errors. Let’s lock the formula cells.
The result should be such that if you try to type anything in the Product ID column, it should give
you an error.
Cell locking can be done for many other use cases. Feel free to experiment on your own.
Functions Library
Tracing Function dependents and precedents
Protection options
Sparklines
Excel Templates