0% found this document useful (0 votes)

31 views9 pages

Set 9

The document discusses data quality concepts in Excel including splitting columns, concatenating cells, generating unique lists, random number generation, validating data types, and locking cells. It provides tasks to practice these concepts on various worksheets including splitting addresses, combining names, generating unique shipping modes and customers, randomly selecting winners, validating data types and dates, and locking formula cells.

Uploaded by

Sona Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views9 pages

Set 9

Uploaded by

Sona Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

DISC 112: Assignment 9

(Total: 20 Marks)

Follow the instructions in the given order to complete the assignment:

1. Open the provided ‘Lab 9’ Excel workbook.
2. Save the workbook in D Drive and name it after Assignment number and your roll
number (e.g. A9-2020-11-001).
3. This is an individual assignment.

Data Quality

We have dealt a lot with different types of data files and analyzing it so far. But we haven’t dealt with
question regarding data quality and data manipulation yet. Data quality is perhaps one of the most
import aspects in a data project. Ensuring your data is in the right format is key to performing all your
analysis. If the data isn’t right, the analysis most certainly won’t be.

In our previous examples, we have always had clean, good data where we assumed we had no issues,
and the data was usually in a good format, ready to be analyzed and used. This may not always be the
case. In real world scenarios, you may not get a pretty and clean workbook to start you work. Instead,
you will have to filter out relevant data, clean it to make sure there are no issues and combine it from
multiple sources to get the right dataset before you can even start your work.

So let’s spend some time practicing these concepts in Excel within this Lab.

TASK 1 (5 Marks):
a) Split Columns

Open the worksheet “Split Columns”. Here, we will learn to separate data given in one column and
divide it across multiple columns using the Text to Columns option in your Data Tab.

In the first example, we have customer IDs given to us. The letters stand for an area code and the
numbers stand for a registration code. Here they are given as one number e.g. BH-11710. But what if we
need both these items separately for some analysis?

Use the Text to column option to get the following output:

In the second example, we have vendor emails. All the emails start with the vendors name, followed by
the domain. Now we want to extract just the vendor’s name from the entire email and get that in
another column as shown below using the text to columns option:

b) Concatenate

Open the “Combine Names” worksheet. Here we have the First and Last Names for a few people. But
you want the entire name given in one column. Use the CONCAT function to achieve this with cell
referencing. (Hint: anything written in two inverted commas can also be concatenated or joined with
other strings/letters. This includes spaces such as “ “ or other alphabets “abc” and symbols “@”).

You output should be as follows:

TASK 2 (5 Marks)
a) Unique Lists

Our dataset has multiple entries, which may or may not be repeating some categories/Names.
Customers may be purchasing multiple times; hence their names will be repeating. The same product
will be bought by multiple customers, so that will also be repeating. And so on.

Here for example we have a column “Ship Mode”. While a glance can tell us that “Second Class” and
“Standard Class” are two types of shipping modes, we can’t be certain of how many different types of
modes exist.

To get this specific information, we will use the Remove duplicates option in your Data tab.

This allows us to get unique lists, which can help us summarize all our information, including the types of
product categories we have, the segments we have, the cities/states/regions we operate in and our
unique customers. Generate these unique lists in the “Unique Lists” worksheet. The first example has
been done for you as shown below:
(Optional Task:) Now count how many customers this company has based on the unique customers list
you generated. This gives you the accurate number of customers the company has rather than just
running a count function on the entire customers column in the “Sales” worksheet.

b) Random Number Generation

You can copy the unique customer list from your previous task and paste that in the given are in the
“Random Numbers” worksheet. The serial numbers given against the names link them as unique
identifiers (e.g. Serial Number 1 means Claire Gute, and so on…).

Now we want to randomly select 5 lucky draw winners out of our customer base. To do that, you will
use the Randbetween function. (You can explore the RAND and RANDBETWEEN functions first to see
how they work).

1. In column A, generate the random numbers using the RANDBETWEEN function (the range
should be that of the serial numbers against your customers).
2. Paste special as values only the result you get from the function in column B. Now these
numbers are static and won’t change constantly (Note: that the randbetween function
keeps changing values when you enter anything in the worksheet – but we cant have our
lucky draw winners keep changing so we keep them static).
3. Use VLOOKUP in column C to locate the customer names for the 5 customers shortlisted.

Great! You have randomly selected 5 lucky draw winners.

TASK 3 (10 Marks)

a) Spotting Data Entry Errors

Data entry is the usually the stem of data errors. You can imagine how one may end up making a
mistake if they are to manually enter all the data given in your main sheet. Humans are prone to error.
But this means bad news for your data analysis projects.

Note the different data types in the entire dataset. Some of them are numbers, some of them are dates,
some are a combination and some of them are just the same category repeating itself.

Let’s try to spot the data entry errors in the given “Sales” worksheet by using Data Validation options in
the Data Tab.
This allows you to set criteria’s on your columns. For example, Row ID can only be Numbers. Customer
Names can have no numbers. Etc.…

Once you have set this criterion, you can pick the circle invalid data option and highlight the wrong
values.

Set criteria for the following and circle the invalid data:

1. Row id to only be numbers

2. Quantity column to be between 0 and 1000
3. Order Date to only be a Date

Note: The circles will disappear when you save or reopen the file. That Is ok. The rules will be saved and
as soon as we click the circling option again, we will see if the rule has been implemented accurately.

How do we deal with this invalid data? This is a longer discussion we can have if you are curious.

b) Data validation rules

How about we stop these data entry issues to happen in the first place? We can implement these data
validation rules on columns to stop the data entry staff from entering incorrect data.

Move to the “Data Entry” worksheet. The column titles are already in place for you. Implement the
following rules on the entire column:

 Order Date should only be a date for January 2022. Add an Error Message saying: “Please enter
dates for January 2022 only”
 Customer Name should be text only . Add appropriate Error Message.
 Ship Mode categories should be limited to Second Class, Standard Class and First Class as shown
below:

Hint: You already have the unique values list from the previous parts.
 Product Name categories should be limited to all the unique product class as shown below:

 Profit should be less than or equal to the value in the sales column. Profit can never be greater
than the sales amount. Add an appropriate error message.

Note how useful this can be if you implement all these rules in a worksheet meant for data entry. You
almost finish the chances of data entry issues while also making it easier for the person entering data to
complete their job.
c) Locking cells

Continue on the data entry sheet. We will try to implement a different type of error prevention
technique.

Product ID: Product ID should always match the respective product name and get automatically
filled in as soon as you select the product name. This has already been implemented using a unique
list of product names, their respective codes, and a VLOOKUP function implemented in the product
ID column (You can have a look at the formula itself).
Now if you enter a product Name, the Product ID should show up itself. Try it. But you may notice
how the function cells are open to edits and the data entry person can easily mess up the formula,
leading to errors. Let’s lock the formula cells.