Lab6 - S1, S2
Lab6 - S1, S2
Our learning lab is about generating more insights form data, understanding it better and
representing it in a better manner for others to see.
Data Quality
We have dealt a lot with different types of data files and analyzing it so far. But we haven’t dealt
with question regarding data quality and data manipulation yet. Data quality is perhaps one of the
most import aspects in a data project. Ensuring your data is in the right format is key to performing
all your analysis. If the data isn’t right, the analysis most certainly won’t be.
In our previous examples, we have always had clean, good data where we assumed we had no
issues, and the data was usually in a good format, ready to be analyzed and used. This may not
always be the case. In real world scenarios, you may not get a pretty and clean workbook to start
you work. Instead, you will have to filter out relevant data, clean it to make sure there are no issues
and combine it from multiple sources to get the right dataset before you can even start your work.
So let’s spend some time practicing these concepts in Excel within this Lab.
Text Functions
In this section we will use different functions to edit and work with text. For this part you will be
using ‘Student Data’ worksheet as your data source. (Some functions that may come up are trim(),
upper(), proper(), right(), left(), ‘&’ and more can found in Excel help).
1. Create a new worksheet called ‘Student Details’. [1]
2. Add column headings: #, Text1, Text2, Text3, Text4, Full Name, First Name, Last Name, ID,
Email, Code. [1]
3. Next go to ‘Student Data’ worksheet. There are supposed to be 55 students in the class, but
the list is longer. Sort the list and you will see that some names are duplicated. Use the
‘remove duplicates’ function from the data tab ribbon of excel to remove the duplicates in
the list. You will end up with 55 students.
4. In ‘Student Data’ worksheet, separate student name and id data using column ‘text to
column’ function. This can be found under the Data tab ribbon of excel. Copy-paste the name
information to column Text1 in ‘Student Details’ worksheet. Also copy paste ID information
to Text2. [2]
5. In ‘Full Name’ column, now use a text related excel function that makes the first letters of
the name capital. E.g. muhammad ahmed -> Muhammad Ahmed. [2]
6. Do another ‘text to column’ action on the original data to separate the first name from the
last name in ‘Student Data’. Copy paste first name and last name to column Text3 and Text4
in ‘Student Details. In columns ‘First Name’ and ‘Last Name’ use a text function to make all
the letters upper case. [2]
7. In ‘ID-2’, use a text related function to display only the last 4 digits of each student’s id. [2]
8. In ‘Email’, use text functions (not ‘+’) to help create email ids of the form
‘[email protected]”. [2].
9. For ‘Code’, use the IFS function (along with some logical functions) such that if the first
name begins with A or B, then assign them as 1, if it begins with C or D then assign 2 and so
on. [3]
a. Use this guide when plugging in values to the formula:
Starting Letter Code
A-B 1
C-D 2
E-F 3
G-H 4
I-J 5
K-L 6
M-N 7
O-P 8
Q-R 9
S-T 10
U-V 11
W-X 12
Y-Z 13
10. Do something about the empty serial number column.
11. Create a new sheet called ‘Free Lunch Offer’. Label column A as ‘ Random Draw’. Use
random functions (see Excel help for different types) to generate random number in column
L that are between 1-55 for 15 rows. [2]
12. In column B copy and paste as values the data from column A. This will keep then fixed and
static as opposed to column A. [1]
13. Name column C as ‘Free Lunch Offer’. Use Vlookup to find student names from Student
details file based on serial number to find out who gets the free lunch. [2]
TASK 2 ( 15 Marks):
1. Spotting Data Entry Errors
Data entry is the usually the stem of data errors. You can imagine how one may end up making a
mistake if they are to manually enter all the data given in your main sheet. Humans are prone to
error. But this means bad news for your data analysis projects.
Note the different data types in the entire dataset. Some of them are numbers, some of them are
dates, some are a combination and some of them are just the same category repeating itself.
Let’s try to spot the data entry errors in the given “Sales” worksheet by using Data Validation
options in the Data Tab.
This allows you to set criteria’s on your columns. For example, Row ID can only be Numbers.
Customer Names can have no numbers. Etc.…
Once you have set this criterion, you can pick the circle invalid data option and highlight the wrong
values.
Set criteria for the following and circle the invalid data:
Note: The circles will disappear when you save or reopen the file. That Is ok. The rules will be saved
and as soon as we click the circling option again, we will see if the rule has been implemented
accurately.
How do we deal with this invalid data? This is a longer discussion we can have if you are curious.
How about we stop these data entry issues to happen in the first place? We can implement these
data validation rules on columns to stop the data entry staff from entering incorrect data.
Move to the “Data Entry” worksheet. The column titles are already in place for you. Implement the
following rules on the entire column:
Order Date should only be a date for January 2022. Add an Error Message saying: “Please
enter dates for January 2023 only” [2]
Customer Name should be text only . Add appropriate Error Message. [1]
Ship Mode categories should be limited to Second Class, Standard Class and First Class as
shown below: [2]
Hint: You might be getting repeats of same values. How can you fix this? Create another
sheet where you can copy the ship modes from sales data and then use functions you
learned earlier in this lab to view only unique values and use that as source.
Product Name categories should be limited to all the unique product class as shown
below: [2]
Profit should be less than or equal to the value in the sales column. Profit can never be
greater than the sales amount. Add an appropriate error message. [1]
Note how useful this can be if you implement all these rules in a worksheet meant for data entry.
You almost finish the chances of data entry issues while also making it easier for the person
entering data to complete their job.
3. Locking cells
Continue on the data entry sheet. We will try to implement a different type of error prevention
technique.
Product ID: Product ID should always match the respective product name and get
automatically filled in as soon as you select the product name. This has already been
implemented using a unique list of product names, their respective codes, and a VLOOKUP
function implemented in the product ID column (You can have a look at the formula itself).
Now if you enter a product Name, the Product ID should show up itself. Try it. But you may
notice how the function cells are open to edits and the data entry person can easily mess up the
formula, leading to errors. Let’s lock the formula cells.
The result should be such that if you try to type anything in the Product ID column, it should
give you an error. [2]
Cell locking can be done for many other use cases. Feel free to experiment on your own.