Data Analysis: A Practical Introduction For Absolute Beginners
Data Analysis: A Practical Introduction For Absolute Beginners
Data Set
There are two different data sets for this lab:
Mod3Lab2a.csv
Mod3Lab2b.csv
Overview
In data analysis, a “join” is a combination two (or more) tables into a single table, based on related
columns in the tables. In this lab, we’ll practice manually performing four different types of data join in
Excel. Before we get started, let’s run through a quick refresher on the definitions. Here are four of the
main types of join:
Inner Join: From two tables of data, this type of join returns a new table that only includes matching
values from both tables. Any row or column that only shows up in one table (instead of both) will be
ignored and not added to the join.
Full Outer Join: Also known as simply an Outer Join, this is basically the exact opposite of an Inner Join:
It returns every single row and column from both tables, even if they don’t match the rows/columns in
the other table. If there are rows without matching values, the new join table will return a “null” or “NA”
entry for those cells (or simply be empty). A Full Outer Join will have the largest possible number of
results.
Left Join: A special type of Outer Join where the “left” table (i.e. the table that’s listed first) is favored.
That means it returns every row and column from the left table, but only matching rows from the right
table.
Right Join: A special type of Outer Join where the “right” table (i.e. the table that’s listed second) is
favored. It returns every row and column from the right table, but only matching rows from the left
table.
Exercise 1: Inner Joins
We’ll start by performing an Inner Join on the two data sets.
1. In a new blank worksheet in Excel Online, type TABLE 1 in cell A1, and type TABLE 2 in cell D1.
This’ll help keep your data sets straight.
2. Open both data sets in Excel. To do this, copy and paste the data from the first data set
(Mod3Lab2a.csv) below TABLE 1, and copy and paste the second data set (Mod3Lab2b.csv)
below TABLE 2.
Your two tables both contain “orderID” values — that’s the common variable between Table 1
and Table 2. Table 1 pairs these order IDs with their ship dates, while Table 2 pairs each order ID
with a product name. Since you’re “joining” the two tables, your new table should have columns
for order ID, ship date, and product. Type those in, like so:
For example, order ID number 1301 was shipped on 11.1.18 (according to Table 1), and that
same order included the Ultra Laptop XL product (according to Table 2, though note that order
1301 is in the second row of Table 2). The idea behind the join is that you want to have both
these pieces of information in a single handy table.
5. Now figure out which data to include in the new Inner Join table. It’s not as simple as just
copying all the data for both tables into the new one. Why not?
Check out those order ID numbers. Notice how both tables include the ID numbers 1301, 1302,
1303, and 1304 — but Table 1 includes 1305 as well, which is not in Table 2. Similarly, there’s an
order ID number 1201 in Table 2 that doesn’t show up in Table 1. Since an Inner Join should
ignore values that only show up in one table, you don’t want to include 1305 or 1201 in the new
table. Before looking at the next step below, see if you can copy and paste the correct data from
Tables 1 and 2 to form the new Inner Join table. Here’s a hint, using that graphic from earlier:
6. Got it? Here’s what your Inner Join should look like:
We only included the info for orders 1301, 1302, 1303, and 1304 because the two original tables
had those order numbers in common. Orders 1305 (in the bottom row of Table 1) and 1201 (in
the top row of Table 2) get ignored.
Obviously this can get a lot more complex with larger data sets, but that’s the basic rundown of
an Inner Join!
1. In the same Excel worksheet you used in Exercise 1, set up a new table for your Full Outer Join.
2. Once again, you’re “joining” the two tables, so your new table should have columns for order ID,
ship date, and product.
3. Think it through. A Full Outer Join includes all columns and rows from both tables, with “NA”
(null) values for any cell that only shows up in one table.
You want to “fully” join both tables this time, which means you do want to include those rogue
orders 1305 from Table 1 and 1201 from Table 2. Try it out for yourself before looking down at
the answer below. Here’s the graphic again:
4. Here’s what your Full Outer Join table should look like:
Notice how there’s no ship date for order 1201 (because that order doesn’t show up in Table 1),
and there’s no product name for order 1305 (because that one doesn’t show up in Table 2).
Instead, we add an “NA” or “null” value in those two spots.
Exercise 3: Left Joins
Next up, we’ll run through a Left Join on our two original data sets.
2. In the same Excel worksheet you used in the first two exercises, set up a new table for your Left
Join (down below the joins from the other exercises). Once again, you’ll want to include all three
variables in your new table because you’re joining the two originals.
3. Which data should you include in the new table? In a Left Join, the “left” table (i.e. the table
that’s listed first) is favored. That’s Table 1 in this case. Your new table should return every row
and column from Table 1, but only matching rows from Table 2. It might be helpful to think of it
this way: You’re treating Table 1 like a Full Outer Join, but you’re treating Table 2 like an Inner
Join.
Try to find the answer yourself before looking ahead at the next step. Here’s the graphic again
for a Left Join:
4. Ready? Here’s what your Left Join table should look like:
Notice that we included all five orders from Table 1 (the “left” table), including that weird order
1305 that didn’t show up in Table 2. But we skipped the extra order from Table 2 (1201) because
the “right” table is not favored.
3. Time to think it through again. In a Right Join, the “right” table (i.e. the table that’s listed
second) is favored. That’s Table 2 this time around. The Right Join returns every row and column
from the second table, but only matching rows from the first table. So in this case, it’s kind of
like you’re doing an Inner Join on Table 1, and a Full Outer Join on Table 2.
You know the drill: Try to copy and paste the correct data from the original tables to create your
new Right Join. Don’t look ahead at the next step until you’ve tried it. Here’s the graphic
showing a Right Join: