4 Data Wrangling With Excel
4 Data Wrangling With Excel
Serious data analysis is usually done using specialized software. For several decades, the main
tools were SPSS, SAS and other commercial statistical software tools. In more recent years,
there has been wide acceptance of open source tools such as R or Python. For modest
problems and quick and dirty tasks, Excel is an excellent platform. It is highly visual making it
accessible to new and occasional users.
The focus of this text is introducing concepts and ways of thinking, and it is easier to illustrate
these in Excel. Most people have some familiarity with Excel and it is on everyone’s desktop.
Having strong Excel skills is an asset in every organization.
But Excel cannot handle large data sets without serious performance problems. It lacks
advanced tools for building models. For these reasons, many topics may be introduced with
Excel but then we show how to do the same tasks in Python.
Cleaning and transforming data to make it ready for analysis can be a tedious, but necessary
first step. Data preparation can take up as much as 80% of your time in doing a data mining
project! We need to ensure that the data is what we think it is and arrange it in a format that
will make analysis and modelling easier. Data scientists often call this data wrangling.
Wrangling usually means to argue or wrestle, but is also used to describe “taming” or
“controlling” cattle or other animals. Some also call it data munging.
To illustrate a flat file, we will use the responses of 811 Saint Mary’s University students to a
national student survey in March 2010. Saint Mary's University is a primarily undergraduate
university, located in Halifax, Nova Scotia, Canada. It offers program in Arts, Business and
Science, and had approximately 6,500 undergraduate students enrolled in winter 2010. The
data file is an Excel workbook with three sheets (the raw data, a list of questions, and a
summary of question responses).
Frequently, data files are saved as a csv file. These are simple text files in which each line is a
single record of observations. Each record includes the values for each variable, separated by a
comma. Hence the name Comma Separated Values or csv. csv files can by read by any data
analysis software, making them highly portable.
Excel can easily convert these text files to a data file. Select the column with the text, click on
the Data tab and select Text to Columns. Select Delimited, then select Comma and Finish. For
convenience, the data files used in this text are already saved as Excel files.
If our data file had been customer transactional information, it might be equally cryptic. The
firm’s database would store records in a number of fields (variables) and the field names are
often kept very short (often only 8 characters). Even if the field name is more detailed, there
may be many fields with similar names (e.g., customer address, billing address, shipping
address, alternate shipping address, ….) and the field name may not completely describe what
the field contains. The database documentation should have a dictionary. Unfortunately, the
meanings of some variables may change over time. The original data definition may be
grounded in the context in which it was created and over time, institutional memory is lost so
the definition may be incorrectly interpreted. It is very easy to extract data from the wrong
field.
If we go to the second sheet, we find that q5 means In what program are you currently
enrolled? Unfortunately, the sheet does not tell us what program corresponds to a response of
4, or 6 or any other value. Thankfully, the third sheet summarizes each variable. For q5 we have
Excel has a variety of database tools for querying and extracting data from a database. Use of
these tools ensures that you do not damage the original dataset and can also keep track of
query and transformations done to the data. Exploration of these tools is beyond the scope of
this book.
You will be prompted to select what cells are part of the table. Excel chooses what it thinks is
the table (and it usually guesses correctly). Click OK. The sheet will change its appearance.
You can remove the shading if you like, but most people find it easier to read with different
shading in alternating rows.
Python is “open source” software that is free to download and is maintained and upgraded by a
large community of users. Although there are many books you can read, it is often best to start
with following a YouTube tutorial. I found the video, “Python Course for Excel Users”, produced
by freeCodeCamp.org to be best one to get me started.
This video got me started and I found the book “Python for Data Analysis”, by Wes McKinney,
O’Reilly Media Inc., 2018, taught me many of the Python features in more depth.
The version of Python used in the text is Python 3 and illustrations are done using Jupyter
Notebook. Both were downloaded from www.anaconda.com.
After downloading and installing Python, open the Anaconda Navigator and then launch Jupyter
Notebook.
The notebook looks nothing like an Excel workbook. Before starting work, name your
notebook. To the right of Jupyter, at the top of the screen, my page shows “Untitled3”. Simply
type over this and give your notebook a name, say “Student Survey Data 2010”.
Our Python notebook will be a record of all the “instructions” that we ask to be executed. We
want to read the data file “Student Survey Data 2010.xlsx”, so will need to tell Python to do
this. This instruction must be typed out and it will be recorded in our notebook. Think of every
point and click action that you do in Excel as an instruction. Now, rather than pointing and
clicking, you will need to write this out and it will be recorded. This takes some getting used to.
Python makes extensive use of libraries. These are like “add-ins” in Excel. We will be using
several libraries in this text.
In Excel, you select a file by first selecting a folder where the file is located. Python assumes
everything is in whatever the current folder is. This is root folder on your computer unless told
otherwise. Let us start by creating a folder that will contain the data and the notebook for this
data analysis project. Creating folders and navigating among them requires us to install the os
library.
The blue means we are now in Command mode, whereas before we were in Edit mode. If you
press h, you will see the many shortcuts for things we can do in Command mode. There are a
lot!
To start a new line below in Command mode, type b (and type a to insert a new line above).
I want to create a new folder (directory) for my Python files and within it, one for this project.
Your root directory is likely not C:\\Users\\s1687448, so you will need to change that in the
script below.
If you are like me, you will make typos and get error messages. Simply go back and edit the
script and run it again.
Notice that each of these instructions had some very fussy syntax.
The os library is treated as an object that has methods (functions) associated with it. We used
several “methods”:
Getcwd get the current working directory
Makedirs make a new directory (folder)
Chdir change the current working directory
Observe that we must tell Python both the object and the method, separated by a dot (e.g.
os.getcwd()). Also, methods are always followed by (), even if there is nothing in the brackets.
Copy the file Student Survey Data 2010.xlsx into your new Student Survey Project folder. You
can check that it is there by typing the line below.
Now we want to read the data set and start exploring it. There are 2 libraries that will be useful
to us, Numpy and Pandas. Numpy (Numerical Python) is a library of tools to do numerical
transformations. Pandas is a library of tools for data analysis. Its name is short for “panel
data”, a term for data sets in econometrics, as well as a play on words for “Python data
analysis”. To load these two libraries, type
We want to read in the Excel file Student Survey Data 2010.xlsx, so type
pd.read_excel(“Student Survey Data 2010.xlsx”)
Remember to enclose the file name in quotes.
The output shows the first and last 5 rows of data and the 10 and last 110 columns. Python
automatically codes blanks as NaN (Not a Number). The first column is the “Index” for the data
In Python, a data set that has multiple columns is called a Data Frame. A single column of data
(excluding the index column) is called a Series.
Python only read in the first page of the Excel notebook. It did not read in the additional pages
that contain valuable information that is useful for interpreting what the variables are and what
the values mean. We will need a separate dictionary, outside Python, to keep track of this.
The Jupyter notebook can function as a document editor in which we can save all kinds of text
information. We can even format it using a convention called Markdown. This is beyond the
scope of this text, but it is valuable for you to investigate if you plan to use Python regularly.
Columns in a data frame are often called features. The names of the columns(features) appear
at the top of each column. They should be in the top row of the Excel or CSV file.
The subsequent rows in the data file (Excel or CSV) are known as the observations. If a cell is
empty, it will appear with a value of NaN, for Not a Number. Python expects that each value in
a column will have the same data type. Common data types are Integer, Float(decimal
numbers), and String (text).
To explore this data frame, we need to give it a name. We can be exploring many different data
frames when we are in a Jupyter notebook, so naming them allows Python to understand what
we are referring to. In contrast to Excel where many actions are simply point and click, Python
wants us to write out the instructions associated with every action we take.
Let us call our data frame df_SSD. When naming objects in Python, do not use blanks. I have
used the underscore, _, to represent a blank. To remind me that this object is a data frame, I
started the name with df.
An “object” has associated attributes. For example, what is the shape of the dataframe? A
data frame always has an Index with unique values for each row/observation. I can get a list of
the names of the various columns and I can request what their data types (dtypes) are.
To obtain these attributes, I simply type the name of the data frame, followed by dot and the
attribute. For example:
Also associated wit objects, we have methods. Methods are functions that we can apply to an
object. We will explore many different methods throughout the text.
In Excel:
Extract this data by copying and pasting these columns into a new worksheet and call it Grad
Expect. Every time you make significant changes to a worksheet, you should copy the data to
a new sheet so that you maintain a trail of your changes.
The variable names should reflect what they represent. We wish to improve our data
understanding, so I recommend that you relabel your variables with descriptors that are short
but intuitive.
Note that in a Data Table, each variable name must be unique. If you try typing the same name
for two variables, Excel will change the second one to name2. You should update your data
dictionary with the old and new variable names. Keep a record of what changes you are making
to your data and include this in your spreadsheet on a separate sheet. Excel does not keep
notes of your changes.
https://youtu.be/bgNmRmszsQ0
In Python, we can extract the columns we want by copying them into a new data frame. Note
that we do not have the risks inherent in modifying a data set in Excel, since we never actually
change the original data file. Furter, we can track all of our actions because they are always
saved in our notebook.
Let us name our new data frame df_Grad_Exp. To pull the columns we want out of df_SSD, we
must list the names of the columns we want. The list is enclosed in double square brackets, [[]].
The names of the columns are text strings, so we must enclose them in quotes ‘ ‘ or “ “. If some
columns have names that contain apostrophes, then you must use double quotes “ “, otherwise
Python will interpret the apostrophe as a single quote and give you an error. Below, is the
script to create df_Grad_Exp and then a display of what it looks like. Typing the name of an
object and then crtl+Enter (Run) will display the value(s) of the object.
If you wanted the columns in df_Grad_Exp to be in a different sequence than in df_SSD, simply
list the columns in the desired sequence.
To rename the columns, we must use the rename “method”. A “method” is Python’s term for a
function. We must say that it is the column headings we wish to rename and then list the old
and new names. If we wish to make this change to the data frame and not assign the result to a
new data frame, we tell Python that change will be inplace.
The script is quite long, so you may wish to break it up over several lines as done below. Python
sent a warning about using inplace rather than copying my data frame to a new object. You
can’t undo the inplace action once it is done.
In Excel (VLOOKUP):
Let us look at the Program variable in Column C. Insert a column to the left of C. If you label the
column Program, Excel will change the name of the original Program variable to Program2. We
would like to map the values of Program2 into the text equivalent in Program. We need a table
to map the number to text. For example,
Saint Mary’s does not offer Education, Fine Arts, Medicine, Health, Services or Law, but some
students selected these programs. We also do not offer any programs that are “Other” than
those in Humanities, Social Sciences, Business, Science, Mathematics, Environment and
Engineering. What should we do? It is not uncommon to have some data values that are invalid.
We could classify them all as Invalid and maybe will choose to investigate them later or to
exclude them from our analysis later.
Open a new worksheet and Rename it Lookup Tables. Create a look up table in which
• Humanities and Social Sciences are classified as Arts,
• Business as Business,
• Science, Math, Engineering and Environment are Science, and
• all others are classified as Invalid.
Go back to the Grad Expect worksheet and the first cell in the Program column (C2). Type
• =VLOOKUP(D2,
• now select the Lookup Tables worksheet and highlight the table you created.
• In the formula bar at the top of the screen you should see that the VLOOKUP formula is
capturing this information.
• Continue typing in the formula bar to add ,2,False).
• Press Enter.
The rest of the Program column is filled in, but incorrectly. This is because we need to lock the
location of the look up table.
Similarly we can create additional tables in our Lookup Tables worksheet and use them to
create new variables that have the text equivalents for Home, Gender, Parent Ed, and
Language.
VLOOKUP is a function that assigns a value based upon a look up table. It has four arguments.
VLOOPUP(value, table, column, match)
The value is the cell location whose value you would like to match within the look up table.
The table is the location of the lookup table, like the Program table we used.
• You must give the location of the top-left cell and the bottom-right cell, separated by :
• Don’t put a comma between the two cell locations, else Excel will give you an error.
• Since the lookup table will always stay in the same location, you should lock the location
by putting $ signs in front of the row and column (e.g., $A$1:$B$15 in the previous
example).
The column is the column with the new value to assign. Your look up table can have many
columns. Excel will match the value in the first column and assign the value in the column that
you have named in the function.
The match can be TRUE (approximate match) or FALSE (exact match). In the example of
program, we exactly matched each numeric value with a particular program.
In Python:
In Python, there are a variety of ways to perform the equivalent of an exact match. One uses a
very similar method to a lookup table, except that it is a list. To improve readability, we can
make it look like a table. We begin by creating a dict. A dict is a dictionary. It is a list of pairs
with the first entry being the old value and the second being the new value. We will call our
first dict Program_map.
We will use this dict to populate a new variable we will call Home2. In Excel, when we tried
naming a new variable with the same label as an existing variable, Excel automatically renamed
to old variable. This doesn’t happen in Python.
BUT!!!
Python is a very rich programming language. Frequently, there are many ways to achieve the
same outcome. A VLOOKUP is equivalent to joining databases(table) which have a common
variable in common. A table is simply a 2 dimensional data frame (a flat file in Excel). “Joining”
is connecting file that have a variable in common. A lookup table is such a file. Suppose we
create a simply array, Program_Name that has just two columns, with the program numeric
value and its text value, just like Program_map.
Note that Program_Name is a set of “nested” lists. Something enclosed [] is a list and here we
have lists within lists. This becomes an “array”, a 2 dimensional matrix of elements.
Merge and Join are important tools for merging databases. Consult Python documentation for
more information on these methods.
We can also do approximate matches as a way of grouping values. Grouping changes a numeric
variable into a categorical (ordinal) variable. Grouping is also referred to as binning. This can
often simplify interpretation. Look at the salary expectations. The average is around $50,000,
so we could group salaries into categories, such as
We can group the salaries by building a Look Up table in which the range of salaries is defined
by its lowest value.
VLOOKUP will classify salaries into the highest category it can select. For example, $54,000 is
greater than 40000 but less than 60000, so it will be classified into the category starting with
40000. In this case, we would assign the match value to be TRUE to get an approximate match.
=VLOOKUP(L2,'Lookup Tables'!$O$1:$P$7,2,TRUE)
In Python:
In Pandas, there is a function cut that takes numeric data and cuts it into categories. Similar to
Excel, Python creates bins. Whereas Excel asks for the lower limit for each bin, Python asks for
the upper limit. Create a list of upper limits and then us the cut function to assign values to a
new variable, Program_Grp.
You must tell Python the start of the first bin, otherwise it will assume it starts at 19999.
Python assigns default category labels showing the limits of each interval. A round bracket, (,
indicates that this value is NOT included in the interval and a square bracket, ], indicates that
the value is included. (19,999.0 39,999.0] indicates that values must be greater than 19,999
and less than or equal to 39,999.
VLOOKUP treats blanks as zeroes. If the look up table maps 0 to a new value (0 = Male, or 0 =
Very Low), then VLOOKUP applies this relationship. Missing values can be a data quality
problem, but erroneously recoding data is an even more serious data quality issue.
One way to address this issue is to use the IF function. The IF function takes a logical argument
and assigns one value if it is true and another value if it is false. IF(argument, TRUE result, FALSE
result).
For example, with Salary, we could change our VLOOKUP formula to read
=IF(L2<>””,VLOOKUP(L2,'Lookup Tables'!$O$1:$P$7,2,TRUE),””)
This says that if the value in L2 is not equal (<>) to blank (“”), then use VLOOKUP, but otherwise,
keep it blank.
https://youtu.be/fWK0shgaHvc
Image Citations:
Figures 4-1 to 4-10: Images courtesy of author using Microsoft Excel