Data Collection - Setting Up Excel
Data Collection - Setting Up Excel
1
Data entry in Excel - Dana Hince PhD
Although there are specialised programs available for data entry, Excel is still
the most commonly used program we see used for this purpose. Most of the
analysis of the data, however, is conducted in specialised statistical programs
(e.g. SPSS, Stata or SAS). You will save yourself/your data analyst A LOT of time
if your Excel data spreadsheet is constructed in such a way that the statistical
program can “understand” it.
By the end of this manual you will be able to engage in “good data entry” (as
defined at the top of this page) by achieving the following objectives:
1. Structure an Excel spreadsheet that allows easy importation into various
statistical packages.
2. Choose appropriate variable names, variable labels, and value codes and
labels.
3. Construct a codebook/data dictionary.
4. Format Excel cells.
5. Limit Excel cells to valid ranges or prepare drop down options.
6. Know what not to include in your spreadsheet.
2
Data entry in Excel - Dana Hince PhD
In both formats
• the columns represent variables
• the first column is a unique identifier for each participant
• the rows represent cases/events
• the rows must be uniquely identifiable
The difference between the two is how many variables are needed to
uniquely identify a row. Wide format needs only one identifier, whereas
long format needs two or more.
For example, compare Fig1 and Fig2. In wide format (Fig1) knowing the id
and variable name (id=1, variable=t1_bp) leaves only one choice of value
(120), whereas in long format (Fig2) knowing the id and variable name
(id=1, variable=bp) leaves you with two options (120 or 110). In this case,
we also need to know the value of time to limit the number of possible
rows to one.
Long format is often the format required for the analysis of longitudinal or
repeated measures study designs. It is possible to use statistical software
to switch between wide and long format (called restructuring or reshaping
the data). If your data requires restructuring, please DO NOT CUT AND
PASTE IN EXCEL. It is too error prone!
MAKE A START!
1. What is your spreadsheet format?
2. How is each row uniquely identified?
3. What are your id values?
3
Data entry in Excel - Dana Hince PhD
4
Data entry in Excel - Dana Hince PhD
Variables names
• are meaningful and short
• contain all relevant information
• have no spaces between words
• start with letters only (or _)
• only take one row (the first one!) in the spreadsheet
• are consistently named (e.g. heart_rate1, heart_rate2 NOT HR_1,
hrate2) THIS IS PARTICULARLY IMPORTANT IF ANY DATA
RESTRUCTURING IS REQUIRED (see page 3)
MAKE A START!
4. Test your knowledge on the next page then
5. Enter your variable names into a blank Excel
5
6
Answers.
a) Time1 and T1 refer to the same time. Needs to be consistently labelled.
b) Beginning variable names with numbers upsets many statistical programmes.
c) Variable names need to be contained in one cell only.
Figure 2a), b) and c). Examples of troublesome variable names.
c)
b)
a)
POP QUIZ 1: what is wrong with these variable names?
Data entry in Excel - Dana Hince PhD
Data entry in Excel - Dana Hince PhD
7
Data entry in Excel - Dana Hince PhD
• In the pop up box that follows, choose Date, and the format you
would like and click OK (Fig. 3b: I like the 3rd option down because it
includes the 4 digit year).
8
Data entry in Excel - Dana Hince PhD
MAKE A START!
6. Format all the cells that will contain date values now
• Include the codes and value labels for any categorical variables.
MAKE A START!
7. Set up your data dictionary on the next sheet in your Excel
file, following Fig4 as an example
Firstly, let’s set up a drop down list with an error message for the variable
gender. Go to the Data ribbon on the top of the Excel window (1), and click
Data Validation (2).
1
2
The window below will pop up. Choose List from the Allow: drop down
menu (3).
10
Data entry in Excel - Dana Hince PhD
Click on the button next to the Source: field (4) this will open another
pop up box.
Now choose the cells that have the values you want to allow as valid
input from your data dictionary. In this case, they are the codes for
gender (5: see also Fig 4). Now click the button marked 6.
11
Data entry in Excel - Dana Hince PhD
You should be back at the Data Validation pop up, with the cell
reference now in the Source: field (7). Now click the Error Alert tab (8).
Enter in the error message that you would like to appear if someone
makes a mistake in the error message box (9), and click OK.
12
Data entry in Excel - Dana Hince PhD
When you place the cursor in the cell/s you just set up, a little arrow will
appear next to the cell. Click this to access the allowable values.
Now try putting in an invalid number. Your error message will appear
when you move to the next cell.
13
Data entry in Excel - Dana Hince PhD
There are other options in the Allow: drop down list in the Data
Validation window. Let’s suppose that our id values can only be between
100 and 200. Choose Whole Number (1), and in the Data: drop down
menu that appears (2) choose between and then enter the maximum
and minimum of the allowable range for that variable (3).
14
Data entry in Excel - Dana Hince PhD
Again, you could set up an error alert that will appear if someone tries to
enter an invalid id value.
https://support.office.com/en-us/article/Create-a-drop-down-list-
7693307a-59ef-400a-b769-c5402dce407b?ui=en-US&rs=en-
MAKE A START!
8. Consider your data and how you might use Excel data
validation to reduce data entry error
9. Set up your variables accordingly!
3. Enter away!
• Take lots of breaks if entering “in bulk”. Fatigue is your enemy when it
comes to accurate entry.
• Try and enter data in real time if at all possible as this might give you
the chance to collect missing data values before it is too late….
• If you have multiple people entering data, make sure that they have
access to the data dictionary and please use the data validation tool!
15
Data entry in Excel - Dana Hince PhD
• Text and numeric values in the same variable. If you find yourself doing
this you need two separate variables.
• Colour coding – this is fine if it helps you with ensuring accurate data
entry, but all the relevant information needs to be included in the
variables.
16
Data entry in Excel - Dana Hince PhD
POP QUIZ 2: What is wrong with this spreadsheet? Can you find the errors? How would you
fix them?
Hypothetical data comparing meditation vs no meditation treatment effects on blood pressure, body temperature and 3
anxiety questionnaires, pre (time 1) and post (time 2) intervention.
id test 1
sbp temp anx_q1 anx_q2 anx_q3 2_SBP t2_temp t2_anx_1 t2_anx_2 t2_anx_3 gender
1 120 37 3 3 4 121 37 3 6 5 M
2 124 37 5 2 6 124 36.9 5 6 6 m
3 125 37.1 6 0 5 123 37 6 6 9 f
4 126 36.9 9 2 135 36.9 8 8 9 fem
5 135 36.8 10 9 9 110 36.8 5 9 3 male
6 101 37 2 7 8 112 37 8 6 6 male
7 26 36.6 4 6 3 120 37 5 6 8 female
Meditation
No
meditation
17
18
P.O Box 1225, Fremantle, WA 6959
19 Mouat Street,
University of Notre Dame, Fremantle
Institute for Health Research,
[email protected]
Chair of Biostatistics
Professor Max Bulsara
or
[email protected]
Research Biostatistician
Dana Hince PhD
For further information please contact:
Answers.
1. Two rows used for the variable names; these should be in one row.
2. Variable name starts with a number; instead put the number at the end.
3. Variable names are inconsistent; choose one format and stick to it
4. Gender is not numerically coded; assign numbers to male, female and
missing and add codes into the data dictionary. NB it is possible to ‘encode’
text values in statistical packages, but it requires extra steps and it isn’t easy
if the text coding is not consistent as is the case here!
5. Row means are included; delete them.
6. Colour coding used to define treatment groups; create another variable that
assigns numbers to the groups and add into the data dictionary.
7. Missing values are not coded (in anx_q2 and T2_temp); decide on an
appropriate code, enter into spreadsheet and into data dictionary.
8. Where is the dictionary?
Data entry in Excel - Dana Hince PhD