Statistical Analysis Using SPSS and R - Chapter 1 To 3 PDF
Statistical Analysis Using SPSS and R - Chapter 1 To 3 PDF
USING
SPSS AND R SOFTWARE
1|Page
PREFACE
These notes have been compiled by Dr. Monique Borg Inguanez, Dr. Fiona
Sammut and Dr. David Suda. The aim of this set of notes is to give an
introduction to statistical analysis using two very popular software, namely:
SPSS and R. Although these notes have been compiled using Windows they
can run on other operating systems.
The material presented here covers syllabi required for SOR0223, SOR1232
and SOR1220. Some material is beyond the syllabus required for certain study
units and some material could also have been covered in study units which
students might have followed earlier. Authors felt that extra material should
be included so as to have a complete set of notes which can be used by students
even after they graduate. This will by no means increase the number of
lecturing hours. Students can experiment with most of the introductory
material on their own. The respective lecturers will inform which sections are
relevant for each study unit.
2|Page
0. INTRODUCTION – DIFFERENT TYPES OF
VARIABLES
As explained in the preface, this study unit focuses on teaching how to use
statistical software to analyse various types of data sets. To be able to identify
which techniques are most suitable for the analysis, we need to understand the
nature of the variables being considered.
In our data set we may have both quantitative and qualitative variables.
A discrete (count) variable can take only whole numbers as response (no
decimal places), such as the number of employees in a company, age in
years, number of patients in a ward.
A continuous variable can take on any number on the real line or a subset
of it, such as salary, height, weight, blood pressure.
There are also two types of measurement scales for qualitative (categorical)
variables, ordinal and nominal:
4|Page
So for nominal variables, the order of listing the categories is irrelevant and
any statistical analysis carried out on this type of data does not depend on the
ordering of the data. The methods designed for nominal variables give the
same results no matter in what order the categories are listed. On the other
hand, methods designed for ordinal variables utilize the category ordering and
results of ordinal analyses would change if the categories were reordered in
any other way. In view of this, methods designed for ordinal variables should
not be used with nominal variables. Methods designed for nominal variables
may be used with either nominal or ordinal variables.
5|Page
Data
Qualitative Quantitative
Variables
(Categorical) (Covariates)
Note that in some texts on modelling the term covariates refers to explanatory variables which can be qualitative or quantitative 1
6|Page
1. INTRODUCING THE SOFTWARE
This first section will serve as an introduction to SPSS, R and R Studio. The
second section will focus on how to create or import a dataset, transform
variables, manipulate data and perform descriptive statistics using either
software.
The second part of the notes will describe some commonly used descriptive
statistics. In the third part, focus will be placed on graphical representations
and the fourth part will cover more advanced topics such as parametric and
non-parametric tests, statistical modelling and categorical data analysis.
1.1 SPSS
SPSS stands for Statistical Package for the Social Sciences. This software
package was first produced by SPSS, Inc. in Chicago, Illinois and acquired by
IBM in 2009. An updated version of the software is issued every year. The
version used to compile these notes is officially named IBM SPSS Statistics
23. In these notes, the software will always be referred to as SPSS.
To start using SPSS, locate the SPSS icon under the Programs menu item.
Otherwise, if you are already in possession of a file that has been created by
SPSS, one can also start SPSS by double-clicking on an SPSS file.
7|Page
The latter is however not recommended as it usually take longer to open a file
in this manner. SPSS consists of different windows, each of which is
associated with a particular SPSS file type. The most important of these
windows are the Data Editor ( .sav files) and the Output Viewer ( .spv files)
windows.
The Data Editor is the window that is open at start-up and is used to enter and
store data in a spreadsheet format. The Data View is the sheet that is visible
when you first open the Data Editor, and this is the window which contains
the data; it displays the contents on the currently open dataset, also known as
working dataset.
The figure which follows shows the Data Editor containing the Employee
data.sav dataset:
Figure 1.1.1
8|Page
From Figure 1.1.1 you can see that at the top of the Data Editor window there
are several menu items that are useful for performing various operations on
the working dataset. All data manipulations, statistical functions, and other
SPSS procedures operate on the currently open dataset.
If you wish to analyze an SPSS data file which is already stored in the
computer, the file may be opened in the Data Editor window, by choosing the
menu options File, Open, Data and then selecting the directory in which the
dataset is stored.
If the file to be opened is not an SPSS data file, the Open menu option may
also be used to import the file directly into the Data Editor. We shall shortly
demonstrate how this may be done. If the data file is in a format which is not
recognized by SPSS, then the software package in which the file was
originally created may be used to try to translate the file into a format that can
be imported and read by SPSS (e.g tab-delimited data, an Excel data file).
Note that, the Data View window contains variables (information available
for each case, such as I.D. card number, gender, education, etc.) in columns
and cases (individuals or observational units) in rows. Thus, cell entries will
be numbers (or words or symbols) telling a single value of a variable for a
particular case.
Apart from the Data View sheet, there is also the Variable View sheet. The
latter sheet may be accessed by clicking on the tab labelled Variable View and
while the second sheet is similar in appearance to the first, it is does not
actually contain data. Instead, this second sheet contains information about
the variables in the dataset. In fact, by means of this view, one can name and
define variables, define labels and values, define missibe data and also select
the type of alignment and measure for the data entered. Note that many of the
cells available in the Variable view spreadsheet (see Figure 1.1.2 below)
contain hidden dialogue boxes that may be activated by clicking on the cells
themselves.
9|Page
Figure 1.1.2
10 | P a g e
Figure 1.1.3
SPSS should also be provided with information about the type of variables
being used in the dataset. This is often critical for SPSS to process the correct
analyses. As can be seen in Figure 1.1.3, SPSS accepts various variable types.
Numeric: This option is used for variables with numerical values. For
example:
11 | P a g e
Categorical Variables (in some statistical analyses these are referred to
as Factors) take numerical values from a predefined set of integers. The
Decimals setting is hence set to 0. Such variables are divided into two classes:
(i) Nominal – which refers to variables that have been coded numerically
where order is not important (e.g., recording a subject's gender as 1 if
male, 2 if female, 3 if other, or an answer to a question as 1 if yes and
0 if no, or assigning integers to districts or cities). For such variables,
the Measure should be set as Nominal.
(ii) Ordinal - which refers to variables that have been coded numerically
but there is some form of ordering in the numbers (e.g., a Likert item
with responses 1=Good, 2=Better, 3=Best). For such variables,
the Measure should be set as Ordinal.
Labels for the different values that a categorical variable may take should
always be defined in SPSS. One of the main reasons for doing this is to help
with the interpretation of the output. For example, considering the variable
jobcat in the Employee dataset, it is coded as either 1, 2, or 3 for employment
categories: clerical, custodial and managerial. Now, imagine having to
interpret some plot or some table from the output and all the time reminding
yourself that 1 stands for clerical, 2 stands for custodial and 3 stands for
managerial. But, on letting SPSS know that 1 stands for clerical, that is, by
assigning labels to the values being used, the labels will appear in the output
rather than the numbers, making interpretation much easier. Such labels are
defined through the Values setting as shown below.
12 | P a g e
Figure 1.1.4
Figure 1.1.5
13 | P a g e
Enter the numerical value in the Value: field and the corresponding label in
the Label: field. Press Add each time and press OK when you have labelled
all the levels of the fixed factor. The change and remove buttons may be used
any time you need to change or remove any predefined labels.
String: This option is used for variables which contain text instead of numeric
values. The values of string variables may include numbers, letters, or
symbols. Examples of string variables are Names of individuals, Zip codes,
phone numbers, I.D. numbers, free-response answers to survey questions, etc.
Such variables are not used in any calculations. In Figure 1.1.2 it may be noted
that gender is set as String. Note that some SPSS procedures (such as the
independent samples t-test; ANOVA; non-parametric tests; etc.) require that
categorical variables be coded as numeric. In SPSS it is very easy to convert
string variables to numeric as we shall see later on.
Comma: Numeric variables that include commas that delimit every three
places (to the left of the decimals) and use a dot to delimit decimals. For
example 1,000.20 or 23,456.32. This is the convention we use.
Dot: Numeric variables that include dots that delimit every three places (to
the left of the decimals) and use a comma to delimit decimals. Example
1.000,20 or 23.456,32. We do not usually use this convention.
14 | P a g e
1.1.2 The Output Viewer
All output from statistical analyses is printed to the Output Viewer window as
well as other useful information. When you execute a command for a
statistical analysis the output will be printed in the Output Viewer. Some other
output that you may want to have printed to the Output Viewer are command
syntax, titles, and error messages. The Output Viewer showing descriptive
statistics for the employee dataset is shown in Figure 1.1.6. Detail on how to
obtain descriptive statistics in SPSS will be given at a later stage.
Figure 1.1.6
The left frame of the Output Viewer contains an outline of the objects
contained in the window. For example, everything under Descriptives in the
outline refers to objects associated with the descriptive statistics, the Title
object refers to the bold title Descriptives in the output while the highlighted
icon labeled Descriptive Statistics refers to the table containing descriptive
statistics. The Notes icon has no referent in the example being considered
here, but it would refer to any notes that appeared between the title and the
table.
15 | P a g e
Note that, by clicking on an icon, you can move to the location of the output
represented by that icon in the Output Viewer. This makes the outline useful
for navigating in the Output Viewer when there are large amounts of output.
Also, the outline is a tool that may be used for copying or deleting objects;
proceeding by first highlighting the objects of interest in the outline and then
performing the operation as needed.
Figuew 1.1.7 shows that what is displayed in the output may also be
customized. This is possible by using the Options menu item on the Edit
menu:
Figure 1.1.7
16 | P a g e
1.1.3 The Syntax Editor
These SPSS for Windows lecture notes will focus on the use of the dialog
boxes to execute procedures. However, there are a couple of important
reasons why you should be aware of SPSS syntax even if you plan to primarily
use the dialog boxes:
Not all procedures are available through the dialog boxes. Therefore,
occasionally, submission of commands will have to be done through
the Syntax Editor.
Your own procedures may be saved as syntax so that they may be rerun
by means of the Syntax Editor at a later date.
SPSS syntax may easily be generated without even having to type in the
Syntax Editor. The process is illustrated below:
The following dialog box is used to generate descriptive statistics. Here, only
the Paste button in the dialog box is relevant, the process used for generating
descriptive statistics is described later. (The dialog boxes available through
the pull-down menus all have a button labeled Paste).
17 | P a g e
Figure 1.1.8
By clicking on the Paste button, the procedure that the above dialog box is
prepared to run will be written in SPSS syntax in the Syntax Editor, as follows:
Figure 1.1.9
Note that the by running the resulting SPSS syntax exactly the same output as
would have been generated by clicking the OK button in the above dialog box
will be produced.
18 | P a g e
Also, the syntax that is printed to the Syntax Editor can then be saved and run
at a later time as long as the same dataset or at least a dataset containing
variables with the same names is active in the Data Editor window. Also,
saving the syntax comes in useful if a rerun to the analysis is necessary, after
more data has been added to some variables or the same analysis will be done
on another dataset that contains the same variables.
One further dialogue box, like the one shown below, will then appear:
Figure 1.1.10
This dialogue box allows you to select a spreadsheet from within the Excel
Workbook. In this particular case, since the Excel file contains five different
spreadsheets, the drop-down menu in Figure 1.1.10 offers five sheets to
choose from.
19 | P a g e
As SPSS only operates on one spreadsheet at a time, only one spreadsheet
may be selected at a time. Note that, if not all variables of an Excel sheet are
required for the analysis, the range has to be specified, otherwise, the whole
spreadsheet will be imported to SPSS. Also, by ticking Read variable names
from the first row of the data, SPSS reads the first row of the spreadsheet as
the variable names. If the spreadsheet contains no variable names, make sure
that the box remains unmarked. Then, if once in SPSS environment, we would
like to add variable names to the dataset, follow the procedure described in
Section 1.1.1.
You should now see data in the Data Editor window. Check to make sure that
all variables and cases were read correctly. Next, save your dataset in SPSS
format by choosing the Save option in the File menu.
The Text Import Wizard will open automatically when an ASCII file, with a
.txt or .dat extension, is to be opened using the Open option in the File menu.
However, if the data file to be imported does not have a .txt or .dat extension
but it is an ASCII file, then it may be imported by opening the Data Import
Wizard from the File menu and then pressing Read Text Data.
The first window to pop up will ask the user to choose the file from the
directory to be imported. Then, a series of dialogue boxes, starting with the
one shown in Figure 1.1.11, will show up and will guide the user with the
retrieval of the data. Once the data has been imported and checked for
accuracy, a copy of the dataset should be saved in SPSS format by selecting
the Save or Save As options.
20 | P a g e
Figure 1.1.11
To insert a case (or variable), select the row (or column) in which the case (or
variable) is to be added, right click and choose Insert Cases (or Insert
Variables) from the resulting menu and a blank row (or column) will result
automatically. The latter procedure could also be done by clicking on the
row's number or on the column's name, then use the insert options available
in the Data menu in the Data Editor.
21 | P a g e
Again, this will produce an empty row or column in the highlighted area of
the Data Editor. Existing cases and variables will be shifted down or to the
right, respectively.
The computation of any new variable is done using the Compute option
available from the Transform menu in the Data Editor. First, the dialogue box
which follows will appear:
22 | P a g e
Figure 1.1.12
Figure 1.1.13
23 | P a g e
Then, so as to create a new variable, type its name in the box labeled Target
Variable. In this case, it has been named Difference. Soon after, the
expression to be computed is either typed into the Numeric Expression cell
directly or the expression may also be entered using the input values or
operators located underneath the Numeric Expression cell.
Now, we might also wish to, for example, compute the difference in salaries,
on the condition that the persons to be considered should have less than 15
years of education. This may easily be done by using the Compute option as
before together with clicking on the If button, to get the dialogue box which
follows:
Figure 1.1.14
24 | P a g e
In this way, the Include if case satisfies condition is activated and the
condition for computing the new variable is entered underneath.
Figure 1.1.15
Figure 1.1.15 illustrates the condition that requires cases to have less than 15
years of education in order to be included in the computation of the new
variable.
Clicking the Continue button, we return to the previous dialogue box and then
by clicking Ok once more, the new variable will appear in the rightmost
column of the working dataset:
25 | P a g e
Figure 1.1.16
Press Tranform and then Recode from the menu in the Data Editor. A choice
needs to be made out of using either Into Same Variables - option which
changes the values of the existing variables or Into Different Variables -
option which is used to create a new variable with the recoded values.
26 | P a g e
Note that, both options will lead to the same result, but Into Different
Variables is preferred because if you change your mind about your recoding
scheme at a later date, you may still recur to the original variable values.
Now, opting for Into Different Variables option will produce the following:
Figure 1.1.17
First the variable which requires recoding must be chosen from the existing
dataset (for this example, we shall use jobcat) and then, by clicking on the
arrow button, the same variable name should appear in the cell Input Variable
-> Output Variable. Next, the name of the new variable (jobcatchanged) must
be supplied, together with an optional corresponding label:
27 | P a g e
Figure 1.1.18
Then, the old and new categories have to be specified in the dialogue box Old
and new Values as shown below:
Figure 1.1.19
Note that, the original value has to be entered in the Old Value cell and
similarly, the new value has to be entered in the New Value cell.
28 | P a g e
Afterwards, it is important to click on Add to save the recoding done. Hence
press Continue and then Ok. The newly coded variable will appear at the
rightmost part of the data entry window. Same process has to be repeated for
any other recoding needed.
It should be noted that the dialogue box shown in Figure 1.1.19 is the same
regardless of whether we are recoding values into the same variable or
creating a new variable.
Now, as for the computation of new variables, recoding of variables may also
be done conditionally. Recoding values given a condition is done in exactly
the same manner as just discussed but with the inclusion of the condition being
specified by means of the If button.
A very good way to look at your data and see if there is any missing data or
any incorrect data entry values, is to sort the data. Sorting of cases allows you
to organize rows of data in ascending or descending order.
For example, we may want our data to be sorted such that the variable jobtime
is in increasing order or we may want the salary sorted within the variable
jobtime. SPSS makes it all possible, since the procedure for sorting may be
done on the basis of one or more variables.
29 | P a g e
Figure 1.1.20
Only two options come with this dialogue box; the first being, the variable/s
to be sorted and the second, the desired order of sorting. Note that, the
hierarchy of such a sorting is determined by the order in which variables are
entered in the Sort by cell. Consider the following:
Figure 1.1.21
30 | P a g e
The sorting requested in Figure 1.1.21 leads all the variables in the data to first
be sorted by the first variable entered, that is sorted according to jobtime. This
is then followed by sorting the second variable, salary, within the first
variable, jobtime. The resulting sorted Employee data set will be as follows
Figure 1.1.22
Refering again to the Employee data set, consider for example wanting to
analyze the data which corresponds to employees having a current salary
greater than $20,000 only. By means of SPSS, we can in fact analyze a
specific subset of our data by using the Select Cases procedure. This is done
by pressing Data, then press Select Cases from the menu options and hence,
the dialogue box which follows will show on the screen:
31 | P a g e
Figure 1.1.23
As may be seen, Figure 1.1.23 contains a list of the variables in the active data
file on the left and several options for selecting cases on the right.
Now, by default SPSS considers All Cases for analysis, so, on pressing All
Cases, the data will remain unchanged. However, if either one of the other
options is chosen, say If condition is satisfied, then the If button underneath
should be pressed so that a second dialogue box, which will ask for the
particular specifications, will show up:
32 | P a g e
Figure 1.1.24
For our example, where we would like to choose only those employees who
earn more than $20,000, the if statement should read:
Figure 1.1.25
33 | P a g e
Note that the portion of the dialogue box in Figure 1.1.23 labeled Output gives
the option of temporarily or permanently removing data from the dataset. The
Filtered option will remove data from subsequent analyses until the All Cases
option is reset, at which time all cases will again be active and used in further
analyses. If the Deleted option is selected, the unselected cases will be
removed from the working dataset. If the dataset is subsequently saved, these
cases will be permanently deleted.
For our example, the Filtered option has been chosen, hence SPSS will
indicate the inactive cases in the Data Editor by placing a slash over the row
number:
Figure 1.1.26
To select the entire dataset again, return to the Select Cases dialog box and
select the All Cases option. Otherwise delete the filter variable (the last
column in the data) from the data.
34 | P a g e
1.1.11 Listing Cases
By means of SPSS, we may also extract a list of cases of a number of (or all)
variables from a data set of interest. The procedure for doing this cannot be
performed using dialogue boxes, but may only be done through command
syntax.
Figure 1.1.27
Note that by typing the command LIST VARIABLES = ALL instead of naming
any variables, will produce a list of all the variables in the dataset and the
subcommand /CASES FROM 1 TO 15, is an instruction to SPSS to print only
the first fifteen cases for all the variables. If the latter instruction was omitted,
all cases would be listed in the output.
35 | P a g e
Now, to execute commands written in the syntax editor, first highlight the
commands, next, either click on the right-facing green arrow or choose a
selection from the Run menu. Execution of the command will give:
Figure 1.1.28
1.2 R
36 | P a g e
R’s first appearance goes back to 1993. The popularity of this software has
increased rapidly especially in the last 10 years. R is downloadable for free
under the GNU General Public License (GPL). It is known as an open source
program as its source code is freely available. The base system is downloaded
from its homepage:
https://cran.r-project.org/
https://cran.r-project.org/other-docs.html
1.2.1 STARTING R
Find R from the list of programs available on your computer and click on the
R icon. If you have created a shortcut on desktop, double click on R icon. A
console window will pop up:
37 | P a g e
Figure 1.2.1: The main R window (GUI)
38 | P a g e
Working with R:
R is case sensitive
R has an inbuilt library of packages
Many more packages are available from R website - keep on the
lookout for new packages – a package for say, fitting a specific model,
may not be available now but may be in a few months’ time.
Packages available for download may be accessed through:
https://cran.r-
project.org/web/packages/available_packages_by_name.html
Say you wish to install the package ‘gnm’ – a package for generalized
nonlinear models. An alternative way of installing packages, is that of
using the command install.packages(‘package’). For the package
required in this case, type install.packages(‘gnm’).
39 | P a g e
A further alternative is that of downloading the zip file of the required
package on your computer and then use the menu Packages – Install
package(s) from local zip files. On using such a procedure, you need to
make sure that you also install any dependencies (any other packages
that the package you are currently installing relies on).
40 | P a g e
If we then wish to load a specific dataset from a package, say we wish
to load the dataset barley from the package gnm, we use the command
data(barley, package='gnm'). If we write barley at the prompt, we then
get shown all the data in the barley dataset.
Whenever you are feeling slightly lost/totally lost, you can also make
use of the help functions available in R. There are various ways to
access help available in R.
Some documentation to get you going in R is available from the menu
Help – Manuals (in pdf).
Especially if you know which R command you are going to make use
of, you can make use of the Help menu in R. Say you wish to know
how to work out the trimmed mean in R and that you know that you
have to use the function mean. Then go on the menu Help – R functions
(text) - mean. A new window will pop up in your browser with details
on the function mean. Writing the command ?mean or help(‘mean’) at
the prompt, will lead you to the same information. We will see how to
enter data and how the function mean may be used shortly.
41 | P a g e
Download pdf
from here
1.3 R STUDIO
https://www.rstudio.com/products/rstudio/download/
Let us start our first session with RStudio so that we can better appreciate the
advantages of using RStudio rather than R directly.
42 | P a g e
1.3.1 STARTING R STUDIO
Find RStudio from the list of programs available on your computer and click
on the RStudio icon/ If you have created a shortcut on desktop, double click
on RStudio icon.
If you look at RStudio’s main window and compare it with the main window
obtained from R, you might already notice that more menus are readily
available for use through RStudio interface. RStudio is in fact more user
friendly. We shall see shortly how RStudio, for example, is much simpler to
use for data import/export and it also offers the use of point-and-click
exploration of data frames and other data objects. To mention some other
advantages, with RStudio it is also easier to install and update packages, easier
to save and export plots and different colours are used whilst writing a script.
The print screen which follows shows part of a script. Note that comments
for example, defined by a #, are attributed the colour green so that they stand
out from the remaining text. The commands library and for are attributed the
colour blue and so on.
43 | P a g e
Figure 1.3.2: Writing a script in RStudio
RStudio can also be used for version control using Git. Git is another open
source software designed to ease the interaction of different people working
on the same project.
All the commands specified earlier for use with R software may also be used
with RStudio. In RStudio however, installing, loading and updating of
packages may be carried out directly through the menu Packages found in the
middle of the right hand side of the main window (refer to the circled part in
Figure 1.3.3). In the same circle of Figure 1.3.3, you may also notice the
presence of another Help menu. Again this menu has been introduced for ease
of use.
Once you are in the Packages menu, a package is then loaded by a single click
in the box next to the package required.
44 | P a g e
Note that since packages such as datasets, graphics, stats etc are base
packages, they are always readily available and loaded for use in RStudio.
Unlike SPSS, RStudio (and R) can also be used as a very powerful calculator.
Some simple examples are given in the following section. More complex
examples are provided Appendix A.
45 | P a g e
1.3.2 Basic Arithmetic and Objects
> 12+5
[1] 17
> exp(-5)
[1] 0.006737947
> x<-7
> x
[1] 7
If instead of <-, we used the symbol =, we would get the same value for x.
> x=7
> x
[1] 7
The use of the symbol <- dates back to when the R language was first created.
The symbol <- was the only choice for an assignment operator. The use of
this symbol comes from the language APL, where the arrow notation was used
to distinguish assignment from equality, that is to distinguish between ‘Assign
to x the value of 7’ versus ‘is x equal to 7?’ Nowadays, many R users still
prefer to use the symbol <- when assigning values to variables. The reason
for this is that R uses the symbol = for associating function arguments with
values, such as:
In this case, sd is declared within the scope of the function, so it does not exist
in the user workspace.
> sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x)) x else as.double(x), na.rm = na.
rm))
<bytecode: 0x000000000cb09808>
<environment: namespace:stats>
46 | P a g e
Note that the rnorm command is used to generate 10 observations from a
normal distribution with mean fixed at 0 and standard deviation fixed at 5:
> rnorm(10, mean = 0, sd = 5)
[1] -0.5593297 1.0933179 -0.3005551 1.2798228 -7.9541652
-2.9788505 4.9220503 3.5936779 -6.2354377 -0.5117681
Some functions which may come handy when working with R are:
> seq(4,9,2)
[1] 4 6 8
> 4:9
[1] 4 5 6 7 8 9
> x<-c(1,2,3)
> sum(x)
[1] 6
> rep(x,4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
47 | P a g e
> rep(3:4,c(5,2))
[1] 3 3 3 3 3 4 4
Now suppose that our data consists of the numbers 2, 5, 7, 9 and that we wish
to extract only some of the elements of the larger vector. So:
> x<-c(2,5,7,9)
We may also wish to for example extract all the numbers in x that are greater
than 5 for example:
48 | P a g e
We may also wish to combine logical operators:
> (x>2) & (x<9)
[1] FALSE TRUE TRUE FALSE
> !(x<7)
[1] FALSE FALSE TRUE TRUE
> x[!(x<7)]
[1] 7 9
By default, R will try to access/save datasets, load source files or save plots,
in its own directory. Should you wish R to import/export a dataset from/to a
specific directory, the directory in which R is working should be changed to
your desired directory from: File – Change Directory or otherwise type
setwd(file path).
In RStudio, paths should be changed using the menu Session – Set Working
Directory – Choose Directory.
If you have changed the working directory and you wish to check in which
working directory you are currently working, in either R or RStudio, use the
command:
49 | P a g e
> getwd()
[1] "C:/Users/user/Desktop "
Note that on Windows, the path returned uses / as the path separator (rather
than \ with which you are most probably more familiar).
If you have already specified the directory from which you would like to
retrieve a dataset, importing a dataset in R is carried out using the following
commands.
From Excel and SPSS – Change extension .xls and extension .sav to comma
separated .csv and use the command:
x = read.csv (‘test.csv’, header = TRUE).
x<-read.csv('Employee data.csv',header=TRUE)
> names(x)
[1] "id" "gender" "bdate" "educ" "jobcat"
"salary"
[7] "salbegin" "jobtime" "prevexp" "minority"
> x[1,]
id gender bdate educ jobcat salary salbegin jobtime
prevexp minority
1 1 m 02/03/1952 15 3 57000 27000 98
144 0
50 | P a g e
> names(x2)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10"
> x2[1,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 id gender bdate educ jobcat salary salbegin jobtime prevexp minority
There are alternative commands that may also be used for importing data into
R from Excel and SPSS: read.table, read.csv2, read.delim, read.delim2.
Also note that rather than changing the working directory, you can also
specify the directory from where to get a file directly in the read.csv
command:
x<-read.csv('C:/Users/user/Desktop/Employee data.csv')
x<-read.csv(file.choose(),header = TRUE )
which will let you choose the directory from which you wish to import your
data without having to write down the actual path.
To import a dataset in RStudio, you can use the same commands used in R
or otherwise you may also make use of the menu Import Dataset. Say we
wish to import the file Employee data.csv. Working with the latest version of
RStudio as of August 2017 (Version 1.0.153), importing of data can be carried
out by going to Import Dataset, From CSV, choose the file to import in
RStudio by clicking on the browse button as shown in Figure 1.3.5 and then
press Import.
51 | P a g e
Figure 1.3.4: Importing Data in RStudio
Import Dataset – From Text file – Choose the directory from where you would
like to get your file – Open the file Employee data.csv.
52 | P a g e
Note that RStudio automatically assigns a name to an imported dataset (the
name of the dataset may also be specified throughout the import procedure
and the name may also be changed once the dataset has been imported in
RStudio).
The command View(name of file) can be used for viewing the dataset in
spreadsheet form. For our example, the name automatically assigned by
RStudio to the imported employee dataset is Employee.data. So by using the
command View( Employee.data) we can get to view the employee dataset in
spreadsheet format. Clicking once on the name Employee.data, in the menu
underneath Data, on the right hand side of the RStudio’s main window, will
also give us the same output.
The commands used for exporting data are the same for both R and RStudio.
One of the most commonly used commands is:
write.csv (x,‘test.csv’)
where we would be saving the dataset called x, in the file called test.csv. The
new file test.csv will be created in the current working directory.
53 | P a g e
The output below shows that for the moment, R is still considering the
variables jobcat and minority as continuous. From the output below we can
also check whether there might be any values in the data which stand out in
terms of being too low or too high (minimum/maximum value of each
variable).
> summary(Employee.data)
id gender bdate educ jobcat
salary
Min. : 1.0 f:216 02/04/1934: 2 Min. : 8.00 Min. :1.000
Min. : 15750
1st Qu.:119.2 m:258 02/08/1962: 2 1st Qu.:12.00 1st Qu.:1.000
1st Qu.: 24000
Median :237.5 02/12/1964: 2 Median :12.00 Median :1.000
Median : 28875
Mean :237.5 04/05/1966: 2 Mean :13.49 Mean :1.411
Mean : 34420
3rd Qu.:355.8 05/11/1965: 2 3rd Qu.:15.00 3rd Qu.:1.000
3rd Qu.: 36938
Max. :474.0 10/20/1959: 2 Max. :21.00 Max. :3.000
Max. :135000
(Other) :462
salbegin jobtime prevexp minority
Min. : 9000 Min. :63.00 Min. : 0.00 Min. :0.0000
1st Qu.:12488 1st Qu.:72.00 1st Qu.: 19.25 1st Qu.:0.0000
Median :15000 Median :81.00 Median : 55.00 Median :0.0000
Mean :17016 Mean :81.11 Mean : 95.86 Mean :0.2194
3rd Qu.:17490 3rd Qu.:90.00 3rd Qu.:138.75 3rd Qu.:0.0000
Max. :79980 Max. :98.00 Max. :476.00 Max. :1.0000
54 | P a g e
> summary(salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15750 24000 28880 34420 36940 135000
Note that labels has been used to define string characters whilst levels has
been used for numeric characters. In the case of the variable jobcat, we are
attributing the names Clerk, Custodial and Manager to levels 1, 2 and 3
respectively.
> variable<-factor(variable,levels=c(1,2,3),ordered=TRUE)
Now suppose that we need to rearrange our dataset such that the values
obtained on salary are in increasing order. The command used for this
purpose is:
> sal.inc<-Employee.data[order(salary),]
Note that in this case, a new dataset called sal.inc has been created so that you
can compare the data in Employee.data with that in sal.inc. Of course you
could have ordered the values in Employee.data directly by using:
If you then wish to get your dataset ordered according to the person’s id, use
the command:
> Employee.data <-Employee.data[order(id),]
Now suppose that we wish to select cases. We make use of the logical
operators introduced earlier on in the notes.
Should we wish to select only those persons whose salary is greater than say
50000, we would use the command:
> Over.50000<-Employee.data[salary>50000,]
where we would be selecting all rows in the data that satisfy the condition
stated.
> bdate.final<-Employee.data[,c(1,2,4:10,3)]
56 | P a g e
Also, now suppose that we wish to categorize the variable salary so that we
have a group of persons with a salary that is less than or equal to 50000 and a
group of persons with a salary that is more than 50000. Due to the way that
the command that will be used for this purpose works, it makes sense to check
whether there are any persons with a salary of 50000. This can be checked by
means of:
> length(which(salary==50000))
[1] 1
One way in which the variable salary can be categorized is then by means of:
> sal.cat <- cut(salary,
+ breaks=c(-Inf, 50000, Inf),
+ labels=c("low","high"))
> summary(sal.cat)
low high
403 71
Note that by default, for the cut function, the ranges defined for breaks are
open on the left, and closed on the right, as in (-inf, 500000]. If we needed to
divide the data into two groups, one group having a salary less than 50000 and
the other group having a salary of 50000 or more, then we would need to use
the command:
If we then we wish to add the newly created variable sal.cat with the
employee dataset, we use the command:
57 | P a g e
> new.data<-cbind(Employee.data,sal.cat)
The same command may also be used in case labels need to be changed. Say
we made a mistake in using the label Manager for the third level of the variable
jobcat. Suppose we wish to change the label Manager to Managerial.
Then:
One alternative way of doing the recoding without relying on the package plyr
is by using the following commands:
> summary(jobcat.changed)
Clerk Custodial Managerial
363 27 84
Now suppose that we wish to create a new variable where job categories are
grouped into Manager and Other. One way in which this recoding can be
done is by means of the command recode( ) from the package car:
58 | P a g e
> require(car)
> summary(jobcat.2cat)
Manager Other
84 390
You can check whether you have any missing values in the data by using the
command is.na(variable of interest). The possible outputs from this
command are either TRUE or FALSE. FALSE will result wherever response
is missing.
> is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE
and if we wish to count how many NAs we have in our variable x, then:
> sum(is.na(x))
[1] 1
> sum(!complete.cases(Employee.data))
[1] 0
59 | P a g e
In a similar manner to the recoding shown earlier, we can also recode to show
that a specific value is a missing value NA:
> x
[1] 3 5 7 NA NA
Should we wish to exclude missing values from the analysis, we make use of
the command na.rm=TRUE as follows:
> mean(x) – x had two missing values so its mean could not
be computed
[1] NA
The previous command may be used so that missing data is ignored when
using a specific command. We can however also create a new dataset without
the missing data. The command na.omit( ) carries out listwise deletion of
missing data.
> new.x<-na.omit(x)
> new.x
[1] 3 5 7
attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"
> x<-matrix(c(1,NA,3,4,6,NA),nrow=3)
> x
[,1] [,2]
[1,] 1 4
[2,] NA 6
[3,] 3 NA
60 | P a g e
> new.x<-na.omit(x)
> new.x
[,1] [,2]
[1,] 1 4
attr(,"na.action")
[1] 2 3
attr(,"class")
[1] "omit"
Advanced handling of missing data which goes beyond listwise deletion, may
also be carried out in R/RStudio by means of various packages. To mention
a few of these packages: Amelia, VIM, mice.
Objects that you create during an R/RStudio session are contained in what is
known as the workspace. This workspace is not saved unless you tell
R/RStudio to do so. Every time you are ending an R/RStudio session, you
will be asked whether you wish to save your workspace. If you do not save
the objects created during your session, these objects will be lost and you will
not be able to retrieve them in a subsequent session.
To see which variables have been created in the current session, use the
command ls().
To remove an object from the workspace, use rm(object).
To remove all objects from the workspace, use rm(list = ls() ).
> x<-c(1,2,3)
> ls()
[1] "x"
> rm(x)
> x
Error: object 'x' not found
61 | P a g e
Suppose that during one particular R/RStudio session you have created two
datasets x and y, and that you would like to use these datasets in future
sessions. Check that the directory in which you are currently working is the
one in which you would like to save. If it is, proceed to save x and y, using:
save (x, y, file=“x.Rdata”).
save(jobcat.2cat, file='C:/Users/user/Desktop/job2cat.Rdata
')
Suppose that instead of saving one or two variables created during a session,
we wish to save the whole workspace, then we use the command save.image(
directory). To save on desktop:
save.image("C:/Users/user/Desktop/wholeworkspace.RData")
If you then wish to load the saved workspace at a later date, you can do so by
means of the command load( file name) as in:
load("C:/Users/user/Desktop/wholeworkspace.RData")
Note that the save function also extends to saving images created in
R/RStudio. Consider the following:
Install the package plot3D, load this package and then enter the following
commands taken from Soetaert, K. (2013, Pg6):
par(mfrow = c(2, 2), mar = c(0, 0, 0, 0))
# Shape 1
M <- mesh(seq(0, 6*pi, length.out = 80), seq(pi/3, pi,
length.out = 80))
u <- M$x ; v <- M$y
x <- u/2 * sin(v) * cos(u)
y <- u/2 * sin(v) * sin(u)
z <- u/2 * cos(v)
surf3D(x, y, z, colvar = z, colkey = FALSE, box = FALSE)
# Shape 2: add border
M <- mesh(seq(0, 2*pi, length.out = 80),
seq(0, 2*pi, length.out = 80))
u <- M$x ; v <- M$y
x <- sin(u)
y <- sin(v)
z <- sin(u + v)
62 | P a g e
surf3D(x, y, z, colvar = z, border = "black", colkey = FALSE)
# shape 3: uses same mesh, white facets
x <- (3 + cos(v/2)*sin(u) - sin(v/2)*sin(2*u))*cos(v)
y <- (3 + cos(v/2)*sin(u) - sin(v/2)*sin(2*u))*sin(v)
z <- sin(v/2)*sin(u) + cos(v/2)*sin(2*u)
surf3D(x, y, z, colvar = z, colkey = FALSE, facets = FALSE)
# shape 4: more complex colvar
M <- mesh(seq(-13.2, 13.2, length.out = 50),
seq(-37.4, 37.4, length.out = 50))
u <- M$x ; v <- M$y
b <- 0.4; r <- 1 - b^2; w <- sqrt(r)
D <- b*((w*cosh(b*u))^2 + (b*sin(w*v))^2)
x <- -u + (2*r*cosh(b*u)*sinh(b*u)) / D
y <- (2*w*cosh(b*u)*(-(w*cos(v)*cos(w*v)) -
sin(v)*sin(w*v)))/D
z <- (2*w*cosh(b*u)*(-(w*sin(v)*cos(w*v)) +
cos(v)*sin(w*v)))/D
surf3D(x, y, z, colvar = sqrt(x + 8.3), colkey = FALSE,
border = "black", box = FALSE)
To export the resulting plot to word from RStudio, the following procedure
has been used:
Export – Copy Plot to Clipboard – Choose Metafile – Copy Plot and then
paste in word.
To export the resulting plot to word from R, right click on the resulting plot,
and choose the format that you would like to use to save your picture.
The plot which results from the commands given is the following:
63 | P a g e
Figure 1.3.6: An example of a 3D Plot obtained with R/RStudio
64 | P a g e
Figure 1.3.7: Entering R Script in R
To save the script file : Activate the R-editor by clicking on it and then
click on File and select Save as.
To open a script file, hit File and select Open script.
There are two possible ways of running the code in an R-script:
1. You can either run the commands one line at a time. To do this make
sure the cursor is in front of the first command. Hit the Run line or
selection button - repeatedly. Each time you hit the botton a
command is executed and the cursor moves to the next line.
2. Alternatively you can highlight a section of the R-Script or all of the
Script and hit the Run line or selection button to execute the
selected commands at one go.
65 | P a g e
An Untitled R Script window will appear. Commands can be entered in
this window. You can also add comments in a script. All comments
should be preceeded by a '#'. R will not execute any comments that
follow the '#'.
Commands are executed in a similar way as is done in the R-editor
window in R. Instead of the Run line or selection button there is a
Run button in the top right corner of the R Script window. Commands
in RStudio may also be run one step at a time by using the shortcut Ctrl
+ Enter on the keyboard.
Note that any script file assembled in R can be executed and edited in RStudio
and viceversa.
66 | P a g e
1.3.9 Data Manipulation
In this subsection we shall see how one can create new variables in an existing
data file. These new variables can be some transformed variables from the
original data set or indices computed from existing variables.
67 | P a g e
For more detail on importing data in R and RStudio refer to Section 1.3.4.
First we create a new dataframe so as not to alter the original data set and add
an empty column to this new data frame which we call Salary_Increase:
> Employee.data.new<-Employee.data
> Employee.data.new["Salary_Increase"] <- NA # Creates the
new column named "Salary_Increase" filled with "NA"
> names(Employee.data.new)
[1] "X...id" "gender" "bdate"
"educ"
[5] "jobcat" "salary" "salbegin"
"jobtime"
[9] "prevexp" "minority" "Salary_Increase"
Employee.data.new["THT"] <- NA
Employee.data.new$THT<-(2*Employee.data.new$salary - 0.5*Em
ployee.data.new$salbegin)/Employee.data.new$jobtime
summary(Employee.data.new$THT)
Note that the command cbind() may also be used to add a new column to a
dataset:
68 | P a g e
> Employee.data.new <-cbind(Employee.data, salarydiff)
In the next section we shall see how descriptive statistics are carried out using
both SPSS and RStudio (hence also R).
69 | P a g e
2. DESCRIPTIVE STATISTICS
In the first section we have discussed some of the basics SPSS and R/R Studio.
In particular, we have used the software to get familiar with manipulating and
importing our datasets. In this section, we will discuss procedures to obtain
various descriptive measures using SPSS and R/RStudio.
Note that:
For SPSS we shall make use of the Employee data.sav, which is
provided with SPSS.
When working with RStudio we shall make use of the dataset ‘pima’
which is found in the package ‘faraway’.
Using R version 3.1.2 or later
Packages Used in this section:
faraway
psych
e1071
plotrix
ggplot2
car
graphics
ply
Recall that to obtain information about the packages one can use the command
‘help’ in R. For example to obtain information about what is contained in the
package ‘faraway’ type the following command:
help(package=faraway)
70 | P a g e
2.1 SUMMARIZING DATA
Figure 2.1.1
Now, to view the available descriptive statistics, click on the button Options,
which will produce the following dialogue box:
71 | P a g e
Figure 2.1.2
The ticked statistics in the above dialogue box are those descriptive statistics
which SPSS outputs by default, when this procedure is run. All the other
statistics may also be chosen. When all required statistics are chosen, pressing
Continue and Ok, will generate the output with the selected statistics in the
Output Viewer. For example, the selections from the preceding example
would produce the following output:
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Current Salary 474 $15,750 $135,000 $34,419.57 $17,075.661
Beginning Salary 474 $9,000 $79,980 $17,016.09 $7,870.638
Valid N (listwise) 474
Table 2.1.1
So, from Table Table 2.1.1 we know that 474 cases were considered for this
analysis. The average current salary (salary) of the respondents is $34,419.57
and their average starting out salary (salbegin) was $17,016.09.
72 | P a g e
Also, beginning salaries ranged from $9000 to $79,980 whereas current
salaries range from $15,750 to $135,000.
The standard deviation describes the standard amount variables differ from
the mean. For example, for this case, a current salary of $51,000.00 is nearly
one standard deviation above the mean, since mean ($34,419.57) + standard
deviation ($17,075.66) = $51,495.23.
Since the variance (or the standard deviation) is a key aspect in many
statistical analyses, it is useful to note that the standard deviation resulting
from the current salaries ($17,075.66) is much greater than the standard
deviation resulting from the beginning salaries ($7,870.64). Thus, there is a
larger spread in the current salaries.
If other statistics like variance, range, kurtosis and skewness were required,
the resulting output would be as follows:
Table 2.1.2
From Table 2.1.2 it may be noted that indeed the current salary has a much
larger range than the beginning salary, $119,250 as opposed to $70,980 and
as expected the current salary also has a larger variance (this was already seen
from the standard deviation results of the previous output).
73 | P a g e
It may also be noted that, both variables have positive skewness and kurtosis.
So, in our case, both variables have an extreme positive kurtosis. Thus, both
distributions heavier tails when compared to the normal distribution; salbegin
in particular.
It should be noted that the variables chosen for the descriptive statistics
procedure are both continuous. In fact, it would not have made sense if we
decided to find the descriptive statistics of variables like gender or jobcat.
What would the mean of jobcat mean? So, even though the descriptive
statistics procedure is useful for summarizing data with an underlying
continuous distribution, it will not prove helpful for interpreting categorical
data. When analyzing categorical data, it makes more sense to obtain
information on the number of cases (frequencies) that fall into the various
categories. Descriptive measures for categorical data will be given shortly.
Another procedure which offers even more descriptive statistics is the Explore
procedure. This procedure is available through Analyze, Descriptive Statistics,
Explore. The resulting dialogue box is as follows:
74 | P a g e
Figure 2.1.3
Figure 2.1.4
75 | P a g e
Descriptives
Statistic Std. Error
Current Salary Mean $34,419.57 $784.311
95% Confidence Interval for Lower Bound $32,878.40
Mean Upper Bound $35,960.73
5% Trimmed Mean $32,455.19
Median $28,875.00
Variance 291578214.500
Std. Deviation $17,075.661
Minimum $15,750
Maximum $135,000
Range $119,250
Interquartile Range $13,162
Skewness 2.125 .112
Kurtosis 5.378 .224
Beginning Salary Mean $17,016.09 $361.510
95% Confidence Interval for Lower Bound $16,305.72
Mean Upper Bound $17,726.45
5% Trimmed Mean $16,041.71
Median $15,000.00
Variance 61946944.960
Std. Deviation $7,870.638
Minimum $9,000
Maximum $79,980
Range $70,980
Interquartile Range $5,168
Skewness 2.853 .112
Kurtosis 12.390 .224
Table 2.1.3
76 | P a g e
Figure 2.1.5
On taking the variables of interest in the Variable(s) cell and then clicking Ok,
the frequency tables are obtained. The frequency tables below result when
jobcat and educ are considered to be the variables of interest:
Employment Category
Cumulative
Frequency Percent Valid Percent Percent
Valid Clerical 363 76.6 76.6 76.6
Custodial 27 5.7 5.7 82.3
Manager 84 17.7 17.7 100.0
Total 474 100.0 100.0
Table 2.1.4
77 | P a g e
Educational Level (years)
Cumulative
Frequency Percent Valid Percent Percent
Valid 8 53 11.2 11.2 11.2
12 190 40.1 40.1 51.3
14 6 1.3 1.3 52.5
15 116 24.5 24.5 77.0
16 59 12.4 12.4 89.5
17 11 2.3 2.3 91.8
18 9 1.9 1.9 93.7
19 27 5.7 5.7 99.4
20 2 .4 .4 99.8
21 1 .2 .2 100.0
Total 474 100.0 100.0
Table 2.1.5
Statistics
Employment Educational
Category Level (years)
N Valid 474 474
Missing 0 0
Mean 1.41 13.49
Median 1.00 12.00
Percentiles 25 1.00 12.00
50 1.00 12.00
75 1.00 15.00
Table 2.1.6
Table 2.1.6 shows that indeed finding for example the mean, median and
percentiles for jobcat, a categorical variable, does not make sense.
78 | P a g e
Clicking on the Charts button, however, produces the following dialogue box
which allows you to graphically examine any type of variable, in several
different formats:
Figure 2.1.5
Here we shall see how the statistics discussed in the previous subsection are
computed using R. Since interpretation is the same we shall not give too much
detail here but refer the reader to the previous sub-section for such detail.
Since we are going to make use of the dataset pima found in the faraway
package, we start by loading this package in R:
> library(faraway)
To get some information about the variables being used we can type
> ?pima
79 | P a g e
This dataset contains the following variables:
Note that there are some missing values for the variables diastolic, triceps,
glucose, insulin 2, bmi and diabetes. These values are denoted as ‘0’ in the
data set. We need to identify them as missing values else they will be used
when issuing plots or calculating statistics. As mentioned in the introductory
session on R/RStudio, the missing code value used by R or RStudio is NA.
Thus our first step is to set all zero values to NA through the following
commands:
To obtain basic summary statistics for all the covariates in the data one can
use the following command:
> summary(pima)
80 | P a g e
This gives the following output:
If on the other hand we want summary statistics for the variable glucose, only:
> summary(pima$glucose)
A command for calculating skewness and kurtosis can be found in the package
psych. Other packages such as moments may be used to calculate these
statistics. Therefore first we load the package:
> library(psych)
na.rm = TRUE is used to inform the software that missing values must be
ignored
trim=.1 is used to declare the percentage of trimming used in a trimmed
mean
81 | P a g e
where:
vars = variable number
n= number of valid cases
mean = sample mean
sd= standard deviation
trimmed =trimmed mean (with trim defaulting to .1)
median=median
mad = median absolute deviation (from the median)
min =minimum value
max=maximum value
skew = skewness
kurtosis = kurtosis
se=standard error of the mean
Individual commands:
> mean(pima$bmi, na.rm = TRUE)
[1] 32.45746
> var(pima$bmi, na.rm = TRUE)
[1] 47.95546
> sd(pima$bmi, na.rm = TRUE)
[1] 6.924988
> median(pima$bmi, na.rm = TRUE)
[1] 32.3
> min(pima$bmi, na.rm = TRUE)
[1] 18.2
> max(pima$bmi, na.rm = TRUE)
[1] 67.1
> skew(pima$bmi, na.rm = TRUE)
[1] 0.5916179
82 | P a g e
> library(e1071)
> kurtosis(pima$bmi, na.rm = TRUE)
[1] 0.839607
Recall that the kth-percentile of a sample of values divides the sample such
that k% of the values lie below or are equal to the kth-percentile and (100-k)%
of the values lie above the kth-percentile. For example the 15th, 25th, and
35th percentiles of the variable bmi are obtained as follows:
Interpretation:
The 15th percentile denoted by 15% in the output shown above, implies
that 15% of the BMI values are less than or equal to 25.10 while 85%
are greater than 25.10.
The 35th percentile denoted by 35% in the output shown above,
implies that 35% of the BMI vales are less than or equal to 29.56 and
hence 65% of the BMI values are greater than 29.56.
Similar interpretation holds for the 25th percentile denoted by 25%.
Quartiles are the same as percentiles but are typically indexed by sample
fractions rather than by sample percentages. In R these are calculated as
percentiles. So for the variable bmi the quantiles are obtained as follows:
Interpretation:
83 | P a g e
The lower quartile, Q1 corresponds to the 25th percentile. From the
output above we can conclude that, 25% of the BMI values are ⩽ 27.5,
while the other 75% of the values are > 24.5.
The median, Q2 corresponds to the 50th percentile. From the previous
output we can conclude that half the BMI values are ⩽ 32.3, while the
other half, are > 32.3.
The upper quantile, Q3 corresponds to the 75th percentile. From the
output above we can conclude that three-fourths of the data are ⩽ 36.6,
while the remaining one-fourth are > 36.6.
The minimum and maximum values of the data are 18.2 and 67.1,
respectively.
The variable test is a categorical variable – a factor. R will treat such variables
as quantitative unless it is told that they should be treated as factors (look back
at the output obtained from summary(pima) earlier on). To designate such
variables as factors so that they are treated appropriately, we use the following
command:
Then
> summary(pima$test)
0 1
500 268
84 | P a g e
3. GRAPHICAL REPRESENTATIONS
In this chapter we shall continue using the ‘Employee’ dataset when using
SPSS and the ‘pima’ dataset when using R/RStudio.
PIE CHARTS
Pie charts are a very common type of graph best suited for a qualitative
variable (for example gender or jobcat in the Employee data set or test in the
pima data set). They consist of a circle divided into slices of different portions.
Each slice corresponds to a different level of the categorical variable while the
portion of each slice represents the frequency of that level. The bigger the
portion the bigger the frequency of the level.
Bar graphs are a very common type of graph best suited for a categorical
variable (for example gender or jobcat in the Employee data set or test in the
pima data set).
85 | P a g e
Since there is no uniform distance between levels of a categorical variable,
the discrete nature of the individual bars is well suited for this type of variable.
In fact, bar graphs are a common way to graphically display the frequency of
each level of a categorical variable. Each column represents a different level
of the categorical variable while height of each column represents the
frequency of that level. Three important bar charts are typically considered,
namely:
HISTOGRAMS
Histograms are similar to simple bar graphs except that each bar represents a
range of variable values rather than just a single value. What makes this
different from a regular bar graph is that each bar represents a summary of
data rather than an independent value.
86 | P a g e
BOX PLOTS
Click on Graphs, Legacy Dialogs, Pie. The following dialog box will appear:
Figure 3.1.1
Choose Summaries for groups of cases and click on Define. The dialog box in
Figure 3.1.2 will then appear. Insert gender in Define Slices by: field and then
click on the OK button.
87 | P a g e
Figure 3.1.2
Figure 3.1.3
88 | P a g e
By double clicking on the pie chart in the SPSS output window, the chart
editor window is opened which will allow one to edit this plot to make it more
informative, attractive or simply change the colours of the slices. For example
if we click on Elements, Show data labels, in the chart editor, a new window
will open which will allow us to add values onto the slices as can be seen in
the figure which follows.
Figure 3.1.4
89 | P a g e
Figure 3.1.5
Figure 3.1.6
90 | P a g e
3.1.2 The Bar Graph (also known as a Bar Chart)
Click on Graphs, Legacy Dialogs, Bar. The following dialog box will appear:
Figure 3.1.7
As you can see in Figure 3.1.7, SPSS makes it possible for us to plot any of
the three types of pie charts discussed earlier.
Click Simple and then Define. The following dialog box will appear:
91 | P a g e
Figure 3.1.8
Place the variable jobcat under Category Axis: and then click OK. The
following output will appear in the output viewer.
92 | P a g e
Figure 3.1.9
This is a simple bar chart for the variable jobcat. It shows clearly that in the
sample under study, most of the employees are clerical workers. As was the
case for pie charts it is possible to edit this output by double clicking on the
plot in the output viewer. This will allow us to change the colour of the bars,
edit legends and also add data labels. For example we can change Figure 3.1.9
as follows:
Figure 3.1.10
93 | P a g e
Figure 3.1.10 is more informative than Figure 3.1.9. It shows that 76.59% of
the individuals selected in the sample under study are clerical workers,
17.72% are Managers and the remaining 5.7% are custodial workers.
Click Clustered and then Define. The following dialog box will appear:
Figure 3.1.11
94 | P a g e
Place the variable jobcat under Category Axis:, gender under Define Clusters
by: and click OK. A clustered bar graph for these two factors will appear in
the output viewer and after some editing (as was done for the simple bar
graph) we obtain the following output:
From the clustered bar graph in Figure 3.1.12 we note that in the sample under
study, 43.46% of the individuals are female clerical workers while 33.12% are
male clerical workers. 74 out of 84 managers are male. All 27 custodial
workers are male.
Click Stacked and then Define. The following dialog box will appear:
95 | P a g e
Figure 3.1.13
Place the variable jobcat under Category Axis:, gender under Define Stacks
by: and then click OK. A stacked bar graph for these two factors will appear
in the output viewer and after some editing (as was done for the simple bar
graph) we obtain the following output:
96 | P a g e
Figure 3.1.14: Stacked bar graph for jobcat and gender
Note that the stacked bar graph in Figure 3.1.14 displays the same information
as for the clustered bar graph in Figure 3.1.12. The information is just
displayed in a slightly different format. Hence the interpretation provided after
Figure 3.1.12 applies to Figure 3.1.14 as well.
97 | P a g e
Then move the Current Salary into the variable list as follows:
Figure 3.1.15
As shall be seen later on in the notes, a number of statistical tests rely on the
normality assumption of the data. It may thus be of interest to plot the normal
distribution curve superimposed on our histogram. So as to obtain this
overlay, one may select Display Normal Curve before pressing OK.
98 | P a g e
Figure 3.1.16: Histogram for the variable Current Salary
Comparing the histogram with the normal curve may help one visualize the
skewness and kurtosis features. Reminder: the normal curve has zero
skewness and zero kurtosis. From Table 2.1.2 we know that the distribution
of Current Salary has positive skewness and kurtosis. Figure 3.1.16, in fact,
shows that the data is indeed positively skewed (due to its asymmetry) and
also leptokurtic (our histogram has heavier tails than the normal curve).
99 | P a g e
3.1.4 The Box Plot
Each box plot shows the median, quartiles and extreme values of a covariate.
If we would like to obtain the median, quartiles and extreme values of the
variable Current Salary, this can be produced by clicking on: Graphs
Legacy Dialogs Boxplot Simple.
Figure 3.1.17
Select Summaries of separate variables and click Define. Then move Current
Salary (covariate) in the Boxes Represent list and press OK.
Figure 3.1.18
100 | P a g e
The following box plot is obtained:
First of all note that a box plot may be given either vertically as in Figure
3.1.19 or also horizontally. Now, the boxplot is interpreted as follows:
The upper edge (hinge) of the box indicates the 75th percentile of the
data set and the lower hinge indicates the 25th percentile. The range
between the lower and upper quartiles is known as the inter-quartile
range.
The line in the box indicates the median value of the data and if the
median line within the box is not equidistant from the hinges, then the
data is skewed.
In our example, the line is closer to the 25th percentile, thus the box plot
is also showing that the distribution of the data is positively skewed.
101 | P a g e
The ends of the vertical lines or "whiskers" indicate the minimum and
maximum data values, unless outliers are present in which case the
whiskers extend to a maximum of 1.5 times the inter-quartile range.
The points outside the ends of the whiskers are outliers or suspected
outliers. Hence, from our plot, it is obvious that there are many outliers
or suspected outliers in our data.
A box plot can also be used to analyze a covariate and a categorical variable
simultaneously. The covariate is summarized within the levels (categories) of
the categorical variable. Each box shows the median, quartiles and extreme
values within a category. If we would like to obtain the median, quartiles and
extreme values of Current Salary according to Employment Category, this can
be produced by clicking on: Graphs Legacy Dialogs Boxplot Simple.
Figure 3.1.20
Select Summaries of separate variables and click Define. Then move Current
Salary (covariate) in the variable list and Employment Category (factor) in the
category axis as follows:
102 | P a g e
Figure 3.1.21
103 | P a g e
If we would like to obtain the median, quartiles and extreme values of the
Current Salary according to Employment Category clustered by the factor
gender, the procedure is to click on Graphs Legacy Dialogs Boxplot
Clustered Define.
Figure 3.1.23
Then do selections as shown in the figure which follows and press OK.
Figure 3.1.24
104 | P a g e
This will give the following plot:
Then, move the variable Current Salary in the y-axis and the Beginning Salary
in the x-axis as shown in Figure 3.1.26:
105 | P a g e
Figure 3.1.26: Creating the scatter plot
Upon clicking OK, a scatter plot appears in the output viewer. Double click
on the graph to open the chart editor and hence select Elements and Fit Line
at Total to get:
106 | P a g e
Figure 3.1.27: Scatter plot for Current and Beginning Salary
A straight line was plotted over the data in Figure 3.1.27 since there seems to
be a linear relationship between the two salary variables. This straight line is
known as the regression line or line of best fit. One common extension of the
above example is to plot separate regression lines for subgroups. For example,
we could plot separate regression lines for males and females in the above
example to visually examine whether the relationship between current salary
and beginning salary is the same for both males and females. To do this, first
we place a categorical variable in the Set Markers by, as shown earlier, then
double click on the resulting chart, hence select Elements and Fit Line at
Subgroups to get:
107 | P a g e
Figure 3.1.28: Scatter plot for Current and Beginning Salary by
Gender
108 | P a g e
1. Simple Pie Charts
Commands:
Figure 3.2.1
109 | P a g e
For the advanced user, the package ReporteRs together with the package
ReporteRjar may be used to import high quality charts in Word from
R/RStudio. Using the latter procedure, Figure 3.2.1 would need to be created
as follows:
library(ReporteRs)
library(ReporteRjars)
mydoc= docx()
slices<-summary(pima$test)
mydoc <- addPlot(mydoc, function() pie(slices,
labels=levels(pima$test),main="Pie Chart of Test" ))
writeDoc(mydoc, file="Pie Chart of Test.docx")
The chart produced will be saved in a new Word document entitled Pie Chart
of Test.docx in the directory in which you are currently working in. The
command getwd() will remind you in which directory you are currently
working in with R/RStudio. If you use R 64-bit and you get a Java related
error on running the above commands, you will need to download Java 64-bit
to be able to use the commands just given.
A further note: An alternative format which was used to produce high quality
pictures for use with a Word document, used to be Postscript (.EPS). As of
April 2017, Microsoft is not supporting such a format anymore.
Commands (Percentages):
Commands (Count):
111 | P a g e
Figure 3.2.3
3. 3D Pie Chart
Commands (Count):
112 | P a g e
Figure 3.2.4
There are various colour options for the bars of a bar chart. A list of the
possible colours that one can use is obtained by using the command ‘palette’
as follows:
Such a graph can be obtained by using the command ‘barplot’ which is found
in the inbuilt package ‘graphics’. To gain more information about this
command type the following in R or RStudio:
help("barplot").
Commands:
113 | P a g e
> text(plot, 100, labels=counts,col="red") # adding labels
Figure 3.2.5
Now, let's create a more complex simple bar graph using various arguments:
Commands:
114 | P a g e
Test Distribution
500
My Y Values
300
100
500 268
0
negative positive
Test Outcome
Figure 3.2.6
Note that the ‘density’ field is used to add the shading – changing the values
in brackets changes the intensity of the shading.
Two categorical variables are needed to construct a stacked bar chart. The
data as it stands does not have two categorical variables. If the analysis
justifies the categorization of a covariate, we can proceed as follows:
115 | P a g e
1=20-45 –“Young”, 2=46-65–“Middle Aged”, 3=66+–“Elder”
Having created a new factor we can proceed to plotting a stacked bar graph
for test and age.
Commands:
First one must construct a contingency table which will be used to plot the bar
chart:
Note that: the field ‘width’ in the command ‘barplot’ defines the width of
each bar, ‘legend.text=T’ places the legend in a default position,
‘xlim=c(lower,upper)’ is used to specify the horizontal "limits" of the image.
Commands:
116 | P a g e
Figure 3.2.7
If we want to change the location of the legend we can use the field ‘inset’ as
shown below:
117 | P a g e
Figure 3.2.8
Commands:
> library(ggplot2)
> qplot(pima$test, geom="bar", fill=pima$agecat,xlab="Test o
utcome",main="Test by Age Group Distribution")
118 | P a g e
Figure 3.2.9
The legend title may also be changed by including the command +labs(f
ill='NEW LEGEND TITLE') at the end of the previous command.
The command ‘barplot’ does not offer an option to show percentages or counts
on the stacked bar charts. If we want the counts or percentages on the charts
we can use the command ‘ggplot’ in package ‘ggplot2’.
119 | P a g e
> counts2<-table(pima$test,pima$agecat)
> counts2
Elder Middle Aged Young
negative 7 45 440
positive 2 47 210
> Data
Age_group Test_outcome Frequency
1 Elder Negative 7
2 Elder Positive 2
3 Middle Aged Negative 45
4 Middle Aged Positive 47
5 Young Negative 440
6 Young Positive 210
Then to obtain a stacked bar chart with labels we use the following commands:
library(plyr)
120 | P a g e
Figure 3.2.10
Here we can use the command barplot again, but we need to add the field
besid=T which tells R to place the bars next to each other rather then stacking
them one on top of the other.
Commands:
The next command gives the position of the value. Change the value 49 to see
what happens.
121 | P a g e
> ypos.outside<-apply(t(counts2), 1, function(x) x + 49)
Figure 3.2.11
122 | P a g e
3.2.3 The Histogram
Command:
h<-hist(pima$diastolic,main="Histogram for Diastolic",
xlab="Diastolic",ylim=c(0,250))
Figure 3.2.12
123 | P a g e
We can add a normal distribution curve to this histogram by adding the
following commands after the previous command:
Figure 3.2.13
124 | P a g e
3.2.4 The Boxplot
Commands:
> library(car)
> Boxplot(pima$diastolic,main="Diastolic",id.method="none")
Figure 3.2.14
125 | P a g e
Hence if instead we type:
> Boxplot(pima$diastolic,main="Diastolic")
[1] 19 126 598 600 44 85 107 178 363 550 659 663 673 692
Note that the values 19, 126 etc identify the row number of where the outlier
can be found in the data set.
Figure 3.2.15
Next we obtain a boxplot for ‘diastolic’ for each level of the variable ‘test’.
126 | P a g e
Command:
Figure 3.2.16
127 | P a g e
Command:
> myboxplot2<-boxplot(pima$diastolic~pima$test,
main="Diastolic", xlab="Test Result",
ylab="Diastolic Blood Pressure")
$n
[1] 481 252
$conf
[,1] [,2]
[1,] 68.84733 72.90751
[2,] 71.15267 76.09249
$out
[1] 30 122 108 110 24 106 106 40 110 30 110 114
$group
[1] 1 1 1 1 1 1 1 2 2 2 2 2
$names
[1] "negative" "positive"
where,
stats = a matrix, each column contains the extreme of the lower whisker, the lower
hinge, the median, the upper hinge and the extreme of the upper whisker for
each group/plot
n= a vector with the number of observations in each group.
conf= a matrix where each column contains the lower and upper extremes of the
notch.
out= the values of any data points which lie beyond the extremes of the whiskers.
128 | P a g e
group= a vector of the same length as out whose elements indicate to which group
the outlier belongs.
names= a vector of names for the groups
Figure 3.2.17
For more details on how to interpret a box plot please refer to Section 3.1.4.
129 | P a g e
3.2.5 The Scatter Plot
1. Bivariate
Command:
Figure 3.2.18
To fit a simple regression line onto Figure 3.2.18, add the following command
to the previous command:
130 | P a g e
> abline(lm(pima$glucose ~ pima$bmi))
# fitting a regression line on a scatter plot
Figure 3.2.19
2. Matrix
The following commands may be used to obtain a matrix of scatter plots for
the first 4 covariates in the data set ‘pima’.
Command:
> pairs(pima[1:4])
131 | P a g e
Figure 3.2.20
132 | P a g e