100% found this document useful (1 vote)
735 views132 pages

Statistical Analysis Using SPSS and R - Chapter 1 To 3 PDF

This document provides an introduction to statistical analysis using SPSS and R software. It discusses different types of variables, including quantitative, qualitative, discrete, continuous, ordinal and nominal variables. It also introduces the SPSS and R software, describing the SPSS Data Editor window and how it is used to enter and store data. The document serves as an overview for the following sections that will demonstrate how to import, manipulate and analyze data using functions in SPSS and R.

Uploaded by

Karl Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
735 views132 pages

Statistical Analysis Using SPSS and R - Chapter 1 To 3 PDF

This document provides an introduction to statistical analysis using SPSS and R software. It discusses different types of variables, including quantitative, qualitative, discrete, continuous, ordinal and nominal variables. It also introduces the SPSS and R software, describing the SPSS Data Editor window and how it is used to enter and store data. The document serves as an overview for the following sections that will demonstrate how to import, manipulate and analyze data using functions in SPSS and R.

Uploaded by

Karl Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

STATISTICAL ANALYSIS

USING
SPSS AND R SOFTWARE

Department of Statistics and O.R.


University of Malta

1|Page
PREFACE

These notes have been compiled by Dr. Monique Borg Inguanez, Dr. Fiona
Sammut and Dr. David Suda. The aim of this set of notes is to give an
introduction to statistical analysis using two very popular software, namely:
SPSS and R. Although these notes have been compiled using Windows they
can run on other operating systems.

The material presented here covers syllabi required for SOR0223, SOR1232
and SOR1220. Some material is beyond the syllabus required for certain study
units and some material could also have been covered in study units which
students might have followed earlier. Authors felt that extra material should
be included so as to have a complete set of notes which can be used by students
even after they graduate. This will by no means increase the number of
lecturing hours. Students can experiment with most of the introductory
material on their own. The respective lecturers will inform which sections are
relevant for each study unit.

2|Page
0. INTRODUCTION – DIFFERENT TYPES OF
VARIABLES

As explained in the preface, this study unit focuses on teaching how to use
statistical software to analyse various types of data sets. To be able to identify
which techniques are most suitable for the analysis, we need to understand the
nature of the variables being considered.

In our data set we may have both quantitative and qualitative variables.

 The response given on a quantitative variable is something that can be


counted or can be measured, such as the number of persons waiting at a
bus stop, the number of assembly machines that are out of order, the
tensile strength of concrete, the total cholesterol level, etc.

 The response given on a qualitative variable can be viewed as a


description, a name or a label, such as the means of transport used to get
to university (car, bus, bicycle, motorcycle, on foot), the stage of a disease
(initial, advanced) etc. A qualitative variable is also known as a
categorical variable. A categorical variable is one for which the
measurement scale consists of a set of categories. Categorical scales are
commonly used in the social sciences for measuring attitudes and opinions
on various issues. Also, categorical scales frequently occur in the health
sciences to measure responses such as whether a patient survives an
operation (yes, no); severity of an injury (none, mild, moderate, severe).
Despite categorical variables being common in the social and health
sciences, they are not limited to only these two areas. Categorical
variables frequently occur in the behavioural sciences as well. For
example, in education: an exam in which students have to choose whether
an answer is true or false, in zoology: categorizing different species (fish,
invertebrate, reptile, mammal), in public health: awareness of the public
of risks for obesity (yes, no), in engineering sciences where items are
classified according to whether or not they conform to certain standards,
in marketing: preference out of brand A, B or C.
3|Page
There are two types of quantitative variables, discrete and continuous:

 A discrete (count) variable can take only whole numbers as response (no
decimal places), such as the number of employees in a company, age in
years, number of patients in a ward.

 A continuous variable can take on any number on the real line or a subset
of it, such as salary, height, weight, blood pressure.

Variables that are either discrete or continuous will be referred to as


covariates throughout these lecture notes.

There are also two types of measurement scales for qualitative (categorical)
variables, ordinal and nominal:

 Those categorical variables whose response has a natural ordering, such


as: too low, about right, too high, are said to have ordered scales and
are called ordinal variables. A Likert item, commonly used in
attitudinal measurements, is measured on an ordinal scale. A Likert
item typically consists of a five-point scale ranging from strongly agree,
agree, neither agree nor disagree, disagree, strongly disagree. A
variable with this type of response may for example be used to measure
level of satisfaction with the health system in the Maltese Islands. A
Likert item may also be rated on a 4-point, 6-point and 7-point response
scale.

 Categorical variables having unordered scales are called nominal


variables. For example: religious affiliation (Catholic, Jewish,
Protestant, Muslim, etc) or transportation to work (car, bicycle, bus,
walk). A nominal variable with only two possible response categories
is also called a dichotomous variable.

4|Page
So for nominal variables, the order of listing the categories is irrelevant and
any statistical analysis carried out on this type of data does not depend on the
ordering of the data. The methods designed for nominal variables give the
same results no matter in what order the categories are listed. On the other
hand, methods designed for ordinal variables utilize the category ordering and
results of ordinal analyses would change if the categories were reordered in
any other way. In view of this, methods designed for ordinal variables should
not be used with nominal variables. Methods designed for nominal variables
may be used with either nominal or ordinal variables.

5|Page
Data

Qualitative Quantitative
Variables
(Categorical) (Covariates)

Ordinal Nominal Continuous Discrete

Note that in some texts on modelling the term covariates refers to explanatory variables which can be qualitative or quantitative 1

6|Page
1. INTRODUCING THE SOFTWARE

This first section will serve as an introduction to SPSS, R and R Studio. The
second section will focus on how to create or import a dataset, transform
variables, manipulate data and perform descriptive statistics using either
software.

The second part of the notes will describe some commonly used descriptive
statistics. In the third part, focus will be placed on graphical representations
and the fourth part will cover more advanced topics such as parametric and
non-parametric tests, statistical modelling and categorical data analysis.

1.1 SPSS

SPSS stands for Statistical Package for the Social Sciences. This software
package was first produced by SPSS, Inc. in Chicago, Illinois and acquired by
IBM in 2009. An updated version of the software is issued every year. The
version used to compile these notes is officially named IBM SPSS Statistics
23. In these notes, the software will always be referred to as SPSS.

SPSS is used for conducting statistical analyses, manipulating data and


generating tables and graphs that summarize data. Statistical analyses can
range from basic descriptive statistics, such as averages and frequencies, to
advanced inferential and multivariate statistical procedures such as analysis
of variance (ANOVA), time series analysis, factor analysis, cluster analysis
and categorical data analysis, to mention a few. SPSS also contains several
tools for manipulating data, including functions for recoding data and
computing new variables as well as merging and aggregating datasets.

To start using SPSS, locate the SPSS icon under the Programs menu item.
Otherwise, if you are already in possession of a file that has been created by
SPSS, one can also start SPSS by double-clicking on an SPSS file.

7|Page
The latter is however not recommended as it usually take longer to open a file
in this manner. SPSS consists of different windows, each of which is
associated with a particular SPSS file type. The most important of these
windows are the Data Editor ( .sav files) and the Output Viewer ( .spv files)
windows.

1.1.1 The Data Editor

The Data Editor is the window that is open at start-up and is used to enter and
store data in a spreadsheet format. The Data View is the sheet that is visible
when you first open the Data Editor, and this is the window which contains
the data; it displays the contents on the currently open dataset, also known as
working dataset.

The figure which follows shows the Data Editor containing the Employee
data.sav dataset:

Figure 1.1.1
8|Page
From Figure 1.1.1 you can see that at the top of the Data Editor window there
are several menu items that are useful for performing various operations on
the working dataset. All data manipulations, statistical functions, and other
SPSS procedures operate on the currently open dataset.

If you wish to analyze an SPSS data file which is already stored in the
computer, the file may be opened in the Data Editor window, by choosing the
menu options File, Open, Data and then selecting the directory in which the
dataset is stored.

If the file to be opened is not an SPSS data file, the Open menu option may
also be used to import the file directly into the Data Editor. We shall shortly
demonstrate how this may be done. If the data file is in a format which is not
recognized by SPSS, then the software package in which the file was
originally created may be used to try to translate the file into a format that can
be imported and read by SPSS (e.g tab-delimited data, an Excel data file).

Note that, the Data View window contains variables (information available
for each case, such as I.D. card number, gender, education, etc.) in columns
and cases (individuals or observational units) in rows. Thus, cell entries will
be numbers (or words or symbols) telling a single value of a variable for a
particular case.

Apart from the Data View sheet, there is also the Variable View sheet. The
latter sheet may be accessed by clicking on the tab labelled Variable View and
while the second sheet is similar in appearance to the first, it is does not
actually contain data. Instead, this second sheet contains information about
the variables in the dataset. In fact, by means of this view, one can name and
define variables, define labels and values, define missibe data and also select
the type of alignment and measure for the data entered. Note that many of the
cells available in the Variable view spreadsheet (see Figure 1.1.2 below)
contain hidden dialogue boxes that may be activated by clicking on the cells
themselves.

9|Page
Figure 1.1.2

By default, on entering new data, the variable names are VAR00001,


VAR00002, etc, however, it is good practise to change the default names to
make the data analysis process easier and less prone to errors. Variable names
may easily be replaced by just listing the new names of the variables in the
Name column in the Variable View window. Note that, no spaces are allowed
in the variable names and the names must always start with a letter. Multiple
words in a variable name should be linked by an underscore.

10 | P a g e
Figure 1.1.3

SPSS should also be provided with information about the type of variables
being used in the dataset. This is often critical for SPSS to process the correct
analyses. As can be seen in Figure 1.1.3, SPSS accepts various variable types.

Numeric: This option is used for variables with numerical values. For
example:

Continuous variables (such as salary, height, weight, blood pressure, ...) :


such variables can take on any number on the real line or a subset of it. The
number of decimal places required are set in the Decimals setting. For such
variables, the Measure setting should be defined as Scale.

Discrete or Count variables (such as number of employees in a company,


age in years, number of patients in a ward,…) such variables take only integer
numbers hence 0 decimal places. For such variables, the Measure setting
should also be defined as Scale.

11 | P a g e
Categorical Variables (in some statistical analyses these are referred to
as Factors) take numerical values from a predefined set of integers. The
Decimals setting is hence set to 0. Such variables are divided into two classes:

(i) Nominal – which refers to variables that have been coded numerically
where order is not important (e.g., recording a subject's gender as 1 if
male, 2 if female, 3 if other, or an answer to a question as 1 if yes and
0 if no, or assigning integers to districts or cities). For such variables,
the Measure should be set as Nominal.

(ii) Ordinal - which refers to variables that have been coded numerically
but there is some form of ordering in the numbers (e.g., a Likert item
with responses 1=Good, 2=Better, 3=Best). For such variables,
the Measure should be set as Ordinal.

Labels for the different values that a categorical variable may take should
always be defined in SPSS. One of the main reasons for doing this is to help
with the interpretation of the output. For example, considering the variable
jobcat in the Employee dataset, it is coded as either 1, 2, or 3 for employment
categories: clerical, custodial and managerial. Now, imagine having to
interpret some plot or some table from the output and all the time reminding
yourself that 1 stands for clerical, 2 stands for custodial and 3 stands for
managerial. But, on letting SPSS know that 1 stands for clerical, that is, by
assigning labels to the values being used, the labels will appear in the output
rather than the numbers, making interpretation much easier. Such labels are
defined through the Values setting as shown below.

Consider the variable gender:

12 | P a g e
Figure 1.1.4

Click on the icon to obtain the following window:

Figure 1.1.5

13 | P a g e
Enter the numerical value in the Value: field and the corresponding label in
the Label: field. Press Add each time and press OK when you have labelled
all the levels of the fixed factor. The change and remove buttons may be used
any time you need to change or remove any predefined labels.

String: This option is used for variables which contain text instead of numeric
values. The values of string variables may include numbers, letters, or
symbols. Examples of string variables are Names of individuals, Zip codes,
phone numbers, I.D. numbers, free-response answers to survey questions, etc.
Such variables are not used in any calculations. In Figure 1.1.2 it may be noted
that gender is set as String. Note that some SPSS procedures (such as the
independent samples t-test; ANOVA; non-parametric tests; etc.) require that
categorical variables be coded as numeric. In SPSS it is very easy to convert
string variables to numeric as we shall see later on.

Comma: Numeric variables that include commas that delimit every three
places (to the left of the decimals) and use a dot to delimit decimals. For
example 1,000.20 or 23,456.32. This is the convention we use.

Dot: Numeric variables that include dots that delimit every three places (to
the left of the decimals) and use a comma to delimit decimals. Example
1.000,20 or 23.456,32. We do not usually use this convention.

Width: refers to how many characters a value can hold.

The other options are self-explanatory.

14 | P a g e
1.1.2 The Output Viewer

All output from statistical analyses is printed to the Output Viewer window as
well as other useful information. When you execute a command for a
statistical analysis the output will be printed in the Output Viewer. Some other
output that you may want to have printed to the Output Viewer are command
syntax, titles, and error messages. The Output Viewer showing descriptive
statistics for the employee dataset is shown in Figure 1.1.6. Detail on how to
obtain descriptive statistics in SPSS will be given at a later stage.

Figure 1.1.6

The left frame of the Output Viewer contains an outline of the objects
contained in the window. For example, everything under Descriptives in the
outline refers to objects associated with the descriptive statistics, the Title
object refers to the bold title Descriptives in the output while the highlighted
icon labeled Descriptive Statistics refers to the table containing descriptive
statistics. The Notes icon has no referent in the example being considered
here, but it would refer to any notes that appeared between the title and the
table.
15 | P a g e
Note that, by clicking on an icon, you can move to the location of the output
represented by that icon in the Output Viewer. This makes the outline useful
for navigating in the Output Viewer when there are large amounts of output.
Also, the outline is a tool that may be used for copying or deleting objects;
proceeding by first highlighting the objects of interest in the outline and then
performing the operation as needed.

Figuew 1.1.7 shows that what is displayed in the output may also be
customized. This is possible by using the Options menu item on the Edit
menu:

Figure 1.1.7

16 | P a g e
1.1.3 The Syntax Editor

Another important window in the SPSS environment is the Syntax Editor. In


the early versions of SPSS, all of the procedures performed by SPSS were
submitted through the use of syntax which instructed SPSS on how to process
the data (DOS based). All recent versions of SPSS contain pull-down menus
with dialog boxes that allow submission of commands to SPSS without the
need to write syntax.

These SPSS for Windows lecture notes will focus on the use of the dialog
boxes to execute procedures. However, there are a couple of important
reasons why you should be aware of SPSS syntax even if you plan to primarily
use the dialog boxes:

 Not all procedures are available through the dialog boxes. Therefore,
occasionally, submission of commands will have to be done through
the Syntax Editor.

 Your own procedures may be saved as syntax so that they may be rerun
by means of the Syntax Editor at a later date.

SPSS syntax may easily be generated without even having to type in the
Syntax Editor. The process is illustrated below:

The following dialog box is used to generate descriptive statistics. Here, only
the Paste button in the dialog box is relevant, the process used for generating
descriptive statistics is described later. (The dialog boxes available through
the pull-down menus all have a button labeled Paste).

17 | P a g e
Figure 1.1.8

By clicking on the Paste button, the procedure that the above dialog box is
prepared to run will be written in SPSS syntax in the Syntax Editor, as follows:

Figure 1.1.9

Note that the by running the resulting SPSS syntax exactly the same output as
would have been generated by clicking the OK button in the above dialog box
will be produced.
18 | P a g e
Also, the syntax that is printed to the Syntax Editor can then be saved and run
at a later time as long as the same dataset or at least a dataset containing
variables with the same names is active in the Data Editor window. Also,
saving the syntax comes in useful if a rerun to the analysis is necessary, after
more data has been added to some variables or the same analysis will be done
on another dataset that contains the same variables.

1.1.4 Importing Data From Excel Files

Since we may have our datasets saved on an Excel spreadsheet, it is desirable


to work with a software package which can read Excel files. SPSS has this
desirable feature. An Excel spreadsheet may be imported into SPSS just by
selecting File, Open, Data in the Data Editor window, selecting the desired
location on disk then selecting Excel from the Files of type drop-down menu
and hence double-clicking on the file which should now show in the main box
in the Open File dialog box.

One further dialogue box, like the one shown below, will then appear:

Figure 1.1.10

This dialogue box allows you to select a spreadsheet from within the Excel
Workbook. In this particular case, since the Excel file contains five different
spreadsheets, the drop-down menu in Figure 1.1.10 offers five sheets to
choose from.
19 | P a g e
As SPSS only operates on one spreadsheet at a time, only one spreadsheet
may be selected at a time. Note that, if not all variables of an Excel sheet are
required for the analysis, the range has to be specified, otherwise, the whole
spreadsheet will be imported to SPSS. Also, by ticking Read variable names
from the first row of the data, SPSS reads the first row of the spreadsheet as
the variable names. If the spreadsheet contains no variable names, make sure
that the box remains unmarked. Then, if once in SPSS environment, we would
like to add variable names to the dataset, follow the procedure described in
Section 1.1.1.

You should now see data in the Data Editor window. Check to make sure that
all variables and cases were read correctly. Next, save your dataset in SPSS
format by choosing the Save option in the File menu.

1.1.5 Importing Data From ASCII Files

Data is often stored in an ASCII file format, alternatively known as a text or


flat file format. Typically, columns of data in an ASCII file are separated by
a space, tab, comma or some other character. By means of a Text Import
Wizard, SPSS is able to import data saved in an ASCII file format.

The Text Import Wizard will open automatically when an ASCII file, with a
.txt or .dat extension, is to be opened using the Open option in the File menu.
However, if the data file to be imported does not have a .txt or .dat extension
but it is an ASCII file, then it may be imported by opening the Data Import
Wizard from the File menu and then pressing Read Text Data.

The first window to pop up will ask the user to choose the file from the
directory to be imported. Then, a series of dialogue boxes, starting with the
one shown in Figure 1.1.11, will show up and will guide the user with the
retrieval of the data. Once the data has been imported and checked for
accuracy, a copy of the dataset should be saved in SPSS format by selecting
the Save or Save As options.

20 | P a g e
Figure 1.1.11

1.1.6 Inserting and Deleting Cases and Variables

Frequently, we would like to add new variables or cases to an existing dataset.


SPSS provides us with easy ways to do both mentioned tasks. To insert a new
variable, just enter the datapoints one underneath the other in a new column
and then change the default name assigned to the variable as desired.

To insert a case (or variable), select the row (or column) in which the case (or
variable) is to be added, right click and choose Insert Cases (or Insert
Variables) from the resulting menu and a blank row (or column) will result
automatically. The latter procedure could also be done by clicking on the
row's number or on the column's name, then use the insert options available
in the Data menu in the Data Editor.

21 | P a g e
Again, this will produce an empty row or column in the highlighted area of
the Data Editor. Existing cases and variables will be shifted down or to the
right, respectively.

A similar process may be carried out if a deletion of a case or a variable is


required. To delete a case (or variable), select the row (or column) in which
the case (or variable) is to be deleted, right click and choose Clear from the
resulting menu and the row (or column) will be removed entirely. This
procedure may also be carried out by clicking on the row's number or on the
column's name, then use the clear option available in the Edit menu in the
Data Editor.

1.1.7 Computing New Variables

A useful feature of SPSS is the possibility to compute a new variable out of


existing variables, for example if in the Employee dataset, we might wish to
also include a variable showing the difference in the current salary and the
starting out salary of the employees. Another example, is if we wanted to
present the variable jobtime in numbers of years rather than in number of
months (dividing by 12).

The computation of any new variable is done using the Compute option
available from the Transform menu in the Data Editor. First, the dialogue box
which follows will appear:

22 | P a g e
Figure 1.1.12

Figure 1.1.13
23 | P a g e
Then, so as to create a new variable, type its name in the box labeled Target
Variable. In this case, it has been named Difference. Soon after, the
expression to be computed is either typed into the Numeric Expression cell
directly or the expression may also be entered using the input values or
operators located underneath the Numeric Expression cell.

The new variable created in this particular case is Difference=salary-salbegin,


that is the difference between an employee's current salary and the employee's
beginning salary. On pressing Ok, the new variable will be computed and it
will appear in the rightmost column of the working dataset.

Now, we might also wish to, for example, compute the difference in salaries,
on the condition that the persons to be considered should have less than 15
years of education. This may easily be done by using the Compute option as
before together with clicking on the If button, to get the dialogue box which
follows:

Figure 1.1.14

24 | P a g e
In this way, the Include if case satisfies condition is activated and the
condition for computing the new variable is entered underneath.

Figure 1.1.15

Figure 1.1.15 illustrates the condition that requires cases to have less than 15
years of education in order to be included in the computation of the new
variable.

Clicking the Continue button, we return to the previous dialogue box and then
by clicking Ok once more, the new variable will appear in the rightmost
column of the working dataset:

25 | P a g e
Figure 1.1.16

1.1.8 Recoding Variables

There are many instances in which we need to recode variables, such as


changing values from letters to numbers, increasing or decreasing the number
of possible values, imposing a cut-off score, etc. SPSS allows us to recode
variables and then use the recoded variables in statistical analyses.

For example, the variable jobcat in Employee dataset, is classified in 3


different categories. Consider now, that for a particular analysis, we would
like to combine category 1 and 2 into a single category. Thus, the
categorization of the variable would then have to be recoded such that the
previous categories 1 and 2 would now fall into category 1 and the previous
category 3 would now be the second category. The way to go is as follows:

Press Tranform and then Recode from the menu in the Data Editor. A choice
needs to be made out of using either Into Same Variables - option which
changes the values of the existing variables or Into Different Variables -
option which is used to create a new variable with the recoded values.

26 | P a g e
Note that, both options will lead to the same result, but Into Different
Variables is preferred because if you change your mind about your recoding
scheme at a later date, you may still recur to the original variable values.

Now, opting for Into Different Variables option will produce the following:

Figure 1.1.17

First the variable which requires recoding must be chosen from the existing
dataset (for this example, we shall use jobcat) and then, by clicking on the
arrow button, the same variable name should appear in the cell Input Variable
-> Output Variable. Next, the name of the new variable (jobcatchanged) must
be supplied, together with an optional corresponding label:

27 | P a g e
Figure 1.1.18

Then, the old and new categories have to be specified in the dialogue box Old
and new Values as shown below:

Figure 1.1.19

Note that, the original value has to be entered in the Old Value cell and
similarly, the new value has to be entered in the New Value cell.

28 | P a g e
Afterwards, it is important to click on Add to save the recoding done. Hence
press Continue and then Ok. The newly coded variable will appear at the
rightmost part of the data entry window. Same process has to be repeated for
any other recoding needed.

It should be noted that the dialogue box shown in Figure 1.1.19 is the same
regardless of whether we are recoding values into the same variable or
creating a new variable.

Now, as for the computation of new variables, recoding of variables may also
be done conditionally. Recoding values given a condition is done in exactly
the same manner as just discussed but with the inclusion of the condition being
specified by means of the If button.

1.1.9 Sorting Cases

A very good way to look at your data and see if there is any missing data or
any incorrect data entry values, is to sort the data. Sorting of cases allows you
to organize rows of data in ascending or descending order.

For example, we may want our data to be sorted such that the variable jobtime
is in increasing order or we may want the salary sorted within the variable
jobtime. SPSS makes it all possible, since the procedure for sorting may be
done on the basis of one or more variables.

Sorting of cases in SPSS is carried out by pressing Data, followed by Sort


Cases in the Data Editor, to give:

29 | P a g e
Figure 1.1.20

Only two options come with this dialogue box; the first being, the variable/s
to be sorted and the second, the desired order of sorting. Note that, the
hierarchy of such a sorting is determined by the order in which variables are
entered in the Sort by cell. Consider the following:

Figure 1.1.21
30 | P a g e
The sorting requested in Figure 1.1.21 leads all the variables in the data to first
be sorted by the first variable entered, that is sorted according to jobtime. This
is then followed by sorting the second variable, salary, within the first
variable, jobtime. The resulting sorted Employee data set will be as follows

Figure 1.1.22

1.1.10 Selecting Cases

Refering again to the Employee data set, consider for example wanting to
analyze the data which corresponds to employees having a current salary
greater than $20,000 only. By means of SPSS, we can in fact analyze a
specific subset of our data by using the Select Cases procedure. This is done
by pressing Data, then press Select Cases from the menu options and hence,
the dialogue box which follows will show on the screen:

31 | P a g e
Figure 1.1.23

As may be seen, Figure 1.1.23 contains a list of the variables in the active data
file on the left and several options for selecting cases on the right.

Now, by default SPSS considers All Cases for analysis, so, on pressing All
Cases, the data will remain unchanged. However, if either one of the other
options is chosen, say If condition is satisfied, then the If button underneath
should be pressed so that a second dialogue box, which will ask for the
particular specifications, will show up:

32 | P a g e
Figure 1.1.24

For our example, where we would like to choose only those employees who
earn more than $20,000, the if statement should read:

Figure 1.1.25

33 | P a g e
Note that the portion of the dialogue box in Figure 1.1.23 labeled Output gives
the option of temporarily or permanently removing data from the dataset. The
Filtered option will remove data from subsequent analyses until the All Cases
option is reset, at which time all cases will again be active and used in further
analyses. If the Deleted option is selected, the unselected cases will be
removed from the working dataset. If the dataset is subsequently saved, these
cases will be permanently deleted.

For our example, the Filtered option has been chosen, hence SPSS will
indicate the inactive cases in the Data Editor by placing a slash over the row
number:

Figure 1.1.26

To select the entire dataset again, return to the Select Cases dialog box and
select the All Cases option. Otherwise delete the filter variable (the last
column in the data) from the data.

34 | P a g e
1.1.11 Listing Cases

By means of SPSS, we may also extract a list of cases of a number of (or all)
variables from a data set of interest. The procedure for doing this cannot be
performed using dialogue boxes, but may only be done through command
syntax.

Reminder: The syntax window is obtained by going to File, New, Syntax.


Then, the syntax that should be used to obtain for example the first 15 cases
for the variables salary and prevexp is as shown below:

Figure 1.1.27

Note that by typing the command LIST VARIABLES = ALL instead of naming
any variables, will produce a list of all the variables in the dataset and the
subcommand /CASES FROM 1 TO 15, is an instruction to SPSS to print only
the first fifteen cases for all the variables. If the latter instruction was omitted,
all cases would be listed in the output.

35 | P a g e
Now, to execute commands written in the syntax editor, first highlight the
commands, next, either click on the right-facing green arrow or choose a
selection from the Run menu. Execution of the command will give:

Figure 1.1.28

1.2 R

R is a programming language that is very widely used by statisticians and data


analysts for computational statistics and visualization, in industry and in the
academic sector. R performs simple and more advanced statistical analysis,
produces high quality graphics and its uses are extended by the possibility of
writing one’s own functions.

R was designed by Ross Ihaka and Robert Gentleman at the University of


Auckland, New Zealand, and is currently maintained by the R Development
Core Team.

36 | P a g e
R’s first appearance goes back to 1993. The popularity of this software has
increased rapidly especially in the last 10 years. R is downloadable for free
under the GNU General Public License (GPL). It is known as an open source
program as its source code is freely available. The base system is downloaded
from its homepage:

https://cran.r-project.org/

For installation on Windows, go to the above link, click on Download R for


Windows, click on base, then click on the current version of R available for
Windows and complete the installation by running the executable file.

R can be obtained in various versions and for different computational


environments - Windows, Linux, MacOs.

R provides an unparalleled platform for programming in a straightforward


manner, but it has a steeper learning curve when compared to drop down menu
software such as SPSS.

GOOD NEWS: plenty of introductory tutorials, books, lecture notes are


available. A list of documentation may be obtained for free from:

https://cran.r-project.org/other-docs.html

BUT: Practice is key!

1.2.1 STARTING R

Find R from the list of programs available on your computer and click on the
R icon. If you have created a shortcut on desktop, double click on R icon. A
console window will pop up:

37 | P a g e
Figure 1.2.1: The main R window (GUI)

 Use the question-and-answer model – every time you enter a command


near prompt ‘>’ and then press Enter to get the required output.
 If relevant and depending on the command entered, the system will
respond by outputting results. Graphs are created in a new window.
 R prints the result and asks for more input.
 When R is ready, ‘>’ will again show up.
 If some mistaken command has been used, R will follow this up by
printing a warning.
 All data objects created during a particular session are kept in memory.
 To get the last expression entered use the up arrow.
 R may be quitted by clicking on the X at the upper right hand corner of
the screen or otherwise by using the command q( ).

38 | P a g e
Working with R:

 R is case sensitive
 R has an inbuilt library of packages
 Many more packages are available from R website - keep on the
lookout for new packages – a package for say, fitting a specific model,
may not be available now but may be in a few months’ time.
Packages available for download may be accessed through:
https://cran.r-
project.org/web/packages/available_packages_by_name.html

The Bioconductor Repository, is another package repository which


contains nearly 500 additional packages that focus solely on biological
and bioinformatics applications. For more information on this
repository, go to: http://www.bioconductor.org/install/.

 You can create your own package and upload it.


 Installing a package: There are different ways in which a package may
be installed. You need to be connected to the internet to be able to
download packages.
You can use the menu Packages – Install Packages – Choose Cran
mirror - Choose package to install from a whole list available.
A mirror is a distribution site for software. The location closest to us
from the whole list is Italy (Palermo). Generally, all mirrors will have
all copies of documentation/files available. A number of mirrors are
available for balancing load across different servers and also for better
downloading speeds.

Say you wish to install the package ‘gnm’ – a package for generalized
nonlinear models. An alternative way of installing packages, is that of
using the command install.packages(‘package’). For the package
required in this case, type install.packages(‘gnm’).

39 | P a g e
A further alternative is that of downloading the zip file of the required
package on your computer and then use the menu Packages – Install
package(s) from local zip files. On using such a procedure, you need to
make sure that you also install any dependencies (any other packages
that the package you are currently installing relies on).

 The inbuilt packages are loaded automatically, all the remaining


packages need to be loaded prior to use. The command search() lists
all the packages that are currently loaded in R. library() lists all
packages that are installed in the system and so are available to load.
 Loading a package: Use the menu Packages – Load Package – Select
package to be loaded. Alternatively, library(package) or
require(package) also load the required package. Say that we wish to
load the package gnm, the command used for this purpose is
library(gnm).
 To view all the files of a package, say e.g. in the package MASS:
library(help=‘MASS’)
 There are different ways of updating currently installed packages.
You can use the menu Packages – Update Packages – choose
package/s that you would like to update.
Alternatively you can use the command update.packages(package).
Say you wish to update the package gnm, then use the command
update.packages(gnm).

 The command for loading datasets currently available in R is data( ).


A new window will pop up with a list of datasets currently available for
use (including datasets from base packages and datasets from all the
loaded packages).
We may also use the command data(package=’ ’) to view all the data
files in a specific package. Say we wish to view files in the package
gnm, then we use the command data(package=’gnm’).

40 | P a g e
If we then wish to load a specific dataset from a package, say we wish
to load the dataset barley from the package gnm, we use the command
data(barley, package='gnm'). If we write barley at the prompt, we then
get shown all the data in the barley dataset.

 Whenever you are feeling slightly lost/totally lost, you can also make
use of the help functions available in R. There are various ways to
access help available in R.
Some documentation to get you going in R is available from the menu
Help – Manuals (in pdf).

Especially if you know which R command you are going to make use
of, you can make use of the Help menu in R. Say you wish to know
how to work out the trimmed mean in R and that you know that you
have to use the function mean. Then go on the menu Help – R functions
(text) - mean. A new window will pop up in your browser with details
on the function mean. Writing the command ?mean or help(‘mean’) at
the prompt, will lead you to the same information. We will see how to
enter data and how the function mean may be used shortly.

The Help menu may also be used to access documentation of specific


packages. Say you are going to fit a specific model. Since most of the
packages are contributed packages, model specification, extracting
summary of results from a model, etc may be different from one
package to another. It is recommended to always consult the
documentation of the specific package being used. Documentation of
say gnm, may be accessed by going on Help – Search Help – gnm. A
new window with details on the package gnm will pop up. A pdf for
this package may then be downloaded from the upper right hand side
of the page.

41 | P a g e
Download pdf
from here

Figure 1.2.2: Downloading Documentation for the package gnm

If you do not know which function should be used for a specific


purpose, then remember that Google is your friend. Pdfs for various
packages may also be found through a Google search.

1.3 R STUDIO

RStudio is a powerful user interface (IDE – integrated development


environment) for R. RStudio is available as an open source edition and as a
commercial edition. Both editions may be used with Windows, Linux and
MacOS. The open source edition is available for download for free. RStudio
may be downloaded from:

https://www.rstudio.com/products/rstudio/download/

IMPORTANT: Since RStudio is used as the user interface of R, R needs to be


installed prior to installing RStudio.

Let us start our first session with RStudio so that we can better appreciate the
advantages of using RStudio rather than R directly.
42 | P a g e
1.3.1 STARTING R STUDIO

Find RStudio from the list of programs available on your computer and click
on the RStudio icon/ If you have created a shortcut on desktop, double click
on RStudio icon.

The following console window will pop up:

Figure 1.3.1: RStudio main window

If you look at RStudio’s main window and compare it with the main window
obtained from R, you might already notice that more menus are readily
available for use through RStudio interface. RStudio is in fact more user
friendly. We shall see shortly how RStudio, for example, is much simpler to
use for data import/export and it also offers the use of point-and-click
exploration of data frames and other data objects. To mention some other
advantages, with RStudio it is also easier to install and update packages, easier
to save and export plots and different colours are used whilst writing a script.
The print screen which follows shows part of a script. Note that comments
for example, defined by a #, are attributed the colour green so that they stand
out from the remaining text. The commands library and for are attributed the
colour blue and so on.
43 | P a g e
Figure 1.3.2: Writing a script in RStudio

RStudio can also be used for version control using Git. Git is another open
source software designed to ease the interaction of different people working
on the same project.

All the commands specified earlier for use with R software may also be used
with RStudio. In RStudio however, installing, loading and updating of
packages may be carried out directly through the menu Packages found in the
middle of the right hand side of the main window (refer to the circled part in
Figure 1.3.3). In the same circle of Figure 1.3.3, you may also notice the
presence of another Help menu. Again this menu has been introduced for ease
of use.

Using RStudio Packages menu:


If you wish to install a new package, go to Packages - Install Packages – write
the package that you wish to install (note that RStudio helps us select the
required file by means of the auto-complete function) – Install.

Once you are in the Packages menu, a package is then loaded by a single click
in the box next to the package required.

44 | P a g e
Note that since packages such as datasets, graphics, stats etc are base
packages, they are always readily available and loaded for use in RStudio.

Updating of packages in RStudio is also carried out through the menu


Packages. By pressing on Check for Updates, a whole list of packages for
which an updated version is available, is outputted. You may then proceed to
update all the packages or as required.

Figure 1.3.3: The Packages and Help Menus

Note that any statistical analysis or mathematical computation that is done in


R can be done in pretty much the same way in RStudio. Since RStudio tends
to be more user friendly, in these notes the focus will be on RStudio.

Unlike SPSS, RStudio (and R) can also be used as a very powerful calculator.
Some simple examples are given in the following section. More complex
examples are provided Appendix A.

45 | P a g e
1.3.2 Basic Arithmetic and Objects

We may use R to calculate arithmetic expressions such as 12 + 5 or exp(-5):

> 12+5
[1] 17

> exp(-5)
[1] 0.006737947

We may also wish to assign a number to some variable x. Such an assignment


is carried out by means of the symbol <-. Say we wish to assign the value of
7 to the variable x, then:

> x<-7
> x
[1] 7

If instead of <-, we used the symbol =, we would get the same value for x.
> x=7
> x
[1] 7

The use of the symbol <- dates back to when the R language was first created.
The symbol <- was the only choice for an assignment operator. The use of
this symbol comes from the language APL, where the arrow notation was used
to distinguish assignment from equality, that is to distinguish between ‘Assign
to x the value of 7’ versus ‘is x equal to 7?’ Nowadays, many R users still
prefer to use the symbol <- when assigning values to variables. The reason
for this is that R uses the symbol = for associating function arguments with
values, such as:

> rnorm(10, mean = 0, sd = 5)

In this case, sd is declared within the scope of the function, so it does not exist
in the user workspace.

> sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x)) x else as.double(x), na.rm = na.
rm))
<bytecode: 0x000000000cb09808>
<environment: namespace:stats>

46 | P a g e
Note that the rnorm command is used to generate 10 observations from a
normal distribution with mean fixed at 0 and standard deviation fixed at 5:
> rnorm(10, mean = 0, sd = 5)
[1] -0.5593297 1.0933179 -0.3005551 1.2798228 -7.9541652
-2.9788505 4.9220503 3.5936779 -6.2354377 -0.5117681

Having assigned the value 7 to x we may then wish to subtract 1 from x. We


can do so by means of the command:
> x-1
[1] 6

We may also wish to assign a whole vector to x or a whole matrix to x. An


example of assigning a vector to x is the following:
> x<-c(1,2,3)
> x
[1] 1 2 3

An example of assigning a matrix to x is the following:


> x<-matrix(c(1,2,3,4),nrow=2)
> x
[,1] [,2]
[1,] 1 3
[2,] 2 4
> x<-matrix(c(1,2,3,4),nrow=2,byrow=T)
>
> x
[,1] [,2]
[1,] 1 2
[2,] 3 4

Some functions which may come handy when working with R are:
> seq(4,9,2)
[1] 4 6 8

> 4:9
[1] 4 5 6 7 8 9

> x<-c(1,2,3)

> sum(x)
[1] 6

> rep(x,4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3

47 | P a g e
> rep(3:4,c(5,2))
[1] 3 3 3 3 3 4 4

Now suppose that our data consists of the numbers 2, 5, 7, 9 and that we wish
to extract only some of the elements of the larger vector. So:
> x<-c(2,5,7,9)

> x[c(1,2)] - selecting the first two terms only


[1] 2 5

> x[c(2:length(x))] – keeping the last three terms only


[1] 5 7 9

> x[-c(1,3)] – removing the 1st and 3rd term


[1] 5 9

Note that an alternative command to x[c(1,2)] is:


> x[c(1:2)]
[1] 2 5

and an alternative command to x[c(2:length(x))] is:


> x[c(-1)]
[1] 5 7 9

We may also wish to for example extract all the numbers in x that are greater
than 5 for example:

> x>5 – checking that we do actually have values that are


greater than 5
[1] FALSE FALSE TRUE TRUE

> x[x>5] – extracting those numbers that are greater than 5


[1] 7 9

Other possible logical comparisons may also be entertained:

> x==5 - checking whether values in x are equal to 5


[1] FALSE TRUE FALSE FALSE

> x<=5 – checking whether values in x are smaller or equal


to 5
[1] TRUE TRUE FALSE FALSE

> x!=3 – checking whether values in x are not equal to 3


[1] TRUE TRUE TRUE TRUE

48 | P a g e
We may also wish to combine logical operators:
> (x>2) & (x<9)
[1] FALSE TRUE TRUE FALSE
> !(x<7)
[1] FALSE FALSE TRUE TRUE
> x[!(x<7)]
[1] 7 9

Important: Commands are entered interactively at the R user prompt. Up and


down arrow keys scroll through your command history.

1.3.3 The Working Directory

Whilst R provides a number of datasets, frequently we need to use our own


dataset. There are ways in which new data is entered in R/RStudio such that
new variables/matrices may be created. We have seen few examples of how
vectors and matrices may be created. More detail on working with vectors
and matrices will be provided later on. Here we will focus on when the data
has already been entered in Excel or SPSS and we wish to import the data for
use in R/RStudio.

By default, R will try to access/save datasets, load source files or save plots,
in its own directory. Should you wish R to import/export a dataset from/to a
specific directory, the directory in which R is working should be changed to
your desired directory from: File – Change Directory or otherwise type
setwd(file path).

In RStudio, paths should be changed using the menu Session – Set Working
Directory – Choose Directory.

If you have changed the working directory and you wish to check in which
working directory you are currently working, in either R or RStudio, use the
command:

49 | P a g e
> getwd()
[1] "C:/Users/user/Desktop "

Note that on Windows, the path returned uses / as the path separator (rather
than \ with which you are most probably more familiar).

1.3.4 Importing/Exporting Data

1.3.4.1 Importing Data

If you have already specified the directory from which you would like to
retrieve a dataset, importing a dataset in R is carried out using the following
commands.

From Excel and SPSS – Change extension .xls and extension .sav to comma
separated .csv and use the command:
x = read.csv (‘test.csv’, header = TRUE).

Focusing on our training dataset, the file Employee data.csv, so as to import


our data in R we should use the commands below:

x<-read.csv('Employee data.csv',header=TRUE)
> names(x)
[1] "id" "gender" "bdate" "educ" "jobcat"
"salary"
[7] "salbegin" "jobtime" "prevexp" "minority"

> x[1,]
id gender bdate educ jobcat salary salbegin jobtime
prevexp minority
1 1 m 02/03/1952 15 3 57000 27000 98
144 0

Header = TRUE is a default setting of the command read.csv. So you can


actually opt to leave this part of the command out. On mistakenly writing
Header=FALSE, we would get the following:

> x2<-read.csv('Employee data.csv',header=FALSE)

50 | P a g e
> names(x2)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10"

> x2[1,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 id gender bdate educ jobcat salary salbegin jobtime prevexp minority

There are alternative commands that may also be used for importing data into
R from Excel and SPSS: read.table, read.csv2, read.delim, read.delim2.

Also note that rather than changing the working directory, you can also
specify the directory from where to get a file directly in the read.csv
command:

x<-read.csv('C:/Users/user/Desktop/Employee data.csv')

Alternatively you may also use the command:

x<-read.csv(file.choose(),header = TRUE )

which will let you choose the directory from which you wish to import your
data without having to write down the actual path.

To import a dataset in RStudio, you can use the same commands used in R
or otherwise you may also make use of the menu Import Dataset. Say we
wish to import the file Employee data.csv. Working with the latest version of
RStudio as of August 2017 (Version 1.0.153), importing of data can be carried
out by going to Import Dataset, From CSV, choose the file to import in
RStudio by clicking on the browse button as shown in Figure 1.3.5 and then
press Import.

51 | P a g e
Figure 1.3.4: Importing Data in RStudio

Figure 1.3.5: Choosing the File to Import in RStudio

In previous versions of RStudio, data can be imported by going to:

Import Dataset – From Text file – Choose the directory from where you would
like to get your file – Open the file Employee data.csv.

52 | P a g e
Note that RStudio automatically assigns a name to an imported dataset (the
name of the dataset may also be specified throughout the import procedure
and the name may also be changed once the dataset has been imported in
RStudio).

The command View(name of file) can be used for viewing the dataset in
spreadsheet form. For our example, the name automatically assigned by
RStudio to the imported employee dataset is Employee.data. So by using the
command View( Employee.data) we can get to view the employee dataset in
spreadsheet format. Clicking once on the name Employee.data, in the menu
underneath Data, on the right hand side of the RStudio’s main window, will
also give us the same output.

1.3.4.2 Exporting Data

The commands used for exporting data are the same for both R and RStudio.
One of the most commonly used commands is:

write.csv (x,‘test.csv’)

where we would be saving the dataset called x, in the file called test.csv. The
new file test.csv will be created in the current working directory.

A .csv file may be obtained from both Excel and SPSS.

Another command that may be used for exporting data is write.table.

1.3.5 Preparing Data for Analysis

By default, unless the levels of a categorical variable are defined using


characters, once a dataset has been entered or imported in R/RStudio, all
variables are considered to be continuous variables. Should there be any
categorical variables in the data (nominal or ordinal), these variables should
be defined accordingly before any data analysis is carried out.

53 | P a g e
The output below shows that for the moment, R is still considering the
variables jobcat and minority as continuous. From the output below we can
also check whether there might be any values in the data which stand out in
terms of being too low or too high (minimum/maximum value of each
variable).

> summary(Employee.data)
id gender bdate educ jobcat
salary
Min. : 1.0 f:216 02/04/1934: 2 Min. : 8.00 Min. :1.000
Min. : 15750
1st Qu.:119.2 m:258 02/08/1962: 2 1st Qu.:12.00 1st Qu.:1.000
1st Qu.: 24000
Median :237.5 02/12/1964: 2 Median :12.00 Median :1.000
Median : 28875
Mean :237.5 04/05/1966: 2 Mean :13.49 Mean :1.411
Mean : 34420
3rd Qu.:355.8 05/11/1965: 2 3rd Qu.:15.00 3rd Qu.:1.000
3rd Qu.: 36938
Max. :474.0 10/20/1959: 2 Max. :21.00 Max. :3.000
Max. :135000
(Other) :462
salbegin jobtime prevexp minority
Min. : 9000 Min. :63.00 Min. : 0.00 Min. :0.0000
1st Qu.:12488 1st Qu.:72.00 1st Qu.: 19.25 1st Qu.:0.0000
Median :15000 Median :81.00 Median : 55.00 Median :0.0000
Mean :17016 Mean :81.11 Mean : 95.86 Mean :0.2194
3rd Qu.:17490 3rd Qu.:90.00 3rd Qu.:138.75 3rd Qu.:0.0000
Max. :79980 Max. :98.00 Max. :476.00 Max. :1.0000

To be able to define the variables as categorical, we first need to be able to


access these variables in the dataset.

So as to access information about a specific variable making up the whole


employee dataset, we can either use the command:
Employee.data$variable of interest

or otherwise, a better option is that of using the command:


attach(Employee.data).

The first command is used to be able to access information about a variable


of interest specifically for this one instance. The second command makes it
possible for us to call each variable in the dataset directly by its name any time
needed for the duration of the session. So once the data has been attached,
say, the command which follows may be used:

54 | P a g e
> summary(salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15750 24000 28880 34420 36940 135000

to get descriptive statistics specifically on the variable salary.

Defining of categorical variables may then be carried out by means of the


command factor. The following are the commands used for jobcat and
minority respectively:
> jobcat<-factor(jobcat,levels=c(1,2,3),labels= c("Clerk",
"Custodial", "Manager"))
> summary(jobcat)
Clerk Custodial Manager
363 27 84

> minority <- factor(minority,levels = c(0,1))


> summary(minority)
0 1
370 104

Note that labels has been used to define string characters whilst levels has
been used for numeric characters. In the case of the variable jobcat, we are
attributing the names Clerk, Custodial and Manager to levels 1, 2 and 3
respectively.

If we have an ordinal variable with 3 levels, then specification of this variable


is carried out by either using the command:

> variable<-factor(variable,levels=c(1,2,3),ordered=TRUE)

or otherwise, we can make use of the command ordered( ).

Now suppose that we need to rearrange our dataset such that the values
obtained on salary are in increasing order. The command used for this
purpose is:
> sal.inc<-Employee.data[order(salary),]

Ordering in decreasing order may also be carried out by means of the


command:
55 | P a g e
> sal.inc<-Employee.data[order(-salary),]

Note that in this case, a new dataset called sal.inc has been created so that you
can compare the data in Employee.data with that in sal.inc. Of course you
could have ordered the values in Employee.data directly by using:

> Employee.data <-Employee.data[order(salary),]

If you then wish to get your dataset ordered according to the person’s id, use
the command:
> Employee.data <-Employee.data[order(id),]

Now suppose that we wish to select cases. We make use of the logical
operators introduced earlier on in the notes.

Should we wish to select only those persons whose salary is greater than say
50000, we would use the command:
> Over.50000<-Employee.data[salary>50000,]

Or suppose that we wish to select females only, then:

> Females<-Employee.data[gender == 'f',]

where we would be selecting all rows in the data that satisfy the condition
stated.

Now suppose that we wish to reorganize the employee dataset so that


responses on the variable bdate are moved to the final column. Then as a
first step we might need to check how many columns do we have in our
dataset:
> ncol(Employee.data)
[1] 10

Knowing that bdate is currently in the third column, we then use:

> bdate.final<-Employee.data[,c(1,2,4:10,3)]

56 | P a g e
Also, now suppose that we wish to categorize the variable salary so that we
have a group of persons with a salary that is less than or equal to 50000 and a
group of persons with a salary that is more than 50000. Due to the way that
the command that will be used for this purpose works, it makes sense to check
whether there are any persons with a salary of 50000. This can be checked by
means of:

> length(which(salary==50000))
[1] 1

So there is one person with a salary of 50,000.

One way in which the variable salary can be categorized is then by means of:
> sal.cat <- cut(salary,
+ breaks=c(-Inf, 50000, Inf),
+ labels=c("low","high"))

> summary(sal.cat)
low high
403 71

Note that by default, for the cut function, the ranges defined for breaks are
open on the left, and closed on the right, as in (-inf, 500000]. If we needed to
divide the data into two groups, one group having a salary less than 50000 and
the other group having a salary of 50000 or more, then we would need to use
the command:

> sal.cat <- cut(salary,


+ breaks=c(-Inf, 50000, Inf),
+ labels=c("low","high"),right=FALSE)
> summary(sal.cat)
low high
402 72

so that the range being defined in this case is [50000, inf).

If we then we wish to add the newly created variable sal.cat with the
employee dataset, we use the command:

57 | P a g e
> new.data<-cbind(Employee.data,sal.cat)

where we are binding a new column to the data.

We might also need to recode the levels of a categorical variable. Suppose


that the levels 0 and 1 attributed to the variable minority need to be changed
into 1 and 2. One way to go about this recoding is by means of the function
mapvalues from the package plyr. So first we need to install the package plyr,
then load this package and then proceed to do the recoding:
> require(plyr)
> minority.rec <- mapvalues(minority, from = c(0, 1), to =
c(1, 2))

The same command may also be used in case labels need to be changed. Say
we made a mistake in using the label Manager for the third level of the variable
jobcat. Suppose we wish to change the label Manager to Managerial.
Then:

> jobcat.changed <- mapvalues(jobcat, from = c('Clerk','Cus


todial','Manager'), to = c('Clerk','Custodial','Managerial
'))
> summary(jobcat.changed)
Clerk Custodial Managerial
363 27 84

One alternative way of doing the recoding without relying on the package plyr
is by using the following commands:

> jobcat.changed[jobcat=="Clerk"] <- "Clerk"


> jobcat.changed[jobcat=="Custodial"] <- "Custodial"
> jobcat.changed[jobcat=="Manager"] <- "Managerial"
> jobcat.changed <- factor(jobcat.changed)

> summary(jobcat.changed)
Clerk Custodial Managerial
363 27 84

Now suppose that we wish to create a new variable where job categories are
grouped into Manager and Other. One way in which this recoding can be
done is by means of the command recode( ) from the package car:

58 | P a g e
> require(car)

> jobcat.2cat <- recode(jobcat, "c('Clerk','Custodial')='Ot


her'; 'Manager'='Manager'")

> summary(jobcat.2cat)
Manager Other
84 390

1.3.6 Missing Data

In R/RStudio missing values are represented by the symbol NA (not


available). NA is not a string or a numeric value, but a symbol of missingness.
The symbol NaN (not a number) is used for impossible values, resulting for
example when you divide by zero.

You can check whether you have any missing values in the data by using the
command is.na(variable of interest). The possible outputs from this
command are either TRUE or FALSE. FALSE will result wherever response
is missing.

For some variable x with values say 3, 5, 7, NA, 9 :


> x<-c(3, 5, 7, NA, 9)

> is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE

and if we wish to count how many NAs we have in our variable x, then:
> sum(is.na(x))
[1] 1

Similarly, we can also count how many incomplete cases we have in


Employee.data using:

> sum(!complete.cases(Employee.data))
[1] 0

The command complete.cases( ) returns a logical vector indicating which


rows are complete.

59 | P a g e
In a similar manner to the recoding shown earlier, we can also recode to show
that a specific value is a missing value NA:

> x<-c(3, 5, 7, NA, 99)


> x[x==99]<-NA

> x
[1] 3 5 7 NA NA

Should we wish to exclude missing values from the analysis, we make use of
the command na.rm=TRUE as follows:

> mean(x) – x had two missing values so its mean could not
be computed
[1] NA

> mean(x,na.rm=TRUE) – the mean has been computed once the


missing values have been ignored
[1] 5

The previous command may be used so that missing data is ignored when
using a specific command. We can however also create a new dataset without
the missing data. The command na.omit( ) carries out listwise deletion of
missing data.

> new.x<-na.omit(x)

> new.x
[1] 3 5 7

attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"

> x<-matrix(c(1,NA,3,4,6,NA),nrow=3)
> x
[,1] [,2]
[1,] 1 4
[2,] NA 6
[3,] 3 NA

60 | P a g e
> new.x<-na.omit(x)

> new.x
[,1] [,2]
[1,] 1 4

attr(,"na.action")
[1] 2 3
attr(,"class")
[1] "omit"

Advanced handling of missing data which goes beyond listwise deletion, may
also be carried out in R/RStudio by means of various packages. To mention
a few of these packages: Amelia, VIM, mice.

1.3.7 Managing the Workspace

Objects that you create during an R/RStudio session are contained in what is
known as the workspace. This workspace is not saved unless you tell
R/RStudio to do so. Every time you are ending an R/RStudio session, you
will be asked whether you wish to save your workspace. If you do not save
the objects created during your session, these objects will be lost and you will
not be able to retrieve them in a subsequent session.

 To see which variables have been created in the current session, use the
command ls().
 To remove an object from the workspace, use rm(object).
 To remove all objects from the workspace, use rm(list = ls() ).

> x<-c(1,2,3)
> ls()
[1] "x"
> rm(x)

> x
Error: object 'x' not found

61 | P a g e
Suppose that during one particular R/RStudio session you have created two
datasets x and y, and that you would like to use these datasets in future
sessions. Check that the directory in which you are currently working is the
one in which you would like to save. If it is, proceed to save x and y, using:
save (x, y, file=“x.Rdata”).

So if we wish to save, say, the variable jobcat.2cat on desktop, we should use:

save(jobcat.2cat, file='C:/Users/user/Desktop/job2cat.Rdata
')

Suppose that instead of saving one or two variables created during a session,
we wish to save the whole workspace, then we use the command save.image(
directory). To save on desktop:
save.image("C:/Users/user/Desktop/wholeworkspace.RData")

If you then wish to load the saved workspace at a later date, you can do so by
means of the command load( file name) as in:
load("C:/Users/user/Desktop/wholeworkspace.RData")

Note that the save function also extends to saving images created in
R/RStudio. Consider the following:

Install the package plot3D, load this package and then enter the following
commands taken from Soetaert, K. (2013, Pg6):
par(mfrow = c(2, 2), mar = c(0, 0, 0, 0))
# Shape 1
M <- mesh(seq(0, 6*pi, length.out = 80), seq(pi/3, pi,
length.out = 80))
u <- M$x ; v <- M$y
x <- u/2 * sin(v) * cos(u)
y <- u/2 * sin(v) * sin(u)
z <- u/2 * cos(v)
surf3D(x, y, z, colvar = z, colkey = FALSE, box = FALSE)
# Shape 2: add border
M <- mesh(seq(0, 2*pi, length.out = 80),
seq(0, 2*pi, length.out = 80))
u <- M$x ; v <- M$y
x <- sin(u)
y <- sin(v)
z <- sin(u + v)

62 | P a g e
surf3D(x, y, z, colvar = z, border = "black", colkey = FALSE)
# shape 3: uses same mesh, white facets
x <- (3 + cos(v/2)*sin(u) - sin(v/2)*sin(2*u))*cos(v)
y <- (3 + cos(v/2)*sin(u) - sin(v/2)*sin(2*u))*sin(v)
z <- sin(v/2)*sin(u) + cos(v/2)*sin(2*u)
surf3D(x, y, z, colvar = z, colkey = FALSE, facets = FALSE)
# shape 4: more complex colvar
M <- mesh(seq(-13.2, 13.2, length.out = 50),
seq(-37.4, 37.4, length.out = 50))
u <- M$x ; v <- M$y
b <- 0.4; r <- 1 - b^2; w <- sqrt(r)
D <- b*((w*cosh(b*u))^2 + (b*sin(w*v))^2)
x <- -u + (2*r*cosh(b*u)*sinh(b*u)) / D
y <- (2*w*cosh(b*u)*(-(w*cos(v)*cos(w*v)) -
sin(v)*sin(w*v)))/D
z <- (2*w*cosh(b*u)*(-(w*sin(v)*cos(w*v)) +
cos(v)*sin(w*v)))/D
surf3D(x, y, z, colvar = sqrt(x + 8.3), colkey = FALSE,
border = "black", box = FALSE)

The resulting plot may be saved in various formats:


pdf(‘mygraph.pdf’) pdf file
win.metafile(‘mygraph.wmf’) windows metafile
png(‘mygraph.png’) png file
jpeg(‘mygraph.jpg’) jpeg file
bmp(‘mygraph.bmp’) bmp file
postscript(‘mygraph.ps’) postscript file

To export the resulting plot to word from RStudio, the following procedure
has been used:

Export – Copy Plot to Clipboard – Choose Metafile – Copy Plot and then
paste in word.

To export the resulting plot to word from R, right click on the resulting plot,
and choose the format that you would like to use to save your picture.

The plot which results from the commands given is the following:

63 | P a g e
Figure 1.3.6: An example of a 3D Plot obtained with R/RStudio

1.3.8 Introduction to R Scripts


When working with R or RStudio, rather than entering the commands in the
console window it might be more convenient to write the commands in an R-
Script as this allows us to save and edit the commands for later use. This
feature is especially useful when R is used for programming.

Directions to create an R-script using R-editor:

 Goto File and select New Script.


 An R-editor window will appear. Commands can be entered in this
window. You can also add comments in this Scripts. All comments
should be preceeded by a '#'. R-editor will not execute any script that
follows the '#'.

64 | P a g e
Figure 1.3.7: Entering R Script in R

 To save the script file : Activate the R-editor by clicking on it and then
click on File and select Save as.
 To open a script file, hit File and select Open script.
 There are two possible ways of running the code in an R-script:
1. You can either run the commands one line at a time. To do this make
sure the cursor is in front of the first command. Hit the Run line or
selection button - repeatedly. Each time you hit the botton a
command is executed and the cursor moves to the next line.
2. Alternatively you can highlight a section of the R-Script or all of the
Script and hit the Run line or selection button to execute the
selected commands at one go.

Directions to create an R-script using RStudio:

 Goto File, select New File and then select R Script.

65 | P a g e
 An Untitled R Script window will appear. Commands can be entered in
this window. You can also add comments in a script. All comments
should be preceeded by a '#'. R will not execute any comments that
follow the '#'.
 Commands are executed in a similar way as is done in the R-editor
window in R. Instead of the Run line or selection button there is a
Run button in the top right corner of the R Script window. Commands
in RStudio may also be run one step at a time by using the shortcut Ctrl
+ Enter on the keyboard.

Figure 1.3.8: Entering R Script in RStudio

 To save the script file : Activate the R Script by clicking on it and


then click on File and select Save as.
 To open a script file, hit File and select Open File.

Note that any script file assembled in R can be executed and edited in RStudio
and viceversa.

66 | P a g e
1.3.9 Data Manipulation

In this subsection we shall see how one can create new variables in an existing
data file. These new variables can be some transformed variables from the
original data set or indices computed from existing variables.

We will once again make use of the ‘Employee data.csv’ file.

We start by uploading the data and preparing it for analysis:

> #upload data file Employee data.csv


> Employee.data <- read.csv("~/R Course/Employee data.csv")
> attach(Employee.data)

> names(Employee.data)# list the variables in my data


[1] "X...id" "gender" "bdate" "educ" "jobcat"
"salary" "salbegin"
[8] "jobtime" "prevexp" "minority"
> str(Employee.data)# list the structure of mydata

'data.frame': 474 obs. of 10 variables:


$ X...id : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : Factor w/ 2 levels "f","m": 2 2 1 1 2 2 2 1 1
1 ...
$ bdate : Factor w/ 462 levels " ","1/10/1964",..: 166 2
75 362 228 176 397 240 290 36 143 ...
$ educ : int 15 16 12 8 15 15 15 12 15 12 ...
$ jobcat : int 3 1 1 1 1 1 1 1 1 1 ...
$ salary : int 57000 40200 21450 21900 45000 32100 36000
21900 27900 24000 ...
$ salbegin: int 27000 18750 12000 13200 21000 13500 18750
9750 12750 13500 ...
$ jobtime : int 98 98 98 98 98 98 98 98 98 98 ...
$ prevexp : int 144 36 381 190 138 67 114 0 115 244 ...
$ minority: int 0 0 0 0 0 0 0 0 0 0 ...
> View(Employee.data) #to view the data file
> #designating factors

> jobcat<- factor(Employee.data$jobcat,levels=c(1,2,3),label


s= c("Clerk", "Custodial", "Manager"))

> minority <- factor(Employee.data$minority,levels = c(0,1)


)

67 | P a g e
For more detail on importing data in R and RStudio refer to Section 1.3.4.

Creating the new variable Salary_Increase by substracting salbegin from


salary:

First we create a new dataframe so as not to alter the original data set and add
an empty column to this new data frame which we call Salary_Increase:

> Employee.data.new<-Employee.data
> Employee.data.new["Salary_Increase"] <- NA # Creates the
new column named "Salary_Increase" filled with "NA"

To compute the values for this new column:


> Employee.data.new$Salary_Increase <- Employee.data.new$sa
lary - Employee.data.new$salbegin

> names(Employee.data.new)
[1] "X...id" "gender" "bdate"
"educ"
[5] "jobcat" "salary" "salbegin"
"jobtime"
[9] "prevexp" "minority" "Salary_Increase"

To see the new dataset:


> View(Employee.data.new)

Suppose we want a new column with ln( salary):


> Employee.data.new["ln_salary"] <- NA
> Employee.data.new$ln_salary<-log(Employee.data.new$salary
)

Suppose that an index is defined by THT=(2*current salary-0.5*beginin


g salary)/jobtime. This index is computed as follows:

Employee.data.new["THT"] <- NA
Employee.data.new$THT<-(2*Employee.data.new$salary - 0.5*Em
ployee.data.new$salbegin)/Employee.data.new$jobtime
summary(Employee.data.new$THT)

Note that the command cbind() may also be used to add a new column to a
dataset:
68 | P a g e
> Employee.data.new <-cbind(Employee.data, salarydiff)

In the next section we shall see how descriptive statistics are carried out using
both SPSS and RStudio (hence also R).

69 | P a g e
2. DESCRIPTIVE STATISTICS

In the first section we have discussed some of the basics SPSS and R/R Studio.
In particular, we have used the software to get familiar with manipulating and
importing our datasets. In this section, we will discuss procedures to obtain
various descriptive measures using SPSS and R/RStudio.

Note that:
 For SPSS we shall make use of the Employee data.sav, which is
provided with SPSS.
 When working with RStudio we shall make use of the dataset ‘pima’
which is found in the package ‘faraway’.
 Using R version 3.1.2 or later
 Packages Used in this section:
 faraway
 psych
 e1071
 plotrix
 ggplot2
 car
 graphics
 ply

Recall that to obtain information about the packages one can use the command
‘help’ in R. For example to obtain information about what is contained in the
package ‘faraway’ type the following command:
help(package=faraway)

For more detail on help offered by R/RStudio refer to Section 1.2.1.

70 | P a g e
2.1 SUMMARIZING DATA

2.1.1 Summary Statistics using SPSS

When analyzing a dataset, it is usual to begin with some descriptive measures


to get a general idea of the nature of the data. Several summary statistics that
may be used for continuous or discrete variables are available from Analyze,
Descriptive Statistics, Descriptives. If we wish to find the descriptive
statistics for the variables salbegin and salary, the descriptives window should
be as follows:

Figure 2.1.1

Now, to view the available descriptive statistics, click on the button Options,
which will produce the following dialogue box:

71 | P a g e
Figure 2.1.2

The ticked statistics in the above dialogue box are those descriptive statistics
which SPSS outputs by default, when this procedure is run. All the other
statistics may also be chosen. When all required statistics are chosen, pressing
Continue and Ok, will generate the output with the selected statistics in the
Output Viewer. For example, the selections from the preceding example
would produce the following output:

Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Current Salary 474 $15,750 $135,000 $34,419.57 $17,075.661
Beginning Salary 474 $9,000 $79,980 $17,016.09 $7,870.638
Valid N (listwise) 474

Table 2.1.1

So, from Table Table 2.1.1 we know that 474 cases were considered for this
analysis. The average current salary (salary) of the respondents is $34,419.57
and their average starting out salary (salbegin) was $17,016.09.

72 | P a g e
Also, beginning salaries ranged from $9000 to $79,980 whereas current
salaries range from $15,750 to $135,000.

Now, the standard deviation measures the amount of variability in the


distribution of a variable, where the more the individual data points differ
from each other, the larger the standard deviation will be. Conversely, the
closer together the points are, the smaller the standard deviation. Reminder:
the standard deviation is the square root of the variance.

The standard deviation describes the standard amount variables differ from
the mean. For example, for this case, a current salary of $51,000.00 is nearly
one standard deviation above the mean, since mean ($34,419.57) + standard
deviation ($17,075.66) = $51,495.23.

Since the variance (or the standard deviation) is a key aspect in many
statistical analyses, it is useful to note that the standard deviation resulting
from the current salaries ($17,075.66) is much greater than the standard
deviation resulting from the beginning salaries ($7,870.64). Thus, there is a
larger spread in the current salaries.

If other statistics like variance, range, kurtosis and skewness were required,
the resulting output would be as follows:

Table 2.1.2

From Table 2.1.2 it may be noted that indeed the current salary has a much
larger range than the beginning salary, $119,250 as opposed to $70,980 and
as expected the current salary also has a larger variance (this was already seen
from the standard deviation results of the previous output).

73 | P a g e
It may also be noted that, both variables have positive skewness and kurtosis.

Now, skewness measures to what extent a distribution of values deviates from


symmetry around the mean. A value of zero represents a symmetric, evenly
balanced distribution, positive skewness indicates a greater number of smaller
values (that is, skewed to the right) and negative skewness indicates a greater
number of larger values (that is, skewed to the left). So, for our example, for
both variables, the distribution of the values is skewed to the right.

Kurtosis looks at the the thickness of the tails of a distribution. A kurtosis


value of zero indicates a shape close to the normal distribution, a positive
value indicates heavier tails than those of the normal distribution and a
negative value indicates lighter tails than those of the normal distribution.
Note that, an extreme positive kurtosis (greater than 5) indicates a distribution
where most of the values are in the tails of the distribution rather than around
the mean.

So, in our case, both variables have an extreme positive kurtosis. Thus, both
distributions heavier tails when compared to the normal distribution; salbegin
in particular.

It should be noted that the variables chosen for the descriptive statistics
procedure are both continuous. In fact, it would not have made sense if we
decided to find the descriptive statistics of variables like gender or jobcat.
What would the mean of jobcat mean? So, even though the descriptive
statistics procedure is useful for summarizing data with an underlying
continuous distribution, it will not prove helpful for interpreting categorical
data. When analyzing categorical data, it makes more sense to obtain
information on the number of cases (frequencies) that fall into the various
categories. Descriptive measures for categorical data will be given shortly.

Another procedure which offers even more descriptive statistics is the Explore
procedure. This procedure is available through Analyze, Descriptive Statistics,
Explore. The resulting dialogue box is as follows:

74 | P a g e
Figure 2.1.3

Start by moving the variables of interest underneath Dependent List cell. In


this case the variables of interest are current and beginning salary. Then click
on the Statistics button to produce a dialogue box with several additional
descriptive statistics as shown below:

Figure 2.1.4

By ticking the boxes Descriptives and Percentiles the following output is


obtained:

75 | P a g e
Descriptives
Statistic Std. Error
Current Salary Mean $34,419.57 $784.311
95% Confidence Interval for Lower Bound $32,878.40
Mean Upper Bound $35,960.73
5% Trimmed Mean $32,455.19
Median $28,875.00
Variance 291578214.500
Std. Deviation $17,075.661
Minimum $15,750
Maximum $135,000
Range $119,250
Interquartile Range $13,162
Skewness 2.125 .112
Kurtosis 5.378 .224
Beginning Salary Mean $17,016.09 $361.510
95% Confidence Interval for Lower Bound $16,305.72
Mean Upper Bound $17,726.45
5% Trimmed Mean $16,041.71
Median $15,000.00
Variance 61946944.960
Std. Deviation $7,870.638
Minimum $9,000
Maximum $79,980
Range $70,980
Interquartile Range $5,168
Skewness 2.853 .112
Kurtosis 12.390 .224

Table 2.1.3

The Frequencies procedure provides various options for the analysis of


datasets and some of its descriptive measures are also suitable for the analysis
of categorical variables. This procedure is available through Analyze,
Descriptive Statistics, Frequencies. The resulting dialogue box is as follows:

76 | P a g e
Figure 2.1.5

On taking the variables of interest in the Variable(s) cell and then clicking Ok,
the frequency tables are obtained. The frequency tables below result when
jobcat and educ are considered to be the variables of interest:

Employment Category
Cumulative
Frequency Percent Valid Percent Percent
Valid Clerical 363 76.6 76.6 76.6
Custodial 27 5.7 5.7 82.3
Manager 84 17.7 17.7 100.0
Total 474 100.0 100.0

Table 2.1.4

77 | P a g e
Educational Level (years)
Cumulative
Frequency Percent Valid Percent Percent
Valid 8 53 11.2 11.2 11.2
12 190 40.1 40.1 51.3
14 6 1.3 1.3 52.5
15 116 24.5 24.5 77.0
16 59 12.4 12.4 89.5
17 11 2.3 2.3 91.8
18 9 1.9 1.9 93.7
19 27 5.7 5.7 99.4
20 2 .4 .4 99.8
21 1 .2 .2 100.0
Total 474 100.0 100.0

Table 2.1.5

Clicking on the Statistics button produces a dialogue box with several


additional descriptive statistics, most of which have already been discussed
earlier. Note that none of these descriptive statistics are suitable for analyzing
categorical data. A typical output is as follows:

Statistics
Employment Educational
Category Level (years)
N Valid 474 474
Missing 0 0
Mean 1.41 13.49
Median 1.00 12.00
Percentiles 25 1.00 12.00
50 1.00 12.00
75 1.00 15.00

Table 2.1.6

Table 2.1.6 shows that indeed finding for example the mean, median and
percentiles for jobcat, a categorical variable, does not make sense.

78 | P a g e
Clicking on the Charts button, however, produces the following dialogue box
which allows you to graphically examine any type of variable, in several
different formats:

Figure 2.1.5

The information contained in the frequency table may be used to obtain a


graphical representation. A graphical representation is easier to comprehend.
Graphical representations will be discussed in more detail at a later stage.

2.1.2 Summary Statistics using RStudio / R

Here we shall see how the statistics discussed in the previous subsection are
computed using R. Since interpretation is the same we shall not give too much
detail here but refer the reader to the previous sub-section for such detail.

Since we are going to make use of the dataset pima found in the faraway
package, we start by loading this package in R:

> library(faraway)

To get some information about the variables being used we can type
> ?pima

79 | P a g e
This dataset contains the following variables:

pregnant - Number of times pregnant


glucose - Plasma glucose concentration at 2 hours in an oral glucose tolerance
test
diastolic - Diastolic blood pressure (mm Hg)
triceps - Triceps skin fold thickness (mm)
insulin - 2- Hour serum insulin (mu U/ml)
bmi - Body mass index (weight in kg/(height in metres squared))
diabetes - Diabetes pedigree function
age - Age (years)
test - test whether the patient shows signs of diabetes (coded 0 if negative, 1
if positive)

Note that there are some missing values for the variables diastolic, triceps,
glucose, insulin 2, bmi and diabetes. These values are denoted as ‘0’ in the
data set. We need to identify them as missing values else they will be used
when issuing plots or calculating statistics. As mentioned in the introductory
session on R/RStudio, the missing code value used by R or RStudio is NA.
Thus our first step is to set all zero values to NA through the following
commands:

> pima$diastolic[pima$diastolic == 0] <- NA

> pima$glucose[pima$glucose == 0] <- NA

> pima$triceps[pima$triceps == 0] <- NA


> pima$insulin[pima$insulin == 0] <- NA

> pima$bmi[pima$bmi == 0] <- NA

To obtain basic summary statistics for all the covariates in the data one can
use the following command:

> summary(pima)

80 | P a g e
This gives the following output:

If on the other hand we want summary statistics for the variable glucose, only:
> summary(pima$glucose)

This gives the following output:


Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
44.0 99.0 117.0 121.7 141.0 199.0 5

A command for calculating skewness and kurtosis can be found in the package
psych. Other packages such as moments may be used to calculate these
statistics. Therefore first we load the package:

> library(psych)

Then we use the command describe() to compute a series of summary


statistics including skewness and kurtosis:
> describe(pima, na.rm = TRUE, trim=.1)

na.rm = TRUE is used to inform the software that missing values must be
ignored
trim=.1 is used to declare the percentage of trimming used in a trimmed
mean

The following output is obtained:

81 | P a g e
where:
vars = variable number
n= number of valid cases
mean = sample mean
sd= standard deviation
trimmed =trimmed mean (with trim defaulting to .1)
median=median
mad = median absolute deviation (from the median)
min =minimum value
max=maximum value
skew = skewness
kurtosis = kurtosis
se=standard error of the mean

Individual commands:
> mean(pima$bmi, na.rm = TRUE)
[1] 32.45746
> var(pima$bmi, na.rm = TRUE)
[1] 47.95546
> sd(pima$bmi, na.rm = TRUE)
[1] 6.924988
> median(pima$bmi, na.rm = TRUE)
[1] 32.3
> min(pima$bmi, na.rm = TRUE)
[1] 18.2
> max(pima$bmi, na.rm = TRUE)
[1] 67.1
> skew(pima$bmi, na.rm = TRUE)
[1] 0.5916179

Kurtosis may be obtained by using the package e1071:

82 | P a g e
> library(e1071)
> kurtosis(pima$bmi, na.rm = TRUE)
[1] 0.839607

Measures of location such as quartiles /percentiles are computed using the


‘quantile’ function, as is shown below.

Recall that the kth-percentile of a sample of values divides the sample such
that k% of the values lie below or are equal to the kth-percentile and (100-k)%
of the values lie above the kth-percentile. For example the 15th, 25th, and
35th percentiles of the variable bmi are obtained as follows:

> quantile(pima$bmi,na.rm = TRUE, prob=c(0.15, 0.25, 0.35))


15% 25% 35%
25.10 27.50 29.56

Interpretation:
 The 15th percentile denoted by 15% in the output shown above, implies
that 15% of the BMI values are less than or equal to 25.10 while 85%
are greater than 25.10.
 The 35th percentile denoted by 35% in the output shown above,
implies that 35% of the BMI vales are less than or equal to 29.56 and
hence 65% of the BMI values are greater than 29.56.
 Similar interpretation holds for the 25th percentile denoted by 25%.

Quartiles are the same as percentiles but are typically indexed by sample
fractions rather than by sample percentages. In R these are calculated as
percentiles. So for the variable bmi the quantiles are obtained as follows:

# Determine and interpret the quartiles for the variable bmi

> quantile(pima$bmi, na.rm = TRUE)


0% 25% 50% 75% 100%
18.2 27.5 32.3 36.6 67.1

Interpretation:

83 | P a g e
 The lower quartile, Q1 corresponds to the 25th percentile. From the
output above we can conclude that, 25% of the BMI values are ⩽ 27.5,
while the other 75% of the values are > 24.5.
 The median, Q2 corresponds to the 50th percentile. From the previous
output we can conclude that half the BMI values are ⩽ 32.3, while the
other half, are > 32.3.
 The upper quantile, Q3 corresponds to the 75th percentile. From the
output above we can conclude that three-fourths of the data are ⩽ 36.6,
while the remaining one-fourth are > 36.6.
 The minimum and maximum values of the data are 18.2 and 67.1,
respectively.

The variable test is a categorical variable – a factor. R will treat such variables
as quantitative unless it is told that they should be treated as factors (look back
at the output obtained from summary(pima) earlier on). To designate such
variables as factors so that they are treated appropriately, we use the following
command:

> pima$test <- factor(pima$test)

Then
> summary(pima$test)
0 1
500 268

Setting labels to the factor levels:


> levels(pima$test) <- c("negative","positive")
> summary(pima$test)
negative positive
500 268

84 | P a g e
3. GRAPHICAL REPRESENTATIONS

In this section we shall explore how one can produce graphical


representations, such as Bar Charts, Pie Charts, Histograms, Box plots and
Scatter plots, using SPSS and R/RStudio. Note that the first two types of charts
are typically used to display graphically the frequency of values for
categorical variables (also referred to by the terms qualitative variables,
factors) while histograms and box plots are used to display graphically the
distribution of values for covariates. Scatter plots are used for a visual
inspection of a possible relationship between two covariates.

In this chapter we shall continue using the ‘Employee’ dataset when using
SPSS and the ‘pima’ dataset when using R/RStudio.

Before proceeding to show how each of the abovementioned plots are


obtained through software, we shall first present a short description of each
graph.

PIE CHARTS

Pie charts are a very common type of graph best suited for a qualitative
variable (for example gender or jobcat in the Employee data set or test in the
pima data set). They consist of a circle divided into slices of different portions.
Each slice corresponds to a different level of the categorical variable while the
portion of each slice represents the frequency of that level. The bigger the
portion the bigger the frequency of the level.

BAR GRAPHS OR BAR CHARTS

Bar graphs are a very common type of graph best suited for a categorical
variable (for example gender or jobcat in the Employee data set or test in the
pima data set).

85 | P a g e
Since there is no uniform distance between levels of a categorical variable,
the discrete nature of the individual bars is well suited for this type of variable.
In fact, bar graphs are a common way to graphically display the frequency of
each level of a categorical variable. Each column represents a different level
of the categorical variable while height of each column represents the
frequency of that level. Three important bar charts are typically considered,
namely:

 Simple - is the most common and is used to graph frequencies of levels


within a variable. The category axis of a simple bar graph is the
variable and each bar represents a level of the variable.
 Clustered – used to display two categorical variables (factors) on one
bar chart, by grouping together bars representing levels of a category.
 Stacked – also used to display two categorical variables (factors) on one
bar chart. It consists of a bar for each level of one of the variables, while
the levels of the other variable are placed on top of each other within
each bar. So, each bar is subdivided into sections that show the relative
sizes of the levels of the latter variable. In software these subdivisions
are typically represented by a different colour.

HISTOGRAMS

Histograms are similar to simple bar graphs except that each bar represents a
range of variable values rather than just a single value. What makes this
different from a regular bar graph is that each bar represents a summary of
data rather than an independent value.

For a histogram, the y-axis is almost always measured on a scalar scale


representing the count or number of how many of a sample fall within each
range of the values considered on the x-axis.

86 | P a g e
BOX PLOTS

In 1977, John Tukey published an efficient method for displaying a five-


number data summary. The graph is called a boxplot (also known as a box
and whisker plot) and it provides a graphical depiction for median, upper and
lower quartiles, minimum and maximum data values. The boxplot also shows
any outliers present in the data and it can give an insight into the spread of the
data.

3.1 USING SPSS

Consider once again the ‘Employee’ data set.

3.1.1 The Pie Chart

Let us plot a pie chart for the categorical variable gender.

Click on Graphs, Legacy Dialogs, Pie. The following dialog box will appear:

Figure 3.1.1

Choose Summaries for groups of cases and click on Define. The dialog box in
Figure 3.1.2 will then appear. Insert gender in Define Slices by: field and then
click on the OK button.

87 | P a g e
Figure 3.1.2

The following output will be obtained.

Figure 3.1.3

88 | P a g e
By double clicking on the pie chart in the SPSS output window, the chart
editor window is opened which will allow one to edit this plot to make it more
informative, attractive or simply change the colours of the slices. For example
if we click on Elements, Show data labels, in the chart editor, a new window
will open which will allow us to add values onto the slices as can be seen in
the figure which follows.

Figure 3.1.4

Some other possible formats for pie charts are as follows:

89 | P a g e
Figure 3.1.5

Figure 3.1.6

90 | P a g e
3.1.2 The Bar Graph (also known as a Bar Chart)

Click on Graphs, Legacy Dialogs, Bar. The following dialog box will appear:

Figure 3.1.7

As you can see in Figure 3.1.7, SPSS makes it possible for us to plot any of
the three types of pie charts discussed earlier.

The Simple Bar Chart

Click Simple and then Define. The following dialog box will appear:

91 | P a g e
Figure 3.1.8

Place the variable jobcat under Category Axis: and then click OK. The
following output will appear in the output viewer.

92 | P a g e
Figure 3.1.9

This is a simple bar chart for the variable jobcat. It shows clearly that in the
sample under study, most of the employees are clerical workers. As was the
case for pie charts it is possible to edit this output by double clicking on the
plot in the output viewer. This will allow us to change the colour of the bars,
edit legends and also add data labels. For example we can change Figure 3.1.9
as follows:

Figure 3.1.10
93 | P a g e
Figure 3.1.10 is more informative than Figure 3.1.9. It shows that 76.59% of
the individuals selected in the sample under study are clerical workers,
17.72% are Managers and the remaining 5.7% are custodial workers.

The Clustered Bar Chart

Click Clustered and then Define. The following dialog box will appear:

Figure 3.1.11

94 | P a g e
Place the variable jobcat under Category Axis:, gender under Define Clusters
by: and click OK. A clustered bar graph for these two factors will appear in
the output viewer and after some editing (as was done for the simple bar
graph) we obtain the following output:

Figure 3.1.12: Clustered bar graph for jobcat and gender

From the clustered bar graph in Figure 3.1.12 we note that in the sample under
study, 43.46% of the individuals are female clerical workers while 33.12% are
male clerical workers. 74 out of 84 managers are male. All 27 custodial
workers are male.

The Stacked Bar Chart

Click Stacked and then Define. The following dialog box will appear:

95 | P a g e
Figure 3.1.13

Place the variable jobcat under Category Axis:, gender under Define Stacks
by: and then click OK. A stacked bar graph for these two factors will appear
in the output viewer and after some editing (as was done for the simple bar
graph) we obtain the following output:

96 | P a g e
Figure 3.1.14: Stacked bar graph for jobcat and gender

Note that the stacked bar graph in Figure 3.1.14 displays the same information
as for the clustered bar graph in Figure 3.1.12. The information is just
displayed in a slightly different format. Hence the interpretation provided after
Figure 3.1.12 applies to Figure 3.1.14 as well.

3.1.3 The Histogram

A histogram is an important graphical presentation which demonstrates the


frequency for different intervals of a selected covariate. A histogram is
constructed by representing the observations that are grouped on a horizontal
scaling and the frequency in each group on a vertical scaling. A histogram of
the variable Current Salary is produced by selecting: Graphs  Legacy
Dialogs  Histogram.

97 | P a g e
Then move the Current Salary into the variable list as follows:

Figure 3.1.15

As shall be seen later on in the notes, a number of statistical tests rely on the
normality assumption of the data. It may thus be of interest to plot the normal
distribution curve superimposed on our histogram. So as to obtain this
overlay, one may select Display Normal Curve before pressing OK.

As can be seen in Figure 3.1.16, the normal curve superimposed on the


histogram of the variable of interest has the same mean and standard deviation
as that of the covariate being considered.

98 | P a g e
Figure 3.1.16: Histogram for the variable Current Salary

Comparing the histogram with the normal curve may help one visualize the
skewness and kurtosis features. Reminder: the normal curve has zero
skewness and zero kurtosis. From Table 2.1.2 we know that the distribution
of Current Salary has positive skewness and kurtosis. Figure 3.1.16, in fact,
shows that the data is indeed positively skewed (due to its asymmetry) and
also leptokurtic (our histogram has heavier tails than the normal curve).

Since the assumption of normality is invoked in many inferential statistics, a


visual assessment of the distribution is certainly not enough. Tests for
normality will be discussed in Chapter 4.

99 | P a g e
3.1.4 The Box Plot

Each box plot shows the median, quartiles and extreme values of a covariate.
If we would like to obtain the median, quartiles and extreme values of the
variable Current Salary, this can be produced by clicking on: Graphs 
Legacy Dialogs  Boxplot  Simple.

Figure 3.1.17

Select Summaries of separate variables and click Define. Then move Current
Salary (covariate) in the Boxes Represent list and press OK.

Figure 3.1.18

100 | P a g e
The following box plot is obtained:

Figure 3.1.19: Box plot for the variable Current Salary

First of all note that a box plot may be given either vertically as in Figure
3.1.19 or also horizontally. Now, the boxplot is interpreted as follows:

 The box itself contains the middle 50% of the data.

 The upper edge (hinge) of the box indicates the 75th percentile of the
data set and the lower hinge indicates the 25th percentile. The range
between the lower and upper quartiles is known as the inter-quartile
range.

 The line in the box indicates the median value of the data and if the
median line within the box is not equidistant from the hinges, then the
data is skewed.

In our example, the line is closer to the 25th percentile, thus the box plot
is also showing that the distribution of the data is positively skewed.

101 | P a g e
 The ends of the vertical lines or "whiskers" indicate the minimum and
maximum data values, unless outliers are present in which case the
whiskers extend to a maximum of 1.5 times the inter-quartile range.

 The points outside the ends of the whiskers are outliers or suspected
outliers. Hence, from our plot, it is obvious that there are many outliers
or suspected outliers in our data.

A box plot can also be used to analyze a covariate and a categorical variable
simultaneously. The covariate is summarized within the levels (categories) of
the categorical variable. Each box shows the median, quartiles and extreme
values within a category. If we would like to obtain the median, quartiles and
extreme values of Current Salary according to Employment Category, this can
be produced by clicking on: Graphs  Legacy Dialogs  Boxplot  Simple.

Figure 3.1.20

Select Summaries of separate variables and click Define. Then move Current
Salary (covariate) in the variable list and Employment Category (factor) in the
category axis as follows:

102 | P a g e
Figure 3.1.21

Press OK to obtain the following plot:

Figure 3.1.22 Box plot for the variable Current Salary by


Employment Category

103 | P a g e
If we would like to obtain the median, quartiles and extreme values of the
Current Salary according to Employment Category clustered by the factor
gender, the procedure is to click on Graphs  Legacy Dialogs  Boxplot 
Clustered  Define.

Figure 3.1.23

Then do selections as shown in the figure which follows and press OK.

Figure 3.1.24

104 | P a g e
This will give the following plot:

Figure 3.1.25 Box plot for the variable Current Salary by


Employment Category and Gender

3.1.5 The Scatter Plot

If we want to reveal important relationships between two covariates, a scatter


plot is of vital importance. A scatter plot can also reveal outliers and unusual
combinations of values in numerical data. The procedure to obtain a scatter
plot in SPSS is to click on Graph  Legacy Dialogs  Scatter  Simple 
Define.

Then, move the variable Current Salary in the y-axis and the Beginning Salary
in the x-axis as shown in Figure 3.1.26:

105 | P a g e
Figure 3.1.26: Creating the scatter plot

Upon clicking OK, a scatter plot appears in the output viewer. Double click
on the graph to open the chart editor and hence select Elements and Fit Line
at Total to get:

106 | P a g e
Figure 3.1.27: Scatter plot for Current and Beginning Salary

A straight line was plotted over the data in Figure 3.1.27 since there seems to
be a linear relationship between the two salary variables. This straight line is
known as the regression line or line of best fit. One common extension of the
above example is to plot separate regression lines for subgroups. For example,
we could plot separate regression lines for males and females in the above
example to visually examine whether the relationship between current salary
and beginning salary is the same for both males and females. To do this, first
we place a categorical variable in the Set Markers by, as shown earlier, then
double click on the resulting chart, hence select Elements and Fit Line at
Subgroups to get:

107 | P a g e
Figure 3.1.28: Scatter plot for Current and Beginning Salary by
Gender

3.2 USING R/RSTUDIO

In this section we shall show how the graphical representations discussed in


section 3.1 are obtained in R software. Although the plots may vary slightly
from those issued using SPSS, interpretation is the same so in this section we
shall not give too much detail on interpretation. For this refer to section 3.1.

3.2.1 The Pie Chart

Remember that before proceeding to work with a categorical variable (factor)


you need to define the variable as a factor in R. Refer to the end of Section
2.1.2.

108 | P a g e
1. Simple Pie Charts

Commands:

> ?pie #to access documentation on command


> slices<-summary(pima$test)
> pie(slices, labels=levels(pima$test),main="Pie Chart of Te
st" )

The following graph is obtained:

Figure 3.2.1

The procedure used to copy Figure 3.2.1 is as described in Section 1.3.7.


Alternatively one can copy and paste a graph into a Word document but
beware that this works well only with the outputs issued by R as at times,
pictures saved in Bitmap and Metafile format from R Studio might not be of
good enough quality.

109 | P a g e
For the advanced user, the package ReporteRs together with the package
ReporteRjar may be used to import high quality charts in Word from
R/RStudio. Using the latter procedure, Figure 3.2.1 would need to be created
as follows:

library(ReporteRs)
library(ReporteRjars)
mydoc= docx()
slices<-summary(pima$test)
mydoc <- addPlot(mydoc, function() pie(slices,
labels=levels(pima$test),main="Pie Chart of Test" ))
writeDoc(mydoc, file="Pie Chart of Test.docx")

The chart produced will be saved in a new Word document entitled Pie Chart
of Test.docx in the directory in which you are currently working in. The
command getwd() will remind you in which directory you are currently
working in with R/RStudio. If you use R 64-bit and you get a Java related
error on running the above commands, you will need to download Java 64-bit
to be able to use the commands just given.

A further note: An alternative format which was used to produce high quality
pictures for use with a Word document, used to be Postscript (.EPS). As of
April 2017, Microsoft is not supporting such a format anymore.

2. Pie Chart with Annotated Percentages/Counts

Commands (Percentages):

> slices <- summary(pima$test)


> lbls <- levels(pima$test)
> pct <- round(slices/sum(slices)*100)
> lbls <- paste(lbls, pct) # add percents to labels
> lbls <- paste(lbls,"%",sep="") # ad % to labels
> pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Test")

The following graph is obtained.


110 | P a g e
Figure 3.2.2

Commands (Count):

> slices <- summary(pima$test)


> lbls <- levels(pima$test)
> pct <- round(slices)
> lbls <- paste(lbls, pct) # add counts to labels
> lbls <- paste(lbls,sep="") # add % to labels
> pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Test")

The following graph is obtained.

111 | P a g e
Figure 3.2.3

3. 3D Pie Chart
Commands (Count):

> slices <- summary(pima$test)


> lbls <- levels(pima$test)
> pct <- round(slices)
> lbls <- paste(lbls, pct) # add counts to labels
> library(plotrix)
> pie3D(slices,labels=lbls, explode=0.1,
main="Pie Chart of Test ")

The following graph is obtained.

112 | P a g e
Figure 3.2.4

3.2.2 The Bar Graph / Bar Chart

There are various colour options for the bars of a bar chart. A list of the
possible colours that one can use is obtained by using the command ‘palette’
as follows:

> palette() # gives the list of possible colours for bars

[1] "black" "red" "green3" "blue" "cyan" "mage


nta" "yellow" "gray"

1. Simple Bar Chart

Such a graph can be obtained by using the command ‘barplot’ which is found
in the inbuilt package ‘graphics’. To gain more information about this
command type the following in R or RStudio:
help("barplot").

Commands:

> counts <- table(pima$test)


> plot<-barplot(counts, main="Test Distribution",
names.arg=levels(pima$test),col="blue")

113 | P a g e
> text(plot, 100, labels=counts,col="red") # adding labels

The following graph is obtained.

Figure 3.2.5

Now, let's create a more complex simple bar graph using various arguments:

Commands:

> barplot(counts, main="Test Distribution", xlab="Test Outc


ome", ylab="My Y Values", names.arg=levels(pima$test),
border="red", density=c(90, 70, 50, 40, 30, 20, 10))
> text(plot, 100, labels=counts,col="red") # adding labels

The following graph is obtained:

114 | P a g e
Test Distribution
500
My Y Values

300
100

500 268
0

negative positive

Test Outcome

Figure 3.2.6

Note that the ‘density’ field is used to add the shading – changing the values
in brackets changes the intensity of the shading.

2. Stacked Bar Charts

Two categorical variables are needed to construct a stacked bar chart. The
data as it stands does not have two categorical variables. If the analysis
justifies the categorization of a covariate, we can proceed as follows:

We can create the categorized variable age - agecat. Categorization is typically


carried out according to categories previously used in the literature or
otherwise according to the research question. From summary statistics we
saw that age varies between 21 and 81 so for the purpose of these notes we
are going to use the following age categories:

115 | P a g e
1=20-45 –“Young”, 2=46-65–“Middle Aged”, 3=66+–“Elder”

Recall that in R this is done as follows:

> pima$agecat[pima$age > 66] <- "Elder"


> pima$agecat[pima$age > 46 & pima$age <= 65] <- "Middle Ag
ed"
> pima$agecat[pima$age <= 45] <- "Young"

Having created a new factor we can proceed to plotting a stacked bar graph
for test and age.

Commands:
First one must construct a contingency table which will be used to plot the bar
chart:

> counts2 <- table(pima$test,pima$agecat)


> counts2
Elder Middle Aged Young
negative 7 45 440
positive 2 47 210

Function ‘barplot’ is then used to plot the bar chart.

Note that: the field ‘width’ in the command ‘barplot’ defines the width of
each bar, ‘legend.text=T’ places the legend in a default position,
‘xlim=c(lower,upper)’ is used to specify the horizontal "limits" of the image.

Commands:

> plot<-barplot(counts2, main="Test by Age Group Distributi


on", xlab="Age group", col=c("blue","green"), width=1,xlim=
c(0,5), legend.text=T)

The following graph is obtained.

116 | P a g e
Figure 3.2.7

If we want to change the location of the legend we can use the field ‘inset’ as
shown below:

> plot<-barplot(counts2, main="Test by Age Group Distributi


on",xlab="Age group", col=c("blue","green"), width=1)
> legend("topright",inset=c(0.4,0), fill=c("blue","green"),
legend=rownames(counts2

The following graph is obtained.

117 | P a g e
Figure 3.2.8

Another way of constructing a stacked bar chart is by using the command


‘qplot’ in package ‘ggplot2’. For this command you do not need to tabulate
your values beforehand, as with barplot. If you set the field geom = "bar" , it
will count the number of instances of each class.

Commands:

> library(ggplot2)
> qplot(pima$test, geom="bar", fill=pima$agecat,xlab="Test o
utcome",main="Test by Age Group Distribution")

The following graph is obtained.

118 | P a g e
Figure 3.2.9

The legend title may also be changed by including the command +labs(f
ill='NEW LEGEND TITLE') at the end of the previous command.

3. Stacked Bar Charts with Annotated Percentages/Counts

The command ‘barplot’ does not offer an option to show percentages or counts
on the stacked bar charts. If we want the counts or percentages on the charts
we can use the command ‘ggplot’ in package ‘ggplot2’.

First we need to create a data frame containing the information needed to


produce the chart. This is constructed as follows:

119 | P a g e
> counts2<-table(pima$test,pima$agecat)

> counts2
Elder Middle Aged Young
negative 7 45 440
positive 2 47 210

> Age_group <- c(rep(c("Elder", "Middle Aged", "Young"), e


ach = 2))

> Test_outcome <- c(rep(c("Negative", "Positive"), times =


3))

> Frequency <- as.vector(counts2)


>
> Data <- data.frame(Age_group, Test_outcome, Frequency)

> Data
Age_group Test_outcome Frequency
1 Elder Negative 7
2 Elder Positive 2
3 Middle Aged Negative 45
4 Middle Aged Positive 47
5 Young Negative 440
6 Young Positive 210

Then to obtain a stacked bar chart with labels we use the following commands:

library(plyr)

The following command is used to calculate midpoints of the bars :


Data <- ddply(Data, .(Test_outcome), transform, pos = cumsu
m(Frequency) - (0.5 * Frequency))

Plotting the bar and adding the count values :

# plot bars and add text


p <- ggplot(Data, aes(x = Test_outcome, y = Frequency)) +
geom_bar(aes(fill = Age_group), stat="identity") +
geom_text(aes(label = Frequency, y = pos), size = 3)
p

The following graph is obtained:

120 | P a g e
Figure 3.2.10

4. Clustered Bar Charts

Here we can use the command barplot again, but we need to add the field
besid=T which tells R to place the bars next to each other rather then stacking
them one on top of the other.

Commands:

> counts2 <- table(pima$agecat,pima$test)


> plot<-barplot(counts2, main="Test by Age Group Distributi
on",xlab="Test outcome", col=c("blue","green","red"), width
=1, besid=T,xlim=c(0,12), ylim = c(0,500),legend.text=T)

> #legend("topright",inset=c(0,0), fill=c("blue","green","r


ed"), legend=rownames(counts2))

The next command gives the position of the value. Change the value 49 to see
what happens.

121 | P a g e
> ypos.outside<-apply(t(counts2), 1, function(x) x + 49)

The next command adds the values onto the bars;


> text(plot, ypos.outside, counts2)

The following graph is obtained.

Figure 3.2.11

122 | P a g e
3.2.3 The Histogram

Plotting a histogram in R is a very easy task. Say we want to plot a histogram


for the covariate diastolic.

Command:
h<-hist(pima$diastolic,main="Histogram for Diastolic",
xlab="Diastolic",ylim=c(0,250))

The following graph is obtained:

Figure 3.2.12

123 | P a g e
We can add a normal distribution curve to this histogram by adding the
following commands after the previous command:

> #adding normality curve


> m=mean(pima$diastolic,na.rm = TRUE)
# na.rm = TRUE is used to remove any NA values from the
variable before calulating the mean
> std=sqrt(var(pima$diastolic,na.rm = TRUE))
> xfit<-seq(min(pima$diastolic,na.rm = TRUE), max(pima$diast
olic,na.rm = TRUE),length=40)
> yfit<-dnorm(xfit,mean=m,sd=std)
> yfit <- yfit*diff(h$mids[1:2])*length(pima$diastolic)
> lines(xfit, yfit, col="blue", lwd=2)

The following graph is obtained:

Figure 3.2.13

124 | P a g e
3.2.4 The Boxplot

To obtain a boxplot in R, use the command ‘Boxplot’ in the package ‘car’.


Next we produce a boxplot for the variable diastolic.

Commands:

> library(car)
> Boxplot(pima$diastolic,main="Diastolic",id.method="none")

The following graph is obtained:

Figure 3.2.14

Note that: the field ‘id.method’ has the following options:


 if "y" (the default), all outlying points are labeled;
 if "identify", points may be labeled interactively;
 if "none", no point identification is performed.

125 | P a g e
Hence if instead we type:

> Boxplot(pima$diastolic,main="Diastolic")
[1] 19 126 598 600 44 85 107 178 363 550 659 663 673 692

Note that the values 19, 126 etc identify the row number of where the outlier
can be found in the data set.

The following graph is obtained:

Figure 3.2.15

Next we obtain a boxplot for ‘diastolic’ for each level of the variable ‘test’.

126 | P a g e
Command:

> myboxplot<-Boxplot(pima$diastolic~pima$test, main="Diasto


lic", xlab="Test Result", ylab="Diastolic Blood Pressure")

The following plot is obtained:

Figure 3.2.16

Alternatively, boxplots may also be obtained by using the function ‘boxplot’


from the inbuilt package ‘graphics’.

127 | P a g e
Command:

> myboxplot2<-boxplot(pima$diastolic~pima$test,
main="Diastolic", xlab="Test Result",
ylab="Diastolic Blood Pressure")

The following output can be obtained:


> myboxplot2$stats
[,1] [,2]
[1,] 38 48.0
[2,] 62 68.0
[3,] 70 74.5
[4,] 78 84.0
[5,] 100 108.0
attr(,"class")
negative
"integer"

$n
[1] 481 252

$conf
[,1] [,2]
[1,] 68.84733 72.90751
[2,] 71.15267 76.09249
$out
[1] 30 122 108 110 24 106 106 40 110 30 110 114
$group
[1] 1 1 1 1 1 1 1 2 2 2 2 2
$names
[1] "negative" "positive"

where,

stats = a matrix, each column contains the extreme of the lower whisker, the lower
hinge, the median, the upper hinge and the extreme of the upper whisker for
each group/plot
n= a vector with the number of observations in each group.
conf= a matrix where each column contains the lower and upper extremes of the
notch.
out= the values of any data points which lie beyond the extremes of the whiskers.

128 | P a g e
group= a vector of the same length as out whose elements indicate to which group
the outlier belongs.
names= a vector of names for the groups

The following plot is obtained:

Figure 3.2.17

For more details on how to interpret a box plot please refer to Section 3.1.4.

129 | P a g e
3.2.5 The Scatter Plot

1. Bivariate

To obtain a scatter plot to investigate the relationship between variables bmi


and glucose:

Command:

> plot(pima$bmi, pima$glucose, xlab="BMI", ylab="Glucose")

The following graph is obtained:

Figure 3.2.18

To fit a simple regression line onto Figure 3.2.18, add the following command
to the previous command:

130 | P a g e
> abline(lm(pima$glucose ~ pima$bmi))
# fitting a regression line on a scatter plot

The following graph is obtained:

Figure 3.2.19

2. Matrix

The following commands may be used to obtain a matrix of scatter plots for
the first 4 covariates in the data set ‘pima’.

Command:

> pairs(pima[1:4])

The following graph is obtained:

131 | P a g e
Figure 3.2.20

132 | P a g e

You might also like