R Programming
R Programming
R Programming
Articles
R Programming 1
Introduction 2
Sample Session 4
Manage your workspace 7
Settings 9
Documentation 15
Control Structures 19
Working with functions 23
Debugging 27
Using C or Fortran 29
Utilities 29
Estimation utilities 32
Packages 34
Data types 36
Working with data frames 45
Importing and exporting data 52
Text Processing 57
Times and Dates 67
Graphics 70
Grammar of graphics 85
Publication quality ouput 86
Descriptive Statistics 92
Mathematics 99
Optimization 107
Probability Distributions 110
Random Number Generation 116
Maximum Likelihood 120
Method of Moments 122
Bayesian Methods 122
Bootstrap 123
Multiple Imputation 124
Nonparametric Methods 124
Linear Models 128
Quantile Regression 136
Binomial Models 137
Multinomial Models 141
Tobit And Selection Models 143
Count Data Models 144
Duration Analysis 146
Time Series 146
Factor Analysis 148
Ordination 150
Clustering 152
Network Analysis 153
Profiling R code 154
Parallel computing with R 155
Sources 155
Index 156
References
Article Sources and Contributors 158
Image Sources, Licenses and Contributors 160
Article Licenses
License 161
R Programming
1
R Programming
'R' programming Wikibook
This book is designed to be a practical guide to the R programming language
[1]
.
R is free software designed for statistical computing. There is already great documentation for the standard R
packages on the Comprehensive R Archive Network (CRAN)
[2]
and many resources in specialized books, forums
such as Stackoverflow
[3]
and personal blogs
[4]
, but all of these resources are scattered and therefore difficult to find
and to compare. The aim of this Wikibook is to be the place where anyone can share his or her knowledge and tricks
on R. It is supposed to be organized by task but not by discipline
[5]
. We try to make a cross-disciplinary book, i.e. a
book that can be used by all people applying statistics to some specific fields.
How can you share your R experience ?
Explain the syntax of a command
Compare the different ways of performing each task using R.
Try to make unique examples based on fake data (ie simulated data sets).
As with any Wikibook please feel free to make corrections, expand explanations, and make additions where
necessary.
Some rules :
Put the name of functions in the code environment : <code>lm()</code>
Put the name of packages in bold : '''stats'''
Except for the default packages, put the name of the package in parentheses after each function :
<code>MCMCprobit()</code> ('''MCMCpack''')
Put examples in the source environment with argument lang="rsplus" : <source
lang="rsplus"></source>
Use references to package documentations, academic literature and wikipedia.
Use the citations templates to make citations : Template:Cite book, Template:Cite web, Template:Cite journal
If you want to add a graph, you should load it on Commons and add the tag {{Created with R}} and, if it
is applicable, [[Category:Images with R source code]].
Prerequisites
We assume that readers have a background in statistics. This book is not a book about statistics but a book about
how to implement statistical methods using R. We try to use terms which are already defined on Wikipedia such that
people can refer to the corresponding wikipedia page each time they have some doubts on a notion.
We also assume that readers are familiar with computers and that they know how to use software with a
command-line interface. There are some graphical user interfaces for R but we are not going to explain how to use
them in this textbook. Beginners should have a look at the Sample session for a first session with R. They can also
have a look at the Statistical Analysis: an Introduction using R book.
R Programming
2
References
[1] R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http:/ / www. R-project. org.
[2] CRAN manuals (http:/ / cran.r-project.org/ doc/ manuals/ )
[3] R on Stackoverflow (http:/ / stackoverflow. com/ questions/ tagged/ r)
[4] See R-Bloggers (http:/ / www.r-bloggers.com/ )
[5] [5] The CRAN Task View is already organized by discipline.
Introduction
What is R ?
R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as
t-test, chi-square tests, standard linear models, instrumental variables estimation, local polynomial regressions, etc. It
also provides high-level graphics capabilities.
Why use R?
R is free software. R is an official GNU project and distributed under the Free Software Foundation General
Public License (GPL).
R is a powerful data-analysis package with many standard and cutting-edge statistical functions. See the
Comprehensive R Archive Network's Task Views
[1]
to get an idea of what you can do with R.
R is a programming language, so its abilities can easily be extended through the use of user-defined functions. A
large collection of user-contributed functions and packages can be found at CRAN's Contributed Packages
[2]
page.
R is widely used in political science, statistics, econometrics, actuarial sciences, sociology, finance, etc.
R is available for all major operating systems (Windows, Mac OS, GNU-Linux).
R is object oriented. Virtually anything (e.g., complex data structures) can be stored as an R object.
R is a matrix language.
R syntax is much more systematic than Stata or SAS syntax.
R can be installed on your USB stick
[3]
.
Alternatives to R
S-PLUS is a commercial version of the same S programming language that R is a free version of.
Gretl is free software for econometrics. It has a graphical user interface and is nice for beginners.
SPSS is proprietary software which is often used in sociology, psychology and marketing. It is known to be easy
to use.
GNU PSPP is a free-software alternative to SPSS.
SAS is proprietary software that can be used with very large datasets such as census data.
Stata is proprietary software that is often used in economics and epidemiology.
MATLAB is proprietary software used widely in the mathematical sciences and engineering.
Octave is free software similar to MATLAB. The syntax is the same and MATLAB code can be used in Octave.
Python is a general programming language. It includes some specific libraries for data analysis such as Pandas.
Beginners can have a look at GNU PSPP or Gretl. Intermediate users can check out Stata. Advanced users who like
matrix programming may prefer MATLAB or Octave. Very advanced users may use C or Fortran.
Introduction
3
R programming style
R is an object oriented programming language. This means that virtually everything can be stored as an R object.
Each object has a class. This class describes what the object contains and what each function does with it. For
instance, plot(x) produces different outputs depending on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical "=" symbol can be used.
The two following statements are equivalent :
> a <- 2
> a = 2
Arguments are passed to functions inside round brackets (parentheses).
One can easily combine functions. For instance you can directly type
mean(rnorm(1000)^2)
The symbol "#" comments to the end of the line:
# This is a comment
5 + 7 # This is also a comment
Commands are normally separated by a newline. If you want to put more than one statement on a line, you can
use the ";" delimiter.
a <- 1:10 ; mean(a)
You can also have one statement on multiple lines.
R is case sensitive: a and A are two different objects.
Traditionally underscores "_" are not used in names. It is often better to use dots ".". One should avoid using an
underscore as the first character of an object name.
How you can help
Here are some things editors do to keep this book internally consistent. If you have something to contribute, go
ahead and make your contribution. Other editors can touch up your edits afterwards so that they conform to the
guidelines.
The local manual of style WB:LMOS for the R programming book, including a brief explanation of why we do it
that way, is:
Examples use "source" tags : <source lang="rsplus"> a <- 1:10 ; mean(a) </source>.
That makes them look pretty to our readers.
The name of packages are in bold : '''Hmisc'''.
Name of functions are in "code" tags: <code>lm()</code>.
Page titles -- the part after "R Programming/" -- are in sentence case, like "R Programming/Working with data
frames". We couldn't decide between sentence case and title case, so I flipped a coin.
Every page has <noinclude>{{R Programming/Navigation}}</noinclude> at the top
and {{R Programming/Navbar|Mathematics|Probability Distributions}} at the
bottom. That makes it easier to navigate from one page to another online.
Introduction
4
See Also
Google's R Style Guide
[4]
: a set of rules for R programmers
References
[1] http:/ / cran. r-project. org/ web/ views/
[2] http:/ / cran. r-project. org/ web/ packages/
[3] Portable R by Andrew Redd http:/ / sourceforge. net/ projects/ rportable/
[4] http:/ / google-styleguide. googlecode.com/ svn/ trunk/ google-r-style. html
Sample Session
This page is an introduction to the R programming language. It shows how to perform very simple tasks using R.
First you need to have R installed (see the Settings page). If you use Windows or Mac OS, the easiest solution is to
use the R Graphical User Interface (click on its icon). If you use Linux, open a terminal and type R at the command
prompt.
Usually when you open R, you see a message similar to the following in the console:
R version 2.15.2 (2012-10-26) -- "Trick or Treat"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
You can type your code after the angle bracket >.
R can be used as a simple calculator and we can perform any simple computation.
> # Sample Session
> # This is a comment
>
> 2 # print a number
[1] 2
> 2+3 # perform a simple calculation
Sample Session
5
[1] 5
> log(2)
[1] 0.6931472
We can also store numeric or string objects.
> x <- 2 # store an object
> x # print this object
[1] 2
> (x <- 3) # store and print an object
[1] 3
>
> x <- "Hello" # store a string object
> x
[1] "Hello"
We can also store vectors.
> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector
> Height # print the vector
[1] 168 177 177 177 178 172 165 171 178 170
>
> Height[2] # Print the second component
[1] 177
> Height[2:5] # Print the second, the 3rd, the 4th and 5th component
[1] 177 177 177 178
>
> (obs <- 1:10) # Define a vector as a sequence (1 to 10)
[1] 1 2 3 4 5 6 7 8 9 10
>
> Weight <- c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)
>
> BMI <- Weight/((Height/100)^2) # Performs a simple calculation using vectors
> BMI
[1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
[9] 16.09645 25.95156
We can also describe the vector with length(), mean() and var().
> length(Height)
[1] 10
> mean(Height) # Compute the sample mean
[1] 173.3
> var(Height)
[1] 22.23333
We can also define a matrix.
> M <- cbind(obs,Height,Weight,BMI) # Create a matrix
> typeof(M) # Give the type of the matrix
[1] "double"
Sample Session
6
> class(M) # Give the class of an object
[1] "matrix"
> is.matrix(M) # Check if M is a matrix
[1] TRUE
> is.vector(M) # M is not a vector
[1] FALSE
> dim(M) # Dimensions of a matrix
[1] 10 4
We can plot the data using plot().
> plot(Height,Weight,ylab="Weight",xlab="Height",main="Corpulence")
We can define a dataframe.
> mydat <- data.frame(M) # Creates a dataframe
> names(mydat) # Give the names of each variable
[1] "obs" "Height" "Weight" "BMI"
> str(mydat) # give the structure of your data
'data.frame': 10 obs. of 4 variables:
$ obs : num 1 2 3 4 5 6 7 8 9 10
$ Height: num 168 177 177 177 178 172 165 171 178 170
$ Weight: num 88 72 85 52 71 69 61 61 51 75
$ BMI : num 31.2 23 27.1 16.6 22.4 ...
>
> View(mydat) # Look at your data
>
> summary(mydat) # Descriptive Statistics
obs Height Weight BMI
Min. : 1.00 Min. :165.0 Min. :51.00 Min. :16.10
1st Qu.: 3.25 1st Qu.:170.2 1st Qu.:61.00 1st Qu.:21.25
Median : 5.50 Median :174.5 Median :70.00 Median :22.70
Mean : 5.50 Mean :173.3 Mean :68.50 Mean :22.89
3rd Qu.: 7.75 3rd Qu.:177.0 3rd Qu.:74.25 3rd Qu.:25.29
Max. :10.00 Max. :178.0 Max. :88.00 Max. :31.18
>
You can save an R session (all the objects in memory) and load the session.
> save.image(file="~/Documents/Logiciels/R/test.rda")
> load("~/Documents/Logiciels/R/test.rda")
We can define a working directory. Note for Windows users : R uses slash ("/") in the directory instead of backslash
("\").
> setwd("~/Desktop") # Sets working directory (character string enclosed in "...")
> getwd() # Returns current working directory
[1] "/Users/username/Desktop"
> dir() * Lists the content of the working directory
There are some special characters in R
Sample Session
7
NA : Not Available (ie missing values)
NaN : Not a Number (eg 0/0)
Inf: Infinity
-Inf : Minus Infinity.
For instance 0 divided by 0 gives a NaN but 1 divided by 0 gives
> 0/0
[1] NaN
> 1/0
[1] Inf
We can exit R using q(). The no argument specifies that the R session is not saved.
q("no")
Manage your workspace
This page explains how to manage your workspace.
Basic functions
ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls()) removes them all.
rm(list=ls()) # remove all the objects in the workspace
Each object can be saved to the disk using the save() function. They can then be loaded into memory using
load().
load("file.Rda")
...
# assume you want to save an object called 'df'
save(df, file = "file.Rda")
save.image() saves your workspace.
Informations about the session
sessionInfo() gives information about your session, i.e., loaded packages, R version, etc.
R.version provides information about the R version.
Memory usage
Note: According to R version 2.15.2 on Linux and Mac, memory.size() and memory.limit() are Windows-specific.
memory.size() gives the total amount of memory currently used by R.
> memory.size()
[1] 10.18
Manage your workspace
8
memory.limit() without any argument gives the limit of memory used by R. This can also be used to increase
the limit. The maximum amount is limited by the memory of the computer.
> memory.limit()
[1] 1535
> memory.limit(size=2000) # 2000 stands for 2000 MB
[1] 2000
object.size() returns the size of an R object. You can print the results and choose the unit
(byte,kilobytes,megabytes,etc).
> a <- rnorm(10^7)
> object.size(a)
80000024 bytes
> print(object.size(a),units="b")
80000024 bytes
> print(object.size(a),units="Kb")
78125 Kb
> print(object.size(a),units="Mb")
76.3 Mb
> print(object.size(a),units="Gb")
0.1 Gb
> print(object.size(a),units="auto")
76.3 Mb
memory.profile() returns more details.
> memory.profile()
NULL symbol pairlist closure environment promise
1 4959 61794 1684 255 3808
language special builtin char logical integer
14253 46 687 5577 2889 4060
double complex character ... any list
523 1 11503 0 0 1024
expression bytecode externalptr weakref raw S4
1 0 497 117 118 642
gc()
References
External links
Dumping functions from the global environment into an R script file (http:/ / www. r-statistics. com/ 2010/ 09/
dumping-functions-from-the-global-environment-into-an-r-script-file/ )
Settings
9
Settings
This page show how to install R, customize it and choose a working environment. Once you have installed R, you
may want to choose a working environment. This can be a simple text editor (such as Emacs, Vim or Gedit), an
integrated development interface (IDE) or graphical user interface (GUI). RStudio is now a popular option.
Installation
Linux
Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as to type in
sudo aptitude install r-base or sudo apt-get install r-base (don't forget that this is have
to be done as root), or installing the package r-base using your favourite package manager, for example Synaptic.
There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look
at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by
the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this
package, as it depends on r-base.
Installation with apt-get (Debian, Ubuntu and all linux distributions based on Debian)
sudo apt-get install r-base
sudo apt-get install r-recommended
Installation with aptitude (Debian, Ubuntu and all linux distributions based on Debian)
sudo aptitude install r-base
sudo aptitude install r-recommended
Mac OS
Installation : Visit the R project website (http:/ / r-project. org/ ), select the "CRAN" page and choose mirror.
Download the disk image (dmg file) and install R.
The default graphical user interface for Mac is much better than the one for Windows. It includes
a dataframe manager,
a history of all commands,
a program editor which supports syntax highlighting.
Windows
(Section source
[1]
)
Download
To install R under Windows operating system you have to download the binaries from the web. First go to
r-project.org
[2]
and click CRAN under download section on the left panel and select a mirror site, from where you
could download the required content. The best idea is pick a mirror closest to your actual geographical location, but
other ones should work as well. The click Windows and in subdirectories base. The windows binary is the exe file, in
form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version the setup has
the same steps.
Settings
10
Setup
As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems.
However, there are few things that you can alter.
1. On the welcome screen click Next.
2. Read or just notice the GNU license, and click Next.
3. 3. Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc,
the default choice will be OK for you.
4. During the next step you can specify which parts of R you want to install. Choices are: User installation, Minimal
user installation, Full installation and Custom installation. Notice the required space under the selection panel
(varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
5. 5. In this step you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation
(see lower).
6. You can specify the Start menu folder.
7. 7. In the next step you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and
specify registry entries.
With these steps you can customize the R graphical user interface.
You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window
(SDI).
You can select the style, how the Help screen is displayed in R. You will use help a lot, so this may be an
important decision. It is up to you, which style you prefer. Please note, that the content of help file will be the
same regardless of your choice. Here you specify just the appearance of that particular window.
In the next step you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard
option here.
Settings
11
Update
Updating R on Windows requires several steps:
1. 1. Downloading/installing the latest version of R
2. 2. Copying your packages from the library folder to the one in the new R installation
Both of these steps can easily be done using the installr
[3]
package, by running the following command (which
would both install the package, and update R)
[4]
:
# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr
updateR() # updates R
There is also the possibility of using a "global" package library, see here
[5]
for more details.
Portable R for Windows
You have a portable version if you want to install R on your USB stick
[6]
. This is useful if you don't have admin
rights on a computer. The basic installation requires something like 115 mb but you may need more if you want to
install add-on packages.
Working environment
Once you have installed R, you need to choose a working environment. In this section, we review all possible
working environment. This include a basic terminal as well as integrated development environment (therefore IDE),
text editors or graphical user interface (therefore GUI).
A graphical user interface provides some menu which makes it possible to run R without writing code. This is a
good solution for beginners.
A text editor makes it easy to write code.
An integrated development environment provides a text editor and a compiler which makes it easy to write R
scripts, to run them and to correct them.
Note that there are some task specific GUIs. For instance speedR provides a GUI to import data into R.
Terminal
R in a Terminal window on Linux.
For Linux and Mac OS users it is possible to use R from
the terminal.
$ R
> q("no") # to leave R and return to the terminal
Settings
12
R Gui
For Mac OS and Windows users, there is a graphical user interface. In Mac OS, the GUI includes a package
manager, a program editor with syntax highlighting and a data browser. In Windows, the GUI is not better than a
Terminal.
Graphical User Interface
This section includes material for beginners (eg people who are not familiar with computing).
Poor Man's GUI (pmg)
A simple GUI for learning R. It is recommanded for beginners.
> install.packages("pmg", dependencies=TRUE)
# Windows users may also run the following scripts to install required libraries
> source("http://www.math.csi.cuny.edu/pmg/installpmg.R")
> library(pmg)
Jaguar : Java GUI for R
Jaguar : Java GUI for R
[7]
is available for Linux, Mac and Windows (screenshots
[8]
).
It is good for beginners.
R commander
Rcommander
[9]
developed by John Fox provides a menu in the standard Graphical User Interface (screenshots
[10]
).
It works on Linux, Mac and Windows.
It is a good interface for beginners and for people who are not used to script editing.
> install.packages("Rcmdr") # installation
> library("Rcmdr") # usage
Ubuntu users can also install R Commander from the software center.
Integrated development environment
RStudio
RStudio on Ubuntu 12.10.
RStudio
[11]
is an integrated development interface
for R
[12]
.
It works on Mac, Windows and Linux
platforms.
It supports Sweave and LaTeX.
It includes syntax highlighting for R, LaTeX
and Sweave.
It includes a way to view variables and
dataframes.
It makes it easy to load and install package, to
navigate in the help files and to manage your
workspace.
It supports code and file name completion.
Settings
13
It can be installed on a USB stick.
John Verzani has written a book dedicated to this new interface, Getting Started with RStudio
[13]
and Jeffrey Racine
recommand RStudio for Sweave
[14]
.
RKward
RKward is an IDE and a GUI for Linux (KDE) (Screenshots
[15]
). RKWard aims to provide an easily extensible,
easy to use IDE/GUI for R. RKWard tries to combine the power of the R-language with the (relative) ease of use of
commercial statistics tools.
Eclipse with StatET
Eclipse with the StatET plugin
[16]
provides an IDE for R.
It supports Sweave.
Rattle GUI
Rattle
[17]
for Linux, Windows and Mac (screenshots
[18]
)
[19]
.
Tinn R
For Windows only
Tinn R
[20]
is a good IDE for Windows users. One can easily define keyboard shortcuts to execute selected R code
from Tinn R.
Notepad++ and NpptoR
For Windows only.
Notepad++
[21]
and NPPtoR
[22]
provides syntax highlighting and hotkeys (by default F8) to send lines of code to R.
Syntax highlighting can be easily modified using the dialog box to manage user define languages (Menu/View/Use
Define Dialog...). NPPtoR provides a method to generate syntax highlighting dynamically (depending on all the
available packages in the R environment).
Vi, Vim and GVim
Vim and GVim provides syntax highlighting
Vim is for advanced users only
The Vim-R-plugin
[23]
allows the communication between Vim and R
Emacs and ESS
Emacs with ESS (Emacs Speaks Statistics)
[24]
.
Emacs is for advanced users only.
For Linux users, you just have to install emacs and ESS using your standard package manager (synaptic, aptitude,
yum, etc)
For Mac and Windows user, you can have a look at Vincent Goulet's page which has binary with Emacs and
ESS
[25]
.
For Mac users, Aquamacs Emacs is a good solution. It is an enhancement of the standard Emacs editor.
For Windows users, XEmacs is a good solution.
Once the installation of Emacs and ESS is done, you just have to open Emacs and open or create a file with
extension .R (C-x C-f). ESS will be automatically loaded.
C-c M-j evaluates the current line
C-c M-r evaluates the current region
Settings
14
C-c M-b evaluates the current buffer
See John Fox webpage http:/ / socserv. mcmaster. ca/ jfox/ Books/ Companion/ ESS/ ess-xemacs. pdf to learn
about Emacs and ESS
WinEdt
How to use R for Windows with the RWinEdt extension ? by Andy Eggers
[26]
WinEdt is not open source
WinEdt is for Windows only.
Install the RWinEdt package.
gedit with gedit-r-plugin
For Linux users only.
There is also a plugin for gedit called gedit-r-plugin. This can be installed using Synaptic or any other package
manager on a linux platform.
Customizing R
R profile
R can be customized using the Rprofile file. On Linux, this file is stored in the home directory. You can edit it by
running the following command in a terminal :
$ gedit ~/.Rprofile
If you use some packages very often, you can load them systematically using the Rprofile file. You can also change
the default options.
Options
The function options() without any argument show all options
> options()
The linguistic and encoding options can be modified using Sys.setlocale() :
> Sys.setlocale()
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/en_US.UTF-8"
By default, error messages are in the local language. However, it is possible to set them in English using
Sys.sentev()
Sys.setenv(LANGUAGE='en')
Settings
15
References
[1] This section was imported from the Wikiversity project Installation, How to use R course
[2] http:/ / www. r-project. org/
[3] http:/ / cran. r-project. org/ web/ packages/ installr/
[4] Updating R from R (on Windows) using the {installr} package (http:/ / www. r-statistics. com/ 2013/ 03/
updating-r-from-r-on-windows-using-the-installr-package/ )
[5] http:/ / www. r-statistics. com/ 2010/ 04/ changing-your-r-upgrading-strategy-and-the-r-code-to-do-it-on-windows/
[6] Portable R http:/ / sourceforge.net/ projects/ rportable/
[7] http:/ / jgr. markushelbig. org/ JGR. html
[8] http:/ / www. rforge. net/ JGR/ screenshots.html
[9] http:/ / socserv.mcmaster. ca/ jfox/ Misc/ Rcmdr/
[10] http:/ / socserv. mcmaster.ca/ jfox/ Misc/ Rcmdr/ Rcmdr-screenshot. html
[11] http:/ / www.rstudio. com/
[12] rstudio.org (http:/ / www. rstudio. org/ )
[13] [13] John Verzani "Getting Started with RStudio An Integrated Development Environment for R", O'Reilly Media, September 2011
[14] [14] Jeffrey Racine, (forthcoming), "RStudio: A Platform Independent IDE for R and Sweave," Journal of Applied Econometrics.
[15] http:/ / sourceforge.net/ apps/ mediawiki/ rkward/ index. php?title=Screenshots
[16] StatET : http:/ / www.walware.de/ goto/ statet
[17] Rattle : http:/ / rattle.togaware. com/
[18] http:/ / rattle. togaware.com/ rattle-screenshots. html
[19] [19] Graham J Williams. Rattle: A Data Mining GUI for R. The R Journal, 1(2):45-55, December 2009
[20] Tinn stands for Tinn Is Not Notepad http:/ / www.sciviews. org/ Tinn-R/
[21] Note that Notepad++ can be installed on a USB stick http:/ / sourceforge. net/ projects/ notepadpluspe/
[22] NPPtoR is also a portable software http:/ / sourceforge. net/ projects/ npptor/
[23] http:/ / www.vim.org/ scripts/ script.php?script_id=2628
[24] ESS : http:/ / ess.r-project. org/
[25] Vincent Goulet Emacs page http:/ / vgoulet. act.ulaval. ca/ emacs
[26] http:/ / www.people. fas. harvard.edu/ ~aeggers/ RWinEdt_installation. pdf
Documentation
Obtaining Help
For each package you have a reference manual available as an HTML file from within R or as a PDF on the CRAN
website. You also often have Vignettes or comprehensive articles in the R Journal, the Journal of Statistical
Software, etc.
>library(help="package_name")
>vignette("np",package="np")
>vignette(all=FALSE) # vignettes for all attached packages
>vignette(all=TRUE) # vignettes for all packages on the computer
You can search for help inside all loaded packages using help() or ?. Usually you do not need to add quotes to
function names, but sometimes it can be useful. args() gives the full syntax of a function.
> help(lm)
> ?lm
> ?"for"
> ?"[["
> args("lm")
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
Documentation
16
NULL
apropos() and find() looks for all the functions in the loaded packages containing a keyword or a regular
expression
[1]
.
> apropos("norm")
[1] "dlnorm" "dnorm" "plnorm"
[4] "pnorm" "qlnorm" "qnorm"
[7] "qqnorm" "qqnorm.default" "rlnorm"
[10] "rnorm" "normalizePath"
You can search for help in all installed packages using help.search() or its shortcut ??.
> ??"lm"
> help.search("covariance")
RSiteSearch() looks for help in all packages and in the R mailing lists. The sos package improves the
RSiteSearch() function with the findFn() function. ??? is a wrapper for findFn().
> RSiteSearch("spline")
> library("sos")
> findFn("spline", maxPages = 2)
> ???"spline"(2)
hints() in the hints package suggests what to do with an object.
> fit <- lm(y ~ x)
> library("hints")
> hints(fit) # returns a list of function using lm objects.
Handouts
An Introduction to R
[2]
The R Reference Manual
Robert Kabacoff's Quick R
[3]
Grant Farnsworth's Econometrics in R
[4]
The best introduction for an economist (about 20 pages)
UCLA R Computing Resources
[5]
A Handbook of Statistical Analyses Using R
[6]
by Brian S. Everitt and Torsten Hothorn
fr+en Arthur Charpentier's R for acturies
[7]
Dan Goldstein's video tutorial
[8]
fr Julien Barnier's introduction to R for sociologists
[9]
Rosetta Code
[10]
presents solutions to the same task in different programming languages.
'R language for programmers', by John Cook
[11]
A Brief Guide to R Beginners in Econometrics
[12]
R Tutorial by Kelly Black
[13]
Documentation
17
Reference Sheet
List of Cheat Sheets
[14]
Daniel Kaplan Reference Sheet
[15]
Teaching Resources
Franois Briatte has a nice introduction to data analysis using R
[16]
Simon Jackman Political Methodology Classes
[17]
Jonathan Katz Political Methodology Classes
[18]
A Brief Guide to R for Beginners in Econometrics
[12]
PRISM luncheons
[19]
Statistical Analysis: an Introduction using R - which includes a course on R
Biostatistics with R aka A R companion to Wayne Daniel 's Biostatistics Book
Blogs
Planet R
[20]
the first R blog aggregator
R Bloggers
[21]
The news pulse for the R blogosphere
"R" you Ready ?
[22]
One R Tip a Day
[23]
Revolution computing blog
[24]
Yu Sung Su's Blog:R
[25]
(fr) Freakonometrics (in French)
[26]
lots of code chunks
(fr) Baptiste Coulmont (in French)
[27]
(fr) Quanti Sciences Sociales (in French)
[28]
R blog for sociologists
Journals
The R Journal
[29]
Journal of Statistical Software
[30]
contains lots of articles on R packages.
The Political Methodologist
[31]
contains lots of articles on R for political scientists.
Books
Venables and Ripley : Modern Applied Statistics with S
[32]
A very good introduction to R covering numerous topics.
A Handbook of Statistical Analyses Using R
[6]
(Brian S. Everitt and Torsten Hothorn, Chapman & Hall/CRC,
2008)
An Introduction to Data Technologies, by Paul Murrell
[33]
Everything you need to know about data management
A first course in statistical programming with R, John Braun and Duncan Murdoch.
Peter Dalgaard (2009). ISwR: Introductory Statistics with R. R package version 2.0-4. http:/ / CRAN. R-project.
org/ package=ISwR
Springer Use R Series
[34]
John Fox : An R and S-PLUS Companion to Applied Regression
[35]
Gelman Hill : Data Analysis using Regression and Multilevel Hierarchical Models
[36]
Documentation
18
useR and other R conferences
useR! 2009
[37]
useR! 2010
[38]
London R homepage
[39]
R / Finance
[40]
conferences in 2009 and 2010
Search Engine
R seek
[41]
Google Code Search with keyword "lang:r"
[42]
gives access to r programs including the request. For instance the
following request optim lang:r
[43]
gives access to all the r programs including optim.
Q&A / Forums
Nabble R http:/ / r. 789695. n4. nabble. com/
Stackoverflow
[44]
The #rstats hashtag
[45]
on Twitter
IRC: #r@freenode
r-soc
[46]
: mailing list for French sociologist
References
[1] If you want to know more about regular expressions, have a look at the Regular expressions section in the Text Processing page.
[2] http:/ / cran. r-project. org/ doc/ manuals/ R-intro. html
[3] http:/ / www. statmethods.net/
[4] http:/ / cran. r-project. org/ doc/ contrib/ Farnsworth-EconometricsInR. pdf
[5] http:/ / www. ats. ucla. edu/ stat/ r/
[6] http:/ / cran. r-project. org/ web/ packages/ HSAUR2/ index. html
[7] http:/ / perso.univ-rennes1. fr/ arthur. charpentier/ slides-R. pdf
[8] http:/ / www. decisionsciencenews. com/ ?p=261
[9] http:/ / alea. fr. eu.org/ j/ intro_R. html
[10] http:/ / rosettacode. org/ wiki/ Category:R
[11] http:/ / www.johndcook. com/ R_language_for_programmers. html
[12] http:/ / people.su.se/ ~ma/ R_intro/ R_intro.pdf
[13] http:/ / www.cyclismo.org/ tutorial/ R/
[14] http:/ / devcheatsheet.com/ tag/ r/
[15] http:/ / www.macalester.edu/ ~kaplan/ ISM/ r-commands. pdf
[16] Introduction to Data Analysis (http:/ / f.briatte.org/ teaching/ ida/ )
[17] http:/ / jackman. stanford. edu/ classes/ index. php
[18] http:/ / jkatz. caltech.edu/ classes/ ss228.html
[19] http:/ / polisci. osu. edu/ prism/ luncheons.htm
[20] http:/ / planetR. stderr. org
[21] http:/ / www.r-bloggers.com/
[22] http:/ / ryouready. wordpress.com/
[23] http:/ / onertipaday. blogspot. com/
[24] http:/ / blog. revolution-computing.com/
[25] http:/ / yusung. blogspot.com/ search/ label/ R
[26] http:/ / freakonometrics. blog.free. fr/ index.php
[27] http:/ / coulmont. com/ blog/ tag/ r/
[28] http:/ / quanti. hypotheses. org/ tag/ r/
[29] http:/ / journal. r-project. org/ current.html
[30] http:/ / www.jstatsoft.org/
[31] http:/ / polmeth.wustl. edu/ thepolmeth.php
[32] http:/ / www.stats.ox.ac. uk/ pub/ MASS4/
[33] http:/ / www.stat.auckland.ac. nz/ ~paul/ ItDT/ HTML/
Documentation
19
[34] http:/ / www.springer. com/ series/ 6991
[35] http:/ / socserv. socsci.mcmaster. ca/ jfox/ Books/ Companion/
[36] http:/ / www.stat.columbia. edu/ ~gelman/ arm/
[37] http:/ / www.agrocampus-ouest. fr/ math/ useR-2009/
[38] http:/ / www.useR2010. org
[39] http:/ / www.londonr. org/
[40] http:/ / www.RinFinance. com
[41] http:/ / www.rseek.org/
[42] http:/ / www.google.com/ codesearch
[43] http:/ / www.google.com/ codesearch?q=optim+ lang%3Ar& hl=en& btnG=Search+ Code
[44] http:/ / stackoverflow. com/ questions/ tagged/ r
[45] http:/ / hashtags. org/ tag/ rstats/ messages
[46] http:/ / alea. fr. eu.org/ j/ index.html#sec-2
Control Structures
Conditional execution
Help for programming :
> ?Control
if accepts a unidimensional condition.
> if (condition){
+ statement
+ }
> else{
+ alternative
+ }
The unidimensional condition may be one of TRUE or FALSE, T or F, 1 or 0 or a statement using the truth
operators:
x == y "x is equal to y"
x != y "x is not equal to y"
x > y "x is greater than y"
x < y "x is less than y"
x <= y "x is less than or equal to y"
x >= y "x is greater than or equal to y"
And may combine these using the & or && operators for AND. | or || are the operators for OR.
> if(TRUE){
+ print("This is true")
+ }
[1] "This is true"
> x <- 2 # x gets the value 2
> if(x==3){
+ print("This is true")
+ }
+ else{
+ print("This is false")
Control Structures
20
+ }
[1] "This is false"
> y <- 4 # y gets the value 4
> if(x==2 && y>2){
+ print("x equals 2 and y is greater than 2")
+ }
[1] "x equals 2 and y is greater than 2"
The ifelse() command takes as first argument the condition, as second argument the treatment if the condition
is true and as third argument the treatment if the condition is false. In that case, the condition can be a vector. For
instance we generate a sequence from 1 to 10 and we want to display values which are lower than 5 and greater than
8.
> x <- 1:10
> ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10
Sets
R has some very useful handlers for sets to select a subset of a vector:
> x = runif(10)
> x<.5
[1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
> x
[1] 0.32664759 0.57826623 0.98171138 0.01718607 0.24564238 0.62190808 0.74839301
[8] 0.32957783 0.19302650 0.06013694
> x[x<.5]
[1] 0.32664759 0.01718607 0.24564238 0.32957783 0.19302650 0.06013694
to exclude a subset of a vector:
> x = 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x[-1:-5]
[1] 6 7 8 9 10
Control Structures
21
Loops
Implicit loops
Example of fast code using vectorisation
R has support for implicit loops, which is called vectorization. This is
built-in to many functions and standard operators. for example, the +
operator can add two arrays of numbers without the need for an
explicit loop.
Explicit Loops are generally slow, and it is better to avoid them when
it is possible.
apply() can apply a function to elements of a matrix or an array.
This may be the rows of a matrix (1) or the columns (2).
lapply() applies a function to each column of a dataframe and
returns a list.
sapply() is similar but the output is simplified. It may be a vector or a matrix depending on the function.
tapply() applies the function for each level of a factor.
> N <- 10
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> male <- rbinom(N,1,.48)
> y <- 1 + x1 + x2 + male + rnorm(N)
> mydat <- data.frame(y,x1,x2,male)
> lapply(mydat,mean) # returns a list
$y
[1] 3.247
$x1
[1] 0.1415
$x2
[1] 1.29
$male
[1] 0.5
> sapply(mydat,mean) # returns a vector
y x1 x2 male
3.2468 0.1415 1.2900 0.5000
> apply(mydat,1,mean) # applies the function to each row
[1] 1.1654 2.8347 -0.9728 0.6512 -0.0696 3.9206 -0.2492 3.1060 2.0478 0.5116
> apply(mydat,2,mean) # applies the function to each column
y x1 x2 male
3.2468 0.1415 1.2900 0.5000
> tapply(mydat$y,mydat$male,mean) # applies the function to each level of the factor
0 1
1.040 5.454
Control Structures
22
See also aggregate() which is similar to tapply() but is applied to a dataframe instead of a vector.
Explicit loops
R provides three ways to write loops: for, repeat and while. The for statement is excessively simple. You simply
have to define index (here k) and a vector (in the example below the vector is 1:5) and you specify the action you
want between braces.
> for (k in 1:5){
+ print(k)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
When it is not possible to use the for statement, you can also use break or while by specifying a breaking rules. One
should be careful with this kind of loops since if the breaking rules is misspecified the loop will never end. In the two
examples below the standard normal distribution is drawn in as long as the value is lower than 1. The cat() function
is used to display the present value on screen.
> repeat {
+ g <- rnorm(1)
+ if (g > 1.0) break
+ cat(g,"\n")
+ }
-1.214395
0.6393124
0.05505484
-1.217408
> g <- 0
> while (g < 1){
+ g <- rnorm(1)
+ cat(g,"\n")
+ }
-0.08111594
0.1732847
-0.2428368
0.3359238
-0.2080000
0.05458533
0.2627001
1.009195
Control Structures
23
Iterators
Loops in R are generally slow. iterators may be more efficient than loops. See this entry in the Revolution
Computing Blogs
[1]
References
[1] http:/ / blog. revolution-computing. com/ 2009/ 07/ counting-with-iterators. html
Working with functions
Looking at the code of a function
You can type the name of the function in the console without any round brackets after the name. This will print
the code of the function in the console.
You can also use the page() function which opens a new editor window and prints the code of the function in
this editor.
You can also use the trCopy() function in the TinnR package to copy the code of the function. Then you just
have to paste it in a text editor to have a look at it.
Here is an example with the lm() function.
> lm
> page(lm)
> library(TinnR)
> trCopy(lm)
Creating your own function
A simple function without argument which doesn't return anything
> fn <- function(){
+ print("hello")
+ }
> fn()
[1] "hello"
Returning an object
By default the value of the last line is returned. In the following example, we have a simple function with two
objects. The last one is returned.
> test <- function(){
+ x <-1
+ z <- 2
+ }
> res <- test()
> res
[1] 2
The function can return an object using return().
Working with functions
24
> test <- function(){
+ x <-1
+ z <- 2
+ return(x)
+ }
> res <- test()
> res
[1] 1
Adding arguments
It is possible to add arguments.
square <- function(x){
x2 <- x^2
return(x2)
}
square(x = 2)
The ... argument means that you can add other arguments which will be passed to functions inside the function.
plot2 <- function(x,...){
plot(x, type = "l", ...)
}
plot2(runif(100), main = "line plot", col = "red")
It is possible to add a dataframe as argument
[1]
. Here is an example :
redplot <- function(z, y, data, env=parent.frame()) {
if(!missing(data)){
z <- data[,deparse(substitute(z))]
y <- data[,deparse(substitute(y))]
}
plot(z,y, col = "red", pch = 15)
}
mydat <- data.frame(vm = rnorm(10),output = rnorm(10))
redplot(vm,output,data=mydat)
For estimation commands it is possible to add formulas as arguments. For instance, we can create our own function
for ordinary least square using a formula interface.
ols <- function(formula, data = list()){
mf <- model.frame(formula=formula, data=data)
X <- model.matrix(attr(mf, "terms"), data=mf)
y <- model.response(mf)
beta <- solve(t(X)%*%X)%*%t(X)%*%y
se <- sqrt( 1/(nrow(X) - ncol(X)) * sum((y - X%*%beta)^2) * diag(solve(t(X)%*%X)))
res <- cbind(beta,se)
colnames(res)<- c("Coefficients","Standard errors")
return(res)
}
Working with functions
25
N <- 100
u <- rnorm(N)
x <- rnorm(N) + 1
y <- 1 + x + u
ols(y~x)
Recursive functions
R supports recursive functions. The function below computes Fibonacci numbers recursively.
> fib <- function(n) if (n>2) c(fib(n-1),sum(tail(fib(n-1),2))) else if (n>=0) rep(1,n)
> fib(10)
[1] 1 1 2 3 5 8 13 21 34 55
Functions as Objects
R functions can be treated as objects
> a <- function(n) function(a) runif(a)
> b <- a(1)
> b(10)
[1] 0.8726873 0.9512367 0.5971435 0.5540743 0.6378967 0.4030071 0.2750673 0.1777123 0.6960378 0.3969920
This can be useful when wanting to make many different kinds of functions
> a <- list()
> b <- function(i){ i; function() runif(i)}
> for (i in 1:10) a[[i]] <- b(i)
> a[[1]]()
[1] 0.2617396
> a[[2]]()
[1] 0.8822248 0.3374574
> a[[3]]()
[1] 0.0348156 0.4212788 0.6107646
Higher-order functions
You can use higher-order functions in R. Contrary to common belief, using them instead of loops, is not faster,
because the apply function has a for-loop inside its definition. Use them only to improve clarity of your code.
(Reference: Patrick Burns, The R Inferno, p. 24)
apply
apply is the most basic of R's map functions. lapply, sapply and mapply are convenient interfaces for apply that work
on lists, vectors and multiple vectors respectively.
apply takes as arguments an array, a vector of the dimension to map along and a function. The following example is
based on the apply documentation. It uses apply to compute column and row sums of a matrix.
x <- matrix(round(rnorm(100)),10,10)
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)
Working with functions
26
tapply
tapply is similar to apply, but applies a function to each cell of a ragged array, that is to each (non-empty) group of
values given by a unique combination of the levels of certain factors.
> x1 <- rnorm(10)
> x2 <- sample(1:2, 10, replace = T)
> cbind(x1,x2)
x1 x2
[1,] -1.7905021 1
[2,] 1.2908169 2
[3,] -2.1902513 2
[4,] 0.4845488 1
[5,] 0.2281593 1
[6,] 0.2201302 1
[7,] 2.1574243 1
[8,] 0.5789705 2
[9,] 1.3315188 1
[10,] -1.0029822 2
> tapply(x1, x2, sum)
1 2
2.631279 -1.323446
Reduce
This function from the Reduce documentation cumulatively adds
> cadd <- function(x) Reduce("+", x, accumulate = TRUE)
> cadd(1:10)
[1] 1 3 6 10 15 21 28 36 45 55
References
[1] http:/ / r.789695.n4.nabble.com/ Optional-data-argument-for-a-function-tp850247p850247. html
Debugging
27
Debugging
Some basic tips
Use print() statements in your functions to print variable values. Although this technique is considered
low-tech or even old fashioned by some, it can still be a quick and easy way to trace an error.
Place a browser() statement in the function just before the crashing line. When the function is called, it will
be executed up to the browser() line. The command-line interface then switches to the function environment,
so that all variables in the function can be inspected or changed. See below for commands available in
browser() mode.
Tracing errors with traceback()
A standard error message in R will tell you which function threw the error. Consider as an example the following
function whose sole purpose is to throw an error.
myFun <- function(){
stop("Woops! An error")
}
A call to myFun() gives
> myFun()
Error in myFun() : Woops! An error
After an error is raised, the traceback() function allows you to show the call stack leading to the error. For
example, the function below calls myFun.
myFun2 <- function(){
myFun()
}
Calling myFun2() and traceback() gives
> myFun2()
Error in myFun() : Woops! An error
> traceback()
3: stop("Woops! An error")
2: myFun()
1: myFun2()
The traceback() function can be executed automatically each time an error is raised with the option
options(error=traceback)
It may be switched off again with
options(error=NULL)
Debugging
28
Executing code line by line
A function can be executed by setting it to debugging mode with
debug(FUNCTION_NAME)
. Then, when the function is called, and a browser in that function's environment is opened so that it can be executed
line by line. In the debugging browser, apart from all standard R functionality, the following commands are
available.
Command Meaning
n Advance to next step. An empty line also works.
c, cont Continue to the end of the current context. E.g. to the end the loop within a loop or to the end of the function.
where Print the stack of function calls (where are you?)
Q Exit the browser and return to the top-level R prompt.
Debugging can be switched off with
undebug(FUNCTION_NAME)
There are a few related functions as well:
debugonce() Switch off debugging after the first call.
isdebugged() Check if a function is in degugging mode.
Browsing the call stack
This is the most advanced debugging option in R base. By setting options(error=recover) you get the
opportunity to browse any environment in the call stack. For example,
> options(error=recover)
> myFun2()
Error in myFun() : Woops! An error
Enter a frame number, or 0 to exit
1: myFun2()
2: myFun()
Selection:
By typing '1' or '2' behind Selection: the browser will jump to the selected environment. Once in the browser,
all standard R functionality is at your disposal, as well as the commands in the table below.
Debugging
29
Command Meaning
c, cont Exit the browser and continue at the next statement. An empty line will do the same.
n Enter the step-through debugger (this changes the meaning of c)
where Print a stack trace of active function calls (where are you in the stack?).
Q Exit the browser, do not continue at the next statement but go back to the top-level R browser.
Recovery mode can be switched off by
options(error=NULL)
Using C or Fortran
For some tasks, R can be slow. In that case, it is possible to write a program in C or Fortran and to use it from R.
This page is for advanced programmers only.
See wikiversity Connecting Fortran and R
Link C with R
[1]
References
[1] http:/ / yusung. blogspot. com/ 2008/ 08/ link-c-with-r.html
Utilities
This page includes material about some utilities. Most of the functions presented here have nothing to do with
statistical analysis but may be useful when working on a project. Many functions are just similar to standard unix
functions.
System (Unix/DOS)
system() gives access to the system (DOS or unix). The option wait=FALSE means that you don't ask R to
wait that the task is finished.
Some examples :
You can convert an image from to PS to PNG using the unix convert function of your computer. If you want
to know more about this function, open a Terminal application and type man convert (This should work on
Mac OS and Linux).
You can open Stata and run a program.
You can run pdflatex from R and directly open the pdf in a pdf browser.
system("convert W:/toto.ps W:/toto.png") # converts toto.ps to toto.png
system("D:/Stata10/stata.exe do D:/pgm.do", wait = F) # opens Stata and run pgm.do
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf
system("open M:/.../doc/*.pdf") # opens all the pdf in a directory
See also sys() in the Hmisc package, shell() and shell.exec().
Utilities
30
File Handling
dir() lists all the files in a directory. It is similar to the Unix function ls. dir.create() creates a new
directory. It is similar to mkdir in Unix.
file.info() gives information about a file.
> file.info("taille.txt")
size isdir mode mtime ctime atime exe
taille.txt 444 FALSE 666 2009-06-26 12:25:44 2009-06-26 12:25:43 2009-06-26 12:25:43 no
Removing files with a specific pattern :
file.remove(dir(path="directoryname", pattern="*.log"))
file.edit() opens a file in the text editor.
file.show() opens a file in a new window.
tempfile() creates a temporary file.
getZip() in the Hmisc package.
Internet
browseURL() opens an URL using an internet browser. download.file() download a file from the internet.
> browseURL("http://en.wikibooks.org/wiki/R_Programming")
To see the default browser, use getOption()
getOption("browser")
We can change the default browser using the options() command. It is safer to store the options before.
oldoptions <- options() # save the options
options(browser = "D:/FramafoxPortable/FramafoxPortable.exe")
You can download a file from the internet using download.file(). Note that very often you don't need to
download a file from the internet and you can directly load it into R from the internet using standard functions. For
instance, if you want to read a text file from the internet, you can use read.table(), scan() or
readLines().
# For example, we download "http://en.wikibooks.org/wiki/R_Programming/Text_Processing" on our Desktop
download.file(url="http://en.wikibooks.org/wiki/R_Programming/Text_Processing",destfile= "~/Desktop/test_processing.html")
# You can also read it into R using readLines()
text <- readLines("http://en.wikibooks.org/wiki/R_Programming/Text_Processing")
See also RCurl
Utilities
31
Computing time
If you perform computer intensive task you may want to optimize the computing time. Two functions are available
system.time() and proc.time(). Both returns a vector of values. The first is the standard CPU time.
> system.time(x<-rnorm(10^6))
[1] 1.14 0.07 1.83 0.00 0.00
> debut <- proc.time()
> x <- rnorm(10^6)
> proc.time()-debut
[1] 1.66 0.10 10.32 0.00 0.00
Computing process
user.prompt() (Zelig) makes a pause in the computation process (useful if you want to do a demo).
waitReturn() (cwhmisc) does the same job. Sys.sleep() stop the computation during a few seconds.
> user.prompt()
Press <return> to continue:
> Sys.sleep(5)
It is possible to stop the computing process if a logical condition is not true using stopifnot().
Miscellanous
trCopy() (TinnR package) copy an object to the clipboard. It is useful if you want to copy a large object to the
clipboard. For instance, if you want to copy the code of a function and paste it in a text editor.
> trCopy(lm)
[1] TRUE
sessionInfo() gives information on the current session info (R version + loaded packages). This function
may be useful for reproducible computing. getRversion() gives the current R version. R.version gives
more details about the computer and R.Version() returns the same informations as a list.
See Also
See the R.utils package
[1]
References
[1] Henrik Bengtsson (2009). R.utils: Various programming utilities. R package version 1.1.7. http:/ / CRAN. R-project. org/ package=R. utils
Estimation utilities
32
Estimation utilities
This page deals with methods which are available for most estimation commands. This can be useful for all kind of
regression models.
Formulas
Most estimation commands use a formula interface. The outcome is left of the ~ and the covariates are on the right.
y ~ x1 + x2
It is easy to include multinomial variable as predictive variables in a model. If the variable is not already a factor,
one just need to use the as.factor() function. This will create a set of dummy variables.
y ~ as.factor(x)
For instance, we can use the Star data in the Ecdat package :
library("Ecdat")
data(Star)
summary(lm(tmathssk ~ as.factor(classk), data = Star))
I() takes arguments "as is". For instance, if you want to include in your equation a modified variable such as a
squarred term or the addition of two variables, you may use I().
lm(y ~ x1 + I(x1^2) + x2)
lm(y ~ I(x1 + x2))
lm(I(y-100) ~ I(x1-100) + I(x2 - 100))
It is easy to include interaction between variables by using : or *. : adds all interaction terms whereas * adds
interaction terms and individual terms.
lm(y~x1:x2) # interaction term only
lm(y~x1*x2) # interaction and individual terms
It is also possible to generate polynomials using the poly() function with option raw = TRUE.
lm(y ~ poly(x, degree = 3, raw = TRUE))
There is also an advanced formula interface which is useful for instrumental variables models and mixed models. For
instance ivreg() (AER) uses this advanced formulas interface. The instrumental variables are entered after the |.
See the Instrumental Variables section if you want to learn more.
library("AER")
ivreg(y ~ x | z)
Output
In addition to the summary() and print() functions which display the output for most estimation commands,
some authors have developed simplified output functions. One of them is the display() function in the arm
package. Another one is the coefplot() in the arm package which displays the coefficients with confidence
intervals in a plot. According to the standards defined by Nathaniel Beck
[1]
, Jeff Gill developped
graph.summary()
[2]
. This command does not show useless auxiliary statistics.
Estimation utilities
33
R code Output
source("http://artsci.wustl.edu/~jgill/Models/graph.summary.R")
N <- 1000
u <- rnorm(N)
x1 <- 1 + rnorm(N)
x2 <- 1 + rnorm(N) + x1
y <- 1 + x1 + x2 + u
graph.summary(lm(y ~ x1 + x2))
Family: gaussian
Link function: identity
Coef Std.Err. 0.95 Lower 0.95 Upper CIs:ZE+RO
(Intercept) 0.980 0.056 0.871 1.089 |o|
x1 1.040 0.043 0.955 1.125 |o|
x2 0.984 0.031 0.923 1.045 |o|
N: 1000 Estimate of Sigma: 0.998
library("arm")
display(lm(y ~ x1 + x2))
lm(formula = y ~ x1 + x2)
coef.est coef.se
(Intercept) 0.89 0.05
x1 1.05 0.04
x2 1.02 0.03
---
n = 1000, k = 3
residual sd = 0.96, R-Squared = 0.86
Delta Method
If you want to know the standard error of a transformation of one of your parameter, you need to use the delta
method
deltamethod() in the msm package
[3]
.
delta.method() in the alr3 package.
deltaMethod in the car package.
Zelig : the pseudo-bootstrap method
Zelig
[4]
is a postestimation package which simulates in the distribution of the estimated parameters and computes the
quantities of interest such as marginal effects or predicted probabilities. This is especially useful for non-linear
models. Zelig comes with a set of vignettes which explain how to deal with each kind of model. There are three
commands.
zelig() estimates the model and draws from the distribution of estimated parameters.
setx() fixes the values of explanatory variables.
sim() computes the quantities of interest.
Estimation utilities
34
References
[1] Nathaniel Beck "Making regression and related output more helpful to users" The Political Methodologist 2010 http:/ / politics. as. nyu. edu/
docs/ IO/ 2576/ beck_tpm_edited. pdf
[2] Jeff Gill graph.summary() http:/ / artsci.wustl. edu/ ~jgill/ Models/ graph. summary. s
[3] See the example on the UCLA Statistics webpage : http:/ / www. ats. ucla. edu/ stat/ r/ faq/ deltamethod. htm
[4] Kosuke Imai, Gary King and Olivia Lau (2009). Zelig: Everyone's Statistical Software. R package version 3.4-5. http:/ / CRAN. R-project.
org/ package=Zelig
Packages
An R package includes a set of functions and datasets. Packages are often developed as supplementary material to
books. For instance the MASS package was developed by Venables and Ripley for their book Modern Applied
Statistics with S and the car package was developed by John Fox for his book An R and S plus Companion to
Applied Regression.
Load a package
A package is loaded into the current R environment using the library() function. A list of functions and
datasets included in a package can be obtained by using the h or help argument of the library function.
library("stats4") # loads the package "stats4"
library(h=stats4) # gives help for all functions
data(package="stats4") # gives the list of all available datasets
A package can be detached from the current environment by using the detach() function :
> detach("package:prettyR")
Without any arguments the library() function lists all of the packages currently available to the user. env()
(gdata) describe all loaded environments (ie packages). search() gives the list of all loaded packages.
> library() # returns the description of all the packages available on the computer
> dir(.libPaths()) # returns the name of all the packages available on the computer (quicker than the previous one)
> search()
> env(unit="MB")
current.packages() (Zelig) show all the required and suggested packages.
> current.packages("sem")
Where are my packages stored ?
The .libPaths() function without arguments prints the library directories
The .libPaths() function with a directory as argument defines a new directory where to store new libraries.
> .libPaths()
[1] "/Users/username/Library/R/library"
[2] "/Library/Frameworks/R.framework/Resources/library"
> .libPaths("W:/AppData/R/library")
Packages
35
Install new packages
The exhaustive list of all available packages is on the CRAN website
[2]
To install a new package, you generally just have to specify the name of the package as an argument of
install.packages() function.
Sometimes you need to specify more options. For instance, this is the case if you are not an administrator of your
computer.
"lib" specifies the directory where you want to store the package
"repos" specifies a list of repositories. Note that you can specify of vector of repositories.
"dep=T" specifies that all the required packages are also downloaded and installed.
> install.packages("faraway")
> install.packages("rgrs", lib="W:/AppData/R/library" ,
repos=c("http://r-forge.r-project.org","http://cran.fr.r-project.org/"),
dep=TRUE)
Stay up to date.
If you want to be aware of the latest packages, type new.packages() in R or visit the Revolution Computing Blog
[24]
which gives each month a list of the new and the updated packages.
> new.packages() # displays all the packages available in the repositories
> update.packages() # updates all the packages installed with the newest version available in the repositories
We can also install bundles of packages using install.views() or update.views() (ctv).
> install.packages("ctv")
> library("ctv")
> install.views("Econometrics")
> update.views("Econometrics")
We can also remove packages with remove.packages().
Building R Packages
You can write down your own R packages. But, all packages submitted to CRAN (or Bioconductor) must follow
specific guidelines, including the folder structure of the package and the other files like DESCRIPTION,
NAMESPACE and so on.
See Friedrich Leisch's introduction (PDF
[1]
20 pages)
[2]
See also Duncan Murdoch's tools for building packages using Windows
[3]
References
[1] http:/ / cran. r-project. org/ doc/ contrib/ Leisch-CreatingPackages. pdf
[2] Friedrich Leisch Creating R Packages : A Tutorial http:/ / cran. r-project. org/ doc/ contrib/ Leisch-CreatingPackages. pdf
[3] http:/ / www. r-project. org/ conferences/ useR-2008/ slides/ Murdoch. pdf
Data types
36
Data types
Data types
Vectors are the simplest R objects, an ordered list of primitive R objects of a given type (e.g. real numbers, strings,
logicals). Vectors are indexed by integers starting at 1. Factors are similar to vectors but where each element is
categorical, i.e. one of a fixed number of possibilities (or levels). A matrix is like a vector but with a specific
instruction for the layout such that it looks like a matrix, i.e. the elements are indexed by two integers, each starting
at 1. Arrays are similar to matrices but can have more than 2 dimensions. A list is similar to a vector, but the
elements need not all be of the same type. The elements of a list can be indexed either by integers or by named
strings, i.e. an R list can be used to implement what is known in other languages as an "associative array", "hash
table", "map" or "dictionary". A dataframe is like a matrix but does not assume that all columns have the same type.
A dataframe is a list of variables/vectors of the same length. Classes define how objects of a certain type look like.
Classes are attached to object as an attribute. All R objects have a class, a type and a dimension.
> class(object)
> typeof(object)
> dim(object)
Vectors
You can create a vector using the c() function which concatenates some elements. You can create a sequence
using the : symbol or the seq() function. For instance 1:5 gives all the number between 1 and 5. The seq()
function lets you specify the interval between the successive numbers. You can also repeat a pattern using the
rep() function. You can also create a numeric vector of missing values using numeric(), a character vector of
missing values using character() and a logical vector of missing values (ie FALSE) using logical()
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> c("a","b","c","d","e")
[1] "a" "b" "c" "d" "e"
> c(T,F,T,F)
[1] TRUE FALSE TRUE FALSE
> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5,by=.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> rep(1,5)
[1] 1 1 1 1 1
> rep(1:2,5)
[1] 1 2 1 2 1 2 1 2 1 2
> numeric(5)
[1] 0 0 0 0 0
Data types
37
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(5)
[1] "" "" "" "" ""
The length() computes the length of a vector. last() ( sfsmisc
[1]
) returns the last element of a vector but this
can also be achieved simply without the need for an extra package.
x <- seq(1,5,by=.5) # Create a sequence of number
x # Display this object
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> length(x) # Get length of object x
[1] 9
> library(sfsmisc)
> last(x) # Select the last element of x
[1] 5.0
> x[length(x)] # Select the last element wihout an extra package.
[1] 5.0
Factors
factor() transforms a vector into a factor. A factor can also be ordered with the option ordered=T or the
function ordered(). levels() returns the levels of a factor. gl() generates factors. n is the number of
levels, k the number of repetition of each factor and length the total length of the factor. labels is optional
and gives labels to each level.
Factors can be most easily thought of as categorical variables. An important function for factor analysis is the
table() function, which offers a type of summary. When considering the types of statistical data (nominal,
ordinal, interval and ratio), factors can be nominal, ordinal or interval. Nominal factors are categorical names,
examples of which could be country names paired with some other information. An example of an ordinal factor
would be a set of race times for a particular athlete paired with the athlete's finishing place (first, second, ...). When
trying to summarize this factor, please see the example with ordinal examples below for an example on self-ordering
your factors. Finally, an example of interval level factors would be age brackets such as "20 - 29", "30 - 39", etc. In
general, R can automatically order numbers stored as factors appropriately but a programmer may use the same
techniques with this type of data to order in the manner most appropriate to their application.
See also is.factor(), as.factor(), is.ordered() and as.ordered().
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes no yes maybe maybe no maybe no no
Levels: maybe no yes
>
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"), ordered = T)
[1] yes no yes maybe maybe no maybe no no
Levels: maybe < no < yes
>
> ordered(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes no yes maybe maybe no maybe no no
Levels: maybe < no < yes
>
> ordered(as.factor(c("First","Third","Second","Fifth","First","First","Third")),
Data types
38
+ levels = c("First","Second","Third","Fourth","Fifth"))
[1] First Third Second Fifth First First Third
Levels: First < Second < Third < Fourth < Fifth
>
> gl(n=2, k=2, length=10, labels = c("Male", "Female")) # generate factor levels
[1] Male Male Female Female Male Male Female Female Male Male
Levels: Male Female
Matrix
If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of
data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or
by column (the default option). Here are two examples.
> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] NA NA NA NA NA
[3,] NA NA NA NA NA
[4,] NA NA NA NA NA
[5,] NA NA NA NA NA
> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 1 2 3 4 5
[5,] 6 7 8 9 10
Functions cbind() and rbind() combine vectors into matrices in a column by column or row by row mode:
> v1 <- 1:5
> v2 <- 5:1
> v2
[1] 5 4 3 2 1
> cbind(v1,v2)
v1 v2
[1,] 1 5
[2,] 2 4
[3,] 3 3
[4,] 4 2
[5,] 5 1
> rbind(v1,v2)
[,1] [,2] [,3] [,4] [,5]
v1 1 2 3 4 5
v2 5 4 3 2 1
Data types
39
The dimension of a matrix can be obtained using the dim() function. Alternatively nrow() and ncol()
returns the number of rows and columns in a matrix:
> X<-matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> dim(X)
[1] 5 5
> nrow(X)
[1] 5
> ncol(X)
[1] 5
Function t() transposes a matrix:
> t(X)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 1 6
[2,] 2 7 12 2 7
[3,] 3 8 13 3 8
[4,] 4 9 14 4 9
[5,] 5 10 15 5 10
Unlike data frames matrices must either be numeric or character in type:
> a=matrix(2,2,2)
> a
[,1] [,2]
[1,] 2 2
[2,] 2 2
> a = rbind(a,c("A","A"))
> a
[,1] [,2]
[1,] "2" "2"
[2,] "2" "2"
[3,] "A" "A"
Arrays
An array is composed of n dimensions where each dimension is a vector of R objects of the same type. An array of
one dimension of one element may be constructed as follows.
> x = array(c(T,F),dim=c(1))
> print(x)
[1] TRUE
The array x was created with a single dimension (dim=c(1)) drawn from the vector of possible values c(T,F). A
similar array, y, can be created with a single dimension and two values.
> y = array(c(T,F),dim=c(2))
> print(y)
[1] TRUE FALSE
A three dimensional array - 3 by 3 by 3 - may be created as follows.
Data types
40
> z = array(1:27,dim=c(3,3,3))
> dim(z)
[1] 3 3 3
> print(z)
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
R arrays are accessed in a manner similar to arrays in other languages: by integer index, starting at 1 (not 0). The
following code shows how the third dimension of the 3 by 3 by 3 array can be accessed. The third dimension is a 3
by 3 array.
> z[,,3]
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
Specifying two of the three dimensions returns an array on one dimension.
> z[,3,3]
[1] 25 26 27
Specifying three of three dimension returns an element of the 3 by 3 by 3 array.
> z[3,3,3]
[1] 27
More complex partitioning of array may be had.
> z[,c(2,3),c(2,3)]
, , 1
[,1] [,2]
Data types
41
[1,] 13 16
[2,] 14 17
[3,] 15 18
, , 2
[,1] [,2]
[1,] 22 25
[2,] 23 26
[3,] 24 27
Arrays need not be symmetric across all dimensions. The following code creates a pair of 3 by 3 arrays.
> w = array(1:18,dim=c(3,3,2))
> print(w)
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
Objects of the vectors composing the array must be of the same type, but they need not be numbers.
> u = array(c(T,F),dim=c(3,3,2))
> print(u)
, , 1
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] FALSE TRUE FALSE
[3,] TRUE FALSE TRUE
, , 2
[,1] [,2] [,3]
[1,] FALSE TRUE FALSE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE FALSE
Data types
42
Lists
A list is a collection of R objects. list() creates a list. unlist() transform a list into a vector. The objects in a
list do not have to be of the same type or length.
> x <- c(1:4)
> y <- FALSE
> z <- matrix(c(1:4),nrow=2,ncol=2)
> myList <- list(x,y,z)
> myList
[[1]]
[1] 1 2 3 4
[[2]]
[1] FALSE
[[3]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
lists have very flexible methods for reference
by index number:
> a = list()
> a
list()
> a[[1]] = "A"
> a
[[1]]
[1] "A"
> a[[2]]="B"
> a
[[1]]
[1] "A"
[[2]]
[1] "B"
By name:
> a
list()
> a$fruit = "Apple"
> a
$fruit
[1] "Apple"
> a$color = "green"
Data types
43
> a
$fruit
[1] "Apple"
$color
[1] "green"
This can also be recursive and in combination
> a = list()
> a[[1]] = "house"
> a$park = "green's park"
> a
[[1]]
[1] "house"
$park
[1] "green's park"
> a$park = "green's park"
> a[[1]]$address = "1 main st."
> a
[[1]]
[[1]][[1]]
[1] "house"
[[1]]$address
[1] "1 main st."
$park
[1] "green's park"
Using the scoping rules in R one can also dynamically name and create list elements
> a = list()
> n = 1:10
> fruit = paste("number of coconuts in bin",n)
> my.number = paste("I have",10:1,"coconuts")
> for (i in 1:10)a[fruit[i]] = my.number[i]
> a$'number of coconuts in bin 7'
[1] "I have 4 coconuts"
Data types
44
Data Frames
A dataframe has been referred to as "a list of variables/vectors of the same length". In the following example, a
dataframe of two vectors is created, each of five elements. The first vector, v1, is composed of a sequence of the
integers 1 through 5. A second vector, v2, is composed of five logical values drawn of type T and F. The dataframe
is then created, composed of the vectors. The columns of the data frame can be accessed using integer subscripts or
the column name and the $ symbol.
> v1 = 1:5
> v2 = c(T,T,F,F,T)
> df = data.frame(v1,v2)
> print(df)
v1 v2
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
5 5 TRUE
> df[,1]
[1] 1 2 3 4 5
> df$v2
[1] TRUE TRUE FALSE FALSE TRUE
The dataframe may be created directly. In the following code, the dataframe is created - naming each vector
composing the dataframe as part of the argument list.
> df = data.frame(foo=1:5,bar=c(T,T,F,F,T))
> print(df)
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
5 5 TRUE
External links
data.frame objects in R
[2]
(a sample chapter from the R in Action book)
Aggregation and Restructuring of data.frame objects
[3]
(a sample chapter from the R in Action book)
References
[1] http:/ / cran. r-project. org/ web/ packages/ sfsmisc/ index. html
[2] http:/ / www. r-statistics. com/ 2011/ 12/ data-frame-objects-in-r-via-r-in-action/
[3] http:/ / www. r-bloggers. com/ aggregation-and-restructuring-data-from-%E2%80%9Cr-in-action%E2%80%9D/
Working with data frames
45
Working with data frames
In this section, we deal with methods to read, manage and clean-up a data frame.
In R, a dataframe is a list of vectors of the same length. They don't have to be of the same type. For instance, you can
combine in one dataframe a logical, a character and a numerical vector.
Reading and saving data
If data are already in an R format (.Rda or .Rdata), you can load them in memory using load(). You can save
data to the R format using save().
load("mydata.Rda")
save(mydata,file="mydata.Rda")
Example Datasets
Most packages include example datasets to test the functions.
The data() function without argument gives the list of all example datasets in all the loaded packages.
If you want to load them in memory, you just need to use the data function and include the name of the dataset as
an argument.
str_data() (sfsmisc) gives the structure of all datasets in a package.
> data() # lists all the datasets in all the packages in memory
> data(package="datasets") # lists all the datasets in the "datasets" package
> data(Orange) # loads the orange dataset in memory
> ?Orange # Help for the "Orange" Datasets
> str_data("datasets") # gives the structure of all the datasets in the datasets package.
Some packages includes lots of datasets.
The datasets package
The AER package
[1]
includes replication datasets for some important textbooks in econometrics.
The EcDat package
[2]
includes replication archive for the Journal of Applied Econometrics, the Journal of
Business and Economic Statistics, etc
Building your own data frames
You can create a dataframe using vectors.
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- 1 + x1 + x2 + u
mydat <- data.frame(y,x1,x2)
R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.
mydat <- edit(data.frame())
Read table from the clipboard :
> mydat <- read.table("clipboard")
Working with data frames
46
You can also read space delimited tables in your code using gsource() (Zelig). Here is an example with Yule
1899 data
[3]
.
mydat <- gsource(var.names = "id union pauperism out old pop",
variables = "
1 Kensington 27 5 104 136
2 Paddington 47 12 115 111
3 Fulham 31 21 85 174
")
Describing a data frame
str() gives a very brief description of the data.
names() gives the name of each variables
summary() gives some very basic summary statistics for each variable
Browsing data
You can browse your data in a spreadsheet using View(). Depending on your operating system, this option is
not always available and the result is not always the same.
You can print the first lines using head() and the last lines using tail().
View(mydata)
head(mydata, n = 20) # n = 20 means that the first 20 lines are printed in the R console
RStudio has a nice data browser (View(mydata)).
RKward has also a nice data browser
Paul Murrell is currently developping the rdataviewer package (pdf
[4]
).
Attaching data
One of the big advantage of R over Stata is that you can deal with multiple datasets at the same time. You just need
to specify the name of the dataset and a "$" symbol before each variable name ( for instance mydat1$var1 and
mydat2$var1). If you only work with one dataset and you don't want to write again and again the name of the
dataset as a prefix for each variable, you can use attach().
mydata$var1
attach(mydata)
var1
detach(mydata)
Working with data frames
47
Detecting duplicates
When you want to clean up a data set, it is very often useful to check if you don't have the same information twice in
the data. R provides some functions to detect duplicates.
duplicated() looks at duplicated elements and return a logical vector. You can use table() to summarize
this vector.
Duplicated() (sfsmisc) generalizes this command. Duplicated() only marks unique values with "NA".
remove.dup.rows() (cwhmisc).
unique() keep only the unique lines in a dataset.
library("Zelig")
mydat <- gsource(
variables = "
1 1 1 1
1 1 1 1
1 2 3 4
1 2 3 4
1 2 2 2
1 2 3 2")
unique(mydat) # keep unique rows
library(cwhmisc)
remove.dup.rows(mydat) # similar to unique()
table(duplicated(mydat)) # table duplicated lines
mydat$dups <- duplicated(mydat) # add a logical variable for duplicates
Creating and removing variables
To create a new variable
mydata$newvar <- oldvar
If you want to delete a variable in a dataset, you can assign NULL to that variable :
# Delete the x variable in the df data frame.
df$x <- NULL
Renaming variables
It is possible to rename variable by redefining the vector of names of a data frame.
There is also a rename() function in the reshape package.
df <- data.frame(x = 1:10, y = 21:30)
names(df)
names(df) <- c("toto","tata")
names(df)
names(df)[2] <- "titi"
names(df)
Working with data frames
48
Creating a subset of the data
One can subset the data using subset(). The first argument is the name of the dataset, the second argument is a
logical condition which say which lines will be included in the new dataset and the last argument is the list of
variable which will be included in the new dataset.
In the following example, we generate a fake dataset and we use the subset() command to select the lines and
columns of interest. We choose the lines such that x1 > 0 and x2 < 0 and we only keep x1 and x2 as variables.
N <- 100
x1 <- rnorm(N)
x2 <- 1 + rnorm(N) + x1
x3 <- rnorm(N) + x2
mydat <- data.frame(x1,x2,x3)
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.
It is also possible to reorder the columns using the select option.
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x2,x1))
Sorting and ordering
order()
mydat[order(var1,var2),]
Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform
distribution and to sort following that vector.
df[order(runif(nrow(df))),]
Detecting missing values
is.na() returns a logical vector equal to TRUE if any of the variable in a dataset is missing and to FALSE
otherwise.
complete.cases() returns a logical vector indicating TRUE if all cases are complete and FALSE otherwise.
> table(complete.cases(df))
Reshaping a dataframe
This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation
per unit and a variable for each time period or in a long format with one observation per unit and time period.
reshape() reshapes a dataset in a wide or long format.
> country <- c("'Angola'","'UK'","'France'")
> gdp.1960 <- c(1,2,3)
> gdp.1970 <- c(2,4,6)
> mydat <- data.frame(country,gdp.1960,gdp.1970)
> mydat # wide format
country gdp.1960 gdp.1970
1 Angola 1 2
Working with data frames
49
2 UK 2 4
3 France 3 6
> reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
country time gdp id
1.1 Angola 1 1 1
2.1 UK 1 2 2
3.1 France 1 3 3
1.2 Angola 2 2 1
2.2 UK 2 4 2
3.2 France 2 6 3
varying gives the numbers of the columns which are time-varying
v.names gives the prefix of the time-varying variables
direction gives the direction, either "long" or "wide".
See also :
reShape() (Hmisc)
See Hadley Wickham's reshape package
[5]
See Duncan Murdoch's tables package
[6]
External links
Printing nested tables in R bridging between the {reshape} and {tables} packages
[7]
Expanding a dataset
Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a
panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a
given scalar in order to create time-varying variables.
It is possible to use the expand() function in the epicalc package. This will multiply each line by a given number.
N <- 1000
T <- 5
wide <- data.frame(id = 1:N,f = rnorm(N), rep = T)
library("epicalc")
long <- expand(wide,index.var = "rep")
long$time <- rep(1:T,N)
We can also use the do it yourself solution or create our own function. The idea is simple. We create a vector which
igives for each line the number of times it should be replicated (dups in the following example). Then we use the
rep() function to create a vector which repeats the line numbers according to what we want. The last step creates a
new dataset which repeats lines according to the desired pattern.
expand <- function(df,dups){
df$dups <- dups
pattern <- rep(1:nrow(df), times=df$dups)
df2 <- df[pattern,]
index <- function(x){
1:length(x)
}
df2$year <- unlist(tapply(df2$dups, df2$id, index))
Working with data frames
50
df2$dups <- NULL
return(df2)
}
df <- data.frame(x = rnorm(3), id = 1:3)
dups = c(3,1,2)
expand(df,dups)
Merging dataframes
Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :
We have one table describing authors :
> authors <- data.frame(
+ surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+ nationality = c("US", "Australia", "US", "UK", "Australia"),
+ deceased = c("yes", rep("no", 4)))
> authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
and one table describing books
> books <- data.frame(
+ name = I(c("Tukey", "Venables", "Tierney",
+ "Ripley", "Ripley", "McNeil", "R Core")),
+ title = c("Exploratory Data Analysis",
+ "Modern Applied Statistics ...",
+ "LISP-STAT",
+ "Spatial Statistics", "Stochastic Simulation",
+ "Interactive Data Analysis",
+ "An Introduction to R"),
+ other.author = c(NA, "Ripley", NA, NA, NA, NA,
+ "Venables & Smith"))
> books
name title other.author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics ... Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the
second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and
Working with data frames
51
by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first
and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the
observations from the author dataset which match with an observation in the books dataset.
> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
name title other.author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics ... Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>
It is also possible to merge two data.frame objects while preserving the rows order by one of the two merged
objects
[8]
.
Resources
R Data Manual
[9][10]
.
Paul Murrell's Introduction to Data Technologies
[11]
.
References
[1] The AER Package http:/ / cran.r-project. org/ web/ packages/ AER/ index. html
[2] The EcDat Package http:/ / cran.r-project.org/ web/ packages/ Ecdat/ index. html
[3] [3] "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule -
Journal of the Royal Statistical Society, June 1899, p 283
[4] http:/ / www. stat. auckland. ac. nz/ ~paul/ Talks/ viewer.pdf
[5] Reshaping Data with the reshape Package : http:/ / www. jstatsoft. org/ v21/ i12
[6] vignette for the tables package: http:/ / cran.r-project.org/ web/ packages/ tables/ vignettes/ tables. pdf
[7] http:/ / www. r-statistics. com/ 2012/ 01/ printing-nested-tables-in-r-bridging-between-the-reshape-and-tables-packages/
[8] Merging data frames while preserving the rows (http:/ / www. r-statistics. com/ 2012/ 01/
merging-two-data-frame-objects-while-preserving-the-rows-order/ )
[9] http:/ / cran. r-project. org/ doc/ manuals/ R-data.html
[10] R Data Manual http:/ / cran. r-project.org/ doc/ manuals/ R-data. html
[11] Paul Murrell introduction to Data Technologies http:/ / www. stat. auckland. ac. nz/ ~paul/ ItDT/
Importing and exporting data
52
Importing and exporting data
Data can be stored in a large variety of formats. Each statistical package has its own format for data (xls for
Microsoft Excel, dta for Stata, sas7bdat for SAS, ...). R can read almost all file formats. We present a method for
each kind of file. If none of the following methods work, you can use a specific software for data conversion such as
the free software OpenRefine or the commercial software Stat Transfer
[1]
. In any case, most statistical software can
export data in a CSV (comma separated values) format and all of them can read CSV data. This is often the best
solution to make data available to everyone.
Graphical user interfaces
Some IDE or GUI provides some press button solution to import data.
You may also have a look at speedR, a graphical user interface which helps at importing data from Excel,
OpenOfficeCalc, CSV and other text files
[2]
.
library(speedR)
speedR()
CSV (csv,txt,dat)
You can import data from a text file (often CSV) using read.table(), read.csv() or read.csv2(). The
option header = TRUE indicates that the first line of the CSV file should be interpreted as variables names and
the option sep = gives the separator (generally "," or ";").
csv.get() (Hmisc) is another possibility.
mydata <- read.table("data.txt",header=TRUE)
mydata <- read.table("data.csv", header = TRUE, sep=",") # import from a CSV
mydata <- read.csv("data.csv", header=T)
mydata <- read.table("data.csv", header = TRUE, sep=";")
mydata <- read.csv2("data.csv", header=T)
Note that there is no problem if your data are stored on the internet.
df <- read.table("http://www.mywebsite.com/.../data.csv", header = TRUE, sep = ",")
By default, strings are converted to factors. If you want to avoid this conversion, you can specify the option
stringsAsFactors = FALSE.
You can export data to a text file using write.table().
write.table(mydat,file="mydat.csv",quote=T,append=F,sep=",",eol = "\n", na = "NA", dec = ".", row.names = T,col.names = T)
For large CSV files, it is possible to use the ff package.
library("ff")
df <- read.csv.ffdf(file="large_csv_file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000)
Importing and exporting data
53
Fixed width text files
read.fwf() and write.fwf().
Some fixed width text files are provided with a SAS script to import them. Anthony Damico has created SAScii
package to easily import those data.
Unstructured text files
See scan() and readLines() in the Reading and writing text files section.
Stata (dta)
We can read Stata data using read.dta() in the foreign package and export to Stata data format using
write.dta().
Note that string variables in Stata are limited to 244 characters. This can be an issue during the exportation
process.
See also Stata.file() in the memisc package and stata.get in the Hmisc package.
> library("foreign")
> mydata <- read.dta("mydata.dta",convert.dates = TRUE, convert.factors = TRUE, convert.underscore = TRUE)
> names(mydata)
> write.dta(mydata, file = "mydata.dta")
SAS (sas7bdat)
Experimental support for SAS databases having the sas7bdat extension is provided by the sas7bdat
[3]
package.
However, sas7bdat files generated by 64 bit versions of SAS, and SAS running on non-Microsoft Windows
platforms are not yet supported.
SAS (xpt)
See also sasexport.get() and sas.get() in the Hmisc
See also the SASxport package.
library("foreign")
mydata<-read.xport("SASData.xpt")
names(mydata)
SPSS (sav)
read.spss() (foreign) and spss.get() (Hmisc)
> library("foreign")
> mydata<-read.spss("SPSSData.sav")
> names(mydata)
EViews
readEViews() in the hexView package for EViews files.
Importing and exporting data
54
Excel (xls,xlsx)
Importing data from Excel is not easy. The solution depends on your operating system. If none of the methods below
works, you can always export each Excel spreadsheets to CSV format and read the CSV in R. This is often the
simplest and quickest solution.
XLConnect supports reading and writing both xls and xlsx file formats. Since it is based on Apache POI it only
requires a Java installation and as such works on many platforms including Windows, UNIX/Linux and Mac.
Besides reading & writing data it provides a number of additional features such as adding plots, cell styling & style
actions and many more.
require("XLConnect")
wb <- loadWorkbook("myfile.xls", create = FALSE)
# Show a summary of the workbook (shows worksheets,
# defined names, hidden sheets, active sheet name, ...)
summary(wb)
# Read data from a worksheet interpreting the first row as column names
df1 <- readWorksheet(wb, sheet = "mysheet")
# Read data from a named region/range interpreting the first row as column
# names
df2 <- readNamedRegion(wb, name = "myname", header = TRUE)
The RODBC solution (works fine on Windows):
library("RODBC")
channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection
sqlTables(channel) # List all the tables
effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table
odbcClose(channel) # close the connection (don't forget)
The xlsReadWrite package.
> library(xlsReadWrite)
mydat <- read.xls("myfile.xls", colNames = T, sheet = "mysheet", type = "data.frame", from = 1, checkNames = TRUE)
"sheet" specifies the name or the number of the sheet you want to import.
"from" specifies the first row of the spreadsheet.
The gnumeric package
[4]
. This package use an external software called ssconvert which is usually installed with
gnumeric, the Gnome office spreadsheet. The read.gnumeric.sheet() function reads xls and xlsx files.
library("gnumeric")
df1 <- read.gnumeric.sheet(file = "df.xls", head = TRUE, sheet.name = "Feuille1")
df2 <- read.gnumeric.sheet(file = "df.xlsx", head = TRUE, sheet.name = "Feuille1")
See also xlsx for Excel 2007 documents and read.xls() (gdata).
Importing and exporting data
55
Google Spread Sheets
You should make the spreadsheet public, publish it as a CSV file. Then you can read it in R using read.csv().
See more on the Revolution's computing blog (link
[5]
). See also RGoogleDocs (link
[6]
).
# Read from a Google SpreadSheet.
require(RCurl)
myCsv <-
getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv")
read.csv(textConnection(myCsv))
gnumeric spreadsheets
The gnumeric package. read.gnumeric.sheet() reads one sheet and read.gnumeric.sheets()
reads all sheets and store them in a list.
library("gnumeric")
df <- read.gnumeric.sheet(file = "df.gnumeric", head = TRUE, sheet.name = "df.csv")
View(df)
df <- read.gnumeric.sheets(file = "df.gnumeric", head = TRUE)
View(df$df.csv)
OpenOffice and LibreOffice (ods)
The gnumeric package does a good job for ODS files.
library("gnumeric")
df <- read.gnumeric.sheet(file = "df.ods", head = TRUE, sheet.name = "Feuille1")
speedR is another alternative.
library("speedR")
df <- speedR.importany(file = "df.ods")
Note that you can also use the speedR graphical user interface (speedR()) which will return the command line for
replication.
library("speedR")
speedR()
Or you may use ROpenOffice
[7]
.
library("ROpenOffice")
df <- read.ods(file = "df.ods")
JSON
JSON (JavaScript Object Notation) is a very common format on the internet. The rjson library makes it easy to
import data from a json format
[8]
.
# json.txt : a text file including data in the JSON format
library("rjson")
df <- fromJSON(paste(readLines("json.txt"), collapse=""))
Is is easy to export a list or a dataframe to a JSON format using the toJSON() function :
Importing and exporting data
56
# df : a data frame
library("rjson")
json <- toJSON(df)
dBase (dbf)
read.dbf() in the foreign package.
library("foreign")
df <- read.dbf("file.dbf")
str(df)
Hierarchical Data Format (hdf5)
hdf5 data can be read using the hdf5 package
[9]
.
DICOM and NIfTI
See "Working with the {DICOM} and {NIfTI} Data Standards in R" in the Journal of Statistical Software
[10]
Resources
R Data Manual
[9][11]
.
Paul Murrell's Introduction to Data Technologies
[12]
.
References
[1] Stat Transfer http:/ / www. stattransfer. com/
[2] speedR http:/ / speedr.r-forge. r-project. org/
[3] sas7bdat http:/ / cran. r-project. org/ web/ packages/ sas7bdat/ index. html
[4] [4] This command has been tested using Ubuntu 10.10 and R 2.11.1
[5] http:/ / blog. revolution-computing. com/ 2009/ 09/ how-to-use-a-google-spreadsheet-as-data-in-r. html
[6] http:/ / www. omegahat. org/ RGoogleDocs/
[7] The Omega Project for Statistical Computing (http:/ / www. omegahat. org/ )
[8] http:/ / cran. r-project. org/ web/ packages/ rjson/ index.html
[9] http:/ / cran. r-project. org/ web/ packages/ hdf5/ index.html
[10] Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of
Statistical Software Vol. 44, Issue 6, Oct 2011, link (http:/ / www. jstatsoft. org/ v44/ i06)
[11] R Data Manual http:/ / cran. r-project.org/ doc/ manuals/ R-data. html
[12] Paul Murrell introduction to Data Technologies http:/ / www. stat. auckland. ac. nz/ ~paul/ ItDT/
Text Processing
57
Text Processing
This page includes all the material you need to deal with strings in R. The section on regular expressions may be
useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.
This page may be useful to :
perform statistical text analysis.
collect data from an unformatted text file.
deal with character variables.
In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of
function for characters, simple functions and regular expressions. Many functions are part of the standard R base
package.
help.search(keyword = "character", package = "base")
However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the stringr
package which defines functions with similar behaviour but their names are easier to retain and their syntax much
more systematic
[1]
.
Keywords : text mining, natural language processing
See CRAN Task view on Natural Language Processing
[2]
See also the following packages tm, tau, languageR, scrapeR.
Reading and writing text files
R can read any text file using readLines() or scan(). It is possible to specify the encoding of the imported
text file with readLines(). The entire contents of the text file can be read into an R object (e.g., a character
vector). scan() is more flexible. The kind of data expected can be specified in the second argument (e.g.,
character(0) for a string).
text <- readLines("file.txt",encoding="UTF-8")
scan("file.txt", character(0)) # separate each word
scan("file.txt", character(0), quote = NULL) # get rid of quotes
scan("file.txt", character(0), sep = ".") # separate each sentence
scan("file.txt", character(0), sep = "\n") # separate each line
We can write the content of an R object into a text file using cat() or writeLines(). By default cat()
concatenates vectors when writing to the text file. You can change it by adding options sep="\n" or
fill=TRUE. The default encoding depends on your computer.
cat(text,file="file.txt",sep="\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = FALSE)
Before reading a text file, you can look at its properties. nlines() (parser package) and countLines()
(R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of
bytes and characters in each line of a file. You can also display a text file using file.show().
Text Processing
58
Character encoding
R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which
have been created with another operating system and especially if the language is not English and has many accents
and specific characters. For instance, the standard encoding scheme in Linux is "UTF-8" whereas the standard
encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string.
iconv() is similar to the unix command iconv and converts the encoding.
iconvlist() gives the list of available encoding scheme on your computer.
readLines(), scan() and file.show() have also an encoding option.
is.utf8() (tau) tests if the encoding is "utf8".
is.locale() (tau) tests if encoding is the same as the default encoding on your computer.
translate() (tau) translates the encoding into the current locale.
fromUTF8() (descr) is less general than iconv().
utf8ToInt() (base)
Example
The following example was run under Windows. Thus, the default encoding is "latin1".
> texte <- "H h"
> Encoding(texte)
[1] "latin1"
> texte2 <- iconv(texte,"latin1","UTF-8")
> Encoding(texte2)
[1] "UTF-8"
Regular Expressions
A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern : 2
digits, 2 letters and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular
expressions are used in R
[3]
extended regular expressions, used by perl = FALSE (the default),
Perl-like regular expressions used by perl = TRUE.
There is a also an option called fixed = TRUE which can be considered as a literal regular expression.
fixed() (stringr) is equivalent to fixed=TRUE in the standard regex functions. These functions are by default
case sensitive. This can be changed by specifying the option ignore.case = TRUE.
If you are not a specialist in regular expression you my find the glob2rx() useful. This function suggests some
regular expression for a specific pattern :
> glob2rx("abc.*")
[1] "^abc\\."
Text Processing
59
Functions which use regular expressions in R
sub(), gsub(), str_replace() (stringr) make some substitutions in a string.
grep(), str_extract() (stringr) extract some value
grepl(), str_detect() (stringr) detect the presence of a pattern.
see also splitByPattern() (R.utils)
See also gsubfn() in the gsubfn package.
Extended regular expressions (The default)
"." stands for any character.
"[ABC]" means A,B or C.
"[A-Z]" means any upper letter between A and Z.
"[0-9]" means any digit between 0 and 9.
Here is the list of metacharacters $ * + . ? [ ] ^ { } | ( ) \. If you need to use one of those
characters, precede them with a doubled backslash.
Here are some classes of regular expressions : For numbers :
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9.
For letters :
[:alpha:] Alphabetic characters: [:lower:] and [:upper:].
[:upper:] Upper-case letters.
[:lower:] Lower-case letters.
Note that the set of alphabetic characters includes accents such as which are very common is some
languages like French. Therefore, it is more general than "[A-Za-z]" which does not include letters with accent.
For other characters :
[:punct:] Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ]
^ _ ` { | } ~.
[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
[:blank:] Blank characters: space and tab.
[:cntrl:] Control characters.
For combination of other classes :
[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:].
[:graph:] Graphical characters: [:alnum:] and [:punct:].
[:print:] Printable characters: [:alnum:], [:punct:] and space.
[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
You can quantify the number of repetition by adding after the regular expression the following characters :
? The preceding item is optional and will be matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more than m times.
^ to force the regular expression to be at the beginning of the string
$ to force the regular expression to be at the end of the string
If you want to know more, have a look at the 2 following help files :
Text Processing
60
>?regexp # gives some general explanations
>?grep # help file for grep(),regexpr(),sub(),gsub(),etc
Perl-like regular expressions
It is also possible to use "perl-like" regular expressions. You just need to use the option perl=TRUE.
Examples
If you want to remove space characters in a string, you can use the \\s Perl macro.
sub('\\s', '',x, perl = TRUE)
Concatenating strings
paste() concatenates strings.
str_c() (stringr) does a similar job.
cat() prints and concatenates strings.
Examples
> paste("toto","tata",sep=' ')
[1] "toto tata"
> paste("toto","tata",sep=",")
[1] "toto,tata"
> str_c("toto","tata",sep=",")
[1] "toto,tata"
> x <- c("a","b","c")
> paste(x,collapse=" ")
[1] "a b c"
> str_c(x, collapse = " ")
[1] "a b c"
> cat(c("a","b","c"), sep = "+")
a+b+c
Splitting a string
strsplit() : Split the elements of a character vector x into substrings according to the matches to substring
split within them.
See also str_split() (stringr).
> unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"
tokenize() (tau) split a string into tokens.
> tokenize("abc defghk")
[1] "abc" " " "defghk"
Text Processing
61
Counting the number of characters in a string
nchar() gives the length of a string.
See also str_length() (stringr).
> nchar("abcdef")
[1] 6
> str_length("abcdef")
[1] 6
> nchar(NA)
[1] 2
> str_length(NA)
[1] NA
Detecting the presence of a substring
Detecting a pattern in a string ?
grepl() returns a logical expression (TRUE or FALSE).
str_detect() (stringr) does a similar job.
> string <- "23 mai 2000"
> string2 <- "1 mai 2000"
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> grepl(pattern = regexp, x = string)
[1] TRUE
> str_detect(string, regexp)
[1] TRUE
> grepl(pattern = regexp, x = string2)
[1] FALSE
The 1st one is true and the second one is false since there is only one digit in the first number.
Counting the occurrence of each pattern in a string ?
textcnt() (tau) counts the occurrence of each pattern or each term in a text.
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> textcnt(string,n=1L,method="string")
blabla mai
2 2
attr(,"class")
[1] "textcnt"
Text Processing
62
Extracting the position of a substring or a pattern in a string
Extracting the position of a substring ?
cpos() (cwhmisc) returns the position of a substring in a string.
substring.location() (cwhmisc) does the same job but returns the first and the last position.
> cpos("abcdefghijklmnopqrstuvwxyz","p",start=1)
[1] 16
> substring.location("abcdefghijklmnopqrstuvwxyz","def")
$first
[1] 4
$last
[1] 6
Extracting the position of a pattern in a string ?
regexpr() returns the position of the regular expression. str_locate() (stringr) does the same job.
gregexpr() is similar to regexpr() but the starting position of every match is returned.
str_locate_all() (stringr) does the same job.
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> regexpr(pattern = regexp, text = string)
[1] 8
attr(,"match.length")
[1] 11
> gregexpr(pattern = regexp, text = string)
[[1]]
[1] 8 27
attr(,"match.length")
[1] 11 11
> str_locate(string,regexp)
start end
[1,] 8 18
> str_locate_all(string,regexp)
[[1]]
start end
[1,] 8 18
[2,] 27 37
Text Processing
63
Extracting a substring from a string
Extracting a fixed width substring ?
substr() takes a sub string.
str_sub() (stringr) is similar.
> substr("simple text",1,3)
[1] "sim"
> str_sub("simple text",1,3)
[1] "sim"
Extracting the first word in a string ?
first.word() First Word in a String or Expression in the Hmisc package
> first.word("abc def ghk")
[1] "abc"
Extracting a pattern in a string ?
grep() returns the value or the position of the regular expression if value=T and its position if value=F.
> grep(pattern = regexp, x = string , value = T)
[1] "23 mai 2000"
> grep(pattern = regexp, x = string2 , value = T)
character(0)
> grep(pattern = regexp, x = string , value = F)
[1] 1
> grep(pattern = regexp, x = string2 , value = F)
integer(0)
str_extract(), str_extract_all(), str_match(), str_match_all() (stringr) and m()
(caroline package) are similar to grep(). str_extract() and str_extract_all() return a vector.
str_match() and str_match_all() return a matrix and m() a dataframe.
> library("stringr")
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> str_extract(string,regexp)
[1] "23 mai 2000"
> str_extract_all(string,regexp)
[[1]]
[1] "23 mai 2000" "18 mai 2004"
> str_match(string,regexp)
[,1] [,2] [,3] [,4]
[1,] "23 mai 2000" "23" "mai" "2000"
> str_match_all(string,regexp)
[[1]]
[,1] [,2] [,3] [,4]
[1,] "23 mai 2000" "23" "mai" "2000"
Text Processing
64
[2,] "18 mai 2004" "18" "mai" "2004"
> library("caroline")
> m(pattern = regexp, vect = string, names = c("day","month","year"), types = rep("character",3))
day month year
1 18 mai 2004
Making some substitution inside a string
Substituting a pattern in a string
sub() makes a substitution.
gsub() is similar to sub() but replace all occurrences of the pattern whereas sub() only replaces the first
occurrence.
str_replace() (stringr) is similar.
In the following example, we have a French date. The regular pattern is the following : 2 digits, a blank, some letters,
a blank, 4 digits. We capture the 2 digits with the [[:digit:]]{2} expression, the letters with
[[:alpha:]]+ and the 4 digits with [[:digit:]]{4}. Each of these three substrings is surrounded with
parenthesis. The first substring is stored in "\\1", the second one in "\\2" and the 3rd one in "\\3".
string <- "23 mai 2000"
regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
sub(pattern = regexp, replacement = "\\1", x = string) # returns the first part of the regular expression
sub(pattern = regexp, replacement = "\\2", x = string) # returns the second part
sub(pattern = regexp, replacement = "\\3", x = string) # returns the third part
In the following example, we compare the outcome of sub() and gsub(). The first one removes the first space
whereas the second one removes all spaces in the text.
> text <- "abc def ghk"
> sub(pattern = " ", replacement = "", x = text)
[1] "abcdef ghk"
> gsub(pattern = " ", replacement = "", x = text)
[1] "abcdefghk"
Substituting characters in a string ?
chartr() substitutes characters in an expression. It stands for "character translation".
replacechar() (cwhmisc) does the same job ...
as well as str_replace_all() (stringr).
> chartr(old="a",new="o",x="baba")
[1] "bobo"
> chartr(old="ab",new="ot",x="baba")
[1] "toto"
> replacechar("abc.def.ghi.jkl",".","_")
[1] "abc_def_ghi_jkl"
> str_replace_all("abc.def.ghi.jkl","\\.","_")
[1] "abc_def_ghi_jkl"
Text Processing
65
Converting letters to lower or upper-case
tolower() converts upper-case characters to lower-case.
toupper() converts lower-case characters to upper-case.
capitalize() (Hmisc) capitalize the first letter of a string
See also cap(), capitalize(), lower(), lowerize() and CapLeading() in the cwhmisc package.
> tolower("ABCdef")
[1] "abcdef"
> toupper("ABCdef")
[1] "ABCDEF"
> capitalize("abcdef")
[1] "Abcdef"
Filling a string with some character
padding() (cwhmisc) fills a string with some characters to fit a given length. See also str_pad() (stringr).
> library("cwhmisc")
> padding("abc",10," ","center") # adds blanks such that the length of the string is 10.
[1] " abc "
> str_pad("abc",width=10,side="center", pad = "+")
[1] "+++abc++++"
> str_pad(c("1","11","111","1111"),3,side="left",pad="0")
[1] "001" "011" "111" "1111"
Note that str_pad() is very slow. For instance for a vector of length 10,000, we have a very long computing
time. padding()does not seem to handle character vectors but the best solution may be to use the sapply()
and padding() functions together.
>library("stringr")
>library("cwhmisc")
>a <- rep(1,10^4)
> system.time(b <- str_pad(a,3,side="left",pad="0"))
utilisateur systme coul
50.968 0.208 73.322
> system.time(c <- sapply(a, padding, space = 3, with = "0", to = "left"))
utilisateur systme coul
7.700 0.020 12.206
Removing leading and trailing spaces
trimws() (memisc package) trim leading and trailing white spaces.
trim() (gdata package) does the same job.
See also str_trim() (stringr)
> library("memisc")
> trimws(" abc def ")
[1] "abc def"
> library("gdata")
> trim(" abc def ")
[1] "abc def"
Text Processing
66
> str_trim(" abd def ")
[1] "abd def"
Comparing two strings
Assessing if they are identical
== returns TRUE if both strings are the same and false otherwise.
> "abc"=="abc"
[1] TRUE
> "abc"=="abd"
[1] FALSE
Computing distance between strings
Few packages implement the the Levenshtein distance between two strings:
adist() in base package utils
stringMatch() in MiscPsycho
stringdist() in stringdist
levenshteinDist() in RecordLinkage
A benchmark comparing the speed of levenshteinDist() and stringdist() is available here: [4].
Example with utils
> adist("test","tester")
[1] 2
Example with MiscPsycho
stringMatch() (MiscPsycho) computes If normalize="YES" the levenshtein distance is divided by the
maximum length of each string.
> library("MiscPsycho")
> stringMatch("test","tester",normalize="NO",penalty=1,case.sensitive = TRUE)
[1] 2
Approximate matching
agrep() search for approximate matches using the Levenshtein distance.
If 'value = TRUE', this returns the value of the string
If 'value = FALSE' this returns the position of the string
max returns the maximal levenshtein distance.
> agrep(pattern = "laysy", x = c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 3, value = TRUE)
[1] "1 lazy"
Text Processing
67
Miscellaneous
deparse() : Turn unevaluated expressions into character strings.
char.expand() (base) expands a string with respect to a target.
pmatch() (base) and charmatch() (base) seek matches for the elements of their first argument among
those of their second.
> pmatch(c("a","b","c","d"),table = c("b","c"), nomatch = 0)
[1] 0 1 2 0
make.unique() makes a character string unique. This is useful if you want to use a string as an identifier in
your data.
> make.unique(c("a", "a", "a"))
[1] "a" "a.1" "a.2"
References
[1] Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http:/ / journal. r-project. org/
archive/ 2010-2/ RJournal_2010-2_Wickham. pdf
[2] http:/ / cran. r-project. org/ web/ views/ NaturalLanguageProcessing. html
[3] In former versions (< 2.10) we had also basic regular expressions in R :
extended regular expressions, used by extended = TRUE (the default),
basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
Since basic regular expressions (extended = FALSE) are now obsolete, the extended option is obsolete in version 2.11.
[4] http:/ / www. markvanderloo.eu/ yaRb/ 2013/ 09/ 07/ a-bit-of-benchmarking-with-string-distances/
Times and Dates
R contains a set of object types for holding date and time information. The system time and date can also be
requested.
Format
Many time and date units are recognised. These include:
Unit Symbol Example
4 digit year %Y 1932
2 digit year %y 84
Numerical Month %m 03
Full Month %B January
Abbreviated Month %b Jan
Day of the month %d 31
Full weekday %A Wednesday
Abbreviated weekday %a Wed
Hours (24hr clock) %H 16
Minutes %M 35
Seconds %S 52
Times and Dates
68
The default format is yyyy-mm-dd hh:mm:ss or %Y-%m-%d %H:%M:%S
For example 2010-02-13 23:12:24
System Date and Time
To get the system date and time:
> Sys.time()
[1] "2010-02-13 23:12:24 COT"
> format(Sys.time(),"%H %M") # in a different format and without the date
[1] "23 13"
> Sys.Date()
[1] "2010-02-13"
> date() # returns the current date and time,
[1] "Wed Jul 18 10:59:42 2012"
Convert strings to date/time objects
Convert a string representing the date or time into a Date/Time object:
> my.date <- as.Date("2010-12-20")
> print(my.date)
[1] "2011-12-30"
> my.date2 <- as.Date("12/20/10", format="%m/%d/%y") # input date in a different format
> print(my.date2)
[1] "2010-12-30"
> my.time <- strptime("12/20/10 14.34.35", format="%m/%d/%y %H.%M.%S") # input time and date
> print(my.time)
[1] "2010-12-30 14:34:35"
> my.string <- as.character(Sys.time()) # convert a date/time object to a normal string
> print(my.string)
[1] "2011-02-30 02:31:15"
Extracting information from dates
Get weekday, month and an integer representing the number of days since the beginning of epoch:
> weekdays(my.date) # Get a string representing the weekday of the specified date
[1] "Monday"
> months(my.date)
[1] "December" # Get the month as well
> my.date
[1] "2010-12-20"
> julian(my.date) # Get the integer number of days since the beginning of epoch
[1] 14963
attr(,"origin")
[1] "1970-01-01"
Note that weekdays() and months() returns results in the local language. For instance, if you turn R into
French, you can get weekdays and months in French
[1]
:
Times and Dates
69
> require("lubridate")
> Sys.setlocale(locale="fr_FR.UTF-8")
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8"
> mydate <- ymd("2002-04-21")
> weekdays(mydate)
[1] "Dimanche"
> months(mydate)
[1] "avril"
Generating sequences of dates
> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("10/01/12","%d/%m/%y"), by = "day")
#create the 10 first days of January 2012
[1] "2012-01-01" "2012-01-02" "2012-01-03" "2012-01-04" "2012-01-05" "2012-01-06"
[7] "2012-01-07" "2012-01-08" "2012-01-09" "2012-01-10"
> seq(from = as.Date("20/01/12", "%d/%m/%y"), to = as.Date("20/12/12","%d/%m/%y"), by = "month")
#create the 20th of each month in 2012
[1] "2012-01-20" "2012-02-20" "2012-03-20" "2012-04-20" "2012-05-20" "2012-06-20"
[7] "2012-07-20" "2012-08-20" "2012-09-20" "2012-10-20" "2012-11-20" "2012-12-20"
> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("31/01/12","%d/%m/%y"), length.out = 16)
#create a sequence of every other day in january 2012
[1] "2012-01-01" "2012-01-03" "2012-01-05" "2012-01-07" "2012-01-09" "2012-01-11"
[7] "2012-01-13" "2012-01-15" "2012-01-17" "2012-01-19" "2012-01-21" "2012-01-23"
[13] "2012-01-25" "2012-01-27" "2012-01-29" "2012-01-31"
References
[1] Issue on Stackoverflow (http:/ / stackoverflow.com/ questions/ 17836966/ french-names-using-wday-in-lubridate)
External links
Do more with dates and times in R with lubridate 1.1.0 (http:/ / www. r-statistics. com/ 2012/ 03/
do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/ ) (a sample chapter from the R in Action book)
Graphics
70
Graphics
R includes at least three graphical systems, the standard graphics package, the lattice package for Trellis graphs
[1]
and the grammar-of-graphics ggplot2 package
[2]
. R has good graphical capabilities but there are some alternatives
like gnuplot.
Interactive Graphics
This section discuss some ways to draw graphics without using R scripts.
The playwith package provides a graphical user interface to customize the graphs, add a title, a grid, some text, etc
and it exports the R code you need if you want to replicate the analysis
[3]
. If you want to know more, you can have a
look at the screenshots on the website (link
[4]
). See also the example on "R you Ready" [5]. This package require
GTK+ libraries.
library("playwith")
playwith(plot(x1))
There is also a graphical user interface GrapheR which makes it very easy to draw graphs for beginners. This
solution is cross-platform.
> library(GrapheR)
latticist (link
[6]
) is another similar project.
Note also that some graphical user interface such as RKward and R Commander makes it easy to draw graphs.
Standard R graphs
In this section we present what you need to know if you want to customize your graphs in the default graph system.
plot() is the main function for graphics. The arguments can be a single point such as 0 or c(.3,.7), a single
vector, a pair of vectors or many other R objects.
par() is another important function which defines the default settings for plots.
There are many other plot functions which are specific to some tasks such as hist(), boxplot(), etc. Most of
them take the same arguments as the plot() function.
> N <- 10^2
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> plot(0)
> plot(0,1)
> plot(x1)
> plot(x1,x2) # scatter plot x1 on the horizontal axis and x2 on the vertical axis
> plot(x2 ~ x1) # the same but using a formula (x2 as a function of x1)
> methods(plot) # show all the available methods for plot (depending on the number of loaded packages).
Graphics
71
Titles, legends and annotations
Titles
main gives the main title, sub the subtitle. They can be passed as argument of the plot() function or using the
title() function. xlab the name of the x axis and ylab the name of the y axis.
plot(x1,x2, main = "Main title", sub = "sub title" , ylab = "Y axis", xlab = "X axis")
plot(x1,x2 , ylab = "Y axis", xlab = "X axis")
title(main = "Main title", sub = "sub title" )
The size of the text can be modified using the parameters cex.main, cex.lab, cex.sub, cex.axis. Those
parameters define a scaling factor, ie the value of the parameter multiply the size of the text. If you choose
cex.main=2 the main title will be twice as big as usual.
Legend
legend(). The position can be "bottomleft", "bottomright", "topleft", "topright" or exact coordinates.
plot(x1, type = "l", col = 1, lty = 1)
lines(x2, col = 2, lty = 2)
legend("bottomleft", legend = c("x1","x2"), col = 1:2, lty = 1:2)
Text in the margin
mtext() puts some texts in the margin. The margin can be at the bottom (1), the left (2), the top (3) or the right (4).
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 1) # the bottom
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 2) # the left
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 3) # the top
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 4) # the right margin
Text in the graph
text()
Mathematical annotations
We can add mathematical symbols using expression() and makes some substitution in a formula using
substitute().
?plotmath # gives help for mathematical annotations
Types
The type of a plot can be :
n for none (nothing is printed),
p for points,
l for lines,
b for both,
o for both overlayed,
h for histogram-like
and s/S for steps.
R code Output
Graphics
72
x1 <- rnorm(50)
png("plottype.png")
par(mfrow = c(2,2))
plot(x1, type = "p", main = "points", ylab = "", xlab = "")
plot(x1, type = "l", main = "lines", ylab = "", xlab = "")
plot(x1, type = "b", main = "both", ylab = "", xlab = "")
plot(x1, type = "o", main = "both overplot", ylab = "", xlab = "")
dev.off()
clic on the graph to zoom
Axes
The default output print the axes. We can remove them with axes=FALSE. We can also change them using the
axis() function.
> plot(x1,x2,axes=FALSE)
>
> plot(x1,x2,axes=FALSE)
> axis(1,col="red",col.axis="blue",font.axis=3)
> axis(2,col="red",col.axis="blue",font.axis=2,las=2)
las specifies the style of axis labels. It can be 0, 1, 2 or 3.
0 : always parallel to the axis [default],
1 : always horizontal,
2 : always perpendicular to the axis,
3 : always vertical.
R code Output
x1 <- rnorm(100)
par(mfrow = c(2,2))
plot(x1, las = 0, main = "las = 0", sub = "always parallel to the axis", xlab = "", ylab = "")
plot(x1, las = 1, main = "las = 1", sub = "always horizontal", xlab = "", ylab = "")
plot(x1, las = 2, main = "las = 2", sub = "always perpendicular to the axis", xlab = "", ylab = "")
plot(x1, las = 3, main = "las = 3", sub = "always vertical", xlab = "", ylab = "")
clic on the graph
It is also possible to add another y axis on the right by adding axis(4,).
Graphics
73
Margins
Margins can be computed in inches or in lines. The default is par(mar = c(5,4,4,2)) which means that
there are 5 lines at the bottom, 4 lines on the left, 4 lines in the top and 2 lines on the right. This can be modified
using the par() function. If you want to specify margins in inches, use par(mai = c(bottom, left,
top, right). If you want to modify margins in lines, use par(mar = c(bottom, left, top, right).
See ?par to learn more about the topic.
Colors
The color of the points or lines can be changed using the col argument, fg for foreground colors (boxes and axes)
and bg for background colors.
show.col(object=NULL) (Hmisc) package plots the main R colors with their numeric code.
The list of all colors in R (pdf
[7]
)
colors() # list the r colors
show.col(object=NULL) # graphs the main R colors
plot(x1, col = "blue")
plot(x1, col = "red")
plot(x1, col = "red", col.axis = "dodgerblue", col.lab = "firebrick",
col.main = "darkgreen", col.sub = "cyan4", main = "Testing colors", sub
= "sub titles", ylab = "y axis", xlab = "x axis")
We can also generate new colors using the rgb() function. The first argument is the intensity of red, the
second, the intensity of green and the third, the intensity of blue. They vary between 0 and 1 by default but this
can be modified with the option max = 255. col2rgb() returns the RGB code of R colors. col2hex()
(gplots) gives the hexadecimal code. col2grey() and col2gray() (TeachingDemos) converts colors to
grey scale.
> mycolor <- rgb(.2,.4,.6)
> plot(x1, col = mycolor)
> col2rgb("pink")
[,1]
red 255
green 192
blue 203
> library("gplots")
> col2hex("pink")
[1] "#FFC0CB"
Points
For points the symbols can be changed using the pch option which takes integer values between 0 and 25 or a
single character. pch can also takes a vector as argument. In that case the first points will use the first element of
the vector as symbol, and so on.
plot(x1, type = "p", pch = 0)
plot(x1, type = "p", pch = 10)
plot(x1, type = "p", pch = 25)
plot(x1, type = "p", pch = "a")
plot(x1, type = "p", pch = "*")
Graphics
74
plot(x1[1:26], type = "p", pch = 0:25)
plot(x1[1:26], type = "p", pch = letters)
The following code displays all the symbols on the same plot :
x <- rep(1,25)
plot(x, pch = 1:25, axes = F, xlab = "", ylab = "")
text(1:25,.95,labels = 1:25)
points() adds points to an existing plot.
> plot(x1, pch = 0) # plot x1
> points(x2, pch = 1, col = "red") # add x2 to the existing plot
Lines
We can change the line type with lty. The argument is a string ("blank", "solid", "dashed", "dotted", "dotdash",
"longdash", or "twodash") or an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash,
6=twodash). The line width can be changed with lwd. The default is lwd=1. lwd=2 means that the width is twice
the normal width.
plot(x1, type = "l", lty = "blank")
plot(x1, type = "l", lty = "solid")
plot(x1, type = "l", lty = "dashed")
plot(x1, type = "l", lty = "dotted")
plot(x1, type = "l", lty = "dotdash")
plot(x1, type = "l", lty = "longdash")
plot(x1, type = "l", lty = "twodash")
lines() adds an additional lines on a graph.
plot(x1, type = "l", lty = "solid")
lines(x2, type = "l", lty = "dashed", col = "red")
abline() adds an horizontal line (h=), a vertical line (v=) or a linear function to the current plot (a= for the
constant and b= for the slope). abline() can also plot the regression line.
> plot(x1, type = "l", lty = "solid")
> abline(h= -3, lty = "dashed", col = "gray")
> abline(v = 0, lty = "dashed", col = "gray")
> abline(a = -3 , b = .06, lty = "dotted", col = "red")
Boxes
Each graph is framed by a box. bty specifies the box type.
plot(x1, bty = "o") # the default
plot(x1, bty = "n") # no box
plot(x1, bty = "l")
plot(x1, bty = "7")
plot(x1, bty = "u")
plot(x1, bty = "c")
plot(x1, bty = "]")
Graphics
75
See also box() to add a box to an existing plot.
Grid
grid() adds a grid to the current graph.
> plot(x1)
> grid()
Although grid has an optional argument nx for setting the number of grid lines, it is not possible to tell it explicitly
where to place those lines (it will usually not place them at integer values). A more precise and manageable
alternative is to use abline().
> abline(v=(seq(0,100,5)), col="lightgray", lty="dotted")
> abline(h=(seq(0,100,5)), col="lightgray", lty="dotted")
Other figures
We can also add a circle to a plot with the circle() function in the calibrate package.
Background
You can choose the background of your plot. For instance, you can change the background color with par(bg=).
par(bg="whitesmoke")
par(bg="transparent")
Overlaying plots
matplot() can plot several plots at the same time.
N <- 100
x1 <- rnorm(N)
x2 <- rnorm(N) + x1 + 1
y <- 1 + x1 + x2 + rnorm(N)
mydat <- data.frame(y,x1,x2)
matplot(mydat[,1],mydat[,2:3], pch = 1:2)
Multiple plots
With par() we can display multiple figures on the same plot. mfrow = c(3,2) prints 6 figures on the same
plot with 3 rows and 2 columns. mfcol = c(3,2) does the same but the order is not the same.
par(mfrow = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")
par(mfcol = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
Graphics
76
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")
Plotting a function
curve() plots a function. This can be added to an existing plot with the option add = TRUE.
plot() can also plots functions.
curve(x^2, from = -1 , to = 1, main = "Quadratic function", ylab = "f(x)=x^2")
plot(rnorm(100))
curve((x/100)^2, add = TRUE, col = "red")
square root function, made using plot()
Exporting graphs
How can you export a graph ?
First you can plot the graph and use the context menu (right click on Windows and Linux or control + click on
Mac) to copy or save the graphs. The available options depend on your operating system. On Windows, you can
also use copy the current graph to the clipboard as a Bitmap file (raster graphics) using CTRL + C or as a
Windows Metafile (vector graphics) using CTRL + W. You can then paste it into another application.
You can export a plot to pdf, png, jpeg, bmp or tiff by adding pdf("filename.pdf"),
png("filename.png"), jpeg("filename.jpg"), bmp("filename.bmp") or
tiff("filename.tiff") prior to the plotting, and dev.off() after the plotting.
You can also use the savePlot() function to save existing graphs.
Sweave also produce ps and pdf graphics (See the Sweave section).
It is better to use vectorial devices such as pdf, ps or svg.
How can you know the list of all available devices ?
Graphics
77
?Devices
Use the capabilities() function to see the list of available devices on your computer.
?Devices
> capabilities()
jpeg png tiff tcltk X11 aqua http/ftp sockets
TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
libxml fifo cledit iconv NLS profmem cairo
TRUE FALSE TRUE TRUE TRUE TRUE FALSE
png("r_plot.png", width = 420, height = 340)
plot(x1, main = " Example")
dev.off()
pdf("r_plot.pdf", width = 420, height = 340)
plot(x1, main = " Example")
dev.off()
postscript(file="graph1.ps",horizontal=F,pagecentre=F,paper="special",width=8.33,height=5.56)
plot(x1, main = "Example")
dev.off()
plot(x1, main = "Example")
savePlot("W:/Bureau/plot.pdf", type = "pdf")
savePlot("W:/Bureau/plot.png", type = "png")
We can also export to SVG using the svg() function.
svg("scatterplot.svg", width = 7, height = 7)
plot(x, y)
dev.off()
The RSvgDevice library which was used in earlier versions of R seems now outdated.
Graphics
78
Advanced topics
Animated plots
The animation package provides dynamic graphics capabilities. It is possible to export the animation in flash, mpeg
or gif format. There are more example on the aniwiki website : http:/ / animation. yihui. name/ .
You can also create motion charts using the googleVis package
[8]
.
Examples
Interactive Graphics
The iplots package provides a way to have interactive data visualization in R
[9]
[10]
.
R GUI now offers interactive graphics Deducer 0.4-2 connects with iplots
[11]
To create an interactive, animated plot viewable in a web browser, the animint package
[12]
can be used. The main
idea is to define an interactive animation as a list of ggplots with two new aesthetics:
showSelected=variable means that only the subset of the data that corresponds to the selected value of variable
will be shown.
clickSelects=variable means that clicking a plot element will change the currently selected value of variable.
Graphics
79
Graphics gallery
In this section, we review all kind of statistical plots and review all alternatives to draw them using R. This include
code for the standard graphics package, the lattice package and the ggplot2 package. Also, we add some examples
from the commons repository. We only add examples which are provided with the R code. You can click on any
graph and find the R code.
Line plot
To draw a line plot, use the generic plot() function by setting type="l".
> x <- seq(0, 2*pi, pi/10)
> plot(x, sin(x), type="l")
Then, you can add further lines on the same plot using the lines() function.
> lines(x, cos(x))
Examples
Scatter plot
plot(x,y)
plot(y ~ x)
xyplot(y ~ x) (lattice)
qplot(x,y) (ggplot2)
Log scale
Sometimes it is useful to plot the log of a variable and to have a log scale on the axis. It is possible to plot the log of
a variable using the log option in the plot() function.
For a log log plot, use log = "xy"
For a log in the x axis only, use log = "x"
For a log in the x axis only, use log = "y"
plot(x, y , log = "xy")
Graphics
80
Label points in a plot
It is possible to add labels with the text() function.
textxy() (calibrate) makes it easy to add labels.
N <- 10
u <-rnorm(N)
x <- 1 + rnorm(N)
y <- 1 + x + u
plot(x, y)
textxy(x, y,labs = signif(x,3), cx=0.7)
Examples
Histogram
hist()
histogram() (lattice)
You can learn more about histograms in the Non parametric methods page.
Examples
Graphics
81
Box plot
Box plot :
boxplot()
Examples
Bar charts
See Bar charts on wikipedia.
barplot() takes a table as argument and returns a bar chart.
qlot() (ggplot2) with the option geom = "bar" takes a variable as argument and returns a bar chart
[13]
.
barchart() takes a variable as argument and returns a bar chart.
Examples
Graphics
82
Dot plot
See also Dot plot on Wikipedia.
dotchart()
Examples
Pie charts
pie()
Examples
Graphics
83
Treemap
The tmPlot() function in the treemap package makes it easy to draw a treemap.
Confidence inteval plot
Standard error bar chart are very useful to plot several estimates with confidence intervals.
The Hmisc package has an errbar() function. This function takes the upper and lower bounds of the
confidence intervals as argument
[14]
.
coefplot() function in Gelman and Hill's arm package. This functions is designed to display estimation
results. It takes point estimates and standard errors as arguments.
coefs <- c(0.2, 1.4, 2.3, 0.5,.3) # vector of point estimates
se <- c(0.12, 0.24, 0.23, 0.15,.2) # standard errors of point estimates
variable <- 1:5 # variable names
library("arm")
# we use CI = qnorm(.975) to have 95% confidence interval
coefplot(coefs, se, variable, vertical = T, CI = qnorm(.975))
coefplot(coefs, se, variable, vertical = F, CI = qnorm(.975))
library("Hmisc")
errbar(variable, coefs, coefs - qnorm(.975) * se, coefs + qnorm(.975) * se)
See also
There is another errbar() function in the sfsmisc package.
plotCI() (gplots) also plot error bars.
plotmeans() (gplots)
ciplot() (hacks)
See also Error bar on Wikipedia
3D plots
contour(), image(), persp()
plot3d() (rgl)
wireframe() (lattice)
Examples
click on the graph to see the R code
Example with
wireframe() (lattice)
Graphics
84
Diagrams
grid package by Paul Murrell
[15]
diagram package
[16]
Rgraphviz package
igraph package
Arc Diagrams
It is also possible to draw Arc Diagrams.
Dendograms
It is possible to plot dendograms in R.
Treemap
It is possible to draw a treemap using the treemap() function in the treemap package
[17]
.
Wordcloud
There is :
the wordcloud() function in the wordcloud package
the tagcloud() function in the tagcloud package
Timeline
timeline() in the timeline package
Resources
Tables 2 Graphs
[18]
R Graphics by Paul Murrell
[19]
ggplot2
[20]
References
[1] [1] D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, 2008. ISBN 9780387759685.
[2] ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham and a list of examples on his own website : http:/ / had. co. nz/
ggplot2/
[3] playwith : http:/ / code.google. com/ p/ playwith/
[4] http:/ / code. google. com/ p/ playwith/ wiki/ Screenshots
[5] http:/ / ryouready.wordpress.com/ 2010/ 03/ 23/ playing-with-the-playwith-package/
[6] http:/ / code. google. com/ p/ latticist/
[7] http:/ / www. stat. columbia. edu/ ~tzheng/ files/ Rcolor.pdf
[8] Tutorial for the googleVis package : http:/ / stackoverflow. com/ questions/ 4646779/ embedding-googlevis-charts-into-a-web-site/
4649753#4649753
[9] http:/ / www. r-bloggers. com/ interactive-graphics-with-the-iplots-package-from-%E2%80%9Cr-in-action%E2%80%9D/
[10] http:/ / www.r-statistics. com/ 2012/ 01/ interactive-graphics-with-the-iplots-package-from-r-in-action/ Interactive Graphics with the iplots
Package] - a chapter from the R in action book
[11] http:/ / www.r-statistics. com/ 2010/ 10/ r-gui-now-offers-interactive-graphics-deducer-0-4-2-connects-with-iplots/
[12] https:/ / github. com/ tdhock/ animint
[13] Hadley Wickham ggplot2: Elegant Graphics for Data Analysis, Springer Verlag, 2009
[14] The default output in errbar() changed between R version 2.8.1 and R version 2.9.2. Axis are not displayed by default anymore
[15] Paul Murrell Drawing Diagrams with R, The R Journal, 2009 http:/ / journal. r-project. org/ 2009-1/ RJournal_2009-1_Murrell. pdf
Graphics
85
[16] (example: Using a binary tree diagram for describing a Bernoulli process (http:/ / www. r-statistics. com/ 2011/ 11/
diagram-for-a-bernoulli-process-using-r/ ))
[17] http:/ / cran. r-project. org/ web/ packages/ treemap/ treemap. pdf
[18] http:/ / tables2graphs. com/ doku.php
[19] http:/ / www.stat.auckland.ac. nz/ ~paul/ RGraphics/ rgraphics. html
[20] http:/ / had.co.nz/ ggplot2/
Grammar of graphics
Hadley Wickham has developped the ggplot2, a graphical library designed according to the principles of the
Grammar of Graphics.
Plotting a function
We use qplot() with the option stat=function :
# Plot the quadratic function
square <- function(x){
x^2
}
mode(square)
qplot(c(0, 2), stat = "function", fun = square, geom = "line")
Here is another example with the sinus function :
# plot the sinus functon
qplot(c(-10, 10), stat = "function", fun = sin, geom = "line")
Bibliography
Leland Wilkinson, The Grammar of Graphics (Statistics and Computing), Springer, 2005
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Use R!, Springer, 2009
Resources
Documentation page for ggplot2
[1]
References
[1] http:/ / docs.ggplot2. org/ current/
Publication quality ouput
86
Publication quality ouput
Formatting numbers
You can use the format() function to control the number of digits and other characteristics of a displayed object.
> df <- data.frame(x = rnorm(10), y = rnorm(10))
> print(df)
x y
1 -0.4350953 -0.6426477
2 -0.5947293 -0.2389625
3 -0.7061850 -2.4382016
4 -0.3384038 -0.6322842
5 0.2713353 0.5396409
6 -1.1144711 -2.0321274
7 -1.0356184 1.7217443
8 -2.6665278 -0.3621377
9 0.2975570 0.1598905
10 1.4631458 -0.7995652
> print(format(df, digits=3, scientific=T))
x y
1 -4.35e-01 -6.43e-01
2 -5.95e-01 -2.39e-01
3 -7.06e-01 -2.44e+00
4 -3.38e-01 -6.32e-01
5 2.71e-01 5.40e-01
6 -1.11e+00 -2.03e+00
7 -1.04e+00 1.72e+00
8 -2.67e+00 -3.62e-01
9 2.98e-01 1.60e-01
10 1.46e+00 -8.00e-01
Sweave
Sweave
[1]
is a literate programming language which integrates LaTeX and R code. The Sweave file generates a
LaTeX file and an R file which can in turn be compiled. Roger Koenker
[2]
, Meredith and Racine (2009)
[3]
and
Charles Geyer
[4]
argue that Sweave favors reproducible econometric/statistical research.
There are some alternatives to Sweave for literate programming. One of them is Babel which is included in Emacs
Orgmode
[5]
. This tool allow export to LaTeX and HTML. It is also possible to include code chunks for various
programming languages (R, Ruby, etc).
Publication quality ouput
87
Syntax
The main idea is that you write a file which includes LaTeX and R code. LaTeX code begins with @ and R code
with <<>>= (some options can be included between << and >>).
@
% Some LaTeX code
\section{Results}
I show that ...
<<>>=
# Some R code
qnorm(.975)
@
% Some LaTeX code
$$
\Phi^{-1}(.975) = 1.96
$$
The file is stored with extension .Rnw or .rnw. At the end, you extract from this file an R file using Stangle()
and a LaTeX file using Sweave(). Here is an example with a file called file.Rnw which generates file.tex
and file.R
> Sweave("file.Rnw")
Writing to file file.tex
Processing code chunks ...
1 : echo keep.source term verbatim pdf
2 : echo keep.source term verbatim pdf
> Stangle("file.Rnw")
Writing to file file.R
Then you can run LaTeX on your file.tex. This can be done using the system() function or texi2dvi().
# Example under Windows :
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf
Note that you may need to download Sweave.sty from the internet since it is not part of the standard MikTeX
distribution.
You can also add your results in your text using the \Sexpr{} function.
$
\Phi^{-1}(.975) = \Sexpr{qnorm(.975)}
$
Publication quality ouput
88
Options
There are some options. These options can be included for each code chunk or in the Sweave command.
For figures, you can either include them in the tex file using fig=T or not include them using fig=F.
By default, figures are exported as pdf and eps files. If you only want one format suppress the other one with
pdf=F or eps=F option.
The R code can be displayed in the tex file using echo=T. If you don't want to include it in the tex file, use
echo=F.
The R code can be evaluated using eval=T. If you don't want to evaluate the R code, use eval=F.
The results :
results=tex treats the output as LaTeX code
results=verbatim treats the output as Verbatim (the default)
results=hide does not include the results in the LaTeX output
These options can be passed to the Sweave() function.
Sweave("file.Rnw", pdf = T, eps=F, echo = F, results = "verbatim")
They can also be passed to each code chunk.
<<fig=T,pdf=T,eps=F>>=
plot(rnorm(100), col = "red")
@
Text editor for Sweave
The main issue with Sweave is that few text editors include syntax highlighting for Sweave. Here are some
exceptions :
RStudio is a very good solution. It is easy to install and use and it includes buttons to run Sweave files.
Vim provides syntax highlighting for Sweave file (R no web syntax)
Emacs + ESS (Emacs Speaks Statistics) provides full support for Sweave file. It includes a keyboard shortcut to
run Sweave files and syntax highlighting switching between LaTeX and R.
Eclipse StatET plugin provides support for Sweave (LaTeX/R) documents with all basic features (syntax
highlighting, bracket matching, toggle comment, ...) and with detection of R chunks.
Export to LaTeX
R has lots of functions which allow it to export results to LaTeX
[6]
.
General functions
toLatex() in the utils package.
Note that toLatex() does not handle matrices.
toLatex() has been adapted to handle matrices and ftables in the memisc package.
> toLatex(sessionInfo())
\begin{itemize}
\item R version 2.2.0, 2005-10-06, \verb|powerpc-apple-darwin7.9.0|
\item Base packages: base, datasets, grDevices,
graphics, methods, stats, utils
\end{itemize}
Publication quality ouput
89
mat2tex() (sfsmisc) exports matrix to LaTeX.
tex.table() (cwhmisc) package exports a dataframe into a LaTeX table.
> tex.table(mydat)
\begin{table}[ht]
\begin{center}
\begin{footnotesize}
\begin{tabular}{r|rrr}
\hline
& y & x1 & x2\\ \hline
1 & -0.09 & -0.37 & -1.04\\
2 & 0.31 & 0.19 & -0.09\\
3 & 3.78 & 0.58 & 0.62\\
4 & 2.09 & 1.40 & -0.95\\
5 & -0.18 & -0.73 & -0.54\\
6 & 3.16 & 1.30 & 0.58\\
7 & 2.78 & 0.34 & 0.77\\
8 & 2.59 & 1.04 & 0.46\\
9 & -1.96 & 0.92 & -0.89\\
10 & 0.91 & 0.72 & -1.1\\
\hline
\end{tabular}
\end{footnotesize}
\end{center}
\end{table}
xtable() (xtable) exports various objects, including tables, data frames, lm, aov, and anova, to LaTeX.
> # lm example
> library(xtable)
> x <- rnorm(100)
> y <- 2*x + rnorm(100)
> lin <- lm(y~x)
> xtable(lin)
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 21:54:04 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
\hline
& Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\
\hline
(Intercept) & -0.0407 & 0.0984 & -0.41 & 0.6803 \\
x & 2.0466 & 0.1043 & 19.63 & 0.0000 \\
\hline
\end{tabular}
\end{center}
\end{table}
Publication quality ouput
90
> # table example
> x <- sample(1:10, 30, replace = T)
> tab <- table(x)
> tab <- cbind(tab, prop.table(tab))
> colnames(tab) <- c("N.", "Prop.")
> xtable(tab, digits = c(0, 0, 2))
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 22:06:36 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrr}
\hline
& N. & Prop. \\
\hline
1 & 5 & 0.17 \\
3 & 1 & 0.03 \\
4 & 3 & 0.10 \\
5 & 6 & 0.20 \\
6 & 5 & 0.17 \\
7 & 3 & 0.10 \\
8 & 2 & 0.07 \\
9 & 2 & 0.07 \\
10 & 3 & 0.10 \\
\hline
\end{tabular}
\end{center}
\end{table}
See also :
The highlight package by Romain Franois exports R code to LaTeX and HTML.
format.df() and latex() in the Hmisc package.
The MEMISC and the quantreg packages include other latex() function.
Descriptive statistics
estout package.
The reporttools package include some functions for table of descriptive statistics
[7]
.
Estimation results
The stargazer package provides an easy way to export the results of regressions to LaTeX
[8]
texreg provides the same kind of features
[9]
.
The estout package provides functions similar to the Stata's esttab and estout utilities
[10]
. Estimates are
stored using eststo() and printed using esttab(). They can be exported to CSV and LaTeX. These
functions support lm, glm and plm objects (see plm package).
apsrtable() (apsrtable) exports the results of multiple regression to LaTeX in a way similar to the American
Political Science Review publication standard.
The xtable (xtable package) exports dataframes, matrix, estimation results
[11]
. xtable() can also be used
to export the results to an HTML file.
Publication quality ouput
91
The outreg() function
[12][13]
developped by Paul Johnson is similar to the Stata outreg
[14]
function. See "R
you ready ?" post
[15]
on this topic.
mtable() and toLatex() in the 'memisc package.
N <- 10^3
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
y <- 1 + x1 + x2 + u
lm1 <- lm(y ~ x1 + x2 )
lm2 <- lm(y ~ x1 + x2 + I(x1*x2))
library(estout)
estclear() # clear all the eststo objects
eststo(lm1)
eststo(lm2)
esttab() # print it
library("apsrtable")
apsrtable(lm1,lm2)
library(xtable)
xtable(lm1)
tab <- xtable(lm1)
print(tab,type="html")
source("http://pj.freefaculty.org/R/WorkingExamples/outreg-worked.R")
outreg(list(lm1,lm2))
library("memisc")
toLatex(mtable(lm1,lm2))
Export to HTML
The rpublisher
[16]
is a literate programming language which publish results in HTML (it is based on python and
was last updated in 2008).
See R2HTML, xtable, hwriter, prettyR, highlight, HTMLUtils
wiki.table() in the hacks package export a matrix or a dataframe into Mediawiki
[17]
table markup (as used on
this wiki and many others).
> wiki.table(matrix(1:16,4),caption="Test")
{|
|+ Test
| 1 || 5 || 9 || 13
|-
| 2 || 6 || 10 || 14
|-
| 3 || 7 || 11 || 15
|-
Publication quality ouput
92
| 4 || 8 || 12 || 16
|}
References
[1] The Sweave Homepage http:/ / www.stat. uni-muenchen. de/ ~leisch/ Sweave/
[2] http:/ / www. econ. uiuc. edu/ ~roger/ repro. html
[3] Meredith, E. and J.S. Racine (2009), Towards Reproducible Econometric Research: The Sweave Framework, Journal of Applied
Econometrics, Volume 24, pp 366-374.
[4] Charles Geyer "Why Reproducible Research is the Right Thing" http:/ / www. stat. umn. edu/ ~charlie/ Sweave/
[5] Babel in Emacs Orgmode http:/ / orgmode.org/ worg/ org-contrib/ babel/ intro. html
[6] See the LaTeX Wikibook if you want to learn about LaTeX
[7] reporttools: R Functions to Generate LaTeX Tables of Descriptive Statistics (http:/ / www. jstatsoft. org/ v31/ c01)
[8] http:/ / www. r-statistics. com/ 2013/ 01/ stargazer-package-for-beautiful-latex-tables-from-r-statistical-models-output/
[9] http:/ / www. r-bloggers. com/ texreg-a-package-for-beautiful-and-easily-customizable-latex-regression-tables-from-r/
[10] estout : http:/ / repec.org/ bocode/ e/ estout/
[11] xtable on dataninja blog (http:/ / archive. is/ 20121225194652/ dataninja. wordpress. com/ 2006/ 02/ 11/ getting-tables-from-r-output/ )
[12] http:/ / pj. freefaculty. org/ R/ WorkingExamples/ outreg-worked. R
[13] The outreg() function http:/ / pj.freefaculty. org/ R/ WorkingExamples/ outreg-worked. R
[14] Stata outreg http:/ / ideas. repec.org/ c/ boc/ bocode/ s375201. html
[15] http:/ / ryouready. wordpress.com/ 2009/ 06/ 19/ r-function-to-create-tables-in-latex-or-lyx-to-display-regression-models-results/
[16] rpublisher : http:/ / code. google. com/ p/ rpublisher/
[17] http:/ / www.mediawiki. org
Descriptive Statistics
In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes
univariate and bivariate statistical tools.
Generic Functions
We introduce some functions to describe a dataset.
names() gives the names of each variable
str() gives the structure of the dataset
summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
> summary(mydat)
describe() (Hmisc package) gives more details than summary()
> library("Hmisc")
> describe(mydat)
contents() (Hmisc package)
dims() in the Zelig package.
descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for
factors and length for character vectors.
whatis() (YaleToolkit) gives a good description of a dataset.
describe() in the psych package also provides summary statistics:
> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
Descriptive Statistics
93
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
var n mean sd median trimmed mad min max range skew kurtosis se
sample.data 1 100 0.37 3.21 0.00 0.07 0.31 -9.02 24.84 33.86 4.79 36.91 0.32
x 2 100 0.54 0.28 0.56 0.55 0.35 0.02 1.00 0.98 -0.12 -1.13 0.03
z 3 100 0.12 6.28 0.02 -0.01 1.14 -30.40 37.93 68.33 1.49 22.33 0.63
y 4 100 -0.01 1.07 0.09 -0.02 1.12 -2.81 2.35 5.16 0.00 -0.30 0.11
Univariate analysis
Continuous variable
Moments
mean() computes the mean
the variance : var().
the standard deviation sd().
the skewness skewness() (fUtilities, moment or e1071)
the kurtosis : kurtosis() (fUtilities, moment or e1071)
all the moments : moment() (moment) and all.moments() (moment).
> library(moments)
> x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506
Order statistics
the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of
a vector), min() the minimum and max() the maximum.
IQR() computes the interquartile range. median() computes the median and mad() the median absolute
deviation.
quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages
computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample
size is big.
> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
0% 25% 50% 75% 100%
-3.07328999 -0.66800917 0.02010969 0.72620061 2.92897970
Descriptive Statistics
94
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
0.00 0.25 0.50 0.75 1.00
-3.07328999 -0.66901899 0.02157989 0.72378407 2.92897970
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
0% 25% 50% 75% 100%
-3.07328999 -0.66800917 0.02010969 0.72620061 2.92897970
attr(,"class")
[1] "kuantile"
Inequality Index
The gini coefficient : Gini() (ineq) and gini() (reldist).
ineq() (ineq) gives all inequalities index.
> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) # Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694
Concentration index
> library(ineq)
> Herfindahl(x)
[1] 0.003091162
> Rosenbluth(x)
[1] 0.002141646
Poverty index
> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index
Descriptive Statistics
95
Plotting the distribution
We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot()
with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the
Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a
normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.
A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the
maximum.
stripchart() and stem() are also availables.
> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one
Goodness of fit tests
Kolmogorov Smirnov Test :
The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the
difference between the empirical cumulative distribution function and the theoritical cumulative distribution
function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a
Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.
> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)
Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters
are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test()
implements the Shapiro Wilk Normality Test
> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)
Lilliefors (Kolmogorov-Smirnov) normality test
data: x
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)
Shapiro-Wilk normality test
data: x
W = 0.9916, p-value = 0.7902
Descriptive Statistics
96
Andersen Darling Test :
> library("nortest")
> ad.test(x)
Anderson-Darling normality test
data: x
A = 0.2541, p-value = 0.7247
See also the package ADGofTest for another version of this test
[1]
.
Shapiro-Francia normality test :
> sf.test(x)
Shapiro-Francia normality test
data: x
W = 0.9866, p-value = 0.9953
Pearson chi-square normality test :
> library("nortest")
> pearson.test(x)
Pearson chi-square normality test
data: x
P = 0.8, p-value = 0.8495
Cramer-von Mises normality test
> cvm.test(x)
Cramer-von Mises normality test
data: x
W = 0.0182, p-value = 0.9756
Jarque-Bera test :
> jarque.bera.test(x)
Jarque Bera Test
data: x
X-squared = 0.6245, df = 2, p-value = 0.7318
Descriptive Statistics
97
Discrete variable
We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart
(pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot()
(lattice)).
freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.
> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x)
x
Frequency Percent
A 32 32
B 34 34
C 34 34
Total 100 100
Multivariate analysis
Continuous variables
Covariance : cov()
Pearson's linear correlation : cor().
Pearson's correlation test cor.test() performs the test.
Spearman's rank correlation :
cor() with method = "spearman".
spearman() (Hmisc)
Spearman's rank correlation test :
spearman2() (Hmisc)
spearman.test() (Hmisc)
spearman.test() (pspearman package) performs the Spearmans rank correlation test with precomputed
exact null distribution for n <= 22.
Kendall's correlation : cor() with method = "kendall". See also the Kendall package.
> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
Descriptive Statistics
98
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")
Discrete variables
table(), xtabs() and prop.table() for contingency tables.
assocplot() and mosaicplot() for graphical display of contingency table.
CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher
independance tests.
my.table.NA() and my.table.margin() (cwhmisc)
chisq.detail() (TeachingDemos)
Discrete and Continuous variables
bystats() Statistics by Categories in the Hmisc package
summaryBy() (doBy)
Multiple box plots : plot() or boxplot()
> N <- 100
> x <- sample(1:4,N, replace = T)
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean)
> bystats(y , as.factor(x), fun = quantile)
Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality
of two distributions ks.test().
N <- 100
x <- sample(0:1,N, replace = T)
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)
References
[1] Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http:/ / CRAN. R-project. org/
package=ADGofTest
Mathematics
99
Mathematics
Basics
?Arithmetic
?Special
Linear Algebra
Vectors
The inner product
The inner product is also called the dot product or the scalar product. It is the sum of the item-by-item product.
> u <- rep(3,3)
> v <- 1:3
> u%*%v # the inner product
[,1]
[1,] 18
The outer product
The outer product is also called the cross product or the vector product. It is a matrix resulting from the product of
the elements of the two vectors.
> v <- matrix(1:3,1,3)
> u <- matrix(1:3,3,1)
> u%*%v
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
Matrix Algebra
If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data,
the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by
column (the default option) with byrow. You can also combine vectors using cbind() or rbind(). The
dimension of a matrix can be obtained using the dim() function or alternatively nrow() and ncol().
> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> v1 <- 1:5
> v2 <- 5:1
> cbind(v1,v2)
> rbind(v1,v2)
> dim(X)
> nrow(X)
> ncol(X)
Mathematics
100
Some special matrix
The identity matrix has ones on the diagonal and zeros outside the diagonal.
eye() (matlab)
diag(1,nrow=10,ncol=10)
diag(rep(1,10))
J matrix is full of ones
ones() (matlab)
A matrix full of zeros
zeros() (matlab)
> library(matlab)
> eye(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> ones(3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> zeros(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
Diagonal matrix
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
Upper triangular
> round(upper.tri(matrix(1, n, n)))
for n=3
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
If you also need the diagonal of one's
> round(upper.tri(matrix(1, 3, 3), diag = TRUE))
Mathematics
101
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 0 1 1
[3,] 0 0 1
Lower triangular
Same as upper triangular but using lower.tri instead
create an Hilbert matrix using hilbert() (fUtilities).
Matrix calculations
compute a matrix multiplication X%*%Y.
> b <- matrix(nrow = 2, ncol = 2, c(1, 2, 3, 4))
> a <- matrix(nrow = 2, ncol = 2, c(1, 0, 0, -1))
> a
[,1] [,2]
[1,] 1 0
[2,] 0 -1
> b
[,1] [,2]
[1,] 1 3
[2,] 2 4
> a%*%b
[,1] [,2]
[1,] 1 3
[2,] -2 -4
> b%*%a
[,1] [,2]
[1,] 1 -3
[2,] 2 -4
compute the Kronecker
[1]
product using %x% or kron() (fUtilities).
> M <- matrix(rep(2,4),nrow = 2)
> M
[,1] [,2]
[1,] 2 2
[2,] 2 2
> I <- eye(2)
> I
[,1] [,2]
[1,] 1 0
[2,] 0 1
> I %x% M
[,1] [,2] [,3] [,4]
[1,] 2 2 0 0
[2,] 2 2 0 0
[3,] 0 0 2 2
Mathematics
102
[4,] 0 0 2 2
> library(fUtilities)
> kron(I,M)
[,1] [,2] [,3] [,4]
[1,] 2 2 0 0
[2,] 2 2 0 0
[3,] 0 0 2 2
[4,] 0 0 2 2
Matrix transposition
Transpose the matrix
> t(M)
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 2
[3,] 0 0 1
The trace and determinant of a matrix
compute the trace of a matrix using tr() (fUtilities)
returns the rank of a matrix using rk() (fBasics:)
Matrix inversion
Invert a matrix using solve() or inv() (fUtilities). We can also compute the generalized inverse using
ginv() in the MASS package.
> M <- cbind(c(1,0,1),c(0,1,2),c(0,0,1))
> solve(M)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] -1 -2 1
> solve(M)%*%M
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
Solving a linear equation
> m=matrix(nrow=2,ncol=2,c(1,-.8,1,.2))
> m
[,1] [,2]
[1,] 1.0 1.0
[2,] -0.8 0.2
>
> l=matrix(c(1.0+25.0/18,25.0/18.0))
> l
Mathematics
103
[,1]
[1,] 2.388889
[2,] 1.388889
>
> k=solve(m,l)
> k
[,1]
[1,] -0.9111111
[2,] 3.3000000
>
> m%*%k #checking the answer
[,1]
[1,] 2.388889
[2,] 1.388889
>
Eigenvalue, eigenvector and eigenspace
Eigenvalues and eigenvectors
> eigen(M)
$values
[1] 1 1 1
$vectors
[,1] [,2] [,3]
[1,] 0 2.220446e-16 0.000000e+00
[2,] 0 0.000000e+00 1.110223e-16
[3,] 1 -1.000000e+00 -1.000000e+00
Misc
compute the norm of a matrix using norm() (fUtilities).
check if a matrix is positive definite isPositiveDefinite() (fUtilities).
make a matrix positive definite makePositiveDefinite() (fUtilities).
computes row statistics and column statistics (fUtilities).
extract the upper and the lower part of a matrix triang() and Triang() (fUtilities).
See also the matrix, matlab, matrixcalc, matrixStats packages.
Analysis
Logarithm and Exponents
We have the power function 10^3 or 10**3 , the logarithm and the exponential log(2.71),
log10(10),exp(1).
> 10^3 # exponent
[1] 1000
> 10**3 # exponent
[1] 1000
Mathematics
104
> exp(1) # exponential
[1] 2.718282
> log(2.71) # natural logarithm
[1] 0.9969486
> log10(1000) # base 10 logarithm
[1] 3
> log(1000,base = 10) # base 10 logarithm
[1] 3
Polynomial equations
To solve , where are given numbers, use the command
> polyroot(c(n,...,b,a))
So, for example, to calculate the roots of the equation one would do as follows:
> polyroot(c(-3,-5,2))
[1] -0.5+0i 3.0-0i
and the solution can be read to be .
See also polynom and multipol packages
Derivatives
Symbolic calculations
R can give the derivative of an expression. You need to convert your function as an expression using the
expression() function. Otherwise you get an error message.
Here are some examples :
> D(expression(x^n),"x")
x^(n - 1) * n
> D(expression(exp(a*x)),"x")
exp(a * x) * a
> D(expression(1/x),"x")
-(1/x^2)
> D(expression(x^3),"x")
3 * x^2
> D(expression(pnorm(x)),"x")
dnorm(x)
> D(expression(dnorm(x)),"x")
-(x * dnorm(x))
Mathematics
105
Numerical approximation
numDeriv package
Integration
R can perform one dimensional integration. For example we can integrate over the density of the normal distribution
between and
> integrate(dnorm,-Inf,Inf)
1 with absolute error < 9.4e-05
> integrate(dnorm,-1.96,1.96)
0.9500042 with absolute error < 1.0e-11
> integrate(dnorm,-1.64,1.64)
0.8989948 with absolute error < 6.8e-14
# we can also store the result in an object
> ci90 <- integrate(dnorm,-1.64,1.64)
> ci90$value
[1] 0.8989948
> integrate(dnorm,-1.64,1.64)$value
[1] 0.8989948
see the adapt package for multivariate integration.
> library(adapt)
> ?adapt
> ir2pi <- 1/sqrt(2*pi)
> fred <- function(z) { ir2pi^length(z) * exp(-0.5 * sum(z * z))}
>
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred)
value relerr minpts lenwrk ifail
1.039222 0.0007911264 231 73 0
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-4)
value relerr minpts lenwrk ifail
1.000237 1.653498e-05 655 143 0
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-6)
value relerr minpts lenwrk ifail
1.000039 3.22439e-07 1719 283 0
See also integrate.gh() in the ecoreg package.
Mathematics
106
Probability
The number of combination of length k within n numbers :
> choose(100, 5)
[1] 75287520
Union and intersection
> union(1:10, 5:7)
[1] 1 2 3 4 5 6 7 8 9 10
> intersect(1:10, 5:7)
[1] 5 6 7
Arithmetics
The factorial function
factorial returns the factorial of an integer. This can also be computed using the prod() (product) applied to
the vector of integers between 1 and the number of interest.
> factorial(3)
[1] 6
> prod(1:3)
[1] 6
Note that by convention . factorial() returns 1 in 0. This is not the case with the prod() functions.
> factorial(0)
[1] 1
> prod(0)
[1] 0
Factorial numbers can be very large and cannot be computed for high values.
> factorial(170)
[1] 7.257416e+306
> factorial(171)
[1] Inf
Message d'avis :
In factorial(171) : value out of range in 'gammafn'
The modulo function and euclidian division
Modulo and integer division (i.e. euclidean division)
> 5%%2
[1] 1
>5%/%2
[1] 2
Note that for some unknown reason, there is a problem with non integer numbers and euclidian divisions.
> .5%/%.1 # we get 4 instead of 5
[1] 4
Mathematics
107
> .5%%.1 # we get .1 instead of 0
[1] 0.1
Geometry
pi the constant
cos(), sin(), tan() the trigonometric functions.
Symbolic calculus
rSymPy (rsympy
[2]
) provides sympy (link
[3]
) functions in R.
If you want to do more symbolic calculus, see Maxima
[4]
, SAGE
[5]
, Mathematica
[6]
References
[1] http:/ / en. wikipedia. org/ wiki/ Kronecker_product
[2] http:/ / code. google. com/ p/ rsympy/
[3] http:/ / code. google. com/ p/ sympy/
[4] Maxima is open source http:/ / maxima. sourceforge.net/
[5] SAGE is an open source package which includes R and Maxima : http:/ / www. sagemath. org/
[6] Mathematica is not open source http:/ / www.wolfram.com/ products/ mathematica/ index. html
Optimization
optimize() is devoted to one dimensional optimization problem.
optim(), nlm(), ucminf() (ucminf) can be used for multidimensional optimization problems.
nlminb() for constrained optimization.
quadprog, minqa, rgenoud, trust packages
Some work is done to improve optimization in R. See Updating and improving optim(), Use R 2009 slides
[1][2]
,
the R-forge optimizer page
[3][4]
and the corresponding packages including optimx
[5]
.
Numerical Methods
One dimensional problem
The one dimensional problem :
> func <- function(x){
+ return ( (x-2)^2 )
+ }
> (func(-2))
[1] 16
>
> # plot your function using the 'curve function'
> curve(func,-4,8)
>
> # Here is another way to plot the function
> # using a grid
> grid <- seq(-10,10,by=.1)
> func(grid)
Optimization
108
> plot(grid,func(grid))
>
> # you can find the minimum using the optimize function
> optimize(f=func,interval=c(-10,10))
$minimum
[1] 2
$objective
[1] 0
Newton-Raphson
nlm() provides a Newton algorithm.
maxLik package for maximization of a likelihood function. This package includes the Newton Raphson method.
newtonraphson() in the spuRs package.
BFGS
The BFGS method
> func <- function(x){
+ out <- (x[1]-2)^2 + (x[2]-1)^2
+ return <- out
+ }>
> optim(par=c(0,0), fn=func, gr = NULL,
+ method = c("BFGS"),
+ lower = -Inf, upper = Inf,
+ control = list(), hessian = T)
> optim(par=c(0,0), fn=func, gr = NULL,
+ method = c("L-BFGS-B"),
+ lower = -Inf, upper = Inf,
+ control = list(), hessian = T)
Conjugate gradient method
optim() with method="cg".
Trust Region Method
"trust"
[6]
package for trust region method
The Nelder-Mead simplex method
The Nelder Mead Method
> func <- function(x){
+ out <- (x[1]-2)^2 + (x[2]-1)^2
+ return <- out
+ }
>
> optim(par=c(0,0), fn=func, gr = NULL,
+ method = c("Nelder-Mead"),
Optimization
109
+ lower = -Inf, upper = Inf,
+ control = list(), hessian = T)
The boot package includes another simplex method
Simulation methods
Simulated Annealing
The Simulated Annealing is an algorithm which is useful to maximise non-smooth functions. It is pre
implemented in optim().
> func <- function(x){
+ out <- (x[1]-2)^2 + (x[2]-1)^2
+ return <- out
+ }>
> optim(par=c(0,0), fn=func, gr = NULL,
+ method = c("SANN"),
+ lower = -Inf, upper = Inf,
+ control = list(), hessian = T)
Genetic Algorithm
rgenoud package for genetic algorithm
[7]
gaoptim package for genetic algorithm
[8]
References
Venables and Ripley, Chapter 16.
Cameron and Trivedi, Microeconometrics, chapter 10
Braun and Murdoch (Chapter 7)
[9]
is a very good reference on optimization using R.
[1] http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Nash+ Varadhan. pdf
[2] Updating and improving optim(), Use R 2009 slides http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Nash+ Varadhan. pdf
[3] http:/ / optimizer.r-forge. r-project.org/
[4] R-forge optimizer http:/ / optimizer. r-forge. r-project. org/
[5] http:/ / r-forge.r-project. org/ R/ ?group_id=395
[6] http:/ / www. stat. umn. edu/ geyer/ trust/
[7] Jasjeet Sekhon homepage : http:/ / sekhon. berkeley.edu/ rgenoud/
[8] gaoptim on CRAN: http:/ / cran. r-project.org/ web/ packages/ gaoptim/ index. html
[9] A first course in statistical programming with R http:/ / portal. acm. org/ citation. cfm?id=1385416
Probability Distributions
110
Probability Distributions
This page review the main probability distributions and describe the main R functions to deal with them.
R has lots of probability functions.
r is the generic prefix for random variable generator such as runif(), rnorm().
d is the generic prefix for the probability density function such as dunif(), dnorm().
p is the generic prefix for the cumulative density function such as punif(), pnorm().
q is the generic prefix for the quantile function such as qunif(), qnorm().
Discrete distributions
Benford Distribution
The Benford Distribution
[1]
is the distribution of the first digit of a number. It is due to Benford 1938
[2]
and
Newcomb 1881
[3]
.
> library(VGAM)
> dbenf(c(1:9))
[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 0.05799195 0.05115252 0.04575749
Bernouilli
We can draw from a Bernouilli
[4]
using sample(), runif() or rbinom() with size = 1.
> n <- 1000
> x <- sample(c(0,1), n, replace=T)
> x <- sample(c(0,1), n, replace=T, prob=c(0.3,0.7))
> x <- runif(n) > 0.3
> x <- rbinom(n, size=1, prob=0.2)
Probability Distributions
111
Binomial
We can sample from a binomial distribution
[5]
using the rbinom() function with arguments n for number of of
samples to take, size defining the number of trials and prob defining the probability of success in each trial.
> x <- rbinom(n=100,size=10,prob=0.5)
Hypergeometric distribution
We can sample n times from a hypergeometric distribution
[6]
using the rhyper() function.
> x <- rhyper(n=1000, 15, 5, 5)
Geometric distribution
The geometric distribution
[7]
.
> N <- 10000
> x <- rgeom(N, .5)
> x <- rgeom(N, .01)
Multinomial
The multinomial distribution
[8]
.
> sample(1:6, 100, replace=T, prob= rep(1/6,6))
Negative binomial distribution
The negative binomial distribution
[9]
is the distribution of the number of failures before k successes in a series of
Bernoulli events.
> N <- 100000
> x <- rnbinom(N, 10, .25)
Probability Distributions
112
Poisson distribution
We can draw n values from a Poisson distribution
[10]
with a mean set by the argument lambda.
> x <- rpois(n=100, lambda=3)
Zipf's law
The distribution of the frequency of words is known as Zipf's Law
[11]
. It is also a good description of the
distribution of city size
[12]
. dzipf() and pzipf() (VGAM)
> library(VGAM)
> dzipf(x=2, N=1000, s=2
Continuous distributions
Beta and Dirichlet distributions
beta distribution
[13]
Dirichlet
[14]
in gtools and MCMCpack
>library(gtools)
>?rdirichlet
>library(bayesm)
>?rdirichlet
>library(MCMCpack)
>?Dirichlet
Cauchy
We can sample n values from a Cauchy distribution
[15]
with a given location parameter (default is 0) and
scale parameter (default is 1) using the rcauchy() function.
> x <- rcauchy(n=100, location=0, scale=1)
Chi Square distribution
Quantile of the Chi square distribution
[16]
( distribution)
> qchisq(.95,1)
[1] 3.841459
> qchisq(.95,10)
[1] 18.30704
> qchisq(.95,100)
[1] 124.3421
Probability Distributions
113
Exponential
We can sample n values from a exponential distribution
[17]
with a given rate (default is 1) using the rexp()
function
> x <- rexp(n=100, rate=1)
Fisher-Snedecor
We can draw the density of a Fisher distribution
[18]
(F-distribution) :
> par(mar=c(3,3,1,1))
> x <- seq(0,5,len=1000)
> plot(range(x),c(0,2),type="n")
> grid()
> lines(x,df(x,df1=1,df2=1),col="black",lwd=3)
> lines(x,df(x,df1=2,df2=1),col="blue",lwd=3)
> lines(x,df(x,df1=5,df2=2),col="green",lwd=3)
> lines(x,df(x,df1=100,df2=1),col="red",lwd=3)
> lines(x,df(x,df1=100,df2=100),col="grey",lwd=3)
> legend(2,1.5,legend=c("n1=1, n2=1","n1=2, n2=1","n1=5,
n2=2","n1=100, n2=1","n1=100,
n2=100"),col=c("black","blue","green","red","grey"),lwd=3,bty="n")
Gamma
We can sample n values from a gamma distribution
[19]
with a given shape parameter and scale parameter
using the rgamma() function. Alternatively a shape parameter and rate parameter can be given.
> x <- rgamma(n=100, scale=1, shape=0.4)
> x <- rgamma(n=100, scale=1, rate=0.8)
Levy
We can sample n values from a Levy distribution
[20]
with a given location parameter (defined by the argument
m, default is 0) and scaling parameter (given by the argument s, default is 1) using the rlevy() function.
> x <- rlevy(n=100, m=0, s=1)
Probability Distributions
114
Log-normal distribution
We can sample n values from a log-normal distribution
[21]
with a given meanlog (default is 0) and sdlog
(default is 1) using the rlnorm() function
> x <- rlnorm(n=100, meanlog=0, sdlog=1)
Normal and related distributions
We can sample n values from a normal
[22]
or gaussian Distribution with a given mean (default is 0) and sd
(default is 1) using the rnorm() function
> x <- rnorm(n=100, mean=0, sd=1)
Quantile of the normal distribution
> qnorm(.95)
[1] 1.644854
> qnorm(.975)
[1] 1.959964
> qnorm(.99)
[1] 2.326348
The mvtnorm package includes functions for multivariate normal distributions.
rmvnorm() generates a multivariate normal distribution.
> library(mvtnorm)
> sig <- matrix(c(1, 0.8, 0.8, 1), 2, 2)
> r <- rmvnorm(1000, sigma = sig)
> cor(r)
[,1] [,2]
[1,] 1.0000000 0.8172368
[2,] 0.8172368 1.0000000
Pareto Distributions
Generalized Pareto
[23]
dgpd() in evd
dpareto(), ppareto(), rpareto(), qpareto() in actuar
The VGAM package also has functions for the Pareto distribution.
Student's t distribution
Quantile of the Student t distribution
[24]
> qt(.975,30)
[1] 2.042272
> qt(.975,100)
[1] 1.983972
> qt(.975,1000)
[1] 1.962339
The following lines plot the .975th quantile of the t distribution in function of the degrees of freedom :
curve(qt(.975,x), from = 2 , to = 100, ylab = "Quantile 0.975 ", xlab = "Degrees of freedom", main = "Student t distribution")
abline(h=qnorm(.975), col = 2)
Probability Distributions
115
Uniform distribution
We can sample n values from a uniform distribution
[25]
(also known as a rectangular distribution] between two
values (defaults are 0 and 1) using the runif() function
> runif(n=100, min=0, max=1)
Weibull
We can sample n values from a Weibull distribution
[26]
with a given shape and scale parameter (default is
1) using the rweibull() function.
> x <- rweibull(n=100, shape=0.5, scale=1)
Extreme values and related distribution
The Gumbel distribution
[27]
The logistic distribution
[28]
: distribution of the difference of two gumbel distributions.
plogis, qlogis, dlogis, rlogis
Frechet dfrechet() evd
Generalized Extreme Value dgev() evd
Gumbel dgumbel() evd
Burr, dburr, pburr, qburr, rburr in actuar
Distribution in circular statistics
Functions for circular statistics are included in the CircStats package.
dvm() Von Mises
[29]
(also known as the nircular normal or Tikhonov distribution) density function
dtri() triangular density
[30]
function
dmixedvm() Mixed Von Mises density
dwrpcauchy() wrapped Cauchy density
dwrpnorm() wrapped normal density.
References
[1] http:/ / en. wikipedia. org/ wiki/ Benford_distribution
[2] Benford, F. (1938) The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78, 551572.
[3] Newcomb, S. (1881) Note on the Frequency of Use of the Different Digits in Natural Numbers. American Journal of Mathematics, 4, 3940.
[4] http:/ / en. wikipedia. org/ wiki/ Bernoulli_distribution
[5] http:/ / en. wikipedia. org/ wiki/ Binomial_distribution
[6] http:/ / en. wikipedia. org/ wiki/ Hypergeometric_distribution
[7] http:/ / en. wikipedia. org/ wiki/ Geometric_distribution
[8] http:/ / en. wikipedia. org/ wiki/ Multinomial_distribution
[9] http:/ / en. wikipedia. org/ wiki/ Negative_binomial_distribution
[10] http:/ / en. wikipedia. org/ wiki/ Poisson_distribution
[11] http:/ / en. wikipedia. org/ wiki/ Zipf%27s_Law
[12] Gabaix, Xavier (August 1999). "Zipf's Law for Cities: An Explanation". Quarterly Journal of Economics 114 (3): 73967.
doi:10.1162/003355399556133. ISSN 0033-5533. http:/ / pages. stern. nyu. edu/ ~xgabaix/ papers/ zipf. pdf.
[13] http:/ / en. wikipedia. org/ wiki/ Beta_distribution
[14] http:/ / en. wikipedia. org/ wiki/ Dirichlet_distribution
[15] http:/ / en. wikipedia. org/ wiki/ Cauchy_distribution
[16] http:/ / en. wikipedia. org/ wiki/ Chi-square_distribution
[17] http:/ / en. wikipedia. org/ wiki/ Exponential_distribution
[18] http:/ / en. wikipedia. org/ wiki/ F-distribution
[19] http:/ / en. wikipedia. org/ wiki/ Gamma_distribution
Probability Distributions
116
[20] http:/ / en. wikipedia. org/ wiki/ Levy_distribution
[21] http:/ / en. wikipedia. org/ wiki/ Lognormal
[22] http:/ / en. wikipedia. org/ wiki/ Normal_distribution
[23] http:/ / en. wikipedia. org/ wiki/ Pareto_distribution
[24] http:/ / en. wikipedia. org/ wiki/ Student%27s_t-distribution
[25] http:/ / en. wikipedia. org/ wiki/ Uniform_distribution_(continuous)
[26] http:/ / en. wikipedia. org/ wiki/ Weibull_distribution
[27] http:/ / en. wikipedia. org/ wiki/ Gumbel_distribution
[28] http:/ / en. wikipedia. org/ wiki/ Logistic_distribution
[29] http:/ / en. wikipedia. org/ wiki/ Von_Mises_distribution
[30] http:/ / en. wikipedia. org/ wiki/ Triangular_distribution
Random Number Generation
Random Number Generators
To a very high degree computers are deterministic and therefore are not a reliable source of significant amounts of
random values. In general pseudo random number generators are used. The default algorithm in R is
Mersenne-Twister but a long list of methods is available. See the help of RNGkind() to learn about random number
generators.
?RNGkind
It is possible to use true random numbers. Some of them are collected on random.org (link
[1]
). The random (link
[2]
) package gives an access to them.
Randu
Randu is an old linear congruential pseudorandom number generator. There is a dataset generated with Randu in the
datasets package. The function which is used to generate the dataset is in the help of this page.
library("datasets")
?randu
Seed
A pseudo random number generator is an algorithm based on a starting point called "seed". If you want to perform
an exact replication of your program, you have to specify the seed using the function set.seed(). The argument of
set.seed has to be an integer.
> set.seed(1)
> runif(1)
[1] 0.2655087
> set.seed(1)
> runif(1)
[1] 0.2655087
Random Number Generation
117
Sampling in a vector
Toss 10 coins
> sample(0:1,10,replace=T)
[1] 1 0 0 0 1 0 0 1 1 1
Roll 10 dice
> sample(1:6,10,replace=T)
[1] 4 1 5 3 2 5 5 6 3 2
play lottery (6 random numbers out of 49 without replacement)
> sample(1:49,6,replace=F)
[1] 18 35 29 1 33 11
You can sample in a multinomial distribution :
>mydat <- sample(1:4,1000,rep=TRUE,prob=c(.2,.3,.2,.3))
>table(mydat)
Sampling in a standard univariate distribution
You can use rnorm, rt, etc.
Misspecified argument
Note that if you put as argument of rnorm a vector instead of a number, R takes by default the length of the vector
instead of returning an error. Here is an example :
x <- rnorm(10) # Sample a normal random vector
set.seed(1) # use the seed
z <- rnorm(x) # put a vector instead of a number as an argument of rnorm
set.seed(1) # initalize the seed again
z2 <- rnorm(length(x)) # sample in a vector with the same length as x
plot(z2,z) # check that z and z2 are the same
Inverse CDF method
If you know the inverse CDF (quantile function), you can generate the random variable by sampling in the
standard uniform distribution and transforming using the CDF.
For instance, if you want to simulate from a standard normal distribution, you can simulate from a standard uniform
and transform it using the quantile function of the normal distribution.
N <- 100
qnorm(runif(N))
This gives the same results as the rnorm() function but the computing time is higher :
> N <- 10^7
> system.time(qnorm(runif(N)))
user system elapsed
1.67 0.00 1.70
> system.time(rnorm(N))
Random Number Generation
118
user system elapsed
1.50 0.00 1.51
Importance sampling
See Jeff Gill's routine
[3]
Gibbs algorithm
rgs package (link
[4]
)
Quasi random numbers
quasi RNG
[5]
Sometimes you need to generate quasi random sequences
[6]
. The randtoolbox library provides several quasi
random number generators.
See also sHalton() and QUnif() (sfsmisc).
Halton Sequence
> library(randtoolbox)
> halton(10, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
[,1] [,2]
[1,] 0.5000 0.33333333
[2,] 0.2500 0.66666667
[3,] 0.7500 0.11111111
[4,] 0.1250 0.44444444
[5,] 0.6250 0.77777778
[6,] 0.3750 0.22222222
[7,] 0.8750 0.55555556
[8,] 0.0625 0.88888889
[9,] 0.5625 0.03703704
[10,] 0.3125 0.37037037
You can compare Halton draws with the standard R (pseudo) random number generator. Halton draws are much
more systematic.
>random <- cbind(runif(1000),runif(1000))
>halton <- halton(1000, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
>par(mfrow=c(2,2))
>plot(halton[,1],halton[,2])
>plot(random[,1],random[,2])
Random Number Generation
119
Examples
Resources
Revolution Computing entry on pseudo random numbers
[7]
Statistical Distributions Module Wessa.net
[8]
it is online application which generates random numbers using R.
You can have access to the R code and use it in your own programs.
References
[1] http:/ / www. random.org/
[2] http:/ / dirk.eddelbuettel.com/ code/ random.html
[3] http:/ / artsci.wustl. edu/ ~jgill/ papers/ sir4. txt
[4] http:/ / code. google. com/ p/ rgs-package/
[5] http:/ / bm2. genes.nig. ac. jp/ RGM2/ R_current/ library/ randtoolbox/ man/ quasiRNG. html
[6] http:/ / mathworld. wolfram. com/ QuasirandomSequence. html
[7] http:/ / blog. revolution-computing. com/ 2009/ 02/ how-to-choose-a-random-number-in-r. html
[8] http:/ / www. wessa. net/ distributions. wasp
Maximum Likelihood
120
Maximum Likelihood
Introduction
Maximum likelihood estimation is just an optimization problem. You have to write down your log likelihood
function and use some optimization technique. Sometimes you also need to write your score (the first derivative of
the log likelihood) and or the hessian (the second derivative of the log likelihood).
One dimension
If there is only one parameter, we can optimize the log likelihood using optimize().
Example with a type 1 Pareto distribution
We provide an example with a type 1 Pareto distribution. Note that in this example we treat the minimum as known
and do not estimate it. Therefore this is a one-dimensional problem.
We use the rpareto1() (actuar) function to generate a random vector from a type 1 Pareto distribution with
shape equal to 1 and minimum value equal to 500. We use the dpareto1() (actuar) function with option log
= TRUE to write the log likelihood. Then we just need to use optimize() with maximum=TRUE. We provide a
minimum and a maximum value for the parameter with the interval option.
> library(actuar)
> y <- rpareto1(1000, shape = 1, min = 500)
> ll <- function(mu, x) {
+ sum(dpareto1(x,mu[1],min = min(x),log = TRUE))
+ }
> optimize(f = ll, x = y, interval = c(0,10), maximum = TRUE)
Multiple dimension
fitdistr() (MASS package) fits univariate distributions by maximum likelihood. It is a wrapper for
optim().
If you need to program yourself your maximum likelihood estimator (MLE) you have to use a built-in optimizer
such as nlm(), optim(). R also includes the following optimizers :
mle() in the stats4 package
The maxLik package
Example with a logistic distribution
For instance, we draw from a logistic distribution and we estimate the parameters using .
> # draw from a gumbel distribution using the inverse cdf simulation method
> e.1 <- -log(-log(runif(10000,0,1)))
> e.2 <- -log(-log(runif(10000,0,1)))
> u <- e.2 - e.1 # u follows a logistic distribution (difference between two gumbels.)
> fitdistr(u,densfun=dlogis,start=list(location=0,scale=1))
Maximum Likelihood
121
Example with a Cauchy distribution
For instance, we can write a simple maximum likelihood estimator for a Cauchy distribution using the nlm()
optimizer. We first draw a vector x from a Cauchy distribution. Then we define the log likelihood function and then
we optimize using the nlm() function. Note that nlm() is minimizer and not a maximizer.
> n <- 100
> x <- rcauchy(n)
> mlog.1 <- function(mu, x) {
+ - sum(dcauchy(x, location = mu, log = TRUE))
+ }
> mu.start <- median(x)
> out <- nlm(mlog.1, mu.start, x = x)
Example with a beta distribution
Here is an other example with the Beta distribution and the optim() function.
> y <- rbeta(1000,2,2)
> loglik <- function(mu, x) {
+ sum(-dbeta(x,mu[1],mu[2],log = TRUE))
+ }
>
> out <- optim(par = c(1,1), fn=loglik,x=y,method = "L-BFGS-B",lower=c(0,0))
Tests
Likelihood Ratio Test
lrtest() in the lmtest package
[1]
.
Some Specific cases
gum.fit() (ismev package) provides MLE for a Gumbel distributon
Resources
Charles Geyer : Maximum Likelihood in R (pdf, 9 pages)
[2]
Marco Steenbergen Maximum Likelihood Programming in R (pdf, 7 pages)
[3]
References
[1] [1] Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3),
7-10. URL http:/ / CRAN. R-project. org/ doc/ Rnews/
[2] http:/ / www. stat. umn. edu/ geyer/ 5931/ mle/ mle.pdf
[3] http:/ / www. unc.edu/ ~monogan/ computing/ r/ MLE_in_R. pdf
Method of Moments
122
Method of Moments
Package gmm implements the generalized method of moment and the generalized empirical likelihood.
First, it is possible to estimate a simple linear model or a simple linear model with instrumental variables using the
gmm() function. The GMM method is often used to estimate heteroskedastic instrumental variable models.
> # Simple linear model
> N <- 1000
> u <- rnorm(N)
> x <- 1 + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, x)
> # Simple linear model with instrumental variables.
> library(gmm)
> N <- 1000
> u <- rnorm(N)
> z <- rnorm(N)
> x <- 1 + z + u + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, z)
> summary(res)
Bayesian Methods
Introduction
R has lots of bayesian estimation procedures, much more than Stata or SAS.
LearnBayes by Jim Albert
bayesm by Peter Rossi and Rob McCulloch
BaM by Jeff Gill
arm package by Jennifer Hill and Andrew Gelman.
MCMCpack package.
mcsm package by Christian Robert and George Casella.
umacs (link
[1]
) by Jouni Kerman and Andrew Gelman.
Bayesian Methods
123
Interface with WinBugs
WinBugs/OpenBugs is a popular statistical package for MCMC techniques.
Andrew Gelman has some instruction to use R and WinBugs on his webpage
[2]
There is also an interface with JAGS
Resources
Christian Robert and Jean Michel Marin The Bayesian Core (link
[3]
including slides and R scripts)
Jim Albert : Bayesian Computation with R, Use R!, Springer 2007.
Christian Robert and George Casella Introducing Monte Carlo Methods with R, Use R!, Springer 2009 (link
[4]
including the mcsm package).
Peter Rossi, Greg Allenby, Robert McCulloch : Bayesian Statistics and Marketing and "bayesm" package
[5]
.
CRAN task view for bayesian statistics (link
[6]
)
References
[1] http:/ / www. stat. columbia. edu/ ~kerman/ Research/ umacs. html
[2] http:/ / www. stat. columbia. edu/ ~gelman/ bugsR/
[3] http:/ / www. ceremade.dauphine. fr/ ~xian/ BCS/
[4] http:/ / www. ceremade.dauphine. fr/ ~xian/ books. html
[5] http:/ / faculty. chicagobooth. edu/ peter. rossi/ research/ bsm. html
[6] http:/ / cran. r-project. org/ web/ views/ Bayesian.html
Bootstrap
boot package includes functions from the book Bootstrap Methods and Their Applications by A. C. Davison and
D. V. Hinkley (1997, CUP)
bootstrap package.
Quick how-to
Do a boostrap of some data for some function (here, mean):
b <- boot(data, function(data, id) { mean(data[id]) }, 1000)
Use this to compute a 90%-confidence interval:
boot.ci(b, .9, type="norm")
References
Instructions for the boot package: http:/ / www. statmethods. net/ advstats/ bootstrapping. html
Sample using the boot package: http:/ / druedin. com/ 2012/ 11/ 10/ bootstrapping-in-r/
Multiple Imputation
124
Multiple Imputation
Multiple imputation of missing data generally includes two steps. First, an imputation step which results in multiple
complete datasets. Second, combining the results obtained by applying the chosen technique on each separate
dataset. The packages needed for these two steps are not necessary the same, but can be.
"mi" package Andrew Gelman Blog Entry on the "mi" package
[1]
"mice" package
[2]
.
Amelia
[3]
mitools
[4]
by Thomas Lumley
survey and Zelig have support for multiple imputated datasets.
References
[1] http:/ / www. stat. columbia. edu/ ~cook/ movabletype/ archives/ 2009/ 06/ multiple_imputa_4. html
[2] http:/ / www. multiple-imputation.com/
[3] http:/ / cran. r-project. org/ web/ packages/ Amelia/ index.html
[4] http:/ / cran. us. r-project.org/ web/ packages/ mitools/ index. html
Nonparametric Methods
This page deals with a set of non-parametric methods
[1]
including the estimation of a cumulative distribution
function (CDF), the estimation of probability density function (PDF) with histograms and kernel methods and the
estimation of flexible regression models such as local regressions and generalized additive models.
For an introduction to nonparametric methods you can have a look at the following books or handout :
Nonparametric Econometrics: A Primer by Jeffrey S. Racine
[2]
.
Li and Racine's handbook, Nonparametric econometrics
[3]
.
Larry Wasserman All of Nonparamatric Statistics
[4]
Empirical distribution function
The easiest way to estimate the empirical CDF uses the rank() and the length() functions.
ecdf() computes the empirical cumulative distribution function.
ecdf.ksCI() (sfsmisc) plots the empirical distribution function with confidence intervals.
> N <- 1000
> x <- rnorm(N)
> edf <- rank(x)/length(x)
> plot(x,edf)
> plot(ecdf(x),xlab = "x",ylab = "Distribution of x")
> grid()
> library("sfsmisc")
> ecdf.ksCI(x1)
Nonparametric Methods
125
Density Estimation
Histogram
hist() is the standard function for drawing histograms. If you store the histogram as an object the estimated
parameters are returned in this object.
> x <- rnorm(1000)
> hist(x, probability = T) # The default uses Sturges method.
> # Sturges, H. A. (1926) The choice of a class interval.
> # Journal of the American Statistical Association 21, 6566.
> hist(x, breaks = "Sturges", probability = T)
>
> # Freedman, D. and Diaconis, P. (1981) On the histogram as a density estimator: L_2 theory.
> # Zeitschrift fr Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 453476.
> # (n^1/3 * range)/(2 * IQR).
> hist(x, breaks = "FD", probability = T)
>
> # Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605610.
> # ceiling[n^1/3 * range/(3.5 * s)].
> hist(x, breaks = "scott", probability = T)
>
> # Wand, M. P. (1995). Data-based choice of histogram binwidth.
> # The American Statistician, 51, 5964.
> library("KernSmooth")
> h <- dpih(x)
> bins <- seq(min(x)-h, max(x)+h, by=h)
> hist(x, breaks=bins, probability = T)
It is also possible to choose the break points.
> x <- rnorm(1000)
> hist(x, breaks = seq(-4,4,.1))
n.bins() (car package) includes several methods to compute the number of bins for an histogram.
histogram() (lattice)
truehist() (MASS)
hist.scott() (MASS) plot a histogram with automatic bin width selection, using the Scott or
FreedmanDiaconis formulae.
histogram package.
Nonparametric Methods
126
Kernel Density Estimation
density() estimates the kernel density
[5]
of a vector.
Choose the bandwidth selection method with bw.
Check the sensitivity of the bandwidth choice using adjust. The default is one. It is good practice to look at
adjust=.5 and adjust=2.
> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = .5, kernel = "gaussian"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 2, kernel = "gaussian"), col = 3)
> legend("topright", legend = c("adjust = 1", "adjust = .5", "adjust = 2"), col = 1:3, lty = 1)
Choose the kernel function with kernel : "gaussian", "epanechnikov", "rectangular", "triangular",
"biweight", "cosine", "optcosine".
> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "epanechnikov"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "rectangular"), col = 3)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "triangular"), col = 3)
> legend("topright", legend = c("gaussian", "epanechnikov", "rectangular", "triangular"), col = 1:4, lty = 1)
tkdensity() (sfsmisc) is a nice function which allow to dynamically choose the kernel and the bandwith with
a handy graphical user interface. This is a good way to check the sensitivity of the bandwidth and/or kernel choice
on the density estimation.
> x <- rnorm(10^3)
> library("sfsmisc")
> tkdensity(x)
kde2d() (MASS) estimates a bivariate kernel density.
> N <- 1000
> x <- rnorm(N)
> y <- 1 + x^2 + rnorm(N)
> dd <- kde2d(y,x) # estimate the bivariate kernel
> contour(dd) # plot the bivariate density
> image(dd) # another plot the bivariate density
Examples
Nonparametric Methods
127
Local Regression
loess() is the standard function for local linear regression.
lowess() is similar to loess() but does not has a standard syntax for regression y ~ x .This is the
ancestor of loess (with different defaults!).
ksmooth() (stats) computes the NadarayaWatson kernel regression estimate.
locpoly() (KernSmooth package)
npreg() (np package)
locpol computes local polynomial estimators
locfit local regression, likelihood and density estimation
Examples
Generalized additive semiparametric models (GAM)
gam() (gam)
gam() (mgcv)
> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> y <- 1 + x1^2 + x2^3 + u
>
> library(gam)
> g1 <- gam(y ~ x1 + x2 ) # Standard linear model
> par(mfrow=c(1,2))
> plot(g1, se = T)
>
> g1 <- gam(y ~ s(x1) + x2 ) # x1 is locally estimated
Nonparametric Methods
128
> par(mfrow=c(1,2))
> plot(g1, se = T)
>
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
>
> library(mgcv)
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
References
[1] http:/ / en. wikipedia. org/ wiki/ Non-parametric_statistics
[2] Jeffrey S. Racine Nonparametric Econometrics: A Primer http:/ / socserv. mcmaster. ca/ racine/ ECO0301. pdf and at the R code examples
http:/ / socserv.mcmaster.ca/ racine/ primer_code. zip
[3] Qi Li, Jeffrey S. Racine, Nonparametric econometrics, Princeton University Press - 2007
[4] [4] Wasserman, Larry, "All of nonparametric statistics", Springer (2007) (ISBN: 0387251456)
[5] http:/ / en. wikipedia. org/ wiki/ Kernel_Density_Estimation
Linear Models
Standard linear model
In this section we present estimation functions for the standard linear model estimated by ordinary least squares
(OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().
Fake data simulations
We first generate a fake dataset such that there is no hetereoskedasticity, no endogeneity and no correlation between
the error terms. Therefore the ordinary least square estimator is unbiased and efficient. We choose a model with two
variables and take all the coefficients equal to one.
> N <- 1000
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> y <- 1 + x1 + x2 + u
> df <- data.frame(y,x1,x2)
Linear Models
129
Least squares estimation
The standard function to estimate a simple linear model is lm().
lsfit() performs the least square procedure but the output is not formatted in fashionable way.
ols() (Design) is another alternative.
We estimate the model using lm(). We store the results in fit and print the result using summary() which is
the standard function.
> fit <- lm(y ~ x1 + x2, data = df)
> summary(fit)
There are some alternative to display the results.
display() in the arm package is one of them.
coefplot() (arm) graphs the estimated coefficients with confidence intervals. This is a good way to present
the results.
mtable() in the memisc package can display the results of a set of regressions in the same table.
> library("arm")
> display(fit)
> coefplot(fit)
fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions
to fit.
We can get the estimated coefficients using fit$coeff or coef(fit).
> fit$coeff
(Intercept) x1 x2
1.2026522 0.8427403 1.5146775
> coef(fit)
(Intercept) x1 x2
0.7541 1.7844 0.7222
> output <- summary(fit)
> coef(output)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1945847 0.2298888 5.196359 0.001258035
x1 0.6458170 0.3423214 1.886581 0.101182585
x2 0.6175165 0.2083628 2.963660 0.020995713
se.coef() (arm) returns the standard error of the estimated coefficients.
The vector of fitted values can be returned via fit$fitted, fitted(fit) or the predict() function. The
predict() function also returns standard error and confidence intervals for predictions.
> fit$fitted
> fitted(fit)
The vector of residuals:
> fit$resid
> residuals(fit)
The number of degrees of freedom :
> fit$df
Linear Models
130
Confidence intervals
We can get the confidence intervals using confint() or conf.intervals() in the alr3 package.
> confint(fit, level = .9)
5 % 95 %
(Intercept) -0.7263261 1.200079
x1 -0.5724022 1.909924
x2 0.6185011 2.475079
> confint(fit, level = .95)
2.5 % 97.5 %
(Intercept) -0.9652970 1.439050
x1 -0.8803353 2.217858
x2 0.3881923 2.705388
> confint(fit, level = .99)
0.5 % 99.5 %
(Intercept) -1.5422587 2.016012
x1 -1.6237963 2.961319
x2 -0.1678559 3.261436
> library(alr3)
> conf.intervals(fit)
2.5 % 97.5 %
(Intercept) -0.9652970 1.439050
x1 -0.8803353 2.217858
x2 0.3881923 2.705388
Tests
coeftest() (lmtest) performs the Student t test and z test on coefficients.
> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)
linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test
using statistics.
> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0
See also waldtest() (lmtest) for nested models.
Linear Models
131
Analysis of variance
We can also make an analysis of variance using anova().
> anova(fit)
Model Search and information criteria
> # Akaike Information Criteria
> AIC(fit)
[1] 26.72857
> # Bayesian Information Criteria
> AIC(fit,k=log(N))
[1] 27.93891
The stats4 package includes AIC() and BIC() function:
> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226
The step() functions performs a model search using the Akaike Information Criteria.
> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> x3 <- rnorm(N)
> y <- 1+ x1 + x2 + u
> fit <- lm(y~x1+x2 + x3)
> step.fit <- step(fit)
Zelig
The method is also supported in Zelig
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> z.out <- zelig(y ~ x, model = "ls", data = mydat)
> x.out <- setx(z.out, x = 10)
> s.out <- sim(z.out, x.out)
> summary(s.out)
Linear Models
132
Bayesian estimation
MCMCregress() (MCMCpack)
BLR() (BLR)
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
>
> posterior <- MCMCregress(y ~ x, data = mydat)
> summary(posterior)
> plot(posterior)
Heteroskedasticity
See the lmtest and sandwich packages.
gls() (nlme) computes the generalized least squares estimator.
See "Cluster-robust standard errors using R" (pdf)
[1]
by Mahmood Arai. He suggests two functions for cluster
robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be
loaded with the following command source("http:/ / people. su. se/ ~ma/ clmclx. R").
> N <- 10 # 10 people
> T <- 5 # 5 times
> id <- rep(1:N,T)
> f <- rep(rnorm(N),T) # is individual specific
> u <- rnorm(N*T)
> x1 <- rnorm(N*T)
> x2 <- rnorm(N*T) + x1
> y <- 1 + x1 + x2 + f + u
> fit <- lm(y ~ x1 + x2 )
> source("http://people.su.se/~ma/clmclx.R")
> clx(fit, 1, id)
Robustness
Cook's distance
>library(car)
> cookd(fit)
1 2 3 4 5
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699
6 7 8 9 10
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871
Influence plot:
> influence.plot(fit)
Leverage plots:
Linear Models
133
> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)
Bonferroni's outlier test:
> outlier.test(fit)
max|rstudent| = 2.907674, degrees of freedom = 6,
unadjusted p = 0.02706231, Bonferroni p = 0.2706231
Observation: 3
See also outlier.t.test() in the alr3 package.
inf.index() in the alr3 package computes all the robustness statistics (Cook's distance, studentized residuals,
outlier test, etc)
rlm() performs a robust estimation
See UCLA example
[2]
See also the robustbase package
Instrumental Variables
ivreg() in the AER package
[3]
tsls() in the sem package.
It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.
Fake data simulations
We first simulate a fake data set with x correlated to u, z and u independant and x correlated with z. Thus x is an
endogenous explanatory variable of y and z is a valid instrument for x.
> N <- 1000
> z <- rnorm(N)
> u <- rnorm(N)
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u
Two stage least squares
Then we estimate the model with OLS (lm()) and IV using z as an instrument for x.
> ols <- lm(y ~ x)
> summary(ols) # ols are biased
> library("AER")
> iv <- ivreg(y ~ x | z)
> summary(iv) # IV estimates are unbiased
> library("sem")
> iv2 <- tsls(y ~ x, instruments = ~ z)
> summary(iv2)
> library("gmm")
> iv3 <- gmm(y ~ x, z)
> summary(iv3)
Linear Models
134
We plot the results :
> plot(y ~ x, col = "gray")
> abline(a = 1,b = 1, lty = 1, col = 1, lwd = 2)
> abline(ols, lty = 2, col = 2 , lwd = 2)
> abline(iv, lty = 3, col = 3, lwd = 2)
> legend("topleft", legend = c("True values","OLS","IV"), col = 1:3, lwd = rep(2,3), lty = 1:3)
Panel Data
plm() (plm) implements the standard random effect, fixed effect, first differences methods
[4]
. It is similar to Stata's
xtreg command.
Note that plm output are not compatible with xtable() and mtable() for publication quality output.
lme4 and gee implements random effect and multilevel models.
See also BayesPanel
Random effects model
To implement a random effects model
[5]
we generate a fake data set with 1000 observations over 5 time periods.
> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T)
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])
We estimate the random effect model with the plm() function and the model = "random" option.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)
Fixed effects model
For a fixed effects model we generate a fake dataset and we correlate the fixed effects f with covariates :
> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
Linear Models
135
> long$x1 <- 1 + rnorm(N*T) + long$f
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])
We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm()
with model = "within" as an option. Then, we compare the estimate with the random effect model and
perform an Hausman test. At the end, we plot the density of the fixed effects.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))
Dynamic panel data
pgmm() (plm) implements the Arellano Bond estimation procedure
[6]
. It is similar to xtabond2 in Stata
[7]
.
Simultaneous equations model
For a [:w:Simultaneous_equations_model|simultaneous equations model] the following packages are needed :
sem package
systemfit package
References
[1] http:/ / people. su.se/ ~ma/ clustering.pdf
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ rreg. htm
[3] [3] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http:/ / CRAN. R-project. org/ package=AER
[4] Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http:/ /
www.jstatsoft. org/ v27/ i02/ .
[5] http:/ / en. wikipedia. org/ wiki/ Random_effects_model
[6] [6] M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The
Review of Economic Studies, 1991
[7] David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http:/ / ideas. repec. org/ c/ boc/ bocode/
s435901.html
External links
Visualization of regression coefficients (http:/ / www. r-statistics. com/ 2010/ 07/
visualization-of-regression-coefficients-in-r/ )
Quantile Regression
136
Quantile Regression
Quantile regression is a very old method which has become popular only in the last years thanks to computing
progress. One of the main researcher in this area is also a R practitioner and has developped a specific package for
quantile regressions (quantreg)
[1]
[2]
.
In theory, Quantile regression are also linear and thus could have been included in the Linear regression page.
However, this is a very specific topic and we think that it is worth writing a specific page for this topic.
Simple quantile model
We simulate from a simple quantile model. We first generate a uniform error term u and a covariate x.
N <- 10^3
u <- runif(N)
x <- 1 + rnorm(N)
y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x
We estimate the quantile model for some values of tau (the quantile) and plot the coefficients :
q1 <- rq(y ~ x, tau = seq(.1,.9,.1))
summary(q1)
plot(q1)
We then plot the scatterplot, the predicted values using a standard linear model and the predicted values using a
quantile linear model :
plot(x,y, col = "grey")
abline(m1, col = "red")
taus <- seq(.1,.9,.1)
for (i in 1:length(taus)){
abline(rq(y ~ x, tau = taus[i]), col = "blue")
}
grid()
We can also estimate the model for all quantiles in the same time :
q2 <- rq(y ~ x, tau = -1)
plot(q2, nrow = 2, ncol = 1)
Computing time
For large data sets it is better to use the "fn" or "pfn" method.
> N <- 10^5
> u <- runif(N)
> x <- 1 + rnorm(N)
> y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x
> system.time(rq(y ~ x, tau = .5, method = "br"))
user system elapsed
1.48 0.00 1.48
> system.time(rq(y ~ x, tau = .5, method = "fn"))
user system elapsed
Quantile Regression
137
0.60 0.00 0.61
> system.time(rq(y ~ x, tau = .5, method = "pfn"))
user system elapsed
0.30 0.00 0.29
Resources
Koenker, Roger (2005) Quantile Regression, Cambridge University Press. ISBN 0-521-60827-9
References
[1] Roger Koenker (2010). quantreg: Quantile Regression. R package version 4.50. http:/ / CRAN. R-project. org/ package=quantreg
[2] Roger Koenker's personal webpage (http:/ / www.econ. uiuc. edu/ ~roger/ research/ rq/ rq. html)
Binomial Models
In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory
variables.
This kind of model can be analyzed using a linear probability model. However a drawback of this model for the
parameter of the Bernoulli distribution is that, unless restrictions are placed on , the estimated coefficients can
imply probabilities outside the unit interval . For this reason, models such as the logit model or the probit
model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models
page.
Logit model
The model takes the form : with the inverse link function : . It can
be estimated using maximum likelihood or using bayesian methods.
Fake data simulations
> x <- 1 + rnorm(1000,1)
> xbeta <- -1 + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)
Binomial Models
138
Maximum likelihood estimation
The standard way to estimate a logit model is glm() function with family binomial and link logit.
lrm() (Design) is another implementation of the logistic regression model.
There is an implementation in the Zelig package
[1]
.
In this example, we simulate a model with one continuous predictor and estimate this model using the glm()
function.
> res <- glm(y ~ x , family = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res)
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities
Zelig
The Zelig' package makes it easy to compute all the quantities of interest.
We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we
estimate the model using zelig() with the model = "logit" option.
We the look at the predicted values of y at the mean of x1 and x2
Then we look at the predicted values when x1 = 0 and x2 = 0
We also look at what happens when x1 changes from the 3rd to the 1st quartile.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1 + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
>
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest
> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)
> # What happens if x1 change from the 3rd quartile to the 1st quartile ?
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2))
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2))
Binomial Models
139
> s.out2<-sim(z.out, x=x.high, x1=x.low)
> plot(s.out2)
ROC Curve in the verification package.
Zelig has a rocplot() function.
See UCLA Statistical Computing example
[2]
Bayesian estimation
bayesglm() in the arm package
MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> # Data generating process
> x <- 1 + rnorm(1000,1)
> xbeta <- -1 + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
>
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)
> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)
Probit model
The probit model is a binary model in which we assume that the link function is the cumulative density function of a
normal distribution.
We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter).
Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and
we draw the binary variable y as Bernouilli random variable.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1 + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
Binomial Models
140
Maximum likelihood
We can use the glm() function with family=binomial(link=probit) option or the probit()
function in the sampleSelection package which is a wrapper of the former one.
> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
>
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)
Bayesian estimation
MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)
See Also
There is an example of a probit model with R on the UCLA statistical computing website
[3]
.
Semi-Parametric models
Klein and Spady estimator
[4]
is implemented in the np package
[5]
(see npindex() with method =
"kleinspady" option).
References
[1] Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary
King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http:/ / gking. harvard. edu/ zelig
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ logit. htm
[3] UCLA statistical computing probit example http:/ / www. ats. ucla. edu/ stat/ R/ dae/ probit. htm
[4] Klein, R. W. and R. H. Spady (1993), An efficient semiparametric estimator for binary response models, Econometrica, 61, 387-421.
[5] Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL
http:/ / www.jstatsoft. org/ v27/ i05/ .
Multinomial Models
141
Multinomial Models
Multinomial Logit
mlogit package.
multinom() nnet
multinom() VGAM
Conditional Logit
clogit() in the survival package
mclogit package.
Multinomial Probit
mprobit package
[1]
MNP
[2]
package to fit a multinomial probit.
Multinomial ordered logit model
We consider a multinomial ordered logit model with unkwnown thresholds. First, we simulate fake data. We draw
the residuals in a logistic distribution. Then we draw some explanatory variable x and we define ys the latent
variable as a linear function of x. Note that we set the constant to 0 because the constant and the thresholds cannot be
identified simultanously in this model. So we need to fix one of the parameters. Then, we define thresholds (-1,0,1)
and we define our observed variable y using the cut() function. So y is an ordered multinomial variable.
N <- 10000
u <- rlogis(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(y,x)
Maximum likelihood estimation
This model can be estimated by maximum likelihood using the polr() function in the MASS package. Since it is
not possible to achieve identification of the constant and the thresholds, R assumes by default that the constant is
equal to 0.
library(MASS)
fit <- polr(y ~ x, method = "logistic", data = df)
summary(fit)
Multinomial Models
142
Bayesian estimation
bayespolr() (arm) performs a bayesian estimation of the multinomial ordered logit
library("arm")
fit <- bayespolr(y ~ x, method = "logistic", data = df)
summary(fit)
Multinomial ordered probit model
We generate fake data by drawing an error term in normal distribution and cutting the latent variables in 4
categories.
N <- 1000
u <- rnorm(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(x,y)
Maximum likelihood estimation
The model can be fitted using maximum likelihood method. This can be done using the polr() function in the
MASS package with the probit method.
library(MASS)
fit <- polr(y ~ x, method = "probit", data = df)
summary(fit)
Bayesian estimation
bayespolr() (arm) performs a bayesian estimation of the multinomial ordered probit
Rank Ordered Logit Model
This model was introduced in econometrics by Beggs, Cardell and Hausman in 1981
[3]
. One application is the
Combes et alii paper explaining the ranking of candidates to become professor
[]
. Is is also known as PlackettLuce
model in biomedical literature or as exploded logit model in marketing.
Conditionally Ordered Hierarchical Probit
The Conditionally Ordered Hierarchical Probit can be estimated using the anchors package developped by Gary
King and his coauthors
[4]
.
Multinomial Models
143
References
[1] [1] Harry Joe, Laing Wei Chou and Hongbin Zhang (2006). mprobit: Multivariate probit model for binary/ordinal response. R package
version 0.9-2.
[2] http:/ / imai. princeton.edu/ software/ MNP.html
[3] Beggs, S., Cardell, S., Hausman, J., 1981. Assessing the potential demand for electric cars. Journal of Econometrics 17 (1), 119
(September).
[4] [4] Jonathan Wand, Gary King, Olivia Lau (2009). anchors: Software for Anchoring Vignette Data. Journal of Statistical Software, Forthcoming.
URL
http:/ / www. jstatsoft. org/ .
Tobit And Selection Models
Tobit (type 1 Tobit)
In this section, we look at simple tobit model where the outcome variable is observed only if it is above or below a
given threshold.
tobit() in the AER package
[1]
. This is a wrapper for survreg().
N <- 1000
u <- rnorm(N)
x <- - 1 + rnorm(N)
ystar <- 1 + x + u
y <- ystar*(ystar > 0)
hist(y)
ols <- lm(y ~ x)
summary(ols)
library(AER)
tobit <- tobit(y ~ x,left=0,right=Inf,dist = "gaussian")
Selection models (type 2 tobit or heckit)
In this section we look at endogenous selection process. The outcome y is observe only if d is equal to one with d a
binary variable which is correlated with the error term of y.
heckit() and selection() in sampleSelection
[2]
. The command is called heckit() in honor of James
Heckman
[3]
.
N <- 1000
u <- rnorm(N)
v <- rnorm(N)
x <- - 1 + rnorm(N)
z <- 1 + rnorm(N)
d <- (1 + x + z + u + v> 0)
ystar <- 1 + x + u
y <- ystar*(d == 1)
hist(y)
Tobit And Selection Models
144
ols <- lm(y ~ x)
summary(ols)
library(sampleSelection)
heckit.ml <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "ml")
summary(heckit.ml)
heckit.2step <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "2step")
summary(heckit.2step)
Truncation
truncreg package
DTDA "An R package for analyzing truncated data" pdf
[4]
.
References
[1] [1] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http:/ / CRAN. R-project. org/ package=AER
[2] Sample Selection Models in R: Package sampleSelection http:/ / www. jstatsoft. org/ v27/ i07
[3] [3] James Heckman "Sample selection bias as a specification error", Econometrica: Journal of the econometric society, 1979
[4] http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Moreira+ DeUnaAlvarez+ Crujeiras. pdf
Count Data Models
The Poisson model
Fake data simulations
We assume that y follows a poisson distribution with mean exp(1 + 1 * x). We store the data in the "df" dataframe.
N <- 1000
x <- rnorm(N)
alpha <- c(1,1)
y <- rpois(N,exp(alpha[1] + alpha[2] * x))
df <- data.frame(x,y)
plot(x,y)
Maximum likelihood
We estimate this simple model using the glm() function with family = poisson as option.
fit <- glm(y ~ x, family = poisson, data = df)
summary(fit)
Bayesian estimation
The model can also be estimated using bayesian methods with the MCMCpoisson() function which is provided in
the MCMCpack.
Count Data Models
145
library("MCMCpack")
posterior <- MCMCpoisson(y ~ x, data = df)
plot(posterior)
summary(posterior)
Overdispersion test
dispersiontest() (AER package) provides a test for equidispersion.
Zero inflated model
See the zic package
[1]
Bivariate poisson regression
bivpois package for bivariate poisson regression.
References
See UCLA website for an example
[2]
Zeileis, A., Kleiber, C. and Jackman, S. Regression Models for Count Data in R
[3]
Replication files for Cameron and Trivedi's 1998 book
[4]
are provided in the AER package
[5]
. You can simply
type ?CameronTrivedi1998 and you will find the source code.
[1] Markus Jochmann (2010). zic: Bayesian Inference for Zero-Inflated Count Models. R package version 0.5-3. http:/ / CRAN. R-project. org/
package=zic
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ poissonreg. htm
[3] http:/ / cran. r-project. org/ web/ packages/ pscl/ vignettes/ countreg. pdf
[4] [4] Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.
[5] [5] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http:/ / CRAN. R-project. org/ package=AER
Duration Analysis
146
Duration Analysis
Using R for Survival Analysis (pdf)
[1]
See the survival package
bootkm() Bootstrap Kaplan-Meier Estimates in Hmisc package
event.chart() Flexible Event Chart for Time-to-Event Data in the Hmisc package
References
[1] http:/ / www. math. unm. edu/ ~bedrick/ PIBBS/ Rsurv.pdf
Time Series
Introduction
In the following examples we will use the data set Mpyr which is included in the R-package Ecdat, which can be
loaded into R and viewed in R by the following code.
#Installs the package Ecdat.
install.packages("Ecdat")
#Loads the packages Ecdat.
library(Ecdat)
#Attached the dataset Mpyr.
data(Mpyr)
#Shows the dataset Mpyr.
Mpyr
Time Series:
Start = 1900
End = 1989
Frequency = 1
m p y r
1900 1.718774 2.092641 0.9030195 4.380000
1901 1.856318 2.086574 1.0131038 4.280000
1902 1.936512 2.120476 1.0114817 4.920000
Creating time-series objects
The function ts() is used to create time-series objects.
The function as.ts() coerces an object to a time-series.
The function is.ts() tests whether an object is a time-series.
Example:
> data.a<-seq(1,24,by=1)
> is.ts(data.a)
[1] FALSE
> ts(data.a, start=c(2005,1), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005 1 2 3 4 5 6 7 8 9 10 11 12
Time Series
147
2006 13 14 15 16 17 18 19 20 21 22 23 24
> data.b<-seq(1,24,by=1)
> is.ts(data.b)
[1] FALSE
> is.ts(as.ts(data.b))
[1] TRUE
Creating lagged and differenced variables
The function lag() creates a lagged variable.
The function diff() creates a differenced variable.
Example:
> data.a<-seq(1,12,by=1)
> ts.a<-ts(data.a, start=c(2005,1), frequency=4)
> lag.a<-lag(ts.a,k=1)
> diff.a<-diff(ts.a,lag=1,difference=1)
> ts.a
Qtr1 Qtr2 Qtr3 Qtr4
2005 1 2 3 4
2006 5 6 7 8
2007 9 10 11 12
> lag.a
Qtr1 Qtr2 Qtr3 Qtr4
2004 1
2005 2 3 4 5
2006 6 7 8 9
2007 10 11 12
> diff.a
Qtr1 Qtr2 Qtr3 Qtr4
2005 1 1 1
2006 1 1 1 1
2007 1 1 1 1
Plotting time-series objects
The function plot.ts() is used for plotting time-series objects.
Fit Autoregressive Models to Time-series by OLS
In order to fit an autoregressive time series model to the data by ordinary least squares it is possible to use the
function ar.ols() which is part of the "stats" package.
Autocorrelation function
The function acf() computes (and by default plots) estimates of the autocovariance or autocorrelation function.
Function pacf() is the function used for the partial autocorrelations. Function ccf() computes the cross-correlation or
cross-covariance of two univariate series.
[1]
Time Series
148
Useful R-packages
fBasics, tis, zoo, tseries, xts, urca
References
[1] http:/ / www. inside-r.org/ r-doc/ stats/ acf
http:/ / cran. r-project. org/ web/ views/ TimeSeries. html
http:/ / cran. r-project. org/ doc/ contrib/ Ricci-refcard-ts. pdf
Factor Analysis
Introduction
Factor analysis is a set of techniques to reduce the dimensionality of the data. The goal is to describe the dataset with
a smaller number of variables (ie underlying factors). Factor Analysis was developed in the early part of the 20th
century by L.L. Thurstone and others. Correspondence analysis was originally developed by Jean-Paul Benzcri in
the 60's and the 70's. Factor analysis is mainly used in marketing, sociology and psychology. It is also known as data
mining, multivariate data analysis or exploratory data analysis.
There are three main methods. Principal Component Analysis deals with continuous variables. Correspondence
Analysis deals with a contingency table (two qualitative variables) and Multiple correspondence analysis is a
generalization of the correspondence analysis with more than two qualitative variables. The major difference
between Factor Analysis and Principal Components Analysis is that in FA, only the variance which is common to
multiple variables is analysed, while in PCA, all of the variance is analysed. Factor Analysis is a difficult procedure
to use properly, and is often misapplied in the psychological literature. One of the major issues in FA (and PCA) is
the number of factors to extract from the data. Incorrect numbers of factors can cause difficulties with the
interpretation and analysis of the data.
There are a number of techniques which can be applied to assess how many factors to extract. The two most useful
are parallel analysis and the minimum average partial criterion. Parallel analysis works by simulating a matrix of the
same rank as the data and extracting eigenvalues from the simulated data set. The point at which the simulated
eigenvalues are greater than those of the data is the point at which the "correct" number of factors have been
extracted. The Minimum Average Partial criterion uses a different approach but can often be more accurate.
Simulation studies have established these two methods as the most accurate. Both of these methods are available in
the psych package under the fa.parallel and the VSS commands.
Another issue in factor analysis is which rotation (if any) to choose. Essentially, the rotations transform the scores
such that they are more easily interpretable. There are two major classes of rotations, orthogonal and oblique.
Orthogonal rotations assume that the factors are uncorrelated, while oblique rotations allow the factors to correlate
(but do not force this). Oblique rotations are recommended by some (e.g. MacCallum et al 1999) as an orthogonal
solution can be obtained from an oblique rotation, but not vice versa.
One of the issues surrounding factor analysis is that there are an infinite number of rotations which explain the same
amount of variance, so it can be difficult to assess which model is correct. In response to such concerns, Structural
Equation Modelling (SEM), which is also known as Confirmatory Factor Analysis (CFA) was developed by
Joreskeg in the 1970's. The essential principle of SEM is that given a model, it attempts to reproduce the observed
covariance matrix seen in the data. The ability of a model to reproduce the data can be used as a test of that model's
truth. SEM is implemented in R in the sem and lavaan packages, as well as the OpenMx package (which is not
available on CRAN).
Factor Analysis
149
See the following packages : FactoMineR (website
[1]
), amap, ade4, anacor, vegan, '"psych"'
Principal Component Analysis (PCA)
PCA deals with continuous variables
prcomp() in the stats package.
princomp() in the stats package.
PCA() (FactoMineR)
See also factanal()
See also fa and prcomp in the psych package
N <- 1000
factor1 <- rnorm(N)
factor2 <- rnorm(N)
x1 <- rnorm(N) + factor1
x2 <- rnorm(N) + factor1
x3 <- rnorm(N) + factor2
x4 <- rnorm(N) + factor2
mydat <- data.frame(x1,x2,x3,x4)
pca <- prcomp(mydat)
names(pca)
plot(pca) # plot the eigenvalues
biplot(pca) # A two dimensional plot
pca2 <- princomp(mydat)
biplot(pca2)
pca2 <- princomp(~ x1 + x2 + x3 + x4, data = mydat) # princomp with a formula syntax
Correspondence Analysis (CA)
Correspondence analysis is a tool for analyzing contingency tables.
corresp() MASS
Michael Greenacre's ca package (JSS article
[2]
)
Correspondence Analysis and Related Network (link
[3]
)
Quick-R's page (link
[4]
)
Simple and Canonical Correspondence Analysis Using the R Package anacor (pdf
[5]
, JSS article
[6]
)
multiv
Factor Analysis
150
References
[1] http:/ / factominer.free. fr/ index. html
[2] http:/ / www. jstatsoft. org/ v20/ i03/ paper
[3] http:/ / www. carme-n.org/
[4] http:/ / www. statmethods.net/ advstats/ ca.html
[5] http:/ / cran. r-project. org/ web/ packages/ anacor/ vignettes/ anacor. pdf
[6] http:/ / www. jstatsoft. org/ v31/ i05
Ordination
Overview
This page provides basic code for creating a distance matrix
[1]
and running and plotting a Non-metric
Multidimensional Scaling
[2]
(NMDS) ordination.
Read more about Ordination
[3]
on Wikipedia.
This code relies on package vegan in R by Jari Oksanen
[4]
.
Data
First, import data and load required libraries:
require(MASS)
require(vegan)
data(varespec) # species data
data(varechem) # environmental data
Distance matrix
bray <- vegdist(varespec, method = "bray") # calculate a distance matrix
# There are many distance measure options for 'dist',
# discoverable by running '?dist'. Common distance measures include:
# 'bray' = Bray-Curtis
# 'canb' = Canberra
# 'euclidean' = Euclidean
Unconstrained Ordination
Displaying dissimilarity using NMDS
NMDS analysis and plotting:
nmds <- metaMDS(varespec, k = 2,
distance = 'bray', autotransform = FALSE) # semi-black box NMDS function
ordiplot(nmds, type = "text") # Plot NMDS ordination
fit <- envfit(nmds, varechem[ ,1:4]) # Calculates environmental vectors
fit # Lists vector endpoint coordinates and r-squared values
plot(fit) # adds environmental vectors
Ordination
151
# a linear representation of environmental variables is not always appropiate
# we could also add a smooth surface of the variable to the plot
ordisurf(nmds, varechem$N, add = TRUE, col = "darkgreen")
nmds$stress # stress value
resulting nmds plot
In the metaMDS function, k is user-defined and relates to how easily
the projection fits the dataframe when contrained to k dimensions.
Conventional wisdom seems to suggest that stress should not exceed
10-12%. Stress is reduced by increasing the number of dimensions.
However, increasing dimensionality might decrease the "realism" of a
2-dimensional plot of the first two NMDS axes.
We can also run a nMDS with 3 dimensions, fit environmental vectors
and create a dynamic graph:
nmds3d <- metaMDS(varespec, k = 3,
distance = 'bray', autotransform = FALSE) # run nmds with 3 dimensions
nmds3d$stress # stress drops
fit3d <- envfit(nmds3d, varechem[ ,1:4], choices = 1:3) # fit environmental vectors to 3d space
ordirgl(nmds3d, envfit = fit3d) # dynamic 3D graph
Running a principle component analysis (PCA) on environmental data
chem_pca <- rda(varechem, scale = TRUE) # Run PCA
biplot(chem_pca, scaling = 2) # display biplot
PCA biplot
Ordination
152
References
[1] http:/ / en. wikipedia. org/ wiki/ Distance_matrix
[2] http:/ / en. wikipedia. org/ wiki/ Multidimensional_scaling
[3] http:/ / en. wikipedia. org/ wiki/ Ordination_(statistics)
[4] http:/ / cc.oulu.fi/ ~jarioksa/
Clustering
Basic clustering
K-Means Clustering
You can use the kmeans() function.
First create some data:
> dat <- matrix(rnorm(100), nrow=10, ncol=10)
To apply kmeans(), you need to specify the number of clusters:
> cl <- kmeans(dat, 3) # here 3 is the number of clusters
> table(cl$cluster)
1 2 3
38 44 18
Hierarchical Clustering
The basic hierarchical clustering function is hclust(), which works on a dissimilarity structure as produced by the
dist() function:
> hc <- hclust(dist(dat)) # data matrix from the example above
> plot(hc)
The resulting tree can be cut using the cutree() function.
Cutting it at a given height:
> cl <- cutree(hc, h=5.1)
> table(cl)
cl
1 2 3 4 5
23 33 29 4 11
Cutting it to obtain given number of clusters:
> cl <- cutree(hc, k=5)
> table(cl)
cl
1 2 3 4 5
23 33 29 4 11
Clustering
153
Available alernatives
See packages class, amap and cluster
See The R bioinformatic page on clustering
[1]
References
"The Elements of Statistical Learning"
[2]
External links
Clustergram: visualization and diagnostics for cluster analysis
[3]
[1] http:/ / manuals.bioinformatics.ucr.edu/ home/ R_BioCondManual#CLUSTERBACK
[2] http:/ / www-stat. stanford. edu/ ~tibs/ ElemStatLearn/
[3] http:/ / www. r-statistics. com/ 2010/ 06/ clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
Network Analysis
Introduction
We mainly use following packages to demontrate network analysis in R: statnet, sna, igraph. They are however not
representing a complete list. See Task view of gR, graphical models in R
[1]
for a complete list.
Creating simple graphs with igraph
> # load the appropriate library
> library(igraph)
> # now create a few simple graphs
> g <- graph.empty(10,directed=FALSE)
> g2 <- graph.ring(10,directed=TRUE)
> g3 <- graph.full(10,directed=FALSE)
> # now get information about these graphs
> summary(g)
Creating graphs from data
First load the igraph package
library(igraph)
then you can choose your preferred format. Below are examples of data provided as edge list and as adjacency
matrix.
Creating graph from an edge list
An edge list is formed by a two-column matrix, with each row defining one edge. An edge is drawn from each
element in the first column to the corresponding element in the second one. Use the graph.edgelist()
function to import your data.
# producing some random data in edge list form
el <- cbind(sample(1:10, 10), sample(1:10, 10))
Network Analysis
154
# creating and plottig the graph from the edge list
gr <- graph.edgelist(el)
plot(gr)
Creating graph from an adjacency matrix
An adjacency matrix is a n n matrix containing n vertices and where each entry a
ij
represents the number of edges
from vertex i to vertex j. To import your adjacency matrix, use the graph.adjacency() function.
# producing a random adjacency matrix
adj <- matrix(sample(0:1, 100, replace=T), 10, 10)
# creating and plottig the graph from the adjacency matrix
gr <- graph.adjacency(adj)
plot(gr)
References
Statnet website
[2]
includes all the documentation on network analysis using R.
Julien Barnier's introduction (in French)
[3]
Journal of Statistical Software #24 Special Issue on Networks in R
[4]
[1] http:/ / cran. r-project. org/ web/ views/ gR. html
[2] http:/ / csde. washington.edu/ statnet/
[3] http:/ / alea. fr. eu.org/ j/ reseaux_R. html
[4] http:/ / www. jstatsoft. org/ v24
Profiling R code
Before starting with parallel or high performance computing it is important to analyze and optimize R code. R
provides some useful tools to analyze and profile R code. A good and short introduction is provided in the R
extension documentation
[1]
.
Soon we are going to provide some example code:
References
[1] http:/ / cran. r-project. org/ doc/ manuals/ R-exts.html#Tidying-and-profiling-R-code
Parallel computing with R
155
Parallel computing with R
There are many packages and tools available for parallel computing with R. A good overview is provided by the
CRAN Task View: High-Performance and Parallel Computing with R
[1]
and several publications:
State of the Art in Parallel Computing with R; Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu,
Luke Tierney, Ulrich Mansmann; Journal of Statistical Software 2009: JSS
[2]
Soon we are going to provide some code examples:
References
[1] http:/ / cran. r-project. org/ web/ views/ HighPerformanceComputing. html
[2] http:/ / www. jstatsoft. org/ v31/ i01
Sources
For the following resources, authors have explicitly given the permission to include their material on the R
programming wikibook. Remember that even if they have given their permission, they should be correctly cited.
Blogs
R-statistics (the R category)
[1]
(A link to a post
[2]
which provides proper licence for approving this content for
use).
GETTING GENETICS DONE
[3]
- R tag. The R content is available from here:
http://gettinggeneticsdone.blogspot.com/search/label/R. The R code is copyrighted under The open source BSD
license (as is described here: http://gettinggeneticsdone.blogspot.com/p/copyright.html). The content itself is
licensed under a Creative Commons Attribution-Share-Alike 3.0 Unported License (as is shown at the bottom of
every post). Bottom line - the R code and written content can be used freely (with attribution).
Struggling Through Problems: http://strugglingthroughproblems.blogspot.com/search/label/R
Backsidesmack R-stuff category
[4]
. Copyright information is in the footer and explicit permission is in this post
[5]
Al3xandr3: http:/ / al3xandr3. github. com/ tags/ r. html
Cloudnumbers.com (the R category)
[6]
: Posts about high-performance computing and cloud computing with R. A
link
[7]
to a post which provides proper license for approving this content for use.
The R Tutorial Series (http://rtutorialseries.blogspot.com) by John M. Quick
[8]
provides a collection of
user-friendly guides to researchers, students, and others who want to learn how to use R for their statistical
analyses. Its content is available for use in the R Programming wikibook under a Creative Commons BY-SA
License.
Exploring Indian census data using R
[9]
and useful scripts to download weather related data from websites. The
content is available for use in the R wikibook under cc-sa license.
Plain Data Analysis tips at www.danielmarcelino.com
[10]
.Topics covered in the blog are related to social
sciences, but there is a great variety of them.
Sources
156
References
[1] http:/ / www. r-statistics. com/ category/ r/
[2] http:/ / www. r-statistics. com/ 2011/ 06/ calling-r-lovers-and-bloggers-to-work-together-on-the-r-programming-wikibook/
[3] http:/ / GettingGeneticsDone.com/
[4] http:/ / www. backsidesmack.com/ category/ r-stuff/
[5] http:/ / www. backsidesmack.com/ 2011/ 06/ no-steal-this-blog/
[6] http:/ / www. cloudnumbers. com/ category/ Rproject/
[7] http:/ / cloudnumbers. com/ the-r-programming-wikibook
[8] http:/ / www. johnmquick. com
[9] http:/ / anandram. wordpress. com/ tag/ r/
[10] http:/ / www.danielmarcelino. com
Index
This page provides tables which make it easy to find functions for usual tasks in statistics in R, SAS and Stata. Other
software may also be included in the future such as SPSS.
Data management
Function R Stata SAS
Merge merge() merge / mmerge -
Reshape reshape() reshape -
Expand a dataset expand() (epicalc) expand -
Descriptive Statistics
Function R Stata SAS
Mean mean() mean proc means
Histogram hist() hist -
Frequency table table() ta proc freq
Regression models
Function R Stata SAS
Least Square lm() reg proc reg
GLM glm() glm proc glm
Probit models glm(, family = binomial(link="probit")) probit -
Probit models glm(, family = binomial(link="logit")) logit -
Linear fixed effects model plm( , model = "within") (plm) xtreg , fe -
Linear random effects model plm( , model = "random") (plm) xtreg , re -
Linear quantile regression rq() (quantreg) qreg -
Ordinal logistic regression polr() (MASS) ologit -
Linear IV (2sls) ivreg() (AER) ivreg proc syslin (2sls) -
Index
157
Programming
Function R Stata SAS
Check some condition stopifnot() assert -
Article Sources and Contributors
158
Article Sources and Contributors
R Programming Source: http://en.wikibooks.org/w/index.php?oldid=2676243 Contributors: Adam majewski, Adrignola, Albmont, Brylie, Chainsawriot, Dallas1278, Darklama, Dcljr, Herbee,
Jtneill, Loquetudigas, Mandingo, Mwtoews, Orderud, PAC, PAC2, Panic2k4, Pep Roca, QuiteUnusual, Talgalili, Taxman, Thenub314, Tplate, 29 anonymous edits
Introduction Source: http://en.wikibooks.org/w/index.php?oldid=2561505 Contributors: Adrignola, Albmont, Alice Springs, DavidCary, Dcljr, Eddelbuettel, Lannajin, PAC, PAC2, Panic2k4,
Petter Lquist, Seemu, Tayste, TomyDuby, Twarzin, 13 anonymous edits
Sample Session Source: http://en.wikibooks.org/w/index.php?oldid=2549960 Contributors: Dcljr, F.jackson, Fiero9, PAC2, 2 anonymous edits
Manage your workspace Source: http://en.wikibooks.org/w/index.php?oldid=2665727 Contributors: Fercho, Fiero9, PAC2, Richierocks, Talgalili, Xania, 2 anonymous edits
Settings Source: http://en.wikibooks.org/w/index.php?oldid=2584255 Contributors: Adrignola, Albmont, Calimo, DimAlice, F.jackson, PAC, PAC2, Panic2k4, Pep Roca, Talgalili, ,
12 anonymous edits
Documentation Source: http://en.wikibooks.org/w/index.php?oldid=2665928 Contributors: Adrignola, Dcljr, Eddelbuettel, Herbee, PAC, PAC2, Panic2k4, Timcdlucas, ZeroOne, 1 anonymous
edits
Control Structures Source: http://en.wikibooks.org/w/index.php?oldid=2359198 Contributors: Adam majewski, Dcljr, Edgester, PAC2, Timcdlucas, 10 anonymous edits
Working with functions Source: http://en.wikibooks.org/w/index.php?oldid=2622195 Contributors: Gibravo, PAC2, Thomas Levine, 3 anonymous edits
Debugging Source: http://en.wikibooks.org/w/index.php?oldid=2213466 Contributors: Dcljr, Markvanderloo, PAC2, Shabbychef
Using C or Fortran Source: http://en.wikibooks.org/w/index.php?oldid=2094135 Contributors: PAC2
Utilities Source: http://en.wikibooks.org/w/index.php?oldid=2586362 Contributors: PAC2, Timcdlucas, 2 anonymous edits
Estimation utilities Source: http://en.wikibooks.org/w/index.php?oldid=2438898 Contributors: PAC2, Timcdlucas, 1 anonymous edits
Packages Source: http://en.wikibooks.org/w/index.php?oldid=2630876 Contributors: Derek Farn, HgDeviasse, Karstew, PAC2, Shabbychef
Data types Source: http://en.wikibooks.org/w/index.php?oldid=2597748 Contributors: Asterisk-lee, Benwing, PAC2, Rob smith, Talgalili, Thomas Levine, Timcdlucas, 11 anonymous edits
Working with data frames Source: http://en.wikibooks.org/w/index.php?oldid=2610783 Contributors: Adrignola, Gibravo, PAC, PAC2, Rob smith, Sigma 7, Talgalili, ZeroOne, 15
anonymous edits
Importing and exporting data Source: http://en.wikibooks.org/w/index.php?oldid=2532512 Contributors: PAC2
Text Processing Source: http://en.wikibooks.org/w/index.php?oldid=2676765 Contributors: Edgester, EtudiantEco, NigelW, PAC, PAC2, Robbiemorrison, Tom Morris, 7 anonymous edits
Times and Dates Source: http://en.wikibooks.org/w/index.php?oldid=2546276 Contributors: Edgester, PAC2, Pep Roca, Talgalili, Timcdlucas, 3 anonymous edits
Graphics Source: http://en.wikibooks.org/w/index.php?oldid=2575663 Contributors: Adrignola, Albmont, Dcljr, Gibravo, Henrybissonnette, Mwtoews, Orderud, PAC, PAC2, QuiteUnusual,
Saric, Talgalili, 21 anonymous edits
Grammar of graphics Source: http://en.wikibooks.org/w/index.php?oldid=2530637 Contributors: PAC2
Publication quality ouput Source: http://en.wikibooks.org/w/index.php?oldid=2676375 Contributors: Adrignola, Avraham, Dcljr, Gibravo, PAC, PAC2, Rotlink, Rp, Talgalili, Timcdlucas, 14
anonymous edits
Descriptive Statistics Source: http://en.wikibooks.org/w/index.php?oldid=2482743 Contributors: Adam majewski, Adrignola, Dcljr, Mlm2764, PAC, PAC2, Timcdlucas, ZeroOne, 4
anonymous edits
Mathematics Source: http://en.wikibooks.org/w/index.php?oldid=2619859 Contributors: Adrignola, Albmont, Dcljr, Mcneale, PAC, PAC2, RuneL87, Taxman, Timcdlucas, ZeroOne, 7
anonymous edits
Optimization Source: http://en.wikibooks.org/w/index.php?oldid=2488152 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, ZeroOne, 5 anonymous edits
Probability Distributions Source: http://en.wikibooks.org/w/index.php?oldid=2195106 Contributors: Albmont, Mmessner, PAC, PAC2, Recent Runes, Timcdlucas, Xania, ZeroOne, 3
anonymous edits
Random Number Generation Source: http://en.wikibooks.org/w/index.php?oldid=2213476 Contributors: Adrignola, Billymac00, Dcljr, Derek Farn, PAC, PAC2, ZeroOne, 2 anonymous edits
Maximum Likelihood Source: http://en.wikibooks.org/w/index.php?oldid=2621042 Contributors: Adrignola, Dcljr, Fishpi, PAC, PAC2, Thenub314, Timcdlucas, Wfoolhill, 4 anonymous edits
Method of Moments Source: http://en.wikibooks.org/w/index.php?oldid=2482598 Contributors: PAC2, Timcdlucas
Bayesian Methods Source: http://en.wikibooks.org/w/index.php?oldid=2443557 Contributors: Adrignola, PAC, PAC2, ZeroOne, 4 anonymous edits
Bootstrap Source: http://en.wikibooks.org/w/index.php?oldid=2457470 Contributors: Adrignola, Dcljr, Dourouc05, PAC, PAC2, 1 anonymous edits
Multiple Imputation Source: http://en.wikibooks.org/w/index.php?oldid=2321697 Contributors: Maartenh, PAC, PAC2, Talgalili
Nonparametric Methods Source: http://en.wikibooks.org/w/index.php?oldid=2213501 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, ZeroOne
Linear Models Source: http://en.wikibooks.org/w/index.php?oldid=2546268 Contributors: Adrignola, Dcljr, PAC, PAC2, Talgalili, Timcdlucas, ZeroOne, 2 anonymous edits
Quantile Regression Source: http://en.wikibooks.org/w/index.php?oldid=2485148 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, 1 anonymous edits
Binomial Models Source: http://en.wikibooks.org/w/index.php?oldid=2536965 Contributors: Adrignola, PAC, PAC2, Timcdlucas, ZeroOne, 1 anonymous edits
Multinomial Models Source: http://en.wikibooks.org/w/index.php?oldid=2204083 Contributors: PAC, PAC2, Timcdlucas
Tobit And Selection Models Source: http://en.wikibooks.org/w/index.php?oldid=2171656 Contributors: PAC2, Timcdlucas
Count Data Models Source: http://en.wikibooks.org/w/index.php?oldid=2200503 Contributors: PAC, PAC2, Timcdlucas
Duration Analysis Source: http://en.wikibooks.org/w/index.php?oldid=2018602 Contributors: Adrignola, PAC, PAC2
Time Series Source: http://en.wikibooks.org/w/index.php?oldid=2147584 Contributors: PAC, PAC2, Recent Runes, RuneL87, Timcdlucas, 5 anonymous edits
Factor Analysis Source: http://en.wikibooks.org/w/index.php?oldid=2679297 Contributors: Adrignola, Fishpi, PAC, PAC2, Panic2k4, Richiemorrisroe, Timcdlucas, 2 anonymous edits
Ordination Source: http://en.wikibooks.org/w/index.php?oldid=2122988 Contributors: Devanmcg, Edi06331, PAC2
Clustering Source: http://en.wikibooks.org/w/index.php?oldid=2552057 Contributors: Gibravo, PAC, PAC2, Talgalili, 2 anonymous edits
Article Sources and Contributors
159
Network Analysis Source: http://en.wikibooks.org/w/index.php?oldid=2425027 Contributors: Adrignola, Gibravo, PAC, PAC2, Seemu, 3 anonymous edits
Profiling R code Source: http://en.wikibooks.org/w/index.php?oldid=2129261 Contributors: Markus.schmidberger, PAC2
Parallel computing with R Source: http://en.wikibooks.org/w/index.php?oldid=2129263 Contributors: Markus.schmidberger, PAC2
Sources Source: http://en.wikibooks.org/w/index.php?oldid=2564621 Contributors: Dmsilv, Markus.schmidberger, PAC2, Protonk, Talgalili, 8 anonymous edits
Index Source: http://en.wikibooks.org/w/index.php?oldid=2123450 Contributors: PAC2
Image Sources, Licenses and Contributors
160
Image Sources, Licenses and Contributors
Image:R setup1.JPG Source: http://en.wikibooks.org/w/index.php?title=File:R_setup1.JPG License: GNU General Public License Contributors: The R Foundation for Statistical Computing
Image:R setup2.JPG Source: http://en.wikibooks.org/w/index.php?title=File:R_setup2.JPG License: GNU General Public License Contributors: The R Foundation for Statistical Computing
Image:R setup3.JPG Source: http://en.wikibooks.org/w/index.php?title=File:R_setup3.JPG License: GNU General Public License Contributors: The R Foundation for Statistical Computing
Image:R setup4.JPG Source: http://en.wikibooks.org/w/index.php?title=File:R_setup4.JPG License: GNU General Public License Contributors: The R Foundation for Statistical Computing
Image:R setup5.JPG Source: http://en.wikibooks.org/w/index.php?title=File:R_setup5.JPG License: GNU General Public License Contributors: The R Foundation for Statistical Computing
File:R Terminal.png Source: http://en.wikibooks.org/w/index.php?title=File:R_Terminal.png License: GNU General Public License Contributors: TrantorFr (talk) 23:48, 22 July 2010 (UTC)
File:Rstudio.png Source: http://en.wikibooks.org/w/index.php?title=File:Rstudio.png License: unknown Contributors: PAC2
File:Mandelbrot Creation Animation (800x600).gif Source: http://en.wikibooks.org/w/index.php?title=File:Mandelbrot_Creation_Animation_(800x600).gif License: Creative Commons
Attribution-Share Alike Contributors: Jarekt
Image:Plottype.png Source: http://en.wikibooks.org/w/index.php?title=File:Plottype.png License: Public Domain Contributors: PAC2
Image:Rplotaxisstyle.png Source: http://en.wikibooks.org/w/index.php?title=File:Rplotaxisstyle.png License: Public Domain Contributors: PAC2
Image:R-basic sqrt plot example.svg Source: http://en.wikibooks.org/w/index.php?title=File:R-basic_sqrt_plot_example.svg License: Public Domain Contributors: HYanWong
Image:Mandelbrot Creation Animation.gif Source: http://en.wikibooks.org/w/index.php?title=File:Mandelbrot_Creation_Animation.gif License: Creative Commons Attribution-Share Alike
Contributors: Jarekt
File:Em_old_faithful.gif Source: http://en.wikibooks.org/w/index.php?title=File:Em_old_faithful.gif License: Creative Commons Attribution-Sharealike 3.0 Contributors: 3mta3 (talk) 16:55,
23 March 2009 (UTC)
Image:NZ Govt debt 1990-2011.svg Source: http://en.wikibooks.org/w/index.php?title=File:NZ_Govt_debt_1990-2011.svg License: Creative Commons Attribution-Sharealike 3.0
Contributors: Mrfebruary
Image:100m-rec men.png Source: http://en.wikibooks.org/w/index.php?title=File:100m-rec_men.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: Thomas Steiner
File:EurUsd.png Source: http://en.wikibooks.org/w/index.php?title=File:EurUsd.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: origin Thomas Steiner
Image:R-car_stopping_distances_1920.svg Source: http://en.wikibooks.org/w/index.php?title=File:R-car_stopping_distances_1920.svg License: Public Domain Contributors: Yan
WongHYanWong
File:Ggplot2scatter.png Source: http://en.wikibooks.org/w/index.php?title=File:Ggplot2scatter.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Protonk
Image:Histogram example.svg Source: http://en.wikibooks.org/w/index.php?title=File:Histogram_example.svg License: Public Domain Contributors: Jkv
Image:Histogram4.png Source: http://en.wikibooks.org/w/index.php?title=File:Histogram4.png License: Public Domain Contributors: Sigbert
Image:Histogram_of_arrivals_per_minute.svg Source: http://en.wikibooks.org/w/index.php?title=File:Histogram_of_arrivals_per_minute.svg License: Creative Commons
Attribution-Sharealike 3.0 Contributors: DanielPenfield
Image:Black_cherry_tree_histogram.svg Source: http://en.wikibooks.org/w/index.php?title=File:Black_cherry_tree_histogram.svg License: Creative Commons Attribution-ShareAlike 3.0
Unported Contributors: Mwtoews, 2 anonymous edits
Image:Michelsonmorley-boxplot.svg Source: http://en.wikibooks.org/w/index.php?title=File:Michelsonmorley-boxplot.svg License: Public Domain Contributors: User:Mwtoews, User:Schutz
Image:Simple barplot.svg Source: http://en.wikibooks.org/w/index.php?title=File:Simple_barplot.svg License: Public Domain Contributors: Hagezussa
Image:Nz ghgs-650-550-2007.png Source: http://en.wikibooks.org/w/index.php?title=File:Nz_ghgs-650-550-2007.png License: Creative Commons Attribution-Sharealike 3.0 Contributors:
Mrfebruary
Image:Hu_real_gdp_growth.svg Source: http://en.wikibooks.org/w/index.php?title=File:Hu_real_gdp_growth.svg License: Public Domain Contributors: Jkv
File:BinomialTest.svg Source: http://en.wikibooks.org/w/index.php?title=File:BinomialTest.svg License: Public Domain Contributors: Sigbert
File:Fi real gdp growth.svg Source: http://en.wikibooks.org/w/index.php?title=File:Fi_real_gdp_growth.svg License: Public Domain Contributors: Jkv
Image:NewZealand score YaleEnvironmentalProtectionIndex.svg Source: http://en.wikibooks.org/w/index.php?title=File:NewZealand_score_YaleEnvironmentalProtectionIndex.svg
License: Creative Commons Attribution-Sharealike 3.0 Contributors: Mrfebruary
Image:English dialects1997.svg Source: http://en.wikibooks.org/w/index.php?title=File:English_dialects1997.svg License: GNU Free Documentation License Contributors: AnonMoos, Man
vyi, Marek Mazurkiewicz, Mwtoews, Senator2029, 4 anonymous edits
Image:Capacidad por fuente Colombia 2005.svg Source: http://en.wikibooks.org/w/index.php?title=File:Capacidad_por_fuente_Colombia_2005.svg License: Creative Commons
Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Germarquezm
File:Mandate2006.png Source: http://en.wikibooks.org/w/index.php?title=File:Mandate2006.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: Thomas Steiner
Image:Bvn-small.png Source: http://en.wikibooks.org/w/index.php?title=File:Bvn-small.png License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Adam majewski,
Bdolicki, Btyner, Kilom691, 2 anonymous edits
Image:Binomial_distribution_cdf.svg Source: http://en.wikibooks.org/w/index.php?title=File:Binomial_distribution_cdf.svg License: Public Domain Contributors: Tayste
Image:Chi-Squared-pdf.png Source: http://en.wikibooks.org/w/index.php?title=File:Chi-Squared-pdf.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: Thomas
Steiner
Image:Chi-Squared-pdf_and_cdf.png Source: http://en.wikibooks.org/w/index.php?title=File:Chi-Squared-pdf_and_cdf.png License: Creative Commons Attribution-Sharealike 2.5
Contributors: Thomas Steiner
File:Halton sequence 2D.svg Source: http://en.wikibooks.org/w/index.php?title=File:Halton_sequence_2D.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors:
User:Jheald
File:Hammersley set 2D.svg Source: http://en.wikibooks.org/w/index.php?title=File:Hammersley_set_2D.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Jheald
File:Kernel density.svg Source: http://en.wikibooks.org/w/index.php?title=File:Kernel_density.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Mwtoews,
1 anonymous edits
File:R-horsekick totals-density.svg Source: http://en.wikibooks.org/w/index.php?title=File:R-horsekick_totals-density.svg License: Public Domain Contributors: HYanWong
File:R-US state areas-density.svg Source: http://en.wikibooks.org/w/index.php?title=File:R-US_state_areas-density.svg License: Public Domain Contributors: HYanWong
Image:NW cb16.png Source: http://en.wikibooks.org/w/index.php?title=File:NW_cb16.png License: Public Domain Contributors: Sigbert
File:Nmds.pdf Source: http://en.wikibooks.org/w/index.php?title=File:Nmds.pdf License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Edi06331
File:Pca.pdf Source: http://en.wikibooks.org/w/index.php?title=File:Pca.pdf License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Edi06331
License
161
License
Creative Commons Attribution-Share Alike 3.0
//creativecommons.org/licenses/by-sa/3.0/