BIO503: Introduction To Programming and Statistical Modeling in R
BIO503: Introduction To Programming and Statistical Modeling in R
Course Description
This course is an introduction to R, a powerful and exible statistical language and environment that also provides more exible graphics capabilities than other popular statistical packages. The course will introduce students to the basics of using R for statistical programming, computation, graphics, and modeling. We will start with a basic introduction to the R language, reading and writing data, and graphics. We then discuss writing functions in R and tips on programming in R. Finally, the latter part of the course will focus on using R to t some important types of statistical models, including linear regression. Our goal is to get students up and running with R such that they can use R in their research and are in a good position to expand their knowledge of R on their own. Basic knowledge of statistics at the level of a basic understanding of linear regression is required The rst 3 lecture will focus on R basics. Depending on course progress, I am happy to tailor the last lectures to students interests.
Learning Objectives
1. Use R for statistical programming, computation, graphics, and modeling, 2. Write functions and use R in an efcient way, 3. Fit some basic types of statistical models 4. Use R in their own research, 5. Be able to expand their knowledge of R on their own.
There are no formal prerequisites, but in order to appreciate the abilities of R and for the later classes that explore various statistical models, we expect that students will have some basic knowledge of statistics, at the level of a basic understanding of linear regression. The intended audience is doctoral students in departments other than biostatistics who need a exible statistical environment for their research. Masters students are also allowed. We do not expect any prior experience with R, but experience with another programming or statistical language may be helpful to a limited extent. Beginning R users with basic knowledge may also nd the course useful. 1
Instructors
Primary classroom and grading instructor: Aedin Culhane Dana-Farber Cancer Institute, Smith 822C (8th oor of the Smith building at the end of Shattuck St) Phone: (617) 617-2468 e-mail: [email protected]
Faculty sponsor: Chris Paciorek Room 407, Building 2 Phone: (617) 432-4912 e-mail: [email protected]
Course Material
Students may use either of these two books depending on their needs and back-
1. Peter Dalgaard. Introductory Statistics with R (Paperback) 1st Edition Springer-Verlag New York, Inc. ISBN 0-387-95475-9 http://www.amazon.com/Introductory-Statistics-R-Peter-Dalgaard/dp/0387954759 Introductory Statistics with R provides an very basic introduction to R, targeting both nonstatistician scientists. It maybe sufcient for students who may use R for basic statistics. 2. W. N. Venables and B. D. Ripley. 2002. Modern Applied Statistics with S. 4th Edition. Springer. ISBN 0-387-95457-0 http://www.amazon.com/Modern-Applied-Statistics-W-N-Venables/dp/0387954570 Modern Applied Statistics with S is a more comprehensive introduction to statistical computing using S and R. Other useful references: An Introduction to R. Online manual at the R website at http://cran.r-project.org/manuals.html Andreas Krause, Melvin Olson. 2005. The Basics of S-PLUS. 4th edition. Springer-Verlag, New York. ISBN 0-387-26109-5 Materials suggested at http://cran.r-project.org/manuals.html Jose Pinheiro, Douglas Bates. 2000. Mixed-effects models in S and S-PLUS Springer-Verlag, Berlin. ISBN 0-387-98957-9
Software
R is available for free from http://cran.r-project.org/ for UNIX/Linux, Windows, and Mac. It is also available in the IT microlabs.
Class Format
There will be ve 3-hour class sessions. They will be held in the microlab and will combine lecture, demonstration, and laboratory components, with an emphasis on demonstration and hands-on experience.
Grading/Assessment
Course note: Pass/Fail or audit grading option only. There will be 3 practical assignments, requiring students to use and expand on the material discussed in class. Pass-fail grading will be based on return and performance on these assignments, and on attendence.
Course topics
1. Introduction to the R language: SAS versus R R, S, and S-plus Obtaining and managing R Objects - types of objects, classes, creating and accessing objects Arithmetic and matrix operations Introduction to functions 2. More details on working with R Reading and writing data R libraries Functions and R programming 3. Graphics Basic plotting Manipulating the plotting window Advanced plotting using lattice library 3 the if statement looping: for, repeat, while writing functions function arguments and options
Saving plots 4. Standard statistical models in R Model formulae and model options Output and extraction from tted models Models considered: Linear regression: lm() Logistic regression: glm() Poisson regression: glm() Survival analysis: Surv(), coxph() Linear mixed models: lme()
5. Advanced R Extensions of topics discussed in lectures 1, 2 and 3 based on a course survey Data management (importing, subsetting, merging, new variables, missing data etc.) Plotting Loops and functions Further topics to be determined by student interest/requirements but may include Migration SAS to R Plotting and Graphics in R Writing R functions, optimizing R code Bioconductor, analysis of gene expression and genomics data. More on linear models Multivariate analysis, Cluster analysis, dimension reduction methods (PCA).