CSC228 R Programming Note
CSC228 R Programming Note
Introduction to R Programming
R was created by Ross Ihaka and Robert Gentleman in the department of statistics at the University of Auckland. In 1993
the first announcement of R was made to the public. In 1995 Martin Machler made an important contribution by
convincing Ross and Robert to make R a free software.
1. Data Cleaning
2. Data Analysis
3. Data Visualization
R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is
free to adopt the software to whatever platform they choose.
Why learn R?
This must be one of the questions on your mind that why should you learn R?, these are few of the reasons you should
learn R;
Installation of R
http://cran.r-project.org/bin/windows/base/
http://cran.r-project.org/bin/macosx/
1
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
R Console
R console is a pane in a window in which a user can type R commands, submit them for execution, and view the results.
Note that R is case sensitive i.e. Base, base and BASE are all different names
R Studio
R studio is an integrated development environment (IDE) for R programming. R studio makes programming easier and
friendly in R.
1. Editor
2. Workspace and History
3. Console
4. Output
2
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
EDITOR
OUTPUT
00
CONSOLE
R as a Calculator
Arithmetic operations can be performed in R at its basic form
3
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
The # sign is use to add comments, so that others can understand what the R code is about.
Getting started
The first thing to do when you open your R studio and you want to work is to get and set your working directory. Now
the question is,” WHAT IS WORKING DIRECTORY?”
Working directory is a folder path that links the current sessions on R to a folder where the file that needs to be
imported or exported is stored. getwd(get working directory) returns an absolute file path representing the current
working directory of the R process. It is very important to always know and set a working directory before working.
A new working directory can be set when you click on session option on the menu bar, select set Working directory and
then click on Choose directory... or press the short cut key ctrl+shift+H or use the code setwd()
4
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
Where;
“C:/users/user/Desktop” are the file path and “/HANDOVER” is the name of the folder.
Keep in mind that the “setwd()” command cannot create folders for you, it can only access them.
To read data use: A = read.csv(“MCBDOUT .CSV”) where “A” is the new name given to the data that is to read and
“MBCD OUT” is the name of the data to be read while “.csv ” is the excel format of the data to be read.
To write data into the specified working diredctory use: write.csv(data, file = “Newdata.csv”) where “write.csv()” is the
function to write data, “data” is the name of the data to be saved ,”file = “Newdata”” is the name we want to use to
save the data while “.csv” is the format with which want to save the data.
DATA TYPES IN R
1. VECTOR: This is a unit variable data such as Income, Age, Gender etc. in writing a vector in R we use a
concatenate function “c” e.g.
2. DATA FRAME: A data frame is a list of variable of the same number of row, unique row names with a given
class. To create a data frame we use the “data.frame()” function. e.g.
5
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
3. Matrices: matrix is an object of dimension that consists of row and columns. To create a matrix use:
matrix(data,nrow,ncol,byrow). Where:
Data is the list or vector of elements that will fill the matrix
Nrow,ncol is the number of elements in the rows and columns respectively.
byrow filling matrix by row
6
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
NOTE: if the rbind() function is used instead of cbind() function, the outcome will be the transpose of the
matrix.
7
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
1. MULTIPLICATION: one of the operations that can be carried out on matrix is multiplication and its denoted
by “%*%” and is used for only matrix multiplication. We can also perform the multiplication of a matrix and
its transpose.
8
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
2. Determinant, Diagonal and Trace of a matrix: to find the determinant of a matrix use the function det(), to
find the diagonal use the function diag() and for the trace which is the sum of the diagonal elements use the
function sum(diag()).
9
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
3. Inverse of a matrix: to find the inverse of a matrix use the function solve (A) where A is the name of the
matrix. One of the properties of a matrix is that a matrix * by its inverse will give us the identity matrix.
10
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
MANIPULATION
Manipulation of matrices and vectors is one of the most common tasks you will undertake in R. Thankfully,
it is an easy and (eventually) intuitive process to arrange your data. First, some points regarding the
indexing of matrices. Take for example this matrix
As you may have noticed, there are commas next to the numbers indicating the number of the rows and
columns. For example: * 1, +. What this comma means is ―take all the elements‖; if it comes after the
number it means all the elements in that row, if before the number all the elements in that column. This
11
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
convention is because of the common (row, column) system of listing row number then column number
when identifying an element of a matrix.
To use the indexing of the matrix to access its elements type the matrix name followed by the element(s)
you want in brackets. For example:
R takes the matrix, removes the second column, and shifts everything over.
If you want to change the actual values of the data just access the part of the matrix you want to change
and use the” =” operator as follows:
12
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
Datasets in R
There are various data readily available for practice preloaded on R. To find the several list preloaded data
use the function data( )
13
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
DATA VISUALIZATION
One of the most powerful aspects of R is its suite of graphic devices. Visualizing data is often one of the best ways
of describing a data set and interpreting what you have collected. Towards this goal, R has a variety of different
plotting tools that you can use and customize to your needs.
1. Dot plot:
One of the most simple (and most useful) plotting tools in R is the dot plot. The dot plot is called through the
plot ( ) command. The plot ( ) command takes in two vectors of equal length and plots them against each
other (vectors in the mathematical sense, R can plot input of all the most common data types). The first vector
provided goes on the X-axis and the second vector provided goes to the y-axis.
14
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
If you plot one graph as we have done, and then plot another graph, by typing the code below:
You will see that your old plot is gone and replaced by the new plot:
We might not always want to replace our old plots, or probably we want to place different plots side by side on the
same page. To achieve this we are going to first use the par( ) function to set some general parameters.
15
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
Where:
mfrow=c(1,2) is the function that allows the plots to be partitioned in the ratio 1:2 where 1 is the number of row and 2
is the number of column which is subjected to change at your own wish.
bg=”grey” is the background color on which the graph will be plotted.
Cex=0.75 is the font size of the graphs labeling
Bty=”n” is the kind of axis u have around your plots, where n for “none”, l for only x and y axis and c for c like shape
2. Line Plots: A line chart is a graph that connects a series of points by drawing line segments between
them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts
are usually used in identifying the trends in data.
The plot() function in R is used to create the line graph.
The basic syntax to create a line chart in R is
16
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
Types Description
P Points
L Lines
B Combines both lines and point
C Plot line alone for b
O Over plotted both lines and point
H Histogram-like vertical lines
S Stair steps
N Does not produce any point or line
Col determines the color of the line and point. If you wish to spell out the color it must be written in a string form. E.g.
col=”red”. There are built-in colors in R that are represented with numbers, they are
1 black
2 red
3 green
4 blue
5 T-green
6 Purple
7 Yellow etc.
xlab is the title of the x axis
ylab is the title of the y axis
main is the title for the plot
lwd is the width of the line
pch is the point character with numbers. 1(0), 2( ), 3(+), 4( ), 5( ), 6( ), 7( )
ylim is the limit of the y axis. Before this can be determined you will to consider the range of values you want to plot.
xaxt=”n” will remove the parameters on the x axis.
font.lab is used to change the font style of the x and y label
cex.lab is used to change the font size of the x and y label
font.axis is used to change the font style of the axis
cex.axis is used to change the font size of the axis
cex is used to increase or decrease the font size of any numerical value in the plot.
Practice
Create an excel sheet containing the months of the year and the numbers of male and female that attended R seminars
in a year. The excel file should be named ATTENDANCE and must be saved in csv (MSDOS) format.
17
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
After creating your document in excel and saving it with the .csv format we then move to our R to plot our line graph
using the following code.
When we run these line of codes correctly we would have this output.
“topleft” is the position where the legend will be displayed on the graph
c(“Male”,”Female”) were written in the order at which the variables were plotted, same is applicable to pch
=c(2,1),col=c(“red”,”blue”) and text.col=c(“red”,”blue”).
Please don’t just do yours as I did mine be dynamic and be creative in your coding create more captivating graphs with
the line of codes.
3. HISTOGRAM
Histograms are another powerful way to visualize a distribution of data. Histograms in R are created using the “hist( )”
command. It represents the frequencies of values of a variable bucketed into ranges.
Histogram is similar to bar chat but the difference is it groups the value into continuous ranges, each bar in histogram
represent the height of the number of the values present in that range.
19
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
4. BAR CHARTS
Bar chart is appropriate for summarizing the distribution of categorical variables.
Categorical variables represent types of data which can be divided into groups. Examples of categorical data include:
Gender, Race, Tribe, Group, Nationality etc.
In R, the bars in a bar chart can be plotted horizontally or vertically. The specific category are being compared on one
axis of the chart, the other axis represents the measured value.
Note that R doesn’t create bar charts directly from the categorical variable. Instead of creating bar chart directly from
the categorical variable, we first create a table that the frequency for each level of variable by using the function “table(
)”.
The function for creating a bar chart is “barplot”.
20
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
5. BOXPLOT
Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=),
where x is a formula and data= denotes the data frame providing the data. An example of a formula is y~group
where a separate boxplot for numeric variable y is generated for each value of group. Add varwidth=TRUE to
make boxplot widths proportional to the square root of the samples sizes. Add horizontal=TRUE to reverse the
axis orientation.
21
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name), which goes
from the first quartile (Q1) to the third quartile (Q3).
Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend
from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the
back whisker goes from Q3 to the largest non-outlier.
Here is how to read a boxplot. The median is indicated by the vertical line that runs down the center of the box. In the
boxplot above, the median is between 4 and 6, around 5.
Additionally, boxplots display two common measures of the variability or spread in a data set.
Range. If you are interested in the spread of all the data, it is represented on a boxplot by the horizontal distance
between the smallest value and the largest value, including any outliers. In the boxplot above, data values range
22
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
from about 0 (the smallest non-outlier) to about 16 (the largest outlier), so the range is 16. If you ignore outliers,
the range is illustrated by the distance between the opposite ends of the whiskers - about 10 in the boxplot
above.
Interquartile range (IQR). The middle half of a data set falls within the interquartile range. In a boxplot, the
interquartile range is represented by the width of the box (Q3 minus Q1). In the chart above, the interquartile
range is equal to about 7 minus 3 or about 4.
And finally, boxplots often provide information about the shape of a data set. The examples below show some common
patterns.
2 4 6 8 10 12 14 16
Skewed right
2 4 6 8 10 12 14 16
Symmetric
23
LABORATORY FOR INTERDISCIPLINARY STATISTICAL ANALYSIS (LISA)
2 4 6 8 10 12 14 16
Skewed left
Each of the above boxplots illustrates a different skewness pattern. If most of the observations are concentrated on the
low end of the scale, the distribution is skewed right; and vice versa. If a distribution is symmetric, the observations will be
evenly split at the median, as shown above in the middle figure.
24