0% found this document useful (0 votes)
21 views491 pages

R PDF

Uploaded by

saikrishn5047
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views491 pages

R PDF

Uploaded by

saikrishn5047
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 491

SYLLABUS

M. Sc. STATISTICS WITH COMPUTER SCIENCE


SEMESTER III
WITH EFFCT FROM 2018- 2019 ADMITTED BATCH OF STUDENTS
Paper - 3.1 R PROGRAMMING

UNIT-I - Familiarizing with R environment, Using R console as a calculator, R atomic types, meth-
ods of creating vectors, combining vectors and repeating vectors, different ways of subsetting vectors
using indexing, names and logicals. Arithmetic and logical operations. Using character vectors for text
data, manipulating text using strsplit(), paste(), cat(), grep(), gsub() functions; handling factor data.
working with dates.

UNIT - II - Creating Matrices, getting values in and out of matrices, performing matrix calcula-
tions; Working with multidimensional Arrays; creating data frames, getting values in and out of data
frames, adding rows to data frame, adding variables to data frame; creating lists, extracting components
from a list, changing values of components of lists. Getting data into and out of R - reading data in
CSV files, EXCEL files, SPSS files and working with other data types. Getting data out of R - working
with write.csv() and write.table() functions.

UNIT - III - Writing Scripts and functions in R. writing functions with named, default and optional
arguments. functions using as arguments. Debugging your code. Control statements in R - conditional
control using if, if-else, ifelse; looping control using for, while, repeat; transfer of control using break
and next. Manipulating and processing data - creating subsets of data, use of merge() function, sorting
and ordering of data. Group manipulation using apply family of functions - apply, sapply, lapply, tapply.

UNIT - IV - Base graphics. Use of high-level plotting functions for creating histograms, scatter
plots, box-whiskers plot, bar plot, dot plot, Q-Q plot and curves. Controlling plot options using low-
level plotting functions - Adding lines, segments, points, polygon, grid to the plotting region; Add text
using legend, text, mtext; and Modify/add axes, Putting multiple plots on a single page.

UNIT - V - Working with probability distributions - normal, binomial, Poisson and other distribu-
tions. Summary statistics, hypothesis testing - one and two-sample Student’s t-tests, Wilcoxon U-test,
paired t-test, paired U-test, correlation and covariance, correlation tests, tests for association- Chi-
squared test and goodness-of-fit tests. Formula notation, one-way and two-way ANOVA and post-hoc
testing, graphical summary of ANOVA and post-hoc testing, extracting means and summary statistics;
Simple linear regression
Text Books:

1. Mark Gardener(2012), Beginning R - The Statistical Programming Language, Wiley India Pvt
Ltd.

2. Andrie de Vries and Joris Meys(2015), R Programming for Dummies, Wiley India Pvt Ltd.

3. Jared P. Lander(2014), R For Everyone - Advanced Analytics and Graphics, Pearson Education
Inc.
R - Vectors-1

R - A Quick Start
Types of data

The basic data types in R are called atomic types. They are:
numeric,
integer,
character,
logical,
complex, and
raw.

> is the command prompt of R system.


< − as well as = symbols can be used as assignment operators. < − is most popular as an
assignment operator.
The R system is case sensitive.

Operators

Some operators are:

Operator Purpose
+ addition
− subtraction
/ division
∗ multiplication
^ exponentiation
%% modulus
& logical AND
| logical OR

> 2 + 3
[1] 5
> 2 -3
[1] -1
>
> 2 * 3
[1] 6
>
> 2 / 3
[1] 0.6666667
>
> 2^3
[1] 8
> 2%%3
[1] 2
>

1
September 6, 2020
R - Vectors-1

There are no scalars in R.

Objects:

Variables in R are called objects. The rules for creating names of the objects are the same
as that in C language. However, you can also use the period symbol to create a multi-word
objects. For example, x.bar is valid object in R, which you use to refer to the average of a
number of values. Another character you can use in creating multi-word object names is the
underscore symbol(for example, x bar).

> x <- 2
> y <- 3
>
> x + y
[1] 5
> x - y
[1] -1
>
> x * y
[1] 6
> x / y
[1] 0.6666667
>
> x^y
[1] 8
>

Data Structures - Vectors


Creating Vectors:

Mostly, we use the c() function to create a vector at command level.

For example, a vector, named x, is created using the c() function as follows:

> x <- c(5,2,6,10)


>

The above command, creates a numeric vector consisting of values 5, 2, 6, and 10, in that
order, and stores it in the object x. In the second line you see again the command prompt. It
means that the command was executed without error and is waiting for your next command.

2
September 6, 2020
R - Vectors-1

> x
[1] 5 2 6 10
>
> mode(x)
[1] "numeric"
>

If you simply enter the vector name followed by enter, the contents of the vector will be
displayed, as in the above example. The mode() function returns the storage mode of an
object. In the case of the object x above, it is numeric.
Create a character vector, for example,

> char <- c("a","e","i","o","u")


> char
[1] "a" "e" "i" "o" "u"
>
>
> mode(char)
[1] "character"

The object char is a character vector. Similarly, you can create a logical vector,as in the
following example:

> logiVec <- c(TRUE,TRUE,FALSE,TRUE)


>
> logiVec
[1] TRUE TRUE FALSE TRUE
>
> mode(logiVec)
[1] "logical"
>

A vector is meant for storing same type values and manipulations on them. What happens
if we put different types into same vector?

> char.num <- c(1,2,"a","b")


> char.num
[1] "1" "2" "a" "b"
>

Observe that the numeric values when mixed with character values will be converted into
character type.

3
September 6, 2020
R - Vectors-1

> char.logi <- c("a","b", TRUE,FALSE)


> char.logi
[1] "a" "b" "TRUE" "FALSE"
>
> mode(char.logi)
[1] "character"
>

Logical values gets converted into characters if they occur together with character type
in a vector.
> num.logi <- c(1,2,TRUE,FALSE)
> num.logi
[1] 1 2 1 0
>
> mode(num.logi)
[1] "numeric"
>

Logical values are converted into numeric type, if they appear together with numeric
values.
This is called coercion. The R system automatically converts the lower type data in a
vector to a higher type.
Exercise: (a) Create vector called mid.marks with the values 18, 20, 12, 15.
(b) Create a vector called grades whose members are ”A”, ”O”,
”A+”, ”A”, ”B”
(c) Create a vector called results whose members are 5 >= 2,
5 > 2, 5 < 2, 5 <= 2, 5 == 2.

Fetching the values of a vector

Now, let us inspect the contents of the object x. In R, indexing of elements of a vector
starts at 1. You can fetch the members of a vector by suffixing the object name with a pair
of square brackets [] and enclosing an integer inside of it. For example, x[2] means the
second member of the vector x. For example,

> x[3]
[1] 6

Whenever you see [1] as the first character of the R System response, it means that the
result of your command is a vector and the index of the first member is 1. Here, the result
of the command x[3] is a vector(because you see a []) of size one, since 6 is the only value it
displayed.
You can fetch more than two values using a command of the type

> vec1[vec2]

It means that, first, create vector, say vec2, whose elements are the indices of the members
of the vector vec1, which you want to pull out. Secondly, you pass vec2 as the index of vec1

4
September 6, 2020
R - Vectors-1

vector.
To get the first two members of our earlier vector x, use the command:

> x[c(1,2)]
[1] 5 2

So, the result of the command x[c(1,2)] is a vector whose first member is 5 and the second
member is 2.
Similarly, x[c(4,2,3)] results in the vector 10, 2, and 6.

The : operator

The : operator creates a sequence of whole numbers differing by one. The general syntax is
> start_value : end_value

For example, the command 1 : 5 results in a vector consisting of values 1, 2, 3, 4, 5.


The command, 5:1 results in vector of values 5, 4, 3, 2, 1.
The colon operator is very useful in subsetting vectors.
For example, x[c(1,2)] is equivalent to x[1:2].
In R, You can use a negative integer as an index. If a single negative integer is used as an
index, then you will be omitting that particular element from the vector.
For example, to display all the elements except the second member of the object x,

> x[-2]
[1] 5 6 10
>
> x
[1] 5 2 6 10

Remember that the command x[-2] does not remove the element from x. To delete the
second element from x, you have to assign x[-2] to x:

> x <- x[-2]


>
> x
[1] 5 6 10

To know the size of the vector use the function length()


> y
[1] 15 43 11 8 26 51 30
>
> length(y)
[1] 7

To display all the members except the last two:

5
September 6, 2020
R - Vectors-1

> n <- length(y)


> y[-c((n-1),n)]
[1] 13 43 11 8 26
>

The length of the x is now 4. Unlike in C, the size of a vector in R can be increased or
decreased.
To add an element, say 100, in the beginning of the vector x,

> x <- c(100,x)


>
> x
[1] 100 5 6 10
>

To add an element, say 200, at the end of the vector x,


>
> x <- c(x,200)
>
> x
[1] 100 5 6 10 200
>

To add an element, say 222, between 6 and 10 of the vector x,


>
> x <- c(x[1:3],222,x[4:length(x)])
>
> x
[1] 100 5 6 222 10 200
>
>

Create a new vector y as follows:

> y <- c(13, 43, 11, 8, 26, 51, 30)

Displaying all the members except first two:

> y[-c(1:2)]

To change the contents of particular elements

> y[1] <- 15 # modifying the contents of y[1]


>
> y
[1] 15 43 11 8 26 51 30
>

6
September 6, 2020
R - Vectors-1

To change the contents of more than one element

> y[c(4,6)] <- c(10,50)


> y
[1] 15 43 11 10 26 50 30
>
Exercise: (a) Create a characteer vector called ’weekdays’ whose members are
’Sun’, ’Mon’, ’Tue’, ’Wed’, ’Thu’, ’Fri’, ’Sat’(in that order).
(b) Fetch the subvector (’Sun’,’Mon’,’Sat’), and name that vector
’sms’.
Do this using
(i) numeric indexing
(ii) logical values as indices
(c) Create a vector called ”mid.marks” with elements 18, 15, 19, and
20.
Suppose the entry 15 was actually 18. Make the necessary cor-
rection to ’mid.maks’ vector.
(d) You have ’mid.marks’ vector as (18, 18, 19, 20). Rearrange the
members of mid.marks by swapping element-1 with element-4,
and element-3 with element-2.
Do this using
(i) numeric indexing
(ii) logical values as indices

Vectorization:
>
> x <- 1:5
> x
[1] 1 2 3 4 5
>
> x^2
[1] 1 4 9 16 25
>
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
>
> sqrt(x^2)
[1] 1 2 3 4 5
>
>

Notice that, the command x2 computes the square of each element of the vector x.
Similarly, the command sqrt(x) computes the square root of every member of the vector x.
This is called vectorized operations. To achieve this in a procedure-oriented programming
languages such as C, one has to use a ”for loop”. In R, vectorization of operations avoids
use of ”for loops”.

7
September 6, 2020
R - Vectors-1

Recycling:
Recycling refers to the process of how a smaller vector recycles to meet the length of the
larger vector when a mathematical operation is performed using two vectors. We illustrate
this with reference to addition operation. For example,

(i) If both the vectors of equal length, the addition of two vectors is performed as follows:

> a <- c(1,2,3)


> b <- c(9,7,5)
> a + b
[1] 10 9 8

(ii) If one of the vectors involved in an addition operation has a length twice as that of the
other, the smaller vector gets recycled until its length equals the length of the larger
vector and then the addition operation is performed on those resulting vectors.

> a1 <- c(a,4,5,6)


> a1
[1] 1 2 3 4 5 6
> b <- c(9,7,5)
> a1 + b #(1,2,3,4,5,6)+(9,7,5,9,7,5)
[1] 10 9 8 13 12 11
>

(iii) Suppose the two vectors involved in an addition operation are unequal in their lengths
and the length of the larger vector is not an integer multiple of the smaller vector.
Then the smaller vector gets recycled to meet the length of the larger vector and
then addition operation is performed. In this case, we also get a warning about the
differences in the lengths of the vectors involved in the operation.

> a2 <- c(a,4)


> a2
[1] 1 2 3 4
> b <- c(9,7,5)
> a2 + b # (1,2,3,4)+(9,7,5,9)
[1] 10 9 8 13
Warning message:
In a2 + b : longer object length is not a multiple of
shorter object length
>

More on Subsetting Vectors:


The process of obtaining a subset of members of a vector is called filtering or subsetting.

Filtering a vector for a subset of its elements can be done in one of the following methods:

8
September 6, 2020
R - Vectors-1

(i) using a vector of indices,


(ii) using a vector of logical values,
(iii) using a vector of member names,
(iv) using negative indices

Filtering by numeric indexing:


Indexing of members of a vector starts at 1. In general, subsetting of a vector takes the
form:
vector1[vector2]

where vector1 is the vector to be filtered and vector2 is a vector of indices or logical
values or member names.
> x<- c(5,16,18,8,1,11,4,10,15)
> x
[1] 5 16 18 8 1 11 4 10 15
>

Suppose we want to filter the vector for 4th, 5th and 1st elements in that order. Then,
with reference to the general syntax for subsetting a vector, vector2 is then c(4,5,1) and
vector1 is x. Thus, x[c(4,5,1)] is the command to be used to fetch the required members
from the vector x.
> x[c(4,5,1)]
[1] 8 1 5
>

Filtering by negative indexing


You are already introduced to negative indices above.
R system also allows us to use negative indices. Negative indices allows us to omit one or
more elements of a vector.
For example, x[-1] displays all the elements of x except the first element; x[-c(2,5)] displays
all the elements of x except 2nd and 5th elements.

9
September 6, 2020
R - Vectors-1

> x
[1] 5 16 18 8 1 11 4 10 15
>
># omitting the 3rd element
> x[-3]
[1] 5 16 8 1 11 4 10 15
>
> # leaving the last element
> x[-length(x)]
[1] 5 16 18 8 1 11 4 10
>
> # Getting all the elements except the first three
> x[-c(1:3)]
[1] 8 1 11 4 10 15
>

Filtering by logical values


A vector of logical values can also be used as indexing vector. For example, if y = c(1,2,3,4,5),
then to fetch the alternative values starting at index 1 use the command

y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] or
y[c(TRUE,FALSE)]

> y <- 1:5


> y
[1] 1 2 3 4 5
> y[c(TRUE,FALSE,TRUE,FALSE,TRUE)]
[1] 1 3 5
>
> y[c(TRUE,FALSE)]
[1] 1 3 5
>

If a logical vector is used to subset a vector, then the size of the logical vector must be
the same as the size of the vector to be filtered.
The command y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] is equivalent to y[c(TRUE,FALSE)],
here, the indexing vector gets recycled until its size becomes the size of y.
In practice, the logical vector is created out of a condition to be met by the vector. For
example, if we are interested in the members of a vector whose values are larger than 6. The
condition here is that x¿6. If we write x¿6, it results in a vector of 10(length of x) logical
values. This is because in the operation x¿6, two vectors of unequal length are involved.
Because of the recycling property, the smaller one(6, being a vector of size 1) gets recycled
to meet the size of the larger vector. Now we have

(5,16,18,8,1,11,4,10,15) > (6,6,6,6,6,6,6,6,6),

10
September 6, 2020
R - Vectors-1

which is equivalent to the vector

(5>6, 16>6, 18>6, 8>6, 1>6, 11>6, 4>6, 10>6, 15>6 ).

In fact, the operators +, ¿ etc., are all functions. As the operation involved here is
relational, the result of which is logical. Hence, we have

(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE).

The R system omits all those indexing values having the value FALSE. The vector

c(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)
= c(2,3,4,6,8,9)

is used as the indexing vector to filter x for all those values larger than 6.
> x
[1] 5 16 18 8 1 11 4 10 15
>
> x>6
[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
>
> x[x>6]
[1] 16 18 8 11 10 15
>

In practice, we use a function called which() to get the indices of a vector satisfying some
condition. In the above illustration, which(x¿6) will return the indices of the vector at which
the values of x are larger than 6.

> which(x>6) # returns indices


[1] 2 3 4 6 8 9
>
> x[which(x>6)] # returns actual values
[1] 16 18 8 11 10 15
>
> x[x>6] # returns actual values
[1] 16 18 8 11 10 15
>

11
September 6, 2020
As a C programmer you know the basic types(integer, floating-point, character, logical) supported by the
C language. R too has some basic types and are usually referred to as atomic types. They are: numeric,
integer, character, logical, complex and raw. The numeric type is like double in C language.

Like C, R is a case sensitive language.

R support a number of data structures, namely, vectors, matrices, arrays, dataframes and lists.

There are no scalars in R language.

">" is the command prompt in R.

"<-" is the assignment operator. You can also use "=" as assignment operator. But the preferred symble
is "<-". It is a matter of personal taste.

Consider the following:

> x <- 2

>

>x

[1] 2

[1] represents the index of the value 2 in vector named x. [] is the indexing operator.

The indexing of the members of vectors starts at 1 and not at 0 as in the C language.

The above command is same as


> x[1]

[1] 2

If you type the number 2 at the command prompt of R, it responds as

>2

[1] 2

That is 2 is the first member of an unnamed vector. Vectors in R work like one-dimensional arrays in C
language. So, the members of a vector must be of the same atomic type or mode. That is, we can have
numeric vectors, character vectors, logical vectors, vectors of complex numbers.

You can also assign the value 2 to the object x as follows:

> 2 -> x

>x

[1] 2

>

Not conventional. But, note that it is possible.

A number such as 2 is by default numeric.

To check whether 2 is an integer, invoke the function is.integer() on 2.


> is.integer(2)

[1] FALSE

To check whether 2 is numeric, invoke the function is.numeric() on 2.

> is.numeric(2)

[1] TRUE

>

If you want to convert the number 2 into an integer, use the function as.integer().

> x <- as.integer(2)

> is.integer(x)

[1] TRUE

By suffixing a numeric constant with the letter L, you can convert that number from numeric to an
intger.

> is.integer(2L)

[1] TRUE

You may be thinking of whether you have functions such as

is.character(),

is.logical(),
is.complex(),

as.character(),

as.logical(),

as.complex() ????

You are correct. R system has such functions.

The symbol # is used to make comments in R scripts. R ignores everything written after #.

R as a Calculator:

------------------

> 2 + 3 # + is for addition

[1] 5

>

>

> 2 - 3 # - is for subtraction

[1] -1

>

>

> 2 * 3 # * is for multiplication

[1] 6

>

>
> 2 / 3 # / is for division

[1] 0.6666667

>

>

> 2^3 # ^ is for exponentiation

[1] 8

>

>

> 2%%3 # %% is for getting reminder, x mod y

[1] 2

>

> 5%/%2 # %/% is for integer division

[1] 2

>

Mostly, we store the data into variables and use those variables in our computations. The first character
of a variable name must be an alphabet and the rest of the characters could be alphabets or numerals.
We can also use the period symbol(.) as well as the underscore( _ ) symbols for creating multi-word
variable names; for example, x.mean (or x_mean) could be a variable name representing the average
value of the variable x.

> x <- 2

> y <- 3

>

>x+y

[1] 5

>x-y
[1] -1

>

>x*y

[1] 6

>x/y

[1] 0.6666667

>

> x^y

[1] 8

>

Creating Sequences of Numbers:

------------------------------

The colon operator

------------------

While subsetting the various data structures, we frequently require to create sequences of integers. The
colon(:) operator can be used to create vectors consisting of sequences of integers. The left operand of
colon represents the starting integer and the right operand represents the editing integer of the
sequence. If the left operand is smaller than the right operand, then it results in an increasing sequence
of integers whose successive members differ by one. If the left operand is larger than the right operand,
then it results in an decreasing sequence of integers whose successive members differ by one.

> 1:5

[1] 1 2 3 4 5

>
> 5:1

[1] 5 4 3 2 1

>

>

> 1.2:5

[1] 1.2 2.2 3.2 4.2

>

> 5.05:3

[1] 5.05 4.05 3.05

>

The seq() function:

-------------------

The seq() function will be useful to create vectors of specific pattern. The following are different forms
of seq() function using which we can create vectors:

(i) seq(from,to)

For example seq(from=1, to=5) or simply, seq(1,5) will create a vector of values starting from 1 to 5
incremented by 1. That is, it creates the vector (1,2,3,4,5). Here the start value is smaller than the end
value. If the start value is larger than the end value, it creates a vector of values in descending order. For
example, seq(5,1) will return the vector (5,4,3,2,1).

> seq( 1, 5 )

[1] 1 2 3 4 5
> seq( 5, 1 )

[1] 5 4 3 2 1

>

> seq(1.1,6)

[1] 1.1 2.1 3.1 4.1 5.1

>

> seq(5.2,1.2)

[1] 5.2 4.2 3.2 2.2 1.2

>

> seq(5.2,0)

[1] 5.2 4.2 3.2 2.2 1.2 0.2

>

Note that the first argument to the function seq() will be assigned to the named parameter from and the
second argument will be assigned to the named parameter to. Both the parameters from and to are
default parameters with a default value of 1. That is, from=1, to=1.

If our objective is not a sequence of integer values, we can specify the increment value using the named
parameter by. The general form of seq() function with three parameters is:

(ii) seq(from,to,by)

Note that by is also a default parameter and whose default value is obtained from the expression (to-
from)/(lenth-1), where length is the size of the vector.

seq(1,5) crates a vector (1,2,3,4,5) since by=(5-1)/(5-1)=1.


By specifying a particular value we get the desired vector of values. For example, seq(1,3,0.5) will return
the vector (1.0, 1.5, 2.0, 2.5, 3.0).

> seq( 1, 5, 2 )

[1] 1 3 5

> seq( 3, 1, -.5 )

[1] 3.0 2.5 2.0 1.5 1.0

On some occasions, we want to have a vector of specific length whose start vale and end value being
known. In such occasions, use the following form of seq() function:

(iii) seq(start, to, length.out)

Suppose we want a vector of 12 values that starts with 1 and ends with 2.

> seq(1,2,len=12)

[1] 1.000000 1.090909 1.181818 1.272727 1.363636 1.454545 1.545455 1.636364

[9] 1.727273 1.818182 1.909091 2.000000

>

Note that, the parameter length.out can be abbreviated as len or length.

Consider the function call: seq(5). It will generate a sequence of integers 1,2,3,4,5. This is because, it
defaults to the function call seq(1,len=5). That is, the function calls, seq(from=5) and seq(length.out=5)
are equivalent. This is also equivalent to the function call seq(along.with=5). So, we seq() function also
supports the function prototypes:
(iv) seq(from)

(v) seq(along.with=)

(vi) seq(length.out=)

The rep() function:

-------------------

Use the rep() function, to create a vector of specified length with the same value being repeated.

rep(x, freq.of.x)

where if x is a scalar, then freq.of.x is also a scalar; if x is a vector of values, then freq.of.x must be a
vector specifying the number of times each member of x must be repeated. For example,
rep(4,3)returns the vector (4,4,4).

> rep(4,3)

[1] 4 4 4

To create the vector (1,1,1,3,3)

> rep(c(1,3),c(3,2))

[1] 1 1 1 2 2

>

> rep( 1:3, 2 )

[1] 1 2 3 1 2 3
>

More complicated patterns can be repeated by specifying pairs of equal-sized vectors.

> rep(c("S","F"),c(2,3))

[1] "S" "S" "F" "F" "F"

>

The c() function

----------------

The most frequently used function to create a vector is c(). The name of the function c is usually
understood to represent combine or concatenate. The general form of creating a vector using the c()
function is:

c(list of values separated by commas)

Examples:

(a) c(2,4,7,10) creates a numeric vector consisting of values 2,4,7,10.

(b) c('a','b','c')(or c("a","b","c")) creates a character vector whose members are "a", "b", "c".

(c) c(TRUE,FALSE,FALSE,TRUE) will create a logical vector.


(d) c(1,2,'a','b') creates a character vector whose members are "1","2","a","b".

(e) The c() function can be used to combine two or more vectors into a single vector.

For example, if x=c(1,2,3) and y=c(4,5,6), then

--- c(x,y) results in the vector (1,2,3,4,5,6).

--- c(x,y,7) results in the vector (1,2,3,4,5,6,7).

The members of a vector can also have names attached to them.

For example,

c(Maths=20, Phy=18, Chem=19)

will create a vector with named members. These names will be useful in filtering the vector.

> x <- c(Maths=20, Phy=18, Chem=19)

>

>x

Maths Phy Chem

20 18 19

>

It is possible to give names to an already created vector using the function names().
For example, x=c(1,2,3). Then

> names(x) <- c("One","Two","Three")

> names(x)

[1] "One" "Two" "Three"

>

>x

One Two Three

1 2 3

Subsetting Vectors

------------------

Fetching a subset of members of a vector is called filtering or subsetting a vector.

Filtering a vector for a subset of its elements can be done in one of the following methods:

(i) using a vector of indices,

(ii) using a vector of logicals,

(iii) using a vector of member names,

(iv) using negative indices

Filtering by numeric indexing:

-----------------------------
Indexing of members of a vector starts at 1. In general, subsetting of a vector takes the form:

vector1[vector2]

where vector1 is the vector to be filtered and vector2 is a vector of indices or logicals or member names.

> x<- c(5,16,18,8,1,11,4,10,15)

>x

[1] 5 16 18 8 1 11 4 10 15

>

Suppose we want to filter the vector for 4th, 5th and 1st elements in that order. Then, with reference to
the general syntax for subsetting a vector, vector2 is then c(4,5,1) and vector1 is x. Thus, x[c(4,5,1)] is
the command to be used to fetch the required members from the vector x.

> x[c(4,5,1)]

[1] 8 1 5

>

Filtering by negative indexing

------------------------------

R system also allows us to use negative indices. Negative indices allows us to omit one or more elements
of a vector.

For example, x[-1] displays all the elements of x except the first element; x[-c(2,5)] displays all the
elements of x except 2nd and 5th elements.
>x

[1] 5 16 18 8 1 11 4 10 15

>

># omitting the 3rd element

> x[-3]

[1] 5 16 8 1 11 4 10 15

>

> # leaving the last element

> x[-length(x)]

[1] 5 16 18 8 1 11 4 10

>

> # Getting all the elements except the first three

> x[-c(1:3)]

[1] 8 1 11 4 10 15

>

Filtering logicals

------------------

A vector of logical values can also be used as indexing vector. For example, if y = c(1,2,3,4,5), then to
fectch the alternative values starting at index 1 use the command

y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] or

y[c(TRUE,FALSE)]
> y <- 1:5

>y

[1] 1 2 3 4 5

> y[c(TRUE,FALSE,TRUE,FALSE,TRUE)]

[1] 1 3 5

>

> y[c(TRUE,FALSE)]

[1] 1 3 5

>

If a logical vector is used to subset a vector, then the size of the logical vector must be the same as the
size of the vector to be filtered.

The command y[c(TRUE,FALSE,TRUE,FALSE,TRUE)] is equivalent to y[c(TRUE,FALSE)], here, the indexing


vector gets recycled until its size becomes the size of y.

In practice, the logical vector is created out of a condition to be met by the vector. For example, if we
are interested in the members of a vector whose values are larger than 6. The condition here is that x>6.
If we write x>6, it results in a vector of 10(length of x) logical values. This is because in the operation x>6,
two vectors of unequal length are involved. Because of the recycling property, the smaller one(6, being a
vector of size 1) gets recycled to meet the size of the larger vector. Now we have

(5,16,18,8,1,11,4,10,15) > (6,6,6,6,6,6,6,6,6),

which is equivalent to the vector

(5>6, 16>6, 18>6, 8>6, 1>6, 11>6, 4>6, 10>6, 15>6 ).


In fact, the operators +, > etc., are all functions. As the operation involved here is relational, the result
of which is logical. Hence, we have

(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE).

The R system omits all those indexing values having the value FALSE. The vector

c(FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)

= c(2,3,4,6,8,9)

is used as the indexing vector to filter x for all those values larger than 6.

>x

[1] 5 16 18 8 1 11 4 10 15

>

> x>6

[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE

>

> x[x>6]

[1] 16 18 8 11 10 15

>

In practice, we use a function called which() to get the indices of a vector satisfying some condition. In
the above illustration, which(x>6) will return the indices of the vector at which the values of x are larger
than 6.
> which(x>6) # returns indices

[1] 2 3 4 6 8 9

>

> x[which(x>6)] # returns actual values

[1] 16 18 8 11 10 15

>

> x[x>6] # returns actual values

[1] 16 18 8 11 10 15

>

Filtering by names

------------------

The c() function can be used to create a vector having its members given some names(named vectors).
For example,

> n = c(phy=12,math=13,chem=20)

> n["phy"]

phy

12

> n[c("phy","chem")]

phy chem

12 20

>

> n[c(TRUE,FALSE,TRUE)]

phy chem
12 20

>

Note that the member names must be enclosed in quotes in the indexing vector.

------------------------------------------------------------------------------

Operators:

operator function

-------- --------

+ addition

- subtraction

* multiplication

/ division

^ exponentiation

%% modulus

: create a sequence of numbers differing by 1

-------- ------------------

<- assignment operator

= assignment operator

-> assignment operator

-------- -------------------

> greater than

< less than


>= greater than or equal to

<= less than or equal to

== equality comparison

!= not equal to

-------- -------------------

Functions:

(a) creating vectors

(i) c()

(ii) seq()

(iii) rep()

(b) to check atomic types

(i) is.numeric()

(ii) is.integer()

(iii) is.character()

(iv) is.logical()

(c) to convert atomic types

(i) as.numeric()

(ii) as.integer()

(iii) as.character()

(iv) as.logical()
(d) which() --- results in a logical vector.

(e) names() --- when invoked on a vector, it returns the names of the members(if the members are
named) otherwise returns NULL
R - The sample() function

The R system provides a function called sample() using which you can draw a with
replacement or without replacement or varying probability random samples. First parameter
this function is the data vector from which to sample. The second parameter is the sample
size. To get a with replacement sample use the replace = TRUE option. To get a without
replacement sample use replace = FALSE. The default value of replace = option is FALSE.
So, if you use only the first two parameters, you get a without replacement sample, provided
the sample size is less than the length of the data vector. This function can also be used
two sample different values with different probabilities. To achieve this we have to use
probability= option. It is provided with a vector of probabilities whose length is same as
that of the data vector. To illustrate the usage of the sample() function, suppose the data
vector consists of 0 and 1. These values may correspond to say tail and head respectively,
in a coin toss experiment. So, the command

> sample(c(0,1),5,replace =TRUE)


[1] 0 0 1 0 1
>

will result in a sequence of zeros and ones and will represent a realization of a coin toss
experiment for 5 times.

Systematic Sampling

Step 1: Create a vector of names of your classmates in the order of their names as in the
attendance register. See that the name must be a single word. Call this vector as mcs.stu.

Step 2: Use the sample() function to create a vector called marks. The data to the
sample() function must have values in the range 12 to 20. The length of the marks vector
must equal the length of the vector created in Step 1.

Step 3: Find the average of the vector marks. This is the mean of the population.

Step 4: Use names() function to assign the names in mcs.stu vector to the members of
marks vector created in Step 2.

Step 5: Use the sample() function to generate random start value between 1 and 5, where
5 is the Sampling Interval. Name this object as random.start

Step 6: Use the seq() function to generate the systematic sample labels. Name this vector
as sys.labels.

Step 7: Use the sys.labels vector as the indexing vector in marks vector to get a systematic
sample. Name the resulting vector as sys.sample.

Step 8: Invoke the mean() on sys.sample to get the systematic sample mean.

Step 9: Compare the agreement between the the population mean and the sample mean.

1
September 22, 2020
Manipulating Text Data

L. V. Rao

October 26, 2020

- ,, October 26, 2020 1/1


Manipulating Text Data

Some of the most basic tasks in character string processing are:


--- splitting
--- matching and
--- replacing.

- ,, October 26, 2020 2/1


strsplit() function
Split the elements of a character vector x into substrings according
to the matches to substring split within them.

strsplit(x, split)

where

x character vector, each element of which is


to be split.
split character vector containing regular expres-
sion(s) to use for splitting. If split has
length 0, x is split into single characters.

The strsplit() function returns a list object.

- ,, October 26, 2020 3/1


> string1 <- "A rolling stone gathers no mass"
>
> strsplit(string1, split=" ")
[[1]]
[1] "A" "rolling" "stone" "gathers" "no" "mass"
>

- ,, October 26, 2020 4/1


strsplit() function

> string1 <- "A rolling stone gathers no mass"


>
> string2 <- "Empty vessels make much noise"
>
> strVec <- c(string1, string2)
>
> strsplit(strVec, split = " ")
[[1]]
[1] "A" "rolling" "stone" "gathers" "no" "mass"

[[2]]
[1] "Empty" "vessels" "make" "much" "noise"
>

- ,, October 26, 2020 5/1


strspilt() function

If the strsplit() function is given a single string and the split


character is a null string, then the strsplit() function returns a list
whose member vector contains single characters.

> strsplit("Abdul Kalam", split="")


[[1]]
[1] "A" "b" "d" "u" "l" " " "K" "a" "l" "a" "m"
>

If the string to be split is a vector of strings, then the list contains


as many members as there are in vector of strings.

- ,, October 26, 2020 6/1


unlist() function

To get a vector object, prefix the strsplit() function with the


unlist() function.

unlist( strsplit( x, split ) )

> unlist( strsplit( string1, split = " ") )


[1] "A" "rolling" "stone" "gathers" "no" "mass"
>

- ,, October 26, 2020 7/1


nchar() function
The nchar() function counts the number of characters in a
I character string or
I a vector of character strings.
The nchar() function returns a numeric vector whose elements
contain the sizes of the corresponding elements of the input vector.
nchar( x )
where x is a character vector.

> nchar("x")
[1] 1
> nchar("xy")
[1] 2
> nchar("xy ")
[1] 3

- ,, October 26, 2020 8/1


nchar() function

The presence or absence of white-space in character strings is


important because it influences many text analysis results.

- ,, October 26, 2020 9/1


nchar() function

> str1 <- c("Math","Phy","Chem")


>
> nchar(str1)
[1] 4 3 4
>

nchar() is the fastest way to find out whether the elements of a


character vector are non-empty strings or not.

- ,, October 26, 2020 10 / 1


grep() function

I The function grep() is used to search for a pattern in a vector


of strings. It returns the indices of the elements of the vector
containing this pattern.
The syntax of the grep() function is:

grep() function

grep( pattern, x, ignore.case = FALSE,


value = FALSE)

- ,, October 26, 2020 11 / 1


grep() function

I pattern, is the character string we want to match.


I x, is the character vector where we want to find matches.
I ignore.case, specifies whether to consider case sensitiveness
aspect or not.
By default, pattern matching is case sensitive, since it is
assigned the value FALSE.
I value, specifies whether the gsub() function to return the
position of the elements of x, in which the pattern is found or
the elements themselves.
By default, it returns the elements position,
since value = FALSE.
By assigning TRUE to value argument, we retrieve the
elements of the corpus vector.

- ,, October 26, 2020 12 / 1


grep() function

Is there any car with the name Mazda included in the mtcars
dataset?

> car.names <- rownames(mtcars)


>
> grep(pattern = "Mazda", x = car.names)
[1] 1 2
>
> grep(pattern = "Mazda", x = car.names,
value = TRUE)
[1] "Mazda RX4" "Mazda RX4 Wag"
>

- ,, October 26, 2020 13 / 1


Subset the mtcars dataset such that it contains information only
on ”Merc” cars

> grep(pattern = "Merc", x = car.names)


[1] 8 9 10 11 12 13 14
>
> grep( pattern = "Merc",
x = car.names,
value = TRUE)
[1] "Merc 240D" "Merc 230" "Merc 280" "Merc 280C"
[5] "Merc 450SE" "Merc 450SL" "Merc 450SLC"

- ,, October 26, 2020 14 / 1


noquote() function

Print character strings without quotes.

Usage : noquote(obj, right = FALSE)

obj any R object, typically a vector of


character strings.
right logical; whether or not strings
should be right aligned.

- ,, October 26, 2020 15 / 1


noquote() function
The following are two equivalent ways of printing character strings
without quotes

> rownames(mtcars)[1:4]
[1] "Mazda RX4" "Mazda RX4 Wag"
[3] "Datsun 710" "Hornet 4 Drive"
>
> noquote(rownames(mtcars))[1:4]
[1] Mazda RX4 Mazda RX4 Wag
[3] Datsun 710 Hornet 4 Drive
>
> print(rownames(mtcars)[1:4], quote = FALSE)
[1] Mazda RX4 Mazda RX4 Wag
[3] Datsun 710 Hornet 4 Drive

- ,, October 26, 2020 16 / 1


sub() function
The sub() function replaces the first occurrence of a substring in
a string with another substring.
The syntax of the sub() function is as follows.

sub(old, new, string)

where,

old is the old substring that has to be


replaced,
new is the new substring that will take
the place of the old substring.
string is the name of the string in which
the substring has to be replaced.

- ,, October 26, 2020 17 / 1


substr() function
While working with strings a common operation that will be
frequently performed is extraction and replacement of some
characters.
The substr() function can be used for extracting or replacing
substrings in a character vector.

usage:
substr(x, start, stop)

x is a character vector,
start indicates the first element
to be replaced, and
stop indicates the last element
to be replaced:

- ,, October 26, 2020 18 / 1


substr() function
The R command susbtr() function extracts parts of character
vectors.
I To extract ”gram” from ”Programming”

> substr("Programming",4,7)
[1] "gram"
>

I Replace 1st character with J

> str<-c("no", "pain", "no", "gain")


> substr(str,1,1) <- "J"
> str
[1] "Jo" "Jain" "Jo" "Jain"
>

- ,, October 26, 2020 19 / 1


-------------------------------------------------------
1. How many car names have just one word in their name?

> car.lst <- strsplit(cnames, split=" ")


>
> y <- NULL
>
> for(k in 1:length(cnames)) y[k] <- length(car.lst[[k]])
>
>y
[1] 2 3 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 3 2 2 2
>
> sum(y==1)
[1] 1
>
> cnames[y==1]
[1] "Valiant"
>

-------------------------------------------------------

2. How many car names have two words in their name?

> car.lst <- strsplit(cnames, split=" ")


>
> y <- NULL
>
> for(k in 1:length(cnames)) y[k] <- length(car.lst[[k]])
>
>y
[1] 2 3 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 3 2 2 2
>
> sum(y==2)
[1] 28
>
> cnames[y==2]
[1] "Mazda RX4" "Datsun 710"
[3] "Hornet Sportabout" "Duster 360"
[5] "Merc 240D" "Merc 230"
[7] "Merc 280" "Merc 280C"
[9] "Merc 450SE" "Merc 450SL"
[11] "Merc 450SLC" "Cadillac Fleetwood"
[13] "Lincoln Continental" "Chrysler Imperial"
[15] "Fiat 128" "Honda Civic"
[17] "Toyota Corolla" "Toyota Corona"
[19] "Dodge Challenger" "AMC Javelin"
[21] "Camaro Z28" "Pontiac Firebird"
[23] "Fiat X1-9" "Porsche 914-2"
[25] "Lotus Europa" "Ferrari Dino"
[27] "Maserati Bora" "Volvo 142E"
>

-----------------------------------------------------------

3. How many car names have three words in their name? What are they?

> car.lst <- strsplit(cnames, split=" ")


>
> y <- NULL
>
> for(k in 1:length(cnames)) y[k] <- length(car.lst[[k]])
>
>y
[1] 2 3 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 3 2 2 2
>
> sum(y==3)
[1] 3
>
> cnames[y==3]
[1] "Mazda RX4 Wag" "Hornet 4 Drive" "Ford Pantera L"
>

--------------------------------------------------------------

4. How many one word, two word and three word car names are there?

> car.lst <- strsplit(cnames, split=" ")


> y <- NULL
> for(k in 1:length(cnames)) y[k] <- length(car.lst[[k]])
>y
[1] 2 3 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 3 2 2 2
> table(y)
y
1 2 3
1 28 3
>

---------------------------------------------------------

5. What is the largest car name?

> cnames[which(nchar(cnames)== max(nchar(cnames)))]


[1] "Lincoln Continental"
>

-----------------------------------------------------------

6. What is the smallest car name?

> cnames[which(nchar(cnames)== min(nchar(cnames)))]


[1] "Valiant"
>

--------------------------------------------------------------

7. How many Merc cars are there?

> length(grep("Merc", cnames))


[1] 7
>

------------------------------------------------------------

8. What is the minimum and maximum mileage of a "Merc" car?

> grep("Merc",cnames)
[1] 8 9 10 11 12 13 14
> ind <- grep("Merc",cnames)
> mtcars[ind,"mpg"]
[1] 24.4 22.8 19.2 17.8 16.4 17.3 15.2
> range(mtcars[ind,"mpg"])
[1] 15.2 24.4
> diff(range(mtcars[ind,"mpg"]))
[1] 9.2
>
9. Which Merc car is having maximum mileage?

> grep("Merc",cnames)
[1] 8 9 10 11 12 13 14
> ind <- grep("Merc",cnames)
> mtcars[ind,"mpg"]
[1] 24.4 22.8 19.2 17.8 16.4 17.3 15.2
>
> cnames[ind[which(mtcars[ind,"mpg"]==max(mtcars[ind,"mpg"]))]]
[1] "Merc 240D"
>

---------------------------------------------------------------

10. What is the average mileage of a Merc cars with 8 cylinder?

> cnames <- rownames(mtcars)


>
> grep("Merc", cnames)
[1] 8 9 10 11 12 13 14
>
> ind <- grep("Merc", cnames)
>
> mtcars[ind, c("mpg","cyl")]
mpg cyl
Merc 240D 24.4 4
Merc 230 22.8 4
Merc 280 19.2 6
Merc 280C 17.8 6
Merc 450SE 16.4 8
Merc 450SL 17.3 8
Merc 450SLC 15.2 8
>
> merc.df <- mtcars[ind, c("mpg","cyl")]
>
> merc.df$cyl==8
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> merc.df[merc.df$cyl==8,"mpg"]
[1] 16.4 17.3 15.2
>
> mean(merc.df[merc.df$cyl==8,"mpg"])
[1] 16.3
Manipulating Text Data

L. V. Rao

October 29, 2020

- ,, October 29, 2020 1/1


substring() function

substring()

The assignment form of substring allows replacement of se-


lected portions of character strings, but substring will only
replace parts of the string with values that have the same
number of characters; if a string that’s shorter than the im-
plied substring is provided, none of the original string will be
overwritten:

- ,, October 29, 2020 2/1


substring() function

substring()

Individual characters of character values are not accessible


through ordinary subscripting. Instead, the substring func-
tion can be used either to extract parts of character strings,
or to change the values of parts of character strings. In ad-
dition to the string being operated on, substring accepts a
first= argument giving the first character of the desired
substring, and a last= argument giving the last character. If
not specified, last= defaults to a large number, so that spec-
ifying just a start= value will operate from that character
to the end of the string. Like most functions in R, substring
is vectorized, operating on multiple strings at once:

- ,, October 29, 2020 3/1


substring() function

The function substring() is used to extract or replace


sub-strings from a string.

Extraction Function
substring(text, first, last = 1000000L)

Replacement Function

substring(text, first, last = 1000000L) <- value

- ,, October 29, 2020 4/1


substring() function

The function substring() is used to extract sub-strings from a


string.

> substring("Programming", first = 4:6,


last = c(7,7,7))
[1] "gram" "ram" "am"
>
> Sys.time()
[1] "2020-10-20 20:25:13 IST"
> date()
[1] "Tue Oct 20 20:25:25 2020"
>

- ,, October 29, 2020 5/1


substring() function
The year, month, day, hours, minutes and seconds can be
extracted as follows:

> Sys.time()
[1] "2020-10-20 20:25:13 IST"
> date()
[1] "Tue Oct 20 20:25:25 2020"

> substring(Sys.time(), c(1, 6, 9), c(4, 7, 10))


[1] "2020" "10" "20"
>
> as.numeric(substring(Sys.time(), c(1, 6, 9),
c(4, 7, 10)))
[1] 2020 10 20
>

- ,, October 29, 2020 6/1


substring() function

Extracting the last three characters from a string:

> x <- "Bombay"


> substring(x, nchar(x)-2, nchar(x))
[1] "bay"
>

Extracting the last three characters from a vector of strings:

> stu <- c("Ramani","Bhavani","Sravani")


> substring(stu, nchar(stu)-2, nchar(stu))
[1] "ani" "ani" "ani"
>

- ,, October 29, 2020 7/1


toupper() function

The toupper() function, as the name suggests, turns the input


character vector to upper case.
The syntax of the toupper() function is very simple.

toupper(x)

where x is the input character vector.

> toupper("Department of Statistics")


[1] "DEPARTMENT OF STATISTICS"
>

- ,, October 29, 2020 8/1


tolower() function

The tolower() function does the opposite of the toupper()


function. It turns the input character vector to lower case.
The syntax of the tolower() function is as follows.

tolower(x)

where x is the input character vector.

> tolower("Department of Statistics")


[1] "department of statistics"

- ,, October 29, 2020 9/1


Determining Features
# Computing document similarity
doc1 <- "Julia loves me more than Linda loves me"
doc1.words <- unlist(strsplit(doc1,split=" "))
doc2 <- "Jane likes me more than Julia loves me"
doc2.words <- unlist(strsplit(doc2,split=" "))
doc3 <- "John loves books more than music"
doc3.words <- unlist(strsplit(doc3,split=" "))
docs <- c(doc1.words,doc2.words,doc3.words)
words <- unique(docs)
words

[1] "Julia" "loves" "me" "more" "than"


[6] "Linda" "Jane" "likes" "John" "books"
[11] "music"

- ,, October 29, 2020 10 / 1


Creating Feature Vectors
# feature vector representation
x <- NULL; y <- NULL; z <- NULL
for(w in words)
{
x <- c(x,length(grep(w, doc1.words)))
y <- c(y,length(grep(w, doc2.words)))
z <- c(z,length(grep(w, doc3.words)))
}
x; y; z

> x
[1] 1 2 2 1 1 1 0 0 0 0 0
> y
[1] 1 1 2 1 1 0 1 1 0 0 0
> z
[1] 0 1 0 1 1 0 0 0 1 1 1
>
- ,, October 29, 2020 11 / 1
Computing Cosine Similarity

sim.xy <- sum(x*y)/(sqrt(sum(x^2))*sqrt(sum(y^2)))


sim.xz <- sum(x*z)/(sqrt(sum(x^2))*sqrt(sum(z^2)))
sim.yz <- sum(z*y)/(sqrt(sum(z^2))*sqrt(sum(y^2)))

sim.xy; sim.xz; sim.yz

> sim.xy
[1] 0.8215838
> sim.xz
[1] 0.4714045
> sim.yz
[1] 0.3872983
>

- ,, October 29, 2020 12 / 1


scan() function

L.V. Rao

January 25, 2021

L.V. Rao scan() function


scan() command - numeric data
While creating an R data object using c() command, the data
items are separated by commas.
The scan() command is helpful to read a simple vector of
data values from the keyboard using the scan() command.
Unlike c() command, for reading numeric values, the
scan() command is used with empty parenthesis and the data
data values are separated by spaces.
> x <- scan()
1: 1 2 3 4 5
6:
Read 5 items
>
When we press the Enter key on an empty line, the R system
understands that we the user has finished entering of data
values. It also reminds us how many values are entered.
L.V. Rao scan() function
scan() command - character data

For reading character data values using the scan() command,


we have to use the optional parameter what with the value
’character’, That is,
obj.name <- scan( what = ’character’ )
> days1 <- scan(what = ’character’)
1: ’Mon’ ’Tue’ ’Wed’ ’Thu’ ’Fri’
6:
Read 5 items
> days1
[1] "Mon" "Tue" "Wed" "Thu" "Fri"
>

L.V. Rao scan() function


scan() command - Clipboard data

The scan() command can be used to create data objects with the
data from other programs such as spreadsheets or notepad.
1. If the data are numbers in a spreadsheet, simply type the
command in R as usual before switching to the spreadsheet
containing the data.
2. Highlight the necessary cells in the spreadsheet and copy
them to the clipboard.
3. Return to R and paste the data from the clipboard into R.
As usual, R waits until a blank line is entered before ending
the data entry so you can continue to copy and paste more
data as required.
4. Once you are finished, enter a blank line to complete data
entry.

L.V. Rao scan() function


scan() command - Clipboard data

If the data are separated with simple spaces, you can simply
copy and paste.
If the data are separated with some other character, you need
to tell R which character is used as the separator.
For example, a CSV (comma-separated values), uses
commas to separate the data items. To tell R you are using
this separator, simply add an extra part to your command like
so:
obj.name <- scan( sep = ’,’ )

L.V. Rao scan() function


scan() command - Data File

It is also possible to get the scan() command to read a file


directly.
To read a file with the scan() command you simply add
file = ’filename’ to the command.
Note that the filename must be enclosed in quotes (single or
double).
You can include the instruction file.choose() as part of your
scan() command. This opens a browser-type window where
you can navigate to and select the file you want to read.

L.V. Rao scan() function


Looking into Current Directory

To know which files and folders are existing in the current


directory use the commands:
dir()
list.files()
The dir() command can also be provided with a path
specified in single quotes to see the files in other directories.
The dir() command lists the files alphabetical order together
with their extensions. Folders are displayed simply their
names.
To see the hidden files use the command
> dir( all.files = TRUE ).

L.V. Rao scan() function


Factor Data
Graphics in R

L. V. Rao

July 31, 2019

Data Visualization - Dr. L. V. Rao,, July 31, 2019 1 / 33


factor data

Character vectors included in data frames are coerced to


a different type called factors.
A factor is a vector object that is used to provide a
compact way to represent categorical data.
A value of a factor variable is called level.
Each level of a factor vector represents a unique category
(e.g., female or male).
Levels of a factor are defined alphabetically by default
A factor variable stores the vector along with a list of the
levels of the factorial variable.
The factor() function converts a vector into a factor
vector.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 2 / 33


> gender <- c( "male", "female", "male",
"male", "female", "female")
> gender
[1] "male" "female" "male" "male" "female" "female"
>
> gender.f <- factor(gender)
> gender.f
[1] male female male male female female
Levels: female male
>
> str(gender.f)
Factor w/ 2 levels "female","male": 2 1 2 2 1 1
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 3 / 33


Factor output

Note the differences between the output of the factor vector


and a character vector.
1 Firstly, the absence of quotation marks indicate that the
vector is no longer a character vector.
2 Internally, the factor vector (gender.f) is actually a
numeric variable containing only 1’s and 2’s and in which
1 is defined as the level ’female’ and 2 is defined as the
level ’male’.
3 When printed, each entry is represented by a label and
the levels contained in the factor are listed below.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 4 / 33


Ordering factor levels

Although the order of factor levels has no bearing on


most statistical procedures and for many applications,
alphabetical ordering is as valid as any other arrangement,
for some analyses (particularly those involving contrasts)
it is necessary to know the arrangement of factor levels.
Furthermore, for graphical summaries of some data,
alphabetical factor levels might not represent the natural
trends among groups.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 5 / 33


Ordering Factor Levels
The order of existing factor levels can also be altered by
redefining a factor:
> (income <- sample(c("Low","Middle","High"),10,
replace=TRUE) )
[1] "Middle" "Middle" "High" "Middle" "Low"
[6]"Low" "Low" "Middle" "High" "Middle"
> ( income.f1 <- factor(income) )
[1] Middle Middle High Middle Low Low Low Middle High
Levels: High Low Middle
> (income.f <- factor( income,
levels=c("Low","Middle","High")))
[1] Middle Middle High Middle Low Low Low Middle
[9] High Middle
Levels: Low Middle High
>
Data Visualization - Dr. L. V. Rao,, July 31, 2019 6 / 33
Ordering Factor Levels
In addition, some analyses perform different operations on
factors that are defined as ’ordered’ compared to ’unordered’
factors. Regardless of whether you have altered the ordering of
factor levels or not, by default all factors are implicitly
considered ’unordered’ until otherwise defined using the
ordered() function.
> income.o <- factor(income, ordered=TRUE,
levels=c("Low","Middle","High"))
> income.o
[1] Middle Middle High Middle Low Low Low Middle
[9] High Middle
Levels: Low < Middle < High
>
> income.o[2] > income.o[5]
[1] TRUE
>
Data Visualization - Dr. L. V. Rao,, July 31, 2019 7 / 33
Ordering Factor Levels

There are a number of more convenient ways to generate


factors in R. Combinations of the rep() function and
concatenation (c()) function can be used in a variety of ways
to produce identical results:

> shade <- factor(c(rep("no", 5), rep("full", 5)))


> shade <- factor(rep(c("no", "full"), c(5, 5)))
> shade <- factor(rep(c("no", "full"), each = 5))
> shade
[1] no no no no no full full full full full
Levels: full no

Data Visualization - Dr. L. V. Rao,, July 31, 2019 8 / 33


gl() function

Another convenient method of generating a factor when each


level of the factor has an equal number of entries (replicates)
is to use the gl() function. The gl() function requires
1 the number of factor levels,
2 the number of consecutive replicates per factor level,
3 the total length of the factor, and
4 a list of factor level labels, as arguments.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 9 / 33


gl() function

# generate a factor with the levels ’no’ and ’full’,


# each repeated 5 times in a row
> shade <- gl(2, 5, 10, c("no", "full"))
> shade
[1] no no no no no full full full full full
Levels: no full
> shade <- gl(2, 1, 10, c("no", "full"))
> shade
[1] no full no full no full no full no full
Levels: no full
Notice that by default, the factor() function arranges the
factor levels in alphabetical order, whereas the gl() function
orders the factor levels in the order in which they are included
in the expression.
Data Visualization - Dr. L. V. Rao,, July 31, 2019 10 / 33
Order of Factor Levels in gl()
Consider a dataset that includes a factorial variable with the
levels ’high’, ’medium’ and ’low’. Presented alphabetically, the
levels of the factor would be ’high’ ’low’ ’medium’. Those data
would probably be more effectively presented in the more
natural order of ’high’ ’medium’ ’low’ or ’low’ ’medium’ ’high’.
When creating a factor, the order of factor levels can be
specified as a list of labels. For example, consider a factor with
the levels ’low’,’medium’ and ’high’:

> income <- gl(3, 2, 6, c("low", "medium", "high"))


> income
[1] low low medium medium high high
Levels: low medium high

Data Visualization - Dr. L. V. Rao,, July 31, 2019 11 / 33


If the levels of a categorical variable in the data set is
coded as numbers, we need to convert the type of
variable to factor using the factor() function, so that R
recognizes it as categorical.
You can use the function is.factor() to examine whether a
variable is a factor.
The levels are ordered alphabetically and can be examined
using the levels() function:

Data Visualization - Dr. L. V. Rao,, July 31, 2019 12 / 33


A factor is a special type of character vector.
In most cases character data is used to describe the other
data, and is not used in calculations. However, for some
computations qualitative variables are used.
To store character data as qualitative variables, a factor
data type is used. We will use qualitative or categorical
variables in some statistical techniques such as two
sample tests, experimental design, logistic regression etc.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 13 / 33


ANOVA always uses some type of factor variable. The function
as.factor() can be used to convert data to data type factor.
Different factor levels are sometimes referred to as
”treatments” in ANOVA.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 14 / 33


Factor variables serve to subdivide the data set into
categories.
Simple examples of factors are, an individuals’ shoe size,
gender, race, and socio-economic status.
The possible values of a factor are called its levels. For
instance, the factor gender would have two levels, namely,
male and female. Socio-economic status typically has
three levels: high, middle, and low.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 15 / 33


Types of Factor Variables

Factors may be of two types: nominal and ordinal.


Nominal factors have levels that correspond to names of
the categories, with no implied ordering.
Examples of nominal factors would be hair color, gender,
race, or political party. There is no natural ordering to
”Democrat” and ”Republican”; the categories are just
names associated with different groups of people.
In contrast, ordinal factors have some sort of ordered
structure to the underlying factor levels. For instance,
socio-economic status would be an ordinal categorical
variable because the levels correspond to ranks associated
with income, education, and occupation. Another
example of ordinal categorical data would be class rank.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 16 / 33


is.factor() function

If the levels of a categorical variable in a data set is coded as


numbers, we need to convert the type of variable to factor
using the factor() function, so that R recognizes it as
categorical.
You can use the function is.factor() to examine whether a
variable is a factor.
For example, the smoke variable (smoking status) in some
data set is coded as 0 for mothers who did not smoke during
their pregnancy and 1 for mothers who smoked during their
pregnancy. R automatically considers this variable as
numerical.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 17 / 33


Creating Factor Variables

Factors are typically created from a numeric or a character


vector (note that you cannot fill matrices or multidimensional
arrays using factor values; factors can only take the form of
vectors). To create a factor vector, use the function factor().

Data Visualization - Dr. L. V. Rao,, July 31, 2019 18 / 33


levels() function

The most important extra piece of information (or attribute)


that a factor object contains is its levels, which store the
possible values in the factor. These levels are printed at the
bottom of each factor vector. You can extract the levels as a
vector of character strings using the levels() function.
You can also relabel a factor using levels() function.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 19 / 33


Subsetting Factor variables

Factor-valued vectors are subsetted in the same way as any


other vector.
Note that after subsetting a factor object, the object continues
to store all defined levels even if some of the levels are no
longer represented in the subsetted object.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 20 / 33


Some functions

Functions like length() and which() work the same way on


factor objects as with vectors, for example.

Data Visualization - Dr. L. V. Rao,, July 31, 2019 21 / 33


> income.f
[1] Middle Middle High Middle Low Low Low
[8] Middle High Middle
Levels: Low Middle High
> table(income.f)
income.f
Low Middle High
3 5 2
> income.tab <- table(income.f)
> income.tab
income.f
Low Middle High
3 5 2
> prop.table(income.tab)
income.f
Low Middle High
0.3 0.5 0.2
Data Visualization - Dr. L. V. Rao,, July 31, 2019 22 / 33
> stu.tab <- with(df,table(Gender,Residence))
> stu.tab
Residence
Gender Nonresident Resident
Female 2 3
Male 2 3
> margin.table(stu.tab,1)
Gender
Female Male
5 5
> margin.table(stu.tab,2)
Residence
Nonresident Resident
4 6
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 23 / 33


How many students have scored more than 65, by gender?
> str(with(df,table(Gender,Score>65)))
’table’ int [1:2, 1:2] 1 2 4 3
- attr(*, "dimnames")=List of 2
..$ Gender: chr [1:2] "Female" "Male"
..$ : chr [1:2] "FALSE" "TRUE"
>
> with(df,table(Gender,Score>65))

Gender FALSE TRUE


Female 1 4
Male 2 3
>
> with(df,table(Gender,Score>65))[,2]
Female Male
4 3
>
Data Visualization - Dr. L. V. Rao,, July 31, 2019 24 / 33
How many students have scored more than 65, by Residence
type?

> with(df,table(Residence,Score>65))

Residence FALSE TRUE


Nonresident 3 1
Resident 0 6
>
> with(df,table(Residence,Score>65))[,2]
Nonresident Resident
1 6
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 25 / 33


> with(df,tapply(Score,Gender,mean))
Female Male
73.4 73.6
>
> with(df,tapply(Score,Residence,mean))
Nonresident Resident
61.75000 81.33333
>
> with(df,tapply(Score,list(Gender,Residence),mean))
Nonresident Resident
Female 62.5 80.66667
Male 61.0 82.00000
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 26 / 33


mtcars data

What is the average mileage by gearbox type and number of


cylinder?

> with(mtcars,table(am,cyl))
cyl
am 4 6 8
0 3 4 12
1 8 3 2
> with(mtcars,tapply(mpg,list(am,cyl),mean))
4 6 8
0 22.900 19.12500 15.05
1 28.075 20.56667 15.40
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 27 / 33


> am.cyl.tab
4 6 8
0 22.900 19.12500 15.05
1 28.075 20.56667 15.40
> barplot(am.cyl.tab )
>

Data Visualization - Dr. L. V. Rao,, July 31, 2019 28 / 33


Data Visualization - Dr. L. V. Rao,, July 31, 2019 29 / 33
> with( mtcars, boxplot(mpg ~ am))

Data Visualization - Dr. L. V. Rao,, July 31, 2019 30 / 33


Data Visualization - Dr. L. V. Rao,, July 31, 2019 31 / 33
> with( mtcars, boxplot(mpg ~ cyl))

Data Visualization - Dr. L. V. Rao,, July 31, 2019 32 / 33


Data Visualization - Dr. L. V. Rao,, July 31, 2019 33 / 33
R - Matrices

A matrix is a two-dimensional R object, which can hold only data of same type (integers,
numeric, character). The dimensions are called rows and columns.
There are atleast three ways of creating a matrix:

— Using matrix() function


— Using rbind() and cbind() functions
— Using array() function

matrix() function
The basic function used to create a matrix is the matrix() function. It requires at least two
arguments, the first of which is the data(usually, a vector) out of which a matrix is to be
created and the either the number of rows or the number of columns of the matrix should be
specified. The matrix() function fills the elements of the matrix by column-wise, by default.
The syntax for matrix() function is:
matrix() Function

matrix(x, nrow = 1, ncol = 1, byrow} = FALSE},


dimnames = NULL)

where
x is a vector
nrow the number of rows of the matrix
ncol the number of columns of the matrix
byrow logical. If TRUE, elements are filled row-wise
dimnames specifies the names of the rows and columns of the matrix

To illustrate creating a matrix using matrix() function, first let use create some vector
consisting of values from 1 to 12, say. Then, use this vector as the data of our matrix.
> x <- 1:12
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
>
> mat.x <- matrix(x, nrow = 3)
>
> mat.x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>

Note that, by default, the matrix() function fills the entries of the matrix column-wise.
Also, we used only the nrow= option of the matrix() function. From the length of the vector
and the size of the row dimension, it determines the number of columns. Let us now try the
ncol= option instead with the same data.

1
September 16, 2020
R - Matrices

> mat.x1 <- matrix(x, ncol = 4)


> mat.x1
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

From the above output, we observe that both the function calls returns the same matrix.
This is because the matrix() function fills the entries of the matrix column-wise. To force
the matrix() function fill the entries of the matrix by row, set the byrow= option to TRUE.

> mat.x2 <- matrix(x, nrow = 3, byrow = TRUE)


> mat.x2
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12

To know the dimension of a matrix, use the dim() function.

> dim(mat.x2)
[1] 3 4

The dim() function when invoked on a matrix returns a vector consisting of two values:
the first member represents the number of rows and the second, the number of columns.

Giving Names to Rows and Columns

You can give names to rows and columns of a matrix in a couple of ways: using dimnames=
option of the matrix function or using the rownames() and colnames() functions. The
dimnames= option is used to name the rows and columns of the matrix at the time of the
creation of the matrix. The dimnames= option expects a two-element list as its value( A list
is a data structure in R), whose first member is a vector consisting of the names of the rows
and the second member is also a vector containing the names of the columns.
Naming Rows and Columns using dimnames Option
> mat.x3 <- matrix( x, nrow = 3,
dimnames = list( c("R1","R2","R3"),
c("C1","C2","C3","C4")))
> mat.x3
C1 C2 C3 C4
R1 1 2 3 4
R2 5 6 7 8
R3 9 10 11 12

The rownames() and colnames() functions are used to set the names of rows and
columns of an already created matrix. To illustrate the use of these functions, let us create
a matrix, called mat.y, as given below:

2
September 16, 2020
R - Matrices

> ( mat.y <- matrix( 11:19, nrow = 3) )


[,1] [,2] [,3]
[1,] 11 14 17
[2,] 12 15 18
[3,] 13 16 19

To set the names to the rows of the matrix mat.y, use the rownames() as in the following
example:

> rownames( mat.y ) <- c("R-1","R-2","R-3")


> mat.y
[,1] [,2] [,3]
R-1 11 14 17
R-2 12 15 18
R-3 13 16 19

The rownames() function can also be used to get the names of row names of a matrix:

> rownames(mat.y)
[1] "R-1" "R-2" "R-3"
>

To set the names to the columns of the matrix mat.y, use the colnames() as in the
following example:

> colnames( mat.y ) <- c("C-1","C-2","C-3")


> mat.y
C-1 C-2 C-3
R-1 11 14 17
R-2 12 15 18
R-3 13 16 19

The colnames() function can also be used to get the names of row names of a matrix:

> colnames( mat.y )


[1] "C-1" "C-2" "C-3"

rbind() function
A matrix can be created out of several vectors of same type and size either by binding them
by row-wise on using the rbind() function or column-wise on using the cbind() function.
Let us create three vectors, say:

> x <- c(1, 2, 3)


> y <- c(4, 5, 6)
> z <- c(7, 8, 9)

Let us now use the rbind() function to create a matrix out of the data already existing

3
September 16, 2020
R - Matrices

in the form of vectors x, y and z. Each vector becomes a row the matrix created and the
order of the vectors passed as parameters to the function determines the order of the rows
of the matrix. Also, the names of the vectors becomes the names of the rows of the matrix
created. The colnames() function can be used to set the column names of the matrix. The
rownames() function can be used to get the row names of the matrix as well as to change
the default names of the matrix.
rbind() Function

> ( mat.r <- rbind(x,y,z) )


[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 9

Remember that all the vectors used with the rbind() function must of the same type
and size. If they are of different sizes, a matrix will be created but it may not be the
desired matrix and further, the R system will output a warning message to remind us about
differences in sizes.
Use of rbind() Function with Vectors of Different Sizes
> x <- c(1:3) # length = 3
> y <- c(4:6) # length = 3
> z <- c(7,8) # length = 2
>
> rbind(x,y,z)
[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 7
Warning message:
In rbind(x, y, z) :
number of columns of result is not a
multiple of vector length (arg 3)

Note that recycling takes place in completing the last row elements.

cbind() function
The cbind() function can also be used to create a matrix using vectors of the same type and
size. However, the vectors used become the columns of the matrix and their names become
the names of the columns. The colnames() and rownames() functions can be used to modify
the names of the columns and rows respectively. The cbind() and rbind() functions can
also be used to add new columns and rows respectively to an existing matrix.

4
September 16, 2020
R - Matrices

cbind() Function

> team1 <- c("Sujatha","Lalitha","Kavitha")


> team2 <- c("Somaiah","Rajaiah","Ramaiah")
> team3 <- c("John","Paul","Hogg")
>
> cbind(team1,team2,team3)
team1 team2 team3
[1,] "Sujatha" "Somaiah" "John"
[2,] "Lalitha" "Rajaiah" "Paul"
[3,] "Kavitha" "Ramaiah" "Hogg"

The functions rownames() and colnames() can still be used to set as well as get the
names the rows and columns of the matrices created using the cbind() function. For exam-
ple, you have a matrix defined as below:

> teams <- rbind(team1,team2,team3)


>
> teams
[,1] [,2] [,3]
team1 "Sujatha" "Lalitha" "Kavitha"
team2 "Somaiah" "Rajaiah" "Ramaiah"
team3 "John" "Paul" "Hogg"
>

Adding Rows and Columns


Appending a new row or column to an existing matrix can be achieved using rbind() and
cbind() functions. Suppose you want add another team

> team4 <- c("Jalaja","Sailaja","Vanaja")

to the matrix teams as a new row either at the beginning or at the end. You can do that as:
Appending a New Row at the End
> team4 <- c("Jalaja","Sailaja","Vanaja")
>
> # Adding as a last row
>
> teams.1 <- rbind(teams,team4)
> teams.1
[,1] [,2] [,3]
team1 "Sujatha" "Lalitha" "Kavitha"
team2 "Somaiah" "Rajaiah" "Ramaiah"
team3 "John" "Paul" "Hogg"
team4 "Jalaja" "Sailaja" "Vanaja"
>

5
September 16, 2020
R - Matrices

Appending a New Row at the Beginning


> # Adding as a first row
>
> teams.2 <- rbind(team4,teams)
> teams.2
[,1] [,2] [,3]
team4 "Jalaja" "Sailaja" "Vanaja"
team1 "Sujatha" "Lalitha" "Kavitha"
team2 "Somaiah" "Rajaiah" "Ramaiah"
team3 "John" "Paul" "Hogg"
>

Subsetting Matrices
Remember that a matrix is a two-dimensional object. To fetch an element of a matrix
object, you require the row index and column index of that element. These indices must be
separated by a comma inside the indexing operator []. Suppose we have the matrix:

> (x <- matrix(1:9, nrow = 3))


[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>

To fetch 2nd row, 3rd element

> x[2,3]
[1] 8
>

To fetch an entire row of a matrix object, drop the column index.

> x[3,]
[1] 3 6 9
>

To fetch an entire column of a matrix object, drop the row index.

> x[,2]
[1] 4 5 6
>

Note that the response to the commands x[2,3] or x[3,] or x[,2] are all vectors. If a
matrix is desired, then use the drop argument.

6
September 16, 2020
R - Matrices

> x[2, , drop = FALSE]


[,1] [,2] [,3]
[1,] 2 5 8
>
> x[,1,drop=FALSE]
[,1]
[1,] 1
[2,] 2
[3,] 3

The square brackets works like a function.

Dropping rows and columns

Negative indexing is allowed with matrices as well. As we saw earlier that a row or a column
of a matrix is a vector. So, negative indexes for rows or columns of a matrix will drop the
corresponding rows and columns. For example, to drop the second from the matrix x, use
the command x[-2,]:
> x[-2,]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 3 6 9

To drop the second column from the matrix x:

> x[,-2]
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9

To drop the second row as well as the second column from the matrix x:

> x[-2,-2]
[,1] [,2]
[1,] 1 7
[2,] 3 9

Fetching a Submatrix

Having understood how to fetch a row, a column and a particular element of a matrix, let
us now consider how to get a submatrix of the given matrix. Suppose from the matrix x
defined above, we want to extract the submatrix
 
5 8
.
6 9

7
September 16, 2020
R - Matrices

This submatrix consists of all the elements except the first row and first column. So, the
command x[-1,-1] will do, to fetch the above submatrix.

> x[-1,-1]
[,1] [,2]
[1,] 5 8
[2,] 6 9
>

This can also be achieved in several different ways. Let us consider the command x[2:3,].
This command results in the submatrix

2 5 8
3 6 9

In this submatrix, we do not require the first column. Therefore, the command x[2:3,-1]
results in the desired submatrix.
> x[2:3,-1]
[,1] [,2]
[1,] 5 8
[2,] 6 9
>

In the above command you are omitting the column that is not required. Instead, you can
specify which columns are required. That is the command x[2:3,2:3] output the desired
submatrix.
> x[2:3,2:3]}
[,1] [,2]
[1,] 5 8
[2,] 6 9
>

Think about other equivalent commands that fetches the specified submatrix. Suppose you
want to extract the submatrix  
1 4
.
3 6
This submatrix may be extracted using the command x[c(1,3),c(1,2)]:

> x[c(1,3),c(1,2)]
[,1] [,2]
[1,] 1 4
[2,] 3 6
>

Think of other equivalent commands to achieve the desired effect.

8
September 16, 2020
R - Matrices

Modifying Members of a Matrix

We may require to change one or more values of a matrix. First, let us consider modifying
the value of a single element in the matrix x. To modify the value of the element in the 2nd
row and 2nd column:
> ( x[2,2] <- 15 )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 15 8
[3,] 3 6 9
>

Let us now consider modifying more than one value of a matrix. Suppose we want to modify
the values of the first two elements in the 2nd row of x as 12 and 15. These elements are
x[2,1:2], which is a vector. Assign the vector c(12, 15) to x[2,1:2].

> ( x[2, 1:2] <- c(2,5) )


[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Now, consider modifying the values of the 1st and 3rd elements in the first column of the
matrix x to 11 and 13.
> x[c(1,3),1] <- c(11,13)
> x
[,1] [,2] [,3]
[1,] 11 4 7
[2,] 12 15 8
[3,] 13 6 9

Now, consider modifying the values in the entire column of a matrix. For example, let modify
the values of the first column of x back to 1,2,3.
> x[,1] <- c(1,2,3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 15 8
[3,] 3 6 9

Similarly, we can modify the elements of an entire row.


Let us now consider modifying the values of a submatrix.

9
September 16, 2020
R - Matrices

> x[1:2, 2:3] <- c(10:13)


> x
[,1] [,2] [,3]
[1,] 1 10 12
[2,] 2 11 13
[3,] 3 6 9

Note that the submatrix elements are filled column-wise by default. Instead, if we want
to modify the elements row-wise use the byrow= option of the matrix() function. Suppose,
you want to modify the values of the above submatrix with (18, 14, 17, 19), such that
the first row elements of the new submatrix are (18, 14) and the second row (17,19).
> y <- c(18,14,17,19)
> x[1:2,2:3] <- matrix(y, nrow=2, byrow=TRUE)
> x
[,1] [,2] [,3]
[1,] 1 18 14
[2,] 2 17 19
[3,] 3 6 9

Subsetting Matrices by Names

You learned that the members of a vector can be filtered either by numerical indices or
negative indices or logical indices or names. Subsetting of matrices can also be achieved
using any of these procedures. We have just seen subsetting matrices using numeric and
negative indices. Let us now consider subsetting matrices using the names of the rows and
columns. To illustrate this let us create a matrix using three vectors x, y and z and then
use the rbind() function to create a matrix.
> x <- c(1,2,3)
> y <- c(11,22,33)
> z <- c(12,23,34)
>
> ( row.lab <- rbind(x,y,z) )
[,1] [,2] [,3]
x 1 2 3
y 11 22 33
z 12 23 34
>

Now, single rows of the above matrix row.lab can be fetched as follows:

> row.lab["x",]
[1] 1 2 3
>
> row.lab["z",]
[1] 12 23 34

10
September 16, 2020
R - Matrices

Two rows can be fetched as follows:


> row.lab[c("z","x"),]
[,1] [,2] [,3]
z 12 23 34
x 1 2 3
>

An element in a row of a matrix can be fetched as follows:


> row.lab["z",3]
z
34
>

A submatrix can be fetched as follows:


> row.lab[c("z","y"),2:3]
[,1] [,2]
z 23 34
y 22 33
>

Use of single index

The elements of a matrix are stored in the memory one column after another in contiguous
memory locations. This means that, the members of a matrix can also be accessed using
single numeric indexing.
> matx
C1 C2 C3 C4
R1 1 2 3 4
R2 5 6 7 8
R3 9 10 11 12
>

With reference to the above matrix, matx[3]=9; matx[4]=2; and so on.

Operations on Matrices
Element-Wise Operations
Let A and B be two matrices of same dimension. The operators

+ − r ∗

when used with matrices of same dimension, they perform the required operations on the
corresponding elements of the matrices and results in new matrix of the same dimension.
These operations are usually referred to as element-wise or element-by-element operations.

11
September 16, 2020
R - Matrices

Element-Wise Operations

Operator A op B
M eaning
+ A+B Addition of corresponding elements of A and B
− A−B Subtracts the elements of B from the corresponding
elements of A
/ A/B Divides the elements of A by the corresponding el-
ements of B
∗ A∗B Multiplies the elements of A by the corresponding
elements of B

(−1) A∧ (−1) Results in a matrix whose elements are reciprocals
of A

To get the usually matrix multiplication, as seen in Linear Algebra, use % ∗ %.

For example, multiplication of A with B is done as in

A% ∗ %B,

provided the matrices are conformable for multiplication.


For example,(Try out the other operations, listed in the table above.)

> ( x <- matrix( 1:12, nrow = 3) )


[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>
> ( y <- matrix( seq( from = 11, len = 12), nrow = 3) )
[,1] [,2] [,3] [,4]
[1,] 11 14 17 20
[2,] 12 15 18 21
[3,] 13 16 19 22
>
> x + y
[,1] [,2] [,3] [,4]
[1,] 12 18 24 30
[2,] 14 20 26 32
[3,] 16 22 28 34

12
September 16, 2020
R - Matrices

Functions used with Matrices

Functions used with Matrices


Function Example Purpose
nrow() nrow(A) determines the number of rows of A
ncol() ncol(A) determines the number of columns of A
rowSums() rowSums(A) prints out the sums of each row of the matrix A
colSums() colSums(A) prints out the sums of each column of the matrix A
rowMeans() rowMeans(A) computes means of each row of the matrix A
colMeans() colMeans(A) computes means of each column of the matrix A
upper.tri() upper.tri(A) returns a vector whose elements are the upper tri-
angular matrix of square matrix A
lower.tri() lower.tri(A) returns a vector whose elements are the lower trian-
gular matrix of square matrix A
det() det(A) results in the determinant of the matrix A
solve() solve(A) results in the inverse of the non-singular matrix A
diag() diag(A) returns a diagonal matrix whose off-diagonal ele-
ments are zeros and diagonals are the same as that
of the square matrix A
t() t(A) returns the the transpose of the matrix A
eigen() eigen(A) returns the eigenvalues and eigenvectors of the ma-
trix A
is.matrix() is.matrix(A) returns TRUE or FALSE depending on whether A
is a matrix or not.
as.matrix() as.matrix(x) creates a matrix out of the vector x

13
September 16, 2020
R - Matrices

is.matrix() function

To verify whether a give R object is a matrix object, use the is.matrix() function. Let us
create dataframe and invoke the is.matrix() on that object.

> df <- data.frame(x1 = 1:4, x2 = 5:8, x3 = 9:12)


> df
x1 x2 x3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
>
> is.matrix(df)
[1] FALSE
>
> is.data.frame(df)
[1] TRUE
>

as.matrix() function

Let us now convert the dataframe object into a matrix object using the as.matrix() function
and then again invoke the is.matrix() on the resulting object to verify whether the function
successfully converted the dataframe object into a matrix object.

> df.mat <- as.matrix(df)


> df.mat
x1 x2 x3
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> is.matrix(df.mat)
[1] TRUE
>

rowSums() and colSumns() functions

The rowSums() and colSums() functions can be used to compute the sums of the rows and
columns in matrix object. We know that the row sums of a transition probability matrix
must each equal to 1. So, let us create a TPM and very whether its rows sums to 1.

1
September 22, 2020
R - Matrices

> s1 <- c(0.1, 0.0, 0.9)


> s2 <- c(0.5, 0.2, 0.3)
> s3 <- c(0.6, 0.2, 0.2)
>
> P <- rbind(s1, s2, s3)
> P
[,1] [,2] [,3]
s1 0.1 0.0 0.9
s2 0.5 0.2 0.3
s3 0.6 0.2 0.2
>

To achieve the matrix multiplication as in Linear Algebra, we have to use the operator
%*%. We know that, the if P is a TPM, the Pn is also a stochastic matrix, for all positive
integer powers.

> P2 <- P%*%P


> P2
[,1] [,2] [,3]
s1 0.55 0.18 0.27
s2 0.33 0.10 0.57
s3 0.28 0.08 0.64
> rowSums(P2)
s1 s2 s3
1 1 1
>
> P3 <- P2%*%P
> P3
[,1] [,2] [,3]
s1 0.307 0.090 0.603
s2 0.425 0.134 0.441
s3 0.452 0.144 0.404
> rowSums(P3)
s1 s2 s3
1 1 1
>

It is easy to observe that the Markov Chain corresponding to the given TPM is finite,
irreducible and aperiodic. we know that, for such a Markov chain, stationary distribution
exists. Having P3 , we can compute, P6 , P12 and P24 . Print the contents of P24 matrix and
confirm that stationary distribution has been obtained and then verify the matrix is TPM.

2
September 22, 2020
R - Matrices

> P6 <- P3%*%P3


> P12 <- P6%*%P6
> P24 <- P12%*%P12
>
> round(P24,digits=4)
[,1] [,2] [,3]
s1 0.3919 0.1216 0.4865
s2 0.3919 0.1216 0.4865
s3 0.3919 0.1216 0.4865
>
> rowSums(round(P24,digits=4))
s1 s2 s3
1 1 1
>

#
# Illustrating the paste() and append() function
#
> x <- sample(10:20,12,replace=T)
> x
[1] 11 18 13 12 12 19 20 11 10 15 14 12
> x <- matrix(x,nrow=4)
> x
[,1] [,2] [,3]
[1,] 11 12 10
[2,] 18 19 15
[3,] 13 20 14
[4,] 12 11 12
>
> apply(x,1,sum)
[1] 33 52 47 35
>
> cbind(x,apply(x,1,sum))
[,1] [,2] [,3] [,4]
[1,] 11 12 10 33
[2,] 18 19 15 52
[3,] 13 20 14 47
[4,] 12 11 12 35
>

3
September 22, 2020
R - Matrices

> paste("Stu",1:4)
[1] "Stu 1" "Stu 2" "Stu 3" "Stu 4"
> paste("Stu",1:4,sep="")
[1] "Stu1" "Stu2" "Stu3" "Stu4"
>
> marks <- x
> marks
[,1] [,2] [,3]
[1,] 11 12 10
[2,] 18 19 15
[3,] 13 20 14
[4,] 12 11 12
> rownames(marks) <- paste("Stu",1:4,sep="")
> marks
[,1] [,2] [,3]
Stu-1 11 12 10
Stu-2 18 19 15
Stu-3 13 20 14
Stu-4 12 11 12
> colnames(marks) <- paste("P",1:3,sep="")
> marks
P1 P2 P3
Stu1 11 12 10
Stu2 18 19 15
Stu3 13 20 14
Stu4 12 11 12
>
> Total <- apply(marks,1,sum)
> Total
Stu1 Stu2 Stu3 Stu4
33 52 47 35
>
> cbind(marks,Total)
P1 P2 P3 Total
Stu1 11 12 10 33
Stu2 18 19 15 52
Stu3 13 20 14 47
Stu4 12 11 12 35
>
> marks <- cbind(marks,Total)
> marks
P1 P2 P3 Total
Stu1 11 12 10 33
Stu2 18 19 15 52
Stu3 13 20 14 47
Stu4 12 11 12 35
>

4
September 22, 2020
Exercise-1:

(a) What is the correlation between scores of males and females?

(b) What are the standard deviations of males and females scores?

(c) Find the name of the student having maximum score?

(d) Among males, who scored max marks?

Solutions:

(a)

> f.score <- df[df$Gender=="Female","Score"]

> f.score

[1] 77 66 88 67 59

>

> m.score <- df[df$Gender=="Male","Score"]

> m.score

[1] 80 60 82 84 62

>

> cor(f.score,m.score)

[1] 0.6618506

>

(b)

> with(df,tapply(Score,Gender,sd))

Female Male
11.28273 11.61034

>

(c)

> df[df$Score==max(df$Score),"Student"]

[1] "Vanaja"

>

(d)

> df[df$Gender=="Male","Score"]

[1] 80 60 82 84 62

> max(df[df$Gender=="Male","Score"])

[1] 84

> df$Score==max(df[df$Gender=="Male","Score"])

[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE

> df[df$Score==max(df[df$Gender=="Male","Score"]),]

Student Gender Residence Score

7 Tharun Male Resident 84

> df[df$Score==max(df[df$Gender=="Male","Score"]),"Student"]

[1] "Tharun"

>
R-Dataframes

Outline
..What is a data frame?
..How to create a data frame?
..... data.frame(),
..... as.data.frame(),
..... read.csv()
..How to subset a data frame?
..Data Fetching
..... a single value,
..... a row,
..... a column
..Modifying data frame structure
..... Changing a value,
..... adding/deleting a row
..... adding/deleting a column
..How to write a data frame to a file

A data frame is two-dimensional data structure such that all columns are of same length
and within each column the data values must be of the same atomic type.

Creating a data frame

The basic function used to create a data frame is data.frame(). Data frames are usually
created out of a .CSV file, an Excel file, or imported from statistical packages such as SPSS,
SAS and so on.
If the data is in the form of several vectors of same length, use the function data.frame()
function. Consider the following examples:

> data.frame(1:5,11:15)
X1.5 X11.15
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

So, when we create a data frame with unnamed vectors, the R system tries to give default
names which turns out to be very clumsy.
We can give names to the vectors inside the data.frame() function while creating the data
frame, which turn out be column names of the data frame, as shown below:

1
September 23, 2020
R-Dataframes

> data.frame(a=c(12,32,15,15,24),b=c(22,23,21,24,23))
a b
1 12 22
2 32 23
3 15 21
4 15 24
5 24 23
>

You can also create a dataframe by first creating the data vectors and then using them within
the data.frame() function.
> a <-c(12,32,15,15,24)
> b <- c(22,23,21,24,23)
> df <- data.frame(a,b)
> df
a b
1 12 22
2 32 23
3 15 21
4 15 24
5 24 23
>

In practice, we will create a data frame by reading data from a disk file. If the data is in a
file, use the function read.csv() function to create a data frame.

Suppose we have text file in our working directory with the name data1.txt.
Data File
Student Gender Residence Score
Sarayu Female Resident 77
Rayudu Male Resident 80
Gowtam Male Nonresident 60
Vasant Male Resident 82
Vinuta Female Nonresident 66
Vanaja Female Resident 88
Tharun Male Resident 84
Pavani female resident 77
Venkat male nonresident 62
Janaki female nonresident 59

It can be read as a data frame using the read.csv() function.

> df<-read.csv(file="data1.txt",sep =" ")

Let us now look at the contents of the data frame df:

2
September 23, 2020
R-Dataframes

> df
Student Gender Residence Score
1 Sarayu Female Resident 77
2 Rayudu Male Resident 80
3 Gowtam Male Nonresident 60
4 Vasant Male Resident 82
5 Vinuta Female Nonresident 66
6 Vanaja Female Resident 88
7 Tharun Male Resident 84
8 Pavani female resident 77
9 Venkat male nonresident 62
10 Janaki female nonresident 59

Now, have look at the structure of the data frame.

> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : Factor w/ 10 levels "Gowtam","Janaki",..: 5 4 1 8 10 7 6 3 9 2
$ Gender : Factor w/ 4 levels "female","Female",..: 2 4 4 4 2 2 4 1 3 1
$ Residence: Factor w/ 4 levels "nonresident",..: 4 4 2 4 2 4 4 3 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>

df is data frame with 10 observations(rows) and 4 variables(columns). Further, note that 3 of


the variables(Student, Gender, Residence) are factor variables and only one variable(Score)
is numeric.
By default, the data.frame() function treats the character data as factors. If you want to
keep the characters as characters, set the stringsAsFactors argument to FALSE. For the
sake of practice, let us do it:

> df <- read.csv(file="data1.txt",sep=" ", stringsAsFactors=FALSE)


> df
Student Gender Residence Score
1 Sarayu Female Resident 77
2 Rayudu Male Resident 80
3 Gowtam Male Nonresident 60
4 Vasant Male Resident 82
5 Vinuta Female Nonresident 66
6 Vanaja Female Resident 88
7 Tharun Male Resident 84
8 Pavani female resident 77
9 Venkat male nonresident 62
10 Janaki female nonresident 59
>

3
September 23, 2020
R-Dataframes

> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ Residence: chr "Resident" "Resident" "Nonresident" "Resident" ...
$ Score : int 77 80 60 82 66 88 84 77 62 59
>

Now we observe that the variables Student, Gender, and Residence are character variables.
Note that, there are some case differences among the values of the variables Gender and
Residence. You need to convert them to uniform case, before you attempt any analysis
using them. Consider the Resident variable:

> df[8:10,3]
[1] "resident" "nonresident" "nonresident"
> c("Resident",rep("Nonresident",2))
[1] "Resident" "Nonresident" "Nonresident"
>
> df[8:10,3] <- c("Resident",rep("Nonresident",2))
> df[,3]
[1] "Resident" "Resident" "Nonresident" "Resident"
[5] "Nonresident" "Resident" "Resident" "Resident"
[9] "Nonresident" "Nonresident"
>
> df[,3] <- as.factor(df$Residence)

Verify whether the intended corrections are correctly carried.

> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ Residence: Factor w/ 2 levels "Nonresident",..: 2 2 1 2 1 2 2 2 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>

Question your dataframe to know the number of levels of the factor variable and the number
of observations on each level.
> levels(df$Residence)
[1] "Nonresident" "Resident"
> table(df$Residence)

Nonresident Resident
4 6
>

Note that the variable Residence is now a factor variable and is having two levels.

4
September 23, 2020
R-Dataframes

Exercise
Try to correct the data errors in the Gender variable and then change it into a factor
variable.

After the conversion the structure of the data frame becomes

> str(df)
’data.frame’: 10 obs. of 4 variables:
$ Student : chr "Sarayu" "Rayudu" "Gowtam" "Vasant" ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 2 1
$ Residence: Factor w/ 2 levels "Nonresident",..: 2 2 1 2 1 2 2 2 1 1
$ Score : int 77 80 60 82 66 88 84 77 62 59
>

You can use as.data.frame() function to create a data frame out of a matrix.

Naming rows and columns

Data frames always have its observations(rows) named as "1","2","3",···. You can check
this using the rownames() function.

> rownames(df)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
>

The names() or colnames() functions can be used to fetch variables or columns of data
frame.
> names(df)
[1] "Student" "Gender" "Residence" "Score"
>
> colnames(df)
[1] "Student" "Gender" "Residence" "Score"
>

The names() function can also be used to assign names to columns of a data frame.

To change the variable name Gender to Sex, use the command:

> names(df)[2]
[1] "Gender"
> names(df)[2] <- "Sex"
> names(df)
[1] "Student" "Sex" "Residence" "Score"
>

Now change the column name from Sex to Gender using the colnames() function.

5
September 23, 2020
R-Dataframes

> colnames(df)[2] <- "Gender"


> names(df)
[1] "Student" "Gender" "Residence" "Score"
>

Unlike matrices, you cannot delete the row names of a data frame.

Subsetting a data frame


A data frame is a two-dimensional data structure whose dimensions are referred to as rows
and columns. The elements of a data frame are retrieved by suffixing the name of the data
frame with a pair of square brackets containing a pair of reference indices separated by a
comma. An element of a data frame can be fetched using either numeric indices, names,
logicals.

Is Vasant a resident student?

The name Vasant appears in the 4th row and the Residence is the 3rd variable as per our
data frame. so, to fetch the required info, issue the command:

> df[4,3]
[1] Resident
Levels: Nonresident Resident
>

While dealing with large data frames, it is easy to remember the column names rather than
their numbers. The R System supports fetching the values in data frame using the names of
the columns.

A column name or a variable can be fetched using a $ symbol with the data frame name.

For example, the Student variable of the data frame df can be accessed as df$Student.

> df$Student
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>

To know whether there is a student named Vasant compare the string ”Vasant” to df$Student.

> df$Student=="Vasant"
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
>

Note that, 4th element of the result of the above command is TRUE and every other element is
FALSE. So, when we pass this vector as the row index for the data frame df and supplementing
the column index as ”Residence”, we get:

6
September 23, 2020
R-Dataframes

> df[df$Student=="Vasant","Residence"]
[1] Resident
Levels: Nonresident Resident
>

Fetching a Row

To fetch an observation( or case or record or row), we leave the second dimension empty as
in the case of matrices. For example, 8th row of the data frame can be accessed using the
commands
> df[8,]
Student Gender Residence Score
8 Pavani Female Resident 77
>
> df[df$Student=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 77
>
> df[df=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 77
>

Suppose, Pavani’s score was wrongly noted as 77 instead of 67. You modify Pavani’s
score using the command
> ## fetching Pavani’s Score
>
> df[df$Student=="Pavani","Score"]
[1] 77
>
> # Modifying Pavan’s Score
>
> df[df$Student=="Pavani","Score"] <- 67
>
> # View modified record
>
> df[df$Student=="Pavani",]
Student Gender Residence Score
8 Pavani Female Resident 67
>

Fetching Multiple rows

Who are the resident students?

7
September 23, 2020
R-Dataframes

> # Fetch the rows for residents


>
> df[df$Residence=="Resident",]
Student Gender Residence Score
1 Sarayu Female Resident 77
2 Rayudu Male Resident 80
4 Vasant Male Resident 82
6 Vanaja Female Resident 88
7 Tharun Male Resident 84
8 Pavani Female Resident 67
>
> # Print the names of the resident students
>
> df[df$Residence=="Resident","Student"]
[1] "Sarayu" "Rayudu" "Vasant" "Vanaja" "Tharun" "Pavani"
>

Who are resident male students?


> # Fetch the records of resident male students
>
> df[df$Residence=="Resident"&df$Gender=="Male",]
Student Gender Residence Score
2 Rayudu Male Resident 80
4 Vasant Male Resident 82
7 Tharun Male Resident 84
>
> # Print the names of the resident male students
>
> df[df$Residence=="Resident"&df$Gender=="Male","Student"]
[1] "Rayudu" "Vasant" "Tharun"
>

What are the records of female students?


> df$Gender=="Female"
[1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
>
>
> df[df$Gender=="Female",]
Student Gender Residence Score
1 Sarayu Female Resident 77
5 Vinuta Female Nonresident 66
6 Vanaja Female Resident 88
8 Pavani Female Resident 67
10 Janaki Female Nonresident 59
>

8
September 23, 2020
R-Dataframes

How many Nonresident female students are there? Who are they?

> df$Gender=="Female"&df$Residence=="Nonresident"
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE

> sum(df$Gender=="Female"&df$Residence=="Nonresident")
[1] 2
>

> df[df$Gender=="Female"&df$Residence=="Nonresident",]
Student Gender Residence Score
5 Vinuta Female Nonresident 66
10 Janaki Female Nonresident 59
>

What is the average score by Gender?

> with(df,tapply(Score,Gender,mean))
Female Male
71.4 73.6
>

What is the average score by Residence?

> with(df,tapply(Score,Residence,mean))
Nonresident Resident
61.75000 79.66667
>

What is the average score by Gender and Residence?

> with(df,tapply(Score,list(Gender,Residence),mean))
Nonresident Resident
Female 62.5 77.33333
Male 61.0 82.00000
>

Count the number of residents and nonresidents.


> table(df$Residence)

Nonresident Resident
4 6

What is the gender composition in the data frame?

9
September 23, 2020
R-Dataframes

> table(df$Gender)

Female Male
5 5
>

Exercise-2
(a) What is the correlation between scores of males and females?
(b) What are the standard deviations of males and females scores?
(c) Find the name of the student having maximum score?
(d) Among males, who scored max marks?

Fetching a column

A column in a data frame is a variable. Unlike matrices and arrays, data frames are not
internally stored as vectors. They are stored as list of vectors.

We can use numeric indices, names and logical vectors for selection of variables as with
matrices. We can also select a variable by inserting a $ symbol in between the data frame
name and column name, in that order.

Who are the students in the data frame?

> df[,1]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[,"Student"]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[,c(TRUE,rep(FALSE,3))]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df$Student
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>

Note that, the output is a vector in all the above cases. If we want to fetch a column as a
data frame, then use the drop=FALSE option.

10
September 23, 2020
R-Dataframes

> df[,1,drop=FALSE]
Student
1 Sarayu
2 Rayudu
3 Gowtam
4 Vasant
5 Vinuta
6 Vanaja
7 Tharun
8 Pavani
9 Venkat
10 Janaki

You can also use square brackets with a single index to get a column of a data frame, since
columns of data frame are stored as lists in the memory.

> df["Student"]
Student
1 Sarayu
2 Rayudu
3 Gowtam
4 Vasant
5 Vinuta
6 Vanaja
7 Tharun
8 Pavani
9 Venkat
10 Janaki
>
>
> df[["Student"]]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta" "Vanaja" "Tharun"
[8] "Pavani" "Venkat" "Janaki"
>
> df[["Student"]][1:5]
[1] "Sarayu" "Rayudu" "Gowtam" "Vasant" "Vinuta"
>

Use of $ symbol to fetch a single variable is more convenient in many instances.

Fetching Two columns

We can fetch two or more columns, as in the case of matrices.

11
September 23, 2020
R-Dataframes

Methods of Subsetting

> df[,c(1,4)] > df[,c("Student","Score")] > df[,c(T, F, F, T)]


Student Score Student Score Student Score
1 Sarayu 77 1 Sarayu 77 1 Sarayu 77
2 Rayudu 80 2 Rayudu 80 2 Rayudu 80
3 Gowtam 60 3 Gowtam 60 3 Gowtam 60
4 Vasant 82 4 Vasant 82 4 Vasant 82
5 Vinuta 66 5 Vinuta 66 5 Vinuta 66
6 Vanaja 88 6 Vanaja 88 6 Vanaja 88
7 Tharun 84 7 Tharun 84 7 Tharun 84
8 Pavani 67 8 Pavani 67 8 Pavani 67
9 Venkat 62 9 Venkat 62 9 Venkat 62
10 Janaki 59 10 Janaki 59 10 Janaki 59

As the above is a data frame, we can subset it as if it is a data frame.


> # Names and scores of the students whose score is mote than 70
>
> df[c("Student","Score")][df$Score>70,]
Student Score
1 Sarayu 77
2 Rayudu 80
4 Vasant 82
6 Vanaja 88
7 Tharun 84
>

Look at the following:

> "["(df,c("Student","Score"))
Student Score
1 Sarayu 77
2 Rayudu 80
3 Gowtam 60
4 Vasant 82
5 Vinuta 66
6 Vanaja 88
7 Tharun 84
8 Pavani 67
9 Venkat 62
10 Janaki 59
>

”[” is a function with the first argument being the data frame and the second argument is a
column index.

12
September 23, 2020
R-Dataframes

Modifying Dataframes
Adding One Observation

Suppose our dataframe is as follows:

> df1
Math Phy Chem
1 30 25 20
2 26 24 22
3 23 23 19
4 21 21 23
5 24 24 22
6 25 25 23
>

(1) If the Data frame contains only numeric values and the rows have default names:

> df1[nrow(df1)+1,] <- c(20,18,15)


> df1
Math Phy Chem
1 30 25 20
2 26 24 22
3 23 23 19
4 21 21 23
5 24 24 22
6 25 25 23
7 20 18 15
>
> rownames(df1)
[1] "1" "2" "3" "4" "5" "6" "7"
>

Alternatively, you can also add anew row to the dataframe using the rbind() function as
follows(assuming the original dataframe):

> df1 <- rbind(df1, c(20,18,15))


> df1
Math Phy Chem
1 30 25 20
2 26 24 22
3 23 23 19
4 21 21 23
5 24 24 22
6 25 25 23
7 20 18 15
>

(2) If the Data frame contains only numeric values and the rows are labelled:

13
September 23, 2020
R-Dataframes

> rownames(df) <- LETTERS[1:6]


> df
Math Phy Chem
A 30 25 20
B 26 24 22
C 23 23 19
D 21 21 23
E 24 24 22
F 25 25 23
>

A new row can be added to an existing datframe as follows:

> rbind(df,"G" = c(20,18,15))


Math Phy Chem
A 30 25 20
B 26 24 22
C 23 23 19
D 21 21 23
E 24 24 22
F 25 25 23
G 20 18 15
>

(3) Suppose different columns of the dataframe contains different atomic types:

> df1 <- data.frame(Names="Raj", Math=25, Phy=18, Chem=15)


> df1
Names Math Phy Chem
1 Raj 25 18 15
> rbind(df, df1)
Names Math Phy Chem
1 Anu 30 25 20
2 Bil 26 24 22
3 Ali 23 23 19
4 Dip 21 21 23
5 Sri 24 24 22
6 Hsu 25 25 23
7 Raj 25 18 15
>

Adding More Than One Row

Create a new data frame with column names being same as those of the old data frame and
also in the same order. Now use the rbind() function to add both these data frames into
single data frame. To be able to bind the new data frame with the old data frame, you have
to make sure that the column names match in both the data frames exactly, including the
case.

14
September 23, 2020
The read.spss function in the foreign package reads all versions of SPSS files, both .sav and .por types.

library(foreign)

> df <- read.spss(file="Exercise02.sav",to.data.frame=T)

re-encoding from CP1252

There is no canned function to write out a completed SPSS dataset, but there are two auxiliary functions
in the foreign package that allow users to write out a text data file and then an input syntax file that will
read the data in and make the "right" variable and value labels.

... writeForeignSPSS() takes three arguments, first is the R data frame you

want to write out, the second is the name of a data file to which the data will

be written and the third is the name of a code file to which the code to input

the data will be written.

Stata

... The read.dta function in the foreign package reads in Stata datasets saved in formats earlier than
Stata 13.

library(foreign)

dat <- read.dta('xyz123.dta')

... To read Stata files from version 13 or later, you can use the read.dta13 function in the readStata13
package. First, you have to install the package:

install.packages('readstata13')

library(readstata13)
dat <-read.dta13('xyz.dta', nonint.factors=T)

write.dta() writes a Stata .dta file of the dataset. The benefit here is that factors remain defined as
variables with labels in Stata. Those attributes go away in the text files.

... writeForeignStata() has the same arguments as the SPSS version.

-------------------------------------

tidyverse - haven

... Haven enables R to read and write various data formats used by other statistical packages.

... Haven is part of the tidyverse package.

... Currently it supports:

SAS: read_sas() reads .sas7bdat + .sas7bcat files and

read_xpt() reads SAS transport files(version 5 and version 8).

SPSS: read_sav() reads .sav files and

read_por() reads the older .por files.

write_sav() writes .sav files.

read_spss() uses either read_por() or read_sav()

based on the file extension.

Usage

read_sav(file, encoding = NULL, user_na = FALSE)

read_por(file, user_na = FALSE)

write_sav(data, path, compress = FALSE)


read_spss(file, user_na = FALSE)

Stata: read_dta() reads .dta files (up to version 15).

write_dta() writes .dta files (versions 8-15).

The output objects:

Are tibbles, which have a better print method for very long and very wide files.

Translate value labels into a new labelled() class, which preserves the original semantics and can easily
be coerced to factors with as_factor(). Special missing values are preserved.

Dates and times are converted to R date/time classes. Character vectors are not converted to factors.

Read SPSS (.sav, .zsav, .por) files. Write .sav and .zsav files.

read_sav() reads both .sav and .zsav files;

write_sav() creates .zsav files when compress = TRUE.

read_por() reads .por files.

read_spss() uses either read_por() or read_sav() based on the file extension.

------------------------------------------------

SPSS File (.sav)

install.packages(“haven”)
library(haven)

Import SPSS (".sav") File

object <- read_sav("filename.sav")

Export SPSS (".sav") File

write_sav(object, "filename.sav")

For example, if you wanted to download the package that would allow you to install

.sas7bdat files, you would do:

install.packages('sas7bdat')
R can read data from a wide variety of sources and in a wide variety of formats.

The foreign package

The foreign package contains methods to read SAS permanent datasets3 (SAS7BDAT files) using
read.ssd, Stata DTA files with read.dta, and SPSS data files with read.spss. Each of these files can be
written with write.foreign.

The xlsx package

The xlsx package is Java-based and cross-platform, so at least in theory it can

read any Excel file on any system. It provides a choice of functions for reading Excel files: spreadsheets
can be imported with read.xlsx and read.xlsx2, which do more processing in R and in Java, respectively.

Importing data from an excel file into R

Many statistical packages (SAS, SPSS) can save data as an EXCEL file.

(1) Using EXCEL "Save As" option

Import any type of data into R by using EXCEL and saving there

the data file into a comma delimited (*.csv) format.

Once the comma delimited file is created using the "Save As" feature

in EXCEL you can import it into R using either the read.table() or the

read.csv() function. Before importing, determine which separator was used in the ".csv" file (comma or
semi-colon). Then:
Option 1: The separator is a comma (,) and NO headers

object <- read.table("filename.csv", header=FALSE, sep=”,”)

Option 2: The separator is a semi-colon (;) and NO headers

object <- read.table(“filename.csv”, header=FALSE, sep=”;”)

(2) using the clipboard:

Open the *.xls file in EXCEL

Select the table from the excel file, copy, go to the R Console and type:

mydf <- read.table("clipboard", header=TRUE, sep="\t")

(3) Data from ".csv" (interactively)

mydata <- read.csv(file.choose(), header = TRUE)

(4) Data from ".csv" using address of the file

mydf <- read.csv("address of the .csv file", header=TRUE)

Export an R object as a (".csv") File

Option 1: The separator is a comma (,) and NO headers

write.table(object, file=“filename.csv”, header=FALSE, sep=”,”)


Option 2: The separator is a semi-colon (;) and NO headers

write.table(object, file=“filename.csv”, header=FALSE, sep=”;”)

-----------------------------------------------

Excel File (.xlsx)

Package Required = openxlsx

install.packages(“openxlsx”)

library(openxlsx)

Import Excel (“.xlsx”) File

Option 1: Sheet=1, NO column headings

object <- read.xlsx(“filename.xlsx”, sheet=1, colNames=FALSE)

Option 2: Sheet=1, Column headings

object <- read.xlsx(“filename.xlsx”, sheet=1, colNames=TRUE)

Export Excel (“.xlsx”) File

Option 1: NO column headings

write.xlsx(object, file=“filename.xlsx”, colNames=FALSE)


Option 2: Column headings

write.xlsx(object, file=“filename.xlsx”, colNames=TRUE)


Importing Data Into R
read.csv() function

Dr. L. V. Rao

August 25, 2019

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 1/1


Exporting Data From R

R provides two functions for writing objects to files in ASCII


format;
write(), which is suitable for the same kinds of data as scan(),
and
write.table(), which is suitable for the types of data which
would normally be read using read.table().

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 2/1


write() fucntion - Help
Description The data (usually a matrix) x are written to file file.
If x is a two-dimensional matrix you need to transpose it to get the
columns in file the same as those in the internal representation.
Usage

write(x, file = "data",


ncolumns = if(is.character(x)) 1 else 5,
append = FALSE, sep = " ")

Arguments
x the data to be written out, usually an atomic vector.
file a connection, or a character string naming the file to write
to. If ””, print to the standard output connection.
ncolumns the number of columns to write the data in.
append if TRUE the data x are appended to the connection.
sep a string used to separate columns. Using sep =”\t” gives
tab delimited output; default is ” ”.
Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 3/1
write() function

The write() function accepts an R object and the name of a


file or connection object, and writes an ASCII representation
of the object to the appropriate destination.
The ncolumns= argument can be used to specify the number
of values to write on each line; it defaults to
5 for numeric variables, and
1 for character variables.
To build up an output file incrementally, the append=TRUE
argument can be used.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 4/1


write() function

We know that matrices are internally stored by columns, and


hence, will be written to any output connection in that order.
To write a matrix in row-wise order, use its transpose and
adjust the ncolumns= argument appropriately.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 5/1


> write(t(mat),file="",
ncolumns=ncol(mat))
> mat
100 95 92
Maths Phy Chem
98 82 84
1 100 95 92
89 79 81
2 98 82 84
95 88 80
3 89 79 81
> write(colnames(mat),
4 95 88 80
file="mat-write.txt",
>
ncolumns=ncol(mat))
> write(mat,file="")
> write(t(mat),file="mat-write.txt",
100 98 89 95 95
ncolumns=ncol(mat),
82 79 88 92 84
append=TRUE)
81 80
> read.csv("mat-write.txt",sep=" ")
> write(t(mat),file="")
Maths Phy Chem
100 95 92 98 82
1 100 95 92
84 89 79 81 95
2 98 82 84
88 80
3 89 79 81
>
4 95 88 80
Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 6/1
>
write.table() function - Help

Description write.table prints its required argument x (after


converting it to a data frame if it is not one nor a matrix) to a file
or connection.
Usage

write.table(x, file = "", append = FALSE, quote = TRUE,


sep = " ", eol = "\n", na = "NA", dec = ".",
row.names = TRUE, col.names = TRUE,
qmethod = c("escape", "double"),
fileEncoding = "")

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 7/1


x the object to be written, preferably a matrix or data
frame. If not, it is attempted to coerce x to a data
frame.
file either a character string naming a file or a connec-
tion open for writing. ”” indicates output to the
console.
append logical. Only relevant if file is a character string.
If TRUE, the output is appended to the file. If
FALSE, any existing file of the name is destroyed.
quote a logical value (TRUE or FALSE) or a numeric vec-
tor. If TRUE, any character or factor columns will
be surrounded by double quotes. If a numeric vec-
tor, its elements are taken as the indices of columns
to quote. In both cases, row and column names
are quoted if they are written. If FALSE, nothing
is quoted.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 8/1


sep the field separator string. Values within each row
of x are separated by this string.
eol the character(s) to print at the end of each line
(row). For example, eol = \r\n will produce Win-
dows’ line endings on a Unix-alike OS, and eol = \r
will produce files as expected by Excel:mac 2004.
na the string to use for missing values in the data.
dec the string to use for decimal points in numeric or
complex columns: must be a single character.
row.names either a logical value indicating whether the row
names of x are to be written along with x, or a
character vector of row names to be written.
col.names either a logical value indicating whether the column
names of x are to be written along with x, or a
character vector of column names to be written.
See the section on ”CSV files” for the meaning of
col.names = NA.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 9/1


qmethod a character string specifying how to deal with
embedded double quote characters when quoting
strings. Must be one of ”escape” (default for
write.table), in which case the quote character is
escaped in C style by a backslash, or ”double” (de-
fault for write.csv and write.csv2), in which case it
is doubled. You can specify just the initial letter.
fileEncoding character string: if non-empty declares the encod-
ing to be used on a file (not a connection) so the
character data can be re-encoded as they are writ-
ten. See file.
... arguments to write.table: append, col.names, sep,
dec and qmethod cannot be altered.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 10 / 1


write.table() function

For mixed-mode data, like data frames, the basic tool to


produce ASCII files is write.table() function.
The only required argument to write.table is the name of a
dataset or matrix; with just a single argument, the output will
be printed on the console, making it easy to test that the file
you will be creating is in the correct format.
Usually, the second argument, file= will be used to specify the
destination as either a character string to represent a file, or a
connection.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 11 / 1


write.table() function

By default, character strings are surrounded by quotes by


write.table; use the quote=FALSE argument to suppress this
feature.

> write.table(mat,
> write.table(mat) quote=FALSE)
"Maths" "Phy" "Chem" Maths Phy Chem
"1" 100 95 92 1 100 95 92
"2" 98 82 84 2 98 82 84
"3" 89 79 81 3 89 79 81
"4" 95 88 80 4 95 88 80
> >
> write.table(mat, > write.table(mat,
file="temp.txt") file="temp1.txt",
quote=FALSE)

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 12 / 1


write.table() function
To suppress row names or column names from being written
to the file, use the row.names=FALSE or col.names=FALSE
arguments, respectively.

> write.table(mat, > write.table(mat,


quote=FALSE, row.names=FALSE,
row.names=FALSE) col.names=FALSE)
Maths Phy Chem 100 95 92
100 95 92 98 82 84
98 82 84 89 79 81
89 79 81 95 88 80
95 88 80 >
> write.table(mat, > write.table(mat,
file="mat-write.txt", file="mat-write.txt",
row.names=FALSE, row.names=FALSE,
quote=FALSE) col.names=FALSE)
Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 13 / 1
Similar to read.csv() and read.csv2(), the functions write.csv()
and write.csv2() are provided as wrappers to write.table(),
with appropriate options set to produce comma- or
semicolon-separated files.
To save the file somewhere other than in the working
directory, enter the full path for the file as shown.
> write.csv(dataset, "C:/folder/filename.csv")
If a file with your chosen name already exists in the specified
location, R overwrites the original file without giving a
warning. You should check the files in the destination folder
beforehand to make sure you are not overwriting anything
important.
By default, the write.csv() and write.table() functions create
an extra column in the file containing the observation
numbers. To prevent this, set the row.names argument to
FALSE.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 14 / 1


write.table() function

Note that col.names= TRUE (the default) produces the same


sort of headers that are read using the header=TRUE
argument of read.table.
The sep= argument can be used to specify a separator other
than a blank space.
Two common choices are; sep=’,’ (comma separated) or
sep=”\t” (tab-separated).

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 15 / 1


write.table() function

The write.table() function acts to write a delimited file, just


as read.table() reads one.
And, just as there are the read.csv() and read.csv2() analogs
to read.table(), R also provides write.csv() and write.csv2().
We normally pass write.table() a data frame, though a matrix
can be written as well, and we generally supply the delimiter
with the sep argument, since the default choice of a space is
rarely a good one.

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 16 / 1


Example: Importing/exporting .csv files

This example illustrates how to export the contents of a data


frame to a .csv file, and how to import the data from a .csv file
into an R data frame.

> # create a data frame


> dates <- c("3/27/1995", "4/3/1995",
"4/10/1995", "4/18/1995")
> prices <- c(11.1, 7.9, 1.9, 7.3)
> d <- data.frame(dates = dates, prices = prices)
>
># create the .csv file
> filename <- "temp.csv"
> write.table(d, file = filename, sep = ",",
row.names = FALSE)

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 17 / 1


Example: Importing/exporting .csv files

The new file temp.csv can be opened in most spreadsheets. When


displayed in a text editor (not a spreadsheet), the file temp.csv
contains the following lines (without the leading spaces).

"dates","prices"
"3/27/1995",11.1
"4/3/1995",7.9
"4/10/1995",1.9
"4/18/1995",7.3

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 18 / 1


Example: Importing/exporting .csv files

Most .csv format files can be read using read.table. In addition


there are functions read.csv and read.csv2 designed for .csv files.

> # read the .csv file


> read.table(file = filename, sep = ",", header = TRUE)
> read.csv(file = filename) #same thing
dates prices
1 3/27/1995 11.1
2 4/3/1995 7.9
3 4/10/1995 1.9
4 4/18/1995 7.3
>

Importing Data Into R - Dr. L. V. Rao,, August 25, 2019 19 / 1


Importing Data
(read functions)

Dr. L. V. Rao

February 11, 2021

- ,, February 11, 2021 1/1


Importing Data

There are a large number of competing methods that can be


used to import data and from a wide variety of sources.
read.table()
read.csv()
read.csv2()
read.delim()
read.delim2()

- ,, February 11, 2021 2/1


Import from text file

A text file can be created in all spreadsheet, database and


statistical software packages.
In a text file, data are separated or delimited by a specific
character, which in turn defines what sort of text file it is.
The text file should broadly represent the format of the data
frame. That is, variables should be in columns and sampling
units in rows.
The first row should contain the variable names and if there
are row names, these should be in the first column.

- ,, February 11, 2021 3/1


read.table() function
read.table(file=, header= , row.names=1, sep= ",")
read.table(file=, header= , row.names=1, sep= "\t")

The argument file= must be provided with a string that


specifies the name of the text file to be imported.
The header= argument is logical value that indicates whether
the first row of the file contains the names of the columns of
the data frame.
The row.names= argument indicates which column in the data
set contains the row names. If there are no row names in the
data set, then the row.names= argument should be omitted.
Finally, the sep= argument specifies which character is used
as the delimiter to separate data entries. The syntax (’\t’)
indicates a tab character. Field (data) separators are not
restricted to commas or tabs, just about any character can be
defined as a separator.
- ,, February 11, 2021 4/1
Importing from the clipboard

The read.table() function can also be used to import data


(into a data frame) that has been placed on the clipboard by
other software, thereby providing a very quick and convenient
way of obtaining data from spreadsheets. Simply replace the
filename argument with the word ’clipboard’ and indicate a
tab field separator (\t).

- ,, February 11, 2021 5/1


Import from other software

SPSS
> library(foreign)
> xyz <- read.spss("xyz.sav", to.data.frame = T)
MINITAB
> library(foreign)
> xyz <- as.data.frame(read.mtp("xyz.mtp"))
SAS
> library(foreign)
> xyz <- read.xport("xyz")
SYSTAT
> library(foreign)
> xyx <- read.systat("xyz.syd", to.data.frame = T)

- ,, February 11, 2021 6/1


Excel

Excel is more than just a spreadsheet – it contains macros,


formulae, multiple worksheets and formatting.
The easiest ways to import data from Excel is either to save
the worksheet as a text file (comma or tab delimited) and
import the data as a text file, or to copy the data to the
clipboard in Excel and import the clipboard data into R.

- ,, February 11, 2021 7/1


Exporting data

Plain text files can be read by a wide variety of other


applications, ensuring that the ability to retrieve the data will
continue indefinitely.
Also, as they are neither compressed nor encoded, a corruption
to one section of the file does not necessarily reduce the
ability to correctly read other parts of the file. Hence, this is
also an important consideration for the storage of datasets.
The write.table() function is used to save data frames.

- ,, February 11, 2021 8/1


read.table() Function
Importing Data

Dr. L. V. Rao

August 20, 2019

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 1 / 29


read.table() function

read.table(file, header = FALSE, sep = "", quote = "\"’",


dec = ".",
numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE,
fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text,
skipNul = FALSE)

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 2 / 29


read.table() function

The read.table() function is used to read data into R in the


form of a data frame.
read.table() always returns a data frame, which means that it
is ideally suited to read data with mixed modes.
read.table() expects each field (variable) in the input source
to be separated by one or more separators, by default any of
spaces, tabs, newlines or carriage returns.
The sep= argument can be used to specify an alternative
separator.
If there are no consistent separators in the input data, but
each variable occupies the same columns for every
observation, the read.fwf() function can be used.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 3 / 29


read.table() function - column names

If the first line in your input data consists of variable names


separated by the same separator as the data, the
header=TRUE argument can be passed to read.table() to use
these names to identify the columns of the output data frame.
Alternatively, the col.names= argument to read.table() can
specify a character vector containing the variable names.
Without other guidance, read.table() will name the variables
using a V followed by the column number.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 4 / 29


file= argument

The only required argument to read.table() is


a file name,
URL, or
connection object.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 5 / 29


stringsAsFactors= argument

read.table() automatically converts character variables to


factors. This offers increased efficiency in storage.
Converting characters to factors may cause some problems
when trying to use the variables as simple character strings.
Conversion to factors can be prevented by using
stringsAsFactors=FALSE.
To insure that character variables are never converted to
factors, the system option stringsAsFactors can be set to
FALSE using
> options(stringsAsFactors=FALSE)

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 6 / 29


as.is= argument

The as.is= argument can be used to suppress factor


conversion for a subset of the variables in your data, by
supplying a vector of indices specifying the columns not to be
converted, or a logical vector with length equal to the number
of columns to be read and TRUE wherever factor conversion
is to be suppressed.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 7 / 29


Illustrating as.is() argument
"Rama Rao" M 12 15 12 13
"Subba Rao" M 14 15 15 15
"Usha Rani" F 13 12 15 14
"Yohan Babu" M 11 11 12 11
"Thilak" M 12 14 15 11
"Sudha Rani" F 12 13 14 15
> str(read.table("noheadspa-fact.txt"))
’data.frame’: 6 obs. of 6 variables:
$ V1: Factor w/ 6 levels "Rama Rao","Subba Rao",..: 1 2 5
$ V2: Factor w/ 2 levels "F","M": 2 2 1 2 2 1
$ V3: int 12 14 13 11 12 12
$ V4: int 15 15 12 11 14 13
$ V5: int 12 15 15 12 15 14
$ V6: int 13 15 14 11 11 15
By default stringsAsFactors=TRUE and hence the file read is of
the above structure.
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 8 / 29
as.is() argument
Data File
"Rama Rao" M 12 15 12 13
"Subba Rao" M 14 15 15 15
"Usha Rani" F 13 12 15 14
"Yohan Babu" M 11 11 12 11
"Thilak" M 12 14 15 11
"Sudha Rani" F 12 13 14 15
Look at the structure of the file after setting as.is= argument:
> str(read.table("noheadspa-fact.txt", as.is = c(1)))
’data.frame’: 6 obs. of 6 variables:
$ V1: chr "Rama Rao" "Subba Rao" "Usha Rani" "Yohan Babu"
$ V2: Factor w/ 2 levels "F","M": 2 2 1 2 2 1
$ V3: int 12 14 13 11 12 12
$ V4: int 15 15 12 11 14 13
$ V5: int 12 15 15 12 15 14
$ V6: int 13 15 14 11 11 15
Reading >
Data into R - Dr. L. V. Rao,, August 20, 2019 9 / 29
row.names= argument

The row.names= argument can be used to pass a vector of


character values to be used as row names to identify the
output and which can be used instead of numeric subscripts
when indexing the data frame.
An argument of row.names = NULL will use a character
representation of the observation number for the row names.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 10 / 29


read.table() function - Missing Values

read.table() will automatically treat the symbol NA as


representing a missing value for any data type, and NaN, Inf
and -Inf as missing for numeric data.
To modify this behavior, the na.strings argument can be
passed a vector of character values that should be interpreted
as representing missing values.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 11 / 29


dec= argument

For locales which use a character other than the period (.) as
a decimal point, the dec= argument can be used to specify an
alternative.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 12 / 29


encoding= argument

The encoding= argument can be used to interpret non-ASCII


characters in your input data.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 13 / 29


Controlling input size

You can control which lines are read from your input source
using the skip= argument that specifies a number of lines to
skip at the beginning of your file, and the nrows= argument
which specifies the maximum number of rows to read.
For very large inputs, specifying a value for nrows= which is
close to but greater than the number of rows to be read may
provide an increase in speed.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 14 / 29


Handling unequal row sizes

read.table() expects the same number of fields on each line,


and will report an error if it detects otherwise. If the unequal
numbers of fields are due to the fact that some observations
naturally have more variables than others, the fill=TRUE
argument can be used to fill in observations with fewer
variables using blanks or NAs.
If read.table() reports that there are unequal numbers of fields
on some of the lines, the count.fields() function can often help
determine where the problem is.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 15 / 29


Which columns to read?

read.table() accepts a colClasses= argument, similar to the


what= argument of scan() function, to specify the modes of
the columns to be read.
Since read.table() will automatically recognize character and
numeric data, this argument is most useful when you want to
perform more complex conversions as the data is being read,
or if you need to skip some of the fields in your input
connection.
Explicitly declaring the types of the columns may also improve
the efficiency of reading data. To specify the column classes,
provide a vector of character values representing the data
types; any type for which there is an as. method can be used.
A value of NULL instructs read.table() to skip that column,
and a value of NA lets read.table() decide the format to use
when reading that column.

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 16 / 29


> read.table("noheadspa.txt", sep=" ",
stringsAsFactors=FALSE)
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
> read.table("noheadspa.txt", sep="",
stringsAsFactors=FALSE)
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 17 / 29
> read.table("noheadtab.txt", sep="\t",
stringsAsFactors=FALSE)
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
> read.table("noheadtab.txt", sep="",
stringsAsFactors=FALSE)
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 18 / 29
Data File:

"Rama Rao";12;15;12;13
"Subba Rao";14;15;15;15
"Usha Rani";13;12;15;14
"Yohan Babu";11;11;12;11
"Thilak";12;14;15;11

> read.table("noheadcol.txt", sep = ";",


stringsAsFactors = FALSE)
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 19 / 29


# This file has no header and tab separated
"Rama Rao" 12 15 12 13
"Subba Rao" 14 15 15 15
"Usha Rani" 13 12 15 14
"Yohan Babu" 11 11 12 11
"Thilak" 12 14 15 11

> read.table("noheadtab-com.txt")
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 20 / 29


read.table() function

# This file has no header and tab separated


"Rama Rao" 12 15 12 13
"Subba Rao" 14 15 15 15
"Usha Rani" 13 12 15 14
"Yohan Babu" 11 11 12 11
"Thilak" 12 14 15 11

> read.table("noheadtab-com.txt",comment.char="#")
V1 V2 V3 V4 V5
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 21 / 29


read.csv() function

read.csv(file, header = TRUE, sep = ",",


quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 22 / 29


read.csv2() function

read.csv2(file, header = TRUE, sep = ";",


quote = "\"", dec = ",",
fill = TRUE, comment.char = "", ...)
> read.csv2("headcol.txt")
Student Prob Dist Estn Prog
1 Rama Rao 12 15 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 11
>

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 23 / 29


read.csv2() function
Data File - Illustrating =”,”
Student;Prob;Dist;Estn;Prog
"Rama Rao";12;15;12;13,2
"Subba Rao";14;15;15;15
"Usha Rani";13;12;15;14
"Yohan Babu";11;11;12;11
"Thilak";12;14;15;11
> read.csv2("headcol-dec.txt")
Student Prob Dist Estn Prog
1 Rama Rao 12 15 12 13.2
2 Subba Rao 14 15 15 15.0
3 Usha Rani 13 12 15 14.0
4 Yohan Babu 11 11 12 11.0
5 Thilak 12 14 15 11.0
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 24 / 29
read.csv2() function
Data File
Student;Prob;Dist;Estn;Prog
"Rama Rao";12,0;15,0;12,0;13,5
"Subba Rao";14,5;15,0;15,0;15,0
"Usha Rani";13,5;12,5;15,0;14,5
"Yohan Babu";11,0;11,0;12,5;11,5
"Thilak";12;14;15;11
> read.csv2("headcol-dec.txt")
Student Prob Dist Estn Prog
1 Rama Rao 12.0 15.0 12.0 13.5
2 Subba Rao 14.5 15.0 15.0 15.0
3 Usha Rani 13.5 12.5 15.0 14.5
4 Yohan Babu 11.0 11.0 12.5 11.5
5 Thilak 12.0 14.0 15.0 11.0
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 25 / 29
read.csv2() fucntion
Data file with Missing Values
Student;Prob;Dist;Estn;Prog
"Rama Rao";12;;12;13
"Subba Rao";14;15;15;15
"Usha Rani";13;12;15;14
"Yohan Babu";11;11;12;11
"Thilak";12;14;15;

> read.csv2("headcolmiss.txt")
Student Prob Dist Estn Prog
1 Rama Rao 12 NA 12 13
2 Subba Rao 14 15 15 15
3 Usha Rani 13 12 15 14
4 Yohan Babu 11 11 12 11
5 Thilak 12 14 15 NA
>
Reading Data into R - Dr. L. V. Rao,, August 20, 2019 26 / 29
read.delim() function

read.delim( file,
header = TRUE,
sep = "\t",
quote = "\"",
dec = ".",
fill = TRUE,
comment.char = "", ...)

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 27 / 29


read.delim2() function

read.delim2( file,
header = TRUE,
sep = "\t",
quote = "\"",
dec = ",",
fill = TRUE,
comment.char = "", ...)

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 28 / 29


Function sep= dec=

read.csv() ”,” ”.”

read.delim() \t ”.”

read.csv2() ”;” ”,”

read.delim2() \t ”,”

Reading Data into R - Dr. L. V. Rao,, August 20, 2019 29 / 29


read.fwf() Function
Importing Data

Dr. L. V. Rao

August 26, 2019

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 1/1


Table-Format Files
Table-format files are best thought of as plain-text files with three
key features that fully define how R should read the data.
Header If a header is present, it’s always the first line of the
file. This optional feature is used to provide names for each
column of data. When importing a file into R, you need to tell
R whether a header is present so that it knows whether to
treat the first line as variable names or, alternatively, observed
data values.
Delimiter The all-important delimiter is a character used to
separate the entries in each line. The delimiter character
cannot be used for anything else in the file. This tells R when
a specific entry begins and ends.
Missing value This is another unique character string used
exclusively to denote a missing value. When reading the file,
R will turn these entries into the form it recognizes: NA.

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 2/1


Table-Format Files

Typically, these files have a .txt extension (highlighting the


plain-text style) or .csv (for comma-separated values).

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 3/1


Fixed-Width-Field Files

For the common cases of reading in data whose fields are


separated by commas or tabs, R provides three convenience
functions,
read.csv(),
read.csv2(), and
read.delim().
These functions are wrappers for read.table() function, with
appropriate arguments set for comma-, semicolon-, or
tab-delimited data, respectively.
Since these functions will accept any of the optional
arguments to read.table() function, they are often more
convenient than using read.table() and setting the appropriate
arguments manually.

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 4/1


read.fwf() function

read.fwf( file, widths,


header = FALSE, sep = "\t",
skip = 0, row.names,
col.names, n = -1,
buffersize = 2000,
fileEncoding = "", ...)

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 5/1


read.fwf() function

Sometimes input data is stored with no delimiters between the


values, but with each variable occupying the same columns on
each line of input.
In cases like this, the read.fwf() function can be used.
Choice in reading Variables The widths= argument can be
a vector containing the widths of the fields to be read, using
negative numbers to indicate columns to be skipped.
Observations exceeding more than one line: If the data for
each observation occupies more than one line, widths= can be
a list of as many vectors as there are lines per observation.
The header=, row.names=, and col.names= arguments
behave similarly to those in read.table().

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 6/1


read.fwf() function
Data file: fwf-tab.txt ( tab separated data )

12 13 14 15
14 12 12 12
15 15 13 12
14 12 12 12
15 14 14 12

> read.fwf("fwf-tab.txt",widths=c(2,-1,2,-1,2,-1,2))
V1 V2 V3 V4
1 12 13 14 15
2 14 12 12 12
3 15 15 13 12
4 14 12 12 12
5 15 14 14 12

use -1 to kill the tab character.


Reading Data into R - Dr. L. V. Rao,, August 26, 2019 7/1
read.fwf() function
Data File: fwf.txt

12131415
14121212
15151312
14121212
15141412

> read.fwf("fwf.txt",width=c(2,2,2,2))
V1 V2 V3 V4
1 12 13 14 15
2 14 12 12 12
3 15 15 13 12
4 14 12 12 12
5 15 14 14 12
>

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 8/1


read.fwf() function
Data File: fwf.txt

12131415
14121212
15151312
14121212
15141412

> read.fwf("fwf.txt", width=c(2,2,2,-2))


V1 V2 V3
1 12 13 14
2 14 12 12
3 15 15 13
4 14 12 12
5 15 14 14
>

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 9/1


Illustrating read.fwf() function

Consider the following lines, showing the 10 counties of the United


States with the highest population density (measured in population
per square mile):

New York, NY 66,834.6


Kings, NY 34,722.9
Bronx, NY 31,729.8
Queens, NY 20,453.0
San Francisco, CA 16,526.2
Hudson, NJ 12,956.9
Suffolk, MA 11,691.6
Philadelphia, PA 11,241.1
Washington, DC 9,378.0
Alexandria IC, VA 8,552.2

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 10 / 1


read.fwf() function

Since the county names contain blanks and are not surrounded
by quotes, read.table() will have difficulty reading the data.
However, since the names are always in the same columns, we
can use read.fwf() function.
The commas in the population values will force read.fwf() to
treat them as character values, and, like read.table(), it will
convert them to factors, which may prove inconvenient later.
If we wanted to extract the state values from the county
names, we might want to suppress factor conversion for these
values as well, and as.is=TRUE will be used.
Assuming that the data is stored in a file named city.txt, the
values could be read as follows:

Reading Data into R - Dr. L. V. Rao,, August 26, 2019 11 / 1


read.fwf() function
> city = read.fwf("city.txt",
widths=c(18,-19,8),
as.is=TRUE)
> city
V1 V2
1 New York, NY 66,834.6
2 Kings, NY 34,722.9
3 Bronx, NY 31,729.8
4 Queens, NY 20,453.0
5 San Francisco, CA 16,526.2
6 Hudson, NJ 12,956.9
7 Suffolk, MA 11,691.6
8 Philadelphia, PA 11,241.1
9 Washington, DC 9,378.0
10 Alexandria IC, VA 8,552.2
>
Reading Data into R - Dr. L. V. Rao,, August 26, 2019 12 / 1
Cleaning Data
Before using V2 as a numeric variable, the commas would
need to be removed using gsub() function.
> city$V2 = as.numeric(gsub(",","",city$V2))
> city
V1 V2
1 New York, NY 66834.6
2 Kings, NY 34722.9
3 Bronx, NY 31729.8
4 Queens, NY 20453.0
5 San Francisco, CA 16526.2
6 Hudson, NJ 12956.9
7 Suffolk, MA 11691.6
8 Philadelphia, PA 11241.1
9 Washington, DC 9378.0
10 Alexandria IC, VA 8552.2
>
Reading Data into R - Dr. L. V. Rao,, August 26, 2019 13 / 1
Importing Data Into R
read.csv() function

Dr. L. V. Rao

August 21, 2019

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 1 / 24


Often data to be analyzed is stored in external files.
Typically, data is stored in plain text files,
delimited by white space such as
tabs or spaces, or
by special characters such as
commas or semicolons.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 2 / 24


scan() - Univariate Data

Univariate data from an external file can be read into a


vector by the scan command.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 3 / 24


read.table() - Table Format Data

If the file contains a data frame or a matrix, or is a csv


format (comma separated values), use the read.table()
function.
The read.table() function has many options to support
different file formats.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 4 / 24


read.table() function

The principal tool for reading in delimited files in R is the


read.table() function, together with its offspring read.csv()
and read.delim().
These three functions are identical except for their default
settings.
All three of these produce data frames.
If a matrix is needed, the best approach is to construct a data
frame and then convert it via as.matrix().

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 5 / 24


European-style data format

Two more functions, read.csv2() and read.delim2(), are also


available; these are just like read.csv() and read.delim()
except they expect the European-style comma for the decimal
point and that read.csv2() uses the semi-colon as its delimiter.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 6 / 24


read.table() function

read.table(file, header = FALSE, sep = "", quote = "\"’",


dec = ".",
numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE,
fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text,
skipNul = FALSE)

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 7 / 24


header= argument

header= a logical indicating whether the disk file has header


labels in the first row. If so, set this to TRUE and those labels
will be used as column headers.
By default, header=FALSE.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 8 / 24


sep= argument

sep= the separator character. This might be a comma for a


CSV, a tab (written \t) for a tab-separated file, a semi-colon,
or something else.
By default, it is sep=””.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 9 / 24


quote= argument

By default, the set of quote characters is set to be both ’ and


” in read.table(), and to just ” in the other functions.
This means that a string inside quotation marks such as
”State Bank of India” is treated as a single unit. This is a
valid approach when the separator is a space, since otherwise
that phrase would look as if it has been made up of three
separate fields.
The single quote mark would be useful in the corresponding
British environment, where its use is more common.
We usually use single quote as an apostrophe, for example,
D’Alembert’s Ratio Test
We generally turn the interpretation of quotation marks off by
passing quote = ””, or set the argument to recognize only the
double quotation mark by passing quote = ”¨”.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 10 / 24


comment.char= argument

Lots of code has comments, but comments in data are rare. If


there are any, they need to be taken care of while reading the
data.
The argument comment.char= is used to set the character for
a comment line in the file.
By default this argument is set to ”#”, that is,
comment.char=”#”
It can be turned off by setting
comment.char=””

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 11 / 24


stringsAsFactors= argument

stringsAsFactors= a logical that determines whether columns


that appear to be characters need to be converted to factors
or left as characters.
By default, this argument is set as
stringsAsFactors=TRUE

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 12 / 24


colClasses= argument

colClasses= a vector that explicitly gives the class of each


column.
By default this argument is set to
colClasses=NA

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 13 / 24


na.strings= argument

na.strings= a vector specifying the indicator(s) of missing


values in the input data.
By default this argument is set to
na.strings=”NA”

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 14 / 24


skip= argument

The skip= argument is used to specify the number of lines


skipped before the reading starts.
Usage: When the data file contains some documentation
regarding the variables and their nature, those line must be
skipped because they don’t become a part of the data to be
analysed.
By default, this argument is set to 0, that is,
skip=0

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 15 / 24


Other options control
,
the maximum number of rows to read,
whether blank lines are omitted or included

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 16 / 24


read.table() function

The choice of sep character can often be inferred from the


name of the file
comma for files whose names end in CSV and tab for TSV,
although this is not a requirement.
When the separator is unknown, we either try the usual ones,
use an external program to examine the first few lines of the
file, or resort to the scan() function.
The default value of sep is the empty string, ””, which
indicates that any amount of white space (including tabs)
serves as the delimiter. This is intended for text that has been
formatted to line up nicely on the page (so that extra spaces
or tabs have been added for readability).
Setting sep to be the space character, ” ”, means that
read.table() will split the line at every space (and never at a
tab).

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 17 / 24


What is a .csv file?

A CSV file is just a normal text file that commonly begins with a
header line listing the names of the variables, each separated by a
comma. The remainder of the file after the header row is expected
to consist of rows of data that record the observations. For each
observation, the fields are separated by commas, delimiting the
actual observation of each of the variables.
Data is often supplied in comma-separated-values (.csv) format,
which is a text file that separates data with special text characters
called delimiters. Files in .csv format can be opened in most
spreadsheet applications. Spreadsheet data should be saved in .csv
format before importing into R.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 18 / 24


The data can contain different numbers of columns in
different rows, with missing columns at the end of a row being
filled with NAs. This is handled using the fill= argument of
read.csv(), which is TRUE by default.
In a .csv file, the dates are likely to be given as strings,
delimited by double quotation marks.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 19 / 24


The additional arguments of read.csv() are used to fine-tune
how the data is read into R.
A header row is the first row and lists the variable or column
names.
We set header=FALSE, if the data file has no header row. we
supply a list of variable names using the col.names= (column
names) argument.
We set header=TRUE, if the first row contains the variable or
column names.
We set strip.white= to TRUE to strip spaces from the data to
ensure we do not get extra white space in any columns.
Missing values are notated with a question mark, so we tell
the function this with the na.strings= argument.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 20 / 24


write.table() function

The write.table() function acts to write a delimited file, just


as read.table() reads one.
And, just as there are the read.csv() and read.csv2() analogs
to read.table(), R also provides write.csv() and write.csv2().
We normally pass write.table() a data frame, though a matrix
can be written as well, and we generally supply the delimiter
with the sep argument, since the default choice of a space is
rarely a good one.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 21 / 24


Example: Importing/exporting .csv files

This example illustrates how to export the contents of a data


frame to a .csv file, and how to import the data from a .csv file
into an R data frame.

> # create a data frame


> dates <- c("3/27/1995", "4/3/1995",
"4/10/1995", "4/18/1995")
> prices <- c(11.1, 7.9, 1.9, 7.3)
> d <- data.frame(dates = dates, prices = prices)
>
># create the .csv file
> filename <- "temp.csv"
> write.table(d, file = filename, sep = ",",
row.names = FALSE)

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 22 / 24


Example: Importing/exporting .csv files

The new file temp.csv can be opened in most spreadsheets. When


displayed in a text editor (not a spreadsheet), the file temp.csv
contains the following lines (without the leading spaces).

"dates","prices"
"3/27/1995",11.1
"4/3/1995",7.9
"4/10/1995",1.9
"4/18/1995",7.3

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 23 / 24


Example: Importing/exporting .csv files

Most .csv format files can be read using read.table. In addition


there are functions read.csv and read.csv2 designed for .csv files.

> # read the .csv file


> read.table(file = filename, sep = ",", header = TRUE)
> read.csv(file = filename) #same thing
dates prices
1 3/27/1995 11.1
2 4/3/1995 7.9
3 4/10/1995 1.9
4 4/18/1995 7.3
>

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 24 / 24


Quoting

A few other arguments are useful as well. First, the resulting


entries from character and factor columns are quoted by
default; we often turn this behavior off with quote = FALSE,
depending on what the recipient of the output is expecting.
Quoting becomes necessary, though, when character values
might contain the delimiter, or when it is important to retain
leading zeros in identifiers that look numeric (01, 02, etc.).
Second, row names are written by default; we rarely want
these, so we generally specify row.names = FALSE. In
contrast, the default setting of the col.names argument, which
is TRUE, usually is what we want. The exception is when we
plan to do a number of writes to a single file. In that case,
the first write will usually specify col.names = TRUE and
append = FALSE, and subsequent ones will specify
col.names= FALSE and append = TRUE.

Importing Data Into R - Dr. L. V. Rao,, August 21, 2019 25 / 24


Control Structures

L. V. Rao

October 29, 2020

- ,, October 29, 2020 1/1


Control Structures in R

• Selection Control Structures


• if
• if · · · else
• ifelse
• Looping Control Structures
• for
• while
• repeat
• Others
• break
• next

- ,, October 29, 2020 2/1


if
The syntax for if control structure is

if(condition)
{
# do something
}

The block of code associated with if gets executed only if the the
If the condition evaluates to TRUE, then only the associated block
of code gets executed.
The curly braces around condition are mandatory.
The braces are optional when the body of if has ONLY one
statement to be executed.
The if statement, with or without else, tests a single logical
statement; it is not an element-wise (vector) function.

- ,, October 29, 2020 3/1


Example: if

n <- 6
m <- 3
if(n %% m == 0)
{
print(paste(n,"is divisible by",m ))
}

[1] "6 is divisible by 3"

- ,, October 29, 2020 4/1


if · · · else
The syntax for if · · · else control structure is

if(cond)
{
# do something
} else {
# do somthing else
}

• If the cond evaluates to TRUE, then the block of code


following the closing parentheses of if gets executed.
• If the cond evaluates to FALSE, then the block of code
associated with the else part gets executed.
• The curly braces around cond are mandatory.

- ,, October 29, 2020 5/1


Example: if · · · else

n <- 7
m <- 3
if(n %% m == 0)
{
print(paste(n,"is divisible by",m ))
} else {
print(paste(n,"is NOT divisible by",m ))
}

[1] "7 is NOT divisible by 3"

- ,, October 29, 2020 6/1


ifelse

• The ifelse() statement accepts a logical vector as its first


argument, and two other arguments: the first provides a value
for the case where elements of the input logical vector are
true, and the second for the case where they are false.
• The ifelse() function operates on vectors and evaluates the
expression given as expression and returns x if it is TRUE and
y otherwise ifelse() is a vectorized version of the if/else
construct.

- ,, October 29, 2020 7/1


Syntax: ifelse

The syntax for ifelse control structure is

ifelse(test, exp1, exp2)

test a logical vector.


exp1 return values for true elements of test.
exp2 return values for false elements of test.

test, exp1 and exp2 are vectors of the same length.

- ,, October 29, 2020 8/1


%in% operator
%in% operator in R, is used to identify if an element belongs to a
vector or data frame.

> x <- 1:5


> 3 %in% x
[1] TRUE
> 13 %in% x
[1] FALSE
>

> mtcars$cyl %in% 4


[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[22] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[29] FALSE FALSE FALSE TRUE
>
- ,, October 29, 2020 9/1
Example

For some recoding tasks, the ifelse function may be more useful
than manipulating logical variables directly. Suppose we have a
variable called group that takes on values in the range of 1 to 5,
and we wish to create a new variable that will be equal to 1 if the
original variable is either 1 or 5, and equal to 2 otherwise.

> group <- 1:5


> ifelse(group %in% c(1,5),1,2)
[1] 1 2 2 2 1
>

- ,, October 29, 2020 10 / 1


ifelse

# Illustrating ifelse
# giving discount
bill <- c(12500, 10131, 567, 8999)
bill.amt <- ifelse(bill > 10000, bill * 0.8, bill)
bill.amt
[1] 10000.0 8104.8 567.0 8999.0

- ,, October 29, 2020 11 / 1


Nested if

if(cond1){
expr1
} else if(cond2){
expr2
}
else{
expr3
}

- ,, October 29, 2020 12 / 1


Nested if

#illustrating Nesting of if...else


x <- -5
if(x > 0)
{
print(paste(x,"is poitive"))
} else if(x == 0)
{
print(paste(x,"is zero"))
} else {
print(paste(x,"is negative"))
}

- ,, October 29, 2020 13 / 1


Loops

• Loops are very inefficiently implemented in R.


• use of loops should be avoided, whenever possible
• vectorization should be used wherever possible.

- ,, October 29, 2020 14 / 1


for

The for control structure syntax is

for(var in seq) expr

• In the above, for and in are R’s reserved words.


• var is the name of the loop control variable
• seq an expression evaluating to a vector
• var successively acquires values in the vector seq
• expr is usually a block of code to be executed repetitively

- ,, October 29, 2020 15 / 1


for

#computing factorial
fact <- 1
for(k in 1:5)
{
fact <- fact * k
print(fact)
}

# using built-in function


factorial(1:5)

- ,, October 29, 2020 16 / 1


for

# printing the pattern


x <- NULL
for(k in 1:4)
{
x <- paste(x, "*")
cat(x, "\n")
}

- ,, October 29, 2020 17 / 1


while

The syntax for while control structure is

while(cond) expr

• while is R’s reserved word.


• cond is a conditional expression.
• If cond evaluates to TRUE, then the expr gets executed;
otherwise, it skips the execution of expr.
• expr is usually a block of code to be executed repetitively.

- ,, October 29, 2020 18 / 1


while

#to reverse a number


revNum <- 0
num <- 1234
while(num>0)
{
digit <- num%%10
revNum <- revNum*10+digit
num <- num%/%10
}
revNum

- ,, October 29, 2020 19 / 1


Nested Control Structures
# to print a series of even numbers
n <- 50
y <- 1:100
evens <- NULL
x <- sample(y,n,replace=T)
for(k in x)
{
if(k%%2 == 0)
{
evens <- c(evens,k)
}
}
> evens
[1] 86 100 32 68 36 52 96 96 8 56
[11] 10 82 64 58 38 66 46 86 18 78
[21] 58 82 52 40 10
- ,, October 29, 2020 20 / 1
Vectorization:

> x[x %% 2 == 0]
> x
[1] 86 100 32 68 36 52 96 96 8 56
[11] 10 82 64 58 38 66 46 86 18 78
[21] 58 82 52 40 10

- ,, October 29, 2020 21 / 1


repeat

The syntax for the repeat loop is

repeat{
statement
}

• repeat is the reserved word of R.


• statement is usually a block of code containing a test
condition to break the loop. In other words, In the statement
block, we must use the break statement to exit the loop.
• As the test is evaluated towards the end of the control
structure, the statements in the code block gets executed at
least once.

- ,, October 29, 2020 22 / 1


Example: repeat

x <- 1
repeat
{
print(x)
x = x+1
if (x == 6)
{
break
}
}

- ,, October 29, 2020 23 / 1


break

• In R programming, a normal looping sequence can be altered


using the break or the next statement.
• A break statement is used inside a loop (repeat, for, while) to
stop the iterations and the control is passed to a statement
following the current loop.
• In a nested looping situation, the break statement exits from
the innermost loop that is being evaluated and control is
passed to outer loop.

- ,, October 29, 2020 24 / 1


break statement

The syntax of break statement is:

if (condition) {
break
}

- ,, October 29, 2020 25 / 1


Example - break

x <- 1:5
for (val in x)
{
if(val == 3)
{
break
}
print(val)
}

[1] 1
[1] 2

- ,, October 29, 2020 26 / 1


next

• A next statement is useful when we want to skip the current


iteration of a loop without terminating it.
• On encountering next, the R parser skips further evaluation
and starts next iteration of the loop.
• The syntax of next statement is:

if( condition )
{
next
}

- ,, October 29, 2020 27 / 1


Example: next

x <- 1:5
for( val in x )
{
if (val == 3)
{
next
}
print(val)
}

[1] 1
[1] 2
[1] 4
[1] 5

- ,, October 29, 2020 28 / 1


Summary Statistics
Formula objects in R

Dr. L. V. Rao

August 28, 2019

Formula Notation - Dr. L. V. Rao,, August 28, 2019 1 / 19


sort() function

The syntax for the sort() function is

sort(x, decreasing = FALSE, na.last = NA, ...)

where x is an R object with a class or


a vector of atomic type.

Note that, it has two arguments with default values:


decreasing = FALSE

na.last = NA

Formula Notation - Dr. L. V. Rao,, August 28, 2019 2 / 19


sort() function - vector objects

> # the vector object


> vec
[1] 15 18 NA 11 17 20 18 NA > # sort(vec, na.last=TRUE,
> decreasing=FALSE)
> # sort(vec,na.last=NA, >
decreasing=FALSE ) > sort(vec,na.last=TRUE)
> [1] 11 15 17 18 18 20 NA NA
> sort(vec) >
[1] 11 15 17 18 18 20 > # sort(vec, na.last=FALSE,
> decreasing=FALSE)
> # sort(vec,na.last=NA, >
decreasing=TRUE ) > sort(vec,na.last=FALSE)
> [1] NA NA 11 15 17 18 18 20
> sort(vec,decreasing=TRUE) >
[1] 20 18 18 17 15 11
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 3 / 19
sort() function

It is used to rearrange the items in an R object.


By default, it arranges the items in ascending order leaving
the NA items, if any.
The default ascending order can be changed using the
decreasing=TRUE argument.
The default way of treating the NA items can be changed
using the na.last= argument. This argument has a default
value of NA ( na.last = NA ).
The na.last= argument is used to determine the placement of
the NA value - whether to keep towards the last of the list of
values or to keep the NA values at the beginning.
na.last=TRUE will display the NA values towards the end.
na.last=FALSE will display the NA values in the beginning.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 4 / 19


sort() function - matrix objects

> Maths <- c(100,98,89,95)


> Phy <- c(95,82,79,88)
> Chem <- c(92,84,81,80)
> mat <- cbind(Maths,Phy,Chem)
> row.names(mat) <- c("Sarayu","Sarala",
"Saroja","Samata")
> mat
Maths Phy Chem
Sarayu 100 95 92
Sarala 98 82 84
Saroja 89 79 81
Samata 95 88 80
>

Formula Notation - Dr. L. V. Rao,, August 28, 2019 5 / 19


sort() function - matrix objects

> # Which subject is


> sort(mat)
> # done well?
[1] 79 80 81 82
>
[5] 84 88 89 92
> sort(mat[1,])
[9] 95 95 98 100
Chem Phy Maths
>
92 95 100
> # Who is doing good?
> sort(mat[2,])
> sort(mat[,"Maths"])
Phy Chem Maths
Saroja Samata Sarala Sarayu
82 84 98
89 95 98 100
> sort(mat[3,])
> sort(mat[,"Phy"])
Phy Chem Maths
Saroja Sarala Samata Sarayu
79 81 89
79 82 88 95
> sort(mat[4,])
> sort(mat[,"Chem"])
Chem Phy Maths
Samata Saroja Sarala Sarayu
80 88 95
80 81 84 92
>
>
Formula Notation - Dr. L. V. Rao,, All didAugust
well in maths
28, 2019 6 / 19
sort() function - matrix objects

Sort the matrix by row names:

> mat1[sort(row.names(mat1)),]
Maths Phy Chem
Samata 95 88 80 Sort the matrix
Sarala 95 82 84 by column names:
Sarayu 100 95 92
Saroja 89 79 NA > mat1[,sort(colnames(mat1))]
> Chem Maths Phy
Sarayu 92 100 95
Sarala 84 95 82
Saroja NA 89 79
Samata 80 95 88
>

Formula Notation - Dr. L. V. Rao,, August 28, 2019 7 / 19


sort() function - Data frame objects
Do not use a sort() command on an entire data frame, even if
all the columns are of same data type.
You may consider a single row or column of a data frame.

> mat2 <- mat1 > sort(df[,1])


> rownames(mat2)<- NULL [1] Samata Sarala Sarayu Saroja
> df <- data.frame( Levels: Samata Sarala Sarayu Sar
Student=rnames, > sort(df[,2])
mat2 ) [1] 89 95 95 100
> df > sort(df[,"Chem"],na.last=TRUE)
Student Maths Phy Chem [1] 80 84 92 NA
1 Sarayu 100 95 92 > sort(df[,"Chem"])
2 Sarala 95 82 84 [1] 80 84 92
3 Saroja 89 79 NA > sort(df$Maths)
4 Samata 95 88 80 [1] 89 95 95 100
> >
Formula Notation - Dr. L. V. Rao,, August 28, 2019 8 / 19
sort() function - List Objects

You will need to use a slightly different convention to extract


the elements of lists;
you must use the $ in the name.
This is the only way you can utilize the sort(), order(), or
rank() commands on list items.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 9 / 19


sort() function - List Objects

> (lst <- list(Maths=Maths,


> lst$Maths
Matrix=mat, dFrame=df))
[1] 100 98 89 95
$Maths
> sort(lst$Maths)
[1] 100 98 89 95
[1] 89 95 98 100
$Matrix
> sort(lst$Matrix[,3])
Maths Phy Chem
Samata Sarala Sarayu
Sarayu 100 95 92
80 84 92
Sarala 98 82 84
> sort(lst$Matrix[,3],
Saroja 89 79 NA
na.last=TRUE)
Samata 95 88 80
Samata Sarala Sarayu Saroja
$dFrame
80 84 92 NA
Student Maths Phy Chem
> sort(lst$Matrix[,2],
1 Sarayu 100 95 92
decreasing=TRUE)
2 Sarala 95 82 84
Sarayu Samata Sarala Saroja
3 Saroja 89 79 NA
95 88 82 79
4 Samata 95 88 80
>
> Notation - Dr. L. V. Rao,,
Formula August 28, 2019 10 / 19
>
order() function

order(..., na.last = TRUE, decreasing = FALSE,


method = c("auto", "shell", "radix"))

... a sequence of numeric, complex, character


or logical vectors, all of the same length,
or a classed R object.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 11 / 19


order() function

The order() function returns the current positions of the


values when used them as an indexing vector will result in a
sorted vector.
> x <- c(12, 8, 15, 2, 3)
1 2 3 4 5
> order(x)
[1] 4 5 2 1 3
> x[order(x)]
[1] 2 3 8 12 15
> sort(x)
[1] 2 3 8 12 15
>

Formula Notation - Dr. L. V. Rao,, August 28, 2019 12 / 19


order() function

The order() function also supports the na.last= argument.


By default, na.last=TRUE for the order() function.

> x.ord
[1] 12 NA 8 12 15 2 5
1 2 3 4 5 6 7
> order(x.ord)
[1] 6 7 3 1 4 5 2
> index of the NA value

Formula Notation - Dr. L. V. Rao,, August 28, 2019 13 / 19


order() function - matrix objects

> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80

Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19


order() function - matrix objects

> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1

Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19


order() function - matrix objects

> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1[order(mat1[,1]),]

Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19


order() function - matrix objects

> mat1
Maths Phy Chem
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1[order(mat1[,1]),]
Maths Phy Chem
Saroja 89 79 NA
Sarala 95 82 84
Samata 95 88 80
Sarayu 100 95 92
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - matrix objects

> mat1 > mat1[order(mat1[,1],


Maths Phy Chem mat1[,3]),]
Sarayu 100 95 92
Sarala 95 82 84
Saroja 89 79 NA
Samata 95 88 80
> order(mat1[,1])
[1] 3 2 4 1
> mat1[order(mat1[,1]),]
Maths Phy Chem
Saroja 89 79 NA
Sarala 95 82 84
Samata 95 88 80
Sarayu 100 95 92
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - matrix objects

> mat1 > mat1[order(mat1[,1],


Maths Phy Chem mat1[,3]),]
Sarayu 100 95 92 Maths Phy Chem
Sarala 95 82 84 Saroja 89 79 NA
Saroja 89 79 NA Samata 95 88 80
Samata 95 88 80 Sarala 95 82 84
> order(mat1[,1]) Sarayu 100 95 92
[1] 3 2 4 1
> mat1[order(mat1[,1]),]
Maths Phy Chem
Saroja 89 79 NA
Sarala 95 82 84
Samata 95 88 80
Sarayu 100 95 92
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - matrix objects

> mat1 > mat1[order(mat1[,1],


Maths Phy Chem mat1[,3]),]
Sarayu 100 95 92 Maths Phy Chem
Sarala 95 82 84 Saroja 89 79 NA
Saroja 89 79 NA Samata 95 88 80
Samata 95 88 80 Sarala 95 82 84
> order(mat1[,1]) Sarayu 100 95 92
[1] 3 2 4 1 > mat1[order(mat1[,1],mat1[,3],
> mat1[order(mat1[,1]),] decreasing=TRUE),]
Maths Phy Chem
Saroja 89 79 NA
Sarala 95 82 84
Samata 95 88 80
Sarayu 100 95 92
>
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - matrix objects

> mat1 > mat1[order(mat1[,1],


Maths Phy Chem mat1[,3]),]
Sarayu 100 95 92 Maths Phy Chem
Sarala 95 82 84 Saroja 89 79 NA
Saroja 89 79 NA Samata 95 88 80
Samata 95 88 80 Sarala 95 82 84
> order(mat1[,1]) Sarayu 100 95 92
[1] 3 2 4 1 > mat1[order(mat1[,1],mat1[,3],
> mat1[order(mat1[,1]),] decreasing=TRUE),]
Maths Phy Chem Maths Phy Chem
Saroja 89 79 NA Sarayu 100 95 92
Sarala 95 82 84 Sarala 95 82 84
Samata 95 88 80 Samata 95 88 80
Sarayu 100 95 92 Saroja 89 79 NA
> >
Formula Notation - Dr. L. V. Rao,, August 28, 2019 14 / 19
order() function - Data Frame Objects

> df[order(df[,2]),]
> df Student Maths Phy Chem
Student Maths Phy Chem 3 Saroja 89 79 NA
1 Sarayu 100 95 92 2 Sarala 95 82 84
2 Sarala 95 82 84 4 Samata 95 88 80
3 Saroja 89 79 NA 1 Sarayu 100 95 92
4 Samata 95 88 80 >
> > df[order(df[,2],df[,4]),]
> order(df[,2]) Student Maths Phy Chem
[1] 3 2 4 1 3 Saroja 89 79 NA
> 4 Samata 95 88 80
> df[order(df[,2]),2] 2 Sarala 95 82 84
[1] 89 95 95 100 1 Sarayu 100 95 92
> >

Formula Notation - Dr. L. V. Rao,, August 28, 2019 15 / 19


rank() function
Returns the sample ranks of the values in a vector. Ties (i.e., equal
values) and missing values can be handled in several ways.
Usage

rank(x, na.last = TRUE,


ties.method = c("average", "first", "last",
"random", "max", "min"))

x a numeric, complex, character or logical vector.


na.last for controlling the treatment of NAs.
If TRUE, missing values in the data are put last
if FALSE, they are put first;
if NA, they are removed;
if "keep" they are kept with rank NA.
ties.method a character string specifying how ties are
treated

Formula Notation - Dr. L. V. Rao,, August 28, 2019 16 / 19


If all components are different (and no NAs), the ranks are
well defined, with values in seq along(x).
With some values equal (called ”ties”), the argument
ties.method determines the result at the corresponding indices.
The ”first” method results in a permutation with increasing
values at each index set of ties, and analogously ”last” with
decreasing values.
The ”random” method puts these in random order whereas
the default, ”average”, replaces them by their mean, and
”max” and ”min” replaces them by their maximum and
minimum respectively, the latter being the typical sports
ranking.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 17 / 19


rank() function

The rank() function returns a vector of ranks of the


observations for an input vector. The rank function handles
the tied observations.

> x
[1] 12 NA 8 12 15 2 5

Formula Notation - Dr. L. V. Rao,, August 28, 2019 18 / 19


rank() function

The rank() function returns a vector of ranks of the


observations for an input vector. The rank function handles
the tied observations.

> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 18 / 19


rank() function

The rank() function returns a vector of ranks of the


observations for an input vector. The rank function handles
the tied observations.

> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.
4 5 are tied obs.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 18 / 19


rank() function

The rank() function returns a vector of ranks of the


observations for an input vector. The rank function handles
the tied observations.

> x
[1] 12 NA 8 12 15 2 5
4 7 3 5 6 1 2 ranks of obs.
4 5 are tied obs.
each receives
a rank of (4+5)/2 = 4.5
4.5 4.5
>
> rank(x)
[1] 4.5 7.0 3.0 4.5 6.0 1.0 2.0

Note that the missing value NA is placed at the end.

Formula Notation - Dr. L. V. Rao,, August 28, 2019 18 / 19


Formula Notation - Dr. L. V. Rao,, August 28, 2019 19 / 19
apply() Functions
(Syntax and Examples)

L. V. Rao

February 4, 2021

Data Visualization - Dr. L. V. Rao,, February 4, 2021 1/1


tapply() function

USAGE:To create tabular summaries of the


subgroups in a data.

This function takes three arguments:


X: a vector
INDEX: a factor or list of factors
FUN: a function

Data Visualization - Dr. L. V. Rao,, February 4, 2021 2/1


The tapply() function applies a function on subsets of a vector
made by the levels of a factor variable.

Example
Suppose we have heights of 1000 Height Gender
individuals (500 males and 500 females) in 175 Male
the form of a data frame (one column for 155 Female
heights and another for gender), and we 180 Male
want to know the average heights of males 169 Male
and females. We can then group heights by 170 Female
gender and then calculate the average .. ..
heights for each level of the gender. . .

Data Visualization - Dr. L. V. Rao,, February 4, 2021 3/1


tapply() function

The basic form of the tapply() function is:

tapply(X, INDEX, FUN=NULL,· · · )

where
X is the variable that we want to have the function ap-
plied to, usually, it is a response variable.
INDEX describes how we want the X variable be split up
FUN is the function to be applied

INDEX could be a single factor variable.


If you have more than one grouping variable, you can list
several factors to be your INDEX.

Data Visualization - Dr. L. V. Rao,, February 4, 2021 4/1


Illustration:

> with(iris,tapply(Sepal.Length,Species,mean))
setosa versicolor virginica
5.006 5.936 6.588
>

The above command tells R to consider the Sepal.Length column


of iris data, split it according to Species, and then calculate the
mean of each group.

Data Visualization - Dr. L. V. Rao,, February 4, 2021 5/1


In the above example, we split a vector into groups, apply a
function to each group, and then combine the result into a
vector.
This is an important idiom in R, and it usually goes by the
name Split, Apply and Combine (SAC).

Data Visualization - Dr. L. V. Rao,, February 4, 2021 6/1


The tapply() function can be used to extract the components of
ANOVA model.

Data Visualization - Dr. L. V. Rao,, February 4, 2021 7/1


tapply()

Suppose, you have two grouping variables, you specify them


using a list() command as illustrated below:
tapply(X, INDEX = list(V1,V2),FUN=NULL,· · · )
In the above case, the first variable V1 specified in the list()
becomes the rows of the output and the second variable V2
becomes the column of the output resulted by the tapply()
function.
When there are more than two grouping variables, the result is
subdivided into more tables as required.

Data Visualization - Dr. L. V. Rao,, February 4, 2021 8/1


tapply() function
Illustration: mtcars
Use of str() function on mtcars data set says that the variable
am is a numeric vector that indicates whether the engine has
an automatic(0) or manual(1) gearbox. Because this is not
very descriptive, create a new object called cars, which is a
copy of mtcars and change the column name am to a factor.
cars <- transform(mtcars,
am = factor(am, levels = 0:1,
labels = c("Automatic", "Manual")))
Now use tapply() to fincd the miles per gallon(mpg) for each
type of gearbox:
> with(cars,tapply(mpg,am,mean))
Automatic Manual
17.14737 24.39231
>
Data Visualization - Dr. L. V. Rao,, February 4, 2021 9/1
tapply() function

To get a two-dimensional table with the type of gearbox(am)


and number of gears(gear):
> with(cars,tapply(mpg,list(gear,am),mean))
Automatic Manual
3 16.10667 NA
4 21.05000 26.275
5 NA 21.380
>

Data Visualization - Dr. L. V. Rao,, February 4, 2021 10 / 1


tapply() function

tapply() V/S table()

We use the tapply() function to create tabular summaries of


data. This is a little bit similar to table() function.
The table() command can only create contingency
tables(that, tables of counts), whereas with tapply()
command, we can specify any function as aggregation
function. In other words, with tapply(), we can calculate
counts, means or any other value.
If we want to summarize statistics on a single vector, tapply()
is very useful and quick to use.

Data Visualization - Dr. L. V. Rao,, February 4, 2021 11 / 1


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

apply() Functions
(Syntax and Examples)

L. V. Rao

November 4, 2020

Data Visualization - Dr. L. V. Rao,, November 4, 2020 1 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

1 Introduction

2 Applying the Same Function to All Rows or Columns of a Matrix


apply() function

3 Applying the Same Function to All Elements of a List


lapply() function

4 Functions Are First-Class Objects

Data Visualization - Dr. L. V. Rao,, November 4, 2020 2 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

The apply() function and its friends

We avoid using loops when execution speed is of prime interest,


in particular, when dealing with large data sets or running lengthy
simulations,

To achieve faster execution speeds, use the apply() function and


its variants.

split-apply-combine paradigm

Data Visualization - Dr. L. V. Rao,, November 4, 2020 3 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Every member of apply family of functions takes at least two


arguments:
an R object and
a function that is to be applied on the members of
the R object.
Every member of the apply family functions returns a result.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 4 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Function R Function’s view Result


Object object’s member
apply() matrix rows / columns vector/matrix/array/list
array rows / columns vector/matrix/array/list
or any dimension
data frame rows / columns vector/matrix/array/list
sapply() vector elements vector/matrix/list
data frame variables vector/matrix/list
list elements vector/matrix/list
lapply() vector elements list
data frame variables list
list elements list

Data Visualization - Dr. L. V. Rao,, November 4, 2020 5 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

apply() function
When to use?
The apply() function is used, when it is required to perform
the same function either for all the rows or columns of a
matrix or a data frame.
Technically, apply() is for matrices, so it will attempt to
coerce a data frame into a matrix.
The apply() function is a general function, in that it works
with arrays, matrices, and data frames.
The apply() function works on anything that has dimensions.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 6 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

What is the syntax?


It requires three arguments:
1 an R object(matrix or dataFrame),
2 a dimension code,
1 for rows and
2 for columns, and
3 a function to be applied.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 7 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

The general form of apply() function is


apply(X, MARGIN, FUN, fargs)
where

X is a matrix or a dataframe
MARGIN is 1 or 2, according to whether we will operate on
rows or columns,
FUN is the function to be applied, and
fargs is an optional list of arguments to be supplied to
FUN.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 8 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Example

Let us apply the built-in R function sum to the rows of a matrix.

>
> (x <- matrix( 1:9, ncol=3 ) )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>
> apply( x, 1, sum )
[1] 12 15 18
>

Data Visualization - Dr. L. V. Rao,, November 4, 2020 9 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> df <- data.frame( Id = c(1001, 1002, 1003, 1004),


Math = c(13, 15, 14, 15),
Stat = c(13, 12, 10, 15) )
>
> df
Id Math Stat
1 1001 13 13
2 1002 15 12
3 1003 14 10
4 1004 15 15
>
> apply(df[-1],2,sd)
Math Stat
0.9574271 2.0816660

Data Visualization - Dr. L. V. Rao,, November 4, 2020 10 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> avg <- apply(df[c(-1,-4)],1,mean)


> avg
[1] 13.0 13.5 12.0 15.0
> df <- cbind(df,avg)
> df
Id Math Stat avg
1 1001 13 13 13.0
2 1002 15 12 13.5
3 1003 14 10 12.0
4 1004 15 15 15.0
>
> apply(df[c(-1,-4)],2,min)
Math Stat
13 10
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 11 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Here is an example of working on rows, using our own function:

> f <- function(x) x/c(2,8)


> y <- apply(z,1,f)
> y
[,1] [,2] [,3]
[1,] 0.5 1.000 1.50
[2,] 0.5 0.625 0.75

You might be surprised that the size of the result here is 2 × 3


rather than 3 × 2. If the function to be applied returns a vector of
k components, the result of apply() will have k rows. You can use
the matrix transpose function t() to change it.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 12 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> # computing stochastic matrix from x


> ( x <- matrix( 1:9, ncol = 3 ) )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
>
> f <- function(y) y/sum(y)
>
> t( apply( x, 1, f ) )
[,1] [,2] [,3]
[1,] 0.08333333 0.3333333 0.5833333
[2,] 0.13333333 0.3333333 0.5333333
[3,] 0.16666667 0.3333333 0.5000000

Data Visualization - Dr. L. V. Rao,, November 4, 2020 13 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
apply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Randomly generating stochastic matrices

> x <- matrix( sample( 0:20, 9, replace = T), ncol = 3 )


>
> f <- function(y) y/sum(y)
>
> round( t( apply(x, 1, f ) ), digits = 3 )
[,1] [,2] [,3]
[1,] 0.125 0.125 0.750
[2,] 0.696 0.000 0.304
[3,] 0.486 0.000 0.514
>

Data Visualization - Dr. L. V. Rao,, November 4, 2020 14 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
lapply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

The lapply() function

The analogue of apply() function for lists is lapply(). It applies the


given function to all elements of the specified list.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 15 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
lapply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Example
> list(1:3,25:30)
[[1]]
[1] 1 2 3

[[2]]
[1] 25 26 27 28 29 30

> lapply(list(1:3,25:30),median)
[[1]]
[1] 2

[[2]]
[1] 27.5

>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 16 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
lapply() function
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

In this example the list was created only as a temporary measure,


so we should convert back to numeric:

> as.numeric(lapply(list(1:3,25:27),median))
[1] 2 26

Data Visualization - Dr. L. V. Rao,, November 4, 2020 17 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Functions can be used as arguments, assigned to some objects,


etc. For instance,
> f1 <- function(a,b) return(a+b)
> f2 <- function(a,b) return(a-b)
> f <- f1
> f(3,2)
[1] 5
> f <- f2
> f(3,2)
[1] 1
> g <- function(h,a,b) h(a,b)
> g(f1,3,2)
[1] 5
> g(f2,3,2)
[1] 1
Data Visualization - Dr. L. V. Rao,, November 4, 2020 18 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> x <- sample(5:20, 9, replace = TRUE)


> x
[1] 14 5 12 19 5 6 10 13 19
> matrix(x, ncol=3)
[,1] [,2] [,3]
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
> birds <- matrix(x, ncol = 3)
> colnames(birds) <- c("Sparrow","Pigeon","Dove")
> birds
Sparrow Pigeon Dove
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
Data Visualization - Dr. L. V. Rao,, November 4, 2020 19 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> birds
Sparrow Pigeon Dove
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
>
> colSums(birds)
Sparrow Pigeon Dove
31 30 42
>
> rowSums(birds)
[1] 43 23 37

Data Visualization - Dr. L. V. Rao,, November 4, 2020 20 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

> birds
Sparrow Pigeon Dove
[1,] 14 19 10
[2,] 5 5 13
[3,] 12 6 19
>
>
> apply(birds, 2, sum)
Sparrow Pigeon Dove
31 30 42
>
>
> apply(birds, 1, sum)
[1] 43 23 37
>
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 21 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

The apply() function splits up the matrix (or data frame) in


rows (or columns).
If we select a single row or column, R will, by default, simplify
that to a vector.
The apply() function then uses these vectors one by one as an
argument to the function we specified. So, the function
specified should be able to deal with vectors.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 22 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

Adding extra arguments


Suppose there are some missing values.
> birds[1,2] <- NA
> birds
Sparrow Pigeon Dove
[1,] 14 NA 10
[2,] 5 5 13
[3,] 12 6 19
> apply(birds,2,max)
Sparrow Pigeon Dove
14 NA 19
> apply(birds,2,max, na.rm = TRUE)
Sparrow Pigeon Dove
14 6 19
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 23 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

apply functions for list-like objects

We have two related functions from the apply family at our


disposal:
lapply() and
sapply().
The l in lapply stands for list, and the s in sapply stands for
simplify.
The two functions work basically the same - the only
difference is that
lapply() always returns a list with the result, whereas,
sapply() tries to simplify the final object if possible.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 24 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

lapply() function

lapply(), is helpful in avoiding loops when using lists;


sapply(), mapply(), and vapply() to do the same for
dataframes, matrices, and vectors, respectively; and
tapply(), performs an action on subsets of an object.
The foreach and plyr packages provide equivalent formulations
for parallel execution (see also the parallel package).

Data Visualization - Dr. L. V. Rao,, November 4, 2020 25 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

tapply() function

The tapply() function applies the function given as the third


argument (in this case mean()) to the vector in the first argument
(y) stratified by every unique set of values of the list of factors in
the second argument (x). It returns a vector of that length with
the results of the function

Data Visualization - Dr. L. V. Rao,, November 4, 2020 26 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

sapply() function

The sapply function is a special case of lapply, or list apply, which


applies a function to a collection and returns a list. The main
difference is by default sapply will try to simplify the output into
an array form.

Data Visualization - Dr. L. V. Rao,, November 4, 2020 27 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

sapply() function
The command sapply(df, class) will return the names and classes
(e.g., numeric, integer, or character) of each variable within a
dataframe.

> df <- data.frame(Math=13:15,Phy=11:13,Chem=12:14)


> df
Math Phy Chem
1 13 11 12
2 14 12 13
3 15 13 14
> sapply(df,class)
Math Phy Chem
"integer" "integer" "integer"
>

Data Visualization - Dr. L. V. Rao,, November 4, 2020 28 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

tapply() function
> with(ToothGrowth, tapply(len, supp, mean) )
OJ VC
20.66333 16.96333
>
For tapply, as with split, the grouping variable is a factor or list of
factors. In the latter case, all combinations are computed before
splitting:
> with(ToothGrowth,
tapply(len, list(supp, dose), mean) )
0.5 1 2
OJ 13.23 22.70 26.06
VC 7.98 16.77 26.14
>
Data Visualization - Dr. L. V. Rao,, November 4, 2020 29 / 32
Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

lapply() function

> x <-c(1:3,NA,4,5)
> sum(x)
[1] NA
> sum(x,na.rm=T)
[1] 15
> apply(matrix(x,nrow=1),1,sum, na.rm=T)
[1] 15
> apply(matrix(x,nrow=1),1,sum)
[1] NA

Data Visualization - Dr. L. V. Rao,, November 4, 2020 30 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

lapply() function

> with(mtcars, table(cyl,gear))


gear
cyl 3 4 5
4 1 8 2
6 2 4 1
8 12 0 2
> apply(with(mtcars, table(cyl,gear)),1,sum)
4 6 8
11 7 14
>

Data Visualization - Dr. L. V. Rao,, November 4, 2020 31 / 32


Introduction
Applying the Same Function to All Rows or Columns of a Matrix
Applying the Same Function to All Elements of a List
Functions Are First-Class Objects

lapply() function

> df <- data.frame(Math=13:15,Phy=11:13,Chem=12:14)


> df
Math Phy Chem
1 13 11 12
2 14 12 13
3 15 13 14
> sapply(df,class)
Math Phy Chem
"integer" "integer" "integer"
>

Data Visualization - Dr. L. V. Rao,, November 4, 2020 32 / 32


stem-and-leaf

Stem-and-leaf plots are text-based graphics that are particularly

useful to describe the distribution of small datasets.

stem(x, scale = 1)

The scale option can be used to increase or decrease the number

of stems (default value is 1).

> x <- sample(1:29, 25, replace = T)

> stem(x)

The decimal point is 1 digit(s) to the right of the |

0 | 112577

1 | 2234456677799

2 | 056789

>

> stem(x, scale = 2)

The decimal point is 1 digit(s) to the right of the |

0 | 112

0 | 577

1 | 22344
1 | 56677799

2|0

2 | 56789

>

The aplpack librray has more options for stem and leaf plots

> install.packages("aplpack")

> library(aplpack)

>

> stem.leaf(x)

1 | 2: represents 12

leaf unit: 1

n: 25

3 0* | 112

6 0. | 577

11 1* | 22344stem(x)

(8) 1. | 56677799

6 2* | 0

5 2. | 56789

>
Two sets of data values can be compared using a stem-and-leaf

plot.

The basic stem-and-leaf chart can be modified to display two data sets

in a back-to-back manner. The built-in stem() function does not do this;

stem.leaf.backback() from the aplpack package does.

The chart contains much more information than the basic chart produced by

stem() function. The left- and rightmost columns record position of the data

in the data set, the stems are recorded in the middle.

> y <- sample(1:29, 25, replace = T)

>

> stem.leaf.backback(x, y, rule.line = "Sturges")

------------------------------------

1 | 2: represents 12, leaf unit: 1

x y

------------------------------------

3 211| 0* |112334 6

6 775| 0. |6 7

11 44322| 1* |2 8

(8) 99777665| 1. |5577899 (7)

6 0| 2* |00024 10

5 98765| 2. |56689 5
| 3* |

------------------------------------

n: 25 25

------------------------------------
barplot() Function in R
Barplot

The purpose of the barplot is to display the frequencies (or proportions) of levels of a factor
variable. For example, a barplot is used to pictorially display the frequencies (or proportions)
of individuals in various socio-economic(factor) groups(levels-high, middle, low). Such a plot
will help to provide a visual comparison among the various factor levels.
In barplot, factor-levels are placed on the x-axis and frequencies (or proportions) of various
factor-levels are considered on the y-axis. For each factor-level one bar of uniform width with
heights being proportional to factor level frequency (or proportion) is constructed.
The barplot() function is in the graphics package of the R’s System Library. The barplot()
function must be supplied at least one argument. The R help calls this as heights, which must
be either vector or a matrix. If it is vector, its members are the various factor-levels.
To illustrate barplot(), consider the following data preparation:

> grades <- c( "A+", "A-", "B+", "B", "C" )


> Marks <- sample( grades, 40, replace = T, prob = c(.2,.3,.25,.15,.1 ) )
> Marks
[1] "A+" "A-" "B+" "A-" "A+" "B" "A+" "B+" "A-" "B" "A+" "A-"
[13] "A-" "B+" "A-" "A-" "A-" "A-" "A+" "A-" "A+" "A+" "C" "C"
[25] "B" "C" "B+" "C" "B+" "B+" "B+" "A+" "B+" "A-" "A+" "A-"
[37] "A-" "B" "C" "A+"
>

A bar plot of the Marks vector is obtained from

> barplot( table( Marks ), main = "Mid-Marks in Algorithms")

Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical
order of the levels. Using the parameter names.arg , the bars in plot can be placed in the order
as stated in the vector, grades.

# plot to the desired horizontal axis labels


> barplot( table( Marks ), names.arg = grades, main = "Mid-Marks in Algorithms")

1
barplot() Function in R

Coloured bars can be drawn using the col= parameter.

> barplot( table( Marks ), names.arg = grades, col = c("lightblue",


"lightcyan", "lavender", "mistyrose", "cornsilk"),
main = "Mid-Marks in Algorithms")

A bar plot with horizontal bars can be obtained as follows:

> barplot( table( Marks ), names.arg=grades, horiz=TRUE, col = c("lightblue",


"lightcyan", "lavender", "mistyrose", "cornsilk"),
main = "Mid-Marks in Algorithms")

2
barplot() Function in R
A bar plot with proportions on the y-axis can be obtained as follows:

> barplot( prop.table( table( Marks ) ), names.arg = grades, col = c("lightblue",


"lightcyan", "lavender", "mistyrose", "cornsilk"),
main = "Mid-Marks in Algorithms")

The sizes of the factor-level names on the x-axis can be increased using “‘cex.names“‘
parameter.

> barplot( prop.table( table( Marks ) ),names.arg = grades, col = c("lightblue",


"lightcyan", "lavender", "mistyrose", "cornsilk"),
main = "Mid-Marks in Algorithms",cex.names=2)

The heights parameter of the barplot() could be a matrix. For example it could be matrix,
where the columns are the various subjects taken in a course, the rows could be the labels of
the grades. Consider the following matrix:

> gradTab
Algorithms Operating Systems Discrete Math
A- 13 10 7
A+ 10 7 2
B 4 2 14
B+ 8 19 12
C 5 2 5

3
barplot() Function in R
To draw a stacked bar, simply use the command:

> barplot( gradTab, col = c( "lightblue", "lightcyan",


"lavender", "mistyrose", "cornsilk" ), legend.text = grades,
main = "Mid-Marks in Algorithms")

To draw a juxtaposed bars, use the besides parameter, as given under:

> barplot( gradTab, beside = T, col = c("lightblue", "lightcyan",


"lavender", "mistyrose", "cornsilk"), legend.text = grades,
main = "Mid-Marks in Algorithms")

A horizontal bar plot can be obtained using horiz=T parameter:

> barplot( gradTab, beside = T, horiz = T, col = c("lightblue", "lightcyan",


"lavender", "mistyrose", "cornsilk"), legend.text = grades,
cex.names=.75, main = "Mid-Marks in Algorithms")

4
barplot() Function in R

5
Boxplot
Data Visualization in R

Dr. L. V. Rao

September 9, 2019

Boxplot - Dr. L. V. Rao,, September 9, 2019 1 / 19


Use to display the sample characteristics of a single sample or
the differences in multiple samples.
One of the advantages of boxplots is that their widths are not
usually meaningful. This allows you to compare the
distribution of many groups in a single graph.
Side-by-side box plots are very useful for comparing groups
(i.e., the levels of a categorical variable) on a numerical
variable.
Notched boxplots provide an approximate method for
visualizing whether groups differ. Although not a formal test,
if the notches of two boxplots do not overlap, there is strong
evidence (95% confidence) that the medians of the two
groups differ.

Boxplot - Dr. L. V. Rao,, September 9, 2019 2 / 19


Box-and-Whiskers Plot

The boxplot is a graphical device based on the five-number


summary.
The five-number summary of a univariate data set is basically
the minimum value,
the maximum value,
the first quartiles Q1 ,
the third quartile Q3 , and
the median.
The five numbers provide a good summary of even very large data
sets.

Boxplot - Dr. L. V. Rao,, September 9, 2019 3 / 19


The basic box plot has the following features:
A box is drawn from the first quartile to the third. This
represents the middle 50% of the data.
The median is drawn indicating the center of the data and
splits the box into areas, each representing 25% of the data.
The top and bottom 25% of the data is represented by
whiskers which stretch to the minimum and maximum values.

Boxplot - Dr. L. V. Rao,, September 9, 2019 4 / 19


Some definitions

Lower Hinge: The 25th percentile of the batch.


Upper Hinge: The 75th percentile of the batch.
H-Spread: The difference between the hinges.
Step: 1.5 times the H-spread is called a step.
Inner fences: Inner fences are 1 step beyond the hinges.
Outer fences: Outer fences are 2 steps beyond the hinges.
Adjacent values:
the largest value below the upper inner fence and
the smallest value above the lower inner fence.

Boxplot - Dr. L. V. Rao,, September 9, 2019 5 / 19


A box plot provides an excellent visual summary of many
important aspects of a distribution. The box stretches from
the lower hinge to the upper hinge and therefore contains the
middle half of the scores in the distribution.
The median is shown as a line across the box. Therefore 1/4
of the distribution is between this line and the top of the box
and 1/4 of the distribution is between this line and the
bottom of the box.

Boxplot - Dr. L. V. Rao,, September 9, 2019 6 / 19


The notches pinch the box to give it a waist at Q̂2 and slowly
draw up to the original box’s edge below and above the
median.
In R, the spread of the notch is approximately
d √n from Q̂2 ;
±1.58IQR/
it gives a rough location for where the true median may lie,
such that any other boxplot’s notch that overlaps likely have a
similar valued median.
If the notch distance extends past the corresponding hinge,
the notch is drawn back on itself, producing sharply pointed
edges on the box.

Boxplot - Dr. L. V. Rao,, September 9, 2019 7 / 19


A conventional modification to boxplot is to cover outliers .
These are defined as values more than 1.5 × IQR below Q1 or
above Q3 . The whiskers are drawn to cover all points in these areas
that are not outliers, leaving the outliers to be marked by points.
With this, it is easy to identify:
center The median clearly marks the center.
spread The IQR is the length of the box and a measure of
spread.
shape The regions of the boxplot pair off. If one area is much
longer than its corresponding pair, then the data is skewed.

Boxplot - Dr. L. V. Rao,, September 9, 2019 8 / 19


boxplot(x, ...)

## S3 method for class ’formula’


boxplot(formula, data = NULL, ..., subset,
na.action = NULL, drop = FALSE, sep = ".",
lex.order = FALSE)

## Default S3 method:
boxplot(x, ..., range = 1.5, width = NULL,
varwidth = FALSE, notch = FALSE,
outline = TRUE, names, plot = TRUE,
border = par("fg"), col = NULL, log = "",
pars = list(boxwex = 0.8, staplewex = 0.5,
outwex = 0.5), horizontal = FALSE,
add = FALSE, at = NULL)

Boxplot - Dr. L. V. Rao,, September 9, 2019 9 / 19


boxplot() Function

The boxplot() function can be given multiple arguments of vectors


to display, or can use a formula interface (which will generate a
boxplot for each level of the variable x).

A number of useful options are available, including


varwidth to draw the boxplots with widths proportional to
the square root of the number of observations in that group,
horizontal to reverse the default orientation,
notch to display notched boxplots, and
names to specify a vector of labels for the groups.

Boxplot - Dr. L. V. Rao,, September 9, 2019 10 / 19


You can let the whiskers always extend to the minimum and the
maximum by setting the range argument of the boxplot()
function to 0.

Boxplot - Dr. L. V. Rao,, September 9, 2019 11 / 19


> str(iris)
’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 .
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0
$ Species : Factor w/ 3 levels "setosa","versicolor",.
>

Boxplot - Dr. L. V. Rao,, September 9, 2019 12 / 19


How does the variable Sepal.Length varies across the Species?
> boxplot(iris$Sepal.Length ~ iris$Species,
xlab = "Species", ylab="Sepal Length",
main = "Differences in Sepal Length by Species")

Boxplot - Dr. L. V. Rao,, September 9, 2019 13 / 19


> boxplot(iris$Sepal.Length ~ iris$Species,
xlab = "Species", ylab = "Sepal Length",
main = "Differences in Sepal Length by Species",
col = 1:length(iris$Species) + 1 )

Boxplot - Dr. L. V. Rao,, September 9, 2019 14 / 19


> boxplot(iris$Sepal.Length ~ iris$Species,
xlab = "Species", ylab = "Sepal Length",
main = "Differences in Sepal Length by Species",
col = 1:length(iris$Species) + 1, notch = TRUE)

Boxplot - Dr. L. V. Rao,, September 9, 2019 15 / 19


> unique(Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
>
> as.character(unique(Species))
[1] "setosa" "versicolor" "virginica"
> boxplot(list(Sepal.Length[1:50],
Sepal.Length[51:100],
Sepal.Length[101:150]),
names = as.character(unique(Species)))
>

Boxplot - Dr. L. V. Rao,, September 9, 2019 16 / 19


> str(mtcars)
’data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
>

Boxplot - Dr. L. V. Rao,, September 9, 2019 17 / 19


Box plot using plot() function

with(mtcars,
plot(mpg ~ factor(cyl), col = c(2,3,4),
xlab = "No. of Cylinders",
main = "Mileage againt No. of Cylinders",
notch = TRUE ))
Warning message:
In bxp(list(stats = c(21.4, 22.8, 26, 30.4, 33.9, 17.8,
18.65, 19.7, :
some notches went outside hinges (’box’):
maybe set notch=FALSE
>

Boxplot - Dr. L. V. Rao,, September 9, 2019 18 / 19


Box plot using plot() function

Mileage againt No. of Cylinders

30
25
mpg

20
15
10

4 6 8

No. of Cylinders

Boxplot - Dr. L. V. Rao,, September 9, 2019 19 / 19


Q-Q Plot
Data Visualization in R

Dr. L. V. Rao

December 7, 2020

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 1 / 17


1 The quantile function is the inverse of the cumulative
distribution function. The p-quantile is the value with the
property that there is probability p of getting a value less than
or equal to it. The median is by definition the 50% quantile.
2 Tables of statistical distributions are almost always given in
terms of quantiles. For a fixed set of probabilities, the table
shows the boundary that a test statistic must cross in order to
be considered significant at that level. This is purely for
operational reasons; it is almost superfluous when you have
the option of computing p exactly.
3 Theoretical quantiles are commonly used for the calculation of
confidence intervals and for power calculations in connection
with designing and dimensioning experiments. A simple
example of a confidence interval can be given here.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 2 / 17


1 If we have n normally distributed observations with the same
mean µ and standard deviation σ, then it is known that the
average x̄ is normally distributed around µ with standard

deviation σ/ n. A 95% confidence interval for µ can be
obtained as
σ σ
x̄ + √ N0.025 ≤ µ ≤ x̄ + √ N0.975
n n

where N0.025 is the 2.5% quantile in the normal distribution.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 3 / 17


If σ = 12 and we have measured n = 5 persons and found an
average of x̄ = 83, then we can compute the relevant quantities as
(”sem” means standard error of the mean)

> xbar <- 83


> sigma <- 12
> n <- 5
> sem <- sigma/sqrt(n)
> sem
[1] 5.366563
> xbar + sem * qnorm(0.025)
[1] 72.48173
> xbar + sem * qnorm(0.975)
[1] 93.51827

and thus find a 95% confidence interval for µ going from 72.48 to
93.52.
Here we have assumed that σ is known.
Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 4 / 17
We know that the normal distribution is symmetric, so that

N0.025 = −N0.975 ,

it is common to write the formula for the confidence interval as



x̄ ± σ/ nN0.975

. The quantile itself is often written Φ−1 (0.975), where Φ is


standard notation for the cumulative distribution function of the
normal distribution (pnorm).
Another application of quantiles is in connection with Q − Q plots,
which can be used to assess whether a set of data can reasonably
be assumed to come from a given distribution.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 5 / 17


Empirical quantiles may be obtained with the function quantile like
this:

> x <- rnorm(100)


> round(quantile(x), digits=4)
0% 25% 50% 75% 100%
-2.5302 -0.6585 -0.0725 0.7717 2.1725

By default you get the minimum, the maximum, and the three
quartiles - the 0.25, 0.50, and 0.75 quantiles - so named because
they correspond to a division into four parts. Similarly, we have
deciles for 0.1, 0.2, · · · , 0.9, and centiles or percentiles. The
difference between the first and third quartiles is called the
interquartile range (IQR) and is sometimes used as a robust
alternative to the standard deviation.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 6 / 17


It is also possible to obtain other quantiles; this is done by adding
an argument containing the desired percentage points. This, for
example, is how to get the deciles:

> prob <- seq(0,1,0.1)


> round(quantile(x,prob), digits=4)
0% 10% 20% 30% 40% 50%
-2.5302 -1.3431 -0.9364 -0.5564 -0.2960 -0.0725
60% 70% 80% 90% 100%
0.2289 0.4281 0.8981 1.2118 2.1725

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 7 / 17


Be aware that there are several possible definitions of empirical
quantiles. The one R uses by default is based on a sum polygon
where the ith ranking observation is the (i − 1)/(n − 1) quantile
and intermediate quantiles are obtained by linear interpolation. It
sometimes confuses students that in a sample of 10 there will be 3
observations below the first quartile with this definition. Other
definitions are available via the type argument to quantile.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 8 / 17


ECDF

The empirical cumulative distribution function is defined as the


fraction of data smaller than or equal to x. That is, if x is the
k − th smallest observation, then the proportion k/n of the data is
smaller than or equal to x (7/10 if x is no. 7 of 10). The empirical
cumulative distribution function can be plotted as follows where x
is the simulated data vector.

> x <- rnorm(25)


> n <- length(x)
> plot(sort(x),(1:n)/n, type="s",ylim=c(0,1))

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 9 / 17


1.0
0.8
0.6
(1:n)/n

0.4
0.2
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

sort(x)

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 10 / 17


ECDF

Some more elaborate displays of empirical cumulative distribution


functions are available via the ecdf function. This is also more
precise regarding the mathematical definition of the step function.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 11 / 17


Q-Q Plot

One purpose of calculating the empirical cumulative distribution


function (c.d.f.) is to see whether data can be assumed normally
distributed. For a better assessment, you might plot the k-th
smallest observation against the expected value of the k-th
smallest observation out of n in a standard normal distribution.
The point is that in this way you would expect to obtain a straight
line if data come from a normal distribution with any mean and
standard deviation.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 12 / 17


Q-Q Plots - qqnorm

Creating such a plot is slightly complicated. Fortunately, there is a


builtin function for doing it, qqnorm. You only have to write

> qqnorm(x)

As the title of the plot indicates, plots of this kind are also called
Q-Q plots (quantile versus quantile). Notice that x and y are
interchanged relative to the empirical c.d.f. - the observed values
are now drawn along the y-axis. You should notice that with this
convention the distribution has heavy tails if the outer parts of the
curve are steeper than the middle part.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 13 / 17


Some readers will have been taught probability plots, which are
similar but have the axes interchanged. It can be argued that the
way R draws the plot is the better one since the theoretical
quantiles are known in advance, while the empirical quantiles
depend on data. You would normally choose to draw fixed values
horizontally and variable values vertically.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 14 / 17


Q-Q Plot

Normal Q−Q Plot

3
2
Sample Quantiles

1
0
−1
−2
−3

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 15 / 17


Q-Q Plot

It is often useful to compare a data set to the normal


distribution.
The qqnorm(...) command plots the sample quantiles against
the quantiles from a normal distribution.
A qqline(...) command after qqnorm(...) will draw a straight
line through the coordinates corresponding to the first and
third quartiles. We would expect a sample from a normal to
yield points on the qq-plot that are close to this line.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 16 / 17


QQ-plots are a better way to assess how closely a sample
follows a certain distribution.

Q-Q Plot - Dr. L. V. Rao,, December 7, 2020 17 / 17


Regression Analysis

L. V. Rao

October 2, 2019

- ,, October 2, 2019
- ,, October 2, 2019
Linear Regression Analysis

Objective:
Regression Analysis uses correlation as a basis to predict the
value of one variable from the value of a second variable or a
combination of several variables.
Terminology:
The variable whose value is to be predicted is called the
response variable( or dependent variable or criterion variable
or outcome variable) and is usually denote by Y .
The variable that is used to predict the value response variable
is called the predictor variable( or independent variable ) and
is denoted by x.
Linear Regression analysis provides information about the
strength of the relationship relationship between response
variable and predictor variable.

- ,, October 2, 2019
Assumptions

Linear: The relationship between response variable and


predictor variables is linear in nature.
Normality: The response variable is a continuous variable
and is normally distributed.
Homoscedasticity: The degree of random noise in the
response variable is constant over the values of the predictor
variable.

- ,, October 2, 2019
The equation for the simple linear regression model is given by

Yi = α + βxi + εi ,

where εi ∼ N(0, σ 2 ), for i = 1, 2, · · · n.


The estimates of the parameters are given by

α̂ = ȳ − β̂ x̄
Pn
i=1 (xi − x̄)(yi − ȳ ) sy
β̂ = P2 = rxy
i=1 (xi − x̄)
2 sx

- ,, October 2, 2019
Assessing the Prediction

Coefficient of Determination, R 2
It measures how much of variation in the response variable is
explained by the predictor variable.
Sum of Squares:
n
X n
X
SST = (yi − ȳ )2 , SSR = (yi − yˆi )2
i=1 i=1

SSR
R2 = 1 −
SST
0 ≤ R 2 ≤ 1, the closer the R 2 to 1, the better is the
prediction.

- ,, October 2, 2019
If SSR is same as SST, then this means that the prediction
using the regression equation is no different from prediction
using the mean of response variable. That is, R 2 = 0.
If SSR is smaller than SST, then R 2 will be greater than zero.
That is, prediction due to regression is better than prediction
from the mean of the response variable. The closer the R 2 to
1, the better is the prediction due to regression.

- ,, October 2, 2019
> str( mtcars )
’data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
>

- ,, October 2, 2019
It is reasonable to assume that the mileage per gallon decreases as
the weight of the car increases. This can be observed by plotting
the scatter plot weight versus mileage.
> plot( mtcars$wt, mtcars$mpg, pch = 20, col = "blue",
main = "mtcars data")
> abline( lm( mpg ~ wt, data = mtcars ), col = 2 )

- ,, October 2, 2019
lm() function

Simple Linear Regression can be carried using the lm() function of


R.

- ,, October 2, 2019
> slr.lm <- lm(mpg ~ wt, data = mtcars)
> summary(slr.lm)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom


Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

- ,, October 2, 2019
The five goodness-of-fit characterizations included in this
summary are the
residual standard error,
the multiple R-squared,
the adjusted R-squared,
the F-statistic and
the p-value associated with the F-statistic
Small p-values provide supporting evidence that the model
parameter is significant in the sense that omitting it(or
equivalent to setting it to zero) would result in a poorer
model.
The most important point is that the p-values associated with
the individual coefficients are telling us something about the
utility of each term in the model, while the p-value given in
the last line of the summary is telling us about the overall fit
quality of the model.

- ,, October 2, 2019
lm() function
α̂ = 37.2851 and SE (α̂) = 1.8776

β̂ = −5.3445 and SE (β̂) = 0.5591

The estimated regression equation is

ŷ = 37.2851 − 5.3445x

The residual Standard Error can be computed from


σ̂ = sd(slr .lm$residuals) ∗ sqrt((length(slr .lm$residuals) − 1)/
slr .lm$df .residual) = 3.045882

σ̂ can also be obtained as follows:


> sum.lm <- summary(slr.lm)
> sum.lm$sigma
[1] 3.045882
>
R 2 = 0.7528 and Adjusted R 2 = 0.7446
- ,, October 2, 2019
R 2 and adjusted R 2 values can be fetched using the following
code:
> sum.lm$r.squared
[1] 0.7528328
>
> sum((slr.lm$fitted.values-mean(mtcars$mpg))^2)/
sum((mtcars$mpg-mean(mtcars$mpg))^2)
[1] 0.7528328
>
> sum.lm$adj.r.squared
[1] 0.7445939
>
σ̂
SE (β̂) = pP
(xi − x̄)2
> sqrt(sum(sum.lm$residuals^2)/(30))/
sqrt(sum((mtcars$wt-mean(mtcars$wt))^2))
[1] 0.559101
- ,, October 2, 2019
Assign sum.lm <- summary(slr.lm). Then
> sum.lm$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 8.241799e-19
wt -5.344472 0.559101 -9.559044 1.293959e-10
> class(sum.lm$coefficients)
[1] "matrix"
Note that class of sum.lm$coefficients is matrix whose
row names are (Intercept) and wt
and column names are "Estimate" , "Std. Error",
"t value", "Pr(>|t|)" .
So, we fetch the regression coefficient on using the command
> sum.lm$coefficients[2,1]
[1] -5.344472
The t-value for testing the hypothesis β̂ = 0 is obtained by
computing
t = β̂/SE (β̂) = −5.344472/0.559101 = −9.56002
- ,, October 2, 2019
Reading Significance Codes

---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

The above significance codes should be read as follows:

Code p-value
’***’ 0 < p < 0.001
’**’ 0.001 < p < 0.01
’*’ 0.01 < p < 0.1
’’ 0.01 < p < 1

- ,, October 2, 2019
> qqnorm( slr.lm$residuals,
main = "Normal Q-Q Plot for Residuals" )
> qqline( slr.lm$residuals, col=2 )

- ,, October 2, 2019
> mlr.lm <- lm( mpg ~ wt + cyl, data = mtcars)
> summary( mlr.lm )
Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
wt -3.1910 0.7569 -4.216 0.000222 ***
cyl -1.5078 0.4147 -3.636 0.001064 **
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 2.568 on 29 degrees of freedom


Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12

- ,, October 2, 2019
> qqnorm( mlr.lm$residuals,
main = "Normal Q-Q Plot for Residuals" )
> qqline( mlr.lm$residuals, col=2 )

- ,, October 2, 2019
Regression

- ,, October 2, 2019
Examples

Linear Regression

1. People often predict children’s future height by using their 2-year-old height. A common
rule is to double the height. The following Table contains data for eight people’s heights
as 2-year-olds and as adults.

Age 2 (in.) 39 30 32 34 35 36 36 30
Adult (in.) 71 63 63 67 68 68 70 64

(a) Draw the scatter diagram. Is the relation linear?


(b) Compute Pearson’s correlation coefficient between heights at age 2 and adult heights.
(c) Use the above data set to build a simple linear regression model for adult height
using height at 2-years as the predictor.
(d) Interpret the estimate of regression coefficient and examine its statistical signifi-
cance.
(e) Find the 95% confidence interval for the regression coefficient.
(f) Find the value of R2 and show that it is equal to sample correlation coefficient.
(g) Create simple diagnostic plots for your model and identify possible outliers.
(h) Using the data, what is the predicted adult height for a 2-year-old who is 33 inches
tall?

Solution:

(a) Draw the scatter diagram. Is the relation linear?


> age2 <- c(39, 30, 32, 34, 35, 36, 36, 30)
> adult <- c(71, 63, 63, 67, 68, 68, 70, 64)
> plot(age2, adult, pch=20, col=2, xlab="Height at age 2",
ylab="Adult Height", main="Scatter diagram")
>

Scatter diagram
70
Adult Height

68
66
64

30 32 34 36 38

Height at age 2

1
December 23, 2020
Examples

The scatter diagram shows that the the heights at age 1 and adult heights are lin-
early related.
(b) Compute Pearson’s correlation coefficient between heights at age 2 and adult heights.
> r <- cor(age2, adult)
> r
[1] 0.9456109
>
The relationship between the variables heights at age 2 and adult heights is positive
and the strength of the linear relationship is 0.9456109.
(c) Use the above data set to build a simple linear regression model for adult height
using height at 2-years as the predictor.
> out.lm <- lm(adult ˜ age2)
> out.lm

Call:
lm(formula = adult ˜ age2)

Coefficients:
(Intercept) age2
35.1786 0.9286
>
Therefore, the regression equation of adult on age2 is given by

adult.height = 35.1786 + 0.9286 ∗ age2

plot(age2, adult, pch=20, col=2, xlab="Height at age 2",


ylab="Adult Height", main="Scatter diagram")
abline(out.lm,col="blue",lwd=2)

Scatter diagram
70
Adult Height

68
66
64

30 32 34 36 38

Height at age 2

2
December 23, 2020
Examples

(d) Interpret the estimate of regression coefficient and examine its statistical signifi-
cance.
The regression coefficient of height of adult on the height of aged 2 is 0.9285714286.
The rate of change in heights of adults for a unit change in the heights at age2 will
be in the interval ( 0.1123691, 1.7447738). (see the confidence interval for β below).
To assess the null hypothesis H0 : β = 0, which is interpreted as no linear relation-
ship between the response variable and the explanatory variable, the test statistic is
t = b/SEb and the corresponding p-value is obtained as follows:

if Ha : β 6= 0, pobs = 2 × P (T ≥ |t|),

where T has the t-distribution with n − 2 degrees of freedom.


> summary( lm( adult ˜ age2) )$coef[2,]
Estimate Std. Error t value Pr(>|t|)
0.9285714286 0.1304101327 7.1203932476 0.0003860027
>
As the p-value is less than 0.05, we reject the null hypothesis H0 : β = 0 and conclude
that the regression coefficient is significant.
(e) Find the 95% confidence interval for the regression coefficient. The 95% confidence
interval for β is
b ± tcritical SEb
> # critical value
> critical <- qt(2*(1-pt(sum.lm$coef[2,3],6)),6)
>
> # half-width
> hw <- critical * sum.lm$coef[2,2]
> hw
[1] -0.8162024
>
> # 95% confidence interval
>
> sum.lm$coef[2,1] + c(1,-1) * hw
[1] 0.1123691 1.7447738
>
(f) Find the value of R2 and show that it is equal to sample correlation coefficient.
> 1-sum(out.lm$residˆ2)/tss
[1] 0.8941799
>
> r <- cor(age2,adult)
> r
[1] 0.9456109
> R2 <- rˆ2
> R2
[1] 0.8941799
>

3
December 23, 2020
Examples

(g) Create simple diagnostic plots for your model and identify possible outliers.

Residuals vs Fitted Normal Q−Q Scale−Location

1.4
3

1.5
7 7

1.2
8 7

1.0
1

Standardized residuals
Standardized residuals
8

1.0
0.5
Residuals

0.8
0

0.0

0.6
−1

0.4
−1.0

0.2
−2

0.0
−2.0
3

64 66 68 70 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 64 66 68 70

Fitted values Theoretical Quantiles Fitted values

Cook’s distance Residuals vs Leverage Cook’s dist vs Leverage hii (1 − hii)


2 3 1.5 1

0.4
3
7
0.4

1
8
1

0.5
8 8
Standardized residuals

0.3
0.3
Cook’s distance

Cook’s distance
0

7 7

0.2
0.2

−1

0.5 0.5

0.1
0.1

1
−2

3
Cook’s distance
0.0

0.0
0

1 2 3 4 5 6 7 8 0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4

Obs. number Leverage Leverage hii

(h) Using the data, what is the predicted adult height for a 2-year-old who is 33 inches
tall?

> # Given the height at age2 as 33 inches, predited height as adult i


>
> # using predict() function
>
> predict(out.lm, newdata=data.frame(age2=33))
1
65.82143
>
> # usning predicted model equation
>
> out.lm$coef[1] + out.lm$coef[2]*33
(Intercept)
65.82143
>

4
December 23, 2020

You might also like